## Intro to Variable Types

Generally, variables will come in two varieties; categorical and quantitative. **Categorical** variables group observations into separate categories that can be ordered or unordered. **Quantitative** variables on the other hand are variables expressed numerically, whether as a count or measurement.

![image.png](attachment:image.png)

### Quantitative Variables

We can think of quantitative variables as any information about an observation that can only be described with numbers. Quantitative variables are generally counts or measurements of something (eg., number of points earned in a game or height). They are well suited for mathematical operations and quantitative analysis, and are helpful for answering questions like “How many/much?”, “What is the average?”, or “How often?”. There are two types of quantitative variables; discrete and continuous, and they both help to serve different functions in a dataset.

#### Discrete Variables

Discrete quantitative variables are numeric values that represent counts and can only take on integer values. They represent whole units that can not be broken down into smaller pieces, and as such cannot be meaningfully expressed with decimals or fractions. Examples of discrete variables are the number of children in a person’s family or the number of coin flips a person makes. Unless we are working with quantum mechanics, we can not meaningfully have flipped a coin 3.5 times, or have 4.75 sisters.

When inspecting a dataset for discrete variables, ask yourself if the variable would make sense if you added .5 to any of the values.

#### Continuous Variables

Continuous quantitative variables are numeric measurements that can be expressed with decimal precision. Theoretically, continuous variables can take on infinitely many values within a given range. Examples of continuous variables are length, weight, and age which can all be described with decimal values.

Sometimes the line between discrete and continuous variables can be a bit blurry. For example, age with decimal values is a continuous variable, but age IN CLOSEST WHOLE YEARS by definition is discrete. The precision with which something is recorded can also determine how we classify the variable.

### Categorical Variables

Categorical variables differ from quantitative variables in that they focus on the different ways data can be grouped rather than counted or measured. With categorical variables, we want to understand how the observations in our dataset can be grouped and separated from one another based on their attributes. When the groupings have a specific order or ranking, the variable is an ordinal categorical variable. If there is no apparent order or ranking to the categories, we refer to the variable as a nominal categorical variable.

#### Ordinal Variables

Do you remember working with a column in a dataset where the values of the column were groups that were greater or lesser than each other in some intrinsic way? Suppose there was a variable containing responses to the question “Rate your agreement with the statement: The minimum age to drive should be lowered.” The response options are “strongly disagree”, “disagree”, “neutral”, “agree”, and “strongly agree”. Because we can see an order where “strongly disagree” < “disagree “ < “neutral” < “agree” < “strongly agree” in relation to agreement, we consider the variable to be ordinal.

Other examples of ordinal variables could be the standings in a sporting competition, age ranges, and customer ratings of a product or service.

It is important to keep in mind when working with ordinal variables that the differences between categories can vary. We can see in the table above that each age group in the age_group column has a range of five years, and so groups are evenly spaced apart. However, the same logic can not be applied to the customer_rating variable. Here it is not accurate to assume that the difference between ”satisfied” and ”very satisfied” is the same as the difference between ”dissatisfied” and ”very dissatisfied”. This is a key difference between ordinal variables and discrete quantitative variables.

#### Nominal Variables

Nominal categorical variables are those variables with two or more categories that do not have any relational order. Examples of nominal categories could be states in the U.S., brands of computers, or ethnicities. Notice how for each of these variables, there is no intrinsic ordering that distinguishes a category as greater than or less than another category.

The number of possible values for a nominal variable can be quite large. It’s even possible that a nominal categorical variable will take on a unique value for every observation in a dataset, like in the case of unique identifiers such as name or email_address.

Sometimes, identifying a nominal variable can be tricky if that variable has attributes that are ordinal or quantitative. For example, the adoption_city variable above is nominal; however, we could assign an ordering to adoption_city based on a city-specific attribute like yearly average temperature. We might do this if we want to build a model to predict whether a particular animal will be adopted — and believe that temperature is relevant in making this prediction. If temperature is the ONLY thing we care about with respect to adoption city, we could assign an order to adoption_city based on temperature (and rename the variable to something like adoption_city_temp). Alternatively, we could create a new ordinal variable in our dataset named city_rel_temp, which is completely dependent on adoption_city

#### Binary Variables

Binary or dichotomous variables are a special kind of nominal variable that have only two categories. Because there are only two possible values for binary variables, they are mutually exclusive to one another. We can imagine a variable that describes if a picture contains a cat or a dog as a binary variable. In this case, if the picture is not a dog, it must be a cat, and vice versa. Binary variables can also be described with numbers similar to bits with 0 or 1 values. Likewise you may find binary variables containing boolean values of True or False.

In [1]:
import pandas as pd

movies = pd.read_csv('movie_shows.csv', index_col = 0)

print(movies.head())
print('-'*10)
print(movies.tail())
print('-'*10)
print(movies.dtypes)

      type            title         country release_year rating  duration
0    Movie  Norm of the ...   United States      missing     PG    91.071
1    Movie  Jandino: Wha...  United Kingdom         2016      R    94.516
2  TV Show  Transformers...   United States         2013      G     1.127
3  TV Show  Transformers...   United States         2016  TV-14     1.687
4    Movie  #realityhigh...   United States         2017  TV-14    99.248
----------
         type            title         country release_year rating  duration
6229  TV Show  Red vs. Blue...   United States         2015      R    13.101
6230  TV Show         Maron...   United States         2016      R     4.122
6231    Movie  Little Baby ...             NaN         2016    NaN    60.362
6232  TV Show  A Young Doct...  United Kingdom         2013      R     2.082
6233  TV Show       Friends...   United States         2003  TV-14    10.972
----------
type             object
title            object
country          object


In [2]:
print(movies.country.unique())
print('-'*10)
print(movies.release_year.unique())

['United States' 'United Kingdom' 'Spain' 'Bulgaria' 'Chile' nan
 'Netherlands' 'France' 'Thailand' 'China' 'Belgium' 'India' 'Pakistan'
 'Canada' 'South Korea' 'Denmark' 'Turkey' 'Brazil' 'Indonesia' 'Ireland'
 'Hong Kong' 'Mexico' 'Vietnam' 'Nigeria' 'Japan' 'Norway' 'Lebanon'
 'Cambodia' 'Russia' 'Poland' 'Israel' 'Italy' 'Germany'
 'United Arab Emirates' 'Egypt' 'Taiwan' 'Australia' 'Czech Republic'
 'Argentina' 'Switzerland' 'Malaysia' 'Philippines' 'Serbia' 'Colombia'
 'Singapore' 'Peru' 'South Africa' 'New Zealand' 'Venezuela'
 'Saudi Arabia' 'Iceland' 'Austria' 'Uruguay' 'Finland' 'Ghana' 'Iran'
 'Sweden' 'Hungary' 'Guatemala' 'Portugal' 'Paraguay' 'Somalia' 'Ukraine'
 'Dominican Republic' 'Romania' 'Slovenia' 'Croatia' 'Bangladesh'
 'Soviet Union' 'Georgia' 'West Germany' 'Mauritius' 'Cyprus']
----------
['missing' '2016' '2013' '2017' '2014' '2015' '2009' '2012' '2010' '2018'
 '2011' '2019' '2004' '2000' '1983' '1982' '2006' '2005' '2002' '1997'
 '2008' '2007' '2003' '1981' '

In [3]:
movies.release_year = movies.release_year.replace(['missing'], 2019)
movies.release_year = movies.release_year.astype(int)
print(movies.release_year.unique())

[2019 2016 2013 2017 2014 2015 2009 2012 2010 2018 2011 2004 2000 1983
 1982 2006 2005 2002 1997 2008 2007 2003 1981 1991 1994 1988 1976 1973
 1974 1989 1986 1984 1978 1998 1972 1979 1960 1959 2001 1995 1992 1990
 1975 1985 1980 1970 1996 1967 1999 1987 1968 1993 2020 1958 1965 1956
 1962 1955 1977 1945 1946 1942 1944 1947 1943 1969 1954 1966 1971 1964
 1925 1963]


### The Pandas Category Data Type
The pandas .Categorical() method can be used to store data as type category and indicate the order of the categories.

In [5]:
movies['rating'] = pd.Categorical(movies['rating'], ['G', 'PG', 'PG-13', 'R','UNRATED', 'NOT RATED'], ordered=True)
print(movies.rating.unique())

['PG', 'R', 'G', NaN, 'PG-13', 'UNRATED', 'NOT RATED']
Categories (6, object): ['G' < 'PG' < 'PG-13' < 'R' < 'UNRATED' < 'NOT RATED']


In [6]:
clothes = pd.read_csv('clothing.csv', index_col=0)

# View the first five rows of the dataframe
print(clothes.head())

# Print the unique values of the `Rating` column
print(clothes['Rating'].unique())

# Change the data type of `Rating` to category
clothes.Rating = pd.Categorical(clothes.Rating, ['very unsatisfied', 'unsatisfied', 'neutral', 'satisfied', 'very satisfied'], ordered=True)

# Recheck the values of `Rating` with .unique()
print(clothes.Rating.unique())

   Clothing ID  Age                    Title          Rating   Division Name
0          767   33                      NaN       satisfied       Initmates
1         1080   34                      NaN  very satisfied         General
2         1077   60  Some major design flaws         neutral         General
3         1049   50         My favorite buy!  very satisfied  General Petite
4          847   47         Flattering shirt  very satisfied         General
['satisfied' 'very satisfied' 'neutral' 'unsatisfied' 'very unsatisfied']
['satisfied', 'very satisfied', 'neutral', 'unsatisfied', 'very unsatisfied']
Categories (5, object): ['very unsatisfied' < 'unsatisfied' < 'neutral' < 'satisfied' < 'very satisfied']


### Label Encoding with .cat.codes

When working with categorical variables, it is sometimes necessary to convert the categories to numbers. This is called categorical encoding and can be achieved through several encoding methods.

Label encoding is when we specifically convert each category in a variable to an integer. This enables us to perform numerical operations on the column and widen our range of plotting capabilities.

In order to perform label encoding in pandas, we can use the .cat.codes accessor for any variable stored as type category.

In [7]:
clothes['rating_codes'] = clothes.Rating.cat.codes
print(clothes.head())

   Clothing ID  Age                    Title          Rating   Division Name  \
0          767   33                      NaN       satisfied       Initmates   
1         1080   34                      NaN  very satisfied         General   
2         1077   60  Some major design flaws         neutral         General   
3         1049   50         My favorite buy!  very satisfied  General Petite   
4          847   47         Flattering shirt  very satisfied         General   

   rating_codes  
0             3  
1             4  
2             2  
3             4  
4             4  


### One-Hot Encoding

In the previous exercise, we saw how label encoding can be useful for ordinal categorical variables. But sometimes we need a different approach. This could be because:

We have a nominal categorical variable (like breed of dog), so it doesn’t really make sense to assign numbers like 0,1,2,3,4,5 to our categories, as this could create an order among the species that is not present.

We have an ordinal categorical variable but we don’t want to assume that there’s equal spacing between categories.

Another way of encoding categorical variables is called One-Hot Encoding (OHE). With OHE, we essentially create a new binary variable for each of the categories within our original variable. This technique is useful when managing nominal variables because it encodes the variable without creating an order among the categories.

In [11]:
cereal = pd.read_csv('cereal.csv')
print(cereal.head())

   Unnamed: 0                       name             mfr type  fiber  rating  \
0           0                  100% Bran          Nestle    C   10.0   68.40   
1           1          100% Natural Bran     Quaker Oats    C    2.0   33.98   
2           2                   All-Bran        Kelloggs    C    9.0   59.43   
3           3  All-Bran with Extra Fiber        Kelloggs    C   14.0   93.70   
4           4             Almond Delight  Ralston Purina    C    1.0   34.38   

  shelf  vitamins  coupons  price  
0   top        25        4   4.06  
1   top         0        1   4.62  
2   top        25        3   4.39  
3   top        25        2   6.18  
4   top        25        2   6.05  


In [13]:
cereal = pd.get_dummies(data=cereal, columns=['mfr'])
print(cereal.head())

   Unnamed: 0                       name type  fiber  rating shelf  vitamins  \
0           0                  100% Bran    C   10.0   68.40   top        25   
1           1          100% Natural Bran    C    2.0   33.98   top         0   
2           2                   All-Bran    C    9.0   59.43   top        25   
3           3  All-Bran with Extra Fiber    C   14.0   93.70   top        25   
4           4             Almond Delight    C    1.0   34.38   top        25   

   coupons  price  mfr_American Home  mfr_General Mills  mfr_Kelloggs  \
0        4   4.06                  0                  0             0   
1        1   4.62                  0                  0             0   
2        3   4.39                  0                  0             1   
3        2   6.18                  0                  0             1   
4        2   6.05                  0                  0             0   

   mfr_Nestle  mfr_Post  mfr_Quaker Oats  mfr_Ralston Purina  
0           1    

### Exercise

In [16]:
# Import pandas with alias
import pandas as pd

# Read in the census dataframe
census = pd.read_csv('census_data.csv', index_col=0)

print(census.head())
print(census.dtypes)
print(census.birth_year.unique())
census.birth_year = census['birth_year'].replace(['missing'], 1967)
census.birth_year = census['birth_year'].astype('int')
print(census['birth_year'].unique())

print(census.dtypes)
print(census.birth_year.mean())

print(census.higher_tax.unique())
census.higher_tax = pd.Categorical(census.higher_tax, ['strongly disagree', 'disagree', 'neutral', 'agree', 'strongly agree'], ordered=True)
print(census.higher_tax.unique())

census['higher_tax_codes'] = census.higher_tax.cat.codes
print(census.higher_tax_codes.median())

print(census.marital_status.unique())
census.marital_status = pd.Categorical(census.marital_status, ['single', 'divorced', 'married', 'widowed'], ordered=False)
census['marital_codes'] = census.marital_status.cat.codes
print(census.marital_codes.head())

census = pd.get_dummies(census, columns=['marital_status'])
print(census.head())

def age_group (current_year, birth_year, interval):
  age = current_year - birth_year
  low_limit = 0
  upper_limit = low_limit + interval
  group = 0
  while (age > upper_limit):
    group += 1
    low_limit = group*interval + 1
    upper_limit = (group + 1)*interval
  age_group = str(low_limit) + ' - ' + str(upper_limit)
  
  return age_group
census['age_group'] = census.birth_year.apply(lambda x: age_group(2021, x, 5))
print(census.age_group.head())
print(census.age_group.unique())
census.age_group = census.age_group.astype('category')
census['age_group_code'] = census.age_group.cat.codes
print(census.head())


  first_name  last_name birth_year  voted  num_children  income_year  \
0     Denise      Ratke       2005  False             0     92129.41   
1       Hali  Cummerata       1987  False             0     75649.17   
2    Salomon        Orn       1992   True             2    166313.45   
3     Sarina   Schiller       1965  False             2     71704.81   
4       Gust  Abernathy       1945  False             2    143316.08   

       higher_tax marital_status  
0        disagree         single  
1         neutral       divorced  
2           agree         single  
3  strongly agree        married  
4           agree        married  
first_name         object
last_name          object
birth_year         object
voted                bool
num_children        int64
income_year       float64
higher_tax         object
marital_status     object
dtype: object
['2005' '1987' '1992' '1965' '1945' '1951' '1963' '1949' '1950' '1971'
 '2007' '1944' '1995' '1973' '1946' '1954' '1994' '1989' '1947' 

In [23]:
print(census.age_group.value_counts().sort_index())

11 - 15     6
16 - 20     6
21 - 25     5
26 - 30     9
31 - 35     6
36 - 40     8
41 - 45     7
46 - 50     6
51 - 55     5
56 - 60    10
61 - 65     6
66 - 70    11
71 - 75     9
76 - 80     5
81 - 85     1
Name: age_group, dtype: int64
