# Dealing with Categorical Variables

## Learning Objectives
___
At the end of this notebook you should be able to 

- define categorical variables

- transform a categorical variable with label encoder and dummy variables

- explain multicollinearity and avoid it when using dummy variables



Let's look at our cars dataset. As always, we will import the necessary libraries and load the data.


In [None]:
import pandas as pd
data = pd.read_csv('data/cars_multivariate.csv')
data.head()

In [None]:
# check out the columns in our dataset
data.columns

In [None]:
# figure out which type the data is stored in
data.info()

Except for "car name", every other column seems to have numerical values. They could also be candidates to describe the dependent variable `mpg` (miles per gallon).
Does this mean we don´t have any categorical variables?

**What are categorical variables again?**  
Categorical variables are types of data which can be divided into groups.  
We can figure out if variables might be categorical or numerical.
As first example, let's take a closer look at the column "origin". 

In [None]:
# we want to explore the descriptives of that column
print(data['origin'].describe())

In [None]:
# how many different values can be found in this column
print(data['origin'].nunique())

Values range from 1 to 3, moreover, actually the only values that are in the dataset are 1, 2 and 3! it turns out that "origin" is a so-called **categorical** variable. It does not represent a continuous number but refers to a location - say 1 may stand for US, 2 for Europe, 3 for Asia (note: for this dataset the actual meaning is not disclosed).

So, categorical variables are exactly what they sound like: they represent categories instead of numerical features. 
Note that even though that's not the case here, these features are often stored as text values which represent various levels of the observations.

## Identifying categorical variables
___
As categorical variables need to be treated in a particular manner especially in ML algorithms, as you'll see later on, you need to make sure to identify which variables are categorical. 



In some cases, identifying will be easy (e.g. if they are stored as strings), then the datatype will return an object type (f.e. carname). In other cases they are numeric and the fact that they are categorical is not always immediately apparent (as origin, model and cylinders in our example).



Note that this may not be trivial. A first thing you can do is use the `.describe()` and `.info()` methods. `.describe()` will give you info on the data types (like strings, integers, etc). But even then continuous variables might have been imported as strings, or categorical as numeric - so it's very important to really have a look at your data and plot them. 

In [None]:
# Let's plot our variables in order to check their datatypes
import matplotlib.pyplot as plt
%matplotlib inline

fig, axes = plt.subplots(nrows=1, ncols=4, figsize=(16,3))

for xcol, ax in zip(['acceleration', 'displacement', 'horsepower', 'weight'], axes):
    data.plot(kind='scatter', x=xcol, y='mpg', ax=ax, alpha=0.4, color='b')

In [None]:
# Do the same for the variables cylinders, model and origin
fig, axes = plt.subplots(nrows=1, ncols=3, figsize=(12,3))

for xcol, ax in zip([ 'cylinders', 'model', 'origin'], axes):
    data.plot(kind='scatter', x=xcol, y='mpg', ax=ax, alpha=0.4, color='b')

Note the structural difference between the top and bottom set of graphs. You can tell the structure looks very different: instead of getting a pretty homogeneous "cloud", categorical variables generate vertical lines for discrete values. Another plot type that may be useful to look at is the histogram.

In [None]:
import warnings
warnings.filterwarnings('ignore')
fig = plt.figure(figsize = (8,8))
ax = fig.gca()
data.hist(ax = ax);

Also in the histograms, the structural difference between categorical and continouus variables becomes clear: Since categorical variables take discrete values, the histograms look different

Additionally to the plots, also look at the number of unique values:

In [None]:
data[['cylinders', 'model', 'origin']].nunique()

## Transforming categorical variables
___

When you want to use categorical variables in regression models, they need to be transformed. There are two approaches to this:
1) Perform label encoding
2) Create dummy variables / one-hot-encoding

### 1) Label encoding

Label encoding is when you represent your labels as numbers. 

Let's illustrate label encoding and dummy creation with the following Pandas Series with 3 categories: "USA", "EU" and "ASIA".

In [None]:
# create pandas series
origin = ['USA', 'EU', 'EU', 'ASIA','USA', 'EU', 'EU', 'ASIA', 'ASIA', 'USA']
origin_series = pd.Series(origin)

Now you'll want to make sure Python recognizes the strings as categories. This can be done as follows:

In [None]:
# transform series as type category
cat_origin = origin_series.astype('category')
cat_origin

Note how the `dtype` (i.e., data type) here is `category` and the three categories are detected.

#### Label encoding with .cat.codes
You'll perform label encoding in a way that numerical labels are always between 0 and (number_of_categories)-1. There are several ways to do this, one way is using `.cat.codes`

In [None]:
# assign each category a numeric label with cat.codes
cat_origin.cat.codes

#### Label encoding with scikit-learn's LabelEncoder
Another way is to use scikit-learn's `LabelEncoder`

In [None]:
# import module, det up LabelEncoder and fit_transform your pandas series
from sklearn.preprocessing import LabelEncoder
lb_make = LabelEncoder()

origin_encoded = lb_make.fit_transform(cat_origin)

In [None]:
origin_encoded

Note that while `.cat.codes` can only be used on variables that are transformed using `.astype(category)`, this is not a requirement when you use `LabelEncoder`.

### 2) Creating Dummy Variables

Another way to transform categorical variables is through using one-hot encoding or "dummy variables". The idea is to convert each category into a new column, and assign a 1 or 0 to the column. There are several libraries that support one-hot encoding, let's take a look at two:

In [None]:
# create dummy variables with the help of pandas
pd.get_dummies(cat_origin)

See how the label name has become the column name! Another method is through using the `LabelBinarizer` in scikit-learn. 

In [None]:
# create dummy variables with the help of scikit-learn
from sklearn.preprocessing import LabelBinarizer

lb = LabelBinarizer()
origin_dummies = lb.fit_transform(cat_origin)
# You need to convert this back to a dataframe
origin_dum_df = pd.DataFrame(origin_dummies,columns=lb.classes_)
origin_dum_df

The advantage of using dummies is that, whatever algorithm you'll be using, your numerical values cannot be misinterpreted as being continuous. Going forward, it's important to know that for linear regression (and most other algorithms in scikit-learn), **one-hot encoding is required** when adding categorical variables in a regression model!

## The Dummy Variable Trap
___

Due to the nature of how dummy variables are created, one variable can be predicted from all of the others. This is known as perfect **multicollinearity** and it can be a problem for regression. If this isn't super clear, go back to the one-hot encoded origin data above:

In [None]:
trap_df = pd.get_dummies(cat_origin)
trap_df

As a consequence of creating dummy variables for every origin, you can now predict any single origin dummy variable using the information from all of the others. OK, that might sound more like a tongue twister than an explanation.  
Let's look at an example: Focus on the ASIA column for now. You can perfectly predict this column by adding the values in the EU and USA columns then subtracting the sum from 1 as shown below:

In [None]:
# Predict ASIA column from EU and USA
predicted_asia = 1 - (trap_df['EU'] + trap_df['USA'])
predicted_asia.to_frame(name='Predicted_ASIA')

EU and USA can be predicted in a similar manner which you can work out on your own. 

You are probably wondering why this is a problem for regression. Recall that the coefficients derived from a regression model are used to make predictions. In a multiple linear regression, the coefficients represent the average change in the dependent variable for each 1 unit change in a predictor variable, assuming that all the other predictor variables are kept constant. This is no longer the case when predictor variables are related which, as you've just seen, happens automatically when you create dummy variables. This is what is known as the **Dummy Variable Trap**.

Fortunately, the dummy variable trap can be avoided by simply dropping one of the dummy variables. You can do this by subsetting the dataframe manually or, more conveniently, by passing ```drop_first=True``` to ```get_dummies()```: 

In [None]:
# create dummy variables and drop first column to avoid multicollinearity
pd.get_dummies(cat_origin, drop_first=True)

If you take a close look at the DataFrame above, you'll see that there is no longer enough information to predict any of the columns so the multicollinearity has been eliminated. 

You'll soon see that dropping the first variable affects the interpretation of regression coefficients. The dropped category becomes what is known as the **reference category**. The regression coefficients that result from fitting the remaining variables represent the change *relative* to the reference.

You'll also see that in certain contexts, multicollinearity and the dummy variable trap are less of an issue and can be ignored. It is therefore important to understand which models are sensitive to multicollinearity and which are not.

## Dummy variables in practice: Back to our auto-mpg data
___

Let's go ahead and change our "cylinders", "model", and "origin" columns over to dummies.

In [None]:
cyl_dummies = pd.get_dummies(data['cylinders'], prefix='cyl', drop_first=True)
yr_dummies = pd.get_dummies(data['model'], prefix='yr', drop_first=True)
orig_dummies = pd.get_dummies(data['origin'], prefix='orig', drop_first=True)

In [None]:
# Check the results - also have a look at your other dummy dfs
cyl_dummies.head()

Next, let's remove the original columns from our data and add the dummy columns instead

In [None]:
data = data.drop(['cylinders','model','origin'], axis=1)

In [None]:
data = pd.concat([data, cyl_dummies, yr_dummies, orig_dummies], axis=1)
data.head()

## Build multiple regression model including categorical variables
---

Now, we want to use these variables in our regression model.

In [None]:
import statsmodels.api as sm

In [None]:
# Let's use all of of variables in our regression model, except mpg (since it's our dependent variable) and car_name
variables = data.columns.to_list()
variables.remove('mpg')
variables.remove('car_name')

In [None]:
X = data[variables]
y = data.mpg
X = sm.add_constant(X)
res = sm.OLS(y, X).fit()
res.summary()

For a better understanding of the interpretation of the dummy coefficients, have a look at [this website](https://dss.princeton.edu/online_help/analysis/dummy_variables.htm).

What to do with the not-significant dummy variables? 
A good answer to this question was found [here](https://www.quora.com/Is-it-necessary-to-include-a-dummy-variable-in-regression-analysis-when-it-is-not-significant):  
"The estimated coefficients of the included variables represent differences between their effects and those of the one that was left out. If one of the dummy variables has an insignificant coefficient, this only means that its effect is nearly the same as that of the one that was left out. If you choose a different dummy variable to leave out, the coefficients and apparent significance of all the others could change."

## Summary
___
You now know ...
- ... that you need to look closely at your data to **find categorical variables**. You can do so by looking at the info and descriptive statistics as well as by figuring out how many unique values each variable has. Also, scatter plots and histograms might help to find categorical variables.  
- ... how to transform a variable with **label encoder** and **dummy variables**. For both ways you can use for example the [scikit-learns preprocessing tools](https://scikit-learn.org/stable/modules/preprocessing.html#encoding-categorical-features).  
- ... what **multicollinearity** is and how to **avoid it when using dummy variables**.  
"Multicollinearity refers to a situation in which more than two explanatory variables in a multiple regression model are highly linearly related. We have perfect multicollinearity if [...] the correlation between two independent variables is equal to 1 or −1." [Multicolinearity Wikipedia](https://en.wikipedia.org/wiki/Multicollinearity#:~:text=Multicollinearity%20refers%20to%20a%20situation,equal%20to%201%20or%20%E2%88%921.).  
To avoid collinearity when using dummy variables, you need to drop one dummy-column.

## Practice time
___
Can you find a combination of variables (also dummy variables) that beat your last multiple regression model from Notebook 4?
