## Including categorical predictors in a regression model: dummy coding

There are multiple ways of both doing dummy coding in Python and incorporating categorical predictors in Regression.
We are going to learn only one of those ways (one that I consider to be very easy to apply).

You can explore other Python methods to incorporate categorical predictors in Regression and choose your preferred one.

<br>

In [None]:
# Importing useful libraries

import pandas as pd
import numpy as np
import statsmodels.formula.api as smf

We are going to use the Credit dataset, which is part of the ISLR package in R (we used this package in the Statistics with R class).

The goal with this dataset is to find a model to predict a person's credit card balance (the Balance column) using all other variables as predictors.

In [None]:
Credit_df= pd.read_csv('C:\\Users\\jheredi2\\Documents\\PythonDataAnalytics\\1-Datasets\\Credit_ISLR.csv')

In [None]:
Credit_df.info()

Let's remove the first two columns of this data frame because they are not useful.

__Note__: The second column could be useful if you wanted to replace the default integer index by an index starting at 1 and ending in 506. However, we will not do that because we do not really need to.

In [None]:
Credit_df.drop(['Unnamed: 0', 'ID'], axis= 1, inplace=True)

In [None]:
Credit_df.info()

Now let's use the Pandas method get_dummies() to convert the categorical predictors in dummy variables.

The argument drop_first enables us to create the right number of dummy variables for each predictor; that is, one less than the categories of the perdictor. When the argument drop_first= True, the first category is removed.

__Note__: I did a quick search and couldn't find how to know in Python what the first category of a categorical predictor is. Not a big deal because you can quickly run the get_dummies() and will find it out.

In [None]:
Credit_df_dummies=pd.get_dummies(Credit_df,columns=['Gender','Student','Married','Ethnicity'], drop_first=True)

In [None]:
Credit_df_dummies.head()

__Note__: If you do not like the category that get_dummies() drops when drop_first= True, you can set drop_first= False (which is the default) and choose your own dummy (or dummies). This approach is shown a few code cells below!

However, in class, to keep the analysis simpler, we will go with drop_first= True

Let's run a regression analysis to get an equation with Balance as the DV, one quantitative predictor (Income) and all qualitative predictors available (gender, student status, marital status, and ethnicity)

In [None]:
regression_object= smf.ols('Balance~ Income+ Gender_Female + Student_Yes + Married_Yes + Ethnicity_Asian+ Ethnicity_Caucasian', data= Credit_df_dummies)

In [None]:
regression_model= regression_object.fit()

In [None]:
regression_model.summary()

Let's re-run regression only with the few predictors that make sense to include in the analysis.

In [None]:
# Notice how the statmodels methods are applied in this code cell. Compare to the three previous code cells.

smf.ols('Balance~ Income + Student_Yes', data= Credit_df_dummies).fit().summary()

In [None]:
regression_object2= smf.ols('Balance~ Income + Student_Yes', data= Credit_df_dummies)

In [None]:
regression_model2= regression_object2.fit()

Answer these questions. __I will give you time to think about the answers for five minutes and then we discuss together__:


1) Do people with more income tend to have more or less credit card balance?

2) Does being student increase the chances of having a higher credit card balance?

Use the predict() method that we studied last class and:

3) Use the equation to predict the balance for a student making $60,000

__Note__: The income values are recorded in units of $1000.

4) Use the equation to predict the balance for a non-student making $60,000

5) Use the equation to predict the balance for a student making $45,000

6) Now try to get the predictions for 3, 4, and 5 using only one line of code

In [None]:
# Answer 3) here!



In [None]:
# Answer 4) here!



In [None]:
# Answer 5) here!



In [None]:
# Answer 6) here!



### About the importance of dropping one category when using dummy coding

In [None]:
Credit_df_dummies3= pd.get_dummies(Credit_df,columns=['Gender','Student','Married','Ethnicity'])

In [None]:
Credit_df_dummies3.head()

In [None]:
Credit_df_dummies3.rename(columns={'Gender_ Male':'Gender_Male','Ethnicity_African American':'Ethnicity_African_American'}, inplace=True)

In [None]:
my_regression_formula = 'Balance~' + '+'.join(Credit_df_dummies3.columns.difference(['Balance','Limit','Rating','Cards','Age','Education']))

In [None]:
regression_object3= smf.ols(my_regression_formula, data= Credit_df_dummies3)

In [None]:
regression_model3= regression_object3.fit()

In [None]:
regression_model3.summary()

### OPTIONAL MATERIAL!

#### The next few cells illustrate another way of using get_dummies() to convert categorical predictors in dummy variables

#### WE WON'T USE THIS WAY IN CLASS

#### Explore it as an independent study if you are curious about it

This second way of using get_dummies() does not set drop_first= True, instead, it leaves it as False, which is the default value.

In [None]:
Credit_df_dummies2=pd.get_dummies(Credit_df,columns=['Gender','Student','Married','Ethnicity'])

In [None]:
Credit_df_dummies2.head()

Now, for each predictor, you can keep the dummy (or dummies) that you want, instead of being forced to use the one chosen by Pandas when drop_first= True

For example, let's keep Gender_Male and drop Gender_Female. 

In [None]:
Credit_df_dummies2.drop(['Gender_Female'], axis= 1, inplace=True)

In [None]:
Credit_df_dummies2.head()

For the other predictors, drop the same categories dropped by drop_first= True.

In [None]:
Credit_df_dummies2.drop(['Student_No', 'Married_No','Ethnicity_African American'], axis= 1, inplace=True)

In [None]:
Credit_df_dummies2.head()

You can also change the name of the dummies created by get_dummies() IF you do not like those names. For example, let's say you want to change Ethnicity_Asian by Asian, and Ethnicity_Caucasian by Caucasian.

In [None]:
Credit_df_dummies2.rename (columns ={'Ethnicity_Asian':'Asian','Ethnicity_Caucasian':'Caucasian'},inplace=True)

In [None]:
Credit_df_dummies2.head()