# Working with Categorical Variables

### Introduction

The first assumption that we are working is that our true function is indeed linear.  By linear, we mean that  the average of our target variable at each value of our features is a linear function of our features.  In other words, if we choose a feature, and then increase that feature, the target variable will linearly increase or decrease. 

When we say increase a feature, this implies that our underlying feature variable is a number.  So, for example, if we model how a change in temperature impacts the number of customers who visit a restaurant, our feature variables are numeric, and increase or decrease.

But this is not always the case.  For example, we may want to choose restaurant location like a city, state or zip code.  We generally don't think of these features as increasing or decreasing - at least not in a way that will be predictive to our target.  Or what about the day of the week - it probably won't help to think of this as numeric.  

Unfortunately, linear regression asks that *we do* think of our features as numeric.  As we know, regression implies that when we move increase our feature, the target will increase or decrease.  So in this lesson, we'll see how we can "increase" a feature that is not numeric for the purposes of linear regression.

### Working with NYC SAT data

Let's load up some data on average SAT scores among NYC public schools.  [Click here](https://www.kaggle.com/nycopendata/high-schools#scores.csv) to view the data.  Here's the data.

In [6]:
import pandas as pd
df = pd.read_csv('./scores.csv')
df.columns = df.columns.str.strip().str.lower().str.replace(' ', '_').str.replace('(', '').str.replace(')', '')
df.columns

Index(['school_id', 'school_name', 'borough', 'building_code',
       'street_address', 'city', 'state', 'zip_code', 'latitude', 'longitude',
       'phone_number', 'start_time', 'end_time', 'student_enrollment',
       'percent_white', 'percent_black', 'percent_hispanic', 'percent_asian',
       'average_score_sat_math', 'average_score_sat_reading',
       'average_score_sat_writing', 'percent_tested'],
      dtype='object')

Now, as we would like to tease out what borough is associated with higher math sat scores.  So let's select just the relevant columns of our dataframe.

In [7]:
selected_columns = ['borough', 'average_score_sat_math']

In [8]:
selected_df = df[selected_columns]
selected_df[0:3]

Unnamed: 0,borough,average_score_sat_math
0,Manhattan,
1,Manhattan,
2,Manhattan,657.0


Now we can see that we have some missing data in our target variables, so let's remove the rows whose target is missing.

In [9]:
pruned_df = selected_df.dropna(subset=['average_score_sat_math'])

In [10]:
pruned_df[0:3]

Unnamed: 0,borough,average_score_sat_math
2,Manhattan,657.0
3,Manhattan,395.0
4,Manhattan,418.0


Now, we'd like to regress `borough` on `average_score_sat_math` to see the impact of the borough someone is in on their SAT score.

### Fitting in categorical data

Now as we know, a linear regression model can only take in a vectors of real numbers for the feature variables.  This is because when we train a linear regression model, we are arriving at parameters which indicate, when we increase a variable by one unit, how much do we expect our target variable to change.

Something like this makes sense, if we are asking, if we increase the average class size of a school by one student, by how much will that change the average SAT score.  But this doesn't directly translate to borough. What does it mean to increase a borough by one?

But it turns out we can fit our categorical data into our regression problem: let's see this.  We'll break down the code later, for now let's focus on the result.

In [11]:
pd.concat([pruned_df, pd.get_dummies(pruned_df.borough)], axis=1)[0:3]

Unnamed: 0,borough,average_score_sat_math,Bronx,Brooklyn,Manhattan,Queens,Staten Island
2,Manhattan,657.0,0,0,1,0,0
3,Manhattan,395.0,0,0,1,0,0
4,Manhattan,418.0,0,0,1,0,0


> The code above is slightly incorrect, we'll deal with that shortly.

Ok, so let's focus on the resulting dataframe we see above.

We have added five additional boroughs to our dataframe - one for each column.  And we can see that whenever the borough is Manhattan, we move the Manhattan column from zero to one.  So in this way, we are able to translate our categories into numbers that make sense.  

Here, each parameter indicates what is the impact of going from not being in the that borough to being in that borough.

### A problem

Our code above should be modified, to the following.  Once again, let's focus on the result.

In [12]:
import numpy as np
pruned_df.loc[:, 'is_manhattan'] = np.where(pruned_df.borough == 'Manhattan', 1, 0)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[key] = _infer_fill_value(value)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s


In [13]:
updated_df = pd.concat([pruned_df, pd.get_dummies(pruned_df.is_manhattan)], axis=1)
updated_df = updated_df.dropna(subset=['average_score_sat_math'])
updated_df.columns = ['borough', 'average_score_sat_math', 'is_manhattan', 'outside_manhattan', 'inside_manhattan']

In [14]:
updated_df[0:5]

Unnamed: 0,borough,average_score_sat_math,is_manhattan,outside_manhattan,inside_manhattan
2,Manhattan,657.0,1,0,1
3,Manhattan,395.0,1,0,1
4,Manhattan,418.0,1,0,1
5,Manhattan,613.0,1,0,1
6,Manhattan,410.0,1,0,1


Here the column `outside_manhattan` is capturing the same information as `inside_manhattan`.  This means that whenever `inside_manhattan` switches on `outside_manhattan` switches off.  This is problematic for our linear regression model.  Is the change in the SAT score due to the change in the `outside_manhattan` column or the `inside_manhattan`?  The linear regression model doesn't know because both are changing at the same time.

In [25]:
features = updated_df[['outside_manhattan', 'inside_manhattan']].to_numpy()
features[0:3]

array([[0, 1],
       [0, 1],
       [0, 1]], dtype=uint8)

In [27]:
targets = updated_df['average_score_sat_math'].to_numpy()
targets[0:3]

array([657., 395., 418.])

In [29]:
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(features, targets)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [31]:
model.coef_

array([-15.04172232,  15.04172232])

This makes sense, going to outside manhattan has the same impact as going to inside manhattan.  So essentially, we are capturing the same impact is being diluted across both of these features.

In [35]:
from sklearn.linear_model import LinearRegression
updated_model = LinearRegression()
updated_model.fit(features[:, 1].reshape(-1, 1), targets)
updated_model.coef_

array([30.08344465])

In general, we don't want the values of one column to be highly correlated with the other.  The linear regression model will each row as changing two items being changed at the same time, and by a repeated amount, so it won't tease out which feature is responsible.    

### The fix

The way that we can fix this is by removing one of the columns.

In [44]:
selected = ['average_score_sat_math']
selected_df = pruned_df[selected]
dummies_df_drop_first = pd.concat([selected_df, pd.get_dummies(pruned_df.borough,  drop_first=True),], axis=1)
dummies_df_drop_first[0:3]

Unnamed: 0,average_score_sat_math,Brooklyn,Manhattan,Queens,Staten Island
2,657.0,0,1,0,0
3,395.0,0,1,0,0
4,418.0,0,1,0,0


In [58]:
numpy_dummies = dummies_df_drop_first.to_numpy()
dummies_features = numpy_dummies[:, 1:]
dummies_target = numpy_dummies[:, 0]
dummies_target.shape

(375,)

In [59]:
dummies_features.shape

(375, 4)

In [60]:
from sklearn.linear_model import LinearRegression
updated_model = LinearRegression()
updated_model.fit(dummies_features, dummies_target)
updated_model.coef_

array([12.04652687, 51.53049759, 58.00517598, 81.84285714])

So we can read the above as there is a 12.04 point boost of a school being in the brooklyn as opposed to the bronx, and a 51.53 boost from being in Manhattan as opposed to the bronx, and so on.