## Encoding Categorical Variables:
    
   - **One-Hot Encoding** - Each bit represents a possible category. If the variable cannot belong to multiple categories at once, then ony one bit in the group can be "on". Each bit is a feature.    
   
    
   -  **Dummy Coding** - The problem with one-hot encoding is that it allows for k degrees of freedom, while variable itself needs only k-1. ***Dummy coding*** removes the extra degree by using k-1 features in the representation.
   
---

````The outcome of modeling with dummy coding is more interpretable than with one-hot encoding.````

---

In [1]:
# Import Libararies

import pandas as pd
from sklearn.linear_model import LinearRegression

In [2]:
# Define a toy dataset of apartment rental prices in NY, SF, and Seattle

df = pd.DataFrame({'City': ['SF', 'SF', 'SF', 'NYC', 'NYC', 'NYC', 'Seattle', 'Seattle', 'Seattle'],
                   'Rent': [3999, 4000, 4001, 3499, 3500, 3501, 2499, 2500, 2501]})
df

Unnamed: 0,City,Rent
0,SF,3999
1,SF,4000
2,SF,4001
3,NYC,3499
4,NYC,3500
5,NYC,3501
6,Seattle,2499
7,Seattle,2500
8,Seattle,2501


In [3]:
df.Rent.mean()

3333.3333333333335

In [4]:
# Convert the categorical variable in the Dataframe to one-hot encoding 
# and fit a Linear regression model.

one_hot_df = pd.get_dummies(df, prefix=['city'])
one_hot_df

Unnamed: 0,Rent,city_NYC,city_SF,city_Seattle
0,3999,0,1,0
1,4000,0,1,0
2,4001,0,1,0
3,3499,1,0,0
4,3500,1,0,0
5,3501,1,0,0
6,2499,0,0,1
7,2500,0,0,1
8,2501,0,0,1


In [5]:
# Assign the variables

X = one_hot_df.drop('Rent', axis=1)
y = one_hot_df['Rent']

In [6]:
# Create a model

lin_reg = LinearRegression()

In [7]:
# Train the model

lin_reg.fit(X, y)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [8]:
lin_reg.coef_

array([ 166.66666667,  666.66666667, -833.33333333])

In [9]:
lin_reg.intercept_

3333.3333333333335

In [10]:
# One-hot encoding weights + intercept

w1 = lin_reg.coef_
b1 = lin_reg.intercept_

---

In [11]:
# Train a linear regression model on dummy code
# Specify the 'drop_first' flag to get dummy coding
dummy_df = pd.get_dummies(df, prefix=['city'], drop_first=True)
dummy_df

Unnamed: 0,Rent,city_SF,city_Seattle
0,3999,1,0
1,4000,1,0
2,4001,1,0
3,3499,0,0
4,3500,0,0
5,3501,0,0
6,2499,0,1
7,2500,0,1
8,2501,0,1


In [12]:
# Assign Variable

data = dummy_df.drop('Rent', axis=1)
target = dummy_df['Rent']

In [13]:
# Train the model

lin_reg.fit(data, target)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [14]:
lin_reg.coef_

array([  500., -1000.])

In [15]:
lin_reg.intercept_

3500.0

In [16]:
# Dummy Coding weights + Intercept
w2 = lin_reg.coef_
b2 = lin_reg.intercept_