# <span style="color:purple; font-weight:bold">One-Hot Encoding


# #<span style="color:red; font-weight:bold">Categorical Variables

**When we have one or more categorical variables in the data we need to encode it to numbers because linear regression takes numerical inputs.**    

![One Hot Encoding](assets/images/One_Hot_Encoding.png)    

---

## <span style="color:brown; font-weight:bold">Types of Encoding

![Encoding Types](assets/images/Encoding_Types.png)

---

## <span style="color:brown; font-weight:bold">Types of Categorical Data

![Nominal vs Ordinal](assets/images/Nominal_vs_Ordinal.png)

---

## We will focus on Label and One-Hot Encoding

1. <h4 style="color:purple">Label Encoding</h4>
   - Assigns a unique integer to each category

    ![Label Encoding](assets/images/Label_Encoding.jpg)

2. <h4 style="color:purple">One-Hot Encoding</h4>
   - Creates a new binary (0 or 1) column for each category, where '1' or 'True' represents hot.

    ![One Hot Encoding](assets/images/One_Hot_Encoding.png)

---

#<span style="color:blue"> Multicollinearity

1. **Two or more independent variables** are **highly correlated**, meaning one can be linearly predicted from the others with high accuracy.
2. **Introduces inaccuracy** when using One-Hot encoding in Linear Regression.
3. To tackle this we intentionally **drop one column randomly** from the One-Hot Encoded columns to reduce (sometimes eliminate) multicollinearity.

---

In [25]:
import pandas as pd

In [26]:
df = pd.read_csv("./assets/files/homeprices.csv")
df

**1. Using pandas to create dummy variables**

In [27]:
dummies = pd.get_dummies(df.town)
dummies

In [28]:
merged = pd.concat([df,dummies],axis='columns')
merged

In [29]:
final = merged.drop(['town'], axis='columns')
final

**IMPORTANT ! Dummy Variable Trap**

When you can derive one variable from other variables, they are known to be multi-colinear. Here if you know values of california and georgia then you can easily infer value of new jersey state, i.e. california=0 and georgia=0. There for these state variables are called to be multi-colinear. In this situation linear regression won't work as expected. Hence you need to drop one column.

**NOTE: sklearn library takes care of dummy variable trap hence even if you don't drop one of the state columns it is going to work, however we should make a habit of taking care of dummy variable trap ourselves just in case library that you are using is not handling this for you**

In [30]:
final = final.drop(['west windsor'], axis='columns')
final

In [31]:
X = final.drop('price', axis='columns')
X

In [32]:
y = final.price

In [33]:
from sklearn.linear_model import LinearRegression
model = LinearRegression()

In [34]:
model.fit(X,y)

In [35]:
model.predict(X) # 2600 sqr ft home in new jersey

In [36]:
model.score(X,y)

In [37]:
model.predict([[3400,0,0]]) # 3400 sqr ft home in west windsor

In [38]:
model.predict([[2800,0,1]]) # 2800 sqr ft home in robbinsville

**2. Using sklearn OneHotEncoder**

In [39]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

In [40]:
dfle = df
dfle.town = le.fit_transform(dfle.town)
dfle

In [41]:
X = dfle[['town','area']].values
X

In [42]:
y = dfle.price.values
y

In [43]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
ct = ColumnTransformer([('town', OneHotEncoder(), [0])], remainder = 'passthrough')

In [44]:
X = ct.fit_transform(X)
X

In [45]:
X = X[:,1:]
X

In [46]:
model.fit(X,y)

In [47]:
model.predict([[0,1,3400]])

In [48]:
model.predict([[1,0,2800]])