### One hot encoding

- One hot encoding is done in nominal categorical variables
- In case of it, it makes columns based upon your categorical variables you want to encode
- No when this is done, a multicollinearity problem arises between our independent varaibles which is really bad for or model, so instead of making n columns, we have make n - 1 columns and we drop the first column.
- The multicollinearity problem arising beacuse of creation of dummy variables to my data is called the dummy variable trap

In [25]:
# importing libraries

import numpy as np
import pandas as pd


from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split

In [26]:
# reading the data

df = pd.read_csv("data.csv")
df.head()

Unnamed: 0,brand,km_driven,fuel,owner,selling_price
0,Maruti,145500,Diesel,First Owner,450000
1,Skoda,120000,Diesel,Second Owner,370000
2,Honda,140000,Petrol,Third Owner,158000
3,Hyundai,127000,Diesel,First Owner,225000
4,Maruti,120000,Petrol,First Owner,130000


### Using pandas

In [27]:
pd.get_dummies(df, columns = ["fuel","owner"], drop_first = True)

Unnamed: 0,brand,km_driven,selling_price,fuel_Diesel,fuel_LPG,fuel_Petrol,owner_Fourth & Above Owner,owner_Second Owner,owner_Test Drive Car,owner_Third Owner
0,Maruti,145500,450000,1,0,0,0,0,0,0
1,Skoda,120000,370000,1,0,0,0,1,0,0
2,Honda,140000,158000,0,0,1,0,0,0,1
3,Hyundai,127000,225000,1,0,0,0,0,0,0
4,Maruti,120000,130000,0,0,1,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...
8123,Hyundai,110000,320000,0,0,1,0,0,0,0
8124,Hyundai,119000,135000,1,0,0,1,0,0,0
8125,Maruti,120000,382000,1,0,0,0,0,0,0
8126,Tata,25000,290000,1,0,0,0,0,0,0


### Using Sklearn

In [28]:
X = df[["brand","km_driven","fuel","owner"]]
y = df["selling_price"]

In [29]:
# split the train and test data

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.3, random_state = 0)

In [30]:
ohe = OneHotEncoder(drop='first', sparse=False, dtype = np.int32)

In [31]:
X_train_transformed = ohe.fit_transform(X_train[['fuel','owner']])

In [32]:
X_test_transformed = ohe.transform(X_test[['fuel','owner']])


In [33]:
X_train_transformed

array([[0, 0, 1, ..., 0, 0, 0],
       [0, 0, 1, ..., 0, 0, 0],
       [0, 0, 1, ..., 0, 0, 0],
       ...,
       [0, 0, 1, ..., 1, 0, 0],
       [1, 0, 0, ..., 0, 0, 0],
       [0, 0, 1, ..., 0, 0, 0]])

In [34]:
# since we got the values in a numpy array we will use hstack to join the other column values with it

np.hstack((X_train[['brand','km_driven']].values, X_train_transformed))

array([['Tata', 20000, 0, ..., 0, 0, 0],
       ['Maruti', 30000, 0, ..., 0, 0, 0],
       ['Maruti', 15000, 0, ..., 0, 0, 0],
       ...,
       ['Hyundai', 90000, 0, ..., 1, 0, 0],
       ['Volkswagen', 90000, 1, ..., 0, 0, 0],
       ['Hyundai', 110000, 0, ..., 0, 0, 0]], dtype=object)

#### Most frequent categories

- If any column has more than 80 different categories , and we craete dummy variables for every category then our data will increase in shape. So what we do is we take the most frequent categories and label the rest under "others"

In [35]:
# Say i will take the threshold of 120 cars. If in any brand 
# I have less than 120 cars then i will keep them in others

# take the count of brands

count = df["brand"].value_counts()

In [37]:
# take threshold = 120

threshold = 120

most_frequent_brands = count[count <= threshold].index

# get the dummies

pd.get_dummies(df["brand"].replace(most_frequent_brands,'others'))

Unnamed: 0,Chevrolet,Ford,Honda,Hyundai,Mahindra,Maruti,Renault,Tata,Toyota,Volkswagen,others
0,0,0,0,0,0,1,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,1
2,0,0,1,0,0,0,0,0,0,0,0
3,0,0,0,1,0,0,0,0,0,0,0
4,0,0,0,0,0,1,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...
8123,0,0,0,1,0,0,0,0,0,0,0
8124,0,0,0,1,0,0,0,0,0,0,0
8125,0,0,0,0,0,1,0,0,0,0,0
8126,0,0,0,0,0,0,0,1,0,0,0
