# Handle nominal data with one hot encoding

# ONE HOT ENCODING 

<p>One-hot encoding is a technique used in machine learning to convert categorical data without any inherent order into a numerical format by creating separate binary columns for each unique category, where a value of 1 indicates the presence of that category and 0 indicates its absence. For example, a feature like Color with values Red, Blue, Green is transformed into three columns (Color_Red, Color_Blue, Color_Green), ensuring that the model does not assume any ranking or priority among categories, making it especially suitable for algorithms that rely on numerical input without interpreting ordinal relationships.
</p>

# Multicollinearity

<p>Multicollinearity is a statistical problem in regression and machine learning models where two or more independent (feature) variables are highly correlated with each other, meaning one feature can be linearly predicted from another with high accuracy. This makes it difficult for the model to determine the individual effect of each feature on the target variable, leading to unstable coefficients, inflated standard errors, and reduced interpretability, even though overall prediction accuracy may remain high. Multicollinearity is commonly addressed using techniques such as feature removal, dimensionality reduction, regularization, or variance inflation factor (VIF) analysis.</p>

# Dummy variable Trap

<p>The dummy variable trap is a situation in regression and machine learning models where dummy variables created from one-hot encoding are highly correlated with each other, causing multicollinearity. This happens when all category columns are included, making one variable redundant because it can be perfectly predicted from the others. As a result, the model may produce unstable coefficients or fail to compute them properly. To avoid the dummy variable trap, one category is always dropped from the encoded features (known as the reference category), ensuring the model remains mathematically sound and interpretable.</p>

In [21]:
import pandas as pd
import numpy as np


In [22]:
df= pd.read_csv('E:\WORK\AI-ML\data\cars.csv')

  df= pd.read_csv('E:\WORK\AI-ML\data\cars.csv')


In [23]:
df.head()

Unnamed: 0,brand,km_driven,fuel,owner,selling_price
0,Maruti,145500,Diesel,First Owner,450000
1,Skoda,120000,Diesel,Second Owner,370000
2,Honda,140000,Petrol,Third Owner,158000
3,Hyundai,127000,Diesel,First Owner,225000
4,Maruti,120000,Petrol,First Owner,130000


In [24]:
df['owner'].value_counts()

owner
First Owner             5289
Second Owner            2105
Third Owner              555
Fourth & Above Owner     174
Test Drive Car             5
Name: count, dtype: int64

# 1. OneHotEncoding using Pandas

In [25]:
pd.get_dummies(df,columns=['fuel','owner'])

Unnamed: 0,brand,km_driven,selling_price,fuel_CNG,fuel_Diesel,fuel_LPG,fuel_Petrol,owner_First Owner,owner_Fourth & Above Owner,owner_Second Owner,owner_Test Drive Car,owner_Third Owner
0,Maruti,145500,450000,False,True,False,False,True,False,False,False,False
1,Skoda,120000,370000,False,True,False,False,False,False,True,False,False
2,Honda,140000,158000,False,False,False,True,False,False,False,False,True
3,Hyundai,127000,225000,False,True,False,False,True,False,False,False,False
4,Maruti,120000,130000,False,False,False,True,True,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...
8123,Hyundai,110000,320000,False,False,False,True,True,False,False,False,False
8124,Hyundai,119000,135000,False,True,False,False,False,True,False,False,False
8125,Maruti,120000,382000,False,True,False,False,True,False,False,False,False
8126,Tata,25000,290000,False,True,False,False,True,False,False,False,False


In [26]:
df['owner'].value_counts()

owner
First Owner             5289
Second Owner            2105
Third Owner              555
Fourth & Above Owner     174
Test Drive Car             5
Name: count, dtype: int64

# 2. K-1 OneHotEncoding 

In [27]:
pd.get_dummies(df,columns=['fuel','owner'],drop_first=True)

Unnamed: 0,brand,km_driven,selling_price,fuel_Diesel,fuel_LPG,fuel_Petrol,owner_Fourth & Above Owner,owner_Second Owner,owner_Test Drive Car,owner_Third Owner
0,Maruti,145500,450000,True,False,False,False,False,False,False
1,Skoda,120000,370000,True,False,False,False,True,False,False
2,Honda,140000,158000,False,False,True,False,False,False,True
3,Hyundai,127000,225000,True,False,False,False,False,False,False
4,Maruti,120000,130000,False,False,True,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...
8123,Hyundai,110000,320000,False,False,True,False,False,False,False
8124,Hyundai,119000,135000,True,False,False,True,False,False,False
8125,Maruti,120000,382000,True,False,False,False,False,False,False
8126,Tata,25000,290000,True,False,False,False,False,False,False


# 3. OneHotEncoding using Sklearn

In [28]:
from sklearn.model_selection import train_test_split 
X_train,X_test,y_train,y_test= train_test_split(df.iloc[:,0:4],df.iloc[:,-1],test_size=0.2,random_state=42)

In [29]:
X_train.head()

Unnamed: 0,brand,km_driven,fuel,owner
6518,Tata,2560,Petrol,First Owner
6144,Honda,80000,Petrol,Second Owner
6381,Hyundai,150000,Diesel,Fourth & Above Owner
438,Maruti,120000,Diesel,Second Owner
5939,Maruti,25000,Petrol,First Owner


In [30]:
from sklearn.preprocessing import OneHotEncoder

In [31]:
ohe= OneHotEncoder(drop='first')

In [32]:
X_train_new=ohe.fit_transform(X_train[['fuel','owner']]).toarray()

In [33]:
X_test_new=ohe.transform(X_test[['fuel','owner']]).toarray()

In [42]:
X_train_new.shape

(6502, 7)

In [35]:
np.hstack((X_train[['brand','km_driven']].values,X_train_new))

array([['Tata', 2560, 0.0, ..., 0.0, 0.0, 0.0],
       ['Honda', 80000, 0.0, ..., 1.0, 0.0, 0.0],
       ['Hyundai', 150000, 1.0, ..., 0.0, 0.0, 0.0],
       ...,
       ['Hyundai', 35000, 0.0, ..., 0.0, 0.0, 0.0],
       ['Maruti', 27000, 1.0, ..., 0.0, 0.0, 0.0],
       ['Maruti', 70000, 0.0, ..., 1.0, 0.0, 0.0]],
      shape=(6502, 9), dtype=object)

# 4.OneHotEncoding with Top categories

In [36]:
counts = df['brand'].value_counts()

In [37]:
df['brand'].nunique()
threshold=100

In [38]:
repl =counts[counts <= threshold].index

In [39]:
pd.get_dummies(df['brand'].replace(repl,'uncommon')).sample(10) 

Unnamed: 0,BMW,Chevrolet,Ford,Honda,Hyundai,Mahindra,Maruti,Renault,Skoda,Tata,Toyota,Volkswagen,uncommon
930,False,False,False,False,False,False,True,False,False,False,False,False,False
3821,False,False,False,False,False,True,False,False,False,False,False,False,False
2852,False,False,False,False,True,False,False,False,False,False,False,False,False
6474,False,False,False,False,False,False,False,True,False,False,False,False,False
1007,False,False,False,False,False,False,False,False,False,False,True,False,False
1983,False,False,False,False,False,False,True,False,False,False,False,False,False
1498,False,False,False,False,False,True,False,False,False,False,False,False,False
4900,False,False,True,False,False,False,False,False,False,False,False,False,False
6000,False,False,False,False,False,True,False,False,False,False,False,False,False
7962,False,False,True,False,False,False,False,False,False,False,False,False,False
