It is nominal data 

# üìå One Hot Encoding ‚Äì Brief Notes

**One Hot Encoding** is a data preprocessing technique used to convert **categorical variables** into a numerical format that machine learning models can understand.

---

## ‚úî Why it is needed?

- Machine learning algorithms cannot work with text labels (e.g., **"Red", "Blue", "Green"**).
- They require **numeric inputs**.
- One hot encoding transforms each category into a **binary (0/1) vector**.

---

## ‚úî How it works?

For a categorical column with **N unique values**, it creates **N new columns**.

### **Example**

**Input:**

__Color__
- Red
- Blue
- Green


**Output:**

| Red | Blue | Green |
|-----|------|--------|
| 1   | 0    | 0      |
| 0   | 1    | 0      |
| 0   | 0    | 1      |

Each row has a **1** in the column representing its category and **0** in others.

---

## ‚úî When to use?

- When the categorical variable is **nominal (no order)**.
- Useful for algorithms like **Linear Regression, Logistic Regression, KNN, SVM**, etc.

---

## ‚úî How to implement?

### **In Pandas**
```python
pd.get_dummies(df, columns=['col'])


<b style="color:red">OneHotEncoding - > convert data in 0 and 1 or j=ye har ek ka vector bna deta h 
agr 3 color h to [1,0,0] kr k is form me bna dega </b>

__To avoid dummy variable remove one col from the data__

# üéØ Dummy Variable Trap (DVT)

Dummy Variable Trap occurs when **two or more dummy variables are highly correlated**, causing **multicollinearity** in a regression model.

---

## üìå Why Does It Happen?

Suppose you have a categorical column:

**Color:**
- Red  
- Blue  
- Green  

If you create **3 dummy columns**:

| Red | Blue | Green |
|-----|------|--------|
| 1   | 0    | 0      |
| 0   | 1    | 0      |
| 0   | 0    | 1      |

üëâ All 3 dummy variables add up to **1 every time**, meaning they are **perfectly correlated**.

This leads to:

- ‚ùå Redundant information  
- ‚ùå Multicollinearity  
- ‚ùå Unstable regression model  

---

## üìå How to Avoid the Dummy Variable Trap?

‚úî **Drop one dummy column** (this becomes the *reference category*).

Example (Drop "Green"):

| Red | Blue |
|-----|------|
| 1   | 0    |
| 0   | 1    |
| 0   | 0    |

Now the model interprets:

- (0,0) ‚Üí Green  
- (1,0) ‚Üí Red  
- (0,1) ‚Üí Blue  

---

## ‚≠ê Pandas Code

```python
pd.get_dummies(df['Color'], drop_first=True)


__üéØ One-Hot Encoding Using the Most Frequent Category__

üß† Easy Explanation

- üëâ OHE creates a separate column for each category.
- üëâ To avoid dummy trap, we drop one.
- üëâ Dropping the most frequent category is usually better because:

It becomes the reference category

The model becomes more interpretable

Rare categories remain as dummy columns

In [2]:
import pandas as pd
import numpy as np

df = pd.read_csv(r'C:\Users\Lenovo\Krishnaraj singh\Code\newml\Documents!.0\car data.csv')
df

Unnamed: 0,Car_Name,Year,Selling_Price,Present_Price,Kms_Driven,Fuel_Type,Seller_Type,Transmission,Owner
0,ritz,2014,3.35,5.59,27000,Petrol,Dealer,Manual,0
1,sx4,2013,4.75,9.54,43000,Diesel,Dealer,Manual,0
2,ciaz,2017,7.25,9.85,6900,Petrol,Dealer,Manual,0
3,wagon r,2011,2.85,4.15,5200,Petrol,Dealer,Manual,0
4,swift,2014,4.60,6.87,42450,Diesel,Dealer,Manual,0
...,...,...,...,...,...,...,...,...,...
296,city,2016,9.50,11.60,33988,Diesel,Dealer,Manual,0
297,brio,2015,4.00,5.90,60000,Petrol,Dealer,Manual,0
298,city,2009,3.35,11.00,87934,Petrol,Dealer,Manual,0
299,city,2017,11.50,12.50,9000,Diesel,Dealer,Manual,0


In [5]:
df['Car_Name'].value_counts()
df['Car_Name'].nunique()  
# we have 98 car brand


98

In [7]:
df['Fuel_Type'].value_counts()

Fuel_Type
Petrol    239
Diesel     60
CNG         2
Name: count, dtype: int64

In [9]:
df['Owner'].nunique()

3

In [10]:
df.sample(14)

Unnamed: 0,Car_Name,Year,Selling_Price,Present_Price,Kms_Driven,Fuel_Type,Seller_Type,Transmission,Owner
3,wagon r,2011,2.85,4.15,5200,Petrol,Dealer,Manual,0
39,sx4,2003,2.25,7.98,62000,Petrol,Dealer,Manual,0
160,Bajaj Avenger Street 220,2011,0.45,0.95,24000,Petrol,Individual,Manual,0
212,creta,2016,11.25,13.6,22671,Petrol,Dealer,Manual,0
125,Royal Enfield Classic 500,2009,0.9,1.75,40000,Petrol,Individual,Manual,0
104,Royal Enfield Classic 350,2017,1.35,1.47,4100,Petrol,Individual,Manual,0
129,Yamaha FZ S V 2.0,2017,0.78,0.84,5000,Petrol,Individual,Manual,0
27,swift,2017,6.0,6.49,16200,Petrol,Individual,Manual,0
256,city,2016,10.25,13.6,49562,Petrol,Dealer,Manual,0
188,Hero Glamour,2013,0.25,0.57,18000,Petrol,Individual,Manual,0


In [13]:
df['Year'].value_counts()

Year
2015    61
2016    50
2014    38
2017    35
2013    33
2012    23
2011    19
2010    15
2008     7
2009     6
2005     4
2006     4
2007     2
2003     2
2018     1
2004     1
Name: count, dtype: int64

## __One hot encoding by pandas__

In [14]:
pd.get_dummies(df,columns=['Fuel_Type','Owner','Year'])

Unnamed: 0,Car_Name,Selling_Price,Present_Price,Kms_Driven,Seller_Type,Transmission,Fuel_Type_CNG,Fuel_Type_Diesel,Fuel_Type_Petrol,Owner_0,...,Year_2009,Year_2010,Year_2011,Year_2012,Year_2013,Year_2014,Year_2015,Year_2016,Year_2017,Year_2018
0,ritz,3.35,5.59,27000,Dealer,Manual,False,False,True,True,...,False,False,False,False,False,True,False,False,False,False
1,sx4,4.75,9.54,43000,Dealer,Manual,False,True,False,True,...,False,False,False,False,True,False,False,False,False,False
2,ciaz,7.25,9.85,6900,Dealer,Manual,False,False,True,True,...,False,False,False,False,False,False,False,False,True,False
3,wagon r,2.85,4.15,5200,Dealer,Manual,False,False,True,True,...,False,False,True,False,False,False,False,False,False,False
4,swift,4.60,6.87,42450,Dealer,Manual,False,True,False,True,...,False,False,False,False,False,True,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
296,city,9.50,11.60,33988,Dealer,Manual,False,True,False,True,...,False,False,False,False,False,False,False,True,False,False
297,brio,4.00,5.90,60000,Dealer,Manual,False,False,True,True,...,False,False,False,False,False,False,True,False,False,False
298,city,3.35,11.00,87934,Dealer,Manual,False,False,True,True,...,True,False,False,False,False,False,False,False,False,False
299,city,11.50,12.50,9000,Dealer,Manual,False,True,False,True,...,False,False,False,False,False,False,False,False,True,False


## __K-1 One hot encoding__

- remove first colm from every operation likr fuel type.owner,yearr etccc

In [16]:
pd.get_dummies(df,columns=['Fuel_Type','Owner','Year'],drop_first=True)

# Isme 3 colm ht jayege 
# Ham isko ml me use nhi kr skte hm ml me one hot encoding skleran wali use krenge c usko yad rhta h or pandas wala bhul jata h

Unnamed: 0,Car_Name,Selling_Price,Present_Price,Kms_Driven,Seller_Type,Transmission,Fuel_Type_Diesel,Fuel_Type_Petrol,Owner_1,Owner_3,...,Year_2009,Year_2010,Year_2011,Year_2012,Year_2013,Year_2014,Year_2015,Year_2016,Year_2017,Year_2018
0,ritz,3.35,5.59,27000,Dealer,Manual,False,True,False,False,...,False,False,False,False,False,True,False,False,False,False
1,sx4,4.75,9.54,43000,Dealer,Manual,True,False,False,False,...,False,False,False,False,True,False,False,False,False,False
2,ciaz,7.25,9.85,6900,Dealer,Manual,False,True,False,False,...,False,False,False,False,False,False,False,False,True,False
3,wagon r,2.85,4.15,5200,Dealer,Manual,False,True,False,False,...,False,False,True,False,False,False,False,False,False,False
4,swift,4.60,6.87,42450,Dealer,Manual,True,False,False,False,...,False,False,False,False,False,True,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
296,city,9.50,11.60,33988,Dealer,Manual,True,False,False,False,...,False,False,False,False,False,False,False,True,False,False
297,brio,4.00,5.90,60000,Dealer,Manual,False,True,False,False,...,False,False,False,False,False,False,True,False,False,False
298,city,3.35,11.00,87934,Dealer,Manual,False,True,False,False,...,True,False,False,False,False,False,False,False,False,False
299,city,11.50,12.50,9000,Dealer,Manual,True,False,False,False,...,False,False,False,False,False,False,False,False,True,False


## __One Hot Encoding Using Sklearn__

In [22]:
col = "Seller_Type"

df[col] = df.pop(col)


In [23]:
df.sample()

Unnamed: 0,Car_Name,Year,Selling_Price,Present_Price,Kms_Driven,Fuel_Type,Transmission,Owner,Seller_Type
270,city,2011,4.1,10.0,69341,Petrol,Manual,0,Dealer


In [25]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(df.iloc[:,0:7],df.iloc[:,-1])
x_train
# y_train


Unnamed: 0,Car_Name,Year,Selling_Price,Present_Price,Kms_Driven,Fuel_Type,Transmission
174,Honda CB Unicorn,2015,0.38,0.72,38600,Petrol,Manual
276,city,2015,8.65,13.60,24800,Petrol,Manual
52,innova,2017,18.00,19.77,15000,Diesel,Automatic
279,city,2014,6.25,13.60,40126,Petrol,Manual
20,alto k10,2016,2.85,3.95,25000,Petrol,Manual
...,...,...,...,...,...,...,...
282,city,2014,8.25,14.00,63000,Diesel,Manual
144,Bajaj Pulsar NS 200,2014,0.60,0.99,25000,Petrol,Manual
86,land cruiser,2010,35.00,92.60,78000,Diesel,Manual
193,Hero Ignitor Disc,2013,0.20,0.65,24000,Petrol,Manual


In [27]:
from sklearn.preprocessing import OneHotEncoder


In [None]:
ohe = OneHotEncoder( )

phle alg alg pr ohe lagayege fir combine krege

In [31]:
x_train.sample(2)

Unnamed: 0,Car_Name,Year,Selling_Price,Present_Price,Kms_Driven,Fuel_Type,Transmission
220,eon,2017,3.5,4.43,38488,Petrol,Manual
282,city,2014,8.25,14.0,63000,Diesel,Manual


In [40]:
x_train_new = ohe.fit_transform(x_train[['Fuel_Type','Transmission']]).toarray()
x_train_new

array([[0., 1., 0., 1.],
       [0., 1., 0., 1.],
       [1., 0., 1., 0.],
       [0., 1., 0., 1.],
       [0., 1., 0., 1.],
       [0., 1., 0., 1.],
       [0., 1., 0., 1.],
       [0., 1., 0., 1.],
       [0., 1., 0., 1.],
       [0., 1., 0., 1.],
       [0., 1., 0., 1.],
       [0., 1., 0., 1.],
       [0., 1., 0., 1.],
       [0., 1., 0., 1.],
       [0., 1., 0., 1.],
       [0., 1., 0., 1.],
       [0., 1., 0., 1.],
       [1., 0., 1., 0.],
       [0., 1., 0., 1.],
       [0., 1., 0., 1.],
       [0., 1., 0., 1.],
       [0., 1., 0., 1.],
       [0., 1., 1., 0.],
       [0., 1., 0., 1.],
       [0., 1., 1., 0.],
       [1., 0., 0., 1.],
       [0., 1., 0., 1.],
       [0., 1., 0., 1.],
       [0., 1., 0., 1.],
       [1., 0., 0., 1.],
       [0., 1., 0., 1.],
       [0., 1., 1., 0.],
       [0., 1., 0., 1.],
       [1., 0., 0., 1.],
       [0., 1., 0., 1.],
       [1., 0., 1., 0.],
       [0., 1., 0., 1.],
       [0., 1., 0., 1.],
       [0., 1., 0., 1.],
       [0., 1., 0., 1.],


In [38]:
x_test_new = ohe.fit_transform(x_test[['Fuel_Type','Transmission']]).toarray()
x_test_new


array([[0., 0., 1., 0., 1.],
       [0., 0., 1., 0., 1.],
       [0., 0., 1., 0., 1.],
       [0., 0., 1., 0., 1.],
       [0., 0., 1., 1., 0.],
       [0., 0., 1., 0., 1.],
       [0., 1., 0., 0., 1.],
       [0., 0., 1., 0., 1.],
       [0., 0., 1., 1., 0.],
       [0., 0., 1., 0., 1.],
       [0., 0., 1., 0., 1.],
       [0., 0., 1., 0., 1.],
       [0., 0., 1., 0., 1.],
       [0., 0., 1., 0., 1.],
       [0., 1., 0., 0., 1.],
       [0., 0., 1., 0., 1.],
       [0., 0., 1., 0., 1.],
       [0., 0., 1., 0., 1.],
       [0., 0., 1., 0., 1.],
       [0., 0., 1., 0., 1.],
       [0., 0., 1., 0., 1.],
       [0., 0., 1., 0., 1.],
       [0., 1., 0., 1., 0.],
       [0., 0., 1., 0., 1.],
       [0., 0., 1., 0., 1.],
       [0., 0., 1., 0., 1.],
       [0., 0., 1., 0., 1.],
       [0., 0., 1., 0., 1.],
       [0., 0., 1., 0., 1.],
       [0., 0., 1., 0., 1.],
       [0., 0., 1., 0., 1.],
       [0., 0., 1., 0., 1.],
       [0., 0., 1., 0., 1.],
       [0., 1., 0., 0., 1.],
       [0., 0.

In [36]:
x_train[['Car_Name','Year']].values

array([['Honda CB Unicorn', 2015],
       ['city', 2015],
       ['innova', 2017],
       ['city', 2014],
       ['alto k10', 2016],
       ['city', 2015],
       ['grand i10', 2015],
       ['i20', 2011],
       ['eon', 2012],
       ['sx4', 2008],
       ['Bajaj Pulsar NS 200', 2012],
       ['creta', 2016],
       ['Bajaj Avenger 150', 2016],
       ['Yamaha FZ S V 2.0', 2015],
       ['800', 2003],
       ['Yamaha FZ S V 2.0', 2017],
       ['i20', 2016],
       ['fortuner', 2017],
       ['Bajaj Dominar 400', 2017],
       ['xcent', 2014],
       ['Royal Enfield Classic 350', 2015],
       ['verna', 2017],
       ['Activa 3g', 2016],
       ['Royal Enfield Thunder 350', 2016],
       ['corolla altis', 2009],
       ['creta', 2016],
       ['ertiga', 2015],
       ['Hero Super Splendor', 2005],
       ['eon', 2017],
       ['etios liva', 2014],
       ['swift', 2013],
       ['Honda Activa 4G', 2017],
       ['ritz', 2014],
       ['i20', 2010],
       ['eon', 2016],
       ['fortu

In [42]:
np.hstack((x_train[['Car_Name','Year']].values,x_train_new)).shape
np.hstack((x_train[['Car_Name','Year']].values,x_train_new))

array([['Honda CB Unicorn', 2015, 0.0, 1.0, 0.0, 1.0],
       ['city', 2015, 0.0, 1.0, 0.0, 1.0],
       ['innova', 2017, 1.0, 0.0, 1.0, 0.0],
       ...,
       ['land cruiser', 2010, 1.0, 0.0, 0.0, 1.0],
       ['Hero  Ignitor Disc', 2013, 0.0, 1.0, 0.0, 1.0],
       ['Bajaj Pulsar 150', 2006, 0.0, 1.0, 0.0, 1.0]], dtype=object)

## __One hot encoding with top__

In [43]:
df

Unnamed: 0,Car_Name,Year,Selling_Price,Present_Price,Kms_Driven,Fuel_Type,Transmission,Owner,Seller_Type
0,ritz,2014,3.35,5.59,27000,Petrol,Manual,0,Dealer
1,sx4,2013,4.75,9.54,43000,Diesel,Manual,0,Dealer
2,ciaz,2017,7.25,9.85,6900,Petrol,Manual,0,Dealer
3,wagon r,2011,2.85,4.15,5200,Petrol,Manual,0,Dealer
4,swift,2014,4.60,6.87,42450,Diesel,Manual,0,Dealer
...,...,...,...,...,...,...,...,...,...
296,city,2016,9.50,11.60,33988,Diesel,Manual,0,Dealer
297,brio,2015,4.00,5.90,60000,Petrol,Manual,0,Dealer
298,city,2009,3.35,11.00,87934,Petrol,Manual,0,Dealer
299,city,2017,11.50,12.50,9000,Diesel,Manual,0,Dealer


In [55]:
count = df['Car_Name'].value_counts()

df['Car_Name'].nunique()
threshold = 10

In [56]:
count = df['Car_Name'].value_counts()
threshold = 5

repl = count[count < threshold].index
for i in repl:
    print(i)


dzire
Royal Enfield Thunder 350
etios liva
wagon r
Bajaj Pulsar 150
ritz
Honda CB Hornet 160R
Bajaj Avenger 220
Yamaha FZ S V 2.0
xcent
Bajaj Pulsar NS 200
TVS Apache RTR 160
etios cross
etios g
Royal Enfield Thunder 500
creta
Honda CB Shine
Activa 3g
Bajaj Discover 125
elantra
Honda Karizma
Honda CB twister
Hero Extreme
Honda CBR 150
Yamaha FZ  v 2.0
Bajaj Avenger 220 dtsi
Hero Passion Pro
Hero Splender iSmart
TVS Apache RTR 180
Honda Activa 4G
Bajaj Pulsar 220 F
Royal Enfield Classic 500
KTM RC200
Hyosung GT250R
KTM RC390
Bajaj Dominar 400
UM Renegade Mojave
etios gd
camry
land cruiser
corolla
s cross
vitara brezza
alto 800
baleno
800
ignis
omni
KTM 390 Duke 
Bajaj Pulsar 135 LS
Honda CB Trigger
Yamaha FZ S 
Bajaj Avenger Street 220
Bajaj Pulsar  NS 200
Yamaha Fazer 
Honda Dream Yuga 
Hero Passion X pro
Mahindra Mojo XT300
Bajaj Pulsar RS200
Royal Enfield Bullet 350
Bajaj Avenger 150
Bajaj Avenger 150 street
Yamaha FZ 16
TVS Sport 
Hero Super Splendor
Hero Glamour
Suzuki Access 125
T

In [57]:
pd.get_dummies(df['Car_Name'].replace(repl,'uncommon'))

Unnamed: 0,Royal Enfield Classic 350,alto k10,amaze,brio,ciaz,city,corolla altis,eon,ertiga,fortuner,grand i10,i10,i20,innova,jazz,swift,sx4,uncommon,verna
0,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False
1,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False
2,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False
4,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
296,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False
297,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
298,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False
299,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False
