# Introduction: Dummy Variables & One-Hot Encoding
When working with categorical data in machine learning and statistical models, we need to convert categories into a numerical format since most models require numeric input. Two common techniques for encoding categorical variables are:

1. Dummy Variables
A dummy variable is a binary variable (0 or 1) used to represent categories. If a categorical variable has N unique categories, we create N-1 dummy variables to avoid multicollinearity in regression models.

🔹 Example: If we have a categorical variable "Color" with values Red, Blue, Green, we can encode it as:

Color	Blue	Green
Red	0	0
Blue	1	0
Green	0	1
Here, if both Blue and Green are 0, it means the color is Red.

2. One-Hot Encoding
One-Hot Encoding converts categorical variables into a binary matrix, creating separate columns for each unique category. Unlike dummy variables, all categories are retained (N categories → N columns).

🔹 Using One-Hot Encoding for the Color column:

Color	Red	Blue	Green
Red	1	0	0
Blue	0	1	0
Green	0	0	1
Key Differences: Dummy Variables vs. One-Hot Encoding
                            Dummy Variables	   One-Hot Encoding
     Feature	                N-1	            N
Prevents Multicollinearity?	✅ Yes	          ❌ No
   Best for	                Linear Regression 	Deep Learning & Tree-Based Models
When to Use What?
✔ Dummy Variables → Used in linear regression models to avoid redundancy.
✔ One-Hot Encoding → Used in neural networks and models where all categories matter.

This concept is widely used in machine learning, data preprocessing, and statistical modeling. Now, let’s practice encoding in Python! 🚀

In [2]:
import pandas as pd
df = pd.read_csv('Data/carpricesohe.csv')
df

Unnamed: 0,Car Model,Mileage,Sell Price($),Age(yrs)
0,BMW X5,69000,18000,6
1,BMW X5,35000,34000,3
2,BMW X5,57000,26100,5
3,BMW X5,22500,40000,2
4,BMW X5,46000,31500,4
5,Audi A5,59000,29400,5
6,Audi A5,52000,32000,5
7,Audi A5,72000,19300,6
8,Audi A5,91000,12000,8
9,Mercedez Benz C class,67000,22000,6


In [4]:
dummies = pd.get_dummies(df['Car Model'])
dummies

Unnamed: 0,Audi A5,BMW X5,Mercedez Benz C class
0,False,True,False
1,False,True,False
2,False,True,False
3,False,True,False
4,False,True,False
5,True,False,False
6,True,False,False
7,True,False,False
8,True,False,False
9,False,False,True


In [8]:
merged = pd.concat([df, dummies], axis='columns')
merged

Unnamed: 0,Car Model,Mileage,Sell Price($),Age(yrs),Audi A5,BMW X5,Mercedez Benz C class
0,BMW X5,69000,18000,6,False,True,False
1,BMW X5,35000,34000,3,False,True,False
2,BMW X5,57000,26100,5,False,True,False
3,BMW X5,22500,40000,2,False,True,False
4,BMW X5,46000,31500,4,False,True,False
5,Audi A5,59000,29400,5,True,False,False
6,Audi A5,52000,32000,5,True,False,False
7,Audi A5,72000,19300,6,True,False,False
8,Audi A5,91000,12000,8,True,False,False
9,Mercedez Benz C class,67000,22000,6,False,False,True


In [10]:
#in Linear regration if you donot drop it will drop atumatically but it good to drop by youself
final = merged.drop(['Car Model','Mercedez Benz C class'], axis = 'columns')
final

Unnamed: 0,Mileage,Sell Price($),Age(yrs),Audi A5,BMW X5
0,69000,18000,6,False,True
1,35000,34000,3,False,True
2,57000,26100,5,False,True
3,22500,40000,2,False,True
4,46000,31500,4,False,True
5,59000,29400,5,True,False
6,52000,32000,5,True,False
7,72000,19300,6,True,False
8,91000,12000,8,True,False
9,67000,22000,6,False,False


In [13]:
from sklearn.linear_model import LinearRegression
model = LinearRegression()
    

In [15]:
X = final.drop(['Sell Price($)'], axis = 'columns')
X

Unnamed: 0,Mileage,Age(yrs),Audi A5,BMW X5
0,69000,6,False,True
1,35000,3,False,True
2,57000,5,False,True
3,22500,2,False,True
4,46000,4,False,True
5,59000,5,True,False
6,52000,5,True,False
7,72000,6,True,False
8,91000,8,True,False
9,67000,6,False,False


In [16]:
y = final['Sell Price($)']
y

0     18000
1     34000
2     26100
3     40000
4     31500
5     29400
6     32000
7     19300
8     12000
9     22000
10    20000
11    21000
12    33000
Name: Sell Price($), dtype: int64

In [17]:
model.fit(X, y)

In [18]:
model.predict([[7345, 4 , 0, 0]])



array([50928.26467481])

In [19]:
model.score(X,y)

0.9417050937281083

ok now it time to see One Hot Encoder 

In [43]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

In [22]:
dfle = df
dfle['Car Model'] = le.fit_transform(dfle['Car Model'])
dfle

Unnamed: 0,Car Model,Mileage,Sell Price($),Age(yrs)
0,1,69000,18000,6
1,1,35000,34000,3
2,1,57000,26100,5
3,1,22500,40000,2
4,1,46000,31500,4
5,0,59000,29400,5
6,0,52000,32000,5
7,0,72000,19300,6
8,0,91000,12000,8
9,2,67000,22000,6


In [44]:
X = dfle[['Car Model','Mileage','Age(yrs)']]
X

Unnamed: 0,Car Model,Mileage,Age(yrs)
0,1,69000,6
1,1,35000,3
2,1,57000,5
3,1,22500,2
4,1,46000,4
5,0,59000,5
6,0,52000,5
7,0,72000,6
8,0,91000,8
9,2,67000,6


In [45]:
y = dfle['Sell Price($)']
y

0     18000
1     34000
2     26100
3     40000
4     31500
5     29400
6     32000
7     19300
8     12000
9     22000
10    20000
11    21000
12    33000
Name: Sell Price($), dtype: int64

In [46]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), ['Car Model'])], remainder='passthrough')
encoded_data = ct.fit_transform(df)
encoded_df = pd.DataFrame(encoded_data, columns=ct.get_feature_names_out())
print(encoded_df)

    encoder__Car Model_0  encoder__Car Model_1  encoder__Car Model_2  \
0                    0.0                   1.0                   0.0   
1                    0.0                   1.0                   0.0   
2                    0.0                   1.0                   0.0   
3                    0.0                   1.0                   0.0   
4                    0.0                   1.0                   0.0   
5                    1.0                   0.0                   0.0   
6                    1.0                   0.0                   0.0   
7                    1.0                   0.0                   0.0   
8                    1.0                   0.0                   0.0   
9                    0.0                   0.0                   1.0   
10                   0.0                   0.0                   1.0   
11                   0.0                   0.0                   1.0   
12                   0.0                   0.0                  

In [47]:
X = encoded_df.drop('remainder__Sell Price($)', axis=1)  # Independent features (drop target)
y = encoded_df['remainder__Sell Price($)']
modelohe = LinearRegression()
modelohe.fit(X,y)

In [50]:
modelohe.predict([[ 0, 0, 1,7345, 4]])



array([50928.26467479])

In [51]:
modelohe.score(X,y)

0.9417050937281083