# One Hot Encoding 
- One-Hot Encoding is a technique for converting categorical variables into a format that can be provided to machine learning models to perform better. 
- It creates binary columns for each category, where the presence of a category is represented as 1 and the absence as 0.

In [1]:
import pandas as pd

In [4]:
ds = pd.read_csv("Sales_data.csv")
ds.head()

Unnamed: 0,Group,Customer_Segment,Sales_Before,Sales_After,Customer_Satisfaction_Before,Customer_Satisfaction_After,Purchase_Made
0,Control,High Value,240.548359,300.007568,74.684767,,No
1,Treatment,High Value,246.862114,381.337555,100.0,100.0,Yes
2,Control,High Value,156.978084,179.330464,98.780735,100.0,No
3,Control,Medium Value,192.126708,229.278031,49.333766,39.811841,Yes
4,,High Value,229.685623,,83.974852,87.738591,Yes


In [5]:
ds.isnull().sum()

Group                           1401
Customer_Segment                1966
Sales_Before                    1522
Sales_After                      767
Customer_Satisfaction_Before    1670
Customer_Satisfaction_After     1640
Purchase_Made                    805
dtype: int64

In [6]:
ds["Customer_Segment"].fillna(ds["Customer_Segment"].mode()[0], inplace =True)

In [9]:
ds["Purchase_Made"].fillna(ds["Purchase_Made"].mode()[0], inplace =True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  ds["Purchase_Made"].fillna(ds["Purchase_Made"].mode()[0], inplace =True)


In [11]:
ds.isnull().sum()

Group                           1401
Customer_Segment                   0
Sales_Before                    1522
Sales_After                      767
Customer_Satisfaction_Before    1670
Customer_Satisfaction_After     1640
Purchase_Made                      0
dtype: int64

### One hot encoding Using pandas get_dummies()

In [13]:
en_ds = ds[["Customer_Segment", "Purchase_Made"]]
en_ds

Unnamed: 0,Customer_Segment,Purchase_Made
0,High Value,No
1,High Value,Yes
2,High Value,No
3,Medium Value,Yes
4,High Value,Yes
...,...,...
9995,Low Value,Yes
9996,High Value,Yes
9997,Low Value,No
9998,Medium Value,No


In [14]:
pd.get_dummies(en_ds)

Unnamed: 0,Customer_Segment_High Value,Customer_Segment_Low Value,Customer_Segment_Medium Value,Purchase_Made_No,Purchase_Made_Yes
0,True,False,False,True,False
1,True,False,False,False,True
2,True,False,False,True,False
3,False,False,True,False,True
4,True,False,False,False,True
...,...,...,...,...,...
9995,False,True,False,False,True
9996,True,False,False,False,True
9997,False,True,False,True,False
9998,False,False,True,True,False


In [17]:
pd.get_dummies(en_ds).info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 5 columns):
 #   Column                         Non-Null Count  Dtype
---  ------                         --------------  -----
 0   Customer_Segment_High Value    10000 non-null  bool 
 1   Customer_Segment_Low Value     10000 non-null  bool 
 2   Customer_Segment_Medium Value  10000 non-null  bool 
 3   Purchase_Made_No               10000 non-null  bool 
 4   Purchase_Made_Yes              10000 non-null  bool 
dtypes: bool(5)
memory usage: 49.0 KB


### One hot encoding using Scikit-Learn's OneHotEncoder

In [19]:
from sklearn.preprocessing import OneHotEncoder

In [21]:
en_ds1 = ds[["Customer_Segment", "Purchase_Made"]]
en_ds1

Unnamed: 0,Customer_Segment,Purchase_Made
0,High Value,No
1,High Value,Yes
2,High Value,No
3,Medium Value,Yes
4,High Value,Yes
...,...,...
9995,Low Value,Yes
9996,High Value,Yes
9997,Low Value,No
9998,Medium Value,No


In [32]:
ohe = OneHotEncoder()
dss= ohe.fit_transform(en_ds1).toarray()
dss

array([[1., 0., 0., 1., 0.],
       [1., 0., 0., 0., 1.],
       [1., 0., 0., 1., 0.],
       ...,
       [0., 1., 0., 1., 0.],
       [0., 0., 1., 1., 0.],
       [0., 1., 0., 0., 1.]], shape=(10000, 5))

In [27]:
pd.DataFrame(dss, columns= [["Customer_Segment_High Value", "Customer_Segment_Medium Value", "Customer_Segment_Low Value", "Purchase_Made_No", "Purchase_Made_Yes"]])

Unnamed: 0,Customer_Segment_High Value,Customer_Segment_Medium Value,Customer_Segment_Low Value,Purchase_Made_No,Purchase_Made_Yes
0,1.0,0.0,0.0,1.0,0.0
1,1.0,0.0,0.0,0.0,1.0
2,1.0,0.0,0.0,1.0,0.0
3,0.0,0.0,1.0,0.0,1.0
4,1.0,0.0,0.0,0.0,1.0
...,...,...,...,...,...
9995,0.0,1.0,0.0,0.0,1.0
9996,1.0,0.0,0.0,0.0,1.0
9997,0.0,1.0,0.0,1.0,0.0
9998,0.0,0.0,1.0,1.0,0.0


In [37]:
ohe1 = OneHotEncoder(drop="first")   # here this drop = "first" removes the first column creted in the process of encoding.
dss1 = ohe1.fit_transform(en_ds1).toarray()
dss1

array([[0., 0., 0.],
       [0., 0., 1.],
       [0., 0., 0.],
       ...,
       [1., 0., 0.],
       [0., 1., 0.],
       [1., 0., 1.]], shape=(10000, 3))

In [38]:
pd.DataFrame(dss1, columns= [["Customer_Segment_Medium Value", "Customer_Segment_Low Value", "Purchase_Made_Yes"]])

Unnamed: 0,Customer_Segment_Medium Value,Customer_Segment_Low Value,Purchase_Made_Yes
0,0.0,0.0,0.0
1,0.0,0.0,1.0
2,0.0,0.0,0.0
3,0.0,1.0,1.0
4,0.0,0.0,1.0
...,...,...,...
9995,1.0,0.0,1.0
9996,0.0,0.0,1.0
9997,1.0,0.0,0.0
9998,0.0,1.0,0.0
