### Encoding:

Encoding is a data preprocessing technique used to convert categorical (text) data into numerical form so that machine learning models can understand it.

### One-Hot Encoding (simple & clear)

One-Hot Encoding is a technique to convert categorical data into binary (0/1) columns, so ML models can use it.

In [1]:
import pandas as pd

In [2]:
dataset = pd.read_csv('Dataset1.zip')
dataset.head(4)

Unnamed: 0,Age,Sex,BP,Cholesterol,Na_to_K,Drug
0,23,F,HIGH,HIGH,25.355,DrugY
1,47,M,LOW,HIGH,13.093,drugC
2,47,M,LOW,HIGH,10.114,drugC
3,28,F,NORMAL,HIGH,7.798,drugX


### Method 1: By Using Pandas

In [3]:
en_data = dataset[['Sex', 'Drug', 'BP', 'Cholesterol']]

In [7]:
en = pd.get_dummies(en_data)

In [16]:
en.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 12 columns):
 #   Column              Non-Null Count  Dtype
---  ------              --------------  -----
 0   Sex_F               200 non-null    bool 
 1   Sex_M               200 non-null    bool 
 2   Drug_DrugY          200 non-null    bool 
 3   Drug_drugA          200 non-null    bool 
 4   Drug_drugB          200 non-null    bool 
 5   Drug_drugC          200 non-null    bool 
 6   Drug_drugX          200 non-null    bool 
 7   BP_HIGH             200 non-null    bool 
 8   BP_LOW              200 non-null    bool 
 9   BP_NORMAL           200 non-null    bool 
 10  Cholesterol_HIGH    200 non-null    bool 
 11  Cholesterol_NORMAL  200 non-null    bool 
dtypes: bool(12)
memory usage: 2.5 KB


### Method 2: By using Scikit-learn

In [9]:
from sklearn.preprocessing import OneHotEncoder

In [13]:
ohe = OneHotEncoder()
ohe.fit_transform(en_data)

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 800 stored elements and shape (200, 12)>

In [17]:
p = ohe.fit_transform(en_data).toarray()


In [18]:
en.select_dtypes(include=['bool']).columns

Index(['Sex_F', 'Sex_M', 'Drug_DrugY', 'Drug_drugA', 'Drug_drugB',
       'Drug_drugC', 'Drug_drugX', 'BP_HIGH', 'BP_LOW', 'BP_NORMAL',
       'Cholesterol_HIGH', 'Cholesterol_NORMAL'],
      dtype='object')

In [19]:
pd.DataFrame(p,columns=['Sex_F', 'Sex_M', 'Drug_DrugY', 'Drug_drugA', 'Drug_drugB',
       'Drug_drugC', 'Drug_drugX', 'BP_HIGH', 'BP_LOW', 'BP_NORMAL',
       'Cholesterol_HIGH', 'Cholesterol_NORMAL'])

Unnamed: 0,Sex_F,Sex_M,Drug_DrugY,Drug_drugA,Drug_drugB,Drug_drugC,Drug_drugX,BP_HIGH,BP_LOW,BP_NORMAL,Cholesterol_HIGH,Cholesterol_NORMAL
0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
1,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0
2,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0
3,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0
4,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...
195,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0
196,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0
197,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0
198,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0


In [20]:
ohe = OneHotEncoder(drop='first')

In [22]:
p1 = ohe.fit_transform(en_data).toarray()
p1

array([[0., 0., 0., ..., 0., 0., 0.],
       [1., 0., 0., ..., 1., 0., 0.],
       [1., 0., 0., ..., 1., 0., 0.],
       ...,
       [1., 0., 0., ..., 0., 1., 0.],
       [1., 0., 0., ..., 0., 1., 1.],
       [0., 0., 0., ..., 1., 0., 1.]], shape=(200, 8))

In [25]:
pd.DataFrame(p1,columns=['Sex_M', 'Drug_drugA', 'Drug_drugB',
       'Drug_drugC', 'Drug_drugX', 'BP_LOW', 'BP_NORMAL',
       'Cholesterol_NORMAL'])

Unnamed: 0,Sex_M,Drug_drugA,Drug_drugB,Drug_drugC,Drug_drugX,BP_LOW,BP_NORMAL,Cholesterol_NORMAL
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0
2,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0
3,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
4,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
...,...,...,...,...,...,...,...,...
195,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0
196,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0
197,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
198,1.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0
