# 🔷 What is "Encoding"?
Encoding means converting words or categories into numbers so that a machine learning model can understand and use them.

Why?
Because most machine learning algorithms can’t work with text or categorical data directly — they only understand numbers.



# 🎯 What Kind of Data Needs Encoding?
Categorical features — like Gender, Country, Color, Size, Brand, etc.

Basically, anything that is not a number already and doesn’t have natural order or math meaning.



# 💡 Things to Remember
Encoding is only for text or categories, not for numeric data.

You should not encode numbers again (e.g., age or salary) — those are already numerical.

You encode before training the model — during data preprocessing.



# 🔷 What is Label Encoding?
Label Encoding is a method to convert text labels (categories) into numbers by assigning each unique value a number.

It is simple.

It gives one number to one category.

Useful when your column has categories with no specific order.



# 🧠 Why Do We Use It?
Machine learning models don’t understand words like "Male", "Female", or "Red", "Blue", "Green".

So we replace these words with numbers. That process is called encoding, and Label Encoding is the simplest form of that.



# 1. Label Encoding

In [2]:
# 1. Label Encoding : Apply on single column : column(sub1,sub2....sub n). It gives no. according to him not according to us

In [3]:
import numpy as np
import pandas as pd

In [4]:
df = pd.read_csv("covid_toy.csv")

In [5]:
df.head()

Unnamed: 0,age,gender,fever,cough,city,has_covid
0,60,Male,103.0,Mild,Kolkata,No
1,27,Male,100.0,Mild,Delhi,Yes
2,42,Male,101.0,Mild,Delhi,No
3,31,Female,98.0,Mild,Kolkata,No
4,65,Female,101.0,Mild,Mumbai,No


In [6]:
df.isnull().sum()

age           0
gender        0
fever        10
cough         0
city          0
has_covid     0
dtype: int64

In [7]:
from sklearn.impute import SimpleImputer #handling missing values

In [8]:
si = SimpleImputer()

In [9]:
df['fever'] = si.fit_transform(df[['fever']])

In [10]:
df.isnull().sum()

age          0
gender       0
fever        0
cough        0
city         0
has_covid    0
dtype: int64

In [11]:
 df.head()

Unnamed: 0,age,gender,fever,cough,city,has_covid
0,60,Male,103.0,Mild,Kolkata,No
1,27,Male,100.0,Mild,Delhi,Yes
2,42,Male,101.0,Mild,Delhi,No
3,31,Female,98.0,Mild,Kolkata,No
4,65,Female,101.0,Mild,Mumbai,No


In [12]:
from sklearn.preprocessing import LabelEncoder #Converting Categorical Data into Numerical Data according to it.

In [13]:
lb = LabelEncoder()

In [14]:
df['gender'] = lb.fit_transform(df[['gender']])
df['cough'] = lb.fit_transform(df[['cough']])
df['city'] = lb.fit_transform(df[['city']])
df['has_covid'] = lb.fit_transform(df[['has_covid']])


  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


In [15]:
df.head()

Unnamed: 0,age,gender,fever,cough,city,has_covid
0,60,1,103.0,0,2,0
1,27,1,100.0,0,1,1
2,42,1,101.0,0,1,0
3,31,0,98.0,0,2,0
4,65,0,101.0,0,3,0


In [16]:
x = df.drop(columns = ['has_covid'])
y = df['has_covid']

In [17]:
from sklearn.model_selection import train_test_split

In [18]:
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size = 0.2 , random_state = 42)

In [19]:
x_train.head()

Unnamed: 0,age,gender,fever,cough,city
55,81,0,101.0,0,3
88,5,0,100.0,0,2
26,19,0,100.0,0,2
42,27,1,100.0,0,1
69,73,0,103.0,0,1


In [20]:
np.round(x_train.describe(), 2)

Unnamed: 0,age,gender,fever,cough,city
count,80.0,80.0,80.0,80.0,80.0
mean,42.91,0.41,100.98,0.4,1.3
std,24.47,0.5,1.93,0.49,1.12
min,5.0,0.0,98.0,0.0,0.0
25%,20.0,0.0,100.0,0.0,0.0
50%,42.0,0.0,101.0,0.0,1.0
75%,65.0,1.0,102.0,1.0,2.0
max,84.0,1.0,104.0,1.0,3.0


In [21]:
from sklearn.preprocessing import MinMaxScaler #Normalization

In [22]:
mn = MinMaxScaler()

In [23]:
x_train_mn = mn.fit_transform(x_train)

In [24]:
x_train_new = pd.DataFrame(x_train_mn, columns = x_train.columns)

In [25]:
np.round(x_train_new.describe(), 2)

Unnamed: 0,age,gender,fever,cough,city
count,80.0,80.0,80.0,80.0,80.0
mean,0.48,0.41,0.5,0.4,0.43
std,0.31,0.5,0.32,0.49,0.37
min,0.0,0.0,0.0,0.0,0.0
25%,0.19,0.0,0.33,0.0,0.0
50%,0.47,0.0,0.5,0.0,0.33
75%,0.76,1.0,0.67,1.0,0.67
max,1.0,1.0,1.0,1.0,1.0


In [None]:
# ------------------------------------------TIPS----------------------------------------------- 

In [30]:
df = pd.read_csv("tips (1).csv")

In [31]:
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [32]:
df.isnull().sum()

total_bill    0
tip           0
sex           0
smoker        0
day           0
time          0
size          0
dtype: int64

In [33]:
from sklearn.preprocessing import LabelEncoder

In [34]:
lb = LabelEncoder()

In [35]:
df['sex'] = lb.fit_transform(df['sex'])
df['smoker'] = lb.fit_transform(df['smoker'])
df['day'] = lb.fit_transform(df['day'])
df['time'] = lb.fit_transform(df['time'])

In [36]:
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,0,0,2,0,2
1,10.34,1.66,1,0,2,0,3
2,21.01,3.5,1,0,2,0,3
3,23.68,3.31,1,0,2,0,2
4,24.59,3.61,0,0,2,0,4


In [37]:
df['smoker'].value_counts()

smoker
0    151
1     93
Name: count, dtype: int64

In [38]:
x = df.drop(columns = ['smoker'])
y = df['smoker']

In [39]:
from sklearn.model_selection import train_test_split

In [40]:
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2 , random_state = 42)

In [41]:
np.round(x_train.describe(), 2)

Unnamed: 0,total_bill,tip,sex,day,time,size
count,195.0,195.0,195.0,195.0,195.0,195.0
mean,20.22,3.09,0.65,1.71,0.27,2.57
std,8.77,1.43,0.48,0.93,0.45,0.94
min,5.75,1.0,0.0,0.0,0.0,1.0
25%,13.66,2.0,0.0,1.0,0.0,2.0
50%,17.92,3.0,1.0,2.0,0.0,2.0
75%,24.86,3.7,1.0,2.0,1.0,3.0
max,50.81,10.0,1.0,3.0,1.0,6.0


In [47]:
from sklearn.preprocessing import StandardScaler

In [48]:
sc = StandardScaler()

In [49]:
x_train_sc = sc.fit_transform(x_train)

In [50]:
x_train_new = pd.DataFrame(x_train_sc, columns = x_train.columns)

In [51]:
np.round(x_train_new.describe(), 2)

Unnamed: 0,total_bill,tip,sex,day,time,size
count,195.0,195.0,195.0,195.0,195.0,195.0
mean,0.0,-0.0,-0.0,0.0,0.0,-0.0
std,1.0,1.0,1.0,1.0,1.0,1.0
min,-1.65,-1.46,-1.37,-1.84,-0.61,-1.68
25%,-0.75,-0.76,-1.37,-0.76,-0.61,-0.61
50%,-0.26,-0.06,0.73,0.31,-0.61,-0.61
75%,0.53,0.43,0.73,0.31,1.64,0.45
max,3.5,4.85,0.73,1.39,1.64,3.65


# 2. Original Encoding

In [57]:
import numpy as np 
import pandas as pd 

In [58]:
df = pd.read_csv("covid_toy.csv")

In [59]:
df.head(3) 

Unnamed: 0,age,gender,fever,cough,city,has_covid
0,60,Male,103.0,Mild,Kolkata,No
1,27,Male,100.0,Mild,Delhi,Yes
2,42,Male,101.0,Mild,Delhi,No


In [61]:
df.isnull().sum()

age           0
gender        0
fever        10
cough         0
city          0
has_covid     0
dtype: int64

In [62]:
df = df.dropna() #drop missing values

In [63]:
df.isnull().sum()

age          0
gender       0
fever        0
cough        0
city         0
has_covid    0
dtype: int64

In [56]:
df.head() 

Unnamed: 0,age,gender,fever,cough,city,has_covid
0,60,Male,103.0,Mild,Kolkata,No
1,27,Male,100.0,Mild,Delhi,Yes
2,42,Male,101.0,Mild,Delhi,No
3,31,Female,98.0,Mild,Kolkata,No
4,65,Female,101.0,Mild,Mumbai,No


In [64]:
df = df.drop(columns = ['age' , 'fever'])
df.head(3) 

Unnamed: 0,gender,cough,city,has_covid
0,Male,Mild,Kolkata,No
1,Male,Mild,Delhi,Yes
2,Male,Mild,Delhi,No


In [65]:
df['city'].value_counts()

city
Kolkata      29
Bangalore    28
Delhi        20
Mumbai       13
Name: count, dtype: int64

In [68]:
from sklearn.preprocessing import OrdinalEncoder 

In [69]:
oe = OrdinalEncoder(categories = [['Male','Female'],
                                 ['Mild','Strong'],
                                 ['Kolkata','Bangalore','Delhi','Mumbai'],
                                 ['No','Yes']]) 

In [70]:
df_sc = oe.fit_transform(df)

In [72]:
df_new = pd.DataFrame(df_sc , columns = df.columns)

In [73]:
 df_new

Unnamed: 0,gender,cough,city,has_covid
0,0.0,0.0,0.0,0.0
1,0.0,0.0,2.0,1.0
2,0.0,0.0,2.0,0.0
3,1.0,0.0,0.0,0.0
4,1.0,0.0,3.0,0.0
...,...,...,...,...
85,1.0,0.0,1.0,0.0
86,1.0,1.0,0.0,1.0
87,1.0,0.0,1.0,0.0
88,1.0,1.0,3.0,0.0


# 3. OneHotEncoder

In [74]:
df = pd.read_csv("covid_toy.csv")

In [75]:
df.head(3) 

Unnamed: 0,age,gender,fever,cough,city,has_covid
0,60,Male,103.0,Mild,Kolkata,No
1,27,Male,100.0,Mild,Delhi,Yes
2,42,Male,101.0,Mild,Delhi,No


In [76]:
df = df.dropna() 

In [77]:
df.head()

Unnamed: 0,age,gender,fever,cough,city,has_covid
0,60,Male,103.0,Mild,Kolkata,No
1,27,Male,100.0,Mild,Delhi,Yes
2,42,Male,101.0,Mild,Delhi,No
3,31,Female,98.0,Mild,Kolkata,No
4,65,Female,101.0,Mild,Mumbai,No


In [82]:
from sklearn.preprocessing import OneHotEncoder 

In [83]:
ohe = OneHotEncoder(drop = 'first' , sparse_output = False , dtype = np.int32) 

In [84]:
df_new = ohe.fit_transform(df[['gender','cough','city','has_covid']]) 

In [None]:
df_new

# 4. get_dummies

In [86]:
df = pd.read_csv("covid_toy.csv")

In [87]:
print(df.columns)

Index(['age', 'gender', 'fever', 'cough', 'city', 'has_covid'], dtype='object')


In [88]:
df.head(3) 

Unnamed: 0,age,gender,fever,cough,city,has_covid
0,60,Male,103.0,Mild,Kolkata,No
1,27,Male,100.0,Mild,Delhi,Yes
2,42,Male,101.0,Mild,Delhi,No


In [89]:
df = df.dropna() 

In [90]:
df.head() 

Unnamed: 0,age,gender,fever,cough,city,has_covid
0,60,Male,103.0,Mild,Kolkata,No
1,27,Male,100.0,Mild,Delhi,Yes
2,42,Male,101.0,Mild,Delhi,No
3,31,Female,98.0,Mild,Kolkata,No
4,65,Female,101.0,Mild,Mumbai,No


In [100]:
print(df.columns.tolist())

['age', 'fever', 'gender_Male', 'cough_Strong', 'city_Delhi', 'city_Kolkata', 'city_Mumbai', 'has_covid_Yes']


In [101]:
df = pd.get_dummies(df, columns = ['gender_Male','cough_Strong','city_Delhi','city_Kolkata','city_Mumbai','has_covid_Yes'], drop_first=True)

In [102]:
df = df.astype(int)

In [105]:
pd.get_dummies(df, columns = ['gender','cough','city','has_covid'])

KeyError: "None of [Index(['gender', 'cough', 'city', 'has_covid'], dtype='object')] are in the [columns]"