<h2>Label Encoding and One Hot Encoding</h2>

#### What is Label Encoding?

Label encoding is a technique for converting categorical data into numerical data. It works by assigning each unique category in a feature a numeric label, with the first category labeled as 0, the second as 1, and so on. The resulting numeric labels can then be used as inputs for machine learning models.

For example, suppose we have a dataset with a "Color" feature that contains three unique categories: "Red," "Green," and "Blue." We can use label encoding to assign each category a unique label as follows: "Red" = 0, "Green" = 1, and "Blue" = 2.

#### Advantages of Label Encoding :

Simplifies Categorical Data: 

Works with Many Machine Learning Algorithms: 

Preserves Information: 


#### Disadvantages of Label Encoding :

Arbitrary Numeric Labels: The labels assigned by label encoding are arbitrary, meaning that they have no inherent meaning or relationship to the categories they represent. This can lead to problems if the labels are interpreted as having a meaningful numerical relationship.


Can Create Bias: The arbitrary labels assigned by label encoding can create bias in some machine learning models. For example, if the labels are used in a regression model, the resulting predictions may be influenced by the arbitrary numerical relationship between the categories.


Not Suitable for Nominal Data: Label encoding is only suitable for features with nominal or ordinal data, where the categories have a natural ordering. It is not appropriate for nominal data, where the categories have no inherent order or ranking.

#### Categorical Data Types: 
####     1. Nominal : Don't have any numerical ordering in between 
####         Examples - Male, Female | red,blue,green | cities
####     2. Ordinal: Will have numerical/Rank based relation
####         Example - High, Medium, Low | Grades | Class | Customer Satisfaction

In [2]:
import pandas as pd

In [3]:
df = pd.read_csv("homeprices.csv")
df

Unnamed: 0,town,area,price
0,chennai,2600,550000
1,chennai,3000,565000
2,chennai,3200,610000
3,chennai,3600,680000
4,chennai,4000,725000
5,Delhi,2600,585000
6,Delhi,2800,615000
7,Delhi,3300,650000
8,Delhi,3600,710000
9,Bangalore,2600,575000


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13 entries, 0 to 12
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   town    13 non-null     object
 1   area    13 non-null     int64 
 2   price   13 non-null     int64 
dtypes: int64(2), object(1)
memory usage: 440.0+ bytes


# One Hot Encoding using SKLEARN

Now use one hot encoder to create dummy variables for each of the town

In [5]:
from sklearn.preprocessing import OneHotEncoder
one = OneHotEncoder()

In [15]:
Terms = ['Bangalore','Delhi','Chennai']
# df_town = pd.DataFrame(one.fit_transform(df[['town']]).toarray(),columns = Terms)
df_town = pd.DataFrame(one.fit_transform(df[['town']]).toarray(),columns=Terms)

In [16]:
df_town

Unnamed: 0,Bangalore,Delhi,Chennai
0,0.0,0.0,1.0
1,0.0,0.0,1.0
2,0.0,0.0,1.0
3,0.0,0.0,1.0
4,0.0,0.0,1.0
5,0.0,1.0,0.0
6,0.0,1.0,0.0
7,0.0,1.0,0.0
8,0.0,1.0,0.0
9,1.0,0.0,0.0


In [17]:
df = df.join(df_town)
df

Unnamed: 0,town,area,price,Bangalore,Delhi,Chennai
0,chennai,2600,550000,0.0,0.0,1.0
1,chennai,3000,565000,0.0,0.0,1.0
2,chennai,3200,610000,0.0,0.0,1.0
3,chennai,3600,680000,0.0,0.0,1.0
4,chennai,4000,725000,0.0,0.0,1.0
5,Delhi,2600,585000,0.0,1.0,0.0
6,Delhi,2800,615000,0.0,1.0,0.0
7,Delhi,3300,650000,0.0,1.0,0.0
8,Delhi,3600,710000,0.0,1.0,0.0
9,Bangalore,2600,575000,1.0,0.0,0.0


In [18]:
df = df.drop(['Bangalore','town'],axis=1)
df

Unnamed: 0,area,price,Delhi,Chennai
0,2600,550000,0.0,1.0
1,3000,565000,0.0,1.0
2,3200,610000,0.0,1.0
3,3600,680000,0.0,1.0
4,4000,725000,0.0,1.0
5,2600,585000,1.0,0.0
6,2800,615000,1.0,0.0
7,3300,650000,1.0,0.0
8,3600,710000,1.0,0.0
9,2600,575000,0.0,0.0


<h2 style='color:purple'>Using pandas to create dummy variables</h2> - One Hot Encoding

In [23]:
df1 = pd.read_csv("homeprices.csv")


In [20]:
dummies = pd.get_dummies(df1.town,columns =['Bangalore', 'Delhi', 'Chennai'] )
dummies

Unnamed: 0,Bangalore,Delhi,chennai
0,0,0,1
1,0,0,1
2,0,0,1
3,0,0,1
4,0,0,1
5,0,1,0
6,0,1,0
7,0,1,0
8,0,1,0
9,1,0,0


In [21]:
df1 = pd.concat([df1,dummies],axis='columns')
df1

Unnamed: 0,town,area,price,Bangalore,Delhi,chennai
0,chennai,2600,550000,0,0,1
1,chennai,3000,565000,0,0,1
2,chennai,3200,610000,0,0,1
3,chennai,3600,680000,0,0,1
4,chennai,4000,725000,0,0,1
5,Delhi,2600,585000,0,1,0
6,Delhi,2800,615000,0,1,0
7,Delhi,3300,650000,0,1,0
8,Delhi,3600,710000,0,1,0
9,Bangalore,2600,575000,1,0,0


In [22]:
df1 = df1.drop(['town','Bangalore'], axis='columns')
df1

Unnamed: 0,area,price,Delhi,chennai
0,2600,550000,0,1
1,3000,565000,0,1
2,3200,610000,0,1
3,3600,680000,0,1
4,4000,725000,0,1
5,2600,585000,1,0
6,2800,615000,1,0
7,3300,650000,1,0
8,3600,710000,1,0
9,2600,575000,0,0


#### One Line Code - One Hot encoding

In [24]:
df1 = pd.get_dummies(data = df1, columns = ['town'],prefix = ['town'],drop_first = True)

In [25]:
df1

Unnamed: 0,area,price,town_Delhi,town_chennai
0,2600,550000,0,1
1,3000,565000,0,1
2,3200,610000,0,1
3,3600,680000,0,1
4,4000,725000,0,1
5,2600,585000,1,0
6,2800,615000,1,0
7,3300,650000,1,0
8,3600,710000,1,0
9,2600,575000,0,0


# Model Prediction One Hot Encoding

In [26]:
y_train = df.price

In [27]:
x_train = df.drop('price', axis=1)
x_train

Unnamed: 0,area,Delhi,Chennai
0,2600,0.0,1.0
1,3000,0.0,1.0
2,3200,0.0,1.0
3,3600,0.0,1.0
4,4000,0.0,1.0
5,2600,1.0,0.0
6,2800,1.0,0.0
7,3300,1.0,0.0
8,3600,1.0,0.0
9,2600,0.0,0.0


In [28]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

In [29]:
linear_reg = LinearRegression()
linear_reg.fit(x_train, y_train)
y_pred_linear = linear_reg.predict(x_train)
r2_score(y_train, y_pred_linear)

0.9573929037221872