## Data Encoding
 Nominal/OHE Encoding
 Label and Ordinal Encoding
 Target Guided Ordinal Encoding

## Nominal/OHE Encoding
One hot encoding, also known as nominal encoding, is a technique used to represent categorical data as numerical data, which is more suitable for machine learning algorithms. In this technique, each category is represented as a binary vector where each bit corresponds to a unique category. For example, if we have a categorical variable "color" with three possible values (red, green, blue), we can represent it using one hot encoding as follows:
1 : Red: [1, 0, 0]
2 : Green: [0, 1, 0]
3 : Blue: [0, 0, 1]

In [2]:
import pandas as pd

In [3]:
from sklearn.preprocessing import OneHotEncoder

In [6]:
## create a simple dataframe
## create a simple dataframe
df=pd.DataFrame({'color':['red','blue','green','green','red','blue']})

In [7]:
df.head()

Unnamed: 0,color
0,red
1,blue
2,green
3,green
4,red


In [8]:
## Create an instance of OneOneHotEncoder
encoder=OneHotEncoder()

In [10]:
## perform fit and then transform
encoder.fit_transform(df[['color']])

<6x3 sparse matrix of type '<class 'numpy.float64'>'
	with 6 stored elements in Compressed Sparse Row format>

In [12]:
encoded=encoder.fit_transform(df[['color']]).toarray()

In [14]:
import pandas as pd
encoder_df=pd.DataFrame(encoded,columns=encoder.get_feature_names_out())

In [15]:
encoder_df

Unnamed: 0,color_blue,color_green,color_red
0,0.0,0.0,1.0
1,1.0,0.0,0.0
2,0.0,1.0,0.0
3,0.0,1.0,0.0
4,0.0,0.0,1.0
5,1.0,0.0,0.0


In [17]:
## For new data 
encoder.transform([['blue']]).toarray()



array([[1., 0., 0.]])

In [18]:
pd.concat([df,encoder_df],axis=1)

Unnamed: 0,color,color_blue,color_green,color_red
0,red,0.0,0.0,1.0
1,blue,1.0,0.0,0.0
2,green,0.0,1.0,0.0
3,green,0.0,1.0,0.0
4,red,0.0,0.0,1.0
5,blue,1.0,0.0,0.0


In [11]:
import seaborn as sns 
df1=sns.load_dataset('tips')

In [12]:
df1=df1.head()

In [13]:
df1

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [14]:
df1=pd.DataFrame(df1)

In [27]:
df1

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [15]:
encoder1=OneHotEncoder()

In [17]:
encoded1=encoder1.fit_transform(df1[['sex']]).toarray()

In [33]:
encoder_df1=pd.DataFrame(encoded1,columns=encoder1.get_feature_names_out())

In [34]:
encoder_df1

Unnamed: 0,sex_Female,sex_Male
0,1.0,0.0
1,0.0,1.0
2,0.0,1.0
3,0.0,1.0
4,1.0,0.0


In [36]:
pd.concat([df1,encoder_df1],axis=1)

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,sex_Female,sex_Male
0,16.99,1.01,Female,No,Sun,Dinner,2,1.0,0.0
1,10.34,1.66,Male,No,Sun,Dinner,3,0.0,1.0
2,21.01,3.5,Male,No,Sun,Dinner,3,0.0,1.0
3,23.68,3.31,Male,No,Sun,Dinner,2,0.0,1.0
4,24.59,3.61,Female,No,Sun,Dinner,4,1.0,0.0


## Assignment 20 march 

In [None]:
## Ans 1:Data encoding is converting the categorical data into 0 and 1 numerical value based on sorting
## for data like signs of danger and caution can be used for categorical data where columns have data like earthquake prone are , not likely,somewhat , most likely

In [None]:
## Ans 2: is a technique used to represent categorical data as numerical data, which is more suitable for machine learning algorithms. In this technique, each category is represented as a binary vector where each bit corresponds to a unique category.
## lets suppose salary grade of employess in company based on their post like manager , senior manager,vice president,ceo

In [None]:
## Ans 3: Nominal encoding is used where categorical data does not have rank and are just names , examples like gender , city live in,marital status etc.
## however OHE is used for data having some rank in the form of categorical data like junior, senior,captain. or first ,second and third where data have some ranks
## for example in a bank data of employess place,gender with their post both nominal and OHE can be used basis of their loc and ranks 

In [None]:
## Ans 4: if i have categorical data with 5 unique values and if the data is just related to name,place,status:WFH or office,on roll off roll like data will use nominal encoding 
## however if the ranks are involved then i will use OHE method 

In [31]:
# Ans 5 . for said 2 columns having categorical data the output will be 4 new columns for 1-0 for first data and 2 for 2nd column 1-0

#Example 
import pandas as pd

import seaborn as sns

df=sns.load_dataset('tips')
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [5]:
## we will use sex and smoker data as categorical data 
from sklearn.preprocessing import OneHotEncoder

In [6]:
encoder=OneHotEncoder()

In [18]:
encoded=encoder.fit_transform(df[['sex','smoker']]).toarray()

In [22]:
encoded[0:8]

array([[1., 0., 1., 0.],
       [0., 1., 1., 0.],
       [0., 1., 1., 0.],
       [0., 1., 1., 0.],
       [1., 0., 1., 0.],
       [0., 1., 1., 0.],
       [0., 1., 1., 0.],
       [0., 1., 1., 0.]])

In [33]:
encoder_df=pd.DataFrame(encoded,columns=encoder.get_feature_names_out())

In [34]:
encoder_df

Unnamed: 0,sex_Female,sex_Male,smoker_No,smoker_Yes
0,1.0,0.0,1.0,0.0
1,0.0,1.0,1.0,0.0
2,0.0,1.0,1.0,0.0
3,0.0,1.0,1.0,0.0
4,1.0,0.0,1.0,0.0
...,...,...,...,...
239,0.0,1.0,1.0,0.0
240,1.0,0.0,0.0,1.0
241,0.0,1.0,0.0,1.0
242,0.0,1.0,1.0,0.0


In [39]:
## Ans 5
pd.concat([df,encoder_df],axis=1)[0:10]

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,sex_Female,sex_Male,smoker_No,smoker_Yes
0,16.99,1.01,Female,No,Sun,Dinner,2,1.0,0.0,1.0,0.0
1,10.34,1.66,Male,No,Sun,Dinner,3,0.0,1.0,1.0,0.0
2,21.01,3.5,Male,No,Sun,Dinner,3,0.0,1.0,1.0,0.0
3,23.68,3.31,Male,No,Sun,Dinner,2,0.0,1.0,1.0,0.0
4,24.59,3.61,Female,No,Sun,Dinner,4,1.0,0.0,1.0,0.0
5,25.29,4.71,Male,No,Sun,Dinner,4,0.0,1.0,1.0,0.0
6,8.77,2.0,Male,No,Sun,Dinner,2,0.0,1.0,1.0,0.0
7,26.88,3.12,Male,No,Sun,Dinner,4,0.0,1.0,1.0,0.0
8,15.04,1.96,Male,No,Sun,Dinner,2,0.0,1.0,1.0,0.0
9,14.78,3.23,Male,No,Sun,Dinner,2,0.0,1.0,1.0,0.0


In [None]:
## Ans 6 : i will use norimal encoding for said case as species, habitat, and diet are name types .

In [None]:
## Ans 7: we will use the ordinal encoding for the said case as it have Ranks involved OHE will be used /