##### Handle Categorical Features(Nominal Encoding)
###### 1.One Hot Encoding.
###### 2.One Hot Encoding with many features.
###### 3.Mean Encoding

### One Hot Encoding
#### Sometimes in datasets, we encounter columns that contain categorical features (string values) for example parameter Gender will have categorical parameters like Male, Female. These labels have no specific order of preference and also since the data is string labels, the machine learning model can not work on such data.

#### One approach to solve this problem can be label encoding where we will assign a numerical value to these labels for example Male and Female mapped to 0 and 1. But this can add bias in our model as it will start giving higher preference to the Female parameter as 1>0 and ideally both labels are equally important in the dataset. To deal with this issue we will use One Hot Encoding technique.

#### In this technique, we each of the categorical parameters, it will prepare separate columns for both Male and Female label. SO, whenever there is Male in Gender, it will 1 in Male column and 0 in Female column and vice-versa.

In [1]:
import pandas as pd

In [2]:
df=pd.read_csv('titanic.csv',usecols=['Sex'])

In [3]:
df.head()

Unnamed: 0,Sex
0,male
1,female
2,female
3,female
4,male


In [4]:
pd.get_dummies(df).head()

Unnamed: 0,Sex_female,Sex_male
0,0,1
1,1,0
2,1,0
3,1,0
4,0,1


##### As we can observe there where 2categories in Sex Attribute and hence get_dummies() create two more column of binary with respective datas in each column.
##### But for 2-Category 1-Column serves enough information(as 1 is for male and 0 for not male(female) in Sex_male column) 
#### Hence, its a good practise to use drop_first

In [9]:
pd.get_dummies(df,drop_first=True).head()

Unnamed: 0,Sex_male
0,1
1,0
2,0
3,0
4,1


In [10]:
df=pd.read_csv('titanic.csv',usecols=['Embarked'])

In [14]:
df['Embarked'].unique()

array(['S', 'C', 'Q', nan], dtype=object)

In [16]:
df.dropna(inplace=True)

In [19]:
pd.get_dummies(df,drop_first=True).head()

Unnamed: 0,Embarked_Q,Embarked_S
0,0,1
1,0,0
2,0,1
3,0,1
4,0,1


### Onehotencoding with many categories in a feature

#### When there are huge number of categories present instead of performing one hot encoding which tend to increase complexity of features. Instead we do is make one binary variable for each top-10 most frequent labels for each column.This was used in Kaggle competition. https://www.kaggle.com/getting-started/114857

In [6]:
df=pd.read_csv('mercedes.csv',usecols=["X0","X1","X2","X3","X4","X5","X6"])

In [7]:
df.head()

Unnamed: 0,X0,X1,X2,X3,X4,X5,X6
0,k,v,at,a,d,u,j
1,k,t,av,e,d,y,l
2,az,w,n,c,d,x,j
3,az,t,n,f,d,x,l
4,az,v,n,f,d,h,d


In [8]:
for i in df.columns:
    print(i,"-",len(df[i].unique()))

X0 - 47
X1 - 27
X2 - 44
X3 - 7
X4 - 4
X5 - 29
X6 - 12


As we see X0,X1,X2,X5,X6 has alot of category and if we do just one hot encoding we may end up increasing dimension of data by adding 46+26+43+28+11 more columns for mentioned features. 
Instead we do is just select top 10 most occuring label for each column.

In [38]:
df.X1.value_counts().sort_values(ascending=False).head(10)

aa    833
s     598
b     592
l     590
v     408
r     251
i     203
a     143
c     121
o      82
Name: X1, dtype: int64

In [40]:
lst_10=df.X1.value_counts().sort_values(ascending=False).head(10).index
lst_10=list(lst_10)

In [41]:
lst_10

['aa', 's', 'b', 'l', 'v', 'r', 'i', 'a', 'c', 'o']

In [42]:
import numpy as np
for categories in lst_10:
    df[categories]=np.where(df['X1']==categories,1,0)

In [49]:
lst_10.append('X1')

In [50]:
df[lst_10]

Unnamed: 0,aa,s,b,l,v,r,i,a,c,o,X1
0,0,0,0,0,1,0,0,0,0,0,v
1,0,0,0,0,0,0,0,0,0,0,t
2,0,0,0,0,0,0,0,0,0,0,w
3,0,0,0,0,0,0,0,0,0,0,t
4,0,0,0,0,1,0,0,0,0,0,v
5,0,0,1,0,0,0,0,0,0,0,b
6,0,0,0,0,0,1,0,0,0,0,r
7,0,0,0,1,0,0,0,0,0,0,l
8,0,1,0,0,0,0,0,0,0,0,s
9,0,0,1,0,0,0,0,0,0,0,b


### Mean Encoding

In [36]:
df=pd.read_csv('titanic.csv',usecols=['Age','Survived'])

In [37]:
df.head()


Unnamed: 0,Survived,Age
0,0,22.0
1,1,38.0
2,1,26.0
3,1,35.0
4,0,35.0


#### Counting every datapoints in SubjectName

In [38]:
df.groupby(['Age'])['Survived'].count().head(10)

Age
0.42     1
0.67     1
0.75     2
0.83     2
0.92     1
1.00     7
2.00    10
3.00     6
4.00    10
5.00     4
Name: Survived, dtype: int64

#### groupby data with Age with their mean according to their positive target(Survived) value

In [39]:
df.groupby(['Age'])['Survived'].mean().head(10)

Age
0.42    1.000000
0.67    1.000000
0.75    1.000000
0.83    1.000000
0.92    1.000000
1.00    0.714286
2.00    0.300000
3.00    0.833333
4.00    0.700000
5.00    1.000000
Name: Survived, dtype: float64

#### The output shows the mean mapped with data point in Age with their positive target(Survived) value (1-positive and 0-Negative).

#### Finally assigning the mean value and map with df[‘Age’]

In [40]:
Mean_encoded_subject = df.groupby(['Age'])['Survived'].mean().to_dict()

df['Age_Mapped'] = df['Age'].map(Mean_encoded_subject)

df.head(10)


Unnamed: 0,Survived,Age,Age_Mapped
0,0,22.0,0.407407
1,1,38.0,0.454545
2,1,26.0,0.333333
3,1,35.0,0.611111
4,0,35.0,0.611111
5,0,,
6,0,54.0,0.375
7,0,2.0,0.3
8,1,27.0,0.611111
9,1,14.0,0.5
