# Handle categorical value

Basically all machine learning algo not able to handle categorical variables because machine understand the  numerical values for that we need to convert in numerical values.

**Type of Categorical variables**
1. Nominal :There is no particular order
> for example  if *Color* is value then there are multiple color like  Red, Yellow, Pink, Blue .

2. Ordinal :There is some order between values.
> for example if *Rating* is value then there are fixed rate Excellent, Okay, Bad .


There are many ways we can encode these categorical variables
1. One Hot Encoding
2. Dummies
3. Binary Encoding
4. Label Encoding
5. Ordinal Encoding
6. Helmert Encoding
6. Frequency Encoding
7. Mean Encoding
8. Weight of Evidence Encoding
9. Probability Ratio Encoding
10. Hashing Encoding
11. Backward Difference Encoding
12. Leave One Out Encoding
13. James-Stein Encoding
14. M-estimator Encoding
15. Thermometer Encoder

In [130]:
import pandas as pd 
dataset = pd.read_csv('Churn_Modelling.csv')
df = dataset[['Geography','Gender']]
df

Unnamed: 0,Geography,Gender
0,France,Female
1,Spain,Female
2,France,Female
3,France,Female
4,Spain,Female
...,...,...
9995,France,Male
9996,France,Male
9997,France,Female
9998,Germany,Male


In [131]:
df.columns

Index(['Geography', 'Gender'], dtype='object')

In [133]:
# OneHot_Encoder
import category_encoders as ce
OneHot_Encoder = ce.OneHotEncoder(cols=['Geography','Gender']).fit_transform(df)
OneHot_Encoder

Unnamed: 0,Geography_1,Geography_2,Geography_3,Gender_1,Gender_2
0,1,0,0,1,0
1,0,1,0,1,0
2,1,0,0,1,0
3,1,0,0,1,0
4,0,1,0,1,0
...,...,...,...,...,...
9995,1,0,0,0,1
9996,1,0,0,0,1
9997,1,0,0,1,0
9998,0,0,1,0,1


*  One Hot Encoding is very popular

*  we map each category to a vector that contains 1 and 0, denoting the presence or absence of the feature. The number of vectors depends on the number of categories for features. This method produces many columns that slow down the learning significantly if the number of the category is very high for the feature. 



In [132]:
# Dummies
Dummies=pd.get_dummies(df,columns=['Geography','Gender']).head()
Dummies

Unnamed: 0,Geography_France,Geography_Germany,Geography_Spain,Gender_Female,Gender_Male
0,1,0,0,1,0
1,0,0,1,1,0
2,1,0,0,1,0
3,1,0,0,1,0
4,0,0,1,1,0


*  Pandas has get_dummies function, which is quite easy to use.

*  We can assign a prefix if we want to, if we do not want the encoding to use the default.



In [134]:
# Binary Encoder
Binary_Encoder = ce.BinaryEncoder(cols=['Gender']).fit_transform(df["Gender"]);
Binary_Encoder

Unnamed: 0,Gender_0,Gender_1
0,0,1
1,0,1
2,0,1
3,0,1
4,0,1
...,...,...
9995,1,0
9996,1,0
9997,0,1
9998,1,0


*  Binary encoding converts a category into binary digits. Each binary digit creates one feature column

*  If there are n unique categories, then binary encoding results in the only log(base 2)ⁿ features. 

In [135]:
# LabelEncoder
from sklearn.preprocessing import LabelEncoder
Label_Encoder=LabelEncoder().fit_transform(df.Gender)
Label_Encoder

array([0, 0, 0, ..., 0, 1, 0])

*  LabelEncoder each category is assigned a value from 1 through N (where N is the number of categories for the feature.

* There is One major issue with this approach is there is no relation or order between these classes, but the algorithm might consider them as some order or some relationship

In [136]:
# Ordinal Encoding
Geography_dict = {'Germany' : 1,
                  'France' : 2,
                  'Spain' : 3}
df['Geography_dict']=df.Geography.map(Geography_dict)
df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """


Unnamed: 0,Geography,Gender,Geography_dict
0,France,Female,2
1,Spain,Female,3
2,France,Female,2
3,France,Female,2
4,Spain,Female,3
...,...,...,...
9995,France,Male,2
9996,France,Male,2
9997,France,Female,2
9998,Germany,Male,1


*  This encoding looks almost similar to Label Encoding but slightly different as Label coding would not consider whether the variable is ordinal or not, and it will assign a sequence of integers.

In [137]:
# Helmert Encoder
encoding = ce.HelmertEncoder(cols=['Geography'],drop_invariant=True)
dhf = encoding.fit_transform(df['Geography'])
Helmert_Encoder = pd.concat([df,dhf],axis=1)
Helmert_Encoder


Unnamed: 0,Geography,Gender,Geography_dict,Geography_0,Geography_1
0,France,Female,2,-1.0,-1.0
1,Spain,Female,3,1.0,-1.0
2,France,Female,2,-1.0,-1.0
3,France,Female,2,-1.0,-1.0
4,Spain,Female,3,1.0,-1.0
...,...,...,...,...,...
9995,France,Male,2,-1.0,-1.0
9996,France,Male,2,-1.0,-1.0
9997,France,Female,2,-1.0,-1.0
9998,Germany,Male,1,0.0,2.0


*  HelmertEncoder the mean of the dependent variable for a level is compared to the mean of the dependent variable over all previous levels.

[For more visit here ](https://towardsdatascience.com/all-about-categorical-variable-encoding-305f3361fd02)