### **One Hot Encoding**

This technique is applied for nomial categorical features.

In one Hot Encoding method, each category value is converted into a new column and assigned a value as 1 or 0 to the column.

This will be done using the pandas `get_dummies()` function and then we will `drop the first column` in order to avoid dummy variable trap.

##### Advantages :

· Simple to use and fits well for `data with few categories`.

##### Disadvantages:

· A high cardinality of higher categories will increase the feature space, resulting in the curse of dimensionality.

#### Importing libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

#### Reading Data

In [2]:
df = pd.read_csv("./data/titanic.csv")
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [3]:
df.Sex.unique()

array(['male', 'female'], dtype=object)

Sex has only 2 categories- Male and female so One Hot Encoding is prefered

In [4]:
Sex_converted_variable= pd.get_dummies(df.Sex, drop_first=True)
Sex_converted_variable.head()

Unnamed: 0,male
0,True
1,False
2,False
3,False
4,True


After converting nomial categorical feature into 1 or 0 we concatenate to the original dataframe

In [5]:
df = pd.concat([df, Sex_converted_variable], axis=1)

In [6]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,male
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,True
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,False
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,False
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,False
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,True


After concatenation we drop the original nomial categorical variable column

In [7]:
df.drop("Sex", axis=1, inplace=True)

In [8]:
df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,male
0,1,0,3,"Braund, Mr. Owen Harris",22.0,1,0,A/5 21171,7.2500,,S,True
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",38.0,1,0,PC 17599,71.2833,C85,C,False
2,3,1,3,"Heikkinen, Miss. Laina",26.0,0,0,STON/O2. 3101282,7.9250,,S,False
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",35.0,1,0,113803,53.1000,C123,S,False
4,5,0,3,"Allen, Mr. William Henry",35.0,0,0,373450,8.0500,,S,True
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",27.0,0,0,211536,13.0000,,S,True
887,888,1,1,"Graham, Miss. Margaret Edith",19.0,0,0,112053,30.0000,B42,S,False
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",,1,2,W./C. 6607,23.4500,,S,False
889,890,1,1,"Behr, Mr. Karl Howell",26.0,0,0,111369,30.0000,C148,C,True


# ---------------------------------------------------------------------

### One Hot Encoding with Multiple Categories

This is one of the ensemble selection techniques pick up from the KDD Orange Cup competition. In this technique, the author made a slight modification to the One hot encoding technique that is instead of creating the new column for every category, they limit creating the new column for 10 most frequent categories. 

#### Advantages:

· Easy to implement

· Does not expand massively the feature space

##### Disadvantages :

· Does not keep track of category values that are overlooked.

#### Example-

> list_top_10 = dataset["column_name].value_counts().sort_values(ascending=False).head(10).index

value_counts() - from selected column counts the number of times each unique category appears
sort_values(ascending=False) - sorts these values in descending order
head(10) - takes only the top 10 sorted categories

`This method gives the 10 most frequently appearing categories`

# ---------------------------------------------------------------------

### **Ordinal Number Encoding**

As the name implies, this technique is used for ordinal categorical features.

In this technique, each unique category value is given an integer value. For instance, “red” equals 1, “green” equals 2 and “blue” equals 3.

Domain information can be used to determine the integer value order. For example, we people love Saturday and Sundays, and most hates Monday. In this scenario the mapping for weekdays goes ‘Monday’ is 1, ‘Tuesday’ is 2, ‘Wednesday’ is 3, ‘Thursday’ is 4, ‘Friday’ is 5,’Saturday’ is 6,’Sunday’ is 7.
So Sunday is best day followed by saturday and so on with Monday being the worst day since it has the lowest integer value.


In [9]:
data = {'Temperature':['Hot','Cold','Very Hot','Warm','Hot','Warm','Warm','Hot','Hot','Cold']}
dataset = pd.DataFrame(data,columns=['Temperature'])
dataset.head()

Unnamed: 0,Temperature
0,Hot
1,Cold
2,Very Hot
3,Warm
4,Hot


In [10]:
mapping_dictionary_value={'Cold':1,'Warm':2,'Hot':3,'Very Hot':4}
dataset['Temperature_Ordinal']=dataset.Temperature.map(mapping_dictionary_value)
dataset

Unnamed: 0,Temperature,Temperature_Ordinal
0,Hot,3
1,Cold,1
2,Very Hot,4
3,Warm,2
4,Hot,3
5,Warm,2
6,Warm,2
7,Hot,3
8,Hot,3
9,Cold,1


In [11]:
dataset.sort_values("Temperature_Ordinal", ascending=False)


Unnamed: 0,Temperature,Temperature_Ordinal
2,Very Hot,4
0,Hot,3
8,Hot,3
4,Hot,3
7,Hot,3
5,Warm,2
6,Warm,2
3,Warm,2
1,Cold,1
9,Cold,1


# ---------------------------------------------------------------------

### **Count or Frequency Encoding**

As the name implies, in this technique we will substitute the categories by the count of the observations that show that category in the dataset.

As an example. If India appears 56 times in the country column and America appears 49 times, we replace India with 56 and America with 49 in the country column.

#### Advantages:

· Easy to implement

· There will be no increase in feature space.

· Work well with the tree-based algorithms.

#### Disadvantages:

It will not provide the same weight if the frequencies are the same.

# ---------------------------------------------------------------------

## **Target guided Ordinal Encoding**

Instead of assigning arbitrary integers (like Label/Ordinal Encoding), this method assigns numbers based on the average value of `a target variable` for each category.

> Suppose you have a `categorical feature City` and a `target variable Salary`
1. Compute mean target for each category
2. Rank the categories based on Avg_Salary
3. Replace each category with its corresponding rank

# ---------------------------------------------------------------------

## **Mean Ordinal Encoding**

It’s a sight variant of target-guided ordinal encoding. We replace the category with the `obtained mean value` `instead of assigning integer values to it.`

# ---------------------------------------------------------------------

## **Probability Ratio Encoding**

This technique is suitable for classification problems only when the `target variable is binary(Either 1 or 0 or True or False).`

In this technique, we will `substitute the category value` with the `probability ratio i.e. P(1)/P(0).`

1. Count number of 1s and 0s for each category
2. Compute Probability Ratio [To avoid division by zero, we usually add smoothing (e.g., add 1 to all counts):]
3. Replace categories with ratios

# ---------------------------------------------------------------------