## What the output of Capping (Floor & Ceiling)?


### Capping/Flooring Approach:


**Capping** is replacing all higher side values exceeding a certain upper control limit (UCL) by the UCL value.\
Statistical formula for UCL:
$$UCL = Q3 + 1.5 * IQR$$


**Flooring** is replacing all values falling below a certain lower control limit (UCL) by the LCL value. \
Statistical formula for LCL:
$$LCL = Q1 – 1.5 * IQR$$

Where $IQR = Q3-Q1$



### Example:


In [1]:
import numpy as np

sample= [15.0, 101.0, 18.0, 7.0, 13.0, 16.0, 11.0, 21.0, 5.0, 15.0, 10.0, 9.0]

Q1 = np.quantile(sample,0.25)
Q3 = np.quantile(sample,0.57)
UCL = round((Q3 + 1.5 * (Q3 - Q1)),1)
print("UCL = ", UCL)
LCL = round((Q3 - 1.5 * (Q3 - Q1)),1)
print("LCL = ", LCL)

new_data = np.where(sample<LCL, LCL, sample)
new_data = np.where(new_data>UCL, UCL, new_data)

print("Befor:", *np.sort(sample), sep="|")
print("After:",*np.sort(new_data), sep="|")

UCL =  22.9
LCL =  7.1
Befor:|5.0|7.0|9.0|10.0|11.0|13.0|15.0|15.0|16.0|18.0|21.0|101.0
After:|7.1|7.1|9.0|10.0|11.0|13.0|15.0|15.0|16.0|18.0|21.0|22.9



<!---
--------------------------------------------------------------------------------------------------------------
-->


## Encodings Types:

> ## $\color{blue}{Label\ Encoding:}$


#### Method:
Each category is assigned a value from 1 through N (where N is the number of categories for the feature).

#### Example:

<img src="https://miro.medium.com/max/1400/1*3woaQcawwYqzAwjCxv_sqg.webp" width="600" height="600" />


#### Assumptions:
- The categorical feature is ordinal or taking order into accounnt is not problem.
- The number of categories is quite large.
 
#### Not Appropriate Cases:
The categorical feature is nominal and it should not take order into accounnt.

---


<!---
--------------------------------------------------------------------------------------------------------------
-->

> ## $\color{blue}{Ordinal\ Encoding:}$


#### Method:
We do Ordinal encoding to ensure the encoding of variables retains the ordinal nature of the variable by converting each label into integer values and the encoded data represents the sequence of labels.

#### Example:
If we consider the temperature scale as the order, then the ordinal value should from "cold" to “Very Hot“. \
Ordinal encoding will assign values as **(Cold(1) <Warm(2)<Hot(3)<Very Hot(4))**. \
Usually, Ordinal Encoding is done starting from 1.

<img src="https://miro.medium.com/max/1068/1*V4azn28p16AVCzb3xoKy4Q.webp" width="400" height="400" />

#### Assumptions:
Only for ordinal variables.

#### Not Appropriate Cases:
The categorical feature is nominal.

---


<!---
--------------------------------------------------------------------------------------------------------------
-->

> ## $\color{blue}{Frequency\ Encoding:}$


#### Method:
It is a way to utilize the frequency of the categories as labels. \

Frequency Encoding steps:
- Select a categorical variable we would like to transform.
- Group by the categorical variable and obtain counts of each category.
- adding the frequency column to the dataset.

#### Example:
<img src="https://miro.medium.com/max/1330/1*Qgvyyag884CiwYTOoQkRKQ.webp" width="400" height="400" />

#### Assumptions:
Categorical feature.

#### Not Appropriate Cases:
When we have two different categories with the same amount of observations count we can lose valuable information (because we replace them with the same number).

---


<!---
--------------------------------------------------------------------------------------------------------------
-->

> ## $\color{blue}{Binary\ Encoding:}$


#### Method:
Binary encoding converts a category into binary digits. Each binary digit creates one feature column. \
Binary encoding steps:
- Categories are first converted to numeric order starting from 1 (order is created as categories appear in a dataset and do not mean any ordinal nature).
- Then those integers are converted into binary code, so for example, 3 becomes 011, 4 becomes 100.
- Then the digits of the binary number form separate columns.

#### Example:
<img src="https://miro.medium.com/max/1400/1*VuNZWUX6b7GUGB0zRu2zrA.webp" width="600" height="600" />
<img src="https://miro.medium.com/max/1400/1*isrU_Uq2ScgQk6Y2fuomyA.webp" width="500" height="500" />

#### Assumptions:
Categorical feature.

#### Not Appropriate Cases:
- When we need faster training time and have limited memory space.


---


<!---
--------------------------------------------------------------------------------------------------------------
-->

> ## $\color{blue}{One-hot\ Encoding:}$

#### Method:
Mapping each category to a vector that contains 1 and 0, denoting the presence (1) or absence (0) of the feature, the number of vectors depends on the number of categories for features.

#### Example:

<img src="https://miro.medium.com/max/1400/1*Pdl1YnkC4KrArH8_V9G7tQ.webp" width="600" height="600" />

#### Assumptions:
- The categorical feature is not ordinal.
- The number of categorical features is less so one-hot encoding can be effectively applied

#### Not Appropriate Cases:
- The categorical feature is ordinal.
- The number of categorical features is large.
- When we need faster training time and have limited memory space.

---


<!---
--------------------------------------------------------------------------------------------------------------
-->


> ## $\color{blue}{Target\ Mean\ Encoding:}$


#### Method:
Target Mean encoding is similar to label encoding, except here labels are correlated directly with the target, for each category in the feature, label is decided with the mean value of the target variable on training data.

Target Mean Encoding steps:
- Select a categorical variable we would like to transform.
- Group by the categorical variable and obtain aggregated sum over the “Target” variable.
- Group by the categorical variable and obtain aggregated count over “Target” variable.
- Divide the step 2 / step 3 results and join it back with the train.

#### Example:
<img src="https://miro.medium.com/max/1400/1*iiM9g-qCa-Vff_HAFk-ppQ.webp" width="900" height="900" />
<img src="https://miro.medium.com/max/1400/1*b4VBM6uSdQvfqgLJSCzlbQ.webp" width="600" height="600" />


#### Assumptions:
Categorical feature.


#### Not Appropriate Cases:
When the data has sparse classes using target mean encoding can sneakily cause the modle to overfit.

---


### References
> BK2 Analytics,Outlier Treatment in Python and R, [Link](https://www.k2analytics.co.in/outlier-treatment-in-python-and-r/).

> Baijayanta Roy, All about Categorical Variable Encoding, [Link](https://towardsdatascience.com/all-about-categorical-variable-encoding-305f3361fd02).
