# Pre-Processing Tutorial: Level 2  
**So this is a tutorial I presented as an instructor at GDSC Enet'Com, Tunisia.**  
**I considered calling it "Advanced pre-processing tutorial", but then again, whenever I see such a title on a kaggle notebook it ends up being not as "advanced" as it claims. I didn't want to fall in a category that i myself criticize xD**  
**Plus, almost no matter what you do in this field, there's something much more advanced.**

# Content
### Exploration 
### Univariate Outlier Detection
### Multivariate Outlier Detection
### Frequency/Count Encoding
### Weight of Evidence Encoding
### Different Scaling Methods
### MICE imputation

The dataset that we'll be using here is about forests. The purpose is, from what i've understood, to predict the cover type of a forest. However, the purpose of this tutorial is just to demonstrate different processing methods, so we won't be doing any predictions.

# Exploring the Dataset (feel free to skip)

Can't do pre-processing without understanding the data first so gotta do this.  
Feel free to skip if uninterested.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib import style
style.available

In [None]:
style.use("seaborn-dark")

In [None]:
data = pd.read_csv("../input/forest-cover-type-prediction/train.csv")

In [None]:
data.shape

In [None]:
data.head()

The id is useless here so we drop it.

In [None]:
data.drop("Id", axis=1, inplace=True)

In [None]:
x = data.drop("Cover_Type", axis=1)
y = data["Cover_Type"]

In [None]:
x.head(10)

In [None]:
x.columns

In [None]:
print('Soil_Type37'[:9])
print('Wilderness_Area3'[:15])

Separating feature names into two lists: one for the categoricals and one for the numericals.

In [None]:
categoricals = []
numericals = []
for col in x.columns:
    if col[:9]=="Soil_Type" or col[:15]=='Wilderness_Area':
        categoricals.append(col)
    else:
        numericals.append(col)

In [None]:
numericals

In [None]:
fig, axes = plt.subplots(nrows=len(numericals), ncols=1, figsize=(20, 9*len(numericals)))
for i in range(len(numericals)):
    col = numericals[i]
    sns.histplot(x=col, data=x, ax=axes[i], color="blueviolet", kde=True)
plt.show()

### Notes:
* Some are gaussian/normal, some are skewed.  
* There are outliers that we can trim/remove.

In [None]:
fig, axes = plt.subplots(nrows=len(categoricals), ncols=1, figsize=(20,9*len(categoricals)))
for i in range(len(categoricals)):
    col = categoricals[i]
    sns.countplot(x=col, data=x, ax=axes[i], palette="Set2")
    axes[i].set_yticks(x[col].value_counts())
plt.show()

### Notes:
* Many have low variation: most observations are in the "0" category.  
* Some have almost no variation: about 40 or less instances (out of ~15000) in the "1" category.
* Some have no variation whatsoever (soil types 7 & 15). These will be removed since they don't differentiate between instances/forests.

In [None]:
y.value_counts()

In [None]:
plt.figure(figsize=(20,9))
sns.countplot(y)
plt.show()

* Y is balanced so no need to worry about imbalance in the target feature.

**Soil types 7 & 15 are not to be considered since they show no variance. All values are the same**

In [None]:
x.drop(["Soil_Type7", "Soil_Type15"], axis=1, inplace=True)
categoricals.remove("Soil_Type7")
categoricals.remove("Soil_Type15")

# Univariate Outlier Detection

**Outliers are observations/datapoints that are very different from the majority**  
**They can often negatively affect the performance of some models/algorithms**  
**So we usually just get rid of them**

**Univariate means we will try to detect outliers by looking at each feature by itself.**  
**For every feature, any point that has a value that is very different from the remaining will be considered an outlier**

## Numerical Features

**Every point that is situated far from the others in a histogram is considered an outlier and removed**  
**Since we wouldn't do this manually, we can automate the process by calculating z-scores**  
**The z-score of an observation/datapoint with regards to a certain continuous feature, characterizes how far it is from the mean/average value**  
**The z-score is "how many standard deviations away from the mean is this value?"**  
**Or, more simply, "How far away from the mean is this value?", the unit being 1 standard deviations**  
**So if the z-score of a particular value with regards to a particular feature is 2.3, we would say that the difference between that value and the mean value is equal to 2.3 times the standard deviation of the feature**  
**To get the z-score, we simply subtract the mean from the value of the observation, then divide it by the standard deviation**  
![](https://toptipbio.com/wp-content/uploads/2020/02/Z-score-formula.jpg)

**If the zscore is negative then the value x is less than the mean, and if the zscore is positive then it's higher that the mean**  
**Usually, if the zscore (in absolute value) is above 3, the point/value is considered an outlier and removed**  
**You can choose other values though. Higher values mean you will tolerate more outliers and only remove the most extreme ones, whereas lower values will remove more points and only keep points that are close to the mean.**

In [None]:
from scipy.stats import zscore

In [None]:
uni_out = x.copy(deep=True)

In [None]:
zs = zscore(uni_out[numericals])

The following table contains the zscores of every point for every feature.  
Any row that contains a high absolute value (which *usually* means any value bigger than 3) will be removed.

In [None]:
zs

Make sure you use absolute values

In [None]:
scores = np.abs(zs)

Here, i used 1 as a threshold instead of 3.  
The following code creates a variable that indicates which indices ***do not*** correspond to outliers

In [None]:
non_outlier_indices = (scores<3).all(axis=1)
print(non_outlier_indices)

Notice that were left with 13990 rows instead of 15120.  
If we used 1 instead of 3 we would get around 600 values only.  
If we used 5 we would remove almost nothing. Only the most extreme ones.

In [None]:
uni_out[non_outlier_indices]

## Categorical Features ??

**Are there categorical outliers????**  
**Well if any point that is different from the majority is an outlier then any categorical value with a low number points could be considered an anomaly and any point corresponding to it would be an outlier**  
**If that feature is binary, then removing one categorical value will only leave one, and thus that feature will be uninformative and get removed**  
**So basically, any binary categorical feature that is so imbalanced that it has only a very small number of observations with a value of 1 (or 0) will be removed**  
#### Note:
**This isn't always a good idea, it depends on what you mean by "a very small number of observations"**  
**Here, i considered anything less that 100 to be a very small number of observations**

In [None]:
uni_out = x.copy(deep=True)

In [None]:
x["Soil_Type13"].value_counts()

In [None]:
probably_useless_features = []
for col in categoricals:
    if x[col].value_counts()[0] < 100 or x[col].value_counts()[1] < 100:
        probably_useless_features.append(col)

In [None]:
probably_useless_features

In [None]:
uni_out.drop(probably_useless_features, axis=1, inplace=True)

In [None]:
uni_out

**But how about categorical features that aren't binary?**  
**Well we can remove categorical values with "a very small number of observations" and leave the rest as it is**  
**Since all categorical features in our dataset are binary, i will create an artificial feature**

In [None]:
feat = []
for i in range(15120):
    r = np.random.rand()
    if r < 0.25:
        feat.append("value1")
    elif r < 0.53:
        feat.append("value2")
    elif r < 0.85:
        feat.append("value3")
    elif r < 0.99:
        feat.append("value4")
    else:
        feat.append("value5")
feat = pd.Series(feat)

In [None]:
sns.countplot(feat)

**Observations that have a value equal to "value5" could be considered as outliers**  
**You're throwing away data so be think well before you do this**  
**If these observations are significant despite their small number you might want to keep them**

In [None]:
feat.shape

In [None]:
feat

We get which indices don't correspond to "value5"

In [None]:
feat!="value5"

In [None]:
feat = feat[feat!="value5"]

In [None]:
sns.countplot(feat)

In [None]:
feat.shape

# Multivariate Outlier Detection with Isolation Forest

**Some outliers can not be detected by looking at histograms**  
**In the following image we have an example.**  
**The red point is clearly an outlier**  
**If we plot the histogram of the x-axis variable/feature or the y-axis variable/feature, it would be in the middle, not somewhere extreme**

![](https://www.intechopen.com/media/chapter/47833/media/image2.jpeg)

**There are any ways to detect the kinds of outliers, and one of them is the Isolation Forest**  


The Isolation Forest algorithm chooses a feature/variable randomly, then makes a split at a random value.  
This separates the space/points into two parts.  

![](https://storage.googleapis.com/kagglesdsdata/datasets/1766233/2883347/image3.png?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=databundle-worker-v2%40kaggle-161607.iam.gserviceaccount.com%2F20211208%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20211208T105422Z&X-Goog-Expires=345599&X-Goog-SignedHeaders=host&X-Goog-Signature=9c04436e5d15a77f3baed0ccdb2cdebfd525595c9e53a67a6eee6ee9a84abd3485bd77555f3f56db98934fd045b0e3cdbea1062646064771b118cbdf57393e78325d4dae8e41c7e8635a7996720c5d946e110c2aede0156aaab896ade5b4c37a876caf9d469d97e6c7ab111e1267ffea2cb83fcba59750cb09d9211570f10fe7cbf0b25efde88a39c1ca0cce141116f7343769edad8bee906824938a223f49ed7d91ff1216b2010fb5e6acf5710cb2019e7b5255fb69aa062ed299fd07a810624427d246d7cc62389837565f30169fe67917739375b98198e2b72d9c026df963fc73c0fc51102f27d67b43611b35c0fb4d1a03264125df14f8ed80c901d92aa7)

Then it makes another random split.  
Notice that after the second split, the outlier is now isolated.  
It would take, on average, many more splits to isolate points that are not outliers.

![](https://storage.googleapis.com/kagglesdsdata/datasets/1766233/2883347/image4.png?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=databundle-worker-v2%40kaggle-161607.iam.gserviceaccount.com%2F20211208%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20211208T105400Z&X-Goog-Expires=345599&X-Goog-SignedHeaders=host&X-Goog-Signature=5cb2167fba1c4c1154a7dda81eb382b168a39b47fe471279aa662a5e9cf027f5e751d0948cc853af48ec963c25244f4b5c0d05c49aa745e09f3de0cfd00db78e3a7ffa84bb75352ec370d1bccc98e07865407043775c06c1c01126a213a83382f8bf934affb5b1ce56bfc96faa1d4812b1185a685b786ddc94eefdccc4a78c7a44df9742c279da61dacf7739c52c71356395333dc703c6cdfb8ff1ee44a492f0b4e9b0c298c82507733950f91246d64c60de5711f41db3d1665b816b93cee96b82cdb86190d91165e2914a05432ee1e59bc599fad5d721827ca3ee0f04659d708966e79375257e0a479a10575d778eeb3c4b7b9068f6d10fb62d079c3e9b84ae)

The algorithm keeps making splits until each point is isolated (or until some early-stopping criteria is met).  
This builds an isolation tree.  
Then another tree is built, and another, and so on..  
Then an average score across all trees is calculated for every point: how many splits do we need, on average, to isolate that point?  
Outliers will have a low score since they're often isolated in just a few splits.  
Note that the isolation forest doesn't automatically remove outliers but rather ranks points/observations by how likely they are to be outliers.  
You can set a percentage that you would like to remove. For example you might want to remove the 5% most extreme points/observations in your data.  
In the sklearn version of the isolation forest, you would need to set the "contamination" parameter to the percentage you choose.

In [None]:
from sklearn.ensemble import IsolationForest
iso = IsolationForest(contamination=0.05) # I want to remove the 5% most extreme rows/datapoints

In [None]:
iso.fit(x)

In [None]:
outlier_indices = iso.predict(x)

-1 values indicate outliers

In [None]:
pd.Series(outlier_indices).value_counts()

In [None]:
outlier_indices == 1

In [None]:
x[outlier_indices == 1]

As you can see, we're left with 14364 rows; 5% were removed.

# Frequency/Count Encoding

Encoding means replacing categorical values to numerical ones.

Count encoding consists of simply replacing a categorical value with how many times it was observed. It literally counts it.  
Frequency encoding is the same thing, it just divides the count by the total number of rows to get a proportion/probability.  
So we just replace every categorical value with how common it is.

In [None]:
!pip install category_encoders

In [None]:
from category_encoders import CountEncoder

In [None]:
feat

In [None]:
enc = CountEncoder()
feat_enc = enc.fit_transform(feat)
print(feat_enc)

In [None]:
enc = CountEncoder(normalize=True)
feat_enc = enc.fit_transform(feat)
print(feat_enc)

# Weight Of Evidence Encoding

I'll be honest, I don't want to explain this. Maybe another time, but not right now.  
However, [This article](https://www.listendata.com/2015/03/weight-of-evidence-woe-and-information.html) explains it really nicely, so make sure you check it out.

![](https://miro.medium.com/max/768/1*6Aw782wiyiFtzvK7EOY8CA.png)

In [None]:
v = pd.DataFrame()
v["Soil_Type10"] = x["Soil_Type10"].map({0:"no", 1:"yes"})
v["Other_feature"] = feat
t = x["Wilderness_Area3"]

In [None]:
v

In [None]:
t

In [None]:
from category_encoders import WOEEncoder

In [None]:
enc = WOEEncoder()

In [None]:
enc.fit(v, t)

In [None]:
enc.transform(v)

# Scaling Methods

Consider the following: Your dataset contains 2 features. One feature has values ranging from 0 to 1, whereas the other has values ranging from 5000 to 9000.  
Now say you want to measure the "similarity" between two data points. This is usually done using euclidean distance.  
The following is the formula for the euclidean distance between two points in 2 dimensions/features:

![](https://cdn.kastatic.org/googleusercontent/UPUY_dSWBpH3LM_ujmZAHhiFQdArEwklCUA-wOFSqBRo1Y4SFtnD5io397_Iw3YREocm_EkDPEUgKU3sDIMnZdU)

Say the first feature is represented in green (the x feature) and the second is in orange (the y feature).  
(x2-x1) will be a value in [-1 , 1] (cause feature x has values between 0 and 1), and (y2-y1) will have values in [-4000, 14000].  
The value of the distance will be almost equal to sqrt( (y2 - y1)² ), and the x term will have no effect.  
So basically the first feature will be ignored by default.  
We don't want that since it could be an important feature.  
The solution is to make all features have similar ranges, for example from 0 to 1.  
This is called feature scaling.  
However, if you're not going to use a model/method that depends on distances, you don't need scaling.  

In this section we will explore a few different scaling methods.

In [None]:
from sklearn.preprocessing import QuantileTransformer, StandardScaler, RobustScaler, MinMaxScaler

In [None]:
NQT=QuantileTransformer(output_distribution='normal')
UQT=QuantileTransformer(output_distribution='uniform')
RS=RobustScaler()
SS=StandardScaler()
MMS=MinMaxScaler()

scalers = [NQT,UQT,RS,SS,MMS]
names = ["Gaussian", "Uniform", "Robust", "Standard", "Min-Max"]

In [None]:
fig, axes = plt.subplots(nrows=len(numericals), ncols=1+len(scalers), figsize=(20, 5*len(numericals)))
for i in range(len(numericals)):
    col = numericals[i]
    sns.histplot(x=col, data=x, ax=axes[i,0], color="blueviolet")
    axes[i,0].set_title("Original")
    for j in range(len(scalers)):
        scaler = scalers[j]
        reshaped_col = np.expand_dims(x[col], axis=1)
        transformed_col = scaler.fit_transform(reshaped_col)
        sns.histplot(x=transformed_col[:,0], ax=axes[i,j+1], color="crimson")
        axes[i,j+1].set_title(names[j])
plt.show()

### Definitions: (from right to left)  
* Min-Max Scaling makes all features have values in [0 , 1]. It does this by subtracting the minimum value then dividing by the maximum value.  
* Standardization subtracts the mean then divides by the standard deviation. This assures all features have an average of 0 and a standard deviation of 1.  
* RobustScaler is the same as standardization but instead of using the mean and the standard deviation it uses the median and the IQR, which is the range between the 25th percentile and the 75th percentile. This assures that the central 50% of the values have a range of 1 and the median becomes 0. This ensures that the central 50% is between -0.5 and 0.5. You can use other percentiles if you wish.  
* The uniform quantile transformer transforms distributions into uniform ones with a range from 0 to 1.  
* The normal/gaussian quantile transformer transforms distributions into gaussian ones with mean 0 and a standard deviation of 1.  

### Notes:
* StandardScaler, Min-Max Scaler & RobustScaler are "linear transformations"; they don't change the shape of the distributions.  
* Min-Max scaling isn't suitable for very skewed features because most values would get mapped between [0.2 , 0.5] for example.  
* Standard Scaler might not be very suitable for skewed features as well, since it uses the mean and the std which are affected by outliers.  
* Robust Scaler is more suited for skewed distributions since it isn't affected by outliers.  
* Uniform and Gaussian transformations might distort dependencies between features; features that are originally dependent might become less dependent and vice-versa. So if you have high correlations/dependencies between features then these two might not be suitable. If not, however, they could be perhaps give better results since many methods/models could work better with normal/uniform distributions

The following cell shows how to scale one feature with the RobustScaler.  
The syntax is the same for all features.  
You can write a loop to scale all features.  
Or you can pass the whole dataset to the scaler, but keep in mind it will return an array, not a dataframe.

In [None]:
rs = RobustScaler(quantile_range=(5,95))
scaled_data = rs.fit_transform(x)

# MICE: Multiple/Multivariate Imputation by Chained Equations

MICE is a method used to estimate missing values with machine learning.  
This is often better than dropping them or replacing them with means or modes etc..  
I cannot explain mice here, but [THIS VIDEO RIGHT HERE](https://www.youtube.com/watch?v=WPiYOS3qK70) does a good job at doing that so check it out.

In [None]:
xcopy = x.copy(deep=True)
xcopy.drop(probably_useless_features, axis=1, inplace=True)

In [None]:
xcopy = xcopy.sample(n=3000, axis=0)

In [None]:
for col in xcopy.columns:
    for i in xcopy.index:
        r = np.random.rand()
        if r < 0.05:
            xcopy.loc[i,col] = np.nan

In [None]:
xcopy.shape

In [None]:
xcopy.isna().sum()

In [None]:
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

In [None]:
from sklearn.tree import DecisionTreeRegressor
tree = DecisionTreeRegressor(max_depth = 5)

In [None]:
mice = IterativeImputer(estimator=tree, n_nearest_features=10)

In [None]:
impdata = mice.fit_transform(xcopy)

In [None]:
impdataframe = pd.DataFrame(impdata, columns=xcopy.columns)

In [None]:
impdataframe.isna().sum()

In [None]:
for col in numericals:
    sns.histplot(xcopy[col])
    plt.show()