# 🍀 Outlier detection removal and feature engineering

### **Purpose:**
+ 🎯 How to detect outliers
+ 🔥 Replace them with mean score
+ 📐 Feature engineering
    + simple difference
    + ratio a
    + ratio b
    + power
    + sqrt
    + polynomial a
    + polynomial b
    + polynomial c
+ [Reference](https://arxiv.org/pdf/1701.07852.pdf) 

In [None]:
# system stuff..
from tqdm import tqdm
import warnings
warnings.filterwarnings('ignore')

# data manipulation
import numpy as np
import pandas as pd

# library for visulaization
import matplotlib.pyplot as plt
import seaborn as sns

# sklearn
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split

# models
from sklearn.ensemble import RandomForestClassifier

# stats
from scipy import stats

In [None]:
train = pd.read_csv("../input/tabular-playground-series-mar-2021/train.csv",index_col='id')
test = pd.read_csv("../input/tabular-playground-series-mar-2021/test.csv",index_col='id')
sample_submission = pd.read_csv("../input/tabular-playground-series-mar-2021/sample_submission.csv")

In [None]:
# label encoder
for c in tqdm(train.columns):
    if train[c].dtype=='object': 
        lbl = LabelEncoder()
        lbl.fit(list(train[c].values) + list(test[c].values))
        train[c] = lbl.transform(train[c].values)
        test[c] = lbl.transform(test[c].values)

# 🧮 Outlier Detection and Removal using IQR score
+ ➡️ [IQR WIKI](https://en.wikipedia.org/wiki/Interquartile_range)

In [None]:
fig, (
      (ax0, ax1, ax2),
      (ax3, ax4, ax5),
      (ax6, ax7, ax8),
      (ax9, ax10, ax11),
      (ax12, ax13, ax14),
      (ax15, ax16, ax17),
      (ax18, ax19, ax20)
    ) = plt.subplots(nrows=7, ncols=3, sharey=True, figsize=(20, 10))

plt.subplots_adjust(hspace=1.3) 

sns.boxplot(x=train.cat0, ax=ax0)
sns.boxplot(x=train.cat1, ax=ax1)
sns.boxplot(x=train.cat2, ax=ax2)
sns.boxplot(x=train.cat3, ax=ax3)
sns.boxplot(x=train.cat4, ax=ax4)
sns.boxplot(x=train.cat5, ax=ax5)
sns.boxplot(x=train.cat6, ax=ax6)
sns.boxplot(x=train.cat7, ax=ax7)
sns.boxplot(x=train.cat8, ax=ax8)
sns.boxplot(x=train.cat9, ax=ax9)
sns.boxplot(x=train.cat10, ax=ax10)
sns.boxplot(x=train.cat11, ax=ax11)
sns.boxplot(x=train.cat12, ax=ax12)
sns.boxplot(x=train.cat13, ax=ax13)
sns.boxplot(x=train.cat14, ax=ax14)
sns.boxplot(x=train.cat15, ax=ax15)
sns.boxplot(x=train.cat16, ax=ax16)
sns.boxplot(x=train.cat17, ax=ax17)
sns.boxplot(train.cat18,ax=ax18)

plt.show()

In [None]:
# remove outlier using iqr score
for i in train.columns:
    if i != 'target':
        q1 = train[i].quantile(0.25)
        q3 = train[i].quantile(0.75)
        iqr = q3 - q1
        train[i] = train[i][~((train[i]<(q1-1.5*iqr)) | (train[i]>(q3+1.5*iqr)))]

In [None]:
# remove outlier using iqr score
for i in test.columns:
    if i != 'target':
        q1 = test[i].quantile(0.25)
        q3 = test[i].quantile(0.75)
        iqr = q3 - q1
        test[i] = test[i][~((test[i]<(q1-1.5*iqr)) | (test[i]>(q3+1.5*iqr)))]

In [None]:
fig, (
      (ax0, ax1, ax2),
      (ax3, ax4, ax5),
      (ax6, ax7, ax8),
      (ax9, ax10, ax11),
      (ax12, ax13, ax14),
      (ax15, ax16, ax17),
      (ax18, ax19, ax20)
    ) = plt.subplots(nrows=7, ncols=3, sharey=True, figsize=(20, 10))

plt.subplots_adjust(hspace=1.3) 

sns.boxplot(x=train.cat0, ax=ax0)
sns.boxplot(x=train.cat1, ax=ax1)
sns.boxplot(x=train.cat2, ax=ax2)
sns.boxplot(x=train.cat3, ax=ax3)
sns.boxplot(x=train.cat4, ax=ax4)
sns.boxplot(x=train.cat5, ax=ax5)
sns.boxplot(x=train.cat6, ax=ax6)
sns.boxplot(x=train.cat7, ax=ax7)
sns.boxplot(x=train.cat8, ax=ax8)
sns.boxplot(x=train.cat9, ax=ax9)
sns.boxplot(x=train.cat10, ax=ax10)
sns.boxplot(x=train.cat11, ax=ax11)
sns.boxplot(x=train.cat12, ax=ax12)
sns.boxplot(x=train.cat13, ax=ax13)
sns.boxplot(x=train.cat14, ax=ax14)
sns.boxplot(x=train.cat15, ax=ax15)
sns.boxplot(x=train.cat16, ax=ax16)
sns.boxplot(x=train.cat17, ax=ax17)
sns.boxplot(train.cat18,ax=ax18)

plt.show()

Outliers have been removed but some have been replaced with NaNs.

In [None]:
train.isnull().sum()

In [None]:
test.isnull().sum()

In [None]:
# replace with mean
for i in train.columns:
    if train[i].isnull().sum() > 0:
        train[i].fillna(train[i].mean(), inplace=True)

In [None]:
# replace with mean
for i in test.columns:
    if test[i].isnull().sum() > 0:
        test[i].fillna(test[i].mean(), inplace=True)

# ⛲ Feature engineering
+ As someone with limited experience in ML/AL feature engineering is something I am still learning about.
+ I found the following article helpful when understanding how to creat new features ➡️ [ARTICLE](https://arxiv.org/pdf/1701.07852.pdf)
![img](https://raw.githubusercontent.com/andronikmk/kaggle-notebooks/main/mar-tab-playground/images/ferandomforrest.png)
+ Some of my features so far...
    + Simple Difference
        + $y = x_{1} - x_{2}$
    + Ratio a
        + $y = \frac{1}{x + x^2}$
    + Power
        + $y = x^2$
    + Square Root
        + $y = \sqrt{x} $
    + Polynomial a
        + $y = 1 + 5x + 8x^2$
    + Polynomial b
        + $y = \frac{x}{x^2}$
    + Polynomial c
        + $y = \frac{1}{5x+8x^2}$
+ **Notes:** this is not the final list of features I will be using but it does seem like a good start. I will also add that depending on the algorithm you use certain features will improve on the score others might not.😊

In [None]:
def wrangle(X):
    """
    simple difference
    ratio a
    power
    square root
    polynomial a
    polynomial b
    polynomial c
    """
    
    X['diff_1'] = abs(X['cat0'] - X['cat1'])
    X['diff_2'] = abs(X['cat2'] - X['cat3'])
    X['diff_3'] = abs(X['cat4'] - X['cat5'])
    X['diff_4'] = abs(X['cat6'] - X['cat7'])
    X['diff_5'] = abs(X['cat8'] - X['cat9'])
    X['diff_6'] = abs(X['cat10'] - X['cat11'])
    X['diff_7'] = abs(X['cat12'] - X['cat13'])
    X['diff_8'] = abs(X['cat14'] - X['cat15'])
    X['diff_8'] = abs(X['cat16'] - X['cat17'])
    X['diff_8'] = abs(X['cat18'] - X['cat0']) 
    X['diff_9'] = abs(X['cont0'] - X['cont1'])
    X['diff_10'] = abs(X['cont2'] - X['cont3'])
    X['diff_11'] = abs(X['cont4'] - X['cont5'])
    X['diff_12'] = abs(X['cont6'] -X['cont7'])
    X['diff_13'] = abs(X['cont8'] -X['cont9'])
    X['diff_14'] = abs(X['cont10'] -X['cont0'])
    
    
    X['rat_1'] = 1 / (X['cont0'] + X['cont1']**2)
    X['rat_2'] = 1 / (X['cont2'] + X['cont3']**2)
    X['rat_3'] = 1 / (X['cont4'] + X['cont5']**2)
    X['rat_4'] = 1 / (X['cont6'] + X['cont7']**2)
    X['rat_5'] = 1 / (X['cont8'] + X['cont9']**2)
    X['rat_6'] = 1 / (X['cont10'] + X['cont7']**2)
    
    X['pow_0'] = X['cont0']**2
    X['pow_1'] = X['cont1']**2
    X['pow_2'] = X['cont2']**2
    X['pow_3'] = X['cont3']**2
    X['pow_4'] = X['cont4']**2
    X['pow_5'] = X['cont5']**2
    X['pow_6'] = X['cont6']**2
    X['pow_7'] = X['cont7']**2
    X['pow_8'] = X['cont8']**2
    X['pow_9'] = X['cont9']**2
    X['pow_10'] = X['cont10']**2
    
    X['sqrt_0'] = np.sqrt(X['cont0'])
    X['sqrt_1'] = np.sqrt(X['cont1'])
    X['sqrt_2'] = np.sqrt(X['cont2'])
    X['sqrt_3'] = np.sqrt(X['cont3'])
    X['sqrt_4'] = np.sqrt(X['cont4'])
    X['sqrt_5'] = np.sqrt(X['cont5'])
    X['sqrt_6'] = np.sqrt(X['cont6'])
    X['sqrt_7'] = np.sqrt(X['cont7'])
    X['sqrt_8'] = np.sqrt(X['cont8'])
    X['sqrt_9'] = np.sqrt(X['cont9'])
    X['sqrt_10'] = np.sqrt(X['cont10'])

    X['poly_0'] = 1 + 5*X['cont0'] + 8*X['cont0']**2
    X['poly_1'] = 1 + 5*X['cont1'] + 8*X['cont1']**2
    X['poly_2'] = 1 + 5*X['cont2'] + 8*X['cont2']**2
    X['poly_3'] = 1 + 5*X['cont3'] + 8*X['cont3']**2
    X['poly_4'] = 1 + 5*X['cont4'] + 8*X['cont4']**2
    X['poly_5'] = 1 + 5*X['cont5'] + 8*X['cont5']**2
    X['poly_6'] = 1 + 5*X['cont6'] + 8*X['cont6']**2
    X['poly_7'] = 1 + 5*X['cont7'] + 8*X['cont7']**2
    X['poly_8'] = 1 + 5*X['cont8'] + 8*X['cont8']**2
    X['poly_9'] = 1 + 5*X['cont9'] + 8*X['cont9']**2
    X['poly_10'] =1 + 5*X['cont10'] + 8*X['cont10']**2
    

    X['poly_b_0'] = X['cont0'] / X['cont0']**2
    X['poly_b_1'] = X['cont1'] / X['cont1']**2
    X['poly_b_2'] = X['cont2'] / X['cont2']**2
    X['poly_b_3'] = X['cont3'] / X['cont3']**2
    X['poly_b_4'] = X['cont4'] / X['cont4']**2
    X['poly_b_5'] = X['cont5'] / X['cont5']**2
    X['poly_b_6'] = X['cont6'] / X['cont6']**2
    X['poly_b_7'] = X['cont7'] / X['cont7']**2
    X['poly_b_8'] = X['cont8'] / X['cont8']**2
    X['poly_b_9'] = X['cont9'] / X['cont9']**2
    X['poly_b_10'] = X['cont10'] / X['cont10']**2
    

    X['drat_1'] = 1 / (5*X['cont0'] + 8*X['cont1']**2)
    X['drat_2'] = 1 / (5*X['cont2'] + 8*X['cont3']**2)
    X['drat_3'] = 1 / (5*X['cont4'] + 8*X['cont5']**2)
    X['drat_4'] = 1 / (5*X['cont6'] + 8*X['cont7']**2)
    X['drat_5'] = 1 / (5*X['cont8'] + 8*X['cont9']**2)
    X['drat_6'] = 1 / (5*X['cont10'] + 8*X['cont7']**2)
    
    return X

In [None]:
# run the function
train = wrangle(train)
test = wrangle(test)

### Now that we have run the wrangle function...
+ We will have produced more NaNs that we must handle.

In [None]:
# remove train NaNs
for i in train.columns:
    if train[i].isnull().sum() > 0:
        train[i].fillna(train[i].mean(), inplace=True)

In [None]:
# remove test NaNs
for i in test.columns:
    if test[i].isnull().sum() > 0:
        test[i].fillna(test[i].mean(), inplace=True)

In [None]:
train.head(3)

In [None]:
test.head(3)

In [None]:
print("Size of new train set:", train.shape)
print("Size of new test set:",test.shape)

### Conclusion:
+ Features engineering and outlier detection is a topic that I'm new to and still need to get better at. I hope this notebook helped in some way. Please, leave a comment and upvote if you found this notebook usefull. Thanks for reading! 🎉🎉🎉🎉