## Outlier Engineering


An outlier is a data point which is significantly different from the remaining data. “An outlier is an observation which deviates so much from the other observations as to arouse suspicions that it was generated by a different mechanism.” [D. Hawkins. Identification of Outliers, Chapman and Hall , 1980].

Statistics such as the mean and variance are very susceptible to outliers. In addition, **some Machine Learning models are sensitive to outliers** which may decrease their performance. Thus, depending on which algorithm we wish to train, we often remove outliers from our variables.

We discussed in section 3 of this course how to identify outliers. In this section, we we discuss how we can process them to train our machine learning models.


## How can we pre-process outliers?

- Trimming: remove the outliers from our dataset
- Treat outliers as missing data, and proceed with any missing data imputation technique
- Discrestisation: outliers are placed in border bins together with higher or lower values of the distribution
- Censoring: capping the variable distribution at a max and / or minimum value

**Censoring** is also known as:

- top and bottom coding
- windsorisation
- capping


## Censoring or Capping.

**Censoring**, or **capping**, means capping the maximum and /or minimum of a distribution at an arbitrary value. On other words, values bigger or smaller than the arbitrarily determined ones are **censored**.

Capping can be done at both tails, or just one of the tails, depending on the variable and the user.

Check [pydata](https://www.youtube.com/watch?v=KHGGlozsRtA), by Soledad Galli, for an example of capping used 
in a finance company.

The numbers at which to cap the distribution can be determined:

- arbitrarily
- using the inter-quantal range proximity rule
- using the gaussian approximation
- using quantiles


### Advantages

- does not remove data

### Limitations

- distorts the distributions of the variables
- distorts the relationships among variables


## In this Demo

We will see how to perform capping with arbitrary values using the Titanic dataset

## Important

When doing capping, we tend to cap values both in train and test set. It is important to remember that the capping values MUST be derived from the train set. And then use those same values to cap the variables in the test set

I will not do that in this demo, but please keep that in mind when setting up your pipelines

In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt

from feature_engine import missing_data_imputers  as msi
from feature_engine import outlier_removers as outr

3.DATASET
The dataset for this paper has been obtained 
from the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets/Automobile).

This data set consists of three types of entities: (a) the specification of an auto in terms of various 
characteristics, (b)its assigned insurance risk rating, (c) its normalized losses in use as compared to other cars.

The second rating corresponds to the degree to which the auto is more risky than its price indicates.
Cars are initially assigned a risk factor symbol associated with its price. Then, if it is more risky 
(or less), this symbol is adjusted by moving it up (or down) the scale. Actuarians call this process 
“symboling”. A value of +3 indicates that the auto is risky, -3 that it is probably pretty safe.

The third factor is the relative average loss payment per insured vehicle year. This value is normalized 
for all autos within a particular size classification (two-door small, station wagons, sports/speciality, etc…),
and represents the average loss per car per year.

There are total 205 instances and 26 attributes out of which 15 are continuous, 1 is integer and 
10 are nominal. There are missing values as well.

In [2]:
# let's load the imports-85-clean-data.csv dataset

data = pd.read_csv('C:\\Users\\gusal\\machine learning\\Feature engineering\\automobile data set\\imports-85-clean-data.csv')


In [3]:

data.head()

Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,...,engine-size,fuel-system,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
0,3,128.414508,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000.0,21,27,13495.0
1,3,128.414508,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000.0,21,27,16500.0
2,1,128.414508,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154,5000.0,19,26,16500.0
3,2,164.0,audi,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.4,10.0,102,5500.0,24,30,13950.0
4,2,164.0,audi,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.4,8.0,115,5500.0,18,22,17450.0


In [4]:
# find numerical variables
# those different from object and also excluding the target Price
features_numerical = [c for c in data.columns if data[c].dtypes!='O']

In [5]:
features_numerical

['symboling',
 'normalized-losses',
 'wheel-base',
 'length',
 'width',
 'height',
 'curb-weight',
 'engine-size',
 'bore',
 'stroke',
 'compression-ratio',
 'horsepower',
 'peak-rpm',
 'city-mpg',
 'highway-mpg',
 'price']

## ArbitraryOutlierCapper

The ArbitraryOutlierCapper caps the minimum and maximum values by a value determined by the user. 

In [17]:
# let's find out the maximum values for the numerical variables in data
for variable in features_numerical:
    data_max = data[variable].max()
    print('maximum value of {0} is {1}'.format(variable, data_max))

maximum value of symboling is 3
maximum value of normalized-losses is 256.0
maximum value of wheel-base is 120.9
maximum value of length is 208.1
maximum value of width is 72.3
maximum value of height is 59.8
maximum value of curb-weight is 4066
maximum value of engine-size is 326
maximum value of bore is 3.94
maximum value of stroke is 4.17
maximum value of compression-ratio is 23.0
maximum value of horsepower is 288
maximum value of peak-rpm is 6600.0
maximum value of city-mpg is 49
maximum value of highway-mpg is 54
maximum value of price is 45400.0


In [18]:
data.describe()

Unnamed: 0,symboling,normalized-losses,wheel-base,length,width,height,curb-weight,engine-size,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
count,205.0,205.0,205.0,205.0,205.0,205.0,205.0,205.0,205.0,205.0,205.0,205.0,205.0,205.0,205.0,205.0
mean,0.834146,128.576317,98.756585,174.049268,65.907805,53.724878,2555.565854,126.907317,3.324878,3.253366,10.142537,104.84878,5125.369458,25.219512,30.75122,13321.278623
std,1.245307,38.606463,6.021776,12.337289,2.145204,2.443522,520.680204,41.642693,0.273049,0.313937,3.97204,39.969861,476.979093,6.542142,6.886443,8095.99644
min,-2.0,65.0,86.6,141.1,60.3,47.8,1488.0,61.0,2.54,2.07,7.0,48.0,4150.0,13.0,16.0,5118.0
25%,0.0,95.0,94.5,166.3,64.1,52.0,2145.0,97.0,3.13,3.11,8.6,70.0,4800.0,19.0,25.0,7775.0
50%,1.0,125.0,97.0,173.2,65.5,54.1,2414.0,120.0,3.31,3.29,9.0,95.0,5200.0,24.0,30.0,10295.0
75%,2.0,154.0,102.4,183.1,66.9,55.5,2935.0,141.0,3.58,3.41,9.4,120.0,5500.0,30.0,34.0,16503.0
max,3.0,256.0,120.9,208.1,72.3,59.8,4066.0,326.0,3.94,4.17,23.0,288.0,6600.0,49.0,54.0,45400.0


capping 75%

In [19]:
capper = outr.ArbitraryOutlierCapper(max_capping_dict = {'curb-weight':2935, 'engine-size':141, 'compression-ratio': 9.4},
                                     min_capping_dict = None)
capper.fit(data)

ArbitraryOutlierCapper(max_capping_dict={'compression-ratio': 9.4,
                                         'curb-weight': 2935,
                                         'engine-size': 141},
                       min_capping_dict=None)

In [20]:
capper.right_tail_caps_

{'curb-weight': 2935, 'engine-size': 141, 'compression-ratio': 9.4}

In [21]:
capper.left_tail_caps_

{}

In [24]:
temp = capper.transform(data)

temp['curb-weight'].max(), temp['engine-size'].max(), temp['compression-ratio'].max()

(2935, 141, 9.4)

### Minimum capping

capping 25%

In [25]:
capper = outr.ArbitraryOutlierCapper(max_capping_dict=None,
                                     min_capping_dict={'curb-weight':2145, 'engine-size':97, 'compression-ratio': 8.6
                                     })
capper.fit(data)

ArbitraryOutlierCapper(max_capping_dict=None,
                       min_capping_dict={'compression-ratio': 8.6,
                                         'curb-weight': 2145,
                                         'engine-size': 97})

In [26]:
capper.variables

['curb-weight', 'engine-size', 'compression-ratio']

In [27]:
capper.right_tail_caps_

{}

In [28]:
capper.left_tail_caps_

{'curb-weight': 2145, 'engine-size': 97, 'compression-ratio': 8.6}

In [29]:
temp = capper.transform(data)

temp['curb-weight'].min(), temp['engine-size'].min(), temp['compression-ratio'].min()

(2145, 97, 8.6)

### Both ends capping

In [30]:
capper = outr.ArbitraryOutlierCapper(max_capping_dict={
                                     'curb-weight':2935, 'engine-size':141, 'compression-ratio': 9.4},
                                     min_capping_dict={
                                     'curb-weight':2145, 'engine-size':97, 'compression-ratio': 8.6})
capper.fit(data)

ArbitraryOutlierCapper(max_capping_dict={'compression-ratio': 9.4,
                                         'curb-weight': 2935,
                                         'engine-size': 141},
                       min_capping_dict={'compression-ratio': 8.6,
                                         'curb-weight': 2145,
                                         'engine-size': 97})

In [31]:
capper.right_tail_caps_

{'curb-weight': 2935, 'engine-size': 141, 'compression-ratio': 9.4}

In [32]:
capper.left_tail_caps_

{'curb-weight': 2145, 'engine-size': 97, 'compression-ratio': 8.6}

In [33]:
temp = capper.transform(data)

temp['curb-weight'].min(), temp['engine-size'].min(), temp['compression-ratio'].min()

(2145, 97, 8.6)

In [34]:
temp['curb-weight'].max(), temp['engine-size'].max(), temp['compression-ratio'].max()

(2935, 141, 9.4)

In [35]:
temp.describe()

Unnamed: 0,symboling,normalized-losses,wheel-base,length,width,height,curb-weight,engine-size,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
count,205.0,205.0,205.0,205.0,205.0,205.0,205.0,205.0,205.0,205.0,205.0,205.0,205.0,205.0,205.0,205.0
mean,0.834146,128.576317,98.756585,174.049268,65.907805,53.724878,2513.229268,118.014634,3.324878,3.253366,9.037122,104.84878,5125.369458,25.219512,30.75122,13321.278623
std,1.245307,38.606463,6.021776,12.337289,2.145204,2.443522,319.598586,18.113606,0.273049,0.313937,0.330059,39.969861,476.979093,6.542142,6.886443,8095.99644
min,-2.0,65.0,86.6,141.1,60.3,47.8,2145.0,97.0,2.54,2.07,8.6,48.0,4150.0,13.0,16.0,5118.0
25%,0.0,95.0,94.5,166.3,64.1,52.0,2145.0,97.0,3.13,3.11,8.6,70.0,4800.0,19.0,25.0,7775.0
50%,1.0,125.0,97.0,173.2,65.5,54.1,2414.0,120.0,3.31,3.29,9.0,95.0,5200.0,24.0,30.0,10295.0
75%,2.0,154.0,102.4,183.1,66.9,55.5,2935.0,141.0,3.58,3.41,9.4,120.0,5500.0,30.0,34.0,16503.0
max,3.0,256.0,120.9,208.1,72.3,59.8,2935.0,141.0,3.94,4.17,9.4,288.0,6600.0,49.0,54.0,45400.0


In [None]:
# comparing with the original data

In [36]:
data.describe()

Unnamed: 0,symboling,normalized-losses,wheel-base,length,width,height,curb-weight,engine-size,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
count,205.0,205.0,205.0,205.0,205.0,205.0,205.0,205.0,205.0,205.0,205.0,205.0,205.0,205.0,205.0,205.0
mean,0.834146,128.576317,98.756585,174.049268,65.907805,53.724878,2555.565854,126.907317,3.324878,3.253366,10.142537,104.84878,5125.369458,25.219512,30.75122,13321.278623
std,1.245307,38.606463,6.021776,12.337289,2.145204,2.443522,520.680204,41.642693,0.273049,0.313937,3.97204,39.969861,476.979093,6.542142,6.886443,8095.99644
min,-2.0,65.0,86.6,141.1,60.3,47.8,1488.0,61.0,2.54,2.07,7.0,48.0,4150.0,13.0,16.0,5118.0
25%,0.0,95.0,94.5,166.3,64.1,52.0,2145.0,97.0,3.13,3.11,8.6,70.0,4800.0,19.0,25.0,7775.0
50%,1.0,125.0,97.0,173.2,65.5,54.1,2414.0,120.0,3.31,3.29,9.0,95.0,5200.0,24.0,30.0,10295.0
75%,2.0,154.0,102.4,183.1,66.9,55.5,2935.0,141.0,3.58,3.41,9.4,120.0,5500.0,30.0,34.0,16503.0
max,3.0,256.0,120.9,208.1,72.3,59.8,4066.0,326.0,3.94,4.17,23.0,288.0,6600.0,49.0,54.0,45400.0
