## Outliers and validity

When preparing data we have to be cautious with the accuracy of our set.
Outliers and invalid data points are difficult to detect but should be handled with caution.

we start out by importing our most important library.

In [1]:
import pandas as pd

### Silicon wafer thickness

Our first dataset contains information about the production of silicon wafers, each wafers thickness is measure on 9 different spots. 
More information on the dataset can be found [here](https://openmv.net/info/silicon-wafer-thickness).

In [2]:
wafer_df = pd.read_csv('https://raw.githubusercontent.com/LorenzF/data-science-practical-approach/main/src/c2_data_preparation/data/silicon-wafer-thickness.csv')
wafer_df.head()

Unnamed: 0,G1,G2,G3,G4,G5,G6,G7,G8,G9
0,0.175,0.188,-0.159,0.095,0.374,-0.238,-0.8,0.158,-0.211
1,0.102,0.075,0.141,0.18,0.138,-0.057,-0.075,0.072,0.072
2,0.607,0.711,0.879,0.765,0.592,0.187,0.431,0.345,0.187
3,0.774,0.823,0.619,0.37,0.725,0.439,-0.025,-0.259,0.496
4,0.504,0.644,0.845,0.681,0.502,0.151,0.404,0.296,0.26


we would like to investigate the distribution of measurements here, as we are early in this course using visualisation techniques would be too soon.
This does not mean we can't use simple mathematics, introducing the InterQuartile Range.
A reason for using IQR over standard deviation is that with IQR we do not assume a normal distribution.
The IQR calculates the range between the bottom 'quart' or 25% and the top 25%, giving us an indication of the spread of our results, we calculate this IQR for each of the 9 measurements independently.
For more info about IQR you can visit [wikipedia](https://en.wikipedia.org/wiki/Interquartile_range).

In [3]:
iqr = wafer_df.quantile(0.75)-wafer_df.quantile(0.25)
iqr

G1    0.54425
G2    0.61000
G3    0.54075
G4    0.52475
G5    0.61175
G6    0.86750
G7    0.76175
G8    0.87225
G9    0.86300
dtype: float64

you can see that the IQR spread for each measurement lays between 0.5 and 1 unit indicating that the 9 measurements of the wafer have a similar spread.
With these IQR's we could calculate for each point relative to the spread of the measurement how far it is from the median.

In [4]:
relative_spread_df = (wafer_df-wafer_df.median())/iqr
relative_spread_df.head()

Unnamed: 0,G1,G2,G3,G4,G5,G6,G7,G8,G9
0,-0.011024,-0.077869,-0.819233,-0.367794,0.176543,-0.352738,-1.029865,-0.130696,-0.254925
1,-0.145154,-0.263115,-0.264448,-0.205812,-0.209236,-0.144092,-0.07811,-0.229292,0.073001
2,0.782729,0.779508,1.100324,0.909004,0.532897,0.137176,0.58615,0.083692,0.206257
3,1.089573,0.963115,0.61951,0.156265,0.750306,0.427666,-0.012471,-0.60877,0.564311
4,0.593477,0.669672,1.037448,0.748928,0.385779,0.095677,0.550706,0.027515,0.290846


You can now see that some points are close to the median, whilst others are much higher, both positive as negative.
By defining a threshold, we quantify what deviation has to be there to flag a reading as an outlier.
The high outliers are seperated, note that only a single measurement of the 9 can trigger and render the total measurement as an outlier.
Yet judging from the setup where we would want to find wafers with varying thickness that approach is desirable.

In [5]:
relative_spread_df[(relative_spread_df>2).any(axis='columns')]

Unnamed: 0,G1,G2,G3,G4,G5,G6,G7,G8,G9
8,2.23243,2.009016,1.956542,1.589328,1.84389,1.544669,1.233344,0.419604,1.582851
38,12.891135,12.827049,12.832178,13.913292,11.429506,9.500865,10.305875,9.9272,9.05562
39,3.691318,3.981148,3.774387,4.081944,3.248059,3.729107,3.30489,3.846374,3.149479
61,2.010106,2.153279,1.98798,1.863745,1.858602,1.274928,1.237283,0.825451,0.955968
110,3.678457,2.841803,3.204808,3.180562,2.669391,0.518732,0.700361,0.176555,0.727694
112,2.361047,2.086066,2.363384,2.10767,1.925623,1.23804,1.766328,0.8908,1.377752
117,1.475425,1.043443,2.154415,2.582182,0.653862,1.823631,1.581227,0.857552,1.188876
120,1.791456,1.484426,2.583449,1.440686,2.085819,0.990202,1.782081,1.034107,1.822711
121,1.791456,1.484426,2.583449,1.440686,2.085819,0.990202,1.782081,1.034107,1.822711
152,2.610932,2.102459,2.387425,2.549786,2.169187,1.730259,2.241549,1.713958,1.592121


seems we have a few high outliers, you can clearly see the measurements are mostly all across the board high, but in some cases (e.g. id 154) only one measurement was an outlier.
We can do the same for the low outliers.

In [6]:
relative_spread_df[(relative_spread_df<-2).any(axis='columns')]

Unnamed: 0,G1,G2,G3,G4,G5,G6,G7,G8,G9
54,-1.550758,-1.52541,-1.843736,-2.082897,-1.659174,-1.203458,-1.184772,-1.650903,-1.245655
56,-1.73266,-1.510656,-2.121128,-2.122916,-1.781774,-1.521614,-1.909419,-1.782746,-1.159907
59,-1.97152,-1.310656,-2.328248,-1.175798,-2.067838,-0.915274,-1.783394,-1.304672,-1.514484
64,-1.234727,-1.361475,-0.736015,-1.055741,-2.224765,-0.839193,-0.679357,-0.865578,-0.663963
65,-2.226918,-1.194262,-2.117429,-2.161029,-2.043318,-0.190202,-1.004923,-0.270565,-0.794902
102,-2.484153,-2.330328,-1.568192,-2.808957,-1.945239,-1.340634,-0.846078,-1.691029,-0.887601


For a simple mathematical equation these result look promising, yet it can always be more sophisticated.
Not going to deep into the subject we could perform some Machine Learning, using a unsupervised method.
Here we use the sklearn library which contains the Isolation forest algorithm.
More info about the algorithm [here](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.IsolationForest.html).

In [7]:
from sklearn.ensemble import IsolationForest

We first create the classifier and train (fit) it with the generic wafer data.
Then for each record of the wafer data we make a prediction, if it thinks its an outlier, we keep them

In [8]:
clf = IsolationForest(random_state=0).fit(wafer_df)
wafer_df[clf.predict(wafer_df)==-1]

Unnamed: 0,G1,G2,G3,G4,G5,G6,G7,G8,G9
8,1.396,1.461,1.342,1.122,1.394,1.408,0.924,0.638,1.375
20,-0.558,-0.705,-0.526,-0.412,-0.753,-0.998,-0.27,0.598,-1.416
38,7.197,8.06,7.223,7.589,7.258,8.31,7.835,8.931,7.824
39,2.19,2.664,2.325,2.43,2.253,3.303,2.502,3.627,2.727
54,-0.663,-0.695,-0.713,-0.805,-0.749,-0.976,-0.918,-1.168,-1.066
56,-0.762,-0.686,-0.863,-0.826,-0.824,-1.252,-1.47,-1.283,-0.992
59,-0.892,-0.564,-0.975,-0.329,-0.999,-0.726,-1.374,-0.866,-1.298
61,1.275,1.549,1.359,1.266,1.403,1.174,0.927,0.992,0.834
65,-1.031,-0.493,-0.861,-0.846,-0.984,-0.097,-0.781,0.036,-0.677
102,-1.171,-1.186,-0.564,-1.186,-0.924,-1.095,-0.66,-1.203,-0.757


Comparing the results with our IQR approach we see a lot of similarities, here the id 154 record did not show up as we already realised this was perhaps not a strong enough outlier.
You could enhance our IQR technique by checking the amount of measurements that are above the threshold and respond accordingly, I will leave you a little hint.

In [9]:
(relative_spread_df>2).sum()

G1    7
G2    7
G3    8
G4    6
G5    6
G6    3
G7    3
G8    2
G9    2
dtype: int64

### Distillation column

As an exercise you can try the same technique to this dataset and see what you would find, good luck!
Be mindful that you do not incorporate the date as a variable in your outlier algorithm.

In [10]:
distil_df = pd.read_csv('https://raw.githubusercontent.com/LorenzF/data-science-practical-approach/main/src/c2_data_preparation/data/distillation-tower.csv')
distil_df

Unnamed: 0,Date,Temp1,FlowC1,Temp2,TempC1,Temp3,TempC2,TempC3,Temp4,PressureC1,...,Temp10,FlowC3,FlowC4,Temp11,Temp12,InvTemp1,InvTemp2,InvTemp3,InvPressure1,VapourPressure
0,2000-08-21,139.9857,432.0636,377.8119,100.2204,492.1353,490.1459,180.5578,187.4331,215.0627,...,513.9653,8.6279,10.5988,30.8983,489.9900,2.0409,2.6468,2.1681,4.3524,32.5026
1,2000-08-23,131.0470,487.4029,371.3060,100.2297,482.2100,480.3128,172.6575,179.5089,205.0999,...,504.5145,8.7662,10.7560,31.9099,480.2888,2.0821,2.6932,2.2207,4.5497,34.8598
2,2000-08-26,118.2666,437.3516,378.4483,100.3084,488.7266,487.0040,165.9400,172.9262,205.0304,...,508.9997,8.5319,10.5737,29.9165,486.6190,2.0550,2.6424,2.1796,4.5511,32.1666
3,2000-08-29,118.1769,481.8314,378.0028,95.5766,493.1481,491.1137,167.2085,174.2338,205.2561,...,514.1794,8.6260,10.6695,30.6229,491.1304,2.0361,2.6455,2.1620,4.5464,30.4064
4,2000-08-30,120.7891,412.6471,377.8871,92.9052,490.2486,488.6641,167.0326,173.9681,205.0883,...,511.0948,8.5939,10.4922,29.4977,487.6475,2.0507,2.6463,2.1704,4.5499,30.9238
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
248,2003-01-26,130.8138,212.6385,341.5964,121.4354,468.3401,467.0299,174.7639,180.7649,229.7393,...,479.0290,5.5590,6.4470,16.4131,466.3347,2.1444,2.9274,2.2127,4.0911,38.8507
249,2003-01-28,128.9673,225.1412,349.8965,118.8604,479.7665,478.4652,176.2176,182.3646,230.5049,...,491.2362,5.6342,6.4360,17.2385,477.8816,2.0926,2.8580,2.1620,4.0783,34.2653
250,2003-01-31,130.5328,223.5965,345.9366,120.4027,474.5378,473.1145,176.3310,182.2578,230.6638,...,485.8786,5.4810,6.3575,16.9866,472.3176,2.1172,2.8907,2.1855,4.0756,36.5717
251,2003-02-03,128.5248,213.5613,343.4950,119.6989,469.3802,467.9954,174.6435,180.5093,230.5226,...,480.2879,5.4727,6.4175,16.6778,467.0001,2.1413,2.9113,2.2090,4.0780,38.1054
