# Outliers
This notebook is intended to provide various ways to detect the outliers. 
___
If you are coming from the previous notebook (2. Transforming data) then I want to clarify something first.
- This notebook can be updated frequenty as far as I discover new ways to detect and remove outliers (from this book)
- First example (very first) is just so simple - and being an outlier doesn't end there

In [11]:
import pandas as pd
import numpy as np
from pandas import Series, DataFrame

# 

## Basic Outlier 

In [14]:
df = DataFrame(np.random.randn(100, 4))
df.describe()

Unnamed: 0,0,1,2,3
count,100.0,100.0,100.0,100.0
mean,0.173463,-0.015103,-0.231392,0.046166
std,1.06875,1.067518,1.093479,1.005034
min,-2.196299,-2.58284,-3.71338,-2.788125
25%,-0.504283,-0.740742,-1.150561,-0.716982
50%,0.298811,-0.036227,-0.054689,0.002347
75%,0.883417,0.757753,0.511272,0.890928
max,3.218513,2.789202,2.367302,2.559666


Now, that we have a data, but We see that ther are some values getting more than 2... 
As a dummy example, the values can't get above 2 or below -2.

So...

### 1. Check

In [29]:
df[np.abs(df[0]) > 2]

Unnamed: 0,0,1,2,3
4,3.218513,-0.688628,0.58629,0.373257
13,2.023627,0.695923,-0.520284,0.306238
46,-2.196299,1.640499,-1.075587,-0.069167
53,-2.084976,0.703409,-1.543034,0.555975
75,3.176328,0.236336,-1.808515,-0.576179


In [30]:
df[np.abs(df[1]) > 2]

Unnamed: 0,0,1,2,3
35,1.970374,2.789202,-3.71338,1.903543
45,-0.6534,-2.183555,0.76906,-0.763131
76,-0.508287,-2.030815,-1.161269,-1.711957
79,0.617072,-2.58284,-0.949572,0.162737
94,1.269597,-2.528789,-0.41381,0.581075


In [31]:
df[np.abs(df[2]) > 2]

Unnamed: 0,0,1,2,3
3,0.683447,1.386483,-2.061696,-0.186593
35,1.970374,2.789202,-3.71338,1.903543
41,1.900607,-1.561373,2.367302,1.680957
60,-1.068619,-1.823509,-2.006502,0.923302
62,-0.752614,0.739068,-2.485644,-1.259811


In [32]:
df[np.abs(df[3]) > 2]

Unnamed: 0,0,1,2,3
8,-0.810934,0.842951,-1.732263,2.559666
28,1.297917,1.020595,0.411189,-2.788125
72,0.269266,-0.747677,-1.827473,2.331064


### 2. Step 1 was verbose.
Get all rows at once.


In [59]:
# So these are the OUTLYING values
# The code below is based on a condition. Cool - I have made a gist on it. Check it here: https://gist.github.com/AayushSameerShah/330d000dfa9b245646391d1c5f167147

df[(np.abs(df) > 2).any(1)].style.applymap(lambda x: "background: yellow" if np.abs(x) > 2 else "")

Unnamed: 0,0,1,2,3
3,0.683447,1.386483,-2.061696,-0.186593
4,3.218513,-0.688628,0.58629,0.373257
8,-0.810934,0.842951,-1.732263,2.559666
13,2.023627,0.695923,-0.520284,0.306238
28,1.297917,1.020595,0.411189,-2.788125
35,1.970374,2.789202,-3.71338,1.903543
41,1.900607,-1.561373,2.367302,1.680957
45,-0.6534,-2.183555,0.76906,-0.763131
46,-2.196299,1.640499,-1.075587,-0.069167
53,-2.084976,0.703409,-1.543034,0.555975


### 3. Fix em 

In [63]:
df[np.abs(df) > 2] = np.sign(df) * 2

Unnamed: 0,0,1,2,3
0,,,,
1,,,,
2,,,,
3,,,-2.061696,
4,3.218513,,,
...,...,...,...,...
95,,,,
96,,,,
97,,,,
98,,,,


The syntax above is a little weird. As the whole DF is returned, but the assignment is only done to the NON NAN values. But its fine.

Done, for this simple outlier.