# Identifying and Removing Outliers

To identify outliers in the data, we will use what is [the Tukey Method](http://datapigtechnologies.com/blog/index.php/highlighting-outliers-in-your-data-with-the-tukey-method/). 

This means that we will look for points that are more than 1.5 times the Inter-quartile range above the third quartile or below the first quartile.

In [2]:
import pandas as pd
import numpy as np

In [3]:
cd ..

/home/jovyan/UCLA_CSX_450_2_2018_W/09-wholesale_customers-3


In [4]:
run src/load_data.py

In [5]:
whos DataFrame

Variable             Type         Data/Info
-------------------------------------------
customer_df          DataFrame         Fresh   Milk  Grocer<...>n\n[440 rows x 6 columns]
customer_final_df    DataFrame            Fresh      Milk  <...>n\n[435 rows x 6 columns]
customer_log_df      DataFrame             Fresh       Milk<...>n\n[440 rows x 6 columns]
customer_log_sc_df   DataFrame            Fresh      Milk  <...>n\n[440 rows x 6 columns]
customer_sc_df       DataFrame            Fresh      Milk  <...>n\n[440 rows x 6 columns]


#### Note that Tukey's method's param is 1.5. This param can be modified depending upon how aggressively you want to catc the outliers. If you want tp get more outliers, then reduce this param value, else increase the param value.


In [14]:
def display_outliers(dataframe, col, param=1.5):
    Q1 = np.percentile(dataframe[col], 25)
    Q3 = np.percentile(dataframe[col], 75)
    #print(Q1)
    #print(Q3)
    tukey_window = param*(Q3-Q1)
    less_than_Q1 = dataframe[col] < Q1 - tukey_window
    greater_than_Q3 = dataframe[col] > Q3 + tukey_window
    tukey_mask = (less_than_Q1 | greater_than_Q3)
    return dataframe[tukey_mask]

In [15]:
for col in customer_log_sc_df:
    print(col, display_outliers(customer_log_sc_df, col).shape)

Fresh (16, 6)
Milk (4, 6)
Grocery (2, 6)
Frozen (10, 6)
Detergents_Paper (2, 6)
Delicatessen (14, 6)


What if we count the rows that show up as an outlier more than once?

In [20]:
from collections import Counter

In [17]:
raw_outliers = []
for col in customer_log_sc_df:
    outlier_df = display_outliers(customer_log_sc_df, col)
    raw_outliers += list(outlier_df.index)

In [18]:
raw_outliers

[65,
 66,
 81,
 95,
 96,
 128,
 171,
 193,
 218,
 304,
 305,
 338,
 353,
 355,
 357,
 412,
 86,
 98,
 154,
 356,
 75,
 154,
 38,
 57,
 65,
 145,
 175,
 264,
 325,
 420,
 429,
 439,
 75,
 161,
 66,
 109,
 128,
 137,
 142,
 154,
 183,
 184,
 187,
 203,
 233,
 285,
 289,
 343]

#### A Counter is a container that keeps track of how many times equivalent values are added.

In [23]:
outlier_count = Counter(raw_outliers)
outliers = [k for k,v in outlier_count.items() if v > 1]
outlier_count.items()

dict_items([(65, 2), (66, 2), (81, 1), (95, 1), (96, 1), (128, 2), (171, 1), (193, 1), (218, 1), (304, 1), (305, 1), (338, 1), (353, 1), (355, 1), (357, 1), (412, 1), (86, 1), (98, 1), (154, 3), (356, 1), (75, 2), (38, 1), (57, 1), (145, 1), (175, 1), (264, 1), (325, 1), (420, 1), (429, 1), (439, 1), (161, 1), (109, 1), (137, 1), (142, 1), (183, 1), (184, 1), (187, 1), (203, 1), (233, 1), (285, 1), (289, 1), (343, 1)])

In [10]:
len(outliers)

5

In [11]:
customer_log_sc_df.shape

(440, 6)

In [12]:
outliers

[65, 66, 128, 154, 75]