<a href="https://colab.research.google.com/github/Konoko2004/Sales_Data_Analysis/blob/main/Handling_Outliers_in_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Based on Chapter 5 of the Book Python Machine Learning by Wei Meng Lee

In [None]:
#There are a number of techniques to handle Outliers, we discuss 2 of them :
# 1. Tukey Fences

# Tukey Fences is based on Interquartile range (IQR). IQR is the difference between the first and third quartiles of a set of values
# The 1st quartile denoted Q1, is the value in the dataset that holds 25% of the values below it.
# The 3rd quartile denoted Q3, is the value in the dataset that holds 25% of the values above it.
# Hence by Definition : IQR = Q3-Q1
# In Tukey Fences, outliers are values that are as follows : Either Less than Q1 - (1.5 * IQR) or More than Q3 + (1.5 * IQR)

# 2. Z-Score

# Z-Score indicates how many standard deviations a datapoint is from the mean. 
# The Z-Score is based on the formula : Z = (Xi - mean) / Std Deviation
# A Negative Z-Score indicates that the datapoint is less than the mean. 
# A Positive Z-Score indicates that the data point is larger than the mean.
# A Z-Score of 0 tells you that the data point is right in the middle (mean),
# and a Z-Score of 1 tells you that your data point is 1 Standard Deviation above the mean and so on.
# Any Z-Score greater than 3 or less than -3 is considered to be an outlier.


#Tukey Fences

In [1]:
# Tukey Fences function 

import numpy as np

def outliers_iqr(data):
  q1 ,q3 = np.percentile(data, [25,75])
  iqr = q3-q1
  lower_bound = q1- (iqr * 1.5)
  upper_bound = q3 + (iqr * 1.5)
  return np.where((data > upper_bound) | (data < lower_bound))

In [2]:
import pandas as pd
df = pd.read_csv('http://www.mosaic-web.org/go/datasets/galton.csv')

outlier_index_array = outliers_iqr(df['height'])

print('Outliers using outliers_iqr()')
print('=============================')
for i in outlier_index_array[0]:
  print(df[i:i+1])

Outliers using outliers_iqr()
    family  father  mother sex  height  nkids
288     72    70.0    65.0   M    79.0      7


In [4]:
df.head()

Unnamed: 0,family,father,mother,sex,height,nkids
0,1,78.5,67.0,M,73.2,4
1,1,78.5,67.0,F,69.2,4
2,1,78.5,67.0,F,69.0,4
3,1,78.5,67.0,F,69.0,4
4,2,75.5,66.5,M,73.5,4


#Z-Score

In [3]:
# Z-Score

def outlier_z_score(data):
  threshold = 3
  mean = np.mean(data)
  std = np.std(data)
  z_scores = [(y-mean) / std for y in data]
  return np.where(np.abs(z_scores) > threshold)

In [22]:
#np.where test

a = np.arange(12).reshape(3,4)
df = pd.DataFrame(a, columns = ['A','B','C','D'])
Required_rows = np.where(df['A']>3)[0]
df['A'].iloc[Required_rows]

1    4
2    8
Name: A, dtype: int64