<a href="https://colab.research.google.com/github/4dsolutions/clarusway_data_analysis/blob/main/DAwPy_S8_(Handling%20with%20Outliers)/DAwPy-S8%20(Handling%20with%20Outliers).ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open and Execute in Google Colaboratory"></a><br/>
[![nbviewer](https://raw.githubusercontent.com/jupyter/design/master/logos/Badges/nbviewer_badge.svg)](https://nbviewer.org/github/4dsolutions/clarusway_data_analysis/blob/main/DAwPy_S8_%28Handling%20with%20Outliers%29/DAwPy-S8%20%28Handling%20with%20Outliers%29.ipynb)


________

<p style="text-align: center;"><img src="https://docs.google.com/uc?id=1lY0Uj5R04yMY3-ZppPWxqCr5pvBLYPnV" class="img-fluid" 
alt="CLRSWY"></p>

## <p style="background-color:#FDFEFE; font-family:newtimeroman; color:#9d4f8c; font-size:100%; text-align:center; border-radius:10px 10px;">WAY TO REINVENT YOURSELF</p>

<img src=https://i.ibb.co/6gCsHd6/1200px-Pandas-logo-svg.png width="700" height="200">

## <p style="background-color:#FDFEFE; font-family:newtimeroman; color:#060108; font-size:200%; text-align:center; border-radius:10px 10px;">Data Analysis with Python</p>

## <p style="background-color:#FDFEFE; font-family:newtimeroman; color:#060108; font-size:150%; text-align:center; border-radius:10px 10px;">Session - 08</p>

## <p style="background-color:#FDFEFE; font-family:newtimeroman; color:#4d77cf; font-size:200%; text-align:center; border-radius:10px 10px;">Working with Outliers</p>

<a id="toc"></a>

## <p style="background-color:#9d4f8c; font-family:newtimeroman; color:#FFF9ED; font-size:175%; text-align:center; border-radius:10px 10px;">Content</p>

* [IMPORTING LIBRARIES NEEDED IN THIS NOTEBOOK](#0)
* [HANDLING WITH OUTLIERS](#1)
* [CATCHING & DETECTING OUTLIERS](#2)
* [REMOVING THE OUTLIERS](#3)    
* [LIMITATION & TRANSFORMATION OF THE OUTLIERS](#4)    
* [THE END OF THE SESSION - 07](#5)

## <p style="background-color:#9d4f8c; font-family:newtimeroman; color:#FFF9ED; font-size:175%; text-align:center; border-radius:10px 10px;">Importing Libraries Needed in This Notebook</p>

<a id="0"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:blue; background-color:#dfa8e4" data-toggle="popover">Content</a>

Once you've installed NumPy & Pandas you can import them as a library:

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib as mpl

## <p style="background-color:#9d4f8c; font-family:newtimeroman; color:#FFF9ED; font-size:175%; text-align:center; border-radius:10px 10px;">Handling with Outliers</p>

<a id="1"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:blue; background-color:#dfa8e4" data-toggle="popover">Content</a>

### What is Outlier? [Source](https://statisticsbyjim.com/basics/remove-outliers/#:~:text=Outliers%20are%20unusual%20values%20in,what%20to%20do%20with%20them.)

In general, <b>``Outliers``</b> are **unusual values** in your dataset, and they can **distort statistical analyses and violate their assumptions**. ... Outliers increase the variability in your data, which decreases statistical power. Consequently, excluding outliers can cause your results to become statistically significant. Outliers can have a disproportionate effect on statistical results, such as the mean, which can result in misleading interpretations. In this case, the mean value makes it seem that the data values are higher than they really are.

### Most common causes of outliers on a data set:

- Data entry errors (human errors)
- Measurement errors (instrument errors)
- Experimental errors (data extraction or experiment planning/executing errors)
- Intentional (dummy outliers made to test detection methods)
- Data processing errors (data manipulation or data set unintended mutations)
- Sampling errors (extracting or mixing data from wrong or various sources)
- Natural (not an error, novelties in data) 

### Guideline for Handling Outliers [Source 01](https://statisticsbyjim.com/basics/remove-outliers/#:~:text=Outliers%20are%20unusual%20values%20in,what%20to%20do%20with%20them.) & [Source 02](https://www.researchgate.net/publication/258174106_Best-Practice_Recommendations_for_Defining_Identifying_and_Handling_Outliers)

- A measurement error or data entry error, correct the error if possible. If you can’t fix it, remove that observation because you know it’s incorrect.
- Not a part of the population you are studying (i.e., unusual properties or conditions), you can legitimately remove the outlier.
- A natural part of the population you are studying, you should not remove it.

## <p style="background-color:#9d4f8c; font-family:newtimeroman; color:#FFF9ED; font-size:175%; text-align:center; border-radius:10px 10px;">Catching and Detecting Outliers</p>

<a id="2"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:blue; background-color:#dfa8e4" data-toggle="popover">Content</a>

In [None]:
sns.get_dataset_names()

In [None]:
import seaborn as sns  # already done, but harmless to do again

df = sns.load_dataset('diamonds')

More about this dataset [on Kaggle](https://www.kaggle.com/code/drvader/diamonds-dataset-exploration-and-regression/data).

![](https://i0.wp.com/www.adiamor.com/blog/wp-content/uploads/2017/04/Screen-Shot-2017-04-25-at-10.28.23-AM.png)

In [None]:
df.head()

In [None]:
df.columns

In [None]:
df.cut.nunique()

In [None]:
df.info()

In [None]:
df.isnull().sum()

`df.select_dtypes` is a new one.  Just include columns of the listed types.

In [None]:
df = df.select_dtypes(include = ['float64', 'int64'])  # drop category type cols
# df = df.dropna() no missing data to worry about
df

In [None]:
df.sort_values(["table"]).tail(20)

In [None]:
df.corr()

### Detecting Outliers with Graphs

A boxplot shows them median and one quartile on each side, so down to 25% and up to 75%. 

Then the whiskers extend to 1.5 times the inter-quartile range i.e. 1.5 times the box width, but stop at the most extreme data points in either direction if they're within range.  If data falls outside the whiskers, it gots plotted as individual outlier points.

In [None]:
df.table.describe()

In [None]:
# you don't need to know how this plotting happens.
# just focus on quarters
# remember Statistics lesson IQR

plt.figure(figsize = (15, 8))
sns.boxplot(x = df['table']);

In [None]:
plt.figure(figsize = (15, 8))
sns.displot(df.table, kind="hist", bins = 10);

In [None]:
df["table"]

### Detecting Outliers with Tukey's Fences | Tukey's Rule

**- First way** of specifying **``Q1 & Q3``** is using the **``.quantile()``** method

In [None]:
Q1 = df["table"].quantile(0.25)
Q3 = df["table"].quantile(0.75)

IQR = Q3 - Q1

In [None]:
Q1

In [None]:
Q3

In [None]:
IQR

**- Second way** of specifying **``Q1 & Q3``** is using the **``.describe()``** method

In [None]:
df.table.describe()

In [None]:
# think of the boxplot whiskers
lower_lim = Q1 - 1.5 * IQR
upper_lim = Q3 + 1.5 * IQR

In [None]:
lower_lim

In [None]:
upper_lim

In [None]:
(df.table < lower_lim).value_counts()

In [None]:
(df.table > upper_lim).value_counts()

In [None]:
df.table[(df.table < lower_lim) | (df.table > upper_lim)].count()

In [None]:
df.table[~((df.table < lower_lim) | (df.table > upper_lim))].count()

In [None]:
df.table[(df.table >= lower_lim) & (df.table <= upper_lim)].count()

In [None]:
53335 + 605

In [None]:
df.table.describe()

## <p style="background-color:#9d4f8c; font-family:newtimeroman; color:#FFF9ED; font-size:175%; text-align:center; border-radius:10px 10px;">Removing the Outliers</p>

<a id="3"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:blue; background-color:#dfa8e4" data-toggle="popover">Content</a>

In [None]:
not_outliers = (df.table >= lower_lim) & (df.table <= upper_lim)

df[not_outliers]['table'].sort_values()

In [None]:
df.table[not_outliers].count()

In [None]:
len(df[not_outliers])

In [None]:
cleaned_df = df[not_outliers]  # new dataframe

In [None]:
cleaned_df

In [None]:
df.loc[(df.table < lower_lim) | (df.table > upper_lim)].index

In [None]:
df

In [None]:
outlier_index = df.loc[(df.table < lower_lim) | (df.table > upper_lim)].index
outlier_index

In [None]:
df.drop(outlier_index)

In [None]:
df.drop(outlier_index).table.describe()

In [None]:
cleaned_df.table.describe()

In [None]:
cleaned_df.info()

In [None]:
plt.figure(figsize = (15, 8))

sns.boxplot(x = cleaned_df.table);

In [None]:
sns.displot(cleaned_df.table, bins=10, kde=False);

In [None]:
cleaned_df.table.describe()

In [None]:
df.table.describe()

## <p style="background-color:#9d4f8c; font-family:newtimeroman; color:#FFF9ED; font-size:175%; text-align:center; border-radius:10px 10px;">Limitation & Transformation of the Outliers</p>

<a id="4"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:blue; background-color:#dfa8e4" data-toggle="popover">Content</a>

## Limitation using ``.winsorize()`` method

With winsorizing, any value of a variable above or below a percentile k on each side of the variables’ distribution is replaced with the value of the k-th percentile itself. For example, if k=5, all observations above the 95th percentile are recoded to the value of the 95th percentile, and values below the 5th percent are recoded, respectively [Source 01](https://towardsdatascience.com/detecting-and-treating-outliers-in-python-part-3-dcb54abaf7b0) & [Source 02](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.mstats.winsorize.html).

In [None]:
from scipy.stats.mstats import winsorize

In [None]:
df

In [None]:
df.sort_values("table")

In [None]:
plt.figure(figsize = (15, 8))
sns.boxplot(x = df['table']);

In [None]:
df.table.quantile(0.02)

In [None]:
df.table.quantile(.95)  # same as df.table.quantile(1 - .05)

In [None]:
winsorize(df.table, (0.02, 0.05))

In [None]:
lower_lim

In [None]:
# try just df.table[df.table < lower_lim], then how many?

a = len(df.table[df.table < lower_lim]) / len(df)
a

These will be the new cut-off points meaning everything `table` value < 51.6 or > 63.5 will be filled in with those values.

In [None]:
df.table.quantile(a)

In [None]:
b = len(df.table[df.table > upper_lim]) / len(df.table)
b

In [None]:
df.table.quantile(1-b)

winsorizing with different bounds, computed from upper_lim and lower_lim.  These are not the values 0.02 and 0.05 used earlier.

In [None]:
table_win = winsorize(df.table, (a, b))  # using percents derived from IQR "fences"
table_win

In [None]:
plt.figure(figsize = (10, 6))
sns.boxplot(x=table_win);

In [None]:
sns.displot(table_win, bins=10, kde=False);

[winsorize at scipy](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.mstats.winsorize.html)

[What is a masked array?](https://numpy.org/doc/stable/reference/maskedarray.generic.html)

In [None]:
a = np.array([10, 4, 9, 8, 5, 3, 7, 2, 1, 6])  # simple data

In [None]:
winsorize(a, limits=[0.1, 0.2])  # consider bottom 1 and top 2 as outliers

In [None]:
ma = winsorize(a, limits=[0.1, 0.2])
type(ma)

In [None]:
ma.data

In [None]:
np.array(ma)

Back to our diamonds...

In [None]:
table_win  # returned as a masked array with everything unmasked

In [None]:
df_table_win = pd.Series(table_win) # make it a series

In [None]:
df_table_win.describe()

The upper and lower limits below were computed using 1.5 * IQR, basically the whiskers.  

In [None]:
upper_lim, lower_lim  # these were computed using Q1 IQR * 1.5 down from 25%

In [None]:
df.table.describe()

In [None]:
df.table.sort_values().head(20)  # before winsorizing

In [None]:
df_table_win.sort_values().head(20) # after winsorizing

In [None]:
df_table_win.sort_values()[-610:-585] # everything after row 50313    63.4 is 63.5

In [None]:
df_table_win[df_table_win == 51.6].count() # high counts of percentile limits

In [None]:
df_table_win[df_table_win == 63.5].count() # high counts of percentile limits

## Transformation using ``log()`` method

The **``Numpy.log()``** method lets you calculate the mathematical log of any number or array. The numpy.log() is a mathematical function that helps user to calculate Natural logarithm of x where x belongs to all the input array elements.

The natural logarithm log is the inverse of the exponential function, so that log(exp(x)) = x. The natural logarithm is logarithm in base e [Source 01](https://www.geeksforgeeks.org/numpy-log-python/#:~:text=The%20numpy.,is%20log%20in%20base%20e.) & [Source 02](https://numpy.org/doc/stable/reference/generated/numpy.log.html).

In [None]:
df.info()

In [None]:
df["carat"].sort_values()

In [None]:
plt.figure(figsize = (10, 6))

sns.boxplot(x=df.carat);

In [None]:
sns.displot(df.carat, bins=10, kde=False);

In [None]:
np.log(df.carat).sort_values()

In [None]:
np.e ** 1.611

In [None]:
plt.figure(figsize = (10, 6))
plt.plot(df.carat.sort_values(), np.log(df.carat.sort_values()))
plt.xlabel("caret")
plt.ylabel("log e");

In [None]:
plt.figure(figsize = (10, 6))

sns.boxplot(x = np.log(df.carat));

In [None]:
sns.displot(np.log(df.carat),  bins=10, kde=False);

In [None]:
df["carat_log"] = np.log(df.carat)
df

## Removing outliers after log() transformation

In [None]:
Q1 = df.carat_log.quantile(0.25)
Q3 = df.carat_log.quantile(0.75)

IQR = Q3 - Q1

In [None]:
Q1

In [None]:
np.e ** -0.916290

In [None]:
Q3

In [None]:
IQR

In [None]:
lower_lim = Q1 - 1.5 * IQR
upper_lim = Q3 + 1.5 * IQR

In [None]:
lower_lim

In [None]:
upper_lim

In [None]:
np.e ** lower_lim, np.e ** upper_lim

In [None]:
(df.carat_log < lower_lim).value_counts()

In [None]:
(df.carat_log > upper_lim).value_counts() # or try .sum()

In [None]:
(df.carat_log <= lower_lim).sum()

In [None]:
df.loc[(df.carat_log > upper_lim)]

In [None]:
outlier_index = df.loc[(df.carat_log > upper_lim)].index
outlier_index

In [None]:
df.drop(outlier_index)

In [None]:
not_outliers = (df.carat_log <= upper_lim)

In [None]:
len(df[not_outliers])

In [None]:
cleaned_df = df[not_outliers]

In [None]:
cleaned_df

## <p style="background-color:#FDFEFE; font-family:newtimeroman; color:#9d4f8c; font-size:150%; text-align:center; border-radius:10px 10px;">The End of The Session - 08</p>

<a id="5"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:blue; background-color:#dfa8e4" data-toggle="popover">Content</a>

<p style="text-align: center;"><img src="https://docs.google.com/uc?id=1lY0Uj5R04yMY3-ZppPWxqCr5pvBLYPnV" class="img-fluid" 
alt="CLRSWY"></p>

## <p style="background-color:#FDFEFE; font-family:newtimeroman; color:#9d4f8c; font-size:100%; text-align:center; border-radius:10px 10px;">WAY TO REINVENT YOURSELF</p>

____