## Anomoly Detection
### Continuous Probabilistic Methods

In [1]:
# imports
import pandas as pd
import numpy as np
from pydataset import data

Define a function named get_lower_and_upper_bounds that has two arguments. The first argument is a pandas Series. The second argument is the multiplier, which should have a default argument of 1.5.

In [17]:
# This function will work on a single column

def get_lower_and_upper_bounds(df, col, k=1.5):
    '''
    This function takes in a dataframe and returns the upper and lower bounds based on the argument of k
    '''
    # Find the lower and upper quartiles
    q_25, q_75 = df[col].quantile([0.25, 0.75])
    # Find the Inner Quartile Range
    q_iqr = q_75 - q_25
    # Find the Upper Bound
    q_upper = q_75 + (k * q_iqr)
    # Find the Lower Bound
    q_lower = q_25 - (k * q_iqr)
    # Identify outliers
    outliers = df[df[col] < q_lower]
    outliers = df[df[col] > q_upper]
    
    return q_lower, q_upper, outliers

In [24]:
# This function will take in an entire dataframe, and operate on a lit of columns...

def get_low_and_up_bounds_df(df, col_list=list(df.select_dtypes(include=['int', 'float'], exclude='O')), k=1.5):
    '''
    This function takes in a pandas dataframe, list of columns, and k value, and will print out upper and lower bounds for each column.
    It takes in a default argument of the col_list being all numeric columns, and the k value=1.5
    '''
    
    for col in col_list:
        
        # Find the lower and upper quartiles
        q_25, q_75 = df[col].quantile([0.25, 0.75])
        # Find the Inner Quartile Range
        q_iqr = q_75 - q_25
        # Find the Upper Bound
        q_upper = q_75 + (k * q_iqr)
        # Find the Lower Bound
        q_lower = q_25 - (k * q_iqr)
        # Identify outliers
        outliers = df[df[col] < q_lower]
        outliers = df[df[col] > q_upper]
        
        print('')
        print(col)
        print(f'K: {k}')
        print(f'Lower Fence: {q_lower}')
        print(f'Upper Fence: {q_upper}')
        print(f'Outliers in {col}')
        print('')
        print(outliers)
        print('-------------------------------------------------------------------')
        
        

1. Using lemonade.csv dataset and focusing on continuous variables:

In [25]:
df = pd.read_csv('lemonade.csv')

In [26]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 365 entries, 0 to 364
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Date         365 non-null    object 
 1   Day          365 non-null    object 
 2   Temperature  365 non-null    float64
 3   Rainfall     365 non-null    float64
 4   Flyers       365 non-null    int64  
 5   Price        365 non-null    float64
 6   Sales        365 non-null    int64  
dtypes: float64(3), int64(2), object(2)
memory usage: 20.1+ KB


In [27]:
get_low_and_up_bounds_df(df)


Temperature
K: 1.5
Lower Fence: 16.700000000000003
Upper Fence: 104.7
Outliers in Temperature

       Date       Day  Temperature  Rainfall  Flyers  Price  Sales
41  2/11/17  Saturday        212.0      0.91      35    0.5     21
---------------------------------------

Rainfall
K: 1.5
Lower Fence: 0.26
Upper Fence: 1.3
Outliers in Rainfall

         Date        Day  Temperature  Rainfall  Flyers  Price  Sales
0      1/1/17     Sunday         27.0      2.00      15    0.5     10
1      1/2/17     Monday         28.9      1.33      15    0.5     13
2      1/3/17    Tuesday         34.5      1.33      27    0.5     15
5      1/6/17     Friday         25.3      1.54      23    0.5     11
6      1/7/17   Saturday         32.9      1.54      19    0.5     13
10    1/11/17  Wednesday         32.6      1.54      23    0.5     12
11    1/12/17   Thursday         38.2      1.33      16    0.5     14
12    1/13/17     Friday         37.5      1.33      19    0.5     15
15    1/16/17     Monday  

1a. Use the IQR Range Rule and the upper and lower bounds to identify the lower outliers of each column of lemonade.csv, using the multiplier of 1.5. Do these lower outliers make sense?Which outliers should be kept?

In [19]:
get_lower_and_upper_bounds(df, 'Sales', k=1.5)

(5.0,
 45.0,
        Date       Day  Temperature  Rainfall  Flyers  Price  Sales
 181  7/1/17  Saturday        102.9      0.47      59    0.5    143
 182  7/2/17    Sunday         93.4      0.51      68    0.5    158
 183  7/3/17    Monday         81.5      0.54      68    0.5    235
 184  7/4/17   Tuesday         84.2      0.59      49    0.5    534)

1b. Use the IQR Range Rule and the upper and lower bounds to identify the upper outliers of each column of lemonade.csv, using the multiplier of 1.5. Do these lower outliers make sense?Which outliers should be kept?

1c. Using the multiplier of 3, IQR Range Rule, and the lower and upper bounds, identify the outliers below the lower bound in each colum of lemonade.csv. Do these lower outliers make sense?Which outliers should be kept?

1d. Using the multiplier of 3, IQR Range Rule, and the lower and upper bounds, identify the outliers above the upper_bound in each colum of lemonade.csv. Do these upper outliers make sense? Which outliers should be kept?

2. Identify if any columns in lemonade.csv are normally distributed. For normally distributed columns:

2a. Use a 2 sigma decision rule to isolate the outliers.

2b. Do these make sense?

2c. Should certain outliers be kept or removed?

3. Now use a 3 sigma decision rule to isolate the outliers in the normally distributed columns from lemonade.csv