# Exercises: Anomaly Detection - CONTINUOUS PROBABALISTIC METHODS
<a href = "https://ds.codeup.com/anomaly-detection/continuous-probabilistic-methods/#exercises">![image.png](attachment:4de3956f-2c34-42d7-a811-8842409149fd.png)</a>

<hr style="border:2px solid gray">

Using the repo setup directions, setup a new local and remote repository named `anomaly-detection-exercises`. The local version of your repo should live inside of `~/codeup-data-science`. This repo should be named `anomaly-detection-exercises`

Save this work in your `anomaly-detection-exercises` repo. Then add, commit, and push your changes.

`continuous_probabilistic_methods.py` or `continuous_probabilistic_methods.ipynb` 

1. Define a function named `get_lower_and_upper_bounds` that has two arguments. The first argument is a pandas Series. The second argument is the multiplier, which should have a default argument of 1.5.

In [None]:
# imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')

In [None]:
def get_lower_and_upper_bounds(s, m=1.5):
    '''
    Given a series and a cutoff value, m, returns the upper/lower outlier bounds for the
    series.
    '''
    
    q1, q3 = s.quantile([.25, 0.75])
    
    iqr = q3 - q1
    
    upper_bound = q3 + (m * iqr)
    lower_bound = q1 - (m * iqr)
        
    return lower_bound, upper_bound

# 1. Using <a href="https://gist.githubusercontent.com/ryanorsinger/19bc7eccd6279661bd13307026628ace/raw/e4b5d6787015a4782f96cad6d1d62a8bdbac54c7/lemonade.csv">`lemonade.csv`</a> dataset and focusing on continuous variables:

In [None]:
df = pd.read_csv('https://gist.githubusercontent.com/ryanorsinger/19bc7eccd6279661bd13307026628ace/raw/lemonade.csv')

In [None]:
df.head()

In [None]:
df.info()

In [None]:
# It looks like every row is a date, so 
# let's treat it like a datetime index
df['Date'] = 

In [None]:
# reset the datetime version of Date as our index:


In [None]:
# let's add a month feature now that we have a datetime index
df['Month'] = 

In [None]:
df.head(2)

In [None]:
def get_object_cols(df):
    '''
    This function takes in a dataframe and identifies the columns that are object types
    and returns a list of those column names. 
    '''
    # get a list of the column names that are objects (from the mask)
    object_cols = df.select_dtypes(include=['object', 'category']).columns.tolist()
    
    return object_cols



def get_numeric_cols(df):
    '''
    This function takes in a dataframe and identifies the columns that are object types
    and returns a list of those column names. 
    '''
    # get a list of the column names that are objects (from the mask)
    num_cols = df.select_dtypes(exclude=['object', 'category']).columns.tolist()
    
    return num_cols


In [None]:
numeric_cols = get_numeric_cols(df)
object_cols = get_object_cols(df)

In [None]:
print(f"""Numerical: {numeric_cols}
Categorical: {object_cols}""")

In [None]:
# initial visual check:
# histograms of numerical information


<div class="alert alert-success" role="alert">
    
## Takeaways:
    
    
</div>

In [None]:
# sales, based on day/month


In [None]:
# what about rainfall?


## 1.a Use the IQR Range Rule and the upper and lower bounds to identify the lower outliers of each column of `lemonade.csv`, using the **multiplier of 1.5**. 
- Do these lower outliers make sense? 
- Which outliers should be kept?

In [None]:
get_lower_and_upper_bounds(df.Rainfall)

In [None]:
# Create an empty dictionary to store outliers

# Iterate over each column in the DataFrame

    # Check if the column data type is numeric
    
        # Get the lower and upper bounds for outlier detection
        
        # Print the lower and upper bounds for the current column
        print(f"""Lower bound for {col} : {lower_bound} 
Upper bound for {col}: {upper_bound}
_________________________________________""")
        
        # Create a sub-dictionary for the current column in the outliers dictionary
        
        # Store the upper and lower bounds in the sub-dictionary
        
        # Find the rows in the DataFrame where the column values are outside the bounds
    else:
        # Skip non-numeric columns
        pass

In [None]:
outliers.keys()

In [None]:
outliers['Temperature'].keys()

In [None]:
outliers['Temperature']['df']

In [None]:
for col in numeric_cols:
    print(f"""{col}:
{outliers[col]['df']}
_____________________________________""")

## 1.b Using the **multiplier of 3**, IQR Range Rule, and the upper bounds, identify the outliers above the upper_bound in each colum of lemonade.csv. 
- Do these upper outliers make sense? 
- Which outliers should be kept?

# 2. Identify if any columns in `lemonade.csv` are normally distributed. For normally distributed columns:

## 2.a Use a 2 sigma decision rule to isolate the outliers.
- Do these make sense?
- Should certain outliers be kept or removed?

In [None]:
# Iterate over each column in the list of numeric columns

    # Calculate the z-scores for the current column
    
    # Create a new column name for the z-scores
    
    # Add the z-scores as a new column in the DataFrame
     print(f"""{col}:
    
{df[z_scores.abs() >= 2]}
_____________________________________""")


# 3. Now use a 3 sigma decision rule to isolate the outliers in the normally distributed columns from lemonade.csv