# Continuous Variable Probabilistic Methods

In [1]:
import pandas as pd

Define a function named get_lower_and_upper_bounds that has two arguments. The first argument is a pandas Series. The second argument is the multiplier, which should have a default argument of 1.5.

In [3]:
def get_lower_and_upper_bounds(series, multiplier=1.5):
    """
    Calculate lower and upper bounds for outlier detection.

    Args:
    - series: pandas Series containing the data.
    - multiplier: Multiplier to adjust the range for bounds. Default is 1.5.

    Returns:
    - lower_bound, upper_bound: Lower and upper bounds for outlier detection.
    """
    # Calculate the interquartile range (IQR)
    q1 = series.quantile(0.25)
    q3 = series.quantile(0.75)
    iqr = q3 - q1

    # Calculate lower and upper bounds
    lower_bound = q1 - multiplier * iqr
    upper_bound = q3 + multiplier * iqr

    return lower_bound, upper_bound

1) Using lemonade.csv dataset and focusing on continuous variables:

a) Use the IQR Range Rule and the upper and lower bounds to identify the lower outliers of each column of lemonade.csv, using the multiplier of 1.5. Do these lower outliers make sense? Which outliers should be kept?

In [4]:
df = pd.read_csv('lemonade.csv')

In [7]:
df.head()

Unnamed: 0,Date,Day,Temperature,Rainfall,Flyers,Price,Sales
0,1/1/17,Sunday,27.0,2.0,15,0.5,10
1,1/2/17,Monday,28.9,1.33,15,0.5,13
2,1/3/17,Tuesday,34.5,1.33,27,0.5,15
3,1/4/17,Wednesday,44.1,1.05,28,0.5,17
4,1/5/17,Thursday,42.4,1.0,33,0.5,18


In [13]:
# Define the columns of interest (continuous variables)
continuous_columns = ["Temperature", "Rainfall", "Flyers", "Sales"]

# Set the multiplier for the IQR Range Rule
multiplier = 1.5

# Identify lower outliers for a column
def identify_lower_outliers(column):
    Q1 = column.quantile(0.25)
    Q3 = column.quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - multiplier * IQR
    return column[column < lower_bound]

# Identify lower outliers for each column
lower_outliers = {}
for col in continuous_columns:
    lower_outliers[col] = identify_lower_outliers(df[col])

# Print the lower outliers for each column
for col, outliers in lower_outliers.items():
    print(f"Lower Outliers in {col}:")
    print(outliers)
    print()

Lower Outliers in Temperature:
364    15.1
Name: Temperature, dtype: float64

Lower Outliers in Rainfall:
Series([], Name: Rainfall, dtype: float64)

Lower Outliers in Flyers:
324   -38
Name: Flyers, dtype: int64

Lower Outliers in Sales:
Series([], Name: Sales, dtype: int64)



b) Use the IQR Range Rule and the upper and upper bounds to identify the upper outliers of each column of lemonade.csv, using the multiplier of 1.5. Do these upper outliers make sense? Which outliers should be kept?