# **Measures of Dispersion**

This notebook will help you understand the main concepts on the statistics part of measuring the dispersion of dataset (How the data is spread) in a practical approach.

<br>

****
The following will be explained:

1. Range
2. Variance
3. Standard Deviation
4. Interquartile Range (IQR)

<br>

****

# **Let's Go!**

In [1]:
# Import Necessary Libraries
import pandas as pd
import numpy as np

In [3]:
# Our DataFrame
data = {
    'CustomerID': [101, 102, 103, 104, 105],
    'Age': [28, 35, 42, 29, 55],
    'Gender': ['Male', 'Female', 'Male', 'Female', 'Male'],
    'Purchase_Amount': [150.50, 200.75, 50.00, 320.10, 88.99],
    'Product_Category': ['Electronics', 'Clothing', 'Electronics', 'Home Goods', 'Clothing'],
    'Satisfaction_Rating': ['Good', 'Excellent', 'Poor', 'Excellent', 'Good']
}
df = pd.DataFrame(data)

print("\nLet's look at the first few rows:\n")
print(df.head())


Let's look at the first few rows:

   CustomerID  Age  Gender  Purchase_Amount Product_Category  \
0         101   28    Male           150.50      Electronics   
1         102   35  Female           200.75         Clothing   
2         103   42    Male            50.00      Electronics   
3         104   29  Female           320.10       Home Goods   
4         105   55    Male            88.99         Clothing   

  Satisfaction_Rating  
0                Good  
1           Excellent  
2                Poor  
3           Excellent  
4                Good  


In [4]:
print("\n--- Measures of Dispersion ---\n")

# For numerical columns
numerical_cols = ['Age', 'Purchase_Amount']

for col in numerical_cols:
    print(f"\n--- {col} ---")

    # Range
    data_range = df[col].max() - df[col].min()
    print(f"Range: {data_range:.2f}")

    # Variance (Sample Variance is default in pandas)
    data_variance = df[col].var()
    print(f"Variance: {data_variance:.2f}")

    # Standard Deviation (Sample Standard Deviation is default in pandas)
    data_std_dev = df[col].std()
    print(f"Standard Deviation: {data_std_dev:.2f}")

    # Interquartile Range (IQR) - requires finding quartiles
    Q1 = df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)
    iqr = Q3 - Q1
    print(f"Q1 (25th percentile): {Q1:.2f}")
    print(f"Q3 (75th percentile): {Q3:.2f}")
    print(f"IQR (Q3 - Q1): {iqr:.2f}")


--- Measures of Dispersion ---


--- Age ---
Range: 27.00
Variance: 123.70
Standard Deviation: 11.12
Q1 (25th percentile): 29.00
Q3 (75th percentile): 42.00
IQR (Q3 - Q1): 13.00

--- Purchase_Amount ---
Range: 270.10
Variance: 11125.96
Standard Deviation: 105.48
Q1 (25th percentile): 88.99
Q3 (75th percentile): 200.75
IQR (Q3 - Q1): 111.76


# **Testing the Effects of Outliers to Dispersion Measures**

In [5]:
# Let's see how Range and Std Dev are affected by outliers
df_with_outlier = df.copy()
df_with_outlier.loc[5] = {'CustomerID': 106, 'Age': 200, 'Gender': 'Female', 'Purchase_Amount': 10000, 'Product_Category': 'Electronics', 'Satisfaction_Rating': 'Excellent'}

print("\n--- Measures of Dispersion with Outlier ---")
for col in numerical_cols:
     print(f"\n--- {col} (with outlier) ---")
     print(f"Range: {df_with_outlier[col].max() - df_with_outlier[col].min():.2f}")
     print(f"Standard Deviation: {df_with_outlier[col].std():.2f}") # std is sensitive to outliers
     Q1_outlier = df_with_outlier[col].quantile(0.25)
     Q3_outlier = df_with_outlier[col].quantile(0.75)
     iqr_outlier = Q3_outlier - Q1_outlier
     print(f"IQR: {iqr_outlier:.2f}") # IQR is robust to outliers


--- Measures of Dispersion with Outlier ---

--- Age (with outlier) ---
Range: 172.00
Standard Deviation: 66.96
IQR: 21.25

--- Purchase_Amount (with outlier) ---
Range: 9950.00
Standard Deviation: 4017.43
IQR: 185.90
