# Descriptive Statistics

This script does the following:

- We define two functions: **descriptive_stats_numeric** for continuous data and **descriptive_stats_categorical** for categorical data.
- The descriptive_stats_numeric function calculates various statistics including Tukey's fences for outlier detection.
- The descriptive_stats_categorical function calculates relevant statistics for categorical data, such as mode, unique value count, and frequencies of the most common categories.
- We create an analyze_dataset function that applies the appropriate statistics function to each column based on its data type.
- In the example usage, we create a sample dataset with mixed data types (numeric and categorical).
- We apply the analyze_dataset function to our DataFrame and print the results.

Key points:

- The script automatically detects whether a column is numeric or categorical using pd.api.types.is_numeric_dtype().
- For numeric columns, it calculates statistics like mean, median, standard deviation, and uses Tukey's fences for outlier detection.
- For categorical columns, it provides information like the number of unique values, mode, and counts of the most common categories.
- We use dropna() when calculating statistics to handle any potential missing values.

To use this with your own dataset:

- Load your data into a pandas DataFrame.
- Call the analyze_dataset(df) function with your DataFrame.
- The function will return a dictionary with statistics for each column.

In [1]:
import pandas as pd
import numpy as np
from scipy import stats

def descriptive_stats_numeric(data, tukey_factor=1.5):
    """Calculate descriptive statistics for numeric data"""
    desc = stats.describe(data)
    percentiles = np.percentile(data, [25, 50, 75])
    q1, q3 = percentiles[0], percentiles[2]
    iqr = q3 - q1
    lower_fence = q1 - tukey_factor * iqr
    upper_fence = q3 + tukey_factor * iqr
    outliers = data[(data < lower_fence) | (data > upper_fence)]

    return {
        "n": desc.nobs,
        "min": desc.minmax[0],
        "max": desc.minmax[1],
        "mean": desc.mean,
        "std": np.sqrt(desc.variance),
        "median": percentiles[1],
        "q1": q1,
        "q3": q3,
        "iqr": iqr,
        "skewness": desc.skewness,
        "kurtosis": desc.kurtosis,
        "lower_fence": lower_fence,
        "upper_fence": upper_fence,
        "n_outliers": len(outliers)
    }

def descriptive_stats_categorical(data):
    """Calculate descriptive statistics for categorical data"""
    value_counts = data.value_counts()
    return {
        "n": len(data),
        "n_unique": data.nunique(),
        "mode": data.mode().iloc[0],
        "mode_count": value_counts.iloc[0],
        "second_most_common": value_counts.index[1] if len(value_counts) > 1 else None,
        "second_most_common_count": value_counts.iloc[1] if len(value_counts) > 1 else None,
    }

def analyze_dataset(df):
    """Analyze each column in the dataset"""
    results = {}

    for column in df.columns:
        if pd.api.types.is_numeric_dtype(df[column]):
            results[column] = descriptive_stats_numeric(df[column].dropna())
        else:
            results[column] = descriptive_stats_categorical(df[column].dropna())

    return results

In [2]:
# Example usage
if __name__ == "__main__":
    # Create a sample dataset
    data = {
        'age': [25, 30, 35, 40, 45, 50, 55, 60, 100],
        'income': [50000, 60000, 70000, 80000, 90000, 100000, 110000, 120000, 1000000],
        'gender': ['M', 'F', 'M', 'F', 'M', 'F', 'M', 'F', 'M'],
        'education': ['HS', 'BS', 'MS', 'PhD', 'BS', 'MS', 'HS', 'BS', 'MS']
    }

    df = pd.DataFrame(data)

    # Analyze the dataset
    results = analyze_dataset(df)

    # Print results
    for column, stats in results.items():
        print(f"\nStatistics for {column}:")
        for key, value in stats.items():
            print(f"  {key}: {value}")


Statistics for age:
  n: 9
  min: 25
  max: 100
  mean: 48.888888888888886
  std: 22.329601678290437
  median: 45.0
  q1: 35.0
  q3: 55.0
  iqr: 20.0
  skewness: 1.3254771038292048
  kurtosis: 1.1504508034543504
  lower_fence: 5.0
  upper_fence: 85.0
  n_outliers: 1

Statistics for income:
  n: 9
  min: 50000
  max: 1000000
  mean: 186666.66666666666
  std: 305859.4448435425
  median: 90000.0
  q1: 70000.0
  q3: 110000.0
  iqr: 40000.0
  skewness: 2.4481339371556046
  kurtosis: 4.049501576996898
  lower_fence: 10000.0
  upper_fence: 170000.0
  n_outliers: 1

Statistics for gender:
  n: 9
  n_unique: 2
  mode: M
  mode_count: 5
  second_most_common: F
  second_most_common_count: 4

Statistics for education:
  n: 9
  n_unique: 4
  mode: BS
  mode_count: 3
  second_most_common: MS
  second_most_common_count: 3


In [3]:
print(df.describe())

              age          income
count    9.000000        9.000000
mean    48.888889   186666.666667
std     22.329602   305859.444844
min     25.000000    50000.000000
25%     35.000000    70000.000000
50%     45.000000    90000.000000
75%     55.000000   110000.000000
max    100.000000  1000000.000000
