# 02: Statistics and Exploratory Data Analysis

## Overview
This notebook covers fundamental statistical concepts and exploratory data analysis (EDA) techniques essential for understanding data before building machine learning models.

## Topics Covered:
1. Descriptive Statistics
2. Probability Distributions
3. Hypothesis Testing
4. Correlation and Covariance
5. Data Visualization
6. Outlier Detection
7. Feature Distributions

## Interview Focus:
- Understanding of statistical measures
- Ability to identify data patterns
- Knowledge of statistical tests
- Data quality assessment

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import warnings
warnings.filterwarnings('ignore')

# Set style for better visualizations
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (10, 6)

print("Libraries imported successfully")

## 1. Descriptive Statistics

In [None]:
# Generate sample data
np.random.seed(42)
data = np.random.normal(100, 15, 1000)  # Mean=100, Std=15, n=1000

# Calculate descriptive statistics
print("Descriptive Statistics:")
print(f"Mean: {np.mean(data):.2f}")
print(f"Median: {np.median(data):.2f}")
print(f"Mode: {stats.mode(data.astype(int), keepdims=True)[0][0]}")
print(f"Standard Deviation: {np.std(data):.2f}")
print(f"Variance: {np.var(data):.2f}")
print(f"Range: {np.max(data) - np.min(data):.2f}")
print(f"\nQuartiles:")
print(f"Q1 (25th percentile): {np.percentile(data, 25):.2f}")
print(f"Q2 (50th percentile - Median): {np.percentile(data, 50):.2f}")
print(f"Q3 (75th percentile): {np.percentile(data, 75):.2f}")
print(f"IQR (Interquartile Range): {np.percentile(data, 75) - np.percentile(data, 25):.2f}")

# Skewness and Kurtosis
print(f"\nSkewness: {stats.skew(data):.2f}")  # 0 = symmetric, >0 = right skew, <0 = left skew
print(f"Kurtosis: {stats.kurtosis(data):.2f}")  # 0 = normal, >0 = heavy tails, <0 = light tails

### 1.1 Measures of Central Tendency

In [None]:
# Visualize central tendency
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
plt.hist(data, bins=30, edgecolor='black', alpha=0.7)
plt.axvline(np.mean(data), color='red', linestyle='--', linewidth=2, label='Mean')
plt.axvline(np.median(data), color='green', linestyle='--', linewidth=2, label='Median')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.title('Distribution with Central Tendency Measures')
plt.legend()

plt.subplot(1, 2, 2)
plt.boxplot(data, vert=True)
plt.ylabel('Value')
plt.title('Box Plot showing Quartiles')

plt.tight_layout()
plt.show()

## 2. Probability Distributions

In [None]:
# Common probability distributions
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Normal Distribution
x = np.linspace(-4, 4, 100)
y = stats.norm.pdf(x, 0, 1)
axes[0, 0].plot(x, y, 'b-', linewidth=2)
axes[0, 0].fill_between(x, y, alpha=0.3)
axes[0, 0].set_title('Normal Distribution (μ=0, σ=1)')
axes[0, 0].set_xlabel('x')
axes[0, 0].set_ylabel('Probability Density')

# Uniform Distribution
x_uniform = np.linspace(0, 10, 100)
y_uniform = stats.uniform.pdf(x_uniform, 2, 6)
axes[0, 1].plot(x_uniform, y_uniform, 'g-', linewidth=2)
axes[0, 1].fill_between(x_uniform, y_uniform, alpha=0.3)
axes[0, 1].set_title('Uniform Distribution (a=2, b=8)')
axes[0, 1].set_xlabel('x')
axes[0, 1].set_ylabel('Probability Density')

# Exponential Distribution
x_exp = np.linspace(0, 5, 100)
y_exp = stats.expon.pdf(x_exp, scale=1)
axes[1, 0].plot(x_exp, y_exp, 'r-', linewidth=2)
axes[1, 0].fill_between(x_exp, y_exp, alpha=0.3)
axes[1, 0].set_title('Exponential Distribution (λ=1)')
axes[1, 0].set_xlabel('x')
axes[1, 0].set_ylabel('Probability Density')

# Binomial Distribution
x_binom = np.arange(0, 21)
y_binom = stats.binom.pmf(x_binom, n=20, p=0.5)
axes[1, 1].bar(x_binom, y_binom, alpha=0.7, color='purple')
axes[1, 1].set_title('Binomial Distribution (n=20, p=0.5)')
axes[1, 1].set_xlabel('x')
axes[1, 1].set_ylabel('Probability Mass')

plt.tight_layout()
plt.show()

## 3. Hypothesis Testing

In [None]:
# T-test example
# Null hypothesis: The mean of the population is 100
# Alternative hypothesis: The mean is different from 100

sample1 = np.random.normal(100, 15, 100)
sample2 = np.random.normal(105, 15, 100)

# One-sample t-test
t_stat, p_value = stats.ttest_1samp(sample1, 100)
print("One-sample t-test:")
print(f"T-statistic: {t_stat:.4f}")
print(f"P-value: {p_value:.4f}")
print(f"Result: {'Reject H0' if p_value < 0.05 else 'Fail to reject H0'}")

# Two-sample t-test
t_stat2, p_value2 = stats.ttest_ind(sample1, sample2)
print("\nTwo-sample t-test:")
print(f"T-statistic: {t_stat2:.4f}")
print(f"P-value: {p_value2:.4f}")
print(f"Result: {'Samples have different means' if p_value2 < 0.05 else 'Cannot conclude samples have different means'}")

# Chi-square test for independence
# Example: Testing if gender and preference are independent
observed = np.array([[30, 20], [15, 35]])
chi2, p_chi, dof, expected = stats.chi2_contingency(observed)
print("\nChi-square test:")
print(f"Chi-square statistic: {chi2:.4f}")
print(f"P-value: {p_chi:.4f}")
print(f"Degrees of freedom: {dof}")
print(f"Result: {'Variables are dependent' if p_chi < 0.05 else 'Variables are independent'}")

## 4. Correlation and Covariance

In [None]:
# Generate correlated data
np.random.seed(42)
n = 100
x1 = np.random.normal(0, 1, n)
x2 = x1 + np.random.normal(0, 0.5, n)  # Positively correlated
x3 = -x1 + np.random.normal(0, 0.5, n)  # Negatively correlated
x4 = np.random.normal(0, 1, n)  # No correlation

df_corr = pd.DataFrame({
    'Feature1': x1,
    'Feature2': x2,
    'Feature3': x3,
    'Feature4': x4
})

# Correlation matrix
corr_matrix = df_corr.corr()
print("Correlation Matrix:")
print(corr_matrix)

# Visualize correlation matrix
plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', center=0, 
            square=True, linewidths=1, cbar_kws={"shrink": 0.8})
plt.title('Correlation Heatmap')
plt.show()

# Scatter plots
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

axes[0].scatter(x1, x2, alpha=0.5)
axes[0].set_xlabel('Feature1')
axes[0].set_ylabel('Feature2')
axes[0].set_title(f'Positive Correlation (r={np.corrcoef(x1, x2)[0,1]:.2f})')

axes[1].scatter(x1, x3, alpha=0.5, color='orange')
axes[1].set_xlabel('Feature1')
axes[1].set_ylabel('Feature3')
axes[1].set_title(f'Negative Correlation (r={np.corrcoef(x1, x3)[0,1]:.2f})')

axes[2].scatter(x1, x4, alpha=0.5, color='green')
axes[2].set_xlabel('Feature1')
axes[2].set_ylabel('Feature4')
axes[2].set_title(f'No Correlation (r={np.corrcoef(x1, x4)[0,1]:.2f})')

plt.tight_layout()
plt.show()

## 5. Exploratory Data Analysis (EDA)

In [None]:
# Create a sample dataset for EDA
np.random.seed(42)
n_samples = 500

df_eda = pd.DataFrame({
    'age': np.random.randint(18, 70, n_samples),
    'income': np.random.normal(50000, 20000, n_samples),
    'credit_score': np.random.randint(300, 850, n_samples),
    'num_accounts': np.random.poisson(3, n_samples),
    'region': np.random.choice(['North', 'South', 'East', 'West'], n_samples),
    'approved': np.random.choice([0, 1], n_samples, p=[0.3, 0.7])
})

# Add some outliers
df_eda.loc[np.random.choice(df_eda.index, 10), 'income'] = np.random.uniform(150000, 200000, 10)

print("Dataset Info:")
print(df_eda.info())
print("\nFirst few rows:")
print(df_eda.head())
print("\nDescriptive Statistics:")
print(df_eda.describe())

In [None]:
# Univariate analysis
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# Age distribution
axes[0, 0].hist(df_eda['age'], bins=20, edgecolor='black', alpha=0.7)
axes[0, 0].set_xlabel('Age')
axes[0, 0].set_ylabel('Frequency')
axes[0, 0].set_title('Age Distribution')

# Income distribution
axes[0, 1].hist(df_eda['income'], bins=30, edgecolor='black', alpha=0.7, color='green')
axes[0, 1].set_xlabel('Income')
axes[0, 1].set_ylabel('Frequency')
axes[0, 1].set_title('Income Distribution')

# Region counts
region_counts = df_eda['region'].value_counts()
axes[1, 0].bar(region_counts.index, region_counts.values, alpha=0.7, color='orange')
axes[1, 0].set_xlabel('Region')
axes[1, 0].set_ylabel('Count')
axes[1, 0].set_title('Distribution by Region')

# Credit score box plot
axes[1, 1].boxplot(df_eda['credit_score'])
axes[1, 1].set_ylabel('Credit Score')
axes[1, 1].set_title('Credit Score Distribution (Box Plot)')

plt.tight_layout()
plt.show()

## 6. Outlier Detection

In [None]:
# IQR method for outlier detection
def detect_outliers_iqr(data, column):
    Q1 = data[column].quantile(0.25)
    Q3 = data[column].quantile(0.75)
    IQR = Q3 - Q1
    
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    
    outliers = data[(data[column] < lower_bound) | (data[column] > upper_bound)]
    return outliers, lower_bound, upper_bound

# Detect outliers in income
outliers, lower, upper = detect_outliers_iqr(df_eda, 'income')
print(f"Number of outliers detected: {len(outliers)}")
print(f"Lower bound: {lower:.2f}")
print(f"Upper bound: {upper:.2f}")

# Visualize outliers
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Box plot
axes[0].boxplot(df_eda['income'])
axes[0].set_ylabel('Income')
axes[0].set_title('Income Distribution with Outliers (Box Plot)')

# Scatter plot
axes[1].scatter(range(len(df_eda)), df_eda['income'], alpha=0.5)
axes[1].axhline(y=upper, color='r', linestyle='--', label='Upper bound')
axes[1].axhline(y=lower, color='r', linestyle='--', label='Lower bound')
axes[1].scatter(outliers.index, outliers['income'], color='red', s=100, alpha=0.7, label='Outliers')
axes[1].set_xlabel('Index')
axes[1].set_ylabel('Income')
axes[1].set_title('Income with Outliers Highlighted')
axes[1].legend()

plt.tight_layout()
plt.show()

# Z-score method
from scipy.stats import zscore
z_scores = np.abs(zscore(df_eda['income']))
outliers_zscore = df_eda[z_scores > 3]
print(f"\nOutliers using Z-score method (|z| > 3): {len(outliers_zscore)}")

## 7. Bivariate Analysis

In [None]:
# Relationship between variables
fig, axes = plt.subplots(2, 2, figsize=(15, 12))

# Scatter plot: Age vs Income
axes[0, 0].scatter(df_eda['age'], df_eda['income'], alpha=0.5)
axes[0, 0].set_xlabel('Age')
axes[0, 0].set_ylabel('Income')
axes[0, 0].set_title('Age vs Income')

# Scatter plot: Credit Score vs Income
axes[0, 1].scatter(df_eda['credit_score'], df_eda['income'], alpha=0.5, color='green')
axes[0, 1].set_xlabel('Credit Score')
axes[0, 1].set_ylabel('Income')
axes[0, 1].set_title('Credit Score vs Income')

# Box plot: Income by Region
df_eda.boxplot(column='income', by='region', ax=axes[1, 0])
axes[1, 0].set_xlabel('Region')
axes[1, 0].set_ylabel('Income')
axes[1, 0].set_title('Income Distribution by Region')
plt.suptitle('')  # Remove default title

# Box plot: Credit Score by Approved status
df_eda.boxplot(column='credit_score', by='approved', ax=axes[1, 1])
axes[1, 1].set_xlabel('Approved (0=No, 1=Yes)')
axes[1, 1].set_ylabel('Credit Score')
axes[1, 1].set_title('Credit Score by Approval Status')

plt.tight_layout()
plt.show()

## 8. Distribution Analysis

In [None]:
# Check if data follows normal distribution
from scipy.stats import normaltest, shapiro

# Normality test
stat, p_value = normaltest(df_eda['income'])
print("D'Agostino's K-squared test:")
print(f"Statistic: {stat:.4f}")
print(f"P-value: {p_value:.4f}")
print(f"Result: {'Data is not normally distributed' if p_value < 0.05 else 'Data is normally distributed'}")

# Q-Q plot
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Histogram with normal curve
axes[0].hist(df_eda['income'], bins=30, density=True, alpha=0.7, edgecolor='black')
mu, sigma = df_eda['income'].mean(), df_eda['income'].std()
x = np.linspace(df_eda['income'].min(), df_eda['income'].max(), 100)
axes[0].plot(x, stats.norm.pdf(x, mu, sigma), 'r-', linewidth=2, label='Normal distribution')
axes[0].set_xlabel('Income')
axes[0].set_ylabel('Density')
axes[0].set_title('Income Distribution vs Normal Distribution')
axes[0].legend()

# Q-Q plot
stats.probplot(df_eda['income'], dist="norm", plot=axes[1])
axes[1].set_title('Q-Q Plot')

plt.tight_layout()
plt.show()

## Interview Questions

### Q1: What's the difference between covariance and correlation?
**Answer:**
- **Covariance** measures how two variables change together but is scale-dependent
- **Correlation** is standardized covariance (ranges from -1 to 1), making it easier to interpret
- Correlation = Covariance / (σx * σy)

### Q2: What is p-value and significance level?
**Answer:**
- **P-value**: Probability of obtaining results at least as extreme as observed, assuming null hypothesis is true
- **Significance level (α)**: Threshold for rejecting null hypothesis (commonly 0.05)
- If p-value < α, we reject the null hypothesis

### Q3: How do you detect outliers?
**Answer:**
- **IQR method**: Values outside Q1 - 1.5×IQR or Q3 + 1.5×IQR
- **Z-score method**: Values with |z-score| > 3
- **Visual methods**: Box plots, scatter plots
- **Statistical methods**: Grubbs test, Dixon's test

### Q4: What is the Central Limit Theorem?
**Answer:** The CLT states that the distribution of sample means approaches a normal distribution as sample size increases, regardless of the population's distribution (given sufficient sample size, typically n > 30).

### Q5: What are Type I and Type II errors?
**Answer:**
- **Type I Error (False Positive)**: Rejecting a true null hypothesis (α = probability)
- **Type II Error (False Negative)**: Failing to reject a false null hypothesis (β = probability)
- **Power**: 1 - β, probability of correctly rejecting a false null hypothesis

### Q6: When to use mean vs median?
**Answer:**
- **Mean**: Use when data is symmetric and no outliers
- **Median**: Use when data is skewed or has outliers (more robust)
- **Mode**: Use for categorical data or to find most frequent value

### Q7: What is skewness and kurtosis?
**Answer:**
- **Skewness**: Measures asymmetry of distribution
  - Positive skew: Long right tail
  - Negative skew: Long left tail
  - Zero: Symmetric
- **Kurtosis**: Measures "tailedness" of distribution
  - High kurtosis: Heavy tails, more outliers
  - Low kurtosis: Light tails, fewer outliers

## Practice Exercises

1. Perform a complete EDA on a real dataset (e.g., from Kaggle)
2. Conduct hypothesis testing to compare two groups
3. Identify and handle outliers using multiple methods
4. Create a correlation matrix and identify multicollinearity
5. Test for normality and apply appropriate transformations if needed
6. Implement custom functions for common statistical tests