## Types of Statistics {\#types-of-statistics}

### 1. Descriptive Statistics

**Purpose**: Summarize and describe data characteristics

**What it does**:

```
- Calculate mean, median, standard deviation
- Create histograms, box plots
- Frequency tables
- Data distributions
```

**ML Use**: Exploratory Data Analysis (EDA) before modeling

### 2. Inferential Statistics

**Purpose**: Make predictions/inferences about population from sample

**Key Methods**:

#### **A. Hypothesis Testing**

**Purpose**: Test if observed effect is real (not random)

**Example**: "Does new model perform better than baseline?"

```
Null Hypothesis (H0): No difference (μ = baseline)
Alternative (H1): New model is better (μ > baseline)

If p-value < 0.05 → Reject H0 (evidence of improvement)
```

**Code Example**:


In [None]:
from scipy.stats import ttest_1samp
baseline = 0.85
new_scores = [0.89, 0.91, 0.88, 0.90, 0.87]
t_stat, p_value = ttest_1samp(new_scores, baseline)
print(f"p-value: {p_value:.4f}")  # < 0.05 = significant improvement


#### **B. Confidence Intervals**

**Purpose**: Range where true parameter likely lies

**Formula**: $\bar{x} \pm z \cdot \frac{\sigma}{\sqrt{n}}$

**95% CI Example**:


In [None]:
import numpy as np
from scipy.stats import t
scores = [85, 92, 78, 88, 95]
confidence = 0.95
n = len(scores)
mean = np.mean(scores)
std_err = np.std(scores, ddof=1) / np.sqrt(n)
ci = t.interval(confidence, n-1, loc=mean, scale=std_err)
print(f"95% CI: ({ci[0]:.1f}, {ci[1]:.1f})")
# Output: 95% CI: (79.5, 95.7)


#### **C. Analysis of Variance (ANOVA)**

**Purpose**: Compare means across 3+ groups

**Example**: Compare 3 ML models across datasets


In [None]:
from scipy.stats import f_oneway
modelA = [0.85, 0.87, 0.84]
modelB = [0.89, 0.91, 0.88]
modelC = [0.82, 0.84, 0.81]
f_stat, p_val = f_oneway(modelA, modelB, modelC)
print(f"ANOVA p-value: {p_val:.4f}")


#### **D. Regression Analysis**

**Purpose**: Predict continuous outcomes

**Simple Linear**: $y = β₀ + β₁x + ε$

#### **E. Chi-square Test**

**Purpose**: Test independence between categorical variables

**Example**: Gender vs Churn


In [None]:
import pandas as pd
from scipy.stats import chi2_contingency
data = pd.DataFrame({
    'gender': ['M','F','M','F','M','F','M','F'],
    'churn': [0,1,0,1,0,1,0,1]
})
contingency = pd.crosstab(data['gender'], data['churn'])
chi2, p, dof, expected = chi2_contingency(contingency)
print(f"Chi-square p-value: {p:.4f}")


#### **F. Sample Training**

**Methods**: Random, Stratified, Time-based split


In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)


#### **G. Bayesian Statistics**

**Purpose**: Update beliefs with new evidence

**Bayes' Theorem**: $P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)}$

***
