# Statistics & Math Foundations for Data Science

This section covers the fundamental statistical concepts required for
data analysis and machine learning. The focus is on **practical
implementation** using Python rather than heavy mathematical theory.


## 1. Descriptive Statistics (Refresher)

Descriptive statistics summarize and describe the main characteristics of a dataset.
Before building machine learning models, it is important to understand:

- Central tendency (mean, median, mode)
- Spread of data (variance, standard deviation)
- Distribution of data (percentiles, IQR)
- Shape of data (skewness, kurtosis)


In [None]:
import pandas as pd
import numpy as np

# Sample dataset: Monthly sales values
data = {
    "sales": [12000, 15000, 13000, 17000, 16000, 14000, 15500, 16500, 18000, 17500]
}

df = pd.DataFrame(data)
df


### Central Tendency

Central tendency measures represent the typical or central value of a dataset.
They help answer the question: *What is a normal or average value?*


In [None]:
mean_sales = df["sales"].mean()
median_sales = df["sales"].median()
mode_sales = df["sales"].mode()[0]

mean_sales, median_sales, mode_sales


- **Mean** gives the average sales value.
- **Median** is useful when the data contains outliers.
- **Mode** shows the most frequent sales value.


### Spread of Data

Measures of spread show how much the data varies from the central value.


In [None]:
variance_sales = df["sales"].var()
std_sales = df["sales"].std()

variance_sales, std_sales


- **Variance** measures how far values are spread from the mean.
- **Standard deviation** is the square root of variance and is easier to interpret.


### Percentiles and Interquartile Range (IQR)

Percentiles divide data into parts and help understand the distribution.
IQR is useful for identifying outliers.


In [None]:
q1 = df["sales"].quantile(0.25)
q3 = df["sales"].quantile(0.75)
iqr = q3 - q1

q1, q3, iqr


- IQR represents the middle 50% of the data.
- It is more robust than range when outliers are present.


### Skewness and Kurtosis

These measures describe the shape of the data distribution.


In [None]:
skewness = df["sales"].skew()
kurtosis = df["sales"].kurt()

skewness, kurtosis


- **Skewness** indicates whether data is symmetric or skewed.
- **Kurtosis** indicates how heavy the tails of the distribution are.


### Conclusion

Understanding descriptive statistics helps in:
- Detecting anomalies
- Choosing appropriate preprocessing techniques
- Making informed decisions before applying ML models


## 2. Probability & Distributions

Probability helps quantify uncertainty in data.  
This section introduces probability concepts and commonly used probability
distributions in data science and machine learning.


### Probability Basics

Probability measures the likelihood of an event occurring.
It forms the foundation for statistical inference and machine learning.


### Random Variables

A random variable represents numerical outcomes of a random process.
In data science, random variables model uncertainty in real-world data.


### Bernoulli Distribution

The Bernoulli distribution models a single trial with two possible outcomes:
success (1) or failure (0).
Example: Purchase made (yes/no).


In [None]:
import numpy as np
import pandas as pd

# Bernoulli trial: probability of success = 0.6
bernoulli_data = np.random.binomial(n=1, p=0.6, size=1000)

np.mean(bernoulli_data), np.var(bernoulli_data)


- Mean represents probability of success
- Variance shows variability in outcomes


### Binomial Distribution

The Binomial distribution models the number of successes in a fixed
number of independent trials.
Example: Number of purchases out of 10 website visits.


In [None]:
# Binomial distribution: 10 trials, success probability 0.4
binomial_data = np.random.binomial(n=10, p=0.4, size=1000)

np.mean(binomial_data), np.var(binomial_data)


- Used when trials are fixed
- Common in A/B testing and conversion analysis


### Normal Distribution

The Normal distribution is symmetric and widely used to model natural
phenomena such as heights, marks, and errors in measurements.


In [None]:
# Normal distribution: mean = 50, standard deviation = 10
normal_data = np.random.normal(loc=50, scale=10, size=1000)

np.mean(normal_data), np.std(normal_data)


In [None]:
import matplotlib.pyplot as plt

plt.hist(normal_data, bins=30)
plt.title("Normal Distribution")
plt.xlabel("Value")
plt.ylabel("Frequency")
plt.show()


### Poisson Distribution

The Poisson distribution models the number of events occurring within
a fixed interval of time or space.
Example: Number of customer arrivals per hour.


In [None]:
# Poisson distribution: average events per interval = 5
poisson_data = np.random.poisson(lam=5, size=1000)

np.mean(poisson_data), np.var(poisson_data)


### Introduction to Hypothesis Testing

Hypothesis testing is used to make decisions based on sample data.

- **Null Hypothesis (H₀):** Assumes no effect or no difference
- **Alternative Hypothesis (H₁):** Assumes there is an effect or difference
- **p-value:** Probability of observing results given that H₀ is true

If the p-value is small, we reject the null hypothesis.


### Conclusion

Probability distributions help model uncertainty in data.
They are widely used in:
- Statistical inference
- Hypothesis testing
- Machine learning algorithms
