# Statistical models

<img src=figs_III/xkcd_III.png/>

## Some parametric models

- Normal (Gaussian) Distribution ($\mu$, $\sigma$) – Defined by mean and variance, often used for modeling continuous data.
- Binomial Distribution (n, p) – Models the number of successes in n independent trials with success probability p.
- Poisson Distribution ($\lambda$) – Models the count of events occurring in a fixed interval with rate $\lambda$.
- Exponential Distribution ($\lambda$) – Models waiting times between independent events with rate $\lambda$.
- Beta Distribution ($\alpha$, $\beta$) – A flexible distribution for probabilities, parameterized by $\alpha$ and $\beta$.
- Gamma Distribution – A generalization of the exponential distribution, often used for waiting times and reliability analysis.
- Log-Normal Distribution – Models data where the logarithm follows a normal distribution.
- Weibull Distribution – Used in survival analysis and reliability engineering, parameterized by shape and scale.
- Linear Regression – Assumes a linear relationship between input variables and output with normally distributed errors.
- Logistic Regression – Models binary outcomes using a sigmoid function to estimate probabilities.



## Some non-parametric models

- Kernel Density Estimation (KDE) – Estimates a probability density function using a smoothing kernel (e.g., Gaussian).
- Histogram Density Estimation – Approximates a distribution by counting data points in fixed-width bins.
- K-Nearest Neighbors (KNN) – Classifies a point based on the majority label of its kk nearest neighbors.
- Decision Trees – Splits data into hierarchical regions based on feature values for classification or regression.
- Random Forests – An ensemble of decision trees that improves accuracy and reduces overfitting.
- Support Vector Machines (SVM) with Kernel Trick – Maps data into a higher-dimensional space using kernels for non-linear classification.
- Locally Weighted Regression – Fits local regressions to smooth data without assuming a global functional form.
- Spline Regression – Uses piecewise polynomials to fit flexible curves to data.
- Empirical Cumulative Distribution Function (ECDF) – Estimates the cumulative distribution directly from observed data.
- Bootstrap Resampling – Estimates properties (e.g., confidence intervals) by repeatedly sampling with replacement.

# Regression, prediction, classification

## Regression: Predicting House Prices
The goal is to predict the price of a house based on features like size, number of bedrooms, and location.\
Linear regression model:
$$Price = \beta_0+\beta_1(Size)+\beta_2(bedrooms)+\epsilon$$\
The result is a continuous value (e.g., $250,000 for a house).

## Prediction: Forecasting Temperature
The aim here is to predict tomorrow’s temperature based on historical weather data.\
We solve this by utilizing time-series forecasting using ARIMA (Autoregressive integrated moving average) or a neural network (LSTM).\
We get a continuous value (e.g., "Tomorrow’s temperature will be 75°F").

## Classification: Spam Detection
We mean to classify emails as "Spam" or "Not Spam" based on content and metadata.\
$$X=\{\text{content, metadata}\},\quad Y=\{\text{label}\}$$
Some models we can use: Logistic regression, decision tree, or neural network.\
We obtain a binary label (e.g., "Spam" or "Not Spam").

- Continuation: Data mining and machine learning course

# Point estimation with bias example
Goal: Estimate the variance of a population based on a sample.\
Suppose we have a random sample $X_1,X_2,...,X_n$ from a population with true variance $\sigma^2$.\
The sample variance (this is the estimator):\
$$S_n^2=\frac{1}{n}\sum^n_{i=1} \left(X_i−\overline{X}\right)^2$$

The bias comes into play with the expectation of $S_n^2$:\
$$\mathbb{E}(S_n^2)=\frac{n-1}{n}\sigma^2$$

This shows that $S_n^2$ underestimates $\sigma^2$, making it a biased estimator.\
The unbiased sample variance corrects this bias by using:\
$$S_n^2=\frac{1}{n-1}\sum^n_{i=1} \left(X_i−\overline{X}\right)^2$$
which ensures $$\mathbb{E}(S_n^2)=\sigma^2.$$

# Mean Squared Error calculation
The task: We want to estimate the mean of a population using a biased estimator. We compare it to an unbiased estimator and compute the Mean Squared Error (MSE).
1. Calculating the true population mean:\
    The true population values are:\
    $$X=\{3,5,7,9,11\}$$\
    The true mean is:\
    $$\mu=\frac{3+5+7+9+11}{5}=\frac{35}{5}=7$$
2. Calculating the biased amd unbiased estimators:\
    Unbiased estimator ($\hat{\mu}_U$): The sample mean using all $n$ observations. This is the same as the true mean.\
    Biased estimator ($\hat{\mu}_B$): Instead of using all data points, let's define our estimator as the mean of only the first four values:\
    $$\hat{\mu}_B=\frac{3+5+7+9}{4}=\frac{24}{4}​=6$$\
    The bias is therefore (if we always choose the first four values):\
    $$Bias(\hat{\mu}_B)=\mathbb{E}[\hat{\mu}_B]−\mu=6−7=−1.$$
3. Computing the Mean Squared Error (MSE):\
    $$MSE\left(\hat{\theta}\right)=\mathbb{E}\left[\left(\hat{\theta}−\theta\right)^2\right]$$\
    Since our biased estimator always gives 6, the squared error is:\
    $$\left(\hat{\theta}−\theta\right)^2 = (6−7)^2=(−1)^2=1$$\
    Since this estimator does not change across samples (always choosing the first four values), the expected squared error is also 1, so:
    $$MSE(\hat{\mu}_B)=1$$\
    If we used the unbiased estimator ($\hat{\mu}_U=7$), we would get:\
    $$MSE(\hat{\mu}_U)=\mathbb{E}\left[(7-7)^2\right]=0$$\
    showing that the unbiased estimator performs better, since the biased one leads to a higher MSE.



- Helps reduce the overall errors in estimation by balancing bias and variance.
- Lower MSE means our estimator is closer to the true value on average.
- It penalizes large errors more (squared nature), ensuring more consistent estimates.
- By minimizing MSE, we improve accuracy and reliability of our statistical model or estimator.
- Expanding the concept: loss function optimization in machine learning tasks.

<img src="figs_III/lossfunc.png">

# Sample mean as an asymptotically normal estimator

Let $X_1,X_2,…,X_n$ be a random sample from a population with unknown mean ($\mu$) and finite variance ($\sigma^2$) and we want to estimate the mean.\
$$\hat{\mu}_n=\frac{1}{n}\sum^n_{i=1}X_i$$
1. Checking for asymptotic normality:\
    According to the Central Limit Theorem (CLT), as $n\rightarrow\infty$, the distribution of $\hat{\mu}_n$ approaches a normal distribution:\
    $$\sqrt{n}\left(\hat{\mu}_n-\mu\right)\xrightarrow{d}\mathcal{N}(0,\sigma^2)$$\
    This means that, for large $n$, the sample mean $\hat{\mu}_n$ follows an approximate normal distribution, even if the original data is not normally distributed.
2. We have a population with $\mu=50$ (true mean) and $\sigma^2=100$ (population variance). We take a random sample of size $n=100$ and calculate the sample mean $\hat{\mu}_n$.\
    The variance of the sample mean is:\
    $$Var(\hat{\mu}_n)=\frac{\sigma^2}{n}$$\
    By the CLT, for large $n$:\
    $$\hat{\mu_{100}}\approx\mathcal{N}\left(50,\frac{100}{100}\right)=\mathcal{N}(50,1)$$\
    So, if we take repeated samples and compute sample means, they will be approximately normal with mean 50 and variance 1.
    
- This property is useful for constructing confidence intervals and hypothesis tests.

# Confidence interval

## Simple calculation: small dataset
Task: We measure the heights (in cm) of 5 people: $X=\{160,165,170,175,180\}$. We want to estimate the population mean height with a $95\%$ confidence interval.
1. Computing the sample mean:\
    $$\overline{X}=\frac{160+165+170+175+180}{5}=\frac{850}{5}=170$$
2. Computing the sample standard deviation:\
    $$\sigma=\sqrt{\frac{(160−170)^2+(165−170)^2+(170−170)^2+(175−170)^2+(180−170)^2}{5−1}}=\sqrt{\frac{250}{4}}\approx 7.91$$
3. Computing the standard error:\
    $$se\left(\overline{X}\right)=\sqrt{Var\left(\overline{X}\right)}=\sqrt{\frac{\sigma^2}{n}}=\sqrt{\frac{7.91^2}{5}}\approx 3.54$$
4. Find the $t$-critical value:\
    A $t$-test is generally preferable when dealing with small datasets ($n<30$) and when we do not know the population standard deviation.\
    This is a two-tailed test. For a $95\%$ confidence level with 4 degrees of freedom ($n-1=5-1=4$), the $t$-value (from a $t$-table) is 2.776.
5. Computing the confidence interval:\
    $$\overline{X}\pm (t\cdot se)=170\pm(2.776\cdot3.54)=170\pm9.83=(160.17,179.83)$$\
    With $95\%$ confidence, the population mean height is between 160.17 cm and 179.83 cm.
    
## Normal-based confidence interval calculation: large dataset
Task: We survey 100 students and find: sample mean test score: $\overline{X}=75$, known population standard deviation: $\sigma=10$, confidence level: $95\%$.
1. Computing the standard error (because of the known $\sigma$, we do not need to use the estimate of $se$):\
    $$se=\sqrt{\frac{\sigma^2}{n}}=\sqrt{\frac{100}{100}}=1$$
2. Find the $z$-critical value:\
    A $z$-test is generally preferable when dealing with larger datasets ($n\geq30$) and when we do know the population standard deviation.\
    For $95\%$ confidence, the $z$-value (from a $z$-table) is 1.96.
3. Computing the confidence interval:
    $$\overline{X}\pm(z\cdot se)=75\pm(1.96\cdot1)=75\pm1.96=(73.04,76.96)$$\
    With 95% confidence, the population mean test score is between 73.04 and 76.96.

In [28]:
from scipy import stats
#2-tailed t-test (for 1-tailed: q= 1-0.05)
print(f"t-value: {stats.t.ppf(q=1-(0.05)/2, df = 4)}")

z_value = stats.norm.ppf(1-(0.05)/2)
print(f"Z-critical value: {z_value}")

t-value: 2.7764451051977987
Z-critical value: 1.959963984540054


## Some confidence interval homework

- $Z$-test confidence intervals (Use $Z$-distribution, assume known population variance)
    1. A sample of $n=50$ students has an average SAT score of 1050 with a known population standard deviation of 100. Compute a 95% confidence interval for the population mean SAT score.
    2. A factory produces bolts with an average length of 5.2 cm. A quality control sample of $n=40$ bolts has a mean length of 5.25 cm, and the population standard deviation is known to be 0.1 cm. Find the 99% confidence interval for the true mean length of all bolts.
    3. A researcher surveys $n=200$ customers and finds that 65% of them prefer online shopping. Compute a 90% confidence interval for the true proportion of customers who prefer online shopping.

- $t$-test confidence intervals (Use $t$-distribution, assume unknown population variance)
    1. A small sample of $n=10$ patients has an average systolic blood pressure of 120 mmHg with a sample standard deviation of 15 mmHg. Find the 95% confidence interval for the population mean blood pressure.
    2. A group of $n=15$ students took a math test, and their average score was 78 with a sample standard deviation of 10. Compute a 99% confidence interval for the true average test score.
    3. A nutritionist collects data from $n=8$ people on their daily calorie intake. The sample mean is 2200 calories, and the sample standard deviation is 250 calories. Find the 90% confidence interval for the average calorie intake in the population.