# Problem 1 
Estimate and compare the confidence intervals or error bars obtained for each distribution using Hoeffding's inequality and the Chebyshev inequality (for the latter one, you need to analyze or empirically  estimate the variance).

In [9]:
import numpy as np
import pandas as pd
from scipy.stats import norm
import time
import matplotlib.pyplot as plt

# Load data generated in data_generation.ipynb

input_file_path = "data_100000_100013059.csv"
df = pd.read_csv(input_file_path)
N = len(df)


## Hoeffding inequality

Let $X_1, ..., X_n$ be i.i.d. random variables, bounded by the intervals $a_i \leq X_i \leq b_i$. Let the empirical mean be defined according to $\overline{X} = \frac{1}{n} \sum_{i=1}^n X_i$. It holds that: $$ P(|\overline{X} - \mathbb{E}[\overline{X}]| \geq \epsilon) \leq 2e^{-\frac{2n\epsilon^2}{\sum_{i=1}^n(a_i - b_i)^2}} $$

To find the Confidence Interval (CI) with confidence level $1 - \alpha$ (e.g., 0.95 for 95%), we set the right side equal to $\alpha$ and solve for the margin of error $\epsilon$:

$$ \epsilon = (b-a)\sqrt(\frac{\ln{2/\alpha}}{2N})$$

### Gaussian distribution

Gaussian distribution is unbounded, so for illustrating the confidence interval using Hoeffding inequality we can choose to truncate to a certain range like $3\sigma$, $5\sigma$, etc. or alternatively to the actual minimum and maximum values of the sample.

In [None]:
# Hoeffding inequality of Gaussian distribution

# Gaussian with mu=0 and sigma=0.5

dfg_m0_s0_5 = df["normal mu=0 sigma=0.5"]
mu = 0
sigma = 0.5
sample_mean = dfg_m0_s0_5.mean()
sample_std = dfg_m0_s0_5.std()
sample_min = dfg_m0_s0_5.min()
sample_max = dfg_m0_s0_5.max()
a, b = sample_min, sample_max
alpha = 0.05  # 95% confidence
range_width = b - a
epsilon = range_width * np.sqrt(np.log(2 / alpha) / (2 * N))
lower_bound = sample_mean - epsilon
upper_bound = sample_mean + epsilon

# Standard Error = sigma / sqrt(N) (Using sample std dev here)
z_score = 1.96 # For 95% CI
clt_margin = z_score * (sample_std / np.sqrt(N))

# --- Output ---
print(f"Sample Mean: {sample_mean:.5f}")
print(f"Data Range [a, b]: [{a:.3f}, {b:.3f}] (Width: {range_width:.3f})")
print("-" * 30)
print(f"Hoeffding Margin of Error: {epsilon:.5f}")
print(f"Hoeffding 95% CI: [{lower_bound:.5f}, {upper_bound:.5f}]")
print("-" * 30)
print(f"Standard (CLT) Margin: {clt_margin:.5f}")
print(f"Hoeffding is wider (more conservative) than CLT.")



Sample Mean: 0.00035
Data Range [a, b]: [-2.198, 2.055] (Width: 4.252)
------------------------------
Hoeffding Margin of Error: 0.01826
Hoeffding 95% CI: [-0.01792, 0.01861]
------------------------------
Standard (CLT) Margin: 0.00310
Note: Hoeffding is usually wider (more conservative) than CLT.


## Chebishev inequality

For any R.V (Random Variable)  $X$ and for *any*  positive number $\lambda > 0$: $$ Pr(|X-\mathbb{E}[X]| \geq \lambda) \leq \frac{Var(X)}{\lambda^2} $$