In [2]:
import numpy as np
# ! pip3 install scipy
from scipy.stats import norm

In [4]:
mean = 0
std_dev = 1

data = np.random.normal(mean, std_dev, 1000)

print(data[:10])

[ 0.97140268 -0.23406852 -0.10130953 -0.89579478  1.07827426  0.24761632
  0.10037077  0.67288597  0.03296816  1.02691856]


<h4>The law of large numbers</h4>
The law of large numbers is a fundamental theorem in probability that describes the result of performing the same experiment a large number of times. It states that as the number of trials increases, the sample average of the outcomes will converge to the expected value.



$$\overline{X}_n = \frac{X_1+ X_2 + ... + X_n}{n}$$
based on the samples $X_1, X_2,..., X_n$ from a probablitiy distribution with expectation $\mu$.

<h4>Weak law of large numbers</h4>
Given a sequence of independent and identically distributed (i.i.d.) random variables with finite expected value $\mu$, the sample
mean $\overline{X}_n$ converages to

$$\lim_{n \to \infty} \text{ } P(|\overline{X}_n - \mu| > \epsilon) = 0 $$
<br><br>
This means that for large n, the probability of the sample mean deviating from the true mean by any fixed amount $\epsilon$ becomes very small.

<h4>Strong law of large numbers (SLLN)</h4>
The sample mean $\overline{X}_n$ converages almost surely to $\mu$
$$\lim_{n \to \infty} \text{ } P(\overline{X}_n = \mu) = 1 $$

It gaurantees that the sample mean will almost surely equal the expected value in the long run.


<h4>Empirical Distribution Function (EDF)</h4>

It's a statistical function used to estimate the cummulative distribution function (CDF) of a sample of data. It provides a non-parametric estimate of the underlying distribution from which the sample was drawn.

$$F_n(x) = \frac{ \text{ number of elements in the dateset } <= x}{n} $$ </h4>
<p> In fact, there is a one-to-one relation between histogram and empirical distribution function. The area under the histogram on a single bin is equal to the relative frequency of elements that lie in that bin, which is also equal to the increase of $F_n$ on that bin.</p>

Given a sample of n independent and identically distributed (i.i.d.) random variables $X_1, X_2,..., X_n$, the empirical distribution function F_n(x) is defined as 
 
$$F_n(x) = \frac{1}{n}  \sum_{i=1}^n 1_{ X_i <= x}$$

Where $ 1_{ X_i <= x}$ is an indicator function that equals 1 if  $X_i <x$ and 0 otherwise.
$F_n(x)$ represents the proportion of sample points less than or equal to x.

<h4>Properties</h4>

- Step Function: The EDF is a step function that increases by $1/n$ at each sample point.
- Convergence: $F_n(x)$ converages uniformly to the true cumulative distribution function F(x) as $n \to \infty$
- For most realizations of the random sample the empirical distribution function $F_n$ is close to F: $$F_n(a) \approx F(a) $$
- Law of Large Numbers: The EDF satisfies the Law of Large Numbers, meaning it approaches the true distribution as more data is collected.


In [6]:
def empirical_distribution_function(data, value):

    sorted_data = np.sort(data)
    
    # Count the number of data points less than or equal to the given value
    count = np.sum(sorted_data <= value)
    
    # Compute the EDF
    edf = count / len(sorted_data)
    
    return edf


data = np.random.normal(0, 1, 1000)  
value = 0  # Example value
edf_value = empirical_distribution_function(data, value)
print(f"Empirical distribution function value for {value}: {edf_value}")

Empirical distribution function value for 0: 0.498


<h4>Identically distributed and Independent Random Variables</h4>
Identically distributed: Each $X_i$ should come from the same underlying distribution.

Independent: No sample value should influence another.

<h3> The inverse cumulative distribution function (CDF) </h3> It's a function that "undoes" 
the cumulative distribution function. Given a probability  p, it returns the value
of the random variable such that the probability of the random variable being less than
or equal to that value is p. In mathematical terms, for a random variable X with
cumulative distribution function $F(x)$, the inverse CDF, denoted as $F^{−1}(p)$, is such that:

$$F^{-1}(p) =x  \text{ if  } F(x) = p$$

In simpler terms, if you have a probability p, the inverse CDF gives you the value 
x such that F(x)=p.



In [7]:
def normal_cdf(x, mu, sigma):
    # x: The value(s) at which to compute the CDF.
    return norm.cdf(x, mu, sigma)


mu = 0
sigma = 1
x = 1.5
cdf_value = normal_cdf(x, mu, sigma)
print(f"CDF at {x}: {cdf_value}")
x = 0
cdf_value = normal_cdf(x, mu, sigma)
print(f"CDF at {x}: {cdf_value}")

CDF at 1.5: 0.9331927987311419
CDF at 0: 0.5


In [9]:
def inverse_normal_cdf(p, mu, sigma):
    #  percent-point function (ppf) aka inverse CDF
    return norm.ppf(p, mu, sigma)

mu = 0
sigma = 1
p = 0.5
x = inverse_normal_cdf(p, mu, sigma)
print(f"The value corresponding to the probability {p}: {x}")
p = 0.9331927987311419
x = inverse_normal_cdf(p, mu, sigma)
print(f"The value corresponding to the probability {p}: {x}")

The value corresponding to the probability 0.5: 0.0
The value corresponding to the probability 0.9331927987311419: 1.4999999999999996


<h4>Non-paramteric estimates </h4>
By lack of konwledge on a phenomenon we prefer not to specify a particular parametric type of distribution, and we model our given data set as the realization of a random sample of size N from a continous probablity distribution.
<br><br>
The kernel density estimate and the emprical distribution function of the dataset approximate the probability density function f and the distribution function F of the distribution. From the resulted graphs, we can understand whether they resemble the probability density function or distriubtion function of any of the familiar parametric distriubtions.
<br><br>
Instead of viewing the two graphs (kernel density estimate and the empirical distribution function) only as graphical summeries of the data, we can also use both curves as estimates for f and F. We estimate the model probability density f by means of the kernel density estimate and the model distribution fucntion F by means of the empirical distribution function. 
<br><br>
Since neither estimates assumes a particular paramteric model, they are called nonparamteric estimates. 