When we take multiple random samples from any distribution, the distribution of the sample means will approximate a normal distribution, regardless of the original distribution, as long as the sample size is large enough (typically 𝑛 ≥ 30).

**The importance of Central Limit Theorem**

- Allows us to use the normal distribution for inference (even when the original data is not normally distributed).
- Enables hypothesis testing and confidence intervals in data science.
- Foundation for many machine learning algorithms and statistical methods.



**Properties of Central Limit Theorem**

1. The Sample Mean Approaches Normality
- Even if the population distribution is skewed, the sample means will be normally distributed as 𝑛 increases.
2. The Mean of the Sample Means Approaches the Population Mean
- If the population has mean 𝜇. then the sample means will also center around 𝜇.
3. The Standard Deviation of the Sample Means Shrinks
- The variability of the sample means decreases as 𝑛 increases.
- The standard deviation of the sample means (Standard Error) is:

    ![standard error](assets/sd-error.png)

    𝜎𝑋ˉ= standard deviation of sample means

    𝜎 = standard deviation of population

    𝑛 = sample size


In [None]:
import numpy as np
import matplotlib.pyplot as plt

np.random.seed(42)
population = np.random.exponential(scale=2, size=10000)

num_samples = 1000 
sample_size = 30 
sample_means = []

for _ in range(num_samples):
    sample = np.random.choice(population, size=sample_size, replace=True)
    sample_means.append(np.mean(sample))

# Plot original skewed population distribution
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
plt.hist(population, bins=50, color='orange', alpha=0.7, edgecolor='black')
plt.title("Original Skewed Population Distribution")
plt.xlabel("Value")
plt.ylabel("Frequency")

# Plot the sample means
plt.subplot(1, 2, 2)
plt.hist(sample_means, bins=50, color='blue', alpha=0.7, edgecolor='black')
plt.title("Sample Means Approximate Normal Distribution")
plt.xlabel("Sample Mean")
plt.ylabel("Frequency")

plt.tight_layout()
plt.show()


**Population**

A population is the entire group of individuals or items that we want to study.

Example:
- All students in a university.
- All smartphones manufactured by a company.
- All customers of an e-commerce website.

Characteristics of Population:
- Includes all members of a group.
- Usually too large to study directly.
- Represented by parameters (e.g., population mean 𝜇, population standard deviation 𝜎).

**Sample**

A sample is a subset of the population that is selected for study.

Example:
- 500 randomly selected students from a university.
- 100 randomly tested smartphones from a factory.
- A survey of 1,000 customers from an online store.

Characteristics of a Sample:
- Smaller than the population.
- Should be randomly selected for unbiased results.
- Represented by statistics (e.g., sample mean 𝑋ˉ, sample standard deviation 𝑠).

Reasons of using a Sample instead of population:

In real-world scenarios, analyzing an entire population is often impractical due to:
- Time constraints (surveying millions of people takes too long).
- High cost (testing all smartphones in a factory is expensive).
- Data availability (we cannot measure all internet users).

Thus, we use samples to estimate population characteristics. This is called statistical inference.

In [None]:
import numpy as np

# Create a "population" of 100,000 values
np.random.seed(42)
population = np.random.normal(loc=50000, scale=10000, size=100000)  # Mean=50,000, Std Dev=10,000

# Take a random sample of size 1,000 from the population
sample = np.random.choice(population, size=1000, replace=False)

# Calculate statistics
pop_mean = np.mean(population)  # Population mean
pop_std = np.std(population)    # Population standard deviation
sample_mean = np.mean(sample)   # Sample mean
sample_std = np.std(sample, ddof=1)  # Sample standard deviation

print(f"Population Mean: {pop_mean:.2f}")
print(f"Sample Mean: {sample_mean:.2f}")
print(f"Population Standard Deviation: {pop_std:.2f}")
print(f"Sample Standard Deviation: {sample_std:.2f}")


Key takeaways:

- The sample mean is close to the population mean, but not exactly the same.
- The sample standard deviation is slightly higher because we use 𝑛 − 1 (Bessel's correction) for unbiased estimation.
- Larger sample sizes give better estimates of the population (as per the Law of Large Numbers).