<a href="https://colab.research.google.com/github/SURESHBEEKHANI/Statistics-For-Data-Science-learining/blob/main/distributions_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Statistical Distributions**

This notebook covers key statistical concepts:  
- **Normal Distribution**: Common in natural phenomena, symmetric bell shape.  
- **Log-Normal Distribution**: Positive values, right-skewed, used in financial modeling.  
- **Power Law Distribution**: Heavy-tailed, explains rare events dominating outcomes.  
- **Pareto Distribution**: "80/20 rule," common in economics.  
- **Central Limit Theorem (CLT)**: Explains why sampling distributions tend to be normal.

### **Objective**
Visualize and understand these distributions and the CLT using Python.


### **Normal Distribution Visualization with Empirical Rule**

This code generates a normal distribution of random data and visualizes it with the Empirical Rule (68-95-99.7 rule), which describes the percentage of data within one, two, and three standard deviations of the mean.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Generate data for a Normal Distribution
mean = 0      # Mean of the distribution
std_dev = 1   # Standard deviation
sample_size = 10000  # Number of random samples to generate

# Generate random samples from a normal distribution using the mean, standard deviation, and sample size
data = np.random.normal(mean, std_dev, sample_size)

# Calculate the boundaries for the Empirical Rule (68-95-99.7 rule)
one_std = (mean - std_dev, mean + std_dev)    # Boundaries for 1σ
two_std = (mean - 2*std_dev, mean + 2*std_dev) # Boundaries for 2σ
three_std = (mean - 3*std_dev, mean + 3*std_dev) # Boundaries for 3σ

# Plot the distribution using a histogram and a Kernel Density Estimate (KDE)
plt.figure(figsize=(12, 6))  # Set the size of the plot
sns.histplot(data, kde=True, bins=50, color='skyblue', label='Data', stat="density", linewidth=0)  # Plot histogram with KDE curve

# Highlight regions for the Empirical Rule on the plot
plt.axvspan(one_std[0], one_std[1], color='green', alpha=0.3, label='68% within 1σ')  # Highlight 68% region
plt.axvspan(two_std[0], two_std[1], color='yellow', alpha=0.3, label='95% within 2σ')  # Highlight 95% region
plt.axvspan(three_std[0], three_std[1], color='red', alpha=0.2, label='99.7% within 3σ')  # Highlight 99.7% region

# Add details to the plot with a professional style
plt.title("Normal Distribution with Empirical Rule", fontsize=18, fontweight='bold', pad=20)  # Title of the plot
plt.xlabel("Value", fontsize=14, labelpad=10)  # Label for the x-axis
plt.ylabel("Density", fontsize=14, labelpad=10)  # Label for the y-axis
plt.legend(fontsize=12, loc='upper right')  # Add legend
plt.grid(True, which='both', linestyle='--', linewidth=0.5, alpha=0.6)  # Add grid with lighter style

# Set the style for the plot
sns.set_style("whitegrid")

# Show the plot
plt.show()

# Print percentages to validate that the data fits the Empirical Rule
within_1_std = np.mean((data >= one_std[0]) & (data <= one_std[1])) * 100  # Percentage within 1σ
within_2_std = np.mean((data >= two_std[0]) & (data <= two_std[1])) * 100  # Percentage within 2σ
within_3_std = np.mean((data >= three_std[0]) & (data <= three_std[1])) * 100  # Percentage within 3σ

# Output the calculated percentages
print(f"Data within 1σ: {within_1_std:.2f}%")
print(f"Data within 2σ: {within_2_std:.2f}%")
print(f"Data within 3σ: {within_3_std:.2f}%")


### **Central Limit Theorem (CLT) Demonstration**

This code demonstrates the **Central Limit Theorem (CLT)**, which shows that the distribution of sample means tends to be normal, even when the population distribution is not. The example uses a **uniform distribution** as the population and repeatedly samples from it to illustrate the CLT.


In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Parameters
population_size = 100000  # Size of the population
sample_size = 30          # Sample size (for CLT to work, this should be at least 30)
num_samples = 1000        # Number of random samples to take
population_min = 0        # Minimum value for the uniform distribution
population_max = 10       # Maximum value for the uniform distribution

# Step 1: Generate a population with a uniform distribution
population = np.random.uniform(population_min, population_max, population_size)

# Step 2: Take repeated random samples and calculate the sample mean
sample_means = []

for _ in range(num_samples):
    sample = np.random.choice(population, size=sample_size)
    sample_means.append(np.mean(sample))

# Step 3: Visualize the population and the sampling distribution of sample means
plt.figure(figsize=(12, 6))

# Plot the population distribution (Uniform Distribution)
plt.subplot(1, 2, 1)
sns.histplot(population, bins=50, kde=False, color='skyblue', label='Population', alpha=0.7)
plt.title('Population Distribution (Uniform)', fontsize=14)
plt.xlabel('Value', fontsize=12)
plt.ylabel('Frequency', fontsize=12)
plt.grid(True)

# Plot the sampling distribution of sample means
plt.subplot(1, 2, 2)
sns.histplot(sample_means, bins=30, kde=True, color='orange', label='Sample Means', alpha=0.7)
plt.title('Sampling Distribution of the Mean (CLT)', fontsize=14)
plt.xlabel('Sample Mean', fontsize=12)
plt.ylabel('Frequency', fontsize=12)
plt.grid(True)

# Show the plots
plt.tight_layout()
plt.show()

# Display some statistics for the sampling distribution
print(f"Population Mean: {np.mean(population)}")
print(f"Sampling Distribution Mean: {np.mean(sample_means)}")
print(f"Sampling Distribution Standard Deviation (Standard Error): {np.std(sample_means)}")


### **Plotting Normal, Log-Normal, and Recovered Normal Distributions**

This Python code demonstrates how to generate a normal distribution, transform it into a log-normal distribution, and then recover the original normal distribution from the log-normal data.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Parameters for the normal distribution (used to generate the log-normal distribution)
mu = 0         # Mean of the normal distribution
sigma = 0.5    # Standard deviation of the normal distribution
sample_size = 10000  # Number of samples to generate

# Generate normal distribution data
normal_data = np.random.normal(mu, sigma, sample_size)

# Generate log-normal data from the normal distribution
log_normal_data = np.exp(normal_data)

# Recover the original normal data by taking the natural logarithm of the log-normal data
recovered_normal_data = np.log(log_normal_data)

# Plot the normal distribution (original), log-normal distribution, and recovered normal distribution
plt.figure(figsize=(18, 6))

# Plot the normal distribution (original)
plt.subplot(1, 3, 1)
sns.histplot(normal_data, bins=50, kde=True, color='skyblue', label='Normal Data')
plt.title('Normal Distribution (Original)', fontsize=16)
plt.xlabel('Value', fontsize=14)
plt.ylabel('Frequency', fontsize=14)
plt.legend()
plt.grid(True)

# Plot the log-normal distribution
plt.subplot(1, 3, 2)
sns.histplot(log_normal_data, bins=50, kde=True, color='orange', label='Log-Normal Data')
plt.title('Log-Normal Distribution', fontsize=16)
plt.xlabel('Value', fontsize=14)
plt.ylabel('Frequency', fontsize=14)
plt.legend()
plt.grid(True)

# Plot the recovered normal distribution
plt.subplot(1, 3, 3)
sns.histplot(recovered_normal_data, bins=50, kde=True, color='green', label='Recovered Normal Data')
plt.title('Recovered Normal Distribution', fontsize=16)
plt.xlabel('Value', fontsize=14)
plt.ylabel('Frequency', fontsize=14)
plt.legend()
plt.grid(True)

# Show the plots
plt.tight_layout()
plt.show()

# Print the mean and variance of the recovered normal data
print(f"Recovered Normal Data Mean: {np.mean(recovered_normal_data)}")
print(f"Recovered Normal Data Variance: {np.var(recovered_normal_data)}")


### **Transforming Power-law Distributed Data to Approximate Normality using Box-Cox**

This code demonstrates how to generate Power-law distributed data and transform it into an approximately normal distribution using the Box-Cox transformation. It visualizes the original data, transformed data, and a QQ plot for normality check.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import boxcox

# Define parameters for generating Power-law distributed data
alpha = 2.5  # Scaling exponent
xmin = 1     # Minimum value
sample_size = 10000  # Number of samples

# Generate Power-law distributed data
power_law_data = (np.random.pareto(alpha, sample_size) + 1) * xmin

# Transform the data to approximate a normal distribution using Box-Cox
# Box-Cox requires all positive values, which is satisfied here
transformed_data, _ = boxcox(power_law_data)

# Plot the Power-law data and transformed data
plt.figure(figsize=(18, 6))

# Histogram of the original Power-law data
plt.subplot(1, 3, 1)
sns.histplot(power_law_data, bins=50, kde=True, color='blue', stat='density', alpha=0.6)
plt.title("Original Power-law Data")
plt.xlabel("Value")
plt.ylabel("Density")

# Histogram of the transformed data (approximate Normal Distribution)
plt.subplot(1, 3, 2)
sns.histplot(transformed_data, bins=50, kde=True, color='green', stat='density', alpha=0.6)
plt.title("Transformed Data (Approx. Normal)")
plt.xlabel("Value")
plt.ylabel("Density")

# Combined QQ plot for visual comparison of Normality
from scipy.stats import probplot
plt.subplot(1, 3, 3)
probplot(transformed_data, dist="norm", plot=plt)
plt.title("QQ Plot of Transformed Data")

# Adjust layout for clarity
plt.tight_layout()
plt.show()


### **Transforming Pareto-Distributed Data to Normal Distribution**

This notebook demonstrates how to transform Pareto-distributed data into an approximate normal distribution. The transformation process includes log-transforming the data and standardizing it. The original and transformed data distributions are visualized.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Parameters for Pareto distribution
alpha = 2.5  # Shape parameter
xmin = 1     # Minimum value
size = 10000  # Total samples

# Generate Pareto-distributed data
pareto_data = (np.random.pareto(alpha, size) + 1) * xmin

# Split the data into two groups
group1, group2 = np.split(pareto_data, 2)

# Function to convert Pareto data to normal distribution
def pareto_to_normal(data):
    # Log-transform the data
    log_data = np.log(data)
    # Standardize (z-score normalization)
    return (log_data - np.mean(log_data)) / np.std(log_data)

# Apply transformation to each group
group1_normal = pareto_to_normal(group1)
group2_normal = pareto_to_normal(group2)

# Plot the results
plt.figure(figsize=(18, 6))

# Original Pareto Data
plt.subplot(1, 3, 1)
sns.histplot(pareto_data, bins=50, kde=True, color='blue', stat='density', alpha=0.6)
plt.title("Original Pareto Distribution")
plt.xlabel("Value")
plt.ylabel("Density")

# Transformed Normal Group 1
plt.subplot(1, 3, 2)
sns.histplot(group1_normal, bins=50, kde=True, color='green', stat='density', alpha=0.6)
plt.title("Group 1 - Transformed Normal")
plt.xlabel("Normalized Value")
plt.ylabel("Density")

# Transformed Normal Group 2
plt.subplot(1, 3, 3)
sns.histplot(group2_normal, bins=50, kde=True, color='orange', stat='density', alpha=0.6)
plt.title("Group 2 - Transformed Normal")
plt.xlabel("Normalized Value")
plt.ylabel("Density")

# Adjust layout for clarity
plt.tight_layout()
plt.show()
