# **Statistics**

# 1. Simulation of Random Variables and Their Properties

**Objective**: Simulate random variables to understand their distributions, calculate their means and variances, and visualize the results.


## Example

Analyze the Iris dataset to calculate the mean and variance of the features for each species of iris.

I use the popular Iris dataset, which is often used for statistical testing, machine learning, and data visualization projects. This dataset includes measurements of 150 iris flowers from three species on four features: sepal length, sepal width, petal length, and petal width.

- Generate samples from different distributions (uniform, normal, binomial, Poisson) using `numpy` or `scipy.stats` and plot their histograms using `matplotlib` or `seaborn`.
- For each distribution, calculate the theoretical mean and variance, compare these with the sample mean and sample variance, and discuss the results in the context of the Law of Large Numbers.
- Demonstrate the Central Limit Theorem by showing that the distribution of the sample mean approaches a normal distribution as sample size increases, regardless of the original distribution of the data.

In [None]:
from sklearn.datasets import load_iris
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

# Load the dataset
iris = load_iris()
iris_df = pd.DataFrame(data=iris.data, columns=iris.feature_names)

# Add species (target) to the DataFrame
iris_df['species'] = iris.target
iris_df['species'] = iris_df['species'].map({0: 'setosa', 1: 'versicolor', 2: 'virginica'})

# Display the first few rows of the dataframe
iris_df.head()

# Calculate mean and variance for each feature grouped by species
mean_variance_df = iris_df.groupby('species').agg(['mean', 'var'])

# Display the calculated mean and variance
mean_variance_df

# Set the style of seaborn
sns.set_style("whitegrid")

# Plot for Sepal Length
plt.figure(figsize=(10, 6))
sns.barplot(x=mean_variance_df.index, y=mean_variance_df[('sepal length (cm)', 'mean')], yerr=mean_variance_df[('sepal length (cm)', 'var')].apply(np.sqrt), capsize=.2)
plt.title('Mean and Standard Deviation of Sepal Length for Each Species')
plt.ylabel('Sepal Length (cm)')
plt.show()

### Interpretation

The mean tells us about the average sizes of different iris features per species, providing a quick reference for distinguishing between species based on size. The variance adds depth to this picture by revealing how much individuals within a species vary from that average, which is essential for understanding the diversity within each species.
- **Central Tendency (Mean)**: Calculating the mean of each feature (sepal length, sepal width, petal length, petal width) for each iris species offers a snapshot of the typical size of these features within each species group. For instance, if the mean sepal length for setosa is significantly lower than that for versicolor and virginica, it suggests that, on average, setosa irises have shorter sepals. This measure of central tendency is crucial for understanding the general physical characteristics that define each species.
- **Variability (Variance)**: The variance provides insights into the spread of each feature within the species. A high variance in sepal length for a particular species would indicate that the sepal lengths within that species vary widely, suggesting a high degree of diversity or perhaps different subpopulations within that species. Conversely, a low variance implies that the feature sizes are more uniform, indicating consistency in the physical characteristics of the flowers within that species.
- **Comparative Analysis**: By comparing the mean and variance of these features across species, we can identify distinguishing characteristics. For example, if the virginica species shows a significantly higher mean petal length with low variance, it could be inferred that longer petals are a consistent and defining characteristic of the virginica species.

--------------

# 2. Image Data Analysis for Computer Vision

**Objective**: Demonstrate the application of statistical concepts in processing and analyzing image data, relevant to computer vision.

## Example

Demonstrate how to calculate the mean image and variance across the MNIST dataset, perform image normalization and contrast adjustment, and implement a simple image classification model.

MNIST dataset is ideal for demonstrating the application of statistical concepts in computer vision, as it contains images of handwritten digits (0 through 9), which are commonly used for training various image processing systems.


*   Load an image dataset (e.g., MNIST, CIFAR-10) and calculate the mean image and variance across the dataset. Discuss how these metrics can be used in image preprocessing steps for machine learning models.
*   For a subset of images, calculate the pixel intensity distribution's mean and variance. Use these statistics to perform image normalization and contrast adjustment.
*   Implement a simple image classification model using a machine learning library (e.g., scikit-learn, TensorFlow, or PyTorch). Discuss how the mean and variance of image features influence the model's performance.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from numpy import nan_to_num

# Load the datasets
train_df = pd.read_csv('./mnist_train.csv')
test_df = pd.read_csv('./mnist_test.csv')

# Extract features and labels
X_train = train_df.drop('label', axis=1).values
y_train = train_df['label'].values
X_test = test_df.drop('label', axis=1).values
y_test = test_df['label'].values

# Calculate mean and variance
mean_image = np.mean(X_train, axis=0)
variance_image = np.var(X_train, axis=0)

# Normalize the training and testing set
X_train_normalized = (X_train - mean_image) / np.sqrt(variance_image)
X_test_normalized = (X_test - mean_image) / np.sqrt(variance_image)

# Replace NaN values with 0 after normalization (to handle division by zero for pixels with no variance)
X_train_normalized = nan_to_num(X_train_normalized, nan=0)
X_test_normalized = nan_to_num(X_test_normalized, nan=0)

# Now, it's safe to check for NaN values, though they should have been handled by the previous step
nan_in_train = np.isnan(X_train_normalized).any()
nan_in_test = np.isnan(X_test_normalized).any()

print(f"NaN in Training Set: {nan_in_train}")
print(f"NaN in Testing Set: {nan_in_test}")

# Visualize the mean image
plt.imshow(mean_image.reshape(28, 28), cmap='gray')
plt.title('Mean Image')
plt.colorbar()
plt.show()

# Visualize the variance image
plt.imshow(variance_image.reshape(28, 28), cmap='gray')
plt.title('Variance Image')
plt.colorbar()
plt.show()

# Initialize and train the classifier
clf = LogisticRegression(solver='lbfgs', max_iter=1000, random_state=42)
clf.fit(X_train_normalized, y_train)

# Predict on the test set and calculate accuracy
y_pred = clf.predict(X_test_normalized)
accuracy = accuracy_score(y_test, y_pred)
print(f'Test Accuracy: {accuracy}')

## Discuss the Influence of Mean and Variance

In the process of preparing the MNIST dataset for the classification task, two critical preprocessing steps were normalization and handling of NaN values, influenced by the mean and variance calculations of the dataset. These steps had a significant impact on the model's training process and its subsequent performance. Here's a detailed discussion on their influence:

### Normalization Using Mean and Variance

Normalization involves adjusting the values in the dataset so that they share a common scale, without distorting differences in the ranges of values. For the MNIST dataset, normalization was achieved by subtracting the mean and dividing by the standard deviation (square root of variance) of the dataset. This method ensures that:

- **Feature Scaling**: Each pixel value is scaled similarly, making the optimization landscape smoother. This is crucial for algorithms like Logistic Regression that rely on gradient descent, as it ensures more uniform convergence across all features (pixels in this context).
- **Improved Model Performance**: By scaling the features to a similar range, models can train faster and often achieve better accuracy. The model's ability to learn from the data is enhanced because the features contribute more equally to the training process.
- **Reduction of Bias**: Without normalization, pixel values with higher numerical ranges could dominate the model's learning process, leading to biased predictions. Normalization mitigates this risk by giving each pixel equal importance based on its variability, rather than its absolute value.

### Handling NaN Values

The appearance of NaN values during normalization (specifically when dividing by zero variance for pixels that do not change across all images) necessitated additional preprocessing. Replacing NaN values with zeros ensures that:

- **Data Integrity**: The model receives a complete dataset without missing values, which could otherwise introduce bias or errors during training.
- **Consistent Input**: Ensures that all inputs to the model are real numbers, which is a prerequisite for most mathematical operations involved in machine learning algorithms.
- **Uninterrupted Training Process**: By addressing potential NaN issues upfront, the training process runs smoothly without interruption due to unexpected input values.

----------

# 3. Analysis of Variability in Weather Data

## Objective:

Analyze a dataset of daily temperatures to calculate mean, variance, and standard deviation, demonstrating your understanding of these concepts and their application in summarizing and understanding data variability.

## Dataset:

You can use any publicly available weather dataset that includes daily temperature readings. For the purpose of this exercise, let's assume you have a dataset **`daily_temperatures.csv`** with columns **`Date`** and **`Temperature`**.

## Steps:

1. **Load and Prepare the Data**: Import necessary libraries (**`pandas`**, **`numpy`**) and load the dataset into a DataFrame. Convert **`Date`** to a datetime type and **`Temperature`** to a float.
2. **Calculate Descriptive Statistics**: Compute the mean and variance of **`Temperature`** using **`numpy`** or **`pandas`** functions.
3. **Visualize the Data**: Plot the temperature over time using **`matplotlib`** or **`seaborn`** to visualize trends and variability.
4. **Interpret the Results**: Discuss what the mean, variance, and the plot tell you about the temperature distribution and variability over time.

In [None]:
# Step 1: Load and Prepare the Data
# For this example, let's simulate loading a dataset since we can't access external files directly.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Simulating a dataset of daily temperatures over a year
np.random.seed(0) # For reproducibility
dates = pd.date_range(start='2023-01-01', end='2023-12-31', freq='D')
temperatures = 20 + 10*np.sin(np.linspace(0, 3*np.pi, 365)) + np.random.normal(0, 5, 365)

daily_temperatures_df = pd.DataFrame({'Date': dates, 'Temperature': temperatures})

# Converting 'Date' to datetime type and 'Temperature' to float is not needed as they are already in the correct format

# Display the first few rows of the dataframe
daily_temperatures_df.head()

# Step 2: Calculate Descriptive Statistics
mean_temperature = daily_temperatures_df['Temperature'].mean()
variance_temperature = daily_temperatures_df['Temperature'].var()
std_dev_temperature = daily_temperatures_df['Temperature'].std()

(mean_temperature, variance_temperature, std_dev_temperature)

# Step 3: Visualize the Data
plt.figure(figsize=(10, 6))
plt.plot(daily_temperatures_df['Date'], daily_temperatures_df['Temperature'], label='Daily Temperature')
plt.axhline(mean_temperature, color='r', linestyle='--', label='Mean Temperature')
plt.title('Daily Temperatures Over a Year')
plt.xlabel('Date')
plt.ylabel('Temperature')
plt.legend()
plt.show()

## Interpretation

The plot visualizes the daily temperature fluctuations over the course of a year. By calculating the mean temperature, variance, and standard deviation, we can understand the dataset's central tendency and variability:

- **Mean Temperature**: The mean provides a central value around which daily temperatures tend to cluster. In our simulated data, this is visually represented by the red dashed line. The mean temperature gives us an idea of the overall "average" temperature throughout the year.
- **Variance**: The variance measures how much the temperatures spread out from the mean. A higher variance indicates a wider range of temperatures. In our case, the variance suggests there is considerable spread in daily temperatures, which is expected due to seasonal changes.
- **Standard Deviation**: This is the square root of the variance and provides a measure of temperature spread in the same units as the data itself (degrees). It tells us, on average, how much individual temperatures deviate from the mean temperature.

The mean, variance, and standard deviation are fundamental for understanding the dataset's behavior. In the context of weather data analysis, these statistics can help in planning agricultural activities, energy usage forecasting, and preparing for weather-dependent events.

--------------

# 4. Simulating Dice Rolls to Understand Random Variables

## Objective:

Simulate rolling a six-sided die to explore the concepts of random variables, mean, and variance, and to illustrate the Law of Large Numbers.

In this exercise, we simulated rolling a six-sided die with increasing numbers of trials (10, 1,000, and 10,000 rolls) to explore the concepts of random variables, mean, variance, and to illustrate the Law of Large Numbers.

## Steps:

1. **Simulate Dice Rolls**: Use **`numpy`** to generate random integers between 1 and 6, simulating 1,000 dice rolls.
2. **Calculate Mean and Variance**: Compute the sample mean and variance of the outcomes to understand the distribution of dice rolls.
3. **Repeat with Increasing Trials**: Repeat the simulation with 10, 1,000, and 10,000 rolls, plotting the mean of the outcomes against the number of trials to demonstrate the Law of Large Numbers.
4. **Discussion**: Explain how the mean stabilizes around the theoretical mean (3.5 for a fair six-sided die) as the number of trials increases, and discuss the implications for understanding random variables and variance.

In [None]:
import numpy as np
import matplotlib.pyplot as plt

# Step 1: Simulate Dice Rolls
rolls_10 = np.random.randint(1, 7, 10)
rolls_1000 = np.random.randint(1, 7, 1000)
rolls_10000 = np.random.randint(1, 7, 10000)

# Step 2: Calculate Mean and Variance
mean_10 = np.mean(rolls_10)
variance_10 = np.var(rolls_10)

mean_1000 = np.mean(rolls_1000)
variance_1000 = np.var(rolls_1000)

mean_10000 = np.mean(rolls_10000)
variance_10000 = np.var(rolls_10000)

# Step 3: Repeat with Increasing Trials and Plot
trial_counts = [10, 1000, 10000]
means = [mean_10, mean_1000, mean_10000]
variances = [variance_10, variance_1000, variance_10000]

plt.figure(figsize=(14, 6))

# Plotting Mean
plt.subplot(1, 2, 1)
plt.plot(trial_counts, means, marker='o', linestyle='-', color='b')
plt.title('Mean of Dice Rolls vs. Number of Trials')
plt.xlabel('Number of Trials')
plt.ylabel('Mean of Outcomes')
plt.xscale('log')  # Use logarithmic scale to better visualize the changes
plt.axhline(y=3.5, color='r', linestyle='--')  # Theoretical mean
plt.legend(['Experimental Mean', 'Theoretical Mean'])

# Plotting Variance
plt.subplot(1, 2, 2)
plt.plot(trial_counts, variances, marker='o', linestyle='-', color='g')
plt.title('Variance of Dice Rolls vs. Number of Trials')
plt.xlabel('Number of Trials')
plt.ylabel('Variance of Outcomes')
plt.xscale('log')  # Use logarithmic scale to better visualize the changes
plt.axhline(y=np.var(np.arange(1, 7)), color='r', linestyle='--')  # Theoretical variance
plt.legend(['Experimental Variance', 'Theoretical Variance'])

plt.tight_layout()
plt.show()

## Results and Discussion

- **Mean of Dice Rolls vs. Number of Trials**: The plot shows how the mean of the dice rolls approaches the theoretical mean (3.5) as the number of trials increases. This is a demonstration of the Law of Large Numbers, which states that as the number of trials increases, the sample mean will get closer to the expected (theoretical) mean of the population. Initially, with only 10 trials, the mean can deviate significantly from 3.5, but as we increase the number of rolls to 1,000 and then 10,000, the experimental mean converges towards the theoretical mean.
- **Variance of Dice Rolls vs. Number of Trials**: The variance plot indicates the spread of the outcomes around the mean. The experimental variance approaches the theoretical variance of a fair six-sided die as the number of trials increases. The theoretical variance can be calculated from the probabilities of each outcome for a fair die, and in this exercise, it's represented by the red dashed line. Similar to the mean, the variance stabilizes as we increase the number of trials, providing a consistent measure of the outcomes' spread.

This exercise demonstrates the Law of Large Numbers and the significance of mean and variance in understanding the distribution of random variables. By simulating an increasing number of dice rolls, we observed how the experimental mean and variance stabilize and converge towards their theoretical values. This not only illustrates the central concepts of statistics and probability but also underscores the importance of large sample sizes in achieving reliable and accurate estimates of population parameters.

-----------

# 5. Image Data Compression Using PCA

## Objective:

Use Principal Component Analysis (PCA) to compress and decompress an image, illustrating the concept of variance in data compression and dimensionality reduction.

## Dataset:

Select a simple image or use an image from a standard dataset (e.g., MNIST if focusing on a single digit image).

## Steps:

1. **Prepare the Image**: Load the image and convert it to grayscale if it's in color. Flatten the image into a 2D array if necessary.
2. **Apply PCA**: Use PCA (from **`sklearn.decomposition`**) to reduce the dimensionality of the image data, retaining different levels of variance (e.g., 95%, 90%, 85%).
3. **Reconstruct the Image**: Inverse transform the PCA components to reconstruct the image with reduced quality.
4. **Visualize and Compare**: Plot the original and reconstructed images side by side to visually compare the effects of data compression.
5. **Discussion**: Discuss how the choice of variance retained affects the image quality and compression ratio, and relate this to the importance of understanding variance in data science.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
import numpy as np
import warnings

# Load the data
train_df = pd.read_csv('./mnist_train.csv')
test_df = pd.read_csv('./mnist_test.csv')

# Assuming the first column is the label
X_train = train_df.drop(labels=["label"], axis=1).values
y_train = train_df["label"].values

# Optionally, visualize the first image
plt.imshow(X_train[0].reshape(28, 28), cmap='gray')
plt.title(f'Label: {y_train[0]}')
plt.show()

# Select the first image and reshape it
image = X_train[0].reshape(28, 28)

# Flatten the image for PCA
image_flattened = image.flatten().reshape(1, -1)

# Normalize pixel values
image_flattened = image_flattened / 255.0

# Check for NaNs or infinite values and replace them
image_flattened = np.nan_to_num(image_flattened)

warnings.filterwarnings('ignore', category=RuntimeWarning)

# Apply PCA with different levels of variance retained
pca_95 = PCA(n_components=0.95)
image_transformed_95 = pca_95.fit_transform(image_flattened)
image_reconstructed_95 = pca_95.inverse_transform(image_transformed_95)

pca_90 = PCA(n_components=0.90)
image_transformed_90 = pca_90.fit_transform(image_flattened)
image_reconstructed_90 = pca_90.inverse_transform(image_transformed_90)

pca_85 = PCA(n_components=0.85)
image_transformed_85 = pca_85.fit_transform(image_flattened)
image_reconstructed_85 = pca_85.inverse_transform(image_transformed_85)

fig, axs = plt.subplots(1, 4, figsize=(20, 5))

axs[0].imshow(image, cmap='gray')
axs[0].set_title('Original Image')

axs[1].imshow(image_reconstructed_95.reshape(28, 28), cmap='gray')
axs[1].set_title('95% Variance Retained')

axs[2].imshow(image_reconstructed_90.reshape(28, 28), cmap='gray')
axs[2].set_title('90% Variance Retained')

axs[3].imshow(image_reconstructed_85.reshape(28, 28), cmap='gray')
axs[3].set_title('85% Variance Retained')

for ax in axs:
    ax.axis('off')

plt.show()



## Discussion

The PCA compression experiment demonstrates the trade-off between data reduction and image quality.
- Retaining 95% of the variance preserves most details of the original image, making it almost indistinguishable from the original.
- As we reduce the variance retained to 90% and 85%, the reconstructed images become increasingly blurry, indicating loss of detail.

This exercise highlights the importance of variance in retaining information during dimensionality reduction. Choosing the right level of variance retention is crucial depending on the application's need for accuracy versus compression.