# **Probability**

# 1. Exploring Gaussian Distribution Properties

**Objective**: Generate a synthetic dataset that follows a Gaussian distribution and visualize its properties.

## Example 1

- Generate a dataset with a mean (*μ*) of 50 and a standard deviation (*σ*) of 10. Use NumPy's **`np.random.normal()`** function.
- Plot the histogram of the dataset to visualize the Gaussian distribution. Use Matplotlib for visualization.
- Calculate and plot the empirical cumulative distribution function (CDF) of the dataset.
- Overlay the probability density function (PDF) on the histogram to show how well the synthetic data matches the theoretical distribution.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats

# Create data that follows a Gaussian distribution with a specific mean and standard deviation
mu = 50  # mean
sigma = 10  # standard deviation
samples = 1000  # number of samples
data = np.random.normal(mu, sigma, samples)

# Visualize the distribution of my data with a histogram
plt.figure(figsize=(10, 6))
count, bins, ignored = plt.hist(data, 30, density=True, alpha=0.6, color='g', edgecolor='black')

# This shows how the data values are distributed cumulatively (calculate and plot the empirical CDF)
sorted_data = np.sort(data)
yvals = np.arange(len(sorted_data)) / float(len(sorted_data) - 1)
plt.plot(sorted_data, yvals, label='Empirical CDF', color='blue')

# Add the theoretical probability density function on top of the histogram to compare the synthetic data with the theoretical distribution

xmin, xmax = plt.xlim()
x = np.linspace(xmin, xmax, 100)
p = stats.norm.pdf(x, mu, sigma)
plt.plot(x, p, 'k', linewidth=2, label='Theoretical PDF')
plt.title('Fit results: mu = %.2f,  std = %.2f' % (mu, sigma))
plt.xlabel('Data values')
plt.ylabel('Density')
plt.legend()
plt.show()


1. **Dataset Generation**: A synthetic dataset was created with a mean (μ) of 50 and a standard deviation (σ) of 10, consisting of 1000 samples. This dataset follows a Gaussian distribution.
2. **Histogram Visualization**: The histogram of the dataset is plotted, showing the distribution of data values. The green bars represent the frequency of data values within specific intervals, normalized to form a probability density.
3. **Empirical CDF Plot**: The blue line represents the empirical cumulative distribution function (CDF), illustrating the proportion of data values less than or equal to each value on the x-axis. This graph provides insight into the distribution of data values across the dataset.
4. **Theoretical PDF Overlay**: The black line shows the theoretical probability density function (PDF) based on the Gaussian distribution with the specified mean and standard deviation. This overlay demonstrates how well the synthetic data matches the theoretical Gaussian distribution.

## Example 2: Daily Temperatures in City X

Imagine we're analyzing the daily high temperatures (in degrees Celsius) in city X for the month of July. We've collected temperature data for all 31 days, resulting in the following dataset (simplified for this example):

28,30,29,31,27,32,28,29,30,27,31,33,29,28,30,32,31,29,28,30,27,31,30,29,32,28,29,30,31,27,3328,30,29,31,27,32,28,29,30,27,31,33,29,28,30,32,31,29,28,30,27,31,30,29,32,28,29,30,31,27,33

In [None]:
import numpy as np
import matplotlib.pyplot as plt

# Input the dataset of daily high temperatures
temperatures = np.array([28, 30, 29, 31, 27, 32, 28, 29, 30, 27, 31, 33, 29, 28, 30, 32, 31, 29, 28, 30, 27, 31, 30, 29, 32, 28, 29, 30, 31, 27, 33])

# Visualize the empirical distribution of temperatures with a histogram
plt.hist(temperatures, bins=7, alpha=0.7, color='blue', edgecolor='black')
plt.title('Histogram of Daily High Temperatures in July for City X')
plt.xlabel('Temperature (°C)')
plt.ylabel('Number of Days')
plt.show()

# First, calculate the empirical CDF, then plot it
# Sort the data and calculate the CDF values
data_sorted = np.sort(temperatures)
cdf = np.arange(1, len(data_sorted)+1) / len(data_sorted)

# Plot the empirical CDF
plt.plot(data_sorted, cdf, marker='.', linestyle='none')
plt.title('Empirical CDF of Daily High Temperatures in July for City X')
plt.xlabel('Temperature (°C)')
plt.ylabel('CDF')
plt.grid(True)
plt.show()

### Interpretation of Results:

**Histogram Interpretation**:

- A histogram groups data into bins (e.g., 27-28°C, 29-30°C, etc.) and counts how many days fall into each bin. This gives you a visual representation of how the temperatures are distributed over the month.
- You can see the most common temperature ranges and how the data spread out. For example, if the histogram shows that most days are clustered around 30°C, it suggests that this temperature is typical for July in City X.

**Empirical CDF Interpretation**:

- The empirical CDF plot shows the proportion of days that had temperatures at or below each temperature value on the x-axis.
- For instance, if the CDF value at 30°C is 0.6, it means that 60% of the days in July had a high temperature of 30°C or lower.
```
Empirical CDF Calculation and Interpretation

Given our dataset of daily high temperatures in city X for July:

28,30,29,31,27,32,28,29,30,27,31,33,29,28,30,32,31,29,28,30,27,31,30,29,32,28,29,30,31,27,3328,30,29,31,27,32,28,29,30,27,31,33,29,28,30,32,31,29,28,30,27,31,30,29,32,28,29,30,31,27,33

How to Calculate the Empirical CDF:

1. Sort the Data: Arrange the temperatures in ascending order.
    - Sorted Data: 27,27,27,27,28,28,28,28,28,29,29,29,29,29,29,30,30,30,30,30,30,31,31,31,31,32,32,32,33,33
    
2. Calculate Cumulative Frequencies:
    - For each unique temperature value, calculate the cumulative frequency, which is the count of days with temperatures at or below that value.
    
3. Divide by Total Number of Days:
    - For the CDF at each temperature, divide the cumulative frequency by the total number of observations (31 days).

Example Calculations:

- CDF for 27°C: 4 days with temperatures of 27°C or lower. So, CDF(27°C) = 4/31.
- CDF for 29°C: To find this, count all days with temperatures of 29°C or lower. Let's say this includes all days up to the 15th observation in the sorted list. So, CDF(29°C) = 15/31.

Interpreting the Empirical CDF:

- What It Tells Us: The value of the empirical CDF at a specific temperature *t* tells us the proportion of the month that had a temperature of t degrees Celsius or lower. For example, if the CDF(29°C) = 0.48, it means that 48% of the days had a temperature of 29°C or lower.

- Graphical Interpretation: When plotted, the empirical CDF provides a visual representation of the distribution of temperatures throughout the month. It shows, at any given temperature on the x-axis, the fraction of days that had a temperature at or below that point on the y-axis.
```
- This plot provides a complete picture of the temperature distribution, allowing you to see not just the most common temperatures (like the histogram) but how all temperatures compare cumulatively.

**Normality and Variability**:

- By observing the shape of the empirical CDF, you can assess the variability and skewness of the temperature distribution. A perfectly straight diagonal line would indicate a uniform distribution, while a curve that rises steeply at first and then levels out would suggest a concentration of values in the lower range (or vice versa).
- If the empirical CDF closely resembles the CDF of a Gaussian distribution (which is a sigmoid curve), it suggests the data may be approximately normally distributed. However, formal statistical tests would be required for a definitive assessment of normality.

--------------------------

# 2. Gaussian Noise in Image Processing

**Objective**: Demonstrate the application of Gaussian noise to an image and use a Gaussian filter for noise reduction.

## Gaussian Noise
Gaussian noise, also referred to as normal noise, is a statistical noise having its probability density function (PDF) equal to that of the normal distribution, which is also known as a Gaussian distribution. In simpler terms, it's a type of noise that occurs in images and signals where the intensity variations follow a Gaussian distribution. Here are key points to understand about Gaussian noise:

### Characteristics of Gaussian Noise

- **Distribution**: The values of the noise follow a bell-shaped curve when plotted, with most noise values being close to the mean and fewer values at the extremes.
- **Mean**: Often, the mean of Gaussian noise is zero, but it can be shifted to any other value.
- **Standard Deviation**: This measures the spread of the noise values around the mean. A larger standard deviation means the noise can cause more significant variations in signal or image intensity.
- **Randomness**: Each pixel in an image affected by Gaussian noise is altered in a random manner according to the Gaussian distribution, leading to variations in brightness or color information.

### Impact on Images

Gaussian noise can be introduced into images due to various factors such as sensor noise in low light, electronic interference, and transmission in poor conditions. It typically results in every pixel in the image being altered from its original value by a small amount, leading to a grainy appearance if the noise level is high.

## Gaussian Filter

A Gaussian filter is a linear filter used in image processing to smooth images and reduce noise. It operates by convolving a Gaussian kernel with an image. The Gaussian kernel is a matrix that embodies the shape of a Gaussian (bell-shaped) curve in two dimensions. Each element of the kernel is calculated using the Gaussian function, and the kernel is applied to every pixel in the image to produce a smoothed output.

### Applications

- **Noise Reduction**: By averaging out the pixels in a manner that closely resembles the physical processes causing blurring (e.g., out-of-focus photography), Gaussian filters are particularly effective at reducing Gaussian noise.
- **Preprocessing**: It's often used as a preprocessing step in computer vision algorithms to simplify the image or reduce details that might complicate tasks like edge detection.

## Example

- Load a sample image using OpenCV or Matplotlib.
- Add Gaussian noise to the image. Implement a function to manually add noise, or use skimage's `random_noise()`.
- Apply a Gaussian filter to the noisy image to reduce noise. Utilize OpenCV's `GaussianBlur()` or skimage's `gaussian()` functions.
- Display the original, noisy, and denoised images side by side.

Necessary libraries: `$ pip install matplotlib opencv-python scikit-image`

In [None]:
import matplotlib.pyplot as plt
from skimage import io, img_as_float
from skimage.util import random_noise
from skimage.filters import gaussian

# Load an image using skimage
image = img_as_float(io.imread('https://images.unsplash.com/photo-1589118949245-7d38baf380d6'))

plt.imshow(image)
plt.axis('off')  # Hide axis
plt.title('Original Image')
#plt.show()

# Add Gaussian noise to the image
noisy_image = random_noise(image, mode='gaussian', var=0.2)  # Increase the variance if needed

plt.imshow(noisy_image)
plt.axis('off')
plt.title('Image with Gaussian Noise')
#plt.show()

# Apply Gaussian filter (blur) to the noisy image
denoised_image = gaussian(noisy_image, sigma=0.05, channel_axis=-1)  # Adjust sigma as needed

plt.imshow(denoised_image)
plt.axis('off')
plt.title('Denoised Image --')
#plt.show()

# Display the images
fig, axes = plt.subplots(1, 3, figsize=(18, 6))
titles = ['Original Image', 'Image with Gaussian Noise', 'Denoised Image']
images = [image, noisy_image, denoised_image]

for ax, img, title in zip(axes, images, titles):
    ax.imshow(img)
    ax.set_title(title)
    ax.axis('off')
plt.tight_layout()
plt.show()

### Interpretation of Results

- **Original Image**: The first plot shows the original image without any alterations, serving as a baseline for comparison.
- **Noisy Image**: The second plot demonstrates the effect of adding Gaussian noise to the image. This simulates real-world scenarios where images might be affected by various types of noise, impacting their clarity and quality.
- **Denoised Image**: The third plot showcases the result of applying a Gaussian filter to the noisy image. The Gaussian filter smooths the image, reducing noise and making it more visually similar to the original. However, note that excessive smoothing might also blur important details.

This exercise demonstrates essential techniques in image processing, particularly in handling noise, which is a common issue in digital imaging.

------------------

# 3. Probability Distributions in Data Science

**Objective**: Compare different probability distributions and their fit to real-world data.


## Kolmogorov-Smirnov (KS)

In KS test results we're dealing with, **D** and **p-value** which are crucial statistics for hypothesis testing, especially when assessing the goodness-of-fit for different distributions. **D** and **p-value** from the KS test give us a way to quantitatively compare the distribution of your observed data against theoretical distributions, helping to determine the most suitable model for your data based on how they are distributed.

### D - The KS Statistic

- **What Is D?**: In the KS test, **D** is the maximum distance between the empirical cumulative distribution function (ECDF) of your sample data and the cumulative distribution function (CDF) of the reference distribution (in your case, Gaussian, Exponential, or Uniform distributions). It quantifies the greatest vertical distance between these two curves.
- **Interpretation**: A larger value of **D** indicates a greater discrepancy between the observed data's distribution and the theoretical distribution being tested. If **D** is small, it suggests that the sample data closely follow the theoretical distribution.

### p-value

- **What Is the p-value?**: The p-value obtained from the KS test measures the probability of observing a test statistic as extreme as **D**, assuming that the null hypothesis is true. The null hypothesis, in this case, is that the data follow the specified distribution (Gaussian, Exponential, or Uniform).
- **Interpretation**:
    - A **low p-value** (typically < 0.05) indicates strong evidence against the null hypothesis, leading to its rejection. This means the sample data do not follow the distribution being tested.
    - A **high p-value** suggests insufficient evidence to reject the null hypothesis, indicating that the sample data could plausibly have come from the theoretical distribution.

### Why Do We Use Them?

- **Assessing Goodness-of-Fit**: The KS test, through **D** and the **p-value**, helps us objectively assess how well our sample data fit a given theoretical distribution. This is crucial in many areas of statistics and data science, where assumptions about data distribution underpin many methods and tests.
- **Guiding Model Selection**: By comparing the **D** values and **p-values** across different distributions, you can select the most appropriate model for your data. This can inform further analysis, hypothesis testing, and predictive modeling.
- **Non-Parametric and Versatile**: The KS test does not assume a normal distribution of the data, making it a non-parametric test. It's versatile and can be used to compare a sample with a reference probability distribution or to compare two samples.

### A good fit by the Gaussian distribution

- **In Terms of D**:A small D value suggests that there is minimal discrepancy between the observed data distribution and the expected Gaussian distribution, implying that the data likely follows a Gaussian distribution pattern.
- **In Terms of p-value**: A high p-value (typically greater than a significance level like 0.05 or 0.01) means that there is insufficient evidence to reject the null hypothesis that the data are drawn from a Gaussian distribution. In other words, the p-value suggests that the observed distribution is plausibly Gaussian.

### A Good Fit by the Exponential Distribution

- **In Terms of D**: A smaller D value for the Exponential distribution compared to the D values for Gaussian and Uniform distributions suggests that the empirical cumulative distribution function (ECDF) of your data is closer to the cumulative distribution function (CDF) of the Exponential distribution. This implies a better match between your data's distribution and the Exponential model.
- **In Terms of p-value**: A higher p-value for the Exponential distribution means there's a higher probability that the observed distribution of your data could have arisen if the true distribution were Exponential. If this p-value is significantly higher than those for Gaussian and Uniform distributions, and especially if it's above a common threshold (like 0.05), it indicates that the Exponential distribution is a plausible model for your data.

### A Good Fit by the Uniform Distribution

- **In Terms of D**: A smaller D value for the Uniform distribution indicates that the maximum discrepancy between the ECDF of your data and the CDF of the Uniform distribution is minimal among the distributions tested. This suggests that your data might be uniformly distributed across its range.
- **In Terms of p-value**: A higher p-value for the Uniform distribution suggests that there's a lack of evidence to reject the hypothesis that your data follow a Uniform distribution. If the Uniform distribution's p-value is the highest among those tested and is above a standard significance level, it suggests that the Uniform distribution could be a suitable model for your data.

### Interpretation and Usage

- **Comparative Analysis**: Comparing D and p-values across distributions helps identify which theoretical distribution best represents your data. A distribution with the smallest D and a p-value above a significance threshold (e.g., 0.05) is considered the best fit.
- **Model Selection and Hypothesis Testing**: Selecting a distribution that fits your data well is crucial for accurate modeling, hypothesis testing, and prediction. For instance, if an Exponential distribution fits your data better, it might influence how you model phenomena such as time between events or website visitor counts.
- **Understanding Data Characteristics**: The fitting distribution reflects inherent characteristics of your data. An Exponential fit might suggest a rapid drop-off in frequency as values increase, typical in wait times or service intervals. A Uniform fit implies equal likelihood across a range of values, which might be seen in scenarios where outcomes are equally probable within certain limits.


## Example

- Choose a real-world dataset (e.g., from Kaggle or UCI Machine Learning Repository) that has at least one numerical feature likely to follow a known distribution.
- Visualize the chosen feature using a histogram.
- Fit Gaussian, exponential, and uniform distributions to the data. You can use SciPy's statistical functions like `scipy.stats.norm`, `scipy.stats.expon`, and `scipy.stats.uniform`.
- Calculate goodness-of-fit measures (e.g., AIC, BIC, or Kolmogorov-Smirnov test) for each fitted distribution.
- Discuss which distribution best fits the data and why.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats

# Load the dataset
#df = pd.read_csv('website_visitors.csv')

# Simulate daily visitors for a year
np.random.seed(42)  # For reproducibility
visitors = np.random.randint(100, 1000, size=365)

# Create a DataFrame and save to CSV
df = pd.DataFrame(visitors, columns=['visitors'])
df.to_csv('website_visitors.csv', index=False)

# Plot a histogram of the visitors data to get an idea of its distribution
plt.hist(df['visitors'], bins=30, alpha=0.7, color='blue', edgecolor='black')
plt.title('Histogram of Daily Visitors')
plt.xlabel('Number of Visitors')
plt.ylabel('Frequency')
plt.show()

mu, std = stats.norm.fit(df['visitors'])
xmin, xmax = plt.xlim()
x = np.linspace(xmin, xmax, 100)
p = stats.norm.pdf(x, mu, std)
plt.plot(x, p, 'k', linewidth=2)

# Fit a Gaussian Distribution
mu, std = stats.norm.fit(df['visitors'])
xmin, xmax = plt.xlim()
x = np.linspace(xmin, xmax, 100)
p = stats.norm.pdf(x, mu, std)
plt.plot(x, p, 'k', linewidth=2)

# Fit an Exponential Distribution
loc, scale = stats.expon.fit(df['visitors'])
x = np.linspace(xmin, xmax, 100)
p = stats.expon.pdf(x, loc, scale)
plt.plot(x, p, 'r', linewidth=2)

# Fit a Uniform Distribution
min_visitors, max_visitors = df['visitors'].min(), df['visitors'].max()
x = np.linspace(xmin, xmax, 100)
p = stats.uniform.pdf(x, loc=min_visitors, scale=max_visitors-min_visitors)
plt.plot(x, p, 'g', linewidth=2)

# Calculate goodness-of-fit measures
# Gaussian
D, p_value = stats.kstest(df['visitors'], 'norm', args=(mu, std))
print(f'Gaussian KS test: D={D:.2f}, p-value={p_value:.2f}\n')

# Exponential
D, p_value = stats.kstest(df['visitors'], 'expon', args=(loc, scale))
print(f'Exponential KS test: D={D:.2f}, p-value={p_value:.2f}\n')

# Uniform
D, p_value = stats.kstest(df['visitors'], 'uniform', args=(min_visitors, max_visitors-min_visitors))
print(f'Uniform KS test: D={D:.2f}, p-value={p_value:.2f}\n\n')

### Discuss the Best Fit

After plotting the fitted distributions and calculating the goodness-of-fit measures, you'll have a clearer idea of which distribution best describes your data. The discussion will depend on the results:

- If the Gaussian fit has the highest p-value in the KS test, it suggests that the daily visitors follow a normal distribution, indicating that most days have a visitor count close to the average, with fewer very high or very low counts.
- If the Exponential distribution fits better, it may indicate that lower visitor counts are more common, with the probability of high counts dropping off exponentially.
- A good fit by the Uniform distribution would suggest that all visitor counts within the range are equally likely, which is less common for real-world data like website visitors.

-----------------

# 4. Bayesian Inference with Gaussian Distributions

**Objective**: Use Bayesian inference to update beliefs about a dataset's mean based on new evidence.

## Bayesian Inference

Bayesian inference is a method of statistical inference in which Bayes' theorem is used to update the probability for a hypothesis as more evidence or information becomes available. It formally incorporates uncertainty and prior knowledge into statistical models, allowing for more nuanced and informed estimates than methods that rely solely on new data.

### Interpretation
- Posterior Mean: Your updated average height estimate considering both the prior belief and new evidence. If the sample mean is significantly different from the prior mean, and the sample size is large enough, the posterior mean will shift towards the sample mean.
- Posterior Standard Deviation: Indicates the uncertainty of the updated estimate. Gathering more data (increasing N) or having a smaller sample standard deviation (precise measurements) reduces uncertainty, narrowing the posterior distribution.

## Example: Estimating the average height of male adults

*   **Prior Belief**: Assume a prior distribution for the mean of a dataset. We could assume a Gaussian distribution as the prior.
    * Assume your prior belief about the average male height follows a normal distribution with a mean ![image.png](attachment:e81bf91a-c4a6-473d-bcaf-91e4ae415e05.png) of 175 cm and a standard deviation ![image.png](attachment:9ad52b75-faca-4401-a4ce-9b9741aba725.png) of 5 cm.
*   **New Evidance (Data)**: Generate a synthetic dataset that represents new evidence.
    * Simulate observed data that represents the heights of 50 male adults, with a sample mean of 178 cm and a standard deviation of 10 cm.
*   **Posterior Distribution**: Use Bayes' theorem to update the prior distribution with the new evidence, resulting in a posterior distribution.
    * With Gaussian distributions, when the prior and likelihood are normal, the posterior parameters can be calculated analytically using the formulas for the conjugate normal-normal model. The posterior mean ![image.png](attachment:531d4ed8-562c-4a4a-a63c-2372dd57c11f.png) and standard deviation ![image.png](attachment:d9ff0a88-d13a-4607-aa35-cd614634e3fd.png) can be computed as follows:
    
    
    ![Screenshot from 2024-02-07 00-56-39.png](attachment:e74e161f-18cb-4217-b332-b4939000747f.png)
    
    where ![image.png](attachment:7463563b-37cd-4b3b-b802-630b6ee0f91b.png) is the number of observations, ![image.png](attachment:085088c2-73e2-4eab-89e7-b8d57c0f728b.png) are the observed values, and ![image.png](attachment:3942ef52-1c2c-42f5-959f-cb38cbb5aa56.png) is the standard deviation of the observed data.
*   Visualize the prior, likelihood, and posterior distributions to show how the initial belief (prior) is updated with new data (likelihood) to form an updated belief (posterior).

In [None]:
import numpy as np
from scipy.stats import norm
import matplotlib.pyplot as plt

mu_prior = 175  # Prior mean
sigma_prior = 5  # Prior standard deviation

# Plotting the prior distribution
x = np.linspace(160, 190, 1000)
y_prior = norm.pdf(x, mu_prior, sigma_prior)

plt.plot(x, y_prior, label='Prior')
plt.xlabel('Height (cm)')
plt.ylabel('Density')
plt.title('Prior Distribution of Mean Height')
plt.legend()
plt.show()

np.random.seed(42)  # For reproducibility
sample_data = np.random.normal(178, 10, 50)  # Generate synthetic data

N = len(sample_data)  # Number of observations
sigma_data = 10  # Known standard deviation of data

# Calculate the posterior mean and standard deviation
sigma_post_squared_inv = (1/sigma_prior**2) + (N/sigma_data**2)
mu_post = (mu_prior/sigma_prior**2 + sample_data.sum()/sigma_data**2) / sigma_post_squared_inv
sigma_post = np.sqrt(1/sigma_post_squared_inv)

print(f"Posterior Mean: {mu_post:.2f}")
print(f"Posterior Standard Deviation: {sigma_post:.2f}")

# Plot the posterior distribution
x = np.linspace(160, 190, 1000)
y_post = norm.pdf(x, mu_post, sigma_post)

plt.plot(x, y_post, label='Posterior')
plt.xlabel('Height (cm)')
plt.ylabel('Density')
plt.title('Posterior Distribution of Mean Height')
plt.legend()
plt.show()

### Interpretation

- **Prior Distribution**: Your initial belief about the mean height.
- **Likelihood**: Incorporates the new evidence from your sample.
- **Posterior Distribution**: Represents the updated belief after considering the new evidence. It combines your prior belief with the new evidence.

The posterior distribution plot will show how your belief about the average male height has updated based on the sample data. Typically, you'd find the mean of the posterior distribution to be closer to the sample mean, reflecting the new evidence, with a narrower distribution if the evidence is strong.