# Statistics for Data Analysis

# Statistics for Data Analysis â€“ Probability Distributions & Variability

This project presents a collection of applied statistical analyses focused on
variability, probability distributions, and relationships between variables.

The objective is to demonstrate practical statistical concepts commonly used
in data analysis, including variance, standard deviation, and key probability
distributions.


## Project Scope

The notebook covers fundamental statistical concepts through practical examples,
including:

- Measures of spread
- Normal distribution
- Binomial distribution
- Poisson distribution
- Exponential distribution
- Covariance

Each section applies statistical theory using Python to solve real-world inspired
problems.


## 1. Measures of Spread

This section evaluates the variability of quiz scores for a small class using
variance and standard deviation.


In [1]:
import numpy as np

scores = np.array([40, 30, 20, 60, 70, 60, 80, 50, 60, 60])

mean = np.mean(scores)
variance = np.var(scores)
standard_deviation = np.std(scores)

print(f"Mean: {mean:.2f}")
print(f"Variance: {variance:.2f}")
print(f"Standard Deviation: {standard_deviation:.2f}")


Mean: 53.00
Variance: 301.00
Standard Deviation: 17.35


### Interpretation

The variance and standard deviation indicate how dispersed the quiz scores are
around the mean, helping to assess consistency in student performance.


## 2. Normal Distribution

This section calculates the Probability Density Function (PDF) for a normally
distributed random variable.


In [2]:
from scipy.stats import norm

mean = 4
std_dev = 2
x = 3

pdf_value = norm.pdf(x, mean, std_dev)

print(f"PDF of X at x = {x}: {pdf_value:.4f}")


PDF of X at x = 3: 0.1760


### Interpretation

The PDF value represents the relative likelihood of the random variable taking
a value close to x = 3.


## 3. Binomial Distribution

This problem models scenarios with two possible outcomes (success or failure),
such as passing or failing an exam or winning or losing a game.


In [4]:
from scipy.stats import binom
import math

n = 20      # number of trials
p = 0.55    # probability of success
k = 15      # number of successes

prob_15_wins = binom.pmf(k, n, p)
mean_wins = n * p
std_dev_wins = math.sqrt(n * p * (1 - p))

print(f"Probability of winning 15 games: {prob_15_wins:.4f}")
print(f"Expected number of wins: {mean_wins:.2f}")
print(f"Standard deviation of wins: {std_dev_wins:.2f}")


Probability of winning 15 games: 0.0365
Expected number of wins: 11.00
Standard deviation of wins: 2.22


### Interpretation

The binomial distribution is appropriate because each trial is independent and
has only two possible outcomes.


## 4. Poisson Distribution

The Poisson distribution models the number of occurrences of an event within
a fixed interval of time.


In [5]:
from scipy.stats import poisson

lambda_ = 2  # average number of occurrences

prob_more_than_2 = 1 - poisson.cdf(2, lambda_)

print(f"Probability of more than 2 occurrences: {prob_more_than_2:.4f}")


Probability of more than 2 occurrences: 0.3233


### Interpretation

This result represents the likelihood that the event occurs more frequently
than the average rate.


## 5. Exponential Distribution

The exponential distribution models the time between independent events
occurring at a constant average rate.


In [6]:
from scipy.stats import expon

arrival_rate = 30 / 60  # customers per minute

average_time_between = 1 / arrival_rate
print(f"Average time between arrivals: {average_time_between:.2f} minutes")



Average time between arrivals: 2.00 minutes


In [7]:
customers = 3
average_time_three = customers / arrival_rate

print(f"Average time for 3 customers to arrive: {average_time_three:.2f} minutes")


Average time for 3 customers to arrive: 6.00 minutes


In [8]:
prob_less_than_1 = expon.cdf(1, scale=1/arrival_rate)
print(f"Probability next arrival is within 1 minute: {prob_less_than_1:.4f}")


Probability next arrival is within 1 minute: 0.3935


In [9]:
prob_more_than_5 = expon.sf(5, scale=1/arrival_rate)
print(f"Probability next arrival takes more than 5 minutes: {prob_more_than_5:.4f}")


Probability next arrival takes more than 5 minutes: 0.0821


In [10]:
time_70_percent = expon.ppf(0.7, scale=1/arrival_rate)
print(f"70% of customers arrive within: {time_70_percent:.2f} minutes")


70% of customers arrive within: 2.41 minutes


### Interpretation

An exponential distribution is reasonable here because customer arrivals are
assumed to be independent and occur at a constant average rate.


## 6. Covariance Matrix

This section calculates the covariance between two variables to measure how
they vary together.


In [11]:
x = np.array([1, 2, 3, 4, 5])
y = np.array([2, 4, 6, 7, 8])

covariance_matrix = np.cov(x, y)

covariance_matrix


array([[2.5 , 3.75],
       [3.75, 5.8 ]])

## Conclusion

This project demonstrates the application of core statistical concepts using
Python to solve practical problems relevant to data analysis.

Key topics include measures of variability, probability distributions, and
covariance analysis, all of which form a strong foundation for analytical roles.
