# Week 9: Discrete Random Variables & Distributions

**Course**: BSMA1002 - Statistics for Data Science I  
**Topic**: Random Variables, PMF, CDF  
**Week**: 9

## ðŸŽ¯ Objectives
- Understand Random Variables (mapping outcomes to numbers)
- Work with Probability Mass Functions (PMF)
- Calculate and visualize Cumulative Distribution Functions (CDF)
- Implement custom discrete distributions in Python
- Apply concepts to real-world scenarios (Customer Churn)

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

# Configure plotting
plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams['figure.figsize'] = (10, 6)
print("Libraries imported and configured!")

## 1. Random Variables and PMF

A **Random Variable** is a function that maps outcomes of a random experiment to real numbers.
- **Discrete RV**: Takes countable values (e.g., roll of a die, number of heads).
- **PMF (Probability Mass Function)**: $P(X=x)$, gives probability of each value.

### Example: Fair Die
- Sample Space: $\{1, 2, 3, 4, 5, 6\}$
- PMF: $P(X=x) = 1/6$ for all $x$

In [None]:
# Define a fair die random variable
die_outcomes = np.array([1, 2, 3, 4, 5, 6])
die_probs = np.array([1/6] * 6)

# Create a discrete random variable using scipy.stats
die_rv = stats.rv_discrete(name='fair_die', values=(die_outcomes, die_probs))

# Calculate properties
print(f"Mean (Expectation): {die_rv.mean()}")
print(f"Variance: {die_rv.var()}")
print(f"Probability of rolling a 6: {die_rv.pmf(6):.4f}")

# Visualize PMF
plt.figure(figsize=(8, 5))
plt.bar(die_outcomes, die_rv.pmf(die_outcomes), alpha=0.7, color='skyblue', edgecolor='black')
plt.title("PMF of a Fair Die")
plt.xlabel("Outcome")
plt.ylabel("Probability P(X=x)")
plt.ylim(0, 0.25)
plt.show()

## 2. Cumulative Distribution Function (CDF)

The **CDF** gives the probability that the random variable is less than or equal to a value:
$$F_X(x) = P(X \le x) = \sum_{t \le x} P(X=t)$$

For a discrete variable, the CDF is a **step function**.

In [None]:
# Calculate CDF for the die
cdf_values = die_rv.cdf(die_outcomes)

print("Outcome | PMF   | CDF")
print("-" * 25)
for x, p, c in zip(die_outcomes, die_probs, cdf_values):
    print(f"{x:7d} | {p:.4f} | {c:.4f}")

# Visualize CDF
plt.figure(figsize=(8, 5))
plt.step(die_outcomes, cdf_values, where='post', color='green', linewidth=2, label='CDF')
plt.plot(die_outcomes, cdf_values, 'go', alpha=0.6)
plt.title("CDF of a Fair Die")
plt.xlabel("Outcome (x)")
plt.ylabel("Cumulative Probability P(X â‰¤ x)")
plt.ylim(0, 1.1)
plt.grid(True, alpha=0.3)
plt.legend()
plt.show()

## 3. Real-World Application: Customer Support Calls

Suppose the number of support calls received per minute follows a specific distribution based on historical data.

| Calls (x) | 0 | 1 | 2 | 3 | 4 |
|-----------|---|---|---|---|---|
| Prob P(x) | 0.1 | 0.3 | 0.4 | 0.15 | 0.05 |

We can model this as a custom discrete random variable.

In [None]:
# Define custom distribution
calls_x = np.array([0, 1, 2, 3, 4])
calls_p = np.array([0.1, 0.3, 0.4, 0.15, 0.05])

# Create RV
calls_rv = stats.rv_discrete(name='calls', values=(calls_x, calls_p))

# Questions
print(f"1. Probability of exactly 2 calls: {calls_rv.pmf(2):.2f}")
print(f"2. Probability of at most 2 calls: {calls_rv.cdf(2):.2f}")
print(f"3. Probability of more than 2 calls: {1 - calls_rv.cdf(2):.2f}")
print(f"4. Expected calls per minute: {calls_rv.mean():.2f}")

# Visualize
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# PMF
ax1.bar(calls_x, calls_p, color='purple', alpha=0.7)
ax1.set_title("PMF: Calls per Minute")
ax1.set_xlabel("Number of Calls")
ax1.set_ylabel("Probability")

# CDF
ax2.step(calls_x, calls_rv.cdf(calls_x), where='post', color='orange', linewidth=2)
ax2.plot(calls_x, calls_rv.cdf(calls_x), 'o', color='orange')
ax2.set_title("CDF: Calls per Minute")
ax2.set_xlabel("Number of Calls")
ax2.set_ylabel("Cumulative Probability")

plt.show()

## 4. Practice Problems

**Problem 1**: A biased coin has P(Head) = 0.7. Let X = 1 for Head, 0 for Tail.
1. Define the PMF.
2. Calculate Mean and Variance.

**Problem 2**: Verify that $\sum P(x) = 1$ for the distribution $P(x) = k \cdot x$ for $x \in \{1, 2, 3, 4\}$. Find $k$.

In [None]:
# Solution 1: Biased Coin
coin_x = [0, 1]
coin_p = [0.3, 0.7]
coin_rv = stats.rv_discrete(values=(coin_x, coin_p))
print(f"Coin Mean: {coin_rv.mean():.2f}")
print(f"Coin Variance: {coin_rv.var():.2f}")

# Solution 2: Find k
x_vals = np.array([1, 2, 3, 4])
sum_x = np.sum(x_vals)  # 1+2+3+4 = 10
k = 1 / sum_x
print(f"Value of k: {k}")

# Verify
probs = k * x_vals
print(f"Probabilities: {probs}")
print(f"Sum of probs: {np.sum(probs):.2f}")