**Statistics Advanced - 1**

## Question 1: What is a random variable in probability theory?

**Answer:**
A random variable is a function that assigns a numerical value to each outcome in a sample space of a random experiment. It quantifies outcomes so we can apply mathematical operations and probability distributions.

## Question 2: What are the types of random variables?

**Answer:**
1. **Discrete random variables** — take countable values (e.g., 0,1,2,...). 
2. **Continuous random variables** — take values in an interval or continuum (e.g., measurements like height, weight).

## Question 3: Explain the difference between discrete and continuous distributions.

**Answer:**
- **Discrete distributions** assign probabilities to specific isolated values; probabilities sum to 1 across those values (e.g., Binomial, Poisson).
- **Continuous distributions** are described by probability density functions (pdf); probabilities for exact single points are zero and probabilities are computed over intervals (e.g., Normal distribution).

## Question 4: What is a binomial distribution, and how is it used in probability?

**Answer:**
The binomial distribution models the number of successes in a fixed number of independent Bernoulli trials with the same success probability p. It is used when we count successes (e.g., number of defective items in a sample of n).

## Question 5: What is the standard normal distribution, and why is it important?

**Answer:**
The standard normal distribution is the normal (Gaussian) distribution with mean 0 and standard deviation 1. It is important because many problems are standardized to it (z-scores), and many statistical methods (like CI, hypothesis tests) use properties of the normal distribution.

## Question 6: What is the Central Limit Theorem (CLT), and why is it critical in statistics?

**Answer:**
The CLT states that the sampling distribution of the sample mean approaches a normal distribution as sample size increases, regardless of the original population distribution (provided variance is finite). It justifies using normal-based inference (CIs, hypothesis tests) for sample means when sample size is sufficiently large.

## Question 7: What is the significance of confidence intervals in statistical analysis?

**Answer:**
A confidence interval gives a range of plausible values for an unknown population parameter (e.g., mean) computed from sample data, together with a confidence level (e.g., 95%) which describes the long-run frequency of such intervals capturing the true parameter.

## Question 8: What is the concept of expected value in a probability distribution?

**Answer:**
The expected value (mean) of a random variable is the long-run average value it would take over repeated trials; for discrete variables it's the sum of values times their probabilities, and for continuous it's the integral of x times the pdf.

## Question 9: Code task description

**Task:** Generate 1000 random numbers from a normal distribution with mean = 50 and sd = 5. Compute mean and sd using NumPy and draw a histogram. The code and output follow in the next code cell.

In [None]:

# Question 9: Generate 1000 random numbers from N(50, 5) and show mean, std, histogram
import numpy as np
import matplotlib.pyplot as plt

np.random.seed(0)  # for reproducible results
data = np.random.normal(loc=50, scale=5, size=1000)

mean_data = np.mean(data)
std_data = np.std(data, ddof=1)  # sample standard deviation

print(f"Sample size = {len(data)}")
print(f"Mean (NumPy) = {mean_data:.4f}")
print(f"Std (sample, ddof=1) = {std_data:.4f}")

# Histogram
plt.figure(figsize=(8,4.5))
plt.hist(data, bins=30, edgecolor='black')  # no explicit color specification
plt.title('Histogram of 1000 samples from N(50, 5)')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.grid(alpha=0.25)
plt.show()


## Question 10: Sales trend and 95% confidence interval

You are given daily_sales (20 values). 

- Explain how to apply CLT to estimate average sales with 95% CI.

**Short explanation:** For moderate sample sizes (n=20) and unknown population variance, use the t-distribution to construct a 95% CI for the mean: CI = sample_mean ± t_{n-1, 0.975} * (s / sqrt(n)).

The code to compute the mean and its 95% CI is in the next code cell.

In [None]:

# Question 10: Compute mean sales and 95% CI using t-distribution
import numpy as np
import math
from scipy import stats

daily_sales = [220, 245, 210, 265, 230, 250, 260, 275, 240, 255,
               235, 260, 245, 250, 225, 270, 265, 255, 250, 260]

data = np.array(daily_sales)
n = len(data)
mean = np.mean(data)
s = np.std(data, ddof=1)  # sample standard deviation
se = s / math.sqrt(n)

# t critical value for 95% CI with df = n-1
alpha = 0.05
df = n - 1
t_crit = stats.t.ppf(1 - alpha/2, df)

margin = t_crit * se
ci_lower = mean - margin
ci_upper = mean + margin

print(f"Sample size = {n}")
print(f"Mean daily sales = {mean:.4f}")
print(f"Sample standard deviation = {s:.4f}")
print(f"95% CI for mean = ({ci_lower:.4f}, {ci_upper:.4f}) using t-distribution (df={df})")
