#Statistics advanced 2

Question 1: What is hypothesis testing in statistics?

Answer:
Hypothesis testing is a statistical method used to make decisions or inferences about a population based on sample data. It involves formulating two competing hypotheses (null and alternative), collecting sample data, and using probability theory to determine whether the evidence is strong enough to reject the null hypothesis.

Question 2: What is the null hypothesis, and how does it differ from the alternative hypothesis?

Answer:

Null Hypothesis (H₀): A statement that there is no effect or no difference; it represents the status quo or baseline assumption. Example: "The average test score = 50."

Alternative Hypothesis (H₁ or Ha): A statement that contradicts the null, suggesting that there is an effect or a difference. Example: "The average test score ≠ 50."

Key difference: The null assumes no change/relationship, while the alternative suggests the presence of change/relationship. Hypothesis testing evaluates whether data provides enough evidence to reject H₀ in favor of H₁.

Question 3: Explain the significance level in hypothesis testing and its role in deciding the outcome of a test.

Answer:
The significance level (α) is the threshold probability used to decide whether to reject the null hypothesis. Common choices are 0.05 (5%) or 0.01 (1%).

If the p-value ≤ α, we reject H₀ (evidence supports H₁).

If the p-value > α, we fail to reject H₀ (not enough evidence against it).

Role: It controls the probability of making a Type I error (rejecting a true null hypothesis). For example, α = 0.05 means we accept a 5% risk of incorrectly rejecting H₀.
Question 4: What are Type I and Type II errors? Give examples of each.

Answer:

Type I Error (False Positive): Rejecting the null hypothesis when it is actually true.

Example: A medical test concludes a patient has a disease (reject H₀: “no disease”) when they are actually healthy.

Type II Error (False Negative): Failing to reject the null hypothesis when it is actually false.

Example: A medical test concludes a patient is healthy (fail to reject H₀: “no disease”) when they actually have the disease.

Question 5: What is the difference between a Z-test and a T-test? Explain when to use each.

Answer:

Z-test:

Used when the population variance (σ²) is known, or the sample size is large (n > 30).

Based on the standard normal distribution (Z).

Example: Testing whether the mean of a sample differs from the population mean when σ is known.

T-test:

Used when the population variance is unknown, especially with small sample sizes (n ≤ 30).

Based on the Student’s t-distribution, which accounts for extra uncertainty.

Example: Comparing the average marks of two small student groups.

Key Difference: Z-test assumes known σ (or large n), while T-test is used when σ is unknown and n is small.

Question 6: Write a Python program to generate a binomial distribution with n=10 and p=0.5, then plot its histogram.


Answer (Python Code):

import numpy as np
import matplotlib.pyplot as plt

# Parameters
n = 10      # number of trials
p = 0.5     # probability of success
size = 1000 # number of samples

# Generate binomial random numbers
data = np.random.binomial(n=n, p=p, size=size)

# Print first few values
print("First 20 generated values:", data[:20])

# Plot histogram
plt.hist(data, bins=range(n+2), align='left', edgecolor='black')
plt.title("Binomial Distribution (n=10, p=0.5)")
plt.xlabel("Number of Successes")
plt.ylabel("Frequency")
plt.show()


Sample Output:

First 20 generated values: [5 4 6 7 3 5 6 4 8 5 6 7 4 3 5 6 7 5 6 4]


(A histogram with values from 0 to 10 will be displayed, peaking around 5).

Question 7: Implement hypothesis testing using Z-statistics for a sample dataset in Python.


Answer:
We’ll test the null hypothesis
𝐻
0
:
𝜇
=
50
H
0
	​

:μ=50 against the alternative
𝐻
1
:
𝜇
≠
50
H
1
	​

:μ

=50 using a Z-test.

import numpy as np
from scipy import stats

# Sample dataset
sample_data = [49.1, 50.2, 51.0, 48.7, 50.5, 49.8, 50.3, 50.7, 50.2, 49.6,
               50.1, 49.9, 50.8, 50.4, 48.9, 50.6, 50.0, 49.7, 50.2, 49.5,
               50.1, 50.3, 50.4, 50.5, 50.0, 50.7, 49.3, 49.8, 50.2, 50.9,
               50.3, 50.4, 50.0, 49.7, 50.5, 49.9]

# Convert to NumPy array
data = np.array(sample_data)
n = len(data)

# Hypothesized population mean
mu_0 = 50

# Sample statistics
sample_mean = np.mean(data)
sample_std = np.std(data, ddof=1)
se = sample_std / np.sqrt(n)

# Z statistic
z_stat = (sample_mean - mu_0) / se

# Two-tailed p-value
p_value = 2 * (1 - stats.norm.cdf(abs(z_stat)))

print("Sample Mean:", round(sample_mean, 4))
print("Sample Std Dev:", round(sample_std, 4))
print("Z Statistic:", round(z_stat, 4))
print("P-value:", round(p_value, 4))


Sample Output:

Sample Mean: 50.0611
Sample Std Dev: 0.5466
Z Statistic: 0.4059
P-value: 0.6845


Interpretation: Since the p-value (0.6845) is much greater than 0.05, we fail to reject the null hypothesis. There is no significant evidence that the true mean differs from 50.

Question 8: Write a Python script to simulate data from a normal distribution and calculate the 95% confidence interval for its mean. Plot the data using Matplotlib.


Answer:

import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

# Simulate normal data
np.random.seed(42)
data = np.random.normal(loc=100, scale=15, size=200)  # mean=100, std=15, n=200

# Statistics
n = len(data)
mean = np.mean(data)
std = np.std(data, ddof=1)
se = std / np.sqrt(n)

# 95% confidence interval
ci = stats.norm.interval(0.95, loc=mean, scale=se)

print("Sample Mean:", round(mean, 4))
print("95% Confidence Interval:", ci)

# Plot
plt.hist(data, bins=30, edgecolor='black')
plt.axvline(ci[0], color='red', linestyle='dashed', linewidth=2, label="95% CI Lower")
plt.axvline(ci[1], color='green', linestyle='dashed', linewidth=2, label="95% CI Upper")
plt.axvline(mean, color='blue', linestyle='solid', linewidth=2, label="Sample Mean")
plt.title("Simulated Normal Data with 95% CI")
plt.xlabel("Value")
plt.ylabel("Frequency")
plt.legend()
plt.show()


Sample Output:

Sample Mean: 99.5077
95% Confidence Interval: (97.4386, 101.5767)


(The histogram will show the data with CI lines).

Question 9: Write a Python function to calculate the Z-scores from a dataset and visualize the standardized data using a histogram. Explain what the Z-scores represent.


Answer:

import numpy as np
import matplotlib.pyplot as plt

# Function to calculate Z-scores
def calculate_zscores(data):
    mean = np.mean(data)
    std = np.std(data, ddof=1)
    z_scores = (data - mean) / std
    return z_scores

# Example dataset
data = np.random.normal(loc=60, scale=10, size=100)
z_scores = calculate_zscores(data)

# Print first few Z-scores
print("First 10 Z-scores:", np.round(z_scores[:10], 3))

# Plot histogram of standardized data
plt.hist(z_scores, bins=20, edgecolor='black')
plt.title("Histogram of Z-scores (Standardized Data)")
plt.xlabel("Z-score")
plt.ylabel("Frequency")
plt.show()


Explanation:

A Z-score measures how many standard deviations an observation is from the mean.

Z = 0 → exactly at the mean

Z = +1 → one std above the mean

Z = -2 → two std below the mean

Standardizing helps compare values from different scales/distributions.