# Statistics Advanced - 2

1. What is hypothesis testing in statistics?
  - Hypothesis testing in statistics is a formal method of making decisions or inferences about a population based on sample data.

  - It helps us determine whether there is enough statistical evidence to support a certain claim (hypothesis) about a population parameter (like mean, proportion, variance, etc.).

  - Example:

     A company claims the average life of its batteries is 100 hours.

     H₀: μ = 100

     H₁: μ ≠ 100

2. What is the null hypothesis, and how does it differ from the alternative
hypothesis?
  - Null Hypothesis (H₀):

      - The default assumption or status quo.

      - It assumes there is no effect, no difference, or no relationship in the population.

      - It is the statement we try to test against.

      - Example:

           - A medicine has no effect on blood pressure.

           - The average height of students is 170 cm (μ = 170).

  - Alternative Hypothesis (H₁ or Ha):

      - The opposite of the null hypothesis.

      - It represents what we suspect, claim, or want to prove.

      - It states that there is an effect, a difference, or a relationship.

      - Example:

           - The medicine does affect blood pressure.

           - The average height of students is not 170 cm (μ ≠ 170).

3. Explain the significance level in hypothesis testing and its role in deciding
the outcome of a test ?
  - The significance level is the “decision rule” in hypothesis testing. It sets how much evidence we need to reject the null hypothesis and controls the risk of a Type I error.

  - Example:

     A company claims its bulb lasts 1000 hours. We test this claim with α = 0.05.

     - If the test gives p = 0.03 → since 0.03 < 0.05, we reject H₀ (evidence bulbs don’t last 1000 hours).

     - If the test gives p = 0.08 → since 0.08 > 0.05, we fail to reject H₀ (not enough evidence against the claim).

4. What are Type I and Type II errors? Give examples of each ?
  - Type I error = false alarm (seeing an effect that isn’t there).

  - Example:

     - A COVID test says a healthy person has COVID.

     - In a criminal trial: concluding the accused is guilty when they are actually innocent.

     - A company claims its bulb lasts 1000 hours (H₀: μ = 1000). If in reality it does last 1000 hours, but our test rejects H₀, we made a Type I error.

  - Type II error = missed detection (failing to see an effect that is there)
     
  - Example:

     - A COVID test says an infected person does not have COVID.

     - In a criminal trial: concluding the accused is innocent when they are actually guilty.

     - If the bulb actually lasts only 900 hours, but our test fails to reject H₀ (μ = 1000), we made a Type II error.

5. What is the difference between a Z-test and a T-test? Explain when to use
each ?
  - Z-test: Used when the population variance (σ²) is known and the sample size is large (n > 30). It relies on the normal distribution. Example: Testing if the average height of students = 160 cm, when population variance is known.

  - T-test: Used when the population variance is unknown and the sample size is small (n ≤ 30). It relies on the Student’s t-distribution, which has heavier tails than the normal.Example: Testing if the average exam score of 15 students = 75, when variance is unknown.
     
      - Use Z-test: Large samples, known population variance.

      - Use T-test: Small samples, unknown population variance.
  

6. Write a Python program to generate a binomial distribution with n=10 and
p=0.5, then plot its histogram.
(Include your Python code and output in the code box below.)
Hint: Generate random number using random function.


In [None]:
import numpy as np
import matplotlib.pyplot as plt

# Parameters
n = 10   # number of trials
p = 0.5  # probability of success
size = 1000  # number of samples

# Generate binomial distribution random numbers
binomial_data = np.random.binomial(n, p, size)

# Plot histogram
plt.hist(binomial_data, bins=range(n+2), edgecolor='black', alpha=0.7)
plt.title("Binomial Distribution (n=10, p=0.5)")
plt.xlabel("Number of Successes")
plt.ylabel("Frequency")
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()

# Display first 20 generated values
print(binomial_data[:20])


In [None]:
The histogram above shows how the distribution is centered around 5 successes (as expected for
𝑛 = 10, 𝑝 = 0.5).


7. Implement hypothesis testing using Z-statistics for a sample dataset in
Python. Show the Python code and interpret the results.
sample_data = [49.1, 50.2, 51.0, 48.7, 50.5, 49.8, 50.3, 50.7, 50.2, 49.6,
50.1, 49.9, 50.8, 50.4, 48.9, 50.6, 50.0, 49.7, 50.2, 49.5,
50.1, 50.3, 50.4, 50.5, 50.0, 50.7, 49.3, 49.8, 50.2, 50.9,
50.3, 50.4, 50.0, 49.7, 50.5, 49.9]
(Include your Python code and output in the code box below.)
  

In [None]:
import numpy as np
from scipy.stats import norm

# Sample data
sample_data = [49.1, 50.2, 51.0, 48.7, 50.5, 49.8, 50.3, 50.7, 50.2, 49.6,
               50.1, 49.9, 50.8, 50.4, 48.9, 50.6, 50.0, 49.7, 50.2, 49.5,
               50.1, 50.3, 50.4, 50.5, 50.0, 50.7, 49.3, 49.8, 50.2, 50.9,
               50.3, 50.4, 50.0, 49.7, 50.5, 49.9]

# Convert to numpy array
data = np.array(sample_data)

# Hypothesis: H0: mu = 50, H1: mu != 50
mu0 = 50  # population mean under null hypothesis
n = len(data)
sample_mean = np.mean(data)
sample_std = np.std(data, ddof=1)  # sample standard deviation

# Standard error
se = sample_std / np.sqrt(n)

# Z-statistic
z_stat = (sample_mean - mu0) / se

# Two-tailed p-value
p_value = 2 * (1 - norm.cdf(abs(z_stat)))

print("Sample Mean:", sample_mean)
print("Sample Std Dev:", sample_std)
print("Z-statistic:", z_stat)
print("p-value:", p_value)

output:

Sample Mean: 50.0889
Sample Std Dev: 0.5365
Z-statistic: 0.9940
p-value: 0.3202

8. Write a Python script to simulate data from a normal distribution and
calculate the 95% confidence interval for its mean. Plot the data using Matplotlib.
(Include your Python code and output in the code box below.)


In [None]:
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

# Simulate data from a normal distribution
np.random.seed(42)  # for reproducibility
mu, sigma = 50, 5   # population mean and standard deviation
n = 100             # sample size
data = np.random.normal(mu, sigma, n)

# Sample statistics
sample_mean = np.mean(data)
sample_std = np.std(data, ddof=1)

# 95% confidence interval using t-distribution
alpha = 0.05
t_crit = stats.t.ppf(1 - alpha/2, df=n-1)  # critical value
margin_of_error = t_crit * (sample_std / np.sqrt(n))
ci_lower = sample_mean - margin_of_error
ci_upper = sample_mean + margin_of_error

# Plot histogram of data
plt.hist(data, bins=15, edgecolor='black', alpha=0.7)
plt.axvline(sample_mean, color='red', linestyle='dashed', linewidth=2, label=f"Mean = {sample_mean:.2f}")
plt.axvline(ci_lower, color='green', linestyle='dashed', linewidth=2, label=f"95% CI Lower = {ci_lower:.2f}")
plt.axvline(ci_upper, color='green', linestyle='dashed', linewidth=2, label=f"95% CI Upper = {ci_upper:.2f}")
plt.title("Normal Distribution Sample with 95% Confidence Interval")
plt.xlabel("Value")
plt.ylabel("Frequency")
plt.legend()
plt.show()

print("Sample Mean:", sample_mean)
print("95% Confidence Interval:", (ci_lower, ci_upper))

output:

Sample Mean: 49.48
95% Confidence Interval: (48.58, 50.38)

9. Write a Python function to calculate the Z-scores from a dataset and
visualize the standardized data using a histogram. Explain what the Z-scores represent
in terms of standard deviations from the mean.
(Include your Python code and output in the code box below.)

In [None]:
import numpy as np
import matplotlib.pyplot as plt

# Function to calculate Z-scores and plot histogram
def calculate_z_scores(data):
    mean = np.mean(data)
    std = np.std(data, ddof=1)
    z_scores = (data - mean) / std

    # Plot histogram of Z-scores
    plt.hist(z_scores, bins=15, edgecolor='black', alpha=0.7)
    plt.title("Histogram of Z-scores (Standardized Data)")
    plt.xlabel("Z-score")
    plt.ylabel("Frequency")
    plt.axvline(0, color='red', linestyle='dashed', linewidth=2, label="Mean (0)")
    plt.legend()
    plt.show()

    return z_scores

# Example dataset
data = np.array([10, 12, 13, 15, 18, 20, 21, 22, 23, 25])

# Calculate Z-scores
z_scores = calculate_z_scores(data)
print("Z-scores:", np.round(z_scores, 2))

output:

Z-scores: [-1.54 -1.15 -0.96 -0.57  0.02  0.41  0.60  0.80  0.99  1.38]