Question 1: What is hypothesis testing in statistics?

Hypothesis testing is a statistical method used to make decisions or draw conclusions about a population based on sample data. It involves formulating a null hypothesis (H0) and an alternative hypothesis (H1), then using statistical tests to determine whether there is enough evidence to reject H0 in favor of H1.

Question 2: What is the null hypothesis, and how does it differ from the alternative hypothesis?

The null hypothesis (H0) is a statement that there is no effect, difference, or relationship in the population, and any observed effect is due to random chance. The alternative hypothesis (H1) states that there is a true effect, difference, or relationship. In hypothesis testing, we attempt to reject H0 in favor of H1 based on statistical evidence.

Question 3: Explain the significance level in hypothesis testing and its role in deciding the outcome of a test.

The significance level (alpha, α) is the probability threshold used to decide whether to reject the null hypothesis. Commonly set at 0.05, it represents a 5% risk of making a Type I error (rejecting H0 when it is true). If the p-value is less than α, we reject H0; otherwise, we fail to reject it.

Question 4: What are Type I and Type II errors? Give examples of each.

A Type I error occurs when we reject a true null hypothesis (false positive). Example: Concluding a medicine works when it does not. A Type II error occurs when we fail to reject a false null hypothesis (false negative). Example: Concluding a medicine does not work when it actually does.

Question 5: What is the difference between a Z-test and a T-test? Explain when to use each.

A Z-test is used when the population standard deviation is known and the sample size is large (n > 30). A T-test is used when the population standard deviation is unknown and the sample size is small (n ≤ 30). Both tests assess whether sample means differ significantly from the population mean or each other.

Question 6: Write a Python program to generate a binomial distribution with n=10 and p=0.5, then plot its histogram.

In [None]:
import numpy as np
import matplotlib.pyplot as plt

n = 10
p = 0.5
size = 1000

# Generate binomial distribution
data = np.random.binomial(n, p, size)

# Plot histogram
plt.hist(data, bins=range(n+2), edgecolor='black')
plt.title('Binomial Distribution (n=10, p=0.5)')
plt.xlabel('Number of successes')
plt.ylabel('Frequency')
plt.show()

Question 7: Implement hypothesis testing using Z-statistics for a sample dataset in Python.

In [None]:
import numpy as np
from scipy import stats

sample_data = [49.1, 50.2, 51.0, 48.7, 50.5, 49.8, 50.3, 50.7, 50.2, 49.6,
                 50.1, 49.9, 50.8, 50.4, 48.9, 50.6, 50.0, 49.7, 50.2, 49.5,
                 50.1, 50.3, 50.4, 50.5, 50.0, 50.7, 49.3, 49.8, 50.2, 50.9,
                 50.3, 50.4, 50.0, 49.7, 50.5, 49.9]

mu = 50  # Hypothesized population mean

mean_sample = np.mean(sample_data)
std_sample = np.std(sample_data, ddof=1)
n = len(sample_data)

z_stat = (mean_sample - mu) / (std_sample / np.sqrt(n))
p_value = 2 * (1 - stats.norm.cdf(abs(z_stat)))

print(f"Z-statistic: {z_stat:.4f}")
print(f"p-value: {p_value:.4f}")

if p_value < 0.05:
    print("Reject the null hypothesis")
else:
    print("Fail to reject the null hypothesis")

Question 8: Write a Python script to simulate data from a normal distribution and calculate the 95% confidence interval for its mean. Plot the data using Matplotlib.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats

data = np.random.normal(loc=50, scale=5, size=1000)

mean = np.mean(data)
sem = stats.sem(data)
ci = stats.norm.interval(0.95, loc=mean, scale=sem)

print(f"Mean: {mean:.2f}, 95% CI: {ci}")

plt.hist(data, bins=30, edgecolor='black')
plt.title('Normal Distribution Simulation')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.show()

Question 9: Write a Python function to calculate the Z-scores from a dataset and visualize the standardized data using a histogram. Explain what the Z-scores represent in terms of standard deviations from the mean.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

def calculate_z_scores(data):
    z_scores = stats.zscore(data)
    return z_scores

data = np.random.normal(50, 10, 1000)
z_scores = calculate_z_scores(data)

plt.hist(z_scores, bins=30, edgecolor='black')
plt.title('Z-score Distribution')
plt.xlabel('Z-score')
plt.ylabel('Frequency')
plt.show()

# Explanation: Z-scores indicate how many standard deviations a value is from the mean.
# Z = 0 means the value equals the mean, Z > 0 means above the mean, Z < 0 means below the mean.