# Statistics Part 2

1. What is hypothesis testing in statistics?
- Ans. Hypothesis testing is a method used to make decisions or draw conclusions about a population based on sample data. It helps us test assumptions (hypotheses) using statistical evidence.

2. What is the null hypothesis, and how does it differ from the alternative hypothesis?
- Ans.

Null Hypothesis (H₀): Assumes there is no effect or difference.

Alternative Hypothesis (H₁ or Ha): Assumes there is an effect or difference.
We test the data to decide whether to reject H₀ in favor of H₁.

3. What is the significance level in hypothesis testing, and why is it important?
- Ans.

Significance level (α) is the probability of rejecting the null hypothesis when it's actually true (Type I error).

Common values: 0.05 or 5%.
It defines how strong the evidence must be to reject H₀.

4. What does a P-value represent in hypothesis testing?
- Ans.
P-value tells us the probability of getting the observed results (or more extreme) if the null hypothesis is true.

5. How do you interpret the P-value in hypothesis testing?
- Ans.

If P-value ≤ α: Reject the null hypothesis (results are significant).

If P-value > α: Fail to reject the null hypothesis (not enough evidence).

6. What are Type 1 and Type 2 errors in hypothesis testing?
- Ans.

Type I Error: Rejecting a true null hypothesis (false positive).

Type II Error: Not rejecting a false null hypothesis (false negative).

7. What is the difference between a one-tailed and a two-tailed test in hypothesis testing?
- Ans.

One-tailed test: Checks for an effect in one direction only.

Two-tailed test: Checks for an effect in both directions (increase or decrease).

8. What is the Z-test, and when is it used in hypothesis testing?
- Ans.
Z-test is used when the population standard deviation is known and the sample size is large (n > 30), typically for comparing means.

9. How do you calculate the Z-score, and what does it represent in hypothesis testing?
- Ans.
Formula:

𝑍
=
𝑋
ˉ
−
𝜇
𝜎
/
𝑛
Z=
σ/
n
​

X
ˉ
 −μ
​

Where:

𝑋
ˉ
X
ˉ
  = sample mean

𝜇
μ = population mean

𝜎
σ = standard deviation

𝑛
n = sample size

It tells how many standard deviations the sample mean is from the population mean.

10. What is the T-distribution, and when should it be used instead of the normal distribution?
- Ans.
The T-distribution is used when the sample size is small (n < 30) and population standard deviation is unknown.

11. What is the difference between a Z-test and a T-test?
- Ans.

Z-test: Used when population standard deviation is known.

T-test: Used when it is unknown and sample size is small.

12. What is the T-test, and how is it used in hypothesis testing?
- Ans.
T-test compares sample means to check if they are significantly different.
Types: One-sample, Two-sample (independent), and Paired t-tests.

13. What is the relationship between Z-test and T-test in hypothesis testing?
- Ans.
Both are used to test hypotheses about means. T-test is a generalization of Z-test for smaller or more uncertain datasets.

14. What is a confidence interval, and how is it used to interpret statistical results?
- Ans.
A confidence interval is a range of values likely to contain the population parameter.
Example: 95% confidence interval means we are 95% confident the true mean lies within that range.

15. What is the margin of error, and how does it affect the confidence interval?
- Ans.
The margin of error is the range above and below the sample statistic. Larger margin = wider confidence interval = more uncertainty.

16. How is Bayes' Theorem used in statistics, and what is its significance?
- Ans.
Bayes' Theorem updates the probability of a hypothesis based on new evidence. It’s important in decision-making and probability modeling.

Formula:

𝑃
(
𝐴
∣
𝐵
)
=
𝑃
(
𝐵
∣
𝐴
)
⋅
𝑃
(
𝐴
)
𝑃
(
𝐵
)
P(A∣B)=
P(B)
P(B∣A)⋅P(A)
​

17. What is the Chi-square distribution, and when is it used?
- Ans.
Chi-square distribution is used for categorical data to test goodness of fit or independence between variables.

18. What is the Chi-square goodness of fit test, and how is it applied?
- Ans.
It checks whether the observed frequency distribution matches an expected distribution.

19. What is the F-distribution, and when is it used in hypothesis testing?
- Ans.
F-distribution is used in comparing two variances or in ANOVA for testing multiple group means.

20. What is an ANOVA test, and what are its assumptions?
- Ans.
ANOVA (Analysis of Variance) checks if there are significant differences between means of 3 or more groups.
Assumptions:

Independence of observations

Normal distribution

Equal variances

21. What are the different types of ANOVA tests?
- Ans.

One-Way ANOVA: One independent variable

Two-Way ANOVA: Two independent variables

Repeated Measures ANOVA: Same subjects tested under different conditions

22. What is the F-test, and how does it relate to hypothesis testing?
- Ans.
F-test compares the variances of two populations. It is used in ANOVA to decide whether the group means are statistically different.

# Practical Part - 1


In [None]:
#1. Generate a random variable and display its value

import numpy as np

random_var = np.random.rand()
print("Random variable:", random_var)

In [None]:
# 2. Discrete Uniform Distribution & PMF

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import randint

x = np.arange(1, 7)  # Like a dice
pmf = randint.pmf(x, 1, 7)

plt.bar(x, pmf)
plt.title("PMF of Discrete Uniform Distribution (Dice)")
plt.xlabel("Value")
plt.ylabel("Probability")
plt.show()


In [None]:
# 3. Bernoulli Distribution – PDF Calculation

from scipy.stats import bernoulli

def bernoulli_pdf(p, x):
    return bernoulli.pmf(x, p)

print("P(X=1) when p=0.6:", bernoulli_pdf(0.6, 1))
print("P(X=0) when p=0.6:", bernoulli_pdf(0.6, 0))


In [None]:
# 4. Simulate Binomial Distribution (n=10, p=0.5) and Plot Histogram

import numpy as np
import matplotlib.pyplot as plt

data = np.random.binomial(n=10, p=0.5, size=1000)

plt.hist(data, bins=11, edgecolor='black')
plt.title("Binomial Distribution (n=10, p=0.5)")
plt.xlabel("Number of Successes")
plt.ylabel("Frequency")
plt.show()


In [None]:
# 5. Poisson Distribution and Visualization

from scipy.stats import poisson
import matplotlib.pyplot as plt

mu = 3
x = np.arange(0, 10)
pmf = poisson.pmf(x, mu)

plt.bar(x, pmf)
plt.title("Poisson Distribution (λ=3)")
plt.xlabel("k")
plt.ylabel("P(X=k)")
plt.show()


In [None]:
# 6. Plot CDF of a Discrete Uniform Distribution

from scipy.stats import randint
import matplotlib.pyplot as plt
import numpy as np

x = np.arange(1, 7)
cdf = randint.cdf(x, 1, 7)

plt.step(x, cdf, where='mid')
plt.title("CDF of Discrete Uniform Distribution")
plt.xlabel("Value")
plt.ylabel("Cumulative Probability")
plt.grid(True)
plt.show()


In [None]:
# 7. Continuous Uniform Distribution and Visualization

import numpy as np
import matplotlib.pyplot as plt

data = np.random.uniform(low=0.0, high=1.0, size=1000)

plt.hist(data, bins=20, edgecolor='black')
plt.title("Continuous Uniform Distribution")
plt.xlabel("Value")
plt.ylabel("Frequency")
plt.show()


In [None]:
# 8. Simulate Normal Distribution and Plot Histogram

import numpy as np
import matplotlib.pyplot as plt

data = np.random.normal(loc=0, scale=1, size=1000)

plt.hist(data, bins=30, edgecolor='black')
plt.title("Normal Distribution (Mean=0, SD=1)")
plt.xlabel("Value")
plt.ylabel("Frequency")
plt.show()


In [None]:
# 9. Z-score Calculation and Plot

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import zscore

data = np.random.normal(50, 10, size=1000)
z_scores = zscore(data)

plt.hist(z_scores, bins=30, edgecolor='black')
plt.title("Z-scores from Normal Distribution")
plt.xlabel("Z-score")
plt.ylabel("Frequency")
plt.show()


In [None]:
# 10. Central Limit Theorem (CLT) Demonstration

import numpy as np
import matplotlib.pyplot as plt

# Generate data from exponential (non-normal) distribution
population = np.random.exponential(scale=2.0, size=100000)

sample_means = []
for _ in range(1000):
    sample = np.random.choice(population, size=30)
    sample_means.append(np.mean(sample))

plt.hist(sample_means, bins=30, edgecolor='black')
plt.title("CLT: Sampling Distribution of Sample Means")
plt.xlabel("Sample Mean")
plt.ylabel("Frequency")
plt.show()


In [None]:
# 11. Simulate multiple samples from a normal distribution and verify the Central Limit Theorem

import numpy as np
import matplotlib.pyplot as plt

population = np.random.normal(loc=50, scale=15, size=100000)

sample_means = [np.mean(np.random.choice(population, size=30)) for _ in range(1000)]

plt.hist(sample_means, bins=30, edgecolor='black')
plt.title("CLT: Sample Means from Normal Distribution")
plt.xlabel("Sample Mean")
plt.ylabel("Frequency")
plt.show()


In [None]:
# 12. Function to calculate and plot the Standard Normal Distribution (mean=0, std=1)

from scipy.stats import norm
import numpy as np
import matplotlib.pyplot as plt

def plot_standard_normal():
    x = np.linspace(-4, 4, 1000)
    y = norm.pdf(x, 0, 1)

    plt.plot(x, y)
    plt.title("Standard Normal Distribution (mean=0, std=1)")
    plt.xlabel("Z")
    plt.ylabel("Probability Density")
    plt.grid(True)
    plt.show()

plot_standard_normal()


In [None]:
# 13. Generate random variables and calculate binomial distribution probabilities

from scipy.stats import binom

n, p = 10, 0.5
x = np.arange(0, 11)
probabilities = binom.pmf(x, n, p)

for i, prob in zip(x, probabilities):
    print(f"P(X={i}) = {prob:.3f}")


In [None]:
# 14. Calculate Z-score and compare with standard normal distribution

from scipy.stats import norm

def z_score(x, mean, std):
    z = (x - mean) / std
    print("Z-score:", z)
    print("Probability (standard normal):", norm.cdf(z))
    return z

z_score(75, 70, 10)


In [None]:
# 15. Hypothesis Testing using Z-statistics
from scipy.stats import norm

# Example: H0: μ = 100, sample mean = 104, std = 10, n = 50
sample_mean = 104
pop_mean = 100
std_dev = 10
n = 50

z = (sample_mean - pop_mean) / (std_dev / np.sqrt(n))
p_value = 1 - norm.cdf(z)

print("Z-statistic:", z)
print("P-value:", p_value)

if p_value < 0.05:
    print("Reject the null hypothesis")
else:
    print("Fail to reject the null hypothesis")


In [None]:
# 16. Confidence Interval for a Dataset

import scipy.stats as stats

data = np.random.normal(70, 10, 100)
mean = np.mean(data)
sem = stats.sem(data)
ci = stats.t.interval(0.95, len(data)-1, loc=mean, scale=sem)

print(f"95% Confidence Interval: {ci}")



In [None]:
# 17. Confidence Interval from Normal Distribution Data
data = np.random.normal(60, 12, 100)
mean = np.mean(data)
sem = stats.sem(data)
ci = stats.t.interval(0.95, len(data)-1, loc=mean, scale=sem)

print("Sample Mean:", mean)
print("95% Confidence Interval:", ci)


In [None]:
# 18. PDF of a Normal Distribution

x = np.linspace(-4, 4, 100)
pdf = norm.pdf(x, loc=0, scale=1)

plt.plot(x, pdf)
plt.title("PDF of Normal Distribution (mean=0, std=1)")
plt.xlabel("x")
plt.ylabel("PDF")
plt.grid(True)
plt.show()


In [None]:
# 19. CDF of Poisson Distribution

from scipy.stats import poisson

mu = 3
x = np.arange(0, 10)
cdf = poisson.cdf(x, mu)

plt.step(x, cdf, where='mid')
plt.title("CDF of Poisson Distribution (λ=3)")
plt.xlabel("k")
plt.ylabel("Cumulative Probability")
plt.grid(True)
plt.show()


In [None]:
# 20. Simulate Continuous Uniform Distribution & Calculate Expected Value

data = np.random.uniform(low=0, high=10, size=1000)
expected_value = np.mean(data)

print("Expected Value (mean):", expected_value)


In [None]:
# 21. Compare Standard Deviations of Two Datasets & Visualize

data1 = np.random.normal(60, 10, 1000)
data2 = np.random.normal(60, 20, 1000)

print("Standard Deviation of Data1:", np.std(data1))
print("Standard Deviation of Data2:", np.std(data2))

plt.hist(data1, bins=30, alpha=0.5, label='Std=10')
plt.hist(data2, bins=30, alpha=0.5, label='Std=20')
plt.legend()
plt.title("Comparison of Standard Deviations")
plt.show()


In [None]:
# 22. Calculate Range and IQR from Normal Distribution

data = np.random.normal(70, 15, 1000)

data_range = np.max(data) - np.min(data)
q1 = np.percentile(data, 25)
q3 = np.percentile(data, 75)
iqr = q3 - q1

print("Range:", data_range)
print("IQR (Interquartile Range):", iqr)


In [None]:
# 23. Z-score Normalization & Visualization

from scipy.stats import zscore

data = np.random.normal(100, 20, 1000)
z_scores = zscore(data)

plt.hist(z_scores, bins=30, edgecolor='black')
plt.title("Z-score Normalized Data")
plt.xlabel("Z-score")
plt.ylabel("Frequency")
plt.show()


In [None]:
# 24. Calculate Skewness and Kurtosis

from scipy.stats import skew, kurtosis

data = np.random.normal(0, 1, 1000)
print("Skewness:", skew(data))
print("Kurtosis:", kurtosis(data))


# Practical Part - 2

In [None]:
# 1. Perform a Z-test for comparing a sample mean to a known population mean
import numpy as np
from scipy.stats import norm

# Sample data
sample = np.array([102, 100, 98, 101, 97, 99, 100])
sample_mean = np.mean(sample)
n = len(sample)
population_mean = 100
population_std = 2

# Z-score calculation
z = (sample_mean - population_mean) / (population_std / np.sqrt(n))
p_value = 2 * (1 - norm.cdf(abs(z)))

print("Z-score:", z)
print("P-value:", p_value)



In [None]:
# 2. Simulate random data to perform hypothesis testing and calculate P-value

# Simulate random data
np.random.seed(0)
data = np.random.normal(loc=50, scale=10, size=100)

# H0: mean = 50
sample_mean = np.mean(data)
z = (sample_mean - 50) / (10 / np.sqrt(100))
p_value = 2 * (1 - norm.cdf(abs(z)))

print("Sample mean:", sample_mean)
print("Z-score:", z)
print("P-value:", p_value)


In [None]:
# 3. One-sample Z-test
def one_sample_z_test(data, population_mean, population_std):
    n = len(data)
    sample_mean = np.mean(data)
    z = (sample_mean - population_mean) / (population_std / np.sqrt(n))
    p = 2 * (1 - norm.cdf(abs(z)))
    return z, p

# Example
data = np.random.normal(48, 5, 50)
z, p = one_sample_z_test(data, 50, 5)
print(f"Z = {z:.2f}, P-value = {p:.4f}")


In [None]:
# 4. Two-tailed Z-test with plot
import matplotlib.pyplot as plt

# Plot standard normal distribution
x = np.linspace(-4, 4, 1000)
y = norm.pdf(x)

plt.plot(x, y)
plt.fill_between(x, y, where=(x < -1.96) | (x > 1.96), color='red', alpha=0.3)
plt.title("Two-tailed Z-test (Critical region shaded)")
plt.xlabel("Z-score")
plt.ylabel("Probability Density")
plt.grid(True)
plt.show()


In [None]:
# 5. Visualize Type 1 and Type 2 Errors

def visualize_type1_type2(alpha=0.05, beta=0.2):
    x = np.linspace(-4, 4, 1000)
    null_dist = norm.pdf(x, 0, 1)
    alt_dist = norm.pdf(x, 1.5, 1)

    plt.plot(x, null_dist, label='Null Hypothesis')
    plt.plot(x, alt_dist, label='Alternative Hypothesis')

    critical_value = norm.ppf(1 - alpha)
    plt.axvline(critical_value, color='red', linestyle='--', label='Critical value (α)')

    plt.fill_between(x, null_dist, where=(x > critical_value), color='red', alpha=0.3, label='Type I Error')
    plt.fill_between(x, alt_dist, where=(x < critical_value), color='blue', alpha=0.3, label='Type II Error')

    plt.legend()
    plt.title("Type I and Type II Errors")
    plt.show()

visualize_type1_type2()


In [None]:
# 6. Independent T-test

from scipy.stats import ttest_ind

group1 = np.random.normal(100, 10, 30)
group2 = np.random.normal(102, 10, 30)

t_stat, p_val = ttest_ind(group1, group2)
print("T-statistic:", t_stat)
print("P-value:", p_val)


In [None]:
# 7. Paired Sample T-test

from scipy.stats import ttest_rel

before = np.random.normal(70, 5, 20)
after = before + np.random.normal(2, 2, 20)

t_stat, p_val = ttest_rel(before, after)
print("T-statistic:", t_stat)
print("P-value:", p_val)


In [None]:
# 8. Compare Z-test and T-test

# Z-test
sample = np.random.normal(100, 10, 30)
z_score = (np.mean(sample) - 100) / (10 / np.sqrt(30))
print("Z-test score:", z_score)

# T-test
from scipy.stats import ttest_1samp
t_stat, p_val = ttest_1samp(sample, 100)
print("T-test stat:", t_stat)


In [None]:
# 9. Confidence Interval for Sample Mean

import scipy.stats as stats

data = np.random.normal(50, 5, 100)
mean = np.mean(data)
sem = stats.sem(data)
conf_int = stats.t.interval(0.95, len(data)-1, loc=mean, scale=sem)

print("95% Confidence Interval:", conf_int)


In [None]:
# 10. Margin of Error Calculation

sample_size = 100
std_dev = 5
confidence_level = 0.95
z_score = norm.ppf(1 - (1 - confidence_level) / 2)

margin_of_error = z_score * (std_dev / np.sqrt(sample_size))
print("Margin of Error:", margin_of_error)


In [None]:
# 11. Implement a Bayesian inference method using Bayes' Theorem in Python
# Given:
# P(A): Probability of having a disease
# P(B|A): Probability of testing positive if the person has the disease
# P(B): Total probability of testing positive

P_A = 0.01         # Prior: 1% of population has the disease
P_B_given_A = 0.99 # Sensitivity: 99% test positive if they have it
P_B_given_notA = 0.05  # False positive rate: 5%

# Total probability of testing positive:
P_B = P_B_given_A * P_A + P_B_given_notA * (1 - P_A)

# Apply Bayes' Theorem
P_A_given_B = (P_B_given_A * P_A) / P_B
print(f"Probability of having the disease given a positive test: {P_A_given_B:.4f}")


In [None]:
# 12.  Perform a Chi-square test for independence between two categorical variables in Python

import pandas as pd
from scipy.stats import chi2_contingency

# Contingency table (e.g., Gender vs Product Preference)
data = [[20, 15],  # Male
        [30, 35]]  # Female

table = pd.DataFrame(data, columns=["Product A", "Product B"], index=["Male", "Female"])

chi2, p, dof, expected = chi2_contingency(table)

print("Chi-square Statistic:", chi2)
print("P-value:", p)
print("Degrees of Freedom:", dof)
print("Expected Frequencies:\n", expected)


In [None]:
# 13.  Write a Python program to calculate the expected frequencies for a Chi-square test
# We already calculated expected frequencies from chi2_contingency
print("Expected Frequencies:")
print(pd.DataFrame(expected, columns=["Product A", "Product B"], index=["Male", "Female"]))
