Problem – Defective Items in a Factory
A factory has recorded the number of defective items produced per day over 1000 production days. The number of defective items per day is randomly generated between 0 and 20 to simulate real-world variability in quality.

Using this dataset, calculate the probability that exactly 5 defective items will be produced on a new day. Use Python to:

Generate the data

Calculate the mean and standard deviation

Compute the probability using the normal distribution (with continuity correction)




Generate random data for 1000 days (defective items between 0 and 20)
defects = ri(0, 21, 1000)
defects = pd.Series(defects)

In [8]:
from numpy.random import randint as ri
import pandas as pd
from scipy.stats import norm

# Generate random data for 1000 days (defective items between 0 and 20)
defects = ri(0, 21, 1000)
defects = pd.Series(defects)

mean = defects.mean()
std_dev = defects.std()

print(f"Mean: {mean:}")
print(f"Standard Deviation: {std_dev:}")

lower_bound = 4.5
upper_bound = 5.5


probability_normal = upper_bound - lower_bound

print(f"Probability using Normal Distribution: {probability_normal}")



actual_count = (defects == 5).sum()
actual_probability = actual_count / len(defects)

print(f"Probability using Actual Data: {actual_probability}")



Mean: 9.869
Standard Deviation: 6.101234634405081
Probability using Normal Distribution: 1.0
Probability using Actual Data: 0.038


🚚 Problem – Testing the Claim About Delivery Time
A food delivery company claims that its average delivery time is 30 minutes. Based on historical data, the population standard deviation is known to be 4 minutes.

To evaluate this claim, a consumer rights group decides to test the null hypothesis that the average delivery time is at most 30 minutes. They observe a sample of 40 deliveries, and the average delivery time for the sample comes out to be 31.2 minutes.

Objective:
Test the null hypothesis using the z-test.

Null Hypothesis (H₀): μ ≤ 30

Alternative Hypothesis (H₁): μ > 30

python
Copy
Edit
import numpy as np

# Known values
population_mean = 30        # Claimed average delivery time
sample_mean = 31.2          # Observed sample mean
std_dev = 4                 # Known population standard deviation
n = 40  

In [9]:
import numpy as np
from scipy.stats import norm

# Given values
population_mean = 30        # Claimed mean
sample_mean = 31.2          # Sample mean from 40 deliveries
std_dev = 4                 # Known population standard deviation
n = 40                     # Sample size



std_error = std_dev / np.sqrt(n)


z_score = (sample_mean - population_mean) / std_error


p_value = 1 - norm.cdf(z_score)


print(f"Z-score: {z_score:.2f}")
print(f"P-value: {p_value:.4f}")

Z-score: 1.90
P-value: 0.0289


Problem – Fitness Program Impact Analysis
A health company evaluates its 6-week fitness program using performance data (scores out of 100) before and after the program for 150 participants, also including Gender (0 = Female, 1 = Male).

Your Task:
Perform the following tests:

One-Sample t-Test – Is the average initial score ≥ 65?

Two-Sample t-Test – Compare initial scores of males vs females

Paired t-Test – Are final scores significantly higher?

Dataset Generation:
python
Copy
Edit
import numpy as np
import pandas as pd

np.random.seed(100)
n = 150

gender = np.random.choice([0, 1], size=n)
initial_scores = np.random.normal(loc=64, scale=6, size=n)
final_scores = initial_scores + np.random.normal(loc=5, scale=3, size=n)

df = pd.DataFrame({
    'Gender': gender,
    'Initial_Score': initial_scores,
    'Final_Score': final_scores
})
df.head()

from scipy.stats import ttest_1samp, ttest_ind, ttest_rel


In [17]:
import numpy as np
import pandas as pd
from scipy.stats import ttest_1samp, ttest_ind, ttest_rel

# Step 1: Generate dataset
np.random.seed(100)
n = 150

gender = np.random.choice([0, 1], size=n)  # 0 = Female, 1 = Male
initial_scores = np.random.normal(loc=64, scale=6, size=n)  # Before program
final_scores = initial_scores + np.random.normal(loc=5, scale=3, size=n)  # After program

df = pd.DataFrame({
    'Gender': gender,
    'Initial_Score': initial_scores,
    'Final_Score': final_scores
})

print("Sample data:")
print(df.head())

print("\n\n")

t_stat, p_value = ttest_1samp(df['Initial_Score'], 65)

print(f"T-statistic: {t_stat:.3f}")
print(f"P-value (two-tailed): {p_value:.3f}")

print("\n \n")

male_scores = df[df['Gender'] == 1]['Initial_Score']
female_scores = df[df['Gender'] == 0]['Initial_Score']

t_stat2, p_value2 = ttest_ind(male_scores, female_scores, equal_var=False)  # Welch's t-test

print(f"T-statistic: {t_stat2:.3f}")
print(f"P-value: {p_value2:.3f}")


print("\n\n")


t_stat3, p_value3 = ttest_rel(df['Final_Score'], df['Initial_Score'])

print(f"T-statistic: {t_stat3:.3f}")
print(f"P-value (two-tailed): {p_value3:.3f}")


Sample data:
   Gender  Initial_Score  Final_Score
0       0      73.167718    76.049901
1       0      67.883235    75.156484
2       1      59.935980    65.727168
3       1      62.409887    68.352951
4       1      68.476639    70.330144



T-statistic: -2.299
P-value (two-tailed): 0.023

 

T-statistic: -0.449
P-value: 0.654



T-statistic: 19.001
P-value (two-tailed): 0.000


Problem – ANOVA: Customer Satisfaction Across Store Branches
A retail company wants to check if satisfaction scores differ among Branch A, Branch B, and Branch C.

Hypotheses:
H₀: Mean satisfaction scores are equal across all branches

H₁: At least one branch has a different mean

Dataset Generation:
python
Copy
Edit
import numpy as np
import pandas as pd

np.random.seed(42)
n = 70

branch_a = np.random.normal(loc=420, scale=30, size=n)
branch_b = np.random.normal(loc=400, scale=35, size=n)
branch_c = np.random.normal(loc=430, scale=25, size=n)

data = pd.DataFrame({
    'Customer_ID': range(1, n*3 + 1),
    'Branch': ['A']*n + ['B']*n + ['C']*n,
    'Satisfaction_Score': np.concatenate([branch_a, branch_b, branch_c])
})
data.head()

In [18]:
import numpy as np
import pandas as pd
from scipy.stats import f_oneway

# Generate dataset
np.random.seed(42)
n = 70

branch_a = np.random.normal(loc=420, scale=30, size=n)
branch_b = np.random.normal(loc=400, scale=35, size=n)
branch_c = np.random.normal(loc=430, scale=25, size=n)

data = pd.DataFrame({
    'Customer_ID': range(1, n*3 + 1),
    'Branch': ['A']*n + ['B']*n + ['C']*n,
    'Satisfaction_Score': np.concatenate([branch_a, branch_b, branch_c])
})

print(data.head())

   Customer_ID Branch  Satisfaction_Score
0            1      A          434.901425
1            2      A          415.852071
2            3      A          439.430656
3            4      A          465.690896
4            5      A          412.975399


In [19]:
scores_a = data[data['Branch'] == 'A']['Satisfaction_Score']
scores_b = data[data['Branch'] == 'B']['Satisfaction_Score']
scores_c = data[data['Branch'] == 'C']['Satisfaction_Score']

# Perform one-way ANOVA
f_stat, p_value = f_oneway(scores_a, scores_b, scores_c)

print("\nANOVA test results:")
print(f"F-statistic: {f_stat:.3f}")
print(f"P-value: {p_value:.4f}")


ANOVA test results:
F-statistic: 24.800
P-value: 0.0000


Problem – Chi-Square Goodness of Fit: Passenger Forecast Accuracy
Forecasted vs. Actual passenger counts (in 000s):

Forecasted: [95, 110, 100, 130, 160, 210, 230]

Observed: [90, 105, 98, 135, 165, 205, 225]

Hypotheses:
H₀: No difference between forecast and observed

H₁: There is a difference

Python Code:
python
Copy
Edit
import numpy as np
from scipy.stats import chi2

expected = np.array([95, 110, 100, 130, 160, 210, 230])
observed = np.array([90, 105, 98, 135, 165, 205, 225])

In [23]:
import numpy as np
from scipy.stats import chisquare

# Given data
expected = np.array([95, 110, 100, 130, 160, 210, 230])
observed = np.array([90, 105, 98, 135, 165, 205, 225])

# Scale the observed frequencies to match the sum of expected frequencies
observed_scaled = observed * (expected.sum() / observed.sum())

chi_stat, p_value = chisquare(f_obs=observed_scaled, f_exp=expected)

print(f"Chi-square statistic: {chi_stat:.3f}")
print(f"P-value: {p_value:.4f}")

Chi-square statistic: 0.990
P-value: 0.9860


Problem – Manual Covariance: Study Hours vs. Exam Scores
Student	Hours_Studied	Exam_Score
A	2	65
B	4	70
C	6	75
D	8	85
E	10	95



hours = [2, 4, 6, 8, 10]
scores = [65, 70, 75, 85, 95]

In [24]:
# Given data
hours = [2, 4, 6, 8, 10]
scores = [65, 70, 75, 85, 95]


In [25]:
mean_hours = sum(hours) / len(hours)
mean_scores = sum(scores) / len(scores)

In [26]:
cov_sum = 0
for i in range(len(hours)):
    cov_sum += (hours[i] - mean_hours) * (scores[i] - mean_scores)

In [27]:
covariance = cov_sum / (len(hours) - 1)

print(f"Mean of Hours Studied: {mean_hours}")
print(f"Mean of Exam Scores: {mean_scores}")
print(f"Covariance: {covariance}")

Mean of Hours Studied: 6.0
Mean of Exam Scores: 78.0
Covariance: 37.5


In [28]:
# Given data
exercise = [1, 3, 5, 7, 9]
stress = [85, 75, 60, 55, 40]

# Step 1: Calculate the means
mean_ex = sum(exercise) / len(exercise)
mean_st = sum(stress) / len(stress)

# Step 2: Calculate covariance and standard deviations
cov_sum = 0
std_ex_sum = 0
std_st_sum = 0

for i in range(len(exercise)):
    x_diff = exercise[i] - mean_ex
    y_diff = stress[i] - mean_st
    cov_sum += x_diff * y_diff
    std_ex_sum += x_diff ** 2
    std_st_sum += y_diff ** 2

# Step 3: Final calculations
covariance = cov_sum / (len(exercise) - 1)
std_ex = (std_ex_sum / (len(exercise) - 1)) ** 0.5
std_st = (std_st_sum / (len(exercise) - 1)) ** 0.5
correlation = covariance / (std_ex * std_st)

# Output results
print(f"Mean Exercise Hours: {mean_ex}")
print(f"Mean Stress Level: {mean_st}")
print(f"Covariance: {covariance}")
print(f"Std Dev of Exercise: {std_ex}")
print(f"Std Dev of Stress: {std_st}")
print(f"Correlation Coefficient (r): {correlation}")


Mean Exercise Hours: 5.0
Mean Stress Level: 63.0
Covariance: -55.0
Std Dev of Exercise: 3.1622776601683795
Std Dev of Stress: 17.53567791675018
Correlation Coefficient (r): -0.9918365981341756
