##  Problem – Defective Items in a Factory

A factory has recorded the number of defective items produced per day over **1000 production days**. The number of defective items per day is randomly generated between **0 and 20** to simulate real-world variability in quality.

Using this dataset, calculate the probability that **exactly 5 defective items** will be produced on a new day. Use Python to:

- Generate the data  
-  Calculate the mean and standard deviation  
-  Compute the probability using the normal distribution (with continuity correction)

---

###  Step-by-step Solution in Python

```python
from numpy.random import randint as ri
import pandas as pd

#Generate random data for 1000 days (defective items between 0 and 20)
defects = ri(0, 21, 1000)
defects = pd.Series(defects)

mean = np.mean(data)
std_dev = np.std(data, ddof=1)


prob_5_defects = norm.cdf(5.5, loc=mean, scale=std_dev) - norm.cdf(4.5, loc=mean, scale=std_dev)


print(f"Mean of defective items per day: {mean:.3f}")
print(f"Standard deviation: {std_dev:.3f}")
print(f"Probability of exactly 5 defective items: {prob_5_defects:.4f}")

Ans: Mean of defective items per day: 9.633
     Standard deviation: 6.221
     Probability of exactly 5 defective items: 0.0486


##  Problem – Testing the Claim About Delivery Time

A food delivery company claims that its average delivery time is **30 minutes**. Based on historical data, the **population standard deviation** is known to be **4 minutes**.

To evaluate this claim, a consumer rights group decides to test the null hypothesis that the average delivery time is **at most 30 minutes**. They observe a sample of **40 deliveries**, and the average delivery time for the sample comes out to be **31.2 minutes**.

### Objective:

Test the null hypothesis using the z-test.  
- **Null Hypothesis (H₀): μ ≤ 30** (Average delivery time is 30 minutes or less)  
- **Alternative Hypothesis (H₁): μ > 30** (Average delivery time is more than 30 minutes)

---

### Step-by-step Solution in Python

```python
import numpy as np


# Known values
population_mean = 30        # Claimed average delivery time
sample_mean = 31.2          # Observed sample mean
std_dev = 4                 # Known population standard deviation
n = 40                      # Sample size


z = (x_bar - mu_0) / (sigma / np.sqrt(n))
p_value = 1 - norm.cdf(z)


print(f"Z-statistic: {z:.3f}")
print(f"P-value: {p_value:.4f}")

alpha = 0.05
if p_value < alpha:
    print("Reject the null hypothesis: The average delivery time is significantly greater than 30 minutes.")
else:
    print("Fail to reject the null hypothesis: Not enough evidence to say the average delivery time is greater than 30 minutes.")

output:
Z-statistic: 1.897
P-value: 0.0289
Reject the null hypothesis: The average delivery time is significantly greater than 30 minutes.


## Problem – Fitness Program Impact Analysis

A health and wellness company is evaluating the impact of its **6-week fitness training program**. They collect performance data (in terms of fitness scores out of 100) from participants **before and after** the program.

You are provided with a dataset of **150 participants**, with the following information:
- **Initial Score** (before the program)
- **Final Score** (after the program)
- **Gender** of the participant (0 = Female, 1 = Male)

---

### Your Task:

Using the dataset provided below, perform the following statistical tests:

1. **One-Sample t-Test**  
   Test whether the **average initial fitness score** is at least **65**.

2. **Two-Sample Independent t-Test**  
   Compare the **initial fitness scores of male and female participants** to check if there's a significant difference.

3. **Paired Sample t-Test**  
   Test whether the **final scores are significantly higher than the initial scores**, i.e., whether the fitness program had a measurable impact.

---


Hypotheses
1️⃣ One-Sample t-Test:


**Null Hypothesis** H₀: μ ≥ 65 (Average initial score is at least 65)


**Alternate Hypothesis** H₁: μ < 65 (Average initial score is less than 65)

2️⃣ Two-Sample Independent t-Test:


**Null Hypothesis** H₀: μ₁ = μ₂ (No difference in average initial scores between males and females)



**Alternate Hypothesis** H₁: μ₁ ≠ μ₂ (There is a difference in average initial scores)



3️⃣ Paired Sample t-Test:


**Null Hypothesis** H₀: μ_diff = 0 (No change in scores before and after the program)




**Alternate Hypothesis** H₁: μ_diff < 0 (Final scores are higher than initial scores)



### Data Setup – Generate your dataset using the code below:

```python
import numpy as np
import pandas as pd

# Set random seed for reproducibility
np.random.seed(100)

# Sample size
n = 150

# Gender (0 = Female, 1 = Male)
gender = np.random.choice([0, 1], size=n)

# Initial scores (mean slightly < 65 to create realistic test)
initial_scores = np.random.normal(loc=64, scale=6, size=n)

# Final scores (showing average improvement)
final_scores = initial_scores + np.random.normal(loc=5, scale=3, size=n)

# Create DataFrame
df = pd.DataFrame({
    'Gender': gender,
    'Initial_Score': initial_scores,
    'Final_Score': final_scores
})

df.head()


# T-Test Instructions

## Statistical Tests with `scipy.stats`

Use the appropriate test based on your data and hypothesis:

### 1. One-Sample T-Test
```python
from scipy.stats import ttest_1samp

t_stat1, p_val1 = ttest_1samp(df['Initial_Score'], popmean=65)
p_val1_one_tailed = p_val1 / 2 if t_stat1 < 0 else 1 - p_val1 / 2

### 2. Two-Sample Independent T-Test

```python
from scipy.stats import ttest_ind  # For independent t-test
from scipy.stats import ttest_rel  # For paired sample t-test (used later)
male_scores = df[df['Gender'] == 1]['Initial_Score']
female_scores = df[df['Gender'] == 0]['Initial_Score']
t_stat2, p_val2 = ttest_ind(male_scores, female_scores, equal_var=False) 


```python

```


```python

```

##  Problem – ANOVA Analysis of Customer Satisfaction Across Store Branches

A retail company wants to analyze whether the **average customer satisfaction scores** vary significantly across its three store branches: **Branch A, Branch B, and Branch C**.

You are provided with data containing:
- **Customer_ID**
- **Branch** (Categorical Variable)
- **Satisfaction_Score** (Continuous Variable on a scale from 0 to 500)

---

###  Objective:
Use **One-Way ANOVA** to test the following hypotheses:

- **H₀ (Null Hypothesis)**: The average satisfaction scores across all three branches are **equal**.
- **H₁ (Alternative Hypothesis)**: At least one branch has a **different average** satisfaction score.

---

###  Dataset Generation (Run this code block)

```python
import numpy as np
import pandas as pd

# Set seed for reproducibility
np.random.seed(42)

# Sample size per branch
n = 70

# Create satisfaction scores for three branches
branch_a = np.random.normal(loc=420, scale=30, size=n)
branch_b = np.random.normal(loc=400, scale=35, size=n)
branch_c = np.random.normal(loc=430, scale=25, size=n)

# Combine into a DataFrame
data = pd.DataFrame({
    'Customer_ID': range(1, n*3 + 1),
    'Branch': ['A'] * n + ['B'] * n + ['C'] * n,
    'Satisfaction_Score': np.concatenate([branch_a, branch_b, branch_c])
})

data.head()



from scipy.stats import f_oneway

scores_a = data[data['Branch'] == 'A']['Satisfaction_Score']
scores_b = data[data['Branch'] == 'B']['Satisfaction_Score']
scores_c = data[data['Branch'] == 'C']['Satisfaction_Score']

f_stat, p_value = f_oneway(scores_a, scores_b, scores_c)

print(f"F-statistic: {f_stat:.4f}")
print(f"P-value: {p_value:.4f}")


```python

```


```python

```

## Problem – Evaluate Forecast Accuracy Using the Chi-Square Goodness of Fit Test

The city’s public transportation authority uses a forecasting model to estimate the number of metro passengers for each day of the week. These forecasts help manage train schedules, staffing, and platform operations.

Recently, actual passenger counts were collected and compared to the forecasted values to evaluate how well the model performs.

---

### Question

You are provided with the forecasted and observed number of passengers (in thousands) for each day of a week:

- **Forecasted Values (Expected):**  
  `[95, 110, 100, 130, 160, 210, 230]`

- **Observed Values (Actual):**  
  `[90, 105, 98, 135, 165, 205, 225]`

Using a **Chi-Square Goodness of Fit Test**, determine whether the forecast model provides an accurate estimate of daily passenger traffic.

---

### Hypotheses

- **Null Hypothesis (H₀):** There is no significant difference between the forecasted and observed values (i.e., the model is accurate).
- **Alternative Hypothesis (H₁):** There is a significant difference between the forecasted and observed values (i.e., the model is inaccurate).

---

### Test Parameters

- **Significance Level (α):** 0.10  
- **Degrees of Freedom (df):** 6  

---

### Instructions

1. **Perform the Chi-Square Goodness of Fit Test** using the given data.
2. **Calculate**:
   - Chi-Square Test Statistic
   - Critical Value at α = 0.10
3. **Compare** the test statistic with the critical value.
4. **State your conclusion**:
   - Do you **reject** or **fail to reject** the null hypothesis?
   - What does this imply about the **accuracy of the forecasting model**?

---

### Python Starter Code

```python
import numpy as np
from scipy.stats import chi2

# Data
expected = np.array([95, 110, 100, 130, 160, 210, 230])
observed = np.array([90, 105, 98, 135, 165, 205, 225])

chi_square_stat = np.sum((observed - expected) ** 2 / expected)
print(f"Chi-Square Test Statistic: {chi_square_stat:.4f}")

critical_value = chi2.ppf(1 - alpha, df)
print(f"Critical Value at α = 0.10: {critical_value:.4f}")

if chi_square_stat > critical_value:
    conclusion = "Reject the null hypothesis: The model is not accurate."
else:
    conclusion = "Fail to reject the null hypothesis: The model is accurate."

print(conclusion)

Chi-Square Test Statistic: 0.7619
Critical Value at α = 0.10: 10.6446
Fail to reject the null hypothesis: The model is accurate.

```
Solve both questions below without using any built-in functions.
## Problem – Manual Covariance Calculation Between Study Hours and Exam Scores

A school counselor wants to understand how strongly the number of hours a student studies is related to their exam score.

She collected the following data:

| Student | Hours_Studied | Exam_Score |
|---------|---------------|------------|
| A       | 2             | 65         |
| B       | 4             | 70         |
| C       | 6             | 75         |
| D       | 8             | 85         |
| E       | 10            | 95         |

---

###  Objective

Manually compute the **covariance** between `Hours_Studied` and `Exam_Score` **without using built-in functions** like `.cov()` or NumPy methods.

---

### Python Code for Manual Covariance Calculation

```python
# Dataset
hours = [2, 4, 6, 8, 10]
scores = [65, 70, 75, 85, 95]

n = len(hours)
mean_hours = sum(hours) / n
mean_scores = sum(scores) / n


cov_sum = 0
for i in range(n):
    cov_sum += (hours[i] - mean_hours) * (scores[i] - mean_scores)

covariance = cov_sum / n
print(f"Covariance between Hours_Studied and Exam_Score: {covariance}")

covariance=30.0


```

##  Problem – Manual Correlation Calculation Between Exercise Hours and Stress Level

A health researcher is analyzing the relationship between how many hours a person exercises per week and their reported stress level (on a scale of 1–100, where higher is more stress).

She collects data from 5 participants:

| Person | Exercise_Hours | Stress_Level |
|--------|----------------|--------------|
| A      | 1              | 85           |
| B      | 3              | 75           |
| C      | 5              | 60           |
| D      | 7              | 55           |
| E      | 9              | 40           |

---

###  Objective

Manually compute the **Pearson correlation coefficient** between `Exercise_Hours` and `Stress_Level` without using built-in correlation functions.

---



###  Python Code for Manual Correlation Calculation

```python
# Data
exercise = [1, 3, 5, 7, 9]
stress = [85, 75, 60, 55, 40]

n = len(exercise)
mean_exercise = sum(exercise) / n
mean_stress = sum(stress) / n

numerator = 0
sum_sq_exercise = 0
sum_sq_stress = 0

for i in range(n):
    x_diff = exercise[i] - mean_exercise
    y_diff = stress[i] - mean_stress
    numerator += x_diff * y_diff
    sum_sq_exercise += x_diff ** 2
    sum_sq_stress += y_diff ** 2

denominator = (sum_sq_exercise ** 0.5) * (sum_sq_stress ** 0.5)
correlation = numerator / denominator

print(f"Pearson correlation coefficient: {correlation}")
Pearson correlation coefficient ≈ -0.9918



In [1]:
exercise = [1, 3, 5, 7, 9]
stress = [85, 75, 60, 55, 40]

n = len(exercise)
mean_exercise = sum(exercise) / n
mean_stress = sum(stress) / n

numerator = 0
sum_sq_exercise = 0
sum_sq_stress = 0

for i in range(n):
    x_diff = exercise[i] - mean_exercise
    y_diff = stress[i] - mean_stress
    numerator += x_diff * y_diff
    sum_sq_exercise += x_diff ** 2
    sum_sq_stress += y_diff ** 2

denominator = (sum_sq_exercise ** 0.5) * (sum_sq_stress ** 0.5)
correlation = numerator / denominator

print(f"Pearson correlation coefficient: {correlation}")

Pearson correlation coefficient: -0.9918365981341756
