##  Problem – Defective Items in a Factory

A factory has recorded the number of defective items produced per day over **1000 production days**. The number of defective items per day is randomly generated between **0 and 20** to simulate real-world variability in quality.

Using this dataset, calculate the probability that **exactly 5 defective items** will be produced on a new day. Use Python to:

- Generate the data  
-  Calculate the mean and standard deviation  
-  Compute the probability using the normal distribution (with continuity correction)

---

###  Step-by-step Solution in Python

```python
from numpy.random import randint as ri
import pandas as pd

#Generate random data for 1000 days (defective items between 0 and 20)
defects = ri(0, 21, 1000)
defects = pd.Series(defects)




```python

```


```python

```


```python

```

##  Problem – Testing the Claim About Delivery Time

A food delivery company claims that its average delivery time is **30 minutes**. Based on historical data, the **population standard deviation** is known to be **4 minutes**.

To evaluate this claim, a consumer rights group decides to test the null hypothesis that the average delivery time is **at most 30 minutes**. They observe a sample of **40 deliveries**, and the average delivery time for the sample comes out to be **31.2 minutes**.

### Objective:

Test the null hypothesis using the z-test.  
- **Null Hypothesis (H₀): μ ≤ 30** (Average delivery time is 30 minutes or less)  
- **Alternative Hypothesis (H₁): μ > 30** (Average delivery time is more than 30 minutes)

---

### Step-by-step Solution in Python

```python
import numpy as np


# Known values
population_mean = 30        # Claimed average delivery time
sample_mean = 31.2          # Observed sample mean
std_dev = 4                 # Known population standard deviation
n = 40                      # Sample size



```python

```


```python

```


```python

```

## Problem – Fitness Program Impact Analysis

A health and wellness company is evaluating the impact of its **6-week fitness training program**. They collect performance data (in terms of fitness scores out of 100) from participants **before and after** the program.

You are provided with a dataset of **150 participants**, with the following information:
- **Initial Score** (before the program)
- **Final Score** (after the program)
- **Gender** of the participant (0 = Female, 1 = Male)

---

### Your Task:

Using the dataset provided below, perform the following statistical tests:

1. **One-Sample t-Test**  
   Test whether the **average initial fitness score** is at least **65**.

2. **Two-Sample Independent t-Test**  
   Compare the **initial fitness scores of male and female participants** to check if there's a significant difference.

3. **Paired Sample t-Test**  
   Test whether the **final scores are significantly higher than the initial scores**, i.e., whether the fitness program had a measurable impact.

---


Hypotheses
1️⃣ One-Sample t-Test:


**Null Hypothesis** H₀: μ ≥ 65 (Average initial score is at least 65)


**Alternate Hypothesis** H₁: μ < 65 (Average initial score is less than 65)

2️⃣ Two-Sample Independent t-Test:


**Null Hypothesis** H₀: μ₁ = μ₂ (No difference in average initial scores between males and females)



**Alternate Hypothesis** H₁: μ₁ ≠ μ₂ (There is a difference in average initial scores)



3️⃣ Paired Sample t-Test:


**Null Hypothesis** H₀: μ_diff = 0 (No change in scores before and after the program)




**Alternate Hypothesis** H₁: μ_diff < 0 (Final scores are higher than initial scores)



### Data Setup – Generate your dataset using the code below:

```python
import numpy as np
import pandas as pd

# Set random seed for reproducibility
np.random.seed(100)

# Sample size
n = 150

# Gender (0 = Female, 1 = Male)
gender = np.random.choice([0, 1], size=n)

# Initial scores (mean slightly < 65 to create realistic test)
initial_scores = np.random.normal(loc=64, scale=6, size=n)

# Final scores (showing average improvement)
final_scores = initial_scores + np.random.normal(loc=5, scale=3, size=n)

# Create DataFrame
df = pd.DataFrame({
    'Gender': gender,
    'Initial_Score': initial_scores,
    'Final_Score': final_scores
})

df.head()


# T-Test Instructions

## Statistical Tests with `scipy.stats`

Use the appropriate test based on your data and hypothesis:

### 1. One-Sample T-Test
```python
from scipy.stats import ttest_1samp



### 2. Two-Sample Independent T-Test

```python
from scipy.stats import ttest_ind  # For independent t-test
from scipy.stats import ttest_rel  # For paired sample t-test (used later)



```python

```


```python

```


```python

```

##  Problem – ANOVA Analysis of Customer Satisfaction Across Store Branches

A retail company wants to analyze whether the **average customer satisfaction scores** vary significantly across its three store branches: **Branch A, Branch B, and Branch C**.

You are provided with data containing:
- **Customer_ID**
- **Branch** (Categorical Variable)
- **Satisfaction_Score** (Continuous Variable on a scale from 0 to 500)

---

###  Objective:
Use **One-Way ANOVA** to test the following hypotheses:

- **H₀ (Null Hypothesis)**: The average satisfaction scores across all three branches are **equal**.
- **H₁ (Alternative Hypothesis)**: At least one branch has a **different average** satisfaction score.

---

###  Dataset Generation (Run this code block)

```python
import numpy as np
import pandas as pd

# Set seed for reproducibility
np.random.seed(42)

# Sample size per branch
n = 70

# Create satisfaction scores for three branches
branch_a = np.random.normal(loc=420, scale=30, size=n)
branch_b = np.random.normal(loc=400, scale=35, size=n)
branch_c = np.random.normal(loc=430, scale=25, size=n)

# Combine into a DataFrame
data = pd.DataFrame({
    'Customer_ID': range(1, n*3 + 1),
    'Branch': ['A'] * n + ['B'] * n + ['C'] * n,
    'Satisfaction_Score': np.concatenate([branch_a, branch_b, branch_c])
})

data.head()



```python

```


```python

```


```python

```


```python

```

## Problem – Evaluate Forecast Accuracy Using the Chi-Square Goodness of Fit Test

The city’s public transportation authority uses a forecasting model to estimate the number of metro passengers for each day of the week. These forecasts help manage train schedules, staffing, and platform operations.

Recently, actual passenger counts were collected and compared to the forecasted values to evaluate how well the model performs.

---

### Question

You are provided with the forecasted and observed number of passengers (in thousands) for each day of a week:

- **Forecasted Values (Expected):**  
  `[95, 110, 100, 130, 160, 210, 230]`

- **Observed Values (Actual):**  
  `[90, 105, 98, 135, 165, 205, 225]`

Using a **Chi-Square Goodness of Fit Test**, determine whether the forecast model provides an accurate estimate of daily passenger traffic.

---

### Hypotheses

- **Null Hypothesis (H₀):** There is no significant difference between the forecasted and observed values (i.e., the model is accurate).
- **Alternative Hypothesis (H₁):** There is a significant difference between the forecasted and observed values (i.e., the model is inaccurate).

---

### Test Parameters

- **Significance Level (α):** 0.10  
- **Degrees of Freedom (df):** 6  

---

### Instructions

1. **Perform the Chi-Square Goodness of Fit Test** using the given data.
2. **Calculate**:
   - Chi-Square Test Statistic
   - Critical Value at α = 0.10
3. **Compare** the test statistic with the critical value.
4. **State your conclusion**:
   - Do you **reject** or **fail to reject** the null hypothesis?
   - What does this imply about the **accuracy of the forecasting model**?

---

### Python Starter Code

```python
import numpy as np
from scipy.stats import chi2

# Data
expected = np.array([95, 110, 100, 130, 160, 210, 230])
observed = np.array([90, 105, 98, 135, 165, 205, 225])





```python

```


```python

```


```python

```
Solve both questions below without using any built-in functions.
## Problem – Manual Covariance Calculation Between Study Hours and Exam Scores

A school counselor wants to understand how strongly the number of hours a student studies is related to their exam score.

She collected the following data:

| Student | Hours_Studied | Exam_Score |
|---------|---------------|------------|
| A       | 2             | 65         |
| B       | 4             | 70         |
| C       | 6             | 75         |
| D       | 8             | 85         |
| E       | 10            | 95         |

---

###  Objective

Manually compute the **covariance** between `Hours_Studied` and `Exam_Score` **without using built-in functions** like `.cov()` or NumPy methods.

---

### Python Code for Manual Covariance Calculation

```python
# Dataset
hours = [2, 4, 6, 8, 10]
scores = [65, 70, 75, 85, 95]




```python

```


```python

```


```python

```

##  Problem – Manual Correlation Calculation Between Exercise Hours and Stress Level

A health researcher is analyzing the relationship between how many hours a person exercises per week and their reported stress level (on a scale of 1–100, where higher is more stress).

She collects data from 5 participants:

| Person | Exercise_Hours | Stress_Level |
|--------|----------------|--------------|
| A      | 1              | 85           |
| B      | 3              | 75           |
| C      | 5              | 60           |
| D      | 7              | 55           |
| E      | 9              | 40           |

---

###  Objective

Manually compute the **Pearson correlation coefficient** between `Exercise_Hours` and `Stress_Level` without using built-in correlation functions.

---



###  Python Code for Manual Correlation Calculation

```python
# Data
exercise = [1, 3, 5, 7, 9]
stress = [85, 75, 60, 55, 40]




```python

```


##  Problem – Defective Items in a Factory

A factory has recorded the number of defective items produced per day over **1000 production days**. The number of defective items per day is randomly generated between **0 and 20** to simulate real-world variability in quality.

Using this dataset, calculate the probability that **exactly 5 defective items** will be produced on a new day. Use Python to:

- Generate the data  
-  Calculate the mean and standard deviation  
-  Compute the probability using the normal distribution (with continuity correction)



In [1]:
#- Generate the data
import numpy as np
from numpy.random import randint as ri
import pandas as pd

np.random.seed(3)
data = np.random.randint(0,21,size=1000)
print(data)

[10  3  8  0 19 10 11  9 10  6  0 20 12  7 14 17  2  2  1 19  5  8 14  1
 10  7 11  1 15 16  5 20 17 14 20  0  0  9 18 20  5  7  5 14  1 17  1 10
 11  4  3 16 16  0 16 18 18 11  0 13  5 16  1 20 17 18  2  4  8 12 16 10
 16  4 17 17  8  7  0 16  9  1 15 11 20 10 16 16 12  4 12 12 19 16  2  7
 18  1  3 13 18 20  1 18  2  3 20 17 11  7  1 11  0 16 11  8 20  8 14 12
 19  4  9 18  5 20  9 10 17  9  0 13 12  7  4 10  8 20 12 17 20  0  7  0
  1 19 15 11  9 17 19  7 13  4  9 13 13  0 15  4 10  6 12 15 14 10 11  3
 17  8  2 11  4  0 13  7  1  2  2 14 10  7 12 11  7 17  7 12 12  9  7  2
 15 13  4  8 20  9 19  8  5 12  3 20  2 11 18 18  0  7  0 14 19  3  5 14
 20 13 12 18  2  6  6 19 13  7 13  7  0 12  8  4 10  4 12  0  3 13  4  2
  5 17  7  9  6 13  0  1  7  0  3 19 14  5 12 11  1 15  3  6 14 15 14 11
  5 15 16  2  1  6 18 18 12 17  4  7 10  3 17 10  2 19  1 20 14 14 18 12
  4 17  3  2 12 17 10  4  8 19  1 18  6  5 17 11  3 10 19 16  7  2 12  6
 16  1  0  4 14  5  6 10  4 13  0 20 18 10 12 12  8

In [2]:
#Calculate the mean and standard deviation
mean=np.mean(data)
std_dev=np.std(data)
print("mean: ",mean)
print("std_dev: ",std_dev)

mean:  10.104
std_dev:  6.143385385925256


In [3]:
# Compute the probability using the normal distribution (with continuity correction)

In [4]:
from scipy.stats import norm
probability = norm.cdf(5.5, loc=mean, scale=std_dev) - norm.cdf(4.5, loc=mean, scale=std_dev)
print("Probability using the normal distribution: ",probability)

Probability using the normal distribution:  0.04596931002918614


#  Problem – Testing the Claim About Delivery Time

A food delivery company claims that its average delivery time is **30 minutes**. Based on historical data, the **population standard deviation** is known to be **4 minutes**.

To evaluate this claim, a consumer rights group decides to test the null hypothesis that the average delivery time is **at most 30 minutes**. They observe a sample of **40 deliveries**, and the average delivery time for the sample comes out to be **31.2 minutes**.

### Objective:

Test the null hypothesis using the z-test.  
- **Null Hypothesis (H₀): μ ≤ 30** (Average delivery time is 30 minutes or less)  
- **Alternative Hypothesis (H₁): μ > 30** (Average delivery time is more than 30 minutes)

---

### Step-by-step Solution in Python

```python
import numpy as np


# Known values
population_mean = 30        # Claimed average delivery time
sample_mean = 31.2          # Observed sample mean
std_dev = 4                 # Known population standard deviation
n = 40                      # Sample size


In [5]:
from scipy.stats import norm
# Known values
population_mean = 30        # Claimed average delivery time
sample_mean = 31.2          # Observed sample mean
std_dev = 4                 # Known population standard deviation
n = 40                      # Sample size

#calculate standard error SEM
SEM = std_dev / (n** 0.5)
print(SEM)

0.6324555320336759


In [6]:
# calculate z-score
z_score = (sample_mean - population_mean) / SEM
print(z_score)


1.8973665961010264


In [7]:
# calculate p_value
p_value=1-norm.cdf(z_score)
print(p_value)

# here p-value = 0.0289 < 0.05 so it reject the null hypothesis
# Average delivery time is more than 30 minutes


0.028889785561798664


In [8]:
# Problem – Fitness Program Impact Analysis

A health and wellness company is evaluating the impact of its **6-week fitness training program**. They collect performance data (in terms of fitness scores out of 100) from participants **before and after** the program.

You are provided with a dataset of **150 participants**, with the following information:
- **Initial Score** (before the program)
- **Final Score** (after the program)
- **Gender** of the participant (0 = Female, 1 = Male)

---

### Your Task:

Using the dataset provided below, perform the following statistical tests:

1. **One-Sample t-Test**  
   Test whether the **average initial fitness score** is at least **65**.

2. **Two-Sample Independent t-Test**  
   Compare the **initial fitness scores of male and female participants** to check if there's a significant difference.

3. **Paired Sample t-Test**  
   Test whether the **final scores are significantly higher than the initial scores**, i.e., whether the fitness program had a measurable impact.

---


Hypotheses
1️⃣ One-Sample t-Test:


**Null Hypothesis** H₀: μ ≥ 65 (Average initial score is at least 65)


**Alternate Hypothesis** H₁: μ < 65 (Average initial score is less than 65)

2️⃣ Two-Sample Independent t-Test:


**Null Hypothesis** H₀: μ₁ = μ₂ (No difference in average initial scores between males and females)



**Alternate Hypothesis** H₁: μ₁ ≠ μ₂ (There is a difference in average initial scores)



3️⃣ Paired Sample t-Test:


**Null Hypothesis** H₀: μ_diff = 0 (No change in scores before and after the program)




**Alternate Hypothesis** H₁: μ_diff < 0 (Final scores are higher than initial scores)



### Data Setup – Generate your dataset using the code below:

```python
import numpy as np
import pandas as pd

# Set random seed for reproducibility
np.random.seed(100)

# Sample size
n = 150

# Gender (0 = Female, 1 = Male)
gender = np.random.choice([0, 1], size=n)

# Initial scores (mean slightly < 65 to create realistic test)
initial_scores = np.random.normal(loc=64, scale=6, size=n)

# Final scores (showing average improvement)
final_scores = initial_scores + np.random.normal(loc=5, scale=3, size=n)

# Create DataFrame
df = pd.DataFrame({
    'Gender': gender,
    'Initial_Score': initial_scores,
    'Final_Score': final_scores
})

df.head()


# T-Test Instructions

## Statistical Tests with `scipy.stats`

Use the appropriate test based on your data and hypothesis:

### 1. One-Sample T-Test
```python
from scipy.stats import ttest_1samp



### 2. Two-Sample Independent T-Test

```python
from scipy.stats import ttest_ind  # For independent t-test
from scipy.stats import ttest_rel  # For paired sample t-test (used later)


SyntaxError: unterminated string literal (detected at line 20) (3114245232.py, line 20)

In [None]:
import pandas as pd

# Load dataset
df = pd.read_csv("fitness_program_data.csv") 
print(df.head())

## Problem – Evaluate Forecast Accuracy Using the Chi-Square Goodness of Fit Test

The city’s public transportation authority uses a forecasting model to estimate the number of metro passengers for each day of the week. These forecasts help manage train schedules, staffing, and platform operations.

Recently, actual passenger counts were collected and compared to the forecasted values to evaluate how well the model performs.

---

### Question

You are provided with the forecasted and observed number of passengers (in thousands) for each day of a week:

- **Forecasted Values (Expected):**  
  `[95, 110, 100, 130, 160, 210, 230]`

- **Observed Values (Actual):**  
  `[90, 105, 98, 135, 165, 205, 225]`

Using a **Chi-Square Goodness of Fit Test**, determine whether the forecast model provides an accurate estimate of daily passenger traffic.

---

### Hypotheses

- **Null Hypothesis (H₀):** There is no significant difference between the forecasted and observed values (i.e., the model is accurate).
- **Alternative Hypothesis (H₁):** There is a significant difference between the forecasted and observed values (i.e., the model is inaccurate).

---

### Test Parameters

- **Significance Level (α):** 0.10  
- **Degrees of Freedom (df):** 6  

---

### Instructions

1. **Perform the Chi-Square Goodness of Fit Test** using the given data.
2. **Calculate**:
   - Chi-Square Test Statistic
   - Critical Value at α = 0.10
3. **Compare** the test statistic with the critical value.
4. **State your conclusion**:
   - Do you **reject** or **fail to reject** the null hypothesis?
   - What does this imply about the **accuracy of the forecasting model**?


In [None]:
import numpy as np
from scipy.stats import chi2

# Given data
expected = np.array([95, 110, 100, 130, 160, 210, 230])
observed = np.array([90, 105, 98, 135, 165, 205, 225])

#calculate Chi-Square Test Statistic
chi_square_statistic = np.sum((observed - expected)**2 / expected)

#degrees of Freedom and significance level
df = len(expected) - 1  # number of categories - 1 = 7 - 1 = 6
alpha = 0.10

#find Critical Value from Chi-Square distribution
critical_value = chi2.ppf(1 - alpha, df)

#result
print(f"Chi-Square Test Statistic: {chi_square_statistic:.4f}")
print(f"Critical Value (alpha={alpha}, df={df}): {critical_value:.4f}")

#so we can conclude that 
if chi_square_statistic > critical_value:
    print("Conclusion: Reject the null hypothesis.")
    print("There is a significant difference between forecasted and observed values.")
    print("The forecasting model may not be accurate.")
else:
    print("Conclusion: Fail to reject the null hypothesis.")
    print("No significant difference between forecasted and observed values.")
    print("The forecasting model provides an accurate estimate of passenger traffic.")


# ## Problem – Manual Covariance Calculation Between Study Hours and Exam Scores

A school counselor wants to understand how strongly the number of hours a student studies is related to their exam score.

She collected the following data:

| Student | Hours_Studied | Exam_Score |
|---------|---------------|------------|
| A       | 2             | 65         |
| B       | 4             | 70         |
| C       | 6             | 75         |
| D       | 8             | 85         |
| E       | 10            | 95         |

---

###  Objective

Manually compute the **covariance** between `Hours_Studied` and `Exam_Score` **without using built-in functions** like `.cov()` or NumPy methods.

---

### Python Code for Manual Covariance Calculation

```python
# Dataset
hours = [2, 4, 6, 8, 10]
scores = [65, 70, 75, 85, 95]



In [None]:
# given dataset
hours = [2, 4, 6, 8, 10]
scores = [65, 70, 75, 85, 95]

#calculate mean for both
mean_hours = sum(hours) / len(hours)
mean_scores = sum(scores) / len(scores)

#calculate sum of products of deviations
cov_sum = 0
for i in range(len(hours)):
    cov_sum += (hours[i] - mean_hours) * (scores[i] - mean_scores)

#calculate covariance
covariance = cov_sum / (len(hours) - 1)

print(f"Mean of Hours Studied: {mean_hours}")
print(f"Mean of Exam Scores: {mean_scores}")
print(f"Covariance between Hours Studied and Exam Scores: {covariance:.4f}")

# here covariance is +ve which shows the +ve effect on both columns 
# means if study hours is increases so scores is also increase


 Problem – Manual Correlation Calculation Between Exercise Hours and Stress Level

A health researcher is analyzing the relationship between how many hours a person exercises per week and their reported stress level (on a scale of 1–100, where higher is more stress).

She collects data from 5 participants:

| Person | Exercise_Hours | Stress_Level |
|--------|----------------|--------------|
| A      | 1              | 85           |
| B      | 3              | 75           |
| C      | 5              | 60           |
| D      | 7              | 55           |
| E      | 9              | 40           |

---

###  Objective

Manually compute the **Pearson correlation coefficient** between `Exercise_Hours` and `Stress_Level` without using built-in correlation functions.

---



###  Python Code for Manual Correlation Calculation

```python
# Data
exercise = [1, 3, 5, 7, 9]
stress = [85, 75, 60, 55, 40]




In [12]:
# given data
exercise = [1, 3, 5, 7, 9]
stress = [85, 75, 60, 55, 40]

#calculate mean for both
mean_exercise = sum(exercise) / len(exercise)
mean_stress = sum(stress) / len(stress)

#calculate covariance numerator and sums of squared deviations
cov_num = 0
sum_sq_dev_exercise = 0
sum_sq_dev_stress = 0

for i in range(len(exercise)):
    dev_ex = exercise[i] - mean_exercise
    dev_st = stress[i] - mean_stress
    
    cov_num += dev_ex * dev_st
    sum_sq_dev_exercise += dev_ex ** 2
    sum_sq_dev_stress += dev_st ** 2

#calculate covariance (sample covariance)
covariance = cov_num / (len(exercise) - 1)

#calculate standard deviations (sample std dev)
std_exercise = (sum_sq_dev_exercise / (len(exercise) - 1)) ** 0.5
std_stress = (sum_sq_dev_stress / (len(exercise) - 1)) ** 0.5

#calculate correlation coefficient
correlation = covariance / (std_exercise * std_stress)

# results
print(f"Mean Exercise Hours: {mean_exercise:.2f}")
print(f"Mean Stress Level: {mean_stress:.2f}")
print(f"Covariance: {covariance:.4f}")
print(f"Standard Deviation (Exercise Hours): {std_exercise:.4f}")
print(f"Standard Deviation (Stress Level): {std_stress:.4f}")
print(f"Correlation Coefficient: {correlation:.4f}")


Mean Exercise Hours: 5.00
Mean Stress Level: 63.00
Covariance: -55.0000
Standard Deviation (Exercise Hours): 3.1623
Standard Deviation (Stress Level): 17.5357
Correlation Coefficient: -0.9918


In [None]:
# here Correlation Coefficient is -0.9918 which is -ve value which shows strong negetive correlation between this two columns
