## 6. Comparing Groups and Hypothesis Testing

### Import libraries

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats

### Load dataset from a local folder into a DataFrame

In [2]:
# df = pd.read_csv('data/bottling_maintenance_events.csv')
# df.head()

### Load dataset from url into a DataFrame

In [3]:
import pandas as pd

# Use the raw GitHub URL instead of a local path
url = "https://raw.githubusercontent.com/Dr-AlaaKhamis/ISE518/main/2_Statistics/data/bottling_maintenance_events.csv"

df = pd.read_csv(url)
df.head()

Unnamed: 0,event_date,line,asset_id,failure_mode,time_to_failure_days,repair_time_hours,downtime_hours
0,2024-01-02,C,C-M2,Electrical,129.9,3.48,4.94
1,2024-01-02,A,A-M10,Mechanical,19.2,3.93,6.56
2,2024-01-02,C,C-M2,Mechanical,22.4,2.22,3.77
3,2024-01-03,B,B-M5,Mechanical,95.9,6.46,10.79
4,2024-01-09,C,C-M1,Mechanical,64.2,4.83,5.71


### 📏 Welch t-test for Comparing Two Groups

The **Welch t-test** is a statistical test used to determine whether the **means of two groups** are significantly different.  
Unlike the classical Student’s t-test, Welch’s version does **not assume equal variances** between groups, which makes it safer for real-world industrial data.

---

#### Example in our dataset:
👉 *Do production Line A and Line B have different average time-to-failure values?*

---

#### How it works:
- Each group has:
  - Mean ($\bar{x}_1, \bar{x}_2$)  
  - Standard deviation ($s_1, s_2$)  
  - Sample size ($n_1, n_2$)  

- The Welch t-statistic is calculated as:

$t = \frac{\bar{x}_1 - \bar{x}_2}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}}$

- The test also adjusts the **degrees of freedom** using the Welch–Satterthwaite equation:

$df = \frac{\left(\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}\right)^2}{\frac{(s_1^2/n_1)^2}{n_1-1} + \frac{(s_2^2/n_2)^2}{n_2-1}}$

---

#### Interpretation:

<img src="Hypothesis_testing.png" alt="Bottling plant" width="500">

- **Null hypothesis (H₀):** the two groups have the same mean.  
- **Alternative hypothesis (H₁):** the two groups have different means.  
- **p-value < 0.05** → reject H₀ → there is evidence the mean time-to-failure differs between the two lines.  
- **p-value ≥ 0.05** → fail to reject H₀ → no significant difference in average reliability.

---

#### Why it matters in reliability:
If the Welch t-test shows a difference, it suggests that one line is more reliable than the other.  
This can guide maintenance teams to investigate **root causes** (e.g., different operating conditions, equipment age, or operator practices).


#### Welch t‑test: Line A vs Line B time to failure

In [4]:
A = df.loc[df['line']=='A', 'time_to_failure_days']
B = df.loc[df['line']=='B', 'time_to_failure_days']

stat, pval = stats.ttest_ind(A, B, equal_var=False)
print(f"Welch t-test A vs B: statistic={stat:.3f}, p-value={pval:.4f}")
if pval < 0.05:
    print("Result: Significant difference at 5% level")
else:
    print("Result: No significant difference at 5% level")

Welch t-test A vs B: statistic=0.785, p-value=0.4333
Result: No significant difference at 5% level


**Interpretation tip**
A small p‑value does not tell you the size of the effect. Always pair hypothesis tests with descriptive summaries and plots.

### 🔎 Chi-Square Test

The **Chi-square test of independence** is used to check whether two categorical variables are related.  
In our maintenance dataset, we can ask:

👉 *Are failure modes distributed the same way across different production lines, or does each line have its own characteristic failure profile?*

---

#### How it works:
- We build a **contingency table** (cross-tab) of counts.  
  For example:

| Line | Mechanical | Electrical | Misalignment |
|------|------------|------------|--------------|
| A    |    25      |    10      |     5        |
| B    |    18      |    14      |     8        |
| C    |    12      |     6      |    12        |

- The Chi-square test compares the **observed counts** with the **expected counts** (what we would expect if line and failure mode were independent).

- The test statistic is:

$\chi^2 = \sum \frac{(O - E)^2}{E}$

where:
- $O$ = observed frequency  
- $E$ = expected frequency  

---

#### Interpretation:
- **p-value < 0.05**: we reject the null hypothesis → there is evidence that failure modes **depend on the line**.  
- **p-value ≥ 0.05**: we fail to reject the null hypothesis → failure modes are **independent of the line**.  

---

#### Why it matters in reliability:
If the Chi-square test shows dependence, it suggests that some lines are more prone to certain failure modes.  
Maintenance managers can then prioritize inspections, spare parts, or training for those specific issues.

📊 The Python function `pd.crosstab()` helps us build the contingency table, and `scipy.stats.chi2_contingency()` performs the test.

#### Are failure modes distributed differently by line?

In [5]:
cont = pd.crosstab(df['line'], df['failure_mode'])
chi2, p, dof, exp = stats.chi2_contingency(cont)
print("Contingency table:\n", cont)
print(f"Chi-square={chi2:.2f}, dof={dof}, p-value={p:.4f}")

Contingency table:
 failure_mode  Electrical  Mechanical  Misalignment
line                                              
A                     24          52            19
B                     33          37            16
C                     17          30            12
Chi-square=3.96, dof=4, p-value=0.4113


### Repeat the Welch t-test for Lines B vs C

In [6]:
B = df.loc[df['line'] == 'B', 'time_to_failure_days']
C = df.loc[df['line'] == 'C', 'time_to_failure_days']

stat, pval = stats.ttest_ind(B, C, equal_var=False)
print(f"Welch t-test B vs C: statistic={stat:.3f}, p-value={pval:.4f}")
if pval < 0.05:
    print("Result: Significant difference at 5% level")
else:
    print("Result: No significant difference at 5% level")

Welch t-test B vs C: statistic=2.112, p-value=0.0364
Result: Significant difference at 5% level


### 📊 Mann–Whitney U Test (Non-Parametric Alternative)

The **Mann–Whitney U test** is used to compare whether the **distributions** of two independent groups are different.  
Unlike the t-test, it does **not assume normality** of the data, which makes it useful for skewed or non-normal reliability data.

---

#### Example in our dataset:
👉 *Do Lines B and C have different distributions of time-to-failure values?*

---

#### How it works:
1. Combine the data from both groups.  
2. Rank all values from smallest to largest.  
3. Compute the **sum of ranks** for each group.  
4. The U statistic measures how often a value from one group is less than a value from the other group.  

The formula for the U statistic is:

$U = n_1 n_2 + \frac{n_1 (n_1+1)}{2} - R_1$

where:  
- $n_1, n_2$ = sample sizes of the two groups  
- $R_1$ = sum of ranks for group 1  

---

#### Interpretation:
- **Null hypothesis (H₀):** the two groups come from the same distribution.  
- **Alternative hypothesis (H₁):** the two groups come from different distributions.  
- **p-value < 0.05** → reject H₀ → evidence that the distributions differ.  
- **p-value ≥ 0.05** → fail to reject H₀ → no significant difference in distributions.  

---

#### Why it matters in reliability:
- Many reliability metrics (like time-to-failure) are **skewed** and not normally distributed.  
- Mann–Whitney U provides a **robust alternative** to the t-test for these cases.  
- If the test shows a difference, maintenance engineers can investigate whether operating conditions or asset types create distinct failure behaviors.


### Non-parametric alternative: Mann–Whitney U test

In [7]:
u_stat, pval_u = stats.mannwhitneyu(B, C, alternative='two-sided')
print(f"Mann–Whitney U test B vs C: U={u_stat:.3f}, p-value={pval_u:.4f}")
if pval_u < 0.05:
    print("Result: Significant difference at 5% level (non-parametric)")
else:
    print("Result: No significant difference at 5% level (non-parametric)")

Mann–Whitney U test B vs C: U=3039.000, p-value=0.0435
Result: Significant difference at 5% level (non-parametric)
