# Statistical Tests with SciPy

In this notebook, we'll learn how to use SciPy to perform common statistical tests that biologists use all the time:
- **Correlation tests** (Pearson and Spearman)
- **T-tests** (comparing group means)

## Learning Objectives

By the end of this notebook, you will be able to:
1. Calculate and interpret Pearson correlation (linear relationships)
2. Calculate and interpret Spearman correlation (monotonic relationships)
3. Understand when to use each correlation method
4. Perform t-tests to compare two groups
5. Interpret p-values correctly

---

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import pearsonr, spearmanr, ttest_ind, ttest_rel

# Set random seed for reproducibility
np.random.seed(42)

print('Libraries loaded!')

---

## Part 1: Pearson Correlation

### What is Pearson Correlation?
- Measures **linear** relationship between two variables
- Range: -1 to +1
  - **r = +1**: Perfect positive correlation
  - **r = 0**: No correlation
  - **r = -1**: Perfect negative correlation
- Returns: (correlation coefficient, p-value)

### Example 1: Strong Positive Correlation

In [None]:
# Gene expression: as gene A goes up, gene B goes up (same pathway)
gene_A = np.array([2.1, 3.5, 5.2, 7.1, 8.9, 10.2, 12.5])
gene_B = np.array([1.8, 3.1, 5.5, 6.9, 9.2, 10.5, 12.1])

df1 = pd.DataFrame({'Gene_A': gene_A, 'Gene_B': gene_B})
print('Strong positive correlation example:')
print(df1)

In [None]:
# Calculate Pearson correlation
r, p_value = pearsonr(gene_A, gene_B)

print(f'Pearson r = {r:.3f}')
print(f'P-value = {p_value:.4f}')

if p_value < 0.05:
    print('✓ Significant correlation (p < 0.05)')
else:
    print('✗ Not significant (p >= 0.05)')

In [None]:
# Visualize the correlation
fig, ax = plt.subplots(figsize=(6, 5))
ax.scatter(gene_A, gene_B, s=100, alpha=0.6)
ax.set_xlabel('Gene A Expression')
ax.set_ylabel('Gene B Expression')
ax.set_title(f'Strong Positive Correlation\nr = {r:.3f}, p = {p_value:.4f}')
ax.grid(alpha=0.3)
plt.tight_layout()
plt.show()

### Example 2: Negative Correlation

In [None]:
# As tumor suppressor gene goes up, oncogene goes down
tumor_suppressor = np.array([8.5, 7.2, 6.1, 4.8, 3.5, 2.3, 1.1])
oncogene = np.array([2.0, 3.5, 5.1, 6.8, 8.2, 9.5, 11.0])

df2 = pd.DataFrame({'Tumor_Suppressor': tumor_suppressor, 'Oncogene': oncogene})
print('Negative correlation example:')
print(df2)

In [None]:
# Calculate Pearson correlation
r_neg, p_neg = pearsonr(tumor_suppressor, oncogene)

print(f'Pearson r = {r_neg:.3f}')
print(f'P-value = {p_neg:.4f}')

# Visualize
fig, ax = plt.subplots(figsize=(6, 5))
ax.scatter(tumor_suppressor, oncogene, s=100, alpha=0.6, color='red')
ax.set_xlabel('Tumor Suppressor Expression')
ax.set_ylabel('Oncogene Expression')
ax.set_title(f'Negative Correlation\nr = {r_neg:.3f}, p = {p_neg:.4f}')
ax.grid(alpha=0.3)
plt.tight_layout()
plt.show()

### Example 3: No Correlation

In [None]:
# Two unrelated genes (random pattern)
gene_X = np.array([3.2, 5.1, 2.8, 7.5, 4.2, 6.8, 3.9])
gene_Y = np.array([4.5, 3.9, 7.2, 2.1, 8.5, 5.3, 6.1])

r_none, p_none = pearsonr(gene_X, gene_Y)

print(f'Pearson r = {r_none:.3f}')
print(f'P-value = {p_none:.4f}')

# Visualize
fig, ax = plt.subplots(figsize=(6, 5))
ax.scatter(gene_X, gene_Y, s=100, alpha=0.6, color='gray')
ax.set_xlabel('Gene X Expression')
ax.set_ylabel('Gene Y Expression')
ax.set_title(f'No Correlation\nr = {r_none:.3f}, p = {p_none:.4f}')
ax.grid(alpha=0.3)
plt.tight_layout()
plt.show()

### 📝 Practice Question 1

**Task:** Create two arrays representing drug dose (1, 2, 3, 4, 5 mg) and cell viability (95, 85, 70, 50, 30 %).

Calculate the Pearson correlation and interpret the result.

**Expected:** Negative correlation (higher dose → lower viability)

In [None]:
# YOUR CODE HERE
# Create drug_dose and cell_viability arrays
# Calculate Pearson correlation


---

## Part 2: Spearman Correlation

### What is Spearman Correlation?
- Measures **monotonic** relationship (not necessarily linear)
- Works on **ranks** instead of actual values
- More robust to outliers
- Range: -1 to +1 (same as Pearson)

### When to Use Spearman vs Pearson?
- **Pearson:** Linear relationship, normally distributed, no outliers
- **Spearman:** Non-linear but monotonic, outliers present, ordinal data

### Example: Non-Linear Relationship

In [None]:
# Gene expression with non-linear relationship (saturation effect)
time = np.array([0, 1, 2, 3, 4, 5, 6])
expression = np.array([1.0, 2.5, 4.8, 6.5, 7.2, 7.5, 7.6])  # Saturates at high time

df_nonlinear = pd.DataFrame({'Time (hours)': time, 'Expression': expression})
print('Non-linear relationship:')
print(df_nonlinear)

In [None]:
# Compare Pearson vs Spearman
r_pearson, p_pearson = pearsonr(time, expression)
r_spearman, p_spearman = spearmanr(time, expression)

print('Pearson correlation:')
print(f'  r = {r_pearson:.3f}, p = {p_pearson:.4f}')
print('\nSpearman correlation:')
print(f'  r = {r_spearman:.3f}, p = {p_spearman:.4f}')

print('\n→ Spearman is higher because it captures monotonic (always increasing) relationship')

In [None]:
# Visualize
fig, ax = plt.subplots(figsize=(7, 5))
ax.scatter(time, expression, s=100, alpha=0.6, color='purple')
ax.plot(time, expression, 'k--', alpha=0.3)
ax.set_xlabel('Time (hours)')
ax.set_ylabel('Gene Expression')
ax.set_title(f'Non-Linear Relationship\nPearson r={r_pearson:.3f}, Spearman ρ={r_spearman:.3f}')
ax.grid(alpha=0.3)
plt.tight_layout()
plt.show()

### Example: Data with Outliers

In [None]:
# Protein levels with one outlier measurement
protein_A = np.array([5.1, 5.3, 5.5, 5.7, 5.9, 6.1, 15.0])  # Last value is outlier!
protein_B = np.array([3.2, 3.4, 3.6, 3.8, 4.0, 4.2, 4.5])

# Compare Pearson vs Spearman
r_p, p_p = pearsonr(protein_A, protein_B)
r_s, p_s = spearmanr(protein_A, protein_B)

print('Data with outlier:')
print('Pearson  r =', f'{r_p:.3f}', '← Affected by outlier')
print('Spearman ρ =', f'{r_s:.3f}', '← More robust')

In [None]:
# Visualize
fig, ax = plt.subplots(figsize=(7, 5))
ax.scatter(protein_A[:-1], protein_B[:-1], s=100, alpha=0.6, label='Normal data')
ax.scatter(protein_A[-1], protein_B[-1], s=200, color='red', marker='X', label='Outlier')
ax.set_xlabel('Protein A')
ax.set_ylabel('Protein B')
ax.set_title('Effect of Outliers on Correlation')
ax.legend()
ax.grid(alpha=0.3)
plt.tight_layout()
plt.show()

### 📝 Practice Question 2

**Task:** Create data for enzyme activity at different pH levels (pH 4, 5, 6, 7, 8, 9, 10) with activity (10, 30, 60, 90, 70, 40, 15).

This shows a **peak** at pH 7 (not monotonic!).

Calculate both Pearson and Spearman correlations. Which one is more appropriate and why?

In [None]:
# YOUR CODE HERE
# Create pH and enzyme_activity arrays
# Calculate both correlations and compare


---

## Part 3: T-Tests

### What is a T-Test?
- Compares the **means** of two groups
- Answers: "Are these groups significantly different?"
- Returns: (t-statistic, p-value)

### Two Types:
1. **Independent t-test** (`ttest_ind`): Different subjects in each group
2. **Paired t-test** (`ttest_rel`): Same subjects measured twice

---

## Independent T-Test

### Use Case: Compare Control vs Treatment

In [None]:
# Tumor size (mm) in control vs drug-treated mice
control = np.array([12.5, 14.2, 13.8, 15.1, 13.2, 14.5, 12.9])
treatment = np.array([8.2, 9.5, 7.8, 10.1, 8.9, 9.2, 8.5])

df_ttest = pd.DataFrame({
    'Control': control,
    'Treatment': treatment
})

print('Tumor sizes (mm):')
print(df_ttest)
print('\nMeans:')
print(f'Control:   {control.mean():.2f} mm')
print(f'Treatment: {treatment.mean():.2f} mm')

In [None]:
# Perform independent t-test
t_stat, p_value = ttest_ind(control, treatment)

print(f'T-statistic = {t_stat:.3f}')
print(f'P-value = {p_value:.4f}')

if p_value < 0.05:
    print('\n✓ Significant difference (p < 0.05)')
    print('→ Treatment reduces tumor size!')
else:
    print('\n✗ No significant difference (p >= 0.05)')

In [None]:
# Visualize with box plot
fig, ax = plt.subplots(figsize=(7, 5))

positions = [1, 2]
data_to_plot = [control, treatment]
bp = ax.boxplot(data_to_plot, positions=positions, labels=['Control', 'Treatment'])

ax.set_ylabel('Tumor Size (mm)')
ax.set_title(f'Control vs Treatment\np = {p_value:.4f}')
ax.grid(alpha=0.3, axis='y')
plt.tight_layout()
plt.show()

### 📝 Practice Question 3

**Task:** Compare gene expression in healthy (n=6) vs diseased (n=6) patients.

Create two arrays:
- Healthy: [5.2, 5.5, 5.1, 5.4, 5.3, 5.6]
- Diseased: [7.8, 8.2, 7.5, 8.0, 7.9, 8.3]

Perform an independent t-test and interpret the result.

In [None]:
# YOUR CODE HERE
# Create healthy and diseased arrays
# Perform t-test


---

## Paired T-Test

### Use Case: Before vs After Treatment (Same Subjects)

In [None]:
# Blood pressure before and after exercise program (same 8 patients)
before = np.array([145, 152, 138, 160, 149, 155, 142, 148])
after = np.array([138, 145, 135, 148, 142, 149, 139, 141])

df_paired = pd.DataFrame({
    'Patient': ['P1', 'P2', 'P3', 'P4', 'P5', 'P6', 'P7', 'P8'],
    'Before': before,
    'After': after,
    'Change': after - before
})

print('Blood pressure (mmHg):')
print(df_paired)
print('\nMeans:')
print(f'Before: {before.mean():.1f} mmHg')
print(f'After:  {after.mean():.1f} mmHg')
print(f'Change: {(after - before).mean():.1f} mmHg')

In [None]:
# Perform paired t-test
t_stat_paired, p_paired = ttest_rel(before, after)

print(f'T-statistic = {t_stat_paired:.3f}')
print(f'P-value = {p_paired:.4f}')

if p_paired < 0.05:
    print('\n✓ Significant change (p < 0.05)')
    if after.mean() < before.mean():
        print('→ Exercise program reduces blood pressure!')
else:
    print('\n✗ No significant change (p >= 0.05)')

In [None]:
# Visualize paired data
fig, ax = plt.subplots(figsize=(8, 5))

patients = np.arange(len(before))
ax.plot(patients, before, 'o-', label='Before', linewidth=2, markersize=8)
ax.plot(patients, after, 's-', label='After', linewidth=2, markersize=8)

# Connect paired points
for i in range(len(before)):
    ax.plot([i, i], [before[i], after[i]], 'k--', alpha=0.3)

ax.set_xlabel('Patient')
ax.set_ylabel('Blood Pressure (mmHg)')
ax.set_title(f'Before vs After Exercise Program\nPaired t-test: p = {p_paired:.4f}')
ax.set_xticks(patients)
ax.set_xticklabels(['P1', 'P2', 'P3', 'P4', 'P5', 'P6', 'P7', 'P8'])
ax.legend()
ax.grid(alpha=0.3)
plt.tight_layout()
plt.show()

### 📝 Practice Question 4

**Task:** You measure cell counts before and after adding a growth factor to the same 5 cell cultures:

- Before: [1200, 1350, 1180, 1420, 1290]
- After: [1850, 2100, 1920, 2250, 2050]

Use a **paired t-test** to determine if the growth factor significantly increases cell count.

In [None]:
# YOUR CODE HERE
# Create before and after arrays
# Perform paired t-test


---

## Part 4: Choosing the Right Test

### Decision Guide:

| Question | Test | Function |
|----------|------|----------|
| Are two variables **related**? (linear) | Pearson correlation | `pearsonr()` |
| Are two variables **related**? (monotonic/outliers) | Spearman correlation | `spearmanr()` |
| Are two **independent groups** different? | Independent t-test | `ttest_ind()` |
| Did **same subjects** change over time? | Paired t-test | `ttest_rel()` |

### Real Example: Complete Analysis

In [None]:
# Study: Does gene expression predict protein levels?
# Also: Does drug treatment change protein levels?

# Question 1: Correlation between gene expression and protein
gene_expr = np.array([3.2, 4.5, 5.8, 6.2, 7.1, 8.3, 9.0, 10.2])
protein = np.array([15, 22, 28, 32, 38, 45, 48, 55])

r, p = pearsonr(gene_expr, protein)
print('Question 1: Gene expression vs Protein level')
print(f'Pearson r = {r:.3f}, p = {p:.4f}')
print('→ Strong positive correlation!\n')

# Question 2: Does treatment change protein levels?
control_protein = np.array([18, 22, 19, 25, 21, 23])
treated_protein = np.array([35, 42, 38, 45, 40, 44])

t, p_t = ttest_ind(control_protein, treated_protein)
print('Question 2: Control vs Treated protein levels')
print(f'T-test: t = {t:.3f}, p = {p_t:.4f}')
print('→ Treatment significantly increases protein!')

### 📝 Practice Question 5 (Challenge)

**Scenario:** You're studying the effect of a drug on tumor growth.

**Data:**
1. Tumor size measured on Day 0 and Day 7 in **same 6 mice** (paired)
   - Day 0: [5.2, 5.8, 5.5, 5.9, 5.3, 5.7]
   - Day 7: [8.5, 9.2, 8.8, 9.5, 8.7, 9.0]

2. Day 7 tumor sizes: **Control (n=6)** vs **Drug (n=6)** groups (independent)
   - Control: [8.5, 9.2, 8.8, 9.5, 8.7, 9.0]
   - Drug: [6.2, 5.8, 6.5, 5.9, 6.3, 6.0]

**Tasks:**
1. Test if tumors grew from Day 0 to Day 7 (paired t-test)
2. Test if drug reduces tumor size vs control (independent t-test)
3. Interpret both results

In [None]:
# YOUR CODE HERE
# Perform both t-tests and interpret


---

## Summary

### Key Concepts:

**1. Pearson Correlation:**
```python
from scipy.stats import pearsonr
r, p_value = pearsonr(x, y)
```
- Linear relationships
- Sensitive to outliers
- r close to ±1 = strong correlation

**2. Spearman Correlation:**
```python
from scipy.stats import spearmanr
rho, p_value = spearmanr(x, y)
```
- Monotonic relationships
- Robust to outliers
- Works on ranks

**3. Independent T-Test:**
```python
from scipy.stats import ttest_ind
t, p_value = ttest_ind(group1, group2)
```
- Different subjects in each group
- Example: Control vs Treatment

**4. Paired T-Test:**
```python
from scipy.stats import ttest_rel
t, p_value = ttest_rel(before, after)
```
- Same subjects measured twice
- Example: Before vs After

### P-Value Interpretation:
- **p < 0.05**: Significant (reject null hypothesis)
- **p ≥ 0.05**: Not significant (fail to reject null hypothesis)

### Biological Applications:
- Gene co-expression analysis (correlation)
- Drug efficacy testing (t-tests)
- Biomarker discovery (correlation + t-tests)
- Time-course experiments (paired t-test)

Remember: **Always visualize your data before testing!**