<a href="https://colab.research.google.com/github/Ash100/Statistical_Analysis/blob/main/Week3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### ============================================================
### 🧬 Practical: Statistical Tests on Gene Expression & Mutation Data
### ============================================================

### Author: **Dr. Ashfaq Ahmad**
### Course: Biostatistics / Computational Biology
### Topic: Hypothesis Testing — t-test and Chi-square test
### Tool: Google Colab
### ============================================================






In [1]:
# Let's import essential packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

# Set style
sns.set(style="whitegrid", font_scale=1.2)


## 🧠 1. Introduction

In this practical, we will learn how to apply **t-tests** and **Chi-square tests** to biological data.

Both are **hypothesis testing methods** used to compare groups:

- **t-test** → used for comparing **means** of two groups (e.g., gene expression in healthy vs diseased tissue)
- **Chi-square test** → used for comparing **frequencies or proportions** (e.g., mutation presence across genders or tumor types)

We will:
1. Simulate gene expression data for two groups.
2. Perform independent t-tests.
3. Interpret p-values.
4. Create a mutation dataset (categorical).
5. Perform Chi-square tests for independence.
6. Visualize and interpret the results.


In [None]:
# Step 3 — Simulate gene expression data for Healthy and Disease samples

np.random.seed(42)
n_samples = 50

# Expression values (log2 normalized)
healthy_expr = np.random.normal(loc=6.5, scale=0.8, size=n_samples)
disease_expr = np.random.normal(loc=7.2, scale=0.9, size=n_samples)

# Combine into a single DataFrame
gene_expr_df = pd.DataFrame({
    "Expression": np.concatenate([healthy_expr, disease_expr]),
    "Group": ["Healthy"] * n_samples + ["Disease"] * n_samples
})

# Display first few rows
gene_expr_df.head()


In [None]:
# Step 4 — Boxplot and swarm overlay for visualization (with saving option)

plt.figure(figsize=(7,5))
sns.boxplot(x="Group", y="Expression", data=gene_expr_df, palette="Set2")
sns.stripplot(x="Group", y="Expression", data=gene_expr_df, color="black", size=3, alpha=0.7)
plt.title("Gene Expression in Healthy vs Disease Samples", fontsize=14)
plt.xlabel("Group")
plt.ylabel("Expression (log2 normalized)")
plt.tight_layout()  # Adjust layout before saving

# ✅ Save the figure at 600 dpi (high resolution for reports/publications)
plt.savefig("gene_expression_boxplot.png", dpi=600, bbox_inches="tight")

plt.show()

print("✅ Figure saved successfully as 'gene_expression_boxplot.png'")


In [None]:
# Step 5 — Perform independent two-sample t-test

t_stat, p_value = stats.ttest_ind(healthy_expr, disease_expr, equal_var=False)

print(f"T-statistic = {t_stat:.3f}")
print(f"P-value = {p_value:.5f}")


### 📖 Interpretation (Based on Our Results)

We obtained:

- **T-statistic = -5.842**  
- **P-value = 0.00000**

#### Step 1: Recall the hypotheses

- **Null hypothesis (H₀):**  
  There is **no significant difference** in mean gene expression between *Healthy* and *Disease* groups.

- **Alternative hypothesis (H₁):**  
  There **is a significant difference** in mean gene expression between the two groups.

#### Step 2: Compare p-value with significance level

We usually take **α = 0.05** as the threshold.

Since our **p-value < 0.05** (actually ≈ 0.00000), we have **strong evidence against H₀**.

#### Step 3: Decision

✅ **Reject the null hypothesis (H₀).**

#### Step 4: Biological Interpretation

The data indicate that **gene expression levels differ significantly** between *Healthy* and *Disease* samples.  
This suggests that the studied gene may be **differentially expressed** and potentially **involved in disease progression or regulation**.

> 💡 *Note for students:*  

> Always consider **biological meaning**, **sample size**, and **effect size** — not just the p-value alone.


## 🧠 Understanding the Paired t-test

### 🔹 What is a Paired t-test?

A **paired t-test** is used when we measure **the same subjects twice** — under two different conditions — to see if there is a **significant change** between the two measurements.

Typical biological examples:
- Gene expression **before and after treatment** in the same cell line  
- Blood glucose levels **before and after drug administration** in the same patient  
- Protein abundance **before and after stress exposure** in the same tissue sample  

It compares the **mean of the differences** between the two conditions.

---

### 🔹 Difference Between Independent and Paired t-test

| Feature | Independent t-test | Paired t-test |
|----------|-------------------|---------------|
| **Samples** | Two **independent** groups (e.g., Healthy vs Disease) | Two **related** groups (e.g., before vs after) |
| **Purpose** | Tests if the means of two unrelated groups differ | Tests if the mean difference within the same group differs from zero |
| **Data Relationship** | Different individuals in each group | Same individuals measured twice |
| **Error Reduction** | More variability (individual differences) | Less variability (controls for subject-specific effects) |

---

### 🔹 Why Do We Use a Paired t-test?

Because it accounts for **inherent biological variability** between subjects.

Each subject acts as their **own control**, which:
- Removes individual differences,
- Increases statistical power,
- Makes it easier to detect true effects of treatment/intervention.

So, if the same samples or individuals are used before and after an experiment, a **paired t-test** is the correct and more sensitive choice.


In [None]:
# Step 7 — Paired t-test (Pre- vs Post-treatment in same samples)

pre_treatment = np.random.normal(5.8, 0.6, 30)
post_treatment = pre_treatment + np.random.normal(0.5, 0.3, 30)

t_stat_paired, p_value_paired = stats.ttest_rel(pre_treatment, post_treatment)

print(f"Paired t-test → t = {t_stat_paired:.3f}, p = {p_value_paired:.5f}")


## 📖 Interpretation of Paired t-test Result

We obtained:
- **t-statistic = -9.181**
- **p-value = 0.00000**

---

### 🔹 Step 1: Define Hypotheses

- **Null hypothesis (H₀):** There is **no significant difference** in the mean values before and after treatment (mean difference = 0).  
- **Alternative hypothesis (H₁):** There **is a significant difference** between the means before and after treatment.

---

### 🔹 Step 2: Compare p-value with α = 0.05

Since **p < 0.05** (actually ≈ 0.00000), we have **strong evidence against H₀**.

---

### 🔹 Step 3: Decision

✅ **Reject the null hypothesis (H₀).**

---

### 🔹 Step 4: Biological Interpretation

The result indicates a **highly significant change** in the measured variable after treatment.  
Because the **t-statistic is negative (-9.181)**, it means that the **post-treatment values are higher** than the pre-treatment ones (if the differences were defined as *pre - post*).  

In biological terms, this suggests that the **treatment produced a measurable effect** — for instance, gene expression, metabolite levels, or response markers have significantly changed after the intervention.

> 💡 *Tip for Students:*  
> A paired t-test tells you **whether** the change is statistically significant,  
> but not **how large** or **biologically meaningful** the change is.  
> For that, always inspect the data visually (boxplots, scatter plots) and calculate the **mean difference** or **effect size**.


###Simulation of Mutation Data

In [None]:
# Step 8 — Simulate mutation data for two tumor types

np.random.seed(42)

mutation_df = pd.DataFrame({
    "Tumor_Type": np.random.choice(["Lung", "Breast"], size=100),
    "Mutation_Present": np.random.choice(["Yes", "No"], size=100, p=[0.4, 0.6])
})

print("✅ Mutation dataset created!")
mutation_df.head()


In [None]:
# Step 9 — Create contingency table

contingency = pd.crosstab(mutation_df["Tumor_Type"], mutation_df["Mutation_Present"])
print("Contingency Table:")
display(contingency)


In [None]:
# Step 10 — Perform Chi-square test of independence

chi2, p_chi, dof, expected = stats.chi2_contingency(contingency)

print(f"Chi-square statistic = {chi2:.3f}")
print(f"Degrees of freedom = {dof}")
print(f"P-value = {p_chi:.5f}")
print("\nExpected frequencies:")
display(pd.DataFrame(expected, index=contingency.index, columns=contingency.columns))


## 📖 Interpretation of Chi-square Test Results

We obtained:

- **Chi-square statistic = 0.474**  
- **Degrees of freedom = 1**  
- **P-value = 0.49120**

### 🔹 Step 1: Define Hypotheses

- **Null hypothesis (H₀):**  
  Mutation frequency is **independent** of tumor type.  
  (In other words, tumor type has no effect on whether a mutation is present.)

- **Alternative hypothesis (H₁):**  
  Mutation frequency is **dependent** on tumor type.  
  (The distribution of mutations differs between tumor types.)

---

### 🔹 Step 2: Compare p-value with Significance Level (α = 0.05)

Since our **p-value = 0.49120 > 0.05**,  
we **fail to reject the null hypothesis (H₀)**.

This means there is **no statistically significant association** between tumor type and mutation presence in this dataset.

---

### 🔹 Step 3: Interpretation of Expected Frequencies

| Tumor Type | Expected “No” | Expected “Yes” |
|-------------|---------------|----------------|
| Breast | 30.8 | 25.2 |
| Lung | 24.2 | 19.8 |

These are the counts we would expect **if mutation and tumor type were completely independent**.  
The observed values were close to these expected values, which supports the non-significant result.

---

### 🔹 Step 4: Biological Interpretation

From a biological perspective, this result suggests that **mutation occurrence does not differ significantly** between *Lung* and *Breast* tumor samples in our simulated data.

The mutation may therefore occur **randomly across tumor types**, rather than being specific to one type.

> 💡 *Teaching Note:*  
> A **non-significant Chi-square result** does not prove independence; it only means there is **not enough evidence** to claim a relationship exists.  
> Larger datasets or more detailed mutation classification might reveal patterns that small samples cannot detect.
