# Feature selection parameters

## **ttest-ind** : For only two variables 0,1 (mainly for numerical columns)
## **f_oneway** : For the Multiclass calssification (mainly for the numerical columns)
## **chi2_contengency** : For only two variable and Multiclass classification (for categorical colums)

## **Spearmanr** : If Output is numerical columns (apply for numerical columns)
## **Kruskal** : If output is numerical columns (apply for categorical columns)

---

### ✅ **Simplified Feature Selection Guide with Examples**

| **Test** | **Use Case** | **Feature Example** | **Target Example** | **When to Use** | **Example Scenario** |
|----------|--------------|---------------------|---------------------|------------------|------------------------|
| **ttest_ind** | Binary Classification | `age`, `salary` (numerical) | `churn` = 0 or 1 (binary) | When target has 2 categories | Compare avg. age of churned vs non-churned users |
| **f_oneway** | Multiclass Classification | `income`, `score` (numerical) | `education_level` = ['high school', 'bachelor', 'master'] | When target has >2 categories | Check if income differs across education levels |
| **chi2** / `chi2_contingency` | Binary/Multiclass Classification | `gender`, `region` (categorical) | `purchased` = 0/1 or multiple classes | Both feature and target are categorical | Test if purchase behavior differs by gender |
| **spearmanr** | Regression | `age`, `hours_spent` (numerical) | `monthly_spend` (numerical) | When target is continuous | Correlation between age and spending |
| **kruskal** | Regression | `product_type`, `location` (categorical) | `revenue` (numerical) | When feature is categorical & target is numerical | Does revenue differ by product type? |



### 🔑 Summary Cheatsheet

| **If target is...** | **And feature is...** | **Use...** |
|---------------------|-----------------------|-------------|
| Binary categorical | Numerical | `ttest_ind` |
| Multiclass categorical | Numerical | `f_oneway` |
| Categorical (any) | Categorical | `chi2` |
| Numerical | Numerical | `spearmanr` |
| Numerical | Categorical | `kruskal` |

---

### **1. What is Feature Selection?**
- **Goal**: Identify and retain features that contribute most to the predictive power of your model while discarding irrelevant or redundant ones.
- **Why it matters**:
  - Reduces model complexity.
  - Speeds up training.
  - Improves generalization (avoids overfitting).
  - Enhances interpretability.



### **2. Types of Feature Selection Methods**
Feature selection techniques can be grouped into **three categories**:

#### **A. Filter Methods**
- **How it works**: Select features based on statistical measures (e.g., correlation, variance) **before** training the model.
- **Pros**: Fast, computationally efficient.
- **Cons**: Ignores feature interactions.
- **Common Techniques**:
  1. **Univariate Selection**: Rank features using statistical tests (e.g., ANOVA, chi-squared).
     ```python
     from sklearn.feature_selection import SelectKBest, chi2
     X_new = SelectKBest(chi2, k=5).fit_transform(X, y)
     ```
  2. **Correlation Analysis**: Remove features highly correlated with others (redundancy).
     ```python
     corr_matrix = X.corr().abs()
     upper_tri = corr_matrix.where(np.triu(np.ones_like(corr_matrix), k=1).astype(bool))
     to_drop = [col for col in upper_tri.columns if any(upper_tri[col] > 0.8)]
     X = X.drop(to_drop, axis=1)
     ```
  3. **Variance Threshold**: Remove low-variance features (constant or near-constant values).
     ```python
     from sklearn.feature_selection import VarianceThreshold
     selector = VarianceThreshold(threshold=0.1)
     X_new = selector.fit_transform(X)
     ```

#### **B. Wrapper Methods**
- **How it works**: Use a machine learning model to evaluate feature subsets (e.g., forward/backward selection).
- **Pros**: Considers feature interactions.
- **Cons**: Computationally expensive.
- **Common Techniques**:
  1. **Recursive Feature Elimination (RFE)**: Iteratively remove the least important features.
     ```python
     from sklearn.feature_selection import RFE
     from sklearn.linear_model import LogisticRegression
     model = LogisticRegression()
     rfe = RFE(model, n_features_to_select=5)
     X_new = rfe.fit_transform(X, y)
     ```
  2. **Forward Selection**: Start with 0 features, add one at a time based on model performance.
  3. **Backward Elimination**: Start with all features, remove one at a time.

#### **C. Embedded Methods**
- **How it works**: Feature selection is built into the model training process.
- **Pros**: Balances speed and accuracy.
- **Common Techniques**:
  1. **Lasso (L1 Regularization)**: Penalizes non-important features by shrinking their coefficients to zero.
     ```python
     from sklearn.linear_model import Lasso
     lasso = Lasso(alpha=0.01)
     lasso.fit(X, y)
     selected_features = X.columns[lasso.coef_ != 0]
     ```
  2. **Tree-Based Models**: Use feature importance scores from algorithms like Random Forest or XGBoost.
     ```python
     from sklearn.ensemble import RandomForestClassifier
     model = RandomForestClassifier()
     model.fit(X, y)
     importance = model.feature_importances_
     ```

#### **D. Hybrid Methods**
- Combine filter and wrapper methods (e.g., use correlation to shortlist features, then apply RFE).



### **3. Key Considerations for Feature Selection**
1. **Domain Knowledge**: Use expert insights to retain meaningful features (e.g., "proline" in the Wine dataset may be critical).
2. **Feature Interactions**: Some features may only be useful when combined (e.g., BMI = weight/height²).
3. **Class Imbalance**: Ensure selection methods account for imbalance (e.g., use stratified sampling).
4. **Avoid Data Leakage**: Perform feature selection **after** splitting data into train/test sets.



### **4. Step-by-Step Workflow**
1. **Preprocess Data**: Handle missing values, encode categorical variables.
2. **Filter Methods**: Use correlation, variance, or univariate tests to remove obvious noise.
3. **Wrapper/Embedded Methods**: Refine selection using model-based techniques.
4. **Validate**: Compare model performance (e.g., accuracy, F1-score) before and after selection.



### **5. Example with the Wine Dataset**
**Goal**: Select features to classify wines into 3 classes.
- **Filter Method Example**:
  ```python
  from sklearn.datasets import load_wine
  from sklearn.feature_selection import SelectKBest, f_classif

  data = load_wine()
  X, y = data.data, data.target

  # Select top 5 features using ANOVA F-test
  selector = SelectKBest(score_func=f_classif, k=5)
  X_new = selector.fit_transform(X, y)
  print("Selected features:", data.feature_names[selector.get_support()])
  ```
- **Embedded Method Example** (Random Forest):
  ```python
  from sklearn.ensemble import RandomForestClassifier

  model = RandomForestClassifier()
  model.fit(X, y)
  importance = model.feature_importances_
  top_features = [data.feature_names[i] for i in importance.argsort()[-5:]]
  print("Top 5 features:", top_features)
  ```



### **6. Common Mistakes to Avoid**
- Using the entire dataset (including test data) for feature selection.
- Ignoring feature scaling (e.g., for methods like SVM or k-NN).
- Over-relying on a single technique (combine methods for robustness).



### **7. Advanced Techniques**
- **PCA (Dimensionality Reduction)**: Not strictly feature selection, but reduces dimensions.
- **Mutual Information**: Measures dependency between features and target.
  ```python
  from sklearn.feature_selection import mutual_info_classif
  mi = mutual_info_classif(X, y)
  ```
- **SHAP Values**: Explain model predictions and feature importance.
  ```python
  import shap
  explainer = shap.TreeExplainer(model)
  shap_values = explainer.shap_values(X)
  ```



### **8. When to Use Which Method?**
- **Small datasets**: Filter methods (fast).
- **Large datasets**: Embedded/wrapper methods (accuracy).
- **Interpretability**: Filter/embedded methods (e.g., Lasso, Random Forest).



### **9. Tools & Libraries**
- **Scikit-learn**: `SelectKBest`, `RFE`, `VarianceThreshold`.
- **Statsmodels**: For detailed statistical tests.
- **MLxtend**: Sequential feature selection.



### **10. Practice Exercise**
**Task**: Use the **Wine Dataset** and compare model performance (e.g., accuracy) using:
1. All 13 features.
2. Top 5 features selected via `SelectKBest`.
3. Top 5 features selected via Random Forest importance.

---

## **Feature Selection (t-test) (T-test for independent samples) (For Numerical Values)**

In feature selection, `ttest_ind` is used to identify features that are significantly different between two groups. It comes from **scipy.stats** and performs an **independent two-sample t-test**, which checks whether the means of two independent groups are statistically different.

### How `ttest_ind` Helps in Feature Selection:
- It helps determine whether a feature (column) is useful for distinguishing between two classes in classification problems.
- A small p-value (typically < 0.05) indicates a significant difference in the feature values between the two groups, meaning the feature is important for classification.
- A large p-value suggests the feature might not be useful for distinguishing between classes.

### Example:
```python
from scipy.stats import ttest_ind
import numpy as np

# Sample data: Two groups (Class 0 and Class 1) for a single feature
class_0 = np.array([2.5, 3.0, 3.2, 4.1, 3.8])
class_1 = np.array([5.2, 5.8, 6.1, 5.9, 6.5])

# Perform t-test
stat, p_value = ttest_ind(class_0, class_1)

print(f"t-statistic: {stat}, p-value: {p_value}")

# Interpretation
if p_value < 0.05:
    print("Feature is significant for classification.")
else:
    print("Feature is NOT significant for classification.")
```

### When to Use `ttest_ind` in Feature Selection:
- When dealing with **continuous numerical features**.
- When you have **two classes** in classification problems.
- If the feature distribution is approximately **normal** (since t-test assumes normality).

---

Sure! Let's break down the **manual calculation** of the independent two-sample t-test (`ttest_ind`) step by step.



## **Formula for t-test (Independent)**
The independent t-test checks whether two groups have significantly different means. The formula is:

$$
t = \frac{\bar{X}_1 - \bar{X}_2}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}}
$$

Where:
- $\bar{X}_1, \bar{X}_2$ = Means of the two groups
- $s_1^2, s_2^2$ = Variances of the two groups
- $n_1, n_2$ = Sample sizes of the two groups



## **Step-by-Step Calculation**
### **Example Data**
Let's say we have two groups (Class 0 and Class 1) with the following values:

| Class 0 | Class 1 |
|---------|---------|
| 2.5     | 5.2     |
| 3.0     | 5.8     |
| 3.2     | 6.1     |
| 4.1     | 5.9     |
| 3.8     | 6.5     |

### **Step 1: Calculate the Means ($\bar{X}_1, \bar{X}_2$)**
#### Mean of Class 0:
$$
\bar{X}_1 = \frac{2.5 + 3.0 + 3.2 + 4.1 + 3.8}{5} = \frac{16.6}{5} = 3.32
$$

#### Mean of Class 1:
$$
\bar{X}_2 = \frac{5.2 + 5.8 + 6.1 + 5.9 + 6.5}{5} = \frac{29.5}{5} = 5.9
$$



### **Step 2: Calculate the Variance ($s_1^2, s_2^2$)**
The formula for variance is:

$$
s^2 = \frac{\sum (X - \bar{X})^2}{n - 1}
$$

#### Variance of Class 0:
$$
s_1^2 = \frac{(2.5 - 3.32)^2 + (3.0 - 3.32)^2 + (3.2 - 3.32)^2 + (4.1 - 3.32)^2 + (3.8 - 3.32)^2}{5 - 1}
$$

$$
= \frac{(-0.82)^2 + (-0.32)^2 + (-0.12)^2 + (0.78)^2 + (0.48)^2}{4}
$$

$$
= \frac{0.6724 + 0.1024 + 0.0144 + 0.6084 + 0.2304}{4}
$$

$$
= \frac{1.628}{4} = 0.407
$$

#### Variance of Class 1:
$$
s_2^2 = \frac{(5.2 - 5.9)^2 + (5.8 - 5.9)^2 + (6.1 - 5.9)^2 + (5.9 - 5.9)^2 + (6.5 - 5.9)^2}{5 - 1}
$$

$$
= \frac{(-0.7)^2 + (-0.1)^2 + (0.2)^2 + (0.0)^2 + (0.6)^2}{4}
$$

$$
= \frac{0.49 + 0.01 + 0.04 + 0.0 + 0.36}{4}
$$

$$
= \frac{0.9}{4} = 0.225
$$



### **Step 3: Compute the t-statistic**
$$
t = \frac{\bar{X}_1 - \bar{X}_2}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}}
$$

Substituting values:

$$
t = \frac{3.32 - 5.9}{\sqrt{\frac{0.407}{5} + \frac{0.225}{5}}}
$$

$$
t = \frac{-2.58}{\sqrt{0.0814 + 0.045}}
$$

$$
t = \frac{-2.58}{\sqrt{0.1264}}
$$

$$
t = \frac{-2.58}{0.3556}
$$

$$
t = -7.25
$$



### **Step 4: Compute the p-value (Using t-table or Python)**
To get the **p-value**, we can use statistical tables or Python's `scipy.stats.t.cdf` function.

If we check the t-distribution table for **degrees of freedom (df = n1 + n2 - 2 = 5 + 5 - 2 = 8)**, a t-score of **-7.25** gives a very small p-value (typically <0.0001), meaning the feature is significant.



## **Final Conclusion**
Since the **p-value is very small (p < 0.05)**, we conclude that **this feature significantly differs between the two groups** and is **useful for classification**.



## **Summary of Steps**
1. Compute means for both groups ($\bar{X}_1, \bar{X}_2$).
2. Compute variances ($s_1^2, s_2^2$).
3. Plug values into the t-test formula.
4. Calculate the **t-statistic**.
5. Find the **p-value** (using a t-table or Python).
6. If **p < 0.05**, the feature is significant.

---

Sure! Let's break it down even **simpler** using a real-life analogy.  



## **Think of It Like a Taste Test 🍕 vs 🍔**  
Imagine you are running a **food competition** between **Pizza Lovers** and **Burger Lovers**. You want to check:  

👉 *Do people in the Pizza group eat significantly fewer/more calories than people in the Burger group?*  

You collect data on how many calories each person eats in both groups.



## **Step-by-Step in Layman Terms**  

### **Step 1: Find the Average Calories Each Group Eats (Mean)**
- You **add up all calories** each person eats in the Pizza group and divide by the number of people → **this is the average (mean) for Pizza Lovers**.  
- Do the same for Burger Lovers → **this is the average (mean) for Burger Lovers**.  
- If these averages are very different, it suggests that one group eats more than the other.  



### **Step 2: Check How Much People in Each Group Vary (Spread/Variance)**
- Not everyone eats the same amount of calories! Some Pizza Lovers eat a lot; some eat less.  
- The same happens with Burger Lovers.  
- So, we **measure how spread out the numbers are** in each group. This is called **variance**.  

If everyone in the Pizza group eats around **2000 calories**, and everyone in the Burger group eats around **3000 calories**, there is **low variance**.  
But if some Pizza Lovers eat **1000 calories** and some eat **3000 calories**, there is **high variance**.  



### **Step 3: Compare the Two Groups with a Formula (t-test)**
Now, we use a **formula** to compare:  
👉 **How different are the averages (step 1)?**  
👉 **How spread out is the data in each group (step 2)?**  

The **t-test formula** gives a **t-score** (a number that tells us how different the two groups are).  

- **If the t-score is big**, it means the groups are very different.  
- **If the t-score is small**, the groups are similar.  



### **Step 4: Get the Final Answer (p-value)**
The **p-value** tells us:  
- **Small p-value (< 0.05)** → The groups are **very different** → The feature (calories) is important!  
- **Large p-value (> 0.05)** → The groups are **similar** → The feature is **not useful**.  

## **Example with Numbers** 🎯  

| Pizza Lovers (Calories) | Burger Lovers (Calories) |
|-------------------------|-------------------------|
| 2000                    | 3000                    |
| 2100                    | 3100                    |
| 1900                    | 2900                    |
| 2050                    | 3050                    |
| 1950                    | 2950                    |



- **Average for Pizza Lovers** = **2000 calories**  
- **Average for Burger Lovers** = **3000 calories**  
- The difference is **1000 calories**.  
- The variance is low (numbers in each group are close to their mean).  
- The t-test gives a **big t-score** and a **small p-value** (p < 0.05).  

👉 **Conclusion:** Pizza and Burger lovers eat **very different** calories, so "Calories" is an important feature!  



## **Now Relate This to Feature Selection**  
- Instead of **Pizza vs. Burger**, we have **Class 0 vs. Class 1** in a machine learning dataset.  
- Instead of **Calories**, we check each feature (e.g., height, weight, salary).  
- If a feature (e.g., salary) is **significantly different** between two groups, it is **important for classification**!  



## **Final Takeaway**  
- `ttest_ind` helps check **if a feature is useful for distinguishing two groups**.  
- A **small p-value** means the feature **is important**.  
- A **large p-value** means the feature **doesn’t help** separate the groups.  



### **Super Simple Conclusion:**  
`ttest_ind` is like a **food competition** – it checks if two groups are really different. 🍕 vs 🍔 If they are, we use that feature for classification. 🚀  

---

Sure! Let's go step by step and break down the **Chi-Square Test for Independence (`chi2_contingency`)** in a **super simple** way.  



## **What is `chi2_contingency`?**
It checks whether **two categorical variables** are **related** or **independent** of each other.  

👉 In simple terms: **"Does one category affect the other?"**  

For example:  
- **Does gender (Male/Female) affect preference for a product (Yes/No)?**  
- **Does education level (High School/College) affect job type (Tech/Non-Tech)?**  
- **Does smoking (Yes/No) affect lung disease (Yes/No)?**  



## **Step-by-Step Process**

### **Step 1: Create a Contingency Table**  
A **contingency table** is like a frequency table that shows how many times each combination occurs.  

For example, let's say we surveyed **100 people** to check whether **gender** affects their **preference for a product**.  

|               | Prefers Product | Doesn't Prefer Product | Total |
|--------------|----------------|------------------------|------|
| **Male**     | 30             | 20                     | 50   |
| **Female**   | 40             | 10                     | 50   |
| **Total**    | 70             | 30                     | 100  |

Each cell in this table tells how many people **fall into a specific category combination**.



### **Step 2: Calculate Expected Counts**
We now calculate what the numbers **should be** if gender and product preference were completely **independent** (not related).  

Formula for **expected count** for each cell:

$$
E_{ij} = \frac{(Row\ Total) \times (Column\ Total)}{\text{Grand Total}}
$$

Let’s calculate for **Male - Prefers Product**:

$$
E_{Male,Prefers} = \frac{(Row\ Total\ for\ Male) \times (Column\ Total\ for\ Prefers)}{\text{Grand Total}}
$$

$$
E_{Male,Prefers} = \frac{50 \times 70}{100} = \frac{3500}{100} = 35
$$

Similarly, we calculate for other cells:

|               | Prefers Product (Expected) | Doesn't Prefer (Expected) |
|--------------|----------------|------------------------|
| **Male**     | 35             | 15                     |
| **Female**   | 35             | 15                     |

Now, we compare the **actual observed values** vs **expected values**.



### **Step 3: Compute the Chi-Square Statistic**
The Chi-Square formula is:

$$
\chi^2 = \sum \frac{(O - E)^2}{E}
$$

Where:
- $O$ = Observed count (actual values from survey)
- $E$ = Expected count (calculated from step 2)

Now, calculate for each cell:

For **Male - Prefers Product**:

$$
\frac{(30 - 35)^2}{35} = \frac{(-5)^2}{35} = \frac{25}{35} = 0.714
$$

For **Male - Doesn’t Prefer Product**:

$$
\frac{(20 - 15)^2}{15} = \frac{(5)^2}{15} = \frac{25}{15} = 1.667
$$

For **Female - Prefers Product**:

$$
\frac{(40 - 35)^2}{35} = \frac{(5)^2}{35} = \frac{25}{35} = 0.714
$$

For **Female - Doesn’t Prefer Product**:

$$
\frac{(10 - 15)^2}{15} = \frac{(-5)^2}{15} = \frac{25}{15} = 1.667
$$

Now sum them all up:

$$
\chi^2 = 0.714 + 1.667 + 0.714 + 1.667 = 4.76
$$



### **Step 4: Find the p-value**
To determine if the relationship is **significant**, we compare our Chi-Square value to a **critical value from a Chi-Square table**, or we calculate the **p-value** using Python.

The **degrees of freedom** (df) is calculated as:

$$
df = (\text{rows} - 1) \times (\text{columns} - 1)
$$

For our table:

$$
df = (2 - 1) \times (2 - 1) = 1
$$

Now, using a **Chi-Square table or Python**, we find the **p-value** for $\chi^2 = 4.76$ with $df = 1$.  
Let's assume **p = 0.029**.



### **Step 5: Interpret the Results**
- **If p-value < 0.05**, reject the null hypothesis → *There is a significant relationship between gender and product preference!*  
- **If p-value > 0.05**, fail to reject the null hypothesis → *No significant relationship. Gender does not affect product preference!*  

Since **p = 0.029 < 0.05**, we conclude:  
✅ **Gender and product preference are related!**  



## **How is `chi2_contingency` Used in Feature Selection?**
- If a **categorical feature** (e.g., "Gender") is **highly related** to the target variable (e.g., "Buys Product"), it is an **important feature**.  
- If **p-value < 0.05**, the feature is useful for prediction.  
- If **p-value > 0.05**, the feature **does not help** and can be removed.



## **Summary**
| Step | What You Do |
|------|------------|
| 1️⃣ | Make a **contingency table** (count how often each category appears). |
| 2️⃣ | Calculate **expected counts** (what would happen if categories were independent). |
| 3️⃣ | Use **Chi-Square formula** to compare observed vs. expected. |
| 4️⃣ | Find the **p-value** (probability that the difference is due to chance). |
| 5️⃣ | If **p < 0.05**, feature is important! |



## **Real Example in Python**
Here’s how you do it in Python using **scipy.stats.chi2_contingency**:

```python
import numpy as np
from scipy.stats import chi2_contingency

# Creating the observed frequency table
data = np.array([[30, 20],  # Male
                 [40, 10]]) # Female

# Performing Chi-Square test
chi2, p, dof, expected = chi2_contingency(data)

# Output results
print(f"Chi-Square Value: {chi2}")
print(f"p-value: {p}")
print(f"Degrees of Freedom: {dof}")
print(f"Expected Table:\n{expected}")

# Check significance
if p < 0.05:
    print("Feature is significant!")
else:
    print("Feature is NOT significant.")
```



## **Final Takeaway**
🔹 `chi2_contingency` helps check if **two categorical variables are related**.  
🔹 If **p < 0.05**, the feature is useful for classification.  
🔹 If **p > 0.05**, the feature can be removed.  
🔹 **Used in feature selection** for categorical data in machine learning.

---

## **ANOVA (Analysis of Variance) Test – Full Explanation**  

### **What is ANOVA?**
ANOVA (**Analysis of Variance**) is a **statistical test** used to determine **if there is a significant difference between the means of three or more independent groups**. It helps us check whether the variations in the groups are **due to random chance or actual differences** in the population.

🔹 **Example Use Case**:  
Suppose you want to compare the **average test scores** of students from three different schools. ANOVA can tell you **whether at least one school has a significantly different mean score** compared to the others.



## **Types of ANOVA**
1. **One-Way ANOVA** → Compares means across **one independent variable** (one factor).  
   - Example: Comparing **test scores** of students from **three schools**.  
   
2. **Two-Way ANOVA** → Compares means across **two independent variables** (two factors).  
   - Example: Comparing test scores based on **school** (factor 1) and **teaching method** (factor 2).  

3. **Repeated Measures ANOVA** → Used when the **same subjects** are tested multiple times under different conditions (like before and after a treatment).  



## **1. One-Way ANOVA Formula**  
The **ANOVA test statistic (F-statistic)** is given by:

$$
F = \frac{\text{Between-group variance}}{\text{Within-group variance}}
$$

Where:  
- **Between-group variance** measures how much the group means differ from the overall mean.  
- **Within-group variance** measures how much the data points within each group differ from their group mean.  

A **higher F-value** means the groups are likely to have significantly different means.



## **Step-by-Step Calculation of One-Way ANOVA**
Let's consider **three groups (A, B, and C)** with test scores:

| **Group A** | **Group B** | **Group C** |
|------------|------------|------------|
| 85         | 78         | 90         |
| 88         | 82         | 94         |
| 92         | 84         | 89         |
| 94         | 80         | 96         |
| 90         | 86         | 91         |

### **Step 1: Compute Group Means ($\bar{X}$)**
#### Mean of Group A:
$$
\bar{X}_A = \frac{85 + 88 + 92 + 94 + 90}{5} = 89.8
$$

#### Mean of Group B:
$$
\bar{X}_B = \frac{78 + 82 + 84 + 80 + 86}{5} = 82.0
$$

#### Mean of Group C:
$$
\bar{X}_C = \frac{90 + 94 + 89 + 96 + 91}{5} = 92.0
$$

#### Overall Mean ($\bar{X}_{\text{overall}}$)
$$
\bar{X}_{\text{overall}} = \frac{(85+88+92+94+90) + (78+82+84+80+86) + (90+94+89+96+91)}{15} = 87.93
$$



### **Step 2: Compute Between-Group Variance (SSB)**
The formula for **Sum of Squares Between Groups (SSB)**:

$$
SSB = n_A (\bar{X}_A - \bar{X}_{\text{overall}})^2 + n_B (\bar{X}_B - \bar{X}_{\text{overall}})^2 + n_C (\bar{X}_C - \bar{X}_{\text{overall}})^2
$$

$$
SSB = 5(89.8 - 87.93)^2 + 5(82 - 87.93)^2 + 5(92 - 87.93)^2
$$

$$
SSB = 5(3.47) + 5(35.22) + 5(16.54) = 17.35 + 176.1 + 82.7 = 276.15
$$



### **Step 3: Compute Within-Group Variance (SSW)**
The formula for **Sum of Squares Within Groups (SSW)**:

$$
SSW = \sum (X - \bar{X})^2 \text{ for each group}
$$

#### **For Group A:**
$$
(85 - 89.8)^2 + (88 - 89.8)^2 + (92 - 89.8)^2 + (94 - 89.8)^2 + (90 - 89.8)^2
$$

$$
= 23.04 + 3.24 + 4.84 + 17.64 + 0.04 = 48.8
$$

#### **For Group B:**
$$
(78 - 82)^2 + (82 - 82)^2 + (84 - 82)^2 + (80 - 82)^2 + (86 - 82)^2
$$

$$
= 16 + 0 + 4 + 4 + 16 = 40
$$

#### **For Group C:**
$$
(90 - 92)^2 + (94 - 92)^2 + (89 - 92)^2 + (96 - 92)^2 + (91 - 92)^2
$$

$$
= 4 + 4 + 9 + 16 + 1 = 34
$$

$$
SSW = 48.8 + 40 + 34 = 122.8
$$



### **Step 4: Compute F-Statistic**
$$
F = \frac{\text{MSB}}{\text{MSW}}
$$

Where:
- **Mean Square Between Groups (MSB)** = $ \frac{SSB}{df_B} $  
  - Degrees of freedom **df_B** = Number of groups - 1 = 3 - 1 = 2  
  - $ MSB = \frac{276.15}{2} = 138.08 $  

- **Mean Square Within Groups (MSW)** = $ \frac{SSW}{df_W} $  
  - Degrees of freedom **df_W** = Total samples - Number of groups = 15 - 3 = 12  
  - $ MSW = \frac{122.8}{12} = 10.23 $  

$$
F = \frac{138.08}{10.23} = 13.5
$$



### **Step 5: Find the p-value**
The **p-value** is obtained from the **F-distribution table** or using Python:

```python
from scipy.stats import f

# Degrees of freedom
df_between = 2
df_within = 12

# Compute p-value
p_value = 1 - f.cdf(13.5, df_between, df_within)

print(f"P-value: {p_value}")
```

📌 **If p-value < 0.05**, we reject the null hypothesis and conclude that at least one group has a significantly different mean.



## **Final Conclusion**
- If **p < 0.05**, at least one group's mean is significantly different.
- If **p > 0.05**, the differences between groups are likely **due to chance**.



## **Summary of Steps**
1. **Compute group means** and **overall mean**.
2. **Calculate SSB** (Between-group variance).
3. **Calculate SSW** (Within-group variance).
4. **Compute the F-statistic**.
5. **Find the p-value**.
6. **Interpret results** → If **p < 0.05**, at least one group is different.

---

Here's a **comparison table** that explains when to use **t-test (independent), ANOVA, and Chi-Square test (chi2_contingency)** along with their use cases, assumptions, and examples:  

| **Test Name**          | **Type of Data**  | **Purpose**  | **Number of Groups** | **Assumptions** | **Example Use Case** | **Hypothesis Tested** | **Python Function** |
|------------------------|------------------|-------------|----------------------|-----------------|----------------------|------------------------|----------------------|
| **t-test (Independent)** <br> `ttest_ind()` | **Numerical (continuous) vs. Categorical** | Compare means between **two** independent groups | **2 groups only** | - Data is **normally distributed** <br> - Groups have **equal variance** (for Student’s t-test) <br> - Observations are **independent** | **Example:** Compare **average exam scores** of **male vs. female students** | **H₀:** No difference in means between the two groups <br> **H₁:** Means are significantly different | `scipy.stats.ttest_ind(group1, group2)` |
| **ANOVA (Analysis of Variance)** <br> `f_oneway()` | **Numerical (continuous) vs. Categorical** | Compare means across **three or more** independent groups | **3+ groups** | - Data is **normally distributed** <br> - Groups have **equal variance** <br> - Observations are **independent** | **Example:** Compare **average salaries** across job roles (**Engineer, Manager, Director, VP**) | **H₀:** All groups have the same mean <br> **H₁:** At least one group mean is different | `scipy.stats.f_oneway(group1, group2, group3, ...)` |
| **Chi-Square Test** <br> `chi2_contingency()` | **Categorical vs. Categorical** | Test **association** or **independence** between two categorical variables | **Any number of categories** | - Data is in **frequency/count format** <br> - Expected frequencies > 5 in most cells | **Example:** Check if **education level** is related to **preferred social media platform** | **H₀:** No association between the categorical variables <br> **H₁:** Variables are dependent | `scipy.stats.chi2_contingency(table)` |



## **How to Choose the Right Test?**
- **If comparing means between two groups → Use t-test (`ttest_ind`)**
- **If comparing means across 3+ groups → Use ANOVA (`f_oneway`)**
- **If testing association between two categorical variables → Use Chi-Square (`chi2_contingency`)**



## **Example Code for Each Test**
### **1️⃣ Independent t-test (Comparing Two Groups)**
```python
import scipy.stats as stats

# Sample data: Exam scores of males and females
male_scores = [85, 88, 90, 92, 86]
female_scores = [78, 82, 84, 80, 79]

# Perform independent t-test
t_stat, p_value = stats.ttest_ind(male_scores, female_scores)

print(f"t-statistic: {t_stat}, p-value: {p_value}")
```



### **2️⃣ ANOVA (Comparing 3+ Groups)**
```python
import scipy.stats as stats

# Sample data: Salaries of different job roles
engineers = [60000, 62000, 58000, 59000, 61000]
managers = [80000, 82000, 78000, 81000, 79000]
directors = [120000, 125000, 118000, 122000, 121000]

# Perform ANOVA test
f_stat, p_value = stats.f_oneway(engineers, managers, directors)

print(f"F-statistic: {f_stat}, p-value: {p_value}")
```



### **3️⃣ Chi-Square Test (Categorical vs. Categorical)**
```python
import scipy.stats as stats
import numpy as np

# Contingency table: (Education Level x Social Media Preference)
data = np.array([[50, 30, 20],  # High School
                 [60, 40, 30],  # Bachelor's
                 [70, 50, 40]]) # Master's

# Perform Chi-Square test
chi2, p, dof, expected = stats.chi2_contingency(data)

print(f"Chi-Square Value: {chi2}, p-value: {p}")
```

---

# 🌟 **Spearman's Rank Correlation: The Complete Guide** 🎯

### **🔹 What is Spearman’s Rank Correlation?**
Spearman’s Rank Correlation, denoted as **Spearman's ρ (rho)**, is a **non-parametric** statistical test that measures the **monotonic** relationship between two variables.

🚀 Unlike **Pearson’s correlation**, which captures **linear relationships**, Spearman’s correlation checks **whether one variable increases (or decreases) as another does, even if the relationship is not linear**.


### **🔹 When Should You Use Spearman's Correlation?**
✅ **Your data is numerical** (continuous or ordinal).  
✅ **Your data is NOT normally distributed**.  
✅ **You suspect a monotonic (but possibly nonlinear) relationship**.  
✅ **You have ranked or ordinal data** (like ratings, preferences, etc.).  
✅ **You want a more robust alternative to Pearson's correlation** (less sensitive to outliers).

🔴 **Do NOT use Spearman’s if:**
❌ The relationship is NOT monotonic.  
❌ You specifically need to measure a **linear** relationship (use Pearson instead).


### **🔹 How Does Spearman’s Correlation Work?**
Spearman’s correlation works by converting the data into **ranks** and then calculating the **Pearson correlation** on the ranks.

👀 **Step-by-Step Process:**
1️⃣ **Convert values into ranks** (smallest value gets rank 1, second smallest gets rank 2, etc.).  
2️⃣ **Compute the difference between ranks** for each pair of data points.  
3️⃣ **Calculate Spearman’s correlation using the formula**:

$$
ρ = 1 - \frac{6 \sum d_i^2}{n(n^2 - 1)}
$$

Where:
- $ d_i $ = difference between the ranks of each pair
- $ n $ = number of observations

### **🔹 Spearman’s vs. Pearson’s vs. Mutual Information**
| Method | Relationship Type | Handles Nonlinear? | Affected by Outliers? | Works with Ordinal Data? |
|--------|------------------|--------------------|----------------------|------------------------|
| **Spearman’s (ρ)** | Monotonic | ✅ Yes | ✅ No (more robust) | ✅ Yes |
| **Pearson’s (r)** | Linear | ❌ No | ❌ Yes | ❌ No |
| **Mutual Information** | Any | ✅ Yes | ✅ No | ✅ Yes |

🔹 **Spearman's is a great choice when Pearson's fails due to nonlinearity or outliers!**


### **🔹 Spearman’s Correlation in Python**
Here’s how you can calculate Spearman’s correlation using `scipy`:

```python
import pandas as pd
import numpy as np
from scipy.stats import spearmanr

# Sample Data
data = {
    'Feature1': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100],
    'Feature2': [1, 3, 2, 5, 4, 7, 6, 9, 8, 10]
}

df = pd.DataFrame(data)

# Calculate Spearman's Correlation
rho, p_value = spearmanr(df['Feature1'], df['Feature2'])

# Print Results
print(f"Spearman's Correlation: {rho:.4f}")
print(f"P-Value: {p_value:.4f}")
```


### **🔹 Interpreting the Results**
- **ρ (Spearman's correlation) value:**
  - **+1** → Perfect positive monotonic relationship 📈
  - **-1** → Perfect negative monotonic relationship 📉
  - **0** → No correlation ❌

- **p-value:**
  - **p < 0.05** → The correlation is **statistically significant** ✅
  - **p > 0.05** → No strong evidence of correlation ❌


### **🔹 Example Scenarios**
🔵 **Example 1 (Perfect Monotonic Relationship)**
```python
x = [1, 2, 3, 4, 5]
y = [10, 20, 30, 40, 50]
```
📊 **ρ = 1.0** (Perfect Positive Correlation)

🔴 **Example 2 (Nonlinear but Monotonic)**
```python
x = [1, 2, 3, 4, 5]
y = [1, 4, 9, 16, 25]  # Squared values
```
📊 **ρ ≈ 1.0** (Still strong correlation!)

🟢 **Example 3 (Non-monotonic)**
```python
x = [1, 2, 3, 4, 5, 6]
y = [10, 5, 20, 15, 30, 25]  # No clear increasing or decreasing trend
```
📊 **ρ ≈ 0.0** (No correlation)


### **🔹 Conclusion: When to Choose Spearman?**
✅ Use **Spearman’s correlation** when:
- Your data has a **nonlinear but monotonic** trend.  
- You have **ranked/ordinal** data (e.g., survey ratings).  
- You want a **robust** method less affected by **outliers**.

🚀 Spearman's is **more flexible than Pearson’s** and is often used in **feature selection** when working with numerical variables.

---

Let's go step by step and manually calculate **Spearman's Rank Correlation Coefficient (ρ)** using an example.  



### **📌 Example Dataset**
We have two variables **X** and **Y**:

| X  | Y  |
|----|----|
| 10 | 200 |
| 20 | 150 |
| 30 | 300 |
| 40 | 250 |
| 50 | 400 |

We will calculate **Spearman's rank correlation coefficient** step by step.



## **🔢 Step 1: Rank the Data**
Spearman’s correlation works on **ranks** instead of actual values.

### **Ranking X and Y:**
The smallest value gets **rank 1**, the second smallest gets **rank 2**, and so on.

| X  | Rank(X) | Y  | Rank(Y) |
|----|--------|----|--------|
| 10 | 1      | 200 | 2      |
| 20 | 2      | 150 | 1      |
| 30 | 3      | 300 | 3      |
| 40 | 4      | 250 | 4      |
| 50 | 5      | 400 | 5      |

Now, we will **calculate the difference** between the ranks of each pair.



## **🔢 Step 2: Compute Rank Differences (d) and Square Them (d²)**

$$
d_i = \text{Rank}(X) - \text{Rank}(Y)
$$

| X  | Rank(X) | Y  | Rank(Y) | $ d_i $ | $ d_i^2 $ |
|----|--------|----|--------|------|------|
| 10 | 1      | 200 | 2      | -1   | 1    |
| 20 | 2      | 150 | 1      | 1    | 1    |
| 30 | 3      | 300 | 3      | 0    | 0    |
| 40 | 4      | 250 | 4      | 0    | 0    |
| 50 | 5      | 400 | 5      | 0    | 0    |

### **Sum of $ d_i^2 $ values:**
$$
\sum d_i^2 = 1 + 1 + 0 + 0 + 0 = 2
$$



## **🔢 Step 3: Apply Spearman’s Correlation Formula**
$$
ρ = 1 - \frac{6 \sum d_i^2}{n(n^2 - 1)}
$$

where:
- $ \sum d_i^2 = 2 $ (Sum of squared rank differences)
- $ n = 5 $ (Number of data points)

Now, plug in the values:

$$
ρ = 1 - \frac{6(2)}{5(5^2 - 1)}
$$

$$
ρ = 1 - \frac{12}{5(25 - 1)}
$$

$$
ρ = 1 - \frac{12}{5 \times 24}
$$

$$
ρ = 1 - \frac{12}{120}
$$

$$
ρ = 1 - 0.1
$$

$$
ρ = 0.9
$$



## **🎯 Interpretation of Spearman's Correlation**
- **ρ = 0.9** means **strong positive correlation**.
- As **X increases, Y also increases in a monotonic way**.
- Even if the relationship is not perfectly linear, the ranking order is **mostly preserved**.



## **🔢 Verify with Python**
Let’s verify our manual calculation with `scipy.stats.spearmanr`:

```python
import numpy as np
from scipy.stats import spearmanr

X = [10, 20, 30, 40, 50]
Y = [200, 150, 300, 250, 400]

rho, _ = spearmanr(X, Y)
print(f"Spearman's correlation coefficient: {rho:.4f}")
```

✅ **Output:**  
```
Spearman's correlation coefficient: 0.9
```
Matches our manual calculation! 🎯



### **🎯 Summary**
✅ **Spearman’s correlation works by ranking the data first**.  
✅ **It measures how well the rank order is preserved** between two variables.  
✅ **ρ = 0.9 means strong positive correlation**, even if the actual values don’t follow a perfect straight line.

---

It looks like you're asking about the **Kruskal-Wallis test**, which is a **non-parametric statistical test** used to compare three or more independent groups to determine if they come from the same distribution. This test is an alternative to the one-way ANOVA when the assumption of normality isn’t met.

### **🚀 Kruskal-Wallis Test: A Deep Dive**
The Kruskal-Wallis test is based on **ranks** rather than actual data values, making it robust against outliers and non-normal distributions.

🔹 **When to Use?**  
- When comparing **three or more** independent groups  
- When data is **not normally distributed**  
- When sample sizes are small  

🔹 **How It Works?**
1. **Rank all values** from all groups together, from lowest to highest.  
2. Compute the **sum of ranks** for each group.  
3. Use the **Kruskal-Wallis H statistic** formula:  

   $$
   H = \frac{12}{N(N+1)} \sum \frac{R_i^2}{n_i} - 3(N+1)
   $$

   where:
   - $ N $ = Total number of observations  
   - $ n_i $ = Number of observations in each group  
   - $ R_i $ = Sum of ranks for each group  

4. Compare **H** to the critical value from the chi-square distribution with $ k-1 $ degrees of freedom.

5. If **p-value < significance level (0.05)**, we **reject the null hypothesis** → At least one group is different.



### **🌟 Example Scenario**
Imagine you're a **data scientist analyzing customer satisfaction** across three different stores (**A, B, and C**). You collect ratings from customers and want to check if satisfaction levels differ.

1. **Step 1: Collect Data**  
   - Store A: [4, 5, 6, 7, 8]  
   - Store B: [2, 3, 4, 5, 6]  
   - Store C: [7, 8, 9, 10, 10]  

2. **Step 2: Rank Data Across All Stores**
   - Rank values from 1 (lowest) to highest.
   
3. **Step 3: Compute H-statistic**  

4. **Step 4: Compare with Chi-Square Table**  

5. **Step 5: Interpret Results**  
   - If **p < 0.05**, at least one store’s satisfaction level is significantly different.



### **🎯 Key Takeaways**
✅ Non-parametric → Works even if assumptions of normality fail  
✅ Compares multiple groups efficiently  
✅ Doesn’t require equal sample sizes  
✅ If **significant**, follow up with post-hoc tests like **Dunn’s Test** to identify which groups differ.

---

Sure! Let's go step by step through a **manual calculation** of the **Kruskal-Wallis Test** with an example.  



## **📌 Example: Comparing Exam Scores of 3 Classes**  
Suppose we have exam scores from **three different classes**:  

| **Class A** | **Class B** | **Class C** |  
|------------|------------|------------|  
| 85         | 88         | 90         |  
| 80         | 75         | 95         |  
| 78         | 85         | 98         |  
| 92         | 82         | 89         |  

Our goal is to check **if there is a significant difference in scores** across the three classes.  



### **Step 1: Rank All Data Together**
We **combine all values** and **rank them from lowest to highest**. If two values are the same, assign them the **average rank**.

| Score | Rank | Group |
|--------|------|--------|
| 75     | 1    | B      |
| 78     | 2    | A      |
| 80     | 3    | A      |
| 82     | 4    | B      |
| 85     | 5.5  | A      |
| 85     | 5.5  | B      |
| 88     | 7    | B      |
| 89     | 8    | C      |
| 90     | 9    | C      |
| 92     | 10   | A      |
| 95     | 11   | C      |
| 98     | 12   | C      |



### **Step 2: Calculate Rank Sums for Each Group**  
Now, sum up the ranks for each class:

- **Class A**: $ R_A = 2 + 3 + 5.5 + 10 = 20.5 $  
- **Class B**: $ R_B = 1 + 4 + 5.5 + 7 = 17.5 $  
- **Class C**: $ R_C = 8 + 9 + 11 + 12 = 40 $  

Total number of observations:  
$$
N = 12
$$



### **Step 3: Apply Kruskal-Wallis Formula**
$$
H = \frac{12}{N(N+1)} \sum \frac{R_i^2}{n_i} - 3(N+1)
$$

Where:  
- $ N = 12 $ (total number of observations)  
- $ n_A = 4 $, $ n_B = 4 $, $ n_C = 4 $ (each group has 4 values)  
- $ R_A = 20.5 $, $ R_B = 17.5 $, $ R_C = 40 $  

First, calculate each term:

$$
\frac{(R_A)^2}{n_A} = \frac{(20.5)^2}{4} = \frac{420.25}{4} = 105.06
$$

$$
\frac{(R_B)^2}{n_B} = \frac{(17.5)^2}{4} = \frac{306.25}{4} = 76.56
$$

$$
\frac{(R_C)^2}{n_C} = \frac{(40)^2}{4} = \frac{1600}{4} = 400
$$

Now, compute:

$$
\sum \frac{R_i^2}{n_i} = 105.06 + 76.56 + 400 = 581.62
$$

$$
H = \frac{12}{12(13)} (581.62) - 3(13)
$$

$$
H = \frac{12}{156} \times 581.62 - 39
$$

$$
H = 44.7 - 39
$$

$$
H = 5.7
$$



### **Step 4: Compare with Chi-Square Critical Value**
The Kruskal-Wallis test statistic $ H $ follows a **chi-square distribution** with $ k - 1 $ degrees of freedom.  
- Here, $ k = 3 $ (number of groups), so $ df = 3 - 1 = 2 $.  
- At **α = 0.05**, the **chi-square critical value** for $ df = 2 $ is **5.99**.

Since **H = 5.7** is **less than 5.99**, we **fail to reject the null hypothesis**.  
👉 **Conclusion**: No significant difference exists between the three classes.


### **🔹 Summary**
✅ We ranked data, summed ranks per group, and applied the Kruskal-Wallis formula.  
✅ We compared the test statistic to a chi-square table.  
✅ Since **H < critical value**, we conclude that **no significant difference** exists.  

---

Sure! Let’s break this down **step by step in the simplest way possible** with a real-world example.  



### **🍕 Imagine You Are Comparing Pizza Quality in 3 Restaurants**
Let's say you and your friends visit **three different pizza places (A, B, and C)** and rate their pizzas.  
Each of you gives a score from 1 to 10.  

Here are the ratings:  

| **Pizza Place A** | **Pizza Place B** | **Pizza Place C** |  
|-----------------|-----------------|-----------------|  
| 8.5           | 8.8           | 9.0           |  
| 8.0           | 7.5           | 9.5           |  
| 7.8           | 8.5           | 9.8           |  
| 9.2           | 8.2           | 8.9           |  

Our goal: **Do these pizza places have the same quality, or is at least one significantly better or worse?**  



### **📌 Step 1: Convert Raw Scores to Ranks**
Instead of using the actual ratings, we **rank all scores** together from lowest to highest.  

| Score | Rank | Pizza Place |
|--------|------|------------|
| 7.5  | 1    | B |
| 7.8  | 2    | A |
| 8.0  | 3    | A |
| 8.2  | 4    | B |
| 8.5  | 5.5  | A |
| 8.5  | 5.5  | B |
| 8.8  | 7    | B |
| 8.9  | 8    | C |
| 9.0  | 9    | C |
| 9.2  | 10   | A |
| 9.5  | 11   | C |
| 9.8  | 12   | C |

🔹 **Why do we rank?**  
Because Kruskal-Wallis doesn’t care about actual numbers, only their order. It looks at whether one group consistently has higher or lower ranks than others.



### **📌 Step 2: Add Up the Ranks for Each Pizza Place**
Now, let’s sum up the ranks for each restaurant:

- **Pizza Place A:**  
  $ 2 + 3 + 5.5 + 10 = 20.5 $  
- **Pizza Place B:**  
  $ 1 + 4 + 5.5 + 7 = 17.5 $  
- **Pizza Place C:**  
  $ 8 + 9 + 11 + 12 = 40 $  

Now we have total rank sums for each place.



### **📌 Step 3: Plug the Numbers into the Kruskal-Wallis Formula**
We use the formula:

$$
H = \frac{12}{N(N+1)} \sum \frac{R_i^2}{n_i} - 3(N+1)
$$

Where:  
- $ N = 12 $ (total ratings)  
- $ n_A = 4 $, $ n_B = 4 $, $ n_C = 4 $ (each place has 4 ratings)  
- $ R_A = 20.5 $, $ R_B = 17.5 $, $ R_C = 40 $  

Let’s calculate:

$$
H = \frac{12}{12(13)} ( \frac{20.5^2}{4} + \frac{17.5^2}{4} + \frac{40^2}{4}) - 3(13)
$$

$$
H = \frac{12}{156} (105.06 + 76.56 + 400) - 39
$$

$$
H = \frac{12}{156} \times 581.62 - 39
$$

$$
H = 44.7 - 39
$$

$$
H = 5.7
$$



### **📌 Step 4: Compare to the Chi-Square Table**
We now check if **5.7 is a big enough number** to say there’s a real difference.

- The Kruskal-Wallis test follows a **chi-square distribution**.
- For **three groups (A, B, C)**, we look at **df = 3 - 1 = 2** in the chi-square table.
- **Critical value at α = 0.05** is **5.99**.

Since **5.7 < 5.99**, we **fail to reject the null hypothesis**.  



### **📢 Final Answer: No Significant Difference in Pizza Quality!**
Even though the numbers looked a little different, the Kruskal-Wallis test tells us that **there is no strong evidence that one pizza place is better than the others**.  

✅ **If the H value was greater than 5.99, we would say at least one place is significantly different.**  
✅ **If we got a significant result, we’d do a "post-hoc test" to find which pizza place is better.**  



### **🔹 In the Simplest Terms:**
1. **Sort all scores from lowest to highest** and assign ranks.  
2. **Add up the ranks for each group.**  
3. **Use the formula to compute H.**  
4. **Compare H to a threshold (chi-square table).**  
5. **If H is too small → no difference. If H is big → at least one group is different.**  

---