**R² and P-Value Notes**

### R² (Coefficient of Determination)
- Measures how well a regression model explains the variance in the dependent variable.
- Formula:
  $
   R^2 = 1 - \frac{SS_{res}}{SS_{tot}} 
  $
  
  Where:
  
  -  $ SS_{res} $ = Sum of Squares of Residuals 
  - $ SS_{tot} $ = Total Sum of Squares
  
- Interpretation:
  
  - $ R^2 = 1 $: Perfect fit
  - $ R^2 = 0 $: Model explains none of the variance
  - Higher $ R^2 $ means better model fit
  
### P-Value for R²
- Determines statistical significance of the relationship between independent and dependent variables.
- Calculated using the F-test:
  $
   F = \frac{MSR}{MSE} = \frac{(SS_{tot} - SS_{res}) / p}{SS_{res} / (n - p - 1)} 
  $
  
  Where:
  
  - $ MSR $ = Mean Square Regression
  - $ MSE $ = Mean Square Error
  - $ p $ = Number of predictors
  - $ n $ = Sample size
  
- Interpretation:
  - Small $ p-value (< 0.05) $ indicates strong evidence against the null hypothesis, meaning the model is significant.

### Example Calculation
Let's calculate R² and p-value for a simple dataset:

| X (Independent Variable) | Y (Dependent Variable) |
|--------------------------|--------------------------|
| 1                        | 2                        |
| 2                        | 2.8                      |
| 3                        | 3.6                      |
| 4                        | 4.5                      |
| 5                        | 5.1                      |

#### Step 1: Compute Mean of Y
$
 \bar{Y} = \frac{2 + 2.8 + 3.6 + 4.5 + 5.1}{5} = 3.6 
$
#### Step 2: Compute SS_tot
$
 SS_{tot} = \sum (Y_i - \bar{Y})^2 
 = (2-3.6)^2 + (2.8-3.6)^2 + (3.6-3.6)^2 + (4.5-3.6)^2 + (5.1-3.6)^2 
 = 2.56 + 0.64 + 0 + 0.81 + 2.25 = 6.26 
$
#### Step 3: Fit a Regression Line
Using least squares regression, suppose we obtain:
$
 Y = 0.78X + 1.22 
$
#### Step 4: Compute SS_res

$
 SS_{res} = \sum (Y_i - Y_{pred})^2 
 = (2-2)^2 + (2.8-2.78)^2 + (3.6-3.56)^2 + (4.5-4.34)^2 + (5.1-5.12)^2 
 = 0 + 0.0004 + 0.0016 + 0.0276 + 0.0004 = 0.03 
$
#### Step 5: Compute R²
$
R^2 = 1 - \frac{0.03}{6.26} = 1 - 0.0048 = 0.995 
$
#### Step 6: Compute F-Statistic
$
F = \frac{(6.26 - 0.03) / 1}{0.03 / (5-2)} 
 = \frac{6.23}{0.01} = 623 
$
Using an F-distribution table, this results in a very small p-value (< 0.001), indicating strong statistical significance.

### Conclusion
- The model explains 99.5% of the variance in Y.
- The p-value is very low, meaning the relationship is statistically significant.

This demonstrates how R² and p-values help evaluate model performance and reliability.



Here's a detailed note on **T-test and ANOVA test**, their relationship with **p-values**, their **mathematical formulas**, and their implementation using **scikit-learn (sklearn) and SciPy** in Python.  

---

## **T-Test and ANOVA Test**  

### **1. T-Test**  
A **T-test** is used to compare the means of two groups to determine whether they are statistically different from each other.

#### **Types of T-Tests**  
- **Independent (Unpaired) T-test**: Compares means from two different groups.  
- **Paired T-test**: Compares means from the same group at different times.  
- **One-sample T-test**: Compares the sample mean to a known population mean.

#### **Mathematical Formula**  
The T-statistic is calculated as:  

$
t = \frac{\bar{X}_1 - \bar{X}_2}{s_p \sqrt{\frac{1}{n_1} + \frac{1}{n_2}}}
$

Where:  
- $ \bar{X}_1, \bar{X}_2 $ = Means of the two groups  
- $ n_1, n_2 $ = Sample sizes  
- $ s_p $ = Pooled standard deviation  

$
s_p = \sqrt{\frac{(n_1 - 1)s_1^2 + (n_2 - 1)s_2^2}{n_1 + n_2 - 2}}
$

where $ s_1, s_2 $ are standard deviations of the two samples.

#### **Implementation in Python (T-test)**
```python
from scipy.stats import ttest_ind

# Example data
group1 = [23, 21, 25, 22, 24]
group2 = [28, 27, 29, 30, 26]

# Perform independent t-test
t_stat, p_value = ttest_ind(group1, group2)
print(f"T-statistic: {t_stat}, P-value: {p_value}")
```

---

### **2. ANOVA (Analysis of Variance)**
The **ANOVA test** is used to compare the means of **three or more** groups to check if at least one group mean is significantly different.

#### **Types of ANOVA**  
- **One-way ANOVA**: Compares means of three or more independent groups.  
- **Two-way ANOVA**: Examines the influence of two independent variables on the dependent variable.

#### **Mathematical Formula for One-Way ANOVA**  
The **F-statistic** is given by:

$
F = \frac{\text{Between-group variability}}{\text{Within-group variability}}
$

Where:  
- **Between-group variability** = Variance of the group means  
- **Within-group variability** = Variance within each group

$
F = \frac{\sum n_i (\bar{X}_i - \bar{X})^2 / (k-1)}{\sum (X_{ij} - \bar{X}_i)^2 / (N-k)}
$

Where:  
- $ k $ = Number of groups  
- $ N $ = Total number of observations  
- $ \bar{X}_i $ = Mean of each group  
- $ \bar{X} $ = Overall mean  

#### **Implementation in Python (ANOVA)**
```python
from scipy.stats import f_oneway

# Example data for three groups
group1 = [23, 21, 25, 22, 24]
group2 = [28, 27, 29, 30, 26]
group3 = [35, 32, 31, 33, 34]

# Perform one-way ANOVA
f_stat, p_value = f_oneway(group1, group2, group3)
print(f"F-statistic: {f_stat}, P-value: {p_value}")
```

---

## **Relation to P-Values**
- **P-value** represents the probability of observing the test results under the null hypothesis.
- If **p-value < α (0.05)** → Reject the null hypothesis (significant difference).
- If **p-value ≥ α** → Fail to reject the null hypothesis (no significant difference).
- **T-tests** compare two groups, while **ANOVA** generalizes this comparison for three or more groups.

---

## **Using Sklearn for Feature Selection**
Scikit-learn provides T-test and ANOVA for feature selection using `f_classif` and `f_regression`.

### **Example: Feature Selection Using ANOVA in Scikit-learn**
```python
from sklearn.feature_selection import f_classif
import numpy as np

# Example dataset (X: features, y: labels)
X = np.array([[2, 4, 3], [1, 5, 7], [6, 8, 9], [5, 3, 1], [8, 6, 5]])
y = np.array([0, 1, 0, 1, 0])  # Binary labels

# Perform ANOVA F-test
F_values, p_values = f_classif(X, y)
print(f"F-values: {F_values}, P-values: {p_values}")
```

---

### **Summary**
| Test | Purpose | Groups Compared | Test Statistic | Python Implementation |
|------|---------|----------------|----------------|----------------------|
| **T-Test** | Compare two means | Two | T-statistic | `ttest_ind()` |
| **ANOVA** | Compare multiple means | Three or more | F-statistic | `f_oneway()` |

Both tests rely on **p-values** to determine statistical significance.

Would you like additional details on assumptions or variations of these tests? 😊