## Hypothesis Testing

### **Hypothesis Testing**
Hypothesis testing is a statistical method used to draw conclusions about a population based on sample data. It involves formulating two contrasting hypotheses:
- **Null Hypothesis ($H_0$)**: Assumes there is no relationship, effect, or difference in the population.
- **Alternative Hypothesis ($H_a$)**: Suggests the presence of a relationship, effect, or difference.

The goal is to determine whether there is enough evidence to reject the null hypothesis in favor of the alternative. Hypothesis testing plays a crucial role in statistical analysis and is often applied in predictive modeling and feature selection.

### **P-Value**
The **P-value** indicates the probability of observing a test statistic as extreme as, or more extreme than, the one obtained, assuming the null hypothesis ($H_0$) is true. It serves as a key measure for decision-making in hypothesis testing:
- **$p < 0.05$**: The null hypothesis is rejected, and the alternative hypothesis is supported. 
- **$p > 0.05$**: There is insufficient evidence to reject the null hypothesis.

A p-value threshold of 0.05 is commonly used, but this threshold may vary depending on the context. In machine learning, p-values are often used to test the relevance of independent variables, helping to identify which features have a significant impact on the model's predictions.


### **Steps in Hypothesis Testing**
The process of hypothesis testing involves the following key steps:

1. **Formulate the Null Hypothesis ($H_0$)**:  
   Assume a null hypothesis that represents the default position (for example, no difference, no significance, or no relationship). The null hypothesis typically suggests that there is no anomaly or pattern present in the data.

2. **Collect a Sample**:  
   Gather a representative sample from the population to test the hypothesis.

3. **Compute the Test Statistic**:  
   Use the sample data to calculate a test statistic (like z-statistic, t-statistic, or chi-square) to measure the degree of agreement between the sample and the null hypothesis.

4. **Make a Decision**:  
   Based on the value of the test statistic and the corresponding p-value, decide whether to **reject** or **fail to reject** the null hypothesis. If the p-value is less than a specified threshold (typically 0.05), the null hypothesis is rejected in favor of the alternative hypothesis ($H_a$).

These steps form the basis for hypothesis testing, which is widely used in statistical analysis and machine learning for feature selection, model validation, and evaluating relationships between variables.


### **Example of Hypothesis Testing**
A soft drink manufacturer claims that every bottle of their soda contains **at least 500 ml** of liquid. You suspect that the actual volume might be less than 500 ml. To verify this, you collect a random sample of 40 soda bottles and measure the volume in each. The sample shows an **average volume of 495 ml** with a **sample standard deviation of 8 ml**. Given a **significance level of 0.05**, can you reject the manufacturer's claim?

### **Step 1: Formulate Hypotheses**
- **Null Hypothesis ($H_0$):** The average volume of soda bottles is at least 500 ml.  
  \[
  H_0: \mu \geq 500
  \]
- **Alternative Hypothesis ($H_a$):** The average volume of soda bottles is less than 500 ml.  
  \[
  H_a: \mu < 500
  \]

---

### **Step 2: Collect the Sample**
- Sample size ($n$) = 40  
- Sample mean ($\bar{x}$) = 495 ml  
- Sample standard deviation ($s$) = 8 ml  

---

### **Step 3: Calculate the Test Statistic**
To calculate the **t-statistic**, we use the formula:  
\[
t = \frac{\bar{x} - \mu_0}{s / \sqrt{n}}
\]
Where:  
- $\bar{x}$ = Sample mean (495)  
- $\mu_0$ = Population mean under $H_0$ (500)  
- $s$ = Sample standard deviation (8)  
- $n$ = Sample size (40)  

\[
t = \frac{495 - 500}{8 / \sqrt{40}} = \frac{-5}{8 / 6.32} = \frac{-5}{1.265} \approx -3.95
\]

---

### **Step 4: Make a Decision**
- **Significance level ($\alpha$) = 0.05**  
- Find the critical t-value for a one-tailed test with **n-1 = 40 - 1 = 39 degrees of freedom**.  
  Using a t-table, the critical t-value for $\alpha = 0.05$ and **df = 39** is **-1.685**.  
  Since **$t = -3.95$** is less than **$-1.685$**, we reject the null hypothesis.

---

### **Conclusion**
Since the t-statistic **(-3.95)** lies in the rejection region (beyond -1.685), we reject the null hypothesis ($H_0$).  
**Conclusion:** There is sufficient evidence at the 0.05 significance level to reject the claim that each soda bottle contains at least 500 ml of liquid. The data suggests that the actual average volume is less than 500 ml.


In [5]:
from scipy import stats
import numpy as np
xbar = 500
mu0 = 495
s = 8
n = 40

# Test Statistic
t_smple  = (xbar-mu0)/(s/np.sqrt(float(n))); print ("Test Statistic:",round(t_smple,2)) 

# Critical value from t-table 
alpha = 0.05 
t_alpha = stats.t.ppf(alpha,n-1); print ("Critical value from t-table:",round(t_alpha,3))  

#Lower tail p-value from t-table 
p_val = stats.t.sf(np.abs(t_smple), n-1); print ("Lower tail p-value from t-table", p_val) 

Test Statistic: 3.95
Critical value from t-table: -1.685
Lower tail p-value from t-table 0.00015761816112839098


### **Type I and Type II Errors in Hypothesis Testing**

When conducting hypothesis testing, we make decisions about the null hypothesis ($H_0$) based on sample data, not the entire population. As a result, there are two possible errors that can occur in this decision-making process:

1. **Type I Error ($\alpha$)**:  
   Occurs when we **reject a true null hypothesis**.  
   In other words, we conclude that an effect or difference exists when, in reality, it does not.  
   This is also known as a **"false positive"**.  
   Example: 
   - Suppose a drug company claims that a new medication has no side effects (null hypothesis: no side effects). 
   - If a study incorrectly finds evidence of side effects when, in reality, there are none, this is a Type I error.

2. **Type II Error ($\beta$)**:  
   Occurs when we **fail to reject a false null hypothesis**.  
   In other words, we conclude that there is no effect or difference when, in reality, there is one.  
   This is also known as a **"false negative"**.  
   Example: 
   - Suppose a medical test is being conducted to detect a disease.  
   - If the test fails to detect the disease in a patient who actually has it, this is a Type II error.  
   - This could happen if the sample size is too small or if the test is not sensitive enough.

---

### **Example of Type I and Type II Errors**

**Scenario**: A company claims that their bottled water has exactly **500 ml of water** on average.  
- **Null Hypothesis ($H_0$):** The average volume of water in the bottle is 500 ml.  
- **Alternative Hypothesis ($H_a$):** The average volume of water in the bottle is not 500 ml.  

| **Decision**               | **Reality (H0 is True)**                  | **Reality (H0 is False)**                    |
|--------------------------|--------------------------------------------|---------------------------------------------|
| **Reject $H_0$**           | **Type I Error** (False positive) – We conclude that the bottle does not have 500 ml, but it actually does. | **Correct Decision** – We correctly identify that the bottle does not have 500 ml. |
| **Fail to Reject $H_0$**   | **Correct Decision** – We correctly conclude that the bottle has 500 ml.  | **Type II Error** (False negative) – We fail to detect that the bottle does not have 500 ml. |

---

### **Reducing Type I and Type II Errors**
- **To reduce Type I Error ($\alpha$)**, decrease the significance level (e.g., from 0.05 to 0.01).  
- **To reduce Type II Error ($\beta$)**, increase the sample size or increase the test's power by using a larger, more representative dataset.  

---

In summary:  
- **Type I Error**: False positive – rejecting a true null hypothesis.  
- **Type II Error**: False negative – failing to reject a false null hypothesis.  
Both errors have consequences, and balancing them is essential for good decision-making in hypothesis testing.


### **What is a Normal Distribution?**
A **normal distribution** (also known as a Gaussian distribution or bell curve) is a probability distribution that is symmetric about its mean. It represents the distribution of many natural phenomena and datasets.

The **probability density function (PDF)** for a normal distribution is given by:

$$
f(x) = \frac{1}{\sigma \sqrt{2\pi}} e^{-\frac{(x - \mu)^2}{2\sigma^2}}
$$

Where:  
- $x$: A data point  
- $\mu$: Mean (center) of the distribution  
- $\sigma$: Standard deviation (spread or width of the curve)  
- $\pi$ and $e$: Mathematical constants  

---

### **Key Characteristics of Normal Distribution**
1. **Symmetry**: The distribution is symmetric around the mean.  
2. **Mean = Median = Mode**: These three measures of central tendency are identical.  
3. **Bell Shape**: Most data points are concentrated around the mean, with fewer as you move away.  
4. **Empirical Rule (68-95-99.7 Rule)**:  
   - 68% of data falls within 1 standard deviation of the mean ($\mu \pm \sigma$).  
   - 95% of data falls within 2 standard deviations ($\mu \pm 2\sigma$).  
   - 99.7% of data falls within 3 standard deviations ($\mu \pm 3\sigma$).  

---

### **Why is Data Often Normally Distributed?**

1. **Central Limit Theorem (CLT)**:  
   - The **Central Limit Theorem (CLT)** states that when independent random variables are added, their sum tends toward a normal distribution, regardless of the original distributions of the variables, as the sample size becomes large.  
   - In simpler terms, many small, independent factors influence most phenomena, causing the data to naturally form a normal distribution.  

2. **Natural Phenomena**:  
   - Many real-world processes result from random interactions of multiple factors.  
   - **Examples**:  
     - Heights of people  
     - Measurement errors  
     - Blood pressure  
     - Exam scores  

3. **Maximum Entropy Principle**:  
   - Among all distributions with a given mean and variance, the normal distribution has the maximum entropy (is the most "unbiased").  
   - This makes it a natural choice for many random processes.  

4. **Measurement and Noise**:  
   - Data often arises from measurements, and random errors in measurements tend to be normally distributed.  

---

### **Examples of Normally Distributed Data**
1. **Biological Data**:  
   - Heights, weights, and blood pressure in a population.  

2. **Finance**:  
   - Stock price changes over small time intervals.  

3. **Physics**:  
   - Measurement errors in scientific experiments.  

---

### **Why is Normal Distribution Useful?**
1. **Statistical Modeling**:  
   - Many statistical tests (e.g., t-tests, ANOVA) assume that the data is normally distributed.  

2. **Predictive Modeling**:  
   - In machine learning, features are often transformed to approximate normality to improve model performance.  

3. **Real-World Interpretation**:  
   - Allows probabilistic reasoning (e.g., "What is the likelihood that a student scores above 90?").  

---

This markdown provides a clear, structured explanation of the **Normal Distribution**, including the probability density function, key properties, reasons for normality, and examples of its occurrence in real-world data.


### **What is a Chi-Square Test?**
The **Chi-Square Test** is a statistical test used to determine if there is a significant relationship between categorical variables. It compares the observed data with the data expected under the assumption that the variables are independent.

There are two main types of Chi-Square tests:
1. **Chi-Square Test of Independence**: Tests whether two categorical variables are independent of each other.
2. **Chi-Square Goodness of Fit Test**: Tests how well an observed distribution fits a specific theoretical distribution.

---

### **Chi-Square Test Formula**
The Chi-Square statistic ($\chi^2$) is calculated as:

$$
\chi^2 = \sum \frac{(O_i - E_i)^2}{E_i}
$$

Where:  
- $O_i$: Observed frequency of category $i$  
- $E_i$: Expected frequency of category $i$  
- $\sum$: Summation over all categories  

---

### **When to Use the Chi-Square Test?**
- When you want to check if two categorical variables are related (Chi-Square Test of Independence).  
- When you want to check if the observed data fits a theoretical distribution (Goodness of Fit Test).  

---

### **Example of Chi-Square Test of Independence**

#### **Scenario**:
A school principal wants to know if **gender (Male/Female)** is related to students' **preference for online vs offline classes**.  
He collects the following data from 100 students:

| **Gender** | **Prefers Online** | **Prefers Offline** | **Total** |
|------------|-------------------|---------------------|------------|
| **Male**   | 30                 | 20                  | 50         |
| **Female** | 10                 | 40                  | 50         |
| **Total**  | 40                 | 60                  | 100        |

---

### **Step 1: Formulate Hypotheses**
- **Null Hypothesis ($H_0$)**: Gender and preference for online/offline classes are independent.  
- **Alternative Hypothesis ($H_a$)**: Gender and preference for online/offline classes are not independent.  

---

### **Step 2: Calculate Expected Frequencies**
The expected frequency ($E_i$) for each cell is calculated using the formula:  

$$
E_{ij} = \frac{(row\ total) \cdot (column\ total)}{grand\ total}
$$

| **Gender** | **Prefers Online** ($E_{11}$) | **Prefers Offline** ($E_{12}$) | **Total** |
|------------|-----------------------------|---------------------------------|------------|
| **Male**   | $E_{11} = \frac{50 \cdot 40}{100} = 20$ | $E_{12} = \frac{50 \cdot 60}{100} = 30$ | 50         |
| **Female** | $E_{21} = \frac{50 \cdot 40}{100} = 20$ | $E_{22} = \frac{50 \cdot 60}{100} = 30$ | 50         |
| **Total**  | 40                            | 60                             | 100        |

---

### **Step 3: Calculate Chi-Square Statistic**
Use the Chi-Square formula:  

$$
\chi^2 = \sum \frac{(O_i - E_i)^2}{E_i}
$$

| **Gender** | **Prefers Online** | **Prefers Offline** | **Observed (O)** | **Expected (E)** | **$(O - E)^2 / E$** |
|------------|-------------------|---------------------|-----------------|------------------|----------------------|
| **Male**   | 30                 | 20                  | 30 (O)          | 20 (E)           | $\frac{(30 - 20)^2}{20} = 5$ |
| **Male**   | 20                 | 30                  | 20 (O)          | 30 (E)           | $\frac{(20 - 30)^2}{30} = 3.33$ |
| **Female** | 10                 | 40                  | 10 (O)          | 20 (E)           | $\frac{(10 - 20)^2}{20} = 5$ |
| **Female** | 40                 | 30                  | 40 (O)          | 30 (E)           | $\frac{(40 - 30)^2}{30} = 3.33$ |

Total Chi-Square Statistic:  
$$
\chi^2 = 5 + 3.33 + 5 + 3.33 = 16.66
$$

---

### **Step 4: Determine the Degrees of Freedom (df)**
The degrees of freedom for a Chi-Square test of independence is:  

$$
df = (rows - 1) \cdot (columns - 1)
$$

In our case:  
$$
df = (2 - 1) \cdot (2 - 1) = 1
$$

---

### **Step 5: Make a Decision**
1. **Significance level ($\alpha$)** = 0.05  
2. **Critical value** for $\chi^2$ with **df = 1** and $\alpha = 0.05$ (from Chi-Square table) = **3.841**  
3. Our calculated $\chi^2 = 16.66$ is greater than the critical value (3.841).  

---

### **Step 6: Conclusion**
Since $\chi^2 = 16.66$ is greater than the critical value of 3.841, we **reject the null hypothesis** ($H_0$).  
**Conclusion**: There is sufficient evidence to suggest that there is a relationship between **gender** and **preference for online/offline classes**.  

---

### **Summary**
1. The **Chi-Square Test** checks the relationship between categorical variables.  
2. It compares **observed** and **expected frequencies** in a contingency table.  
3. The test can be used for:  
   - **Independence Test**: To see if two variables are related.  
   - **Goodness of Fit**: To check if observed data fits a particular distribution.  
4. It is useful in fields like marketing, social sciences, and biological research.  

This markdown provides a clear, structured explanation of the Chi-Square Test with a **step-by-step example** using properly formatted mathematical expressions for Jupyter Notebook.


In [9]:
import pandas as pd 
from scipy import stats 
 
survey = pd.read_csv("data/survey.csv")   
 
# Tabulating 2 variables with row & column variables respectively 
survey_tab = pd.crosstab(survey.Smoke, survey.Exer, margins = True) 


In [10]:
survey

Unnamed: 0,Sex,Wr.Hnd,NW.Hnd,W.Hnd,Fold,Pulse,Clap,Exer,Smoke,Height,M.I,Age
0,Female,18.5,18.0,Right,R on L,92.0,Left,Some,Never,173.0,Metric,18.250
1,Male,19.5,20.5,Left,R on L,104.0,Left,,Regul,177.8,Imperial,17.583
2,Male,18.0,13.3,Right,L on R,87.0,Neither,,Occas,,,16.917
3,Male,18.8,18.9,Right,R on L,,Neither,,Never,160.0,Metric,20.333
4,Male,20.0,20.0,Right,Neither,35.0,Right,Some,Never,165.0,Metric,23.667
...,...,...,...,...,...,...,...,...,...,...,...,...
232,Female,18.0,18.0,Right,L on R,85.0,Right,Some,Never,165.1,Imperial,17.667
233,Female,18.5,18.0,Right,L on R,88.0,Right,Some,Never,160.0,Metric,16.917
234,Female,17.5,16.5,Right,R on L,,Right,Some,Never,170.0,Metric,18.583
235,Male,21.0,21.5,Right,R on L,90.0,Right,Some,Never,183.0,Metric,17.167


In [11]:
survey_tab

Exer,Freq,Some,All
Smoke,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Heavy,7,3,10
Never,87,84,171
Occas,12,4,16
Regul,9,7,16
All,115,98,213


In [12]:
# Creating observed table for analysis 
observed = survey_tab.iloc[0:4, 0:2]
observed

Exer,Freq,Some
Smoke,Unnamed: 1_level_1,Unnamed: 2_level_1
Heavy,7,3
Never,87,84
Occas,12,4
Regul,9,7


### If p-value < 0.05, there is a strong dependency between two variables, whereas if p-value > 0.05, there is no dependency between the variables

In [14]:
contg = stats.chi2_contingency(observed= observed) 
p_value = round(contg[1],3) 
print ("P-value is: ",p_value) 

P-value is:  0.206


# **Analysis of Variance (ANOVA)**

## **What is ANOVA?**
**ANOVA** (Analysis of Variance) is a statistical method used to determine if there is a **significant difference in the means** of two or more groups. It examines the amount of variation **between groups** compared to the amount of variation **within groups**.

---

## **Why Use ANOVA?**
- **When you have more than two groups** to compare.  
- When you want to identify if there is a difference between multiple group means.  
- ANOVA prevents the need to conduct multiple **t-tests**, reducing the chance of a Type I error.

---

## **When to Use ANOVA?**
- When you have a **categorical independent variable** (e.g., Diet Type, Treatment Group).  
- When you have a **continuous dependent variable** (e.g., Exam Score, Weight, Revenue).  

---

## **Types of ANOVA**
1. **One-Way ANOVA**: Used to compare the means of multiple groups for one factor.  
2. **Two-Way ANOVA**: Used to compare the means across multiple factors (e.g., gender and education level).  
3. **Repeated Measures ANOVA**: Used to analyze repeated measurements of the same group over time.  

---

## **How Does ANOVA Work?**
ANOVA works by comparing the variance **between groups** and **within groups**.  
- If the variance **between groups** is significantly larger than the variance **within groups**, it indicates that at least one group is different.  
- It computes an **F-statistic** to determine the ratio of **between-group variance to within-group variance**.  

---

## **Key Terms**
- **Between-Group Variance**: Measures the difference between the group means.  
- **Within-Group Variance**: Measures the variation within each group.  
- **F-statistic**: Ratio of Between-Group Variance to Within-Group Variance.  
- **p-value**: The probability of observing the data if the null hypothesis is true.  

---

## **Mathematical Formulas**

### **1. Total Variation (SST)**
$SST = \sum_{i=1}^{N} (X_i - \bar{X})^2$
Where:  
- $X_i$ = Each observation  
- $\bar{X}$ = Grand mean of all observations  
- $N$ = Total number of data points  

---

### **2. Between-Group Variation (SSB)**
$SSB = \sum_{j=1}^{k} n_j (\bar{X}_j - \bar{X})^2$
Where:  
- $n_j$ = Number of samples in group $j$  
- $\bar{X}_j$ = Mean of group $j$  
- $\bar{X}$ = Grand mean  

---

### **3. Within-Group Variation (SSW)**
$SSW = \sum_{j=1}^{k} \sum_{i=1}^{n_j} (X_{ij} - \bar{X}_j)^2$
Where:  
- $X_{ij}$ = Data point $i$ in group $j$  
- $\bar{X}_j$ = Mean of group $j$  

---

### **4. F-statistic**
$F = \frac{\text{MSB}}{\text{MSW}}$
Where:  
- **MSB** = Mean Square Between Groups = $\frac{SSB}{k - 1}$  
- **MSW** = Mean Square Within Groups = $\frac{SSW}{N - k}$  
- $k$ = Number of groups  
- $N$ = Total number of data points  

---

## **Hypotheses for ANOVA**
- **Null Hypothesis ($H_0$)**: All group means are equal.  
  $H_0: \mu_1 = \mu_2 = \mu_3 = \cdots = \mu_k$
- **Alternative Hypothesis ($H_1$)**: At least one of the group means is different.  
  $H_1: \text{At least one of the group means is different.}$

---

## **Example (One-Way ANOVA)**
**Problem**: A researcher wants to know if 3 different diets affect weight loss.  
The weights lost by participants on each diet are:  

| **Diet A** | **Diet B** | **Diet C** |
|------------|------------|------------|
| 4          | 3          | 6          |
| 5          | 7          | 8          |
| 6          | 4          | 7          |
| 7          | 5          | 5          |

---

### **Step 1: Calculate Group Means ($\bar{X}_j$)**
$\bar{X}_A = \frac{4 + 5 + 6 + 7}{4} = 5.5, \quad \bar{X}_B = \frac{3 + 7 + 4 + 5}{4} = 4.75, \quad \bar{X}_C = \frac{6 + 8 + 7 + 5}{4} = 6.5$

---

### **Step 2: Calculate Grand Mean ($\bar{X}$)**
$\bar{X} = \frac{4 + 5 + 6 + 7 + 3 + 7 + 4 + 5 + 6 + 8 + 7 + 5}{12} = 5.583$

---

### **Step 3: Calculate Between-Group Variation (SSB)**
$SSB = 4(5.5 - 5.583)^2 + 4(4.75 - 5.583)^2 + 4(6.5 - 5.583)^2$
$SSB = 0.0276 + 2.776 + 3.34 = 6.14$

---

### **Step 4: Calculate Within-Group Variation (SSW)**
$SSW = \sum_{j=1}^{k} \sum_{i=1}^{n_j} (X_{ij} - \bar{X}_j)^2$
For each group:  
- **Group A**: $(4 - 5.5)^2 + (5 - 5.5)^2 + (6 - 5.5)^2 + (7 - 5.5)^2$  
- **Group B**: $(3 - 4.75)^2 + (7 - 4.75)^2 + (4 - 4.75)^2 + (5 - 4.75)^2$  
- **Group C**: $(6 - 6.5)^2 + (8 - 6.5)^2 + (7 - 6.5)^2 + (5 - 6.5)^2$  

---

### **Step 5: Calculate F-Statistic**
$F = \frac{\text{MSB}}{\text{MSW}}$
Where:  
- **MSB** = $\frac{SSB}{k - 1}$  
- **MSW** = $\frac{SSW}{N - k}$  

---

### **When Do You Reject the Null Hypothesis?**
- If the **F-statistic** is larger than the **critical F-value**, reject the null hypothesis.  
- If the **p-value** is smaller than the significance level (e.g., 0.05), reject the null hypothesis.  

---

## **Conclusion**
- **ANOVA** is used to compare the means of multiple groups.  
- It prevents multiple t-tests, reducing the chance of a **Type I error**.  
- If the p-value is **less than 0.05**, we reject the null hypothesis, indicating that at least one of the means is different.  
- One-way ANOVA is used for one independent variable, while two-way ANOVA is used for two independent variables.  

If you'd like help with a **Python implementation** of ANOVA using **`scipy.stats.f_oneway()`** or **`statsmodels`**, let me know!


# **Decision to Reject the Null Hypothesis in ANOVA**

## **1️⃣ When to Reject the Null Hypothesis?**
- If the **F-statistic** is larger than the **critical F-value**, we **reject the null hypothesis**.  
- If the **p-value** is **less than 0.05** (or your chosen significance level), we **reject the null hypothesis**.  

The null hypothesis ($H_0$) for ANOVA is that **all group means are equal**.  
If we reject $H_0$, it means at least one of the group means is different.

---

## **2️⃣ Example Setup**
We want to determine if three different diets affect weight loss.  
Here is the summary of the key data from the earlier example.

| **Diet**  | **Group Mean** ($\bar{X}_j$) |
|-----------|------------------------------|
| **Diet A**| 5.5                            |
| **Diet B**| 4.75                           |
| **Diet C**| 6.5                            |

**Grand Mean** ($\bar{X}$) = 5.583  

- **Between-Group Variation (SSB)** = 6.14  
- **Within-Group Variation (SSW)** = Calculated using individual deviations from group means.  

---

## **3️⃣ Calculate the F-statistic**

To calculate the F-statistic, use the formula:  
\[
F = \frac{\text{MSB}}{\text{MSW}}
\]
Where:  
- **MSB** = $\frac{SSB}{k - 1}$  
- **MSW** = $\frac{SSW}{N - k}$  

---

### **Step 1: Calculate MSB**  
Assuming:  
- $k = 3$ (number of groups: Diet A, Diet B, Diet C)  
- $N = 12$ (total number of observations)  

\[
MSB = \frac{SSB}{k - 1} = \frac{6.14}{3 - 1} = \frac{6.14}{2} = 3.07
\]

---

### **Step 2: Calculate MSW**  
Suppose we computed the total within-group sum of squares (SSW) to be 10.6.  

\[
MSW = \frac{SSW}{N - k} = \frac{10.6}{12 - 3} = \frac{10.6}{9} = 1.178
\]

---

### **Step 3: Calculate the F-statistic**  
\[
F = \frac{\text{MSB}}{\text{MSW}} = \frac{3.07}{1.178} = 2.606
\]

---

## **4️⃣ Check the p-value**
To check if this F-statistic is significant, we compare it to a **critical F-value** from the **F-distribution table** for:  
- Degrees of Freedom for numerator ($df_1 = k - 1 = 2$)  
- Degrees of Freedom for denominator ($df_2 = N - k = 9$)  
- Significance level ($\alpha = 0.05$)  

From the F-table, the critical F-value for $df_1 = 2$ and $df_2 = 9$ at **$\alpha = 0.05$** is approximately **4.26**.

---

## **5️⃣ Decision**
- Our calculated **F = 2.606**.  
- The critical F-value from the table is **4.26**.  

Since **F = 2.606** is **less than 4.26**, we **fail to reject the null hypothesis**.  
This means there is **not enough evidence** to say that the group means are significantly different.

---

## **6️⃣ Conclusion**
- **Null Hypothesis ($H_0$)**: The means of the 3 diets are equal.  
- **Decision**: We **fail to reject the null hypothesis** because the F-statistic (2.606) is less than the critical F-value (4.26).  
- **Conclusion**: There is no significant difference between the means of the 3 diets at the **0.05 significance level**.  

---

If you'd like to see this example calculated step-by-step in Python, I can provide the code as well. Let me know! 😊


In [17]:
import pandas as pd 
from scipy import stats 
fertilizers = pd.read_csv("Data/fetilizers.csv") 

In [18]:
fertilizers

Unnamed: 0,fertilizer1,fertilizer2,fertilizer3
0,62,54,48
1,62,56,62
2,90,58,92
3,42,36,96
4,84,72,92
5,64,34,80


In [19]:
one_way_anova = stats.f_oneway(fertilizers["fertilizer1"], fertilizers["fertilizer2"], fertilizers["fertilizer3"]) 
print ("Statistic :", round(one_way_anova[0],2),", p-value :",round(one_way_anova[1],3)) 


Statistic : 3.66 , p-value : 0.051


---

## **Classification Metrics Example**

### **Scenario**
We are developing a machine learning model to predict if a patient has a disease. There are **50 patients** tested, and for each patient, we have the **actual label** (Does the patient have the disease? Yes or No) and the **model's prediction**.

Here is a breakdown of the results:

| **Patient** | **Actual** | **Predicted** |
|-------------|------------|---------------|
| 1           | Yes        | Yes           |
| 2           | No         | No            |
| 3           | Yes        | No            |
| 4           | Yes        | Yes           |
| 5           | No         | No            |
| 6           | No         | Yes           |
| 7           | Yes        | Yes           |
| 8           | No         | No            |
| 9           | Yes        | No            |
| 10          | No         | No            |
| 11          | Yes        | Yes           |
| 12          | Yes        | Yes           |
| 13          | No         | Yes           |
| 14          | No         | No            |
| 15          | Yes        | No            |
| 16          | Yes        | Yes           |
| 17          | No         | No            |
| 18          | No         | No            |
| 19          | Yes        | Yes           |
| 20          | No         | Yes           |
| 21          | Yes        | No            |
| 22          | Yes        | Yes           |
| 23          | No         | No            |
| 24          | No         | No            |
| 25          | Yes        | No            |
| 26          | No         | Yes           |
| 27          | No         | No            |
| 28          | Yes        | Yes           |
| 29          | Yes        | No            |
| 30          | No         | No            |
| 31          | No         | Yes           |
| 32          | No         | Yes           |
| 33          | Yes        | Yes           |
| 34          | Yes        | No            |
| 35          | No         | No            |
| 36          | Yes        | Yes           |
| 37          | No         | No            |
| 38          | Yes        | Yes           |
| 39          | No         | No            |
| 40          | No         | Yes           |
| 41          | Yes        | Yes           |
| 42          | Yes        | Yes           |
| 43          | No         | No            |
| 44          | No         | No            |
| 45          | Yes        | No            |
| 46          | No         | Yes           |
| 47          | No         | No            |
| 48          | No         | Yes           |
| 49          | Yes        | Yes           |
| 50          | No         | No            |

---

## **Step 1: Calculate TP, TN, FP, FN**
- **True Positives (TP)**: 15
- **True Negatives (TN)**: 18
- **False Positives (FP)**: 9
- **False Negatives (FN)**: 8

---

## **Step 2: Calculate Evaluation Metrics**

### **1. Precision (P)**
$ \text{Precision} = \frac{TP}{TP + FP} = \frac{15}{15 + 9} = \frac{15}{24} = 0.625 $
**Interpretation**: When the model predicted "Yes", it was correct 62.5% of the time.

---

### **2. Recall (R) / Sensitivity / True Positive Rate (TPR)**
$ \text{Recall} = \frac{TP}{TP + FN} = \frac{15}{15 + 8} = \frac{15}{23} \approx 0.652 $
**Interpretation**: Out of all the actual "Yes" cases, the model correctly identified 65.2% of them.

---

### **3. F1-Score (F1)**
$ F1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}} $
$ F1 = 2 \cdot \frac{0.625 \cdot 0.652}{0.625 + 0.652} = 2 \cdot \frac{0.407}{1.277} = 0.637 $
**Interpretation**: The F1 score is 0.637, indicating a balance between precision and recall.

---

### **4. Specificity (True Negative Rate)**
$ \text{Specificity} = \frac{TN}{TN + FP} = \frac{18}{18 + 9} = \frac{18}{27} = 0.666 $
**Interpretation**: Out of all the "No" cases, the model correctly identified 66.6% of them.

---

### **5. False Positive Rate (FPR)**
$ \text{False Positive Rate} = \frac{FP}{FP + TN} = \frac{9}{9 + 18} = \frac{9}{27} = 0.333 $
**Note**: $ \text{Specificity} = 1 - \text{False Positive Rate} = 1 - 0.333 = 0.666 $

---

### **6. Area Under ROC Curve (AUC)**
The approximate AUC score for this model is **0.7**, as the TPR (0.652) is significantly higher than the FPR (0.333).

---

## **Summary of Metrics**
| **Metric**       | **Formula**                | **Value**   |
|------------------|---------------------------|-------------|
| **True Positives (TP)** | —                     | 15          |
| **True Negatives (TN)** | —                     | 18          |
| **False Positives (FP)** | —                    | 9           |
| **False Negatives (FN)** | —                    | 8           |
| **Precision**    | $ \frac{TP}{TP + FP} $  | 0.625       |
| **Recall (TPR)** | $ \frac{TP}{TP + FN} $  | 0.652       |
| **F1 Score**     | $ 2 \cdot \frac{P \cdot R}{P + R} $ | 0.637  |
| **Specificity**  | $ \frac{TN}{TN + FP} $  | 0.666       |
| **False Positive Rate (FPR)** | $ \frac{FP}{FP + TN} $ | 0.333 |
| **AUC**          | —                         | 0.7         |

---

## **Key Takeaways**
- **Precision**: When the model says "Yes", it is correct **62.5%** of the time.
- **Recall**: Out of all people with the disease, the model identifies **65.2%**.
- **Specificity**: Out of all people without the disease, the model identifies **66.6%**.
- **F1-Score**: Balances precision and recall to **63.7%**.
- **AUC**: The model is **70%** effective at distinguishing "Yes" from "No".



## Adjusted R-squared value is the key metric in evaluating the quality of linear regressions. Any linear regression model having the value of R2 adjusted >= 0.7 is considered as a good enough model to implement.

# Bias vs. Variance Trade-off

---

## **What is Bias?**
- **Bias** refers to errors caused by overly simplistic assumptions in the model.
- A high-bias model fails to capture the complexity of the data, leading to **underfitting**.
- Underfitting happens when the model is too simple to explain the data well, causing both the training and testing errors to be high.

### **Example of High-Bias Models**:
- **Linear Regression**:
  - Imagine your data follows a complex curve, but linear regression tries to fit a straight line.
  - The model won’t capture the actual relationships well, leaving high errors.

---

## **What is Variance?**
- **Variance** refers to how sensitive a model is to fluctuations in the training data.
- A high-variance model captures even the noise in the training data, leading to **overfitting**.
- Overfitting happens when the model is too complex, performing well on training data but poorly on new data.

### **Example of High-Variance Models**:
- **Decision Trees**:
  - Decision trees can create very detailed, complex models.
  - Even small changes in the training data can result in drastically different trees, leading to instability.

---

## **Bias-Variance Trade-off**
- Bias and variance are like a balancing act:
  - **High bias** → The model is too simple → Underfitting.
  - **High variance** → The model is too complex → Overfitting.
- **Goal**: Find the right balance between bias and variance to minimize the total error.

### **Total Error = Bias² + Variance + Irreducible Error**
- **Irreducible error** is noise in the data that no model can fix.

---

## **Visualizing the Trade-off**
1. **High Bias**:
   - Predictions are far from the actual values.
   - The model doesn’t capture the data’s complexity.
   - Example: A straight line when the data forms a curve.

2. **High Variance**:
   - The model fits the training data almost perfectly.
   - Poor generalization to new data (test data error is high).
   - Example: A squiggly curve that tries to follow every point in the training set.

---

## **Modern Solution: Ensemble Methods**
- Modern machine learning techniques balance bias and variance effectively using **ensemble methods**.
- **Random Forest** is a great example:
  - It combines multiple high-variance models (decision trees).
  - By averaging their outputs, the overall variance decreases while keeping bias low.

---

## **Key Takeaways**
- **High Bias = Underfitting**: The model is too simple to capture patterns in the data.
- **High Variance = Overfitting**: The model is too complex, capturing noise instead of patterns.
- **Balanced Model**: The sweet spot has both low bias and low variance, achieving good generalization.


# Statistical Modeling vs Machine Learning

## Statistical Modeling
- **Example**: Linear regression with two independent variables.
- **Goal**: Fits the **best plane** through the data points by minimizing errors.
- Focuses on understanding the **relationship** between the independent variables (inputs) and the dependent variable (output).

## Machine Learning
- **Focus**: Optimizes parameters (e.g., weights and biases) rather than just relationships between variables.
- Converts the problem into an **optimization task** (minimizing a function like squared error).
- Errors are **squared** to make the function **convex**, ensuring:
  - Faster convergence.
  - A global optimum is reached.

---

# Convex vs Non-Convex Functions

## Convex Functions
- A function is convex if a straight line drawn between any two points on the curve always stays **above or on the curve**.
- In convex functions:
  - **Local minimum** = **Global minimum**.
- Optimization techniques like **gradient descent** work reliably and find the best solution.

## Non-Convex Functions
- A function is non-convex if a straight line between two points on the curve can sometimes go **below the curve**.
- In non-convex functions:
  - There can be **multiple local minima** (valleys).
  - It's hard to know if the solution is the **global minimum** (best solution).

---

# Why Does This Matter in Machine Learning?
- Machine learning models rely on **optimization** to find the best parameters (e.g., weights, biases).
- If the function being optimized is convex:
  - The optimization process (e.g., **gradient descent**) is **guaranteed to find the global minimum**.
- If the function is non-convex:
  - The optimization process might get "stuck" in a local minimum and fail to find the best solution.

---

# Key Takeaways
1. Machine learning converts the problem into an **optimization task**.
2. Squaring the errors creates a **convex function**, which ensures:
   - Faster convergence.
   - The global optimum is achieved.
3. For non-convex problems:
   - Advanced techniques (e.g., adding randomness or ensemble methods) are used to overcome the challenge of multiple minima.


# Convex and Non-Convex Functions

### **Convex Function**

A function is **convex** if the line segment connecting any two points on the function lies **above or on the curve**.

#### **Example 1: Quadratic Function**
The function:  
$$f(x) = x^2$$  
This is a classic example of a convex function. If you take any two points, say $x_1 = -1$ and $x_2 = 2$, the line connecting $f(-1)$ and $f(2)$ will always lie **above or on** the curve of $f(x)$.

#### **Key Characteristics:**
1. Has a single **global minimum** (no other "valleys").
2. Easy to optimize (find the minimum).

#### **Real-Life Analogy**:
Imagine you're at the bottom of a perfectly smooth **bowl**. No matter where you start on the edge of the bowl, you'll always slide down to the **lowest point**, which is the **global minimum**.

---

### **Non-Convex Function**

A function is **non-convex** if the line segment connecting two points on the function can lie **below the curve**.

#### **Example 2: Sinusoidal Function**
The function:  
$$f(x) = \sin(x)$$  
This is an example of a non-convex function. It has multiple **peaks** and **valleys** (local minima and maxima). If you take two points on this curve, say $x_1 = \pi/4$ and $x_2 = 3\pi/4$, the line connecting $f(\pi/4)$ and $f(3\pi/4)$ will fall **below the curve**.

#### **Key Characteristics:**
1. Has **multiple local minima** (valleys) and maxima (peaks).
2. Hard to optimize because an algorithm may "get stuck" in a **local minimum** and fail to find the **global minimum**.

#### **Real-Life Analogy**:
Imagine you're hiking in a mountainous region with multiple peaks and valleys. If you're trying to find the lowest point, you might get stuck in one of the small valleys (local minimum) and fail to reach the deepest valley (global minimum).

---

### **Practical Implications in Machine Learning**

1. **Convex Functions**:
   - Optimization (like gradient descent) is straightforward.
   - Example: Linear regression minimizes the squared error function, which is convex.

2. **Non-Convex Functions**:
   - Optimization is harder because algorithms might get stuck in local minima.
   - Example: Neural networks often involve non-convex functions due to the complexity of the loss function. Advanced techniques like random initialization, momentum, or Adam optimizer are used to overcome this challenge.

---

### **Visualization**

1. **Convex Function**: A smooth "U" shape, like a bowl.  
   Example: $f(x) = x^2$

2. **Non-Convex Function**: A wavy curve with multiple peaks and valleys.  
   Example: $f(x) = \sin(x)$  


## Gradient Descent

In [26]:
import numpy as np

def gradient_descent(x, y, learn_rate, conv_threshold, batch_size, max_iter):
    converged = False
    iter = 0
    m = batch_size
    t0 = np.random.random()  # Initialize intercept randomly
    t1 = np.random.random()  # Initialize coefficient randomly
    MSE = float('inf')  # Initialize MSE to a large value

    while not converged:
        grad0 = 1.0 / m * sum([(t0 + t1 * x[i] - y[i]) for i in range(m)])
        grad1 = 1.0 / m * sum([(t0 + t1 * x[i] - y[i]) * x[i] for i in range(m)])
        
        temp0 = t0 - learn_rate * grad0
        temp1 = t1 - learn_rate * grad1

        t0, t1 = temp0, temp1

        MSE_New = sum([(t0 + t1 * x[i] - y[i]) ** 2 for i in range(m)]) / m

        if abs(MSE - MSE_New) <= conv_threshold:
            print('Converged, iterations:', iter)
            converged = True
        
        MSE = MSE_New
        iter += 1

        if iter == max_iter:
            print('Max iterations reached')
            converged = True

    return t0, t1


# Example Usage:
# Generate example data
np.random.seed(42)
X = 2 * np.random.rand(100, 1).flatten()
y = 4 + 3 * X + np.random.randn(100)

# Perform gradient descent
Inter, Coeff = gradient_descent(
    x=X,
    y=y,
    learn_rate=0.00003,
    conv_threshold=1e-8,
    batch_size=32,
    max_iter=1500000
)
print('Gradient Descent Results:')
print(f'Intercept = {Inter}, Coefficient = {Coeff}')


Converged, iterations: 543935
Gradient Descent Results:
Intercept = 4.225953652228333, Coefficient = 2.6128610303074944


# **Gradient Descent Explained with Code and Example**

---

## **Mathematical Formulation**

Gradient Descent minimizes the Mean Squared Error (MSE), the objective function for linear regression.

1. **Objective Function:**
   $$
   J(\theta_0, \theta_1) = \frac{1}{m} \sum_{i=1}^m \left( h_\theta(x_i) - y_i \right)^2
   $$
   - $J(\theta_0, \theta_1)$: MSE
   - $m$: Number of samples in the batch
   - $h_\theta(x_i) = \theta_0 + \theta_1 x_i$: Predicted value
   - $y_i$: Actual value

2. **Gradients:**
   $$
   \frac{\partial J}{\partial \theta_0} = \frac{1}{m} \sum_{i=1}^m \left( h_\theta(x_i) - y_i \right)
   $$
   $$
   \frac{\partial J}{\partial \theta_1} = \frac{1}{m} \sum_{i=1}^m \left( h_\theta(x_i) - y_i \right) x_i
   $$

3. **Parameter Updates:**
   $$
   \theta_0 \leftarrow \theta_0 - \alpha \frac{\partial J}{\partial \theta_0}
   $$
   $$
   \theta_1 \leftarrow \theta_1 - \alpha \frac{\partial J}{\partial \theta_1}
   $$
   - $\alpha$: Learning rate

4. **Convergence Condition:**
   $$
   | J_{\text{old}} - J_{\text{new}} | \leq \text{conv\_threshold}
   $$

---

## **Worked Example**

### **Dataset:**
- $x = [1, 2, 3, 4]$ (input values)
- $y = [2.2, 2.8, 4.5, 3.7]$ (actual values)

### **Initial Parameters:**
- $\theta_0 = 0.5, \theta_1 = 0.5$
- Learning rate $\alpha = 0.01$

### **Iterations:**

- Compute predictions $h_\theta(x) = \theta_0 + \theta_1 x$
- Compute gradients:
  $$
  \text{grad}_0 = \frac{1}{m} \sum \left( h_\theta(x_i) - y_i \right)
  $$
  $$
  \text{grad}_1 = \frac{1}{m} \sum \left( h_\theta(x_i) - y_i \right) x_i
  $$
- Update parameters:
  $$
  \theta_0 \leftarrow \theta_0 - \alpha \cdot \text{grad}_0
  $$
  $$
  \theta_1 \leftarrow \theta_1 - \alpha \cdot \text{grad}_1
  $$

- Repeat Until Convergence.


# **Step-by-Step Calculation for Gradient Descent Iteration**

### **Dataset:**
- $x = [1, 2, 3, 4]$ (input values)
- $y = [2.2, 2.8, 4.5, 3.7]$ (actual values)

### **Initial Parameters:**
- $\theta_0 = 0.5, \theta_1 = 0.5$
- Learning rate $\alpha = 0.01$
- Number of samples $m = 4$ (since there are 4 data points)

### **Iteration 1:**

1. **Compute Predictions:**

   For each value of $x_i$, the prediction is:
   $$
   h_\theta(x_i) = \theta_0 + \theta_1 x_i
   $$

   Using the initial values of $\theta_0 = 0.5$ and $\theta_1 = 0.5$, we compute the predictions for each $x_i$:
   - For $x_1 = 1$, $h_\theta(x_1) = 0.5 + 0.5(1) = 1.0$
   - For $x_2 = 2$, $h_\theta(x_2) = 0.5 + 0.5(2) = 1.5$
   - For $x_3 = 3$, $h_\theta(x_3) = 0.5 + 0.5(3) = 2.0$
   - For $x_4 = 4$, $h_\theta(x_4) = 0.5 + 0.5(4) = 2.5$

   Thus, the predictions are:
   $$
   h_\theta = [1.0, 1.5, 2.0, 2.5]
   $$

2. **Compute Gradients:**

   The gradients for $\theta_0$ and $\theta_1$ are calculated as:
   $$
   \text{grad}_0 = \frac{1}{m} \sum_{i=1}^m \left( h_\theta(x_i) - y_i \right)
   $$
   $$
   \text{grad}_1 = \frac{1}{m} \sum_{i=1}^m \left( h_\theta(x_i) - y_i \right) x_i
   $$

   We compute the individual errors $(h_\theta(x_i) - y_i)$ for each $i$:
   - For $x_1 = 1$, error = $1.0 - 2.2 = -1.2$
   - For $x_2 = 2$, error = $1.5 - 2.8 = -1.3$
   - For $x_3 = 3$, error = $2.0 - 4.5 = -2.5$
   - For $x_4 = 4$, error = $2.5 - 3.7 = -1.2$

   Now, calculate the gradients:
   - $\text{grad}_0$:
     $$
     \text{grad}_0 = \frac{1}{4} \left( (-1.2) + (-1.3) + (-2.5) + (-1.2) \right) = \frac{1}{4} \times (-6.2) = -1.55
     $$
   - $\text{grad}_1$:
     $$
     \text{grad}_1 = \frac{1}{4} \left( (-1.2)(1) + (-1.3)(2) + (-2.5)(3) + (-1.2)(4) \right)
     $$
     $$
     \text{grad}_1 = \frac{1}{4} \left( -1.2 - 2.6 - 7.5 - 4.8 \right) = \frac{1}{4} \times (-16.1) = -4.025
     $$

3. **Update Parameters:**

   Now, update the parameters using the gradients and learning rate:
   $$
   \theta_0 \leftarrow \theta_0 - \alpha \cdot \text{grad}_0
   $$
   $$
   \theta_1 \leftarrow \theta_1 - \alpha \cdot \text{grad}_1
   $$

   Using $\alpha = 0.01$, we compute the new values for $\theta_0$ and $\theta_1$:
   - Update $\theta_0$:
     $$
     \theta_0 = 0.5 - 0.01 \cdot (-1.55) = 0.5 + 0.0155 = 0.5155
     $$
   - Update $\theta_1$:
     $$
     \theta_1 = 0.5 - 0.01 \cdot (-4.025) = 0.5 + 0.04025 = 0.54025
     $$

### **New Parameters After Iteration 1:**
- $\theta_0 = 0.5155$
- $\theta_1 = 0.54025$

---

### **Iteration 2:**

1. **Compute New Predictions:**
   
   Using the updated values of $\theta_0 = 0.5155$ and $\theta_1 = 0.54025$:
   - For $x_1 = 1$, $h_\theta(x_1) = 0.5155 + 0.54025(1) = 1.05575$
   - For $x_2 = 2$, $h_\theta(x_2) = 0.5155 + 0.54025(2) = 1.59625$
   - For $x_3 = 3$, $h_\theta(x_3) = 0.5155 + 0.54025(3) = 2.13675$
   - For $x_4 = 4$, $h_\theta(x_4) = 0.5155 + 0.54025(4) = 2.67725$

2. **Compute Gradients:**

   Calculate the errors and gradients:
   - For $x_1 = 1$, error = $1.05575 - 2.2 = -1.14425$
   - For $x_2 = 2$, error = $1.59625 - 2.8 = -1.20375$
   - For $x_3 = 3$, error = $2.13675 - 4.5 = -2.36325$
   - For $x_4 = 4$, error = $2.67725 - 3.7 = -1.02275$

   Then compute the gradients:
   - $\text{grad}_0$:
     $$
     \text{grad}_0 = \frac{1}{4} \left( (-1.14425) + (-1.20375) + (-2.36325) + (-1.02275) \right) = \frac{1}{4} \times (-5.734) = -1.4335
     $$
   - $\text{grad}_1$:
     $$
     \text{grad}_1 = \frac{1}{4} \left( (-1.14425)(1) + (-1.20375)(2) + (-2.36325)(3) + (-1.02275)(4) \right)
     $$
     $$
     \text{grad}_1 = \frac{1}{4} \left( -1.14425 - 2.4075 - 7.08975 - 4.091 \right) = \frac{1}{4} \times (-14.7325) = -3.683125
     $$

3. **Update Parameters:**

   - Update $\theta_0$:
     $$
     \theta_0 = 0.5155 - 0.01 \cdot (-1.4335) = 0.5155 + 0.014335 = 0.529835
     $$

   - Update $\theta_1$:
     $$
     \theta_1 = 0.54025 - 0.01 \cdot (-3.683125) = 0.54025 + 0.03683125 = 0.57708125
     $$

### **New Parameters After Iteration 2:**
- $\theta_0 = 0.529835$
- $\theta_1 = 0.57708125$

---

### **Continue the Process:**
Repeat the above steps (compute predictions, gradients, and parameter updates) until the parameters converge, i.e., the change in MSE between iterations is less than a predefined convergence threshold (e.g., $10^{-8}$).


## Implementation