

### **1. Mean Squared Error (MSE)**:
- **Definition**: MSE measures the average squared difference between the predicted values (\(\hat{y}\)) and the actual values (\(y\)) in a dataset.
- **Formula**:  
    $
  \text{MSE} = \frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2
  $
  
  where:
  - $(n)$ is the number of observations,
  - $(y_i)$ is the actual value,
  - $(\hat{y}_i)$ is the predicted value.

- **Purpose**: It is a measure of the **error** or **loss** in predictions, often used in regression models to evaluate their performance. A smaller MSE indicates better predictions.

---

### **2. Variance**:
- **Definition**: Variance measures the average squared deviation of the actual data points $(y)$ from their mean $(\bar{y})$.
- **Formula**:
    $
  \text{Variance} = \frac{1}{n} \sum_{i=1}^n (y_i - \bar{y})^2
  $
  
  where:
  - $(\bar{y})$ is the mean of the data.

- **Purpose**: It quantifies the **spread** or **dispersion** of the data around the mean, indicating how much the data varies.

---

### **Key Differences**:
| Aspect               | Mean Squared Error (MSE)                          | Variance                                 |
|----------------------|---------------------------------------------------|------------------------------------------|
| **What it measures** | Error between predicted and actual values         | Spread of actual values around the mean |
| **Reference point**  | Predictions (\(\hat{y}_i\))                      | Mean of the data (\(\bar{y}\))          |
| **Purpose**          | Evaluate model performance                       | Describe the data distribution          |

---

### **Relationship**:
In regression problems, the **MSE** can be decomposed into **variance**, **bias**, and **irreducible error** (as part of the bias-variance tradeoff):

$
\text{MSE} = \text{Variance} + (\text{Bias})^2 + \text{Irreducible Error}
$

This decomposition shows that while variance is a component of MSE, it is not the same as MSE.

---

### **Conclusion**:
- MSE is a measure of how well a model predicts the actual values.
- Variance is a property of the actual data, independent of any model.
- They are **not the same** but are related, particularly in the context of regression and error analysis.

# Variance reduction

$
\text{Variance Reduction} = \text{Variance (Parent)} - \sum_{i=1}^k w_i \cdot \text{Variance (Child}_i\text{)}
$

The formula is used to measure **variance reduction** in decision tree algorithms when splitting a node.

---

### **Explanation:**

1. **Variance (Parent/Root)**:
   - This represents the variance of the target variable $(y)$ at the parent node (before the split).

2. **Variance (Child)**:
   - After a split, the target values are divided into $(k)$ child nodes (e.g., for binary splits, $(k = 2))$.
   - Each child node has its own variance.

3. **Weights $(w_i)$**:
   - The weights $(w_i)$ represent the proportion of data points in each child node relative to the total data points in the parent node:
     $
     w_i = \frac{\text{Number of data points in Child}_i}{\text{Total number of data points in Parent}}
     $

4. **Summation**:
   - The weighted sum of the variances of the child nodes represents the combined variance after the split.

5. **Variance Reduction**:
   - The difference between the parent node's variance and the weighted sum of the child nodes' variances gives the **variance reduction**.
   - A larger variance reduction indicates a better split.

---

### **Use in Decision Trees:**

This formula is used to evaluate the quality of a split in regression trees (e.g., **CART for regression**). The algorithm chooses the split that maximizes the variance reduction, aiming to minimize the variance within the child nodes and, consequently, improve the predictive performance of the tree.

---

### **Formula Recap**:

$
\text{Variance Reduction} = \text{Variance (Parent)} - \sum_{i=1}^k w_i \cdot \text{Variance (Child}_i\text{)}
$

Where:
- $(\text{Variance (Parent)} = \frac{1}{n} \sum_{j=1}^n (y_j - \bar{y})^2)$,
- $(\text{Variance (Child}_i\text{)})$ is calculated similarly for each child node.

This formula is commonly applied in regression decision trees.