Now that we know how to create a model we need to know how to improve accuracy of models we made. It's a complicated step and it does not always gives the result we wanted. The key idea is trying out different things we have to see if they have positive impact or negative impact on models.

# <span style="color:#2E86C1"><b>Scaling Data</b></span>

 
- ## <span style="color:#D35400"><b>Normalization</b></span>
    
    Normalization refers to the process of scaling individual samples to have unit norm. This means that the feature values are scaled to fit within a specific range, often between 0 and 1. Normalization is useful when the feature values have different ranges and you want to bring them to a common scale.
    
    **Example**: Let's say the weight and price of gold; one scale is very small while the other is very large.

    **Formula and Notation**:
    
    $$ 
    X' = \frac{X - X_{min}}{X_{max} - X_{min}} 
    $$
    
    where:
    
    - $X'$ = Normalized value
    - $X$ = Original value
    - $X_{min}$ = Minimum value of the feature
    - $X_{max}$ = Maximum value of the feature

    ```bash
    from sklearn.preprocessing import MinMaxScaler

    scaler = MinMaxScaler()
    normalized_data = scaler.fit_transform(data)
    ```

    **When to use**:
    
    Normalization is generally preferred when the features have different scales, particularly when using algorithms that rely on distances, such as **k-nearest neighbors (KNN)** and **neural networks**.

--- 

- ## <span style="color:#D35400"><b>Standardization</b></span>
    
    Standardization (or Z-score normalization) transforms the data to have a mean of 0 and a standard deviation of 1. This process ensures that the feature distribution follows a standard normal distribution, which is useful for algorithms that assume normally distributed data.

    **Formula and Notation**:
    
    $$ 
    X' = \frac{X - \mu}{\sigma} 
    $$
    
    where:
    
    - $X'$ = Standardized value
    - $X$ = Original value
    - $\mu$ = Mean of the feature
    - $\sigma$ = Standard deviation of the feature

    ```bash
    from sklearn.preprocessing import StandardScaler

    scaler = StandardScaler()
    standardized_data = scaler.fit_transform(data)
    ```

    **When to use**:
    
    Standardization is more appropriate when the data follows a **Gaussian distribution**, especially when using algorithms like **linear regression**, **logistic regression**, and **support vector machines (SVM)**.


# <span style="color:#2E86C1"><b>Imputing Data</b></span>


Data imputation is a method for retaining the majority of the dataset's information by substituting missing data with different values. These methods are employed because it would be impractical to remove data from a dataset each time a missing value is encountered. Imputation helps in maintaining the integrity of the dataset and avoiding potential biases introduced by removing data.

---

- ## <span style="color:#D35400"><b>Different Techniques</b></span>

    - ### <span style="color:#28B463"><b>Imputing with Mean, Median, Mode, Forward Fill (ffill), and Backward Fill (bfill)</b></span>
    
        You can use the `fillna()` method from pandas to impute missing values in various ways.
        
        ```bash
        import pandas as pd
        
        # Sample DataFrame
        data = pd.DataFrame({'A': [1, 2, None, 4], 'B': [None, 3, 4, None]})
        
        # Impute with mean
        data['A'].fillna(data['A'].mean(), inplace=True)

        # Impute with median
        data['B'].fillna(data['B'].median(), inplace=True)

        # Impute with mode
        data['B'].fillna(data['B'].mode()[0], inplace=True)

        # Forward fill
        data.fillna(method='ffill', inplace=True)

        # Backward fill
        data.fillna(method='bfill', inplace=True)
        ```

    - ### <span style="color:#28B463"><b>Iterative Imputation (MICE)</b></span>
        
        - ### <span style="color:pink"><b>Overview</b></span>
            -  Multiple Imputation by Chained Equations (MICE) uses an iterative approach to fill in missing values based on other features.
            - Utilizes a regression model to predict missing values based on other features in the dataset.
        
        - ### <span style="color:pink"><b>Process</b></span>
            - Initializes missing values with a guess (e.g., mean).
            - Iteratively models each feature with missing values using regression on remaining features.
            - Updates missing values until convergence.

        - ### <span style="color:pink"><b>Benefits</b></span>
            - Captures complex relationships among features for more accurate imputations.
            - Suitable for datasets with correlated features.

        ```bash
        from sklearn.experimental import enable_iterative_imputer
        from sklearn.impute import IterativeImputer

        imputer = IterativeImputer()
        imputed_data = imputer.fit_transform(data)
        ```

    - ### <span style="color:#28B463"><b>KNN Imputation</b></span>
    
        - ### <span style="color:pink"><b>Overview</b></span>
            - K-Nearest Neighbors (KNN) can also be used to impute missing values based on the nearest samples.
            - Fills missing values by averaging values from the K nearest neighbors in the dataset.
    
        - ### <span style="color:pink"><b>Process</b></span>
            - Calculates distance between instances to find K nearest neighbors.
            - Imputes missing values using the mean (for continuous features) or mode (for categorical features) of the neighbors.

        - ### <span style="color:pink"><b>Benefits</b></span>
            - Preserves local structure and relationships in the data.
            - Simple and effective when sufficient similar observations are present.
            

        ```bash
        from sklearn.impute import KNNImputer
        
        imputer = KNNImputer(n_neighbors=5)
        imputed_data = imputer.fit_transform(data)
        ```

    - ### <span style="color:#28B463"><b>Simple Imputer</b></span>
    
        The `SimpleImputer` class can be used to specify different strategies for imputation.
        
        ```bash
        from sklearn.impute import SimpleImputer
        
        imputer = SimpleImputer(strategy='mean')  # or 'median', 'most_frequent', etc.
        imputed_data = imputer.fit_transform(data)
        ```

    - ### <span style="color:#28B463"><b>Imputing with Min/Max Values</b></span>
    
        You can also use the minimum or maximum values for imputation.
        
        ```bash
        # Impute with minimum value
        data.fillna(data.min(), inplace=True)

        # Impute with maximum value
        data.fillna(data.max(), inplace=True)
        ```

---

- ## <span style="color:#D35400"><b>When to Use Each Technique</b></span>

    - **Mean, Median, Mode Imputation**: Use these methods for numerical features when the data is symmetrically distributed. Median is preferable for skewed distributions.
    - **Forward Fill / Backward Fill**: Suitable for time series data where the order is important, and missing values are expected to be similar to nearby values.
    - **Iterative Imputation (MICE)**: Best for datasets with complex relationships among features. This technique often yields better results when features are correlated.
    - **KNN Imputation**: Effective for datasets where the values of a feature are influenced by other features. It's useful when the dataset is not too large, as it can be computationally expensive.
    - **Simple Imputer**: Useful for general cases where a specific strategy is required. It offers flexibility in choosing the imputation strategy.
    - **Min/Max Imputation**: Generally used for bounded features, but use with caution as it can reduce variability in the data.

---

- ## <span style="color:#D35400"><b>Why We Need Imputation</b></span>

    Imputation is crucial for maintaining the usability of a dataset, especially in real-world applications where missing values are common. By imputing missing data, we can preserve the size and integrity of the dataset, which is vital for training effective machine learning models.

---

- ## <span style="color:#D35400"><b>Impact on Actual Data and Model</b></span>

    The impact of data imputation can be both positive and negative:
    
    - **Positive**: Imputation can lead to more robust models that generalize better due to the increased amount of usable data.
    - **Negative**: If not done correctly, imputation can introduce bias, reduce variability, or distort relationships between features, ultimately leading to poor model performance.


# <span style="color:#2E86C1"><b>Regularization</b></span>

-   Regularization is a set of methods aimed at **reducing overfitting** in machine learning models. Typically, it involves trading a marginal decrease in **training accuracy** for an increase in **generalizability**—the model's ability to produce accurate predictions on new datasets.
-   Basically, regularization increases a model’s generalizability but often results in **higher training error**. This means models may perform less accurately on training data while improving predictions on test data.

### <span style="color:#28B463"><b>Bias-Variance Tradeoff</b></span>

The concession of increased training error for decreased testing error is known as the **bias-variance tradeoff**. Here's a brief breakdown:

- **Bias**: Measures the average difference between predicted and true values. High bias results in high error on the training set.
  
- **Variance**: Measures how much predictions differ across various subsets of the same data. High variance indicates poor performance on unseen data.

### <span style="color:#D35400"><b>Key Points on Variance:</b></span>
- **Variance** in machine learning reflects how much a model's predictions change when trained on different data subsets. It signifies a model's sensitivity to training data.

- **Different Subsets, Different Models**: Training on different data subsets often results in slightly different models due to randomness.
  
- **Prediction Variation**: These models may produce varying predictions on unseen data, with variance measuring the extent of this variation.

- **Lower Prediction Variance**: Indicates that the model generalizes well rather than memorizing patterns.

### <span style="color:#D35400"><b>Aim of Regularization:</b></span>

Developers strive to reduce both bias and variance. However, simultaneous reduction isn't always achievable, leading to the need for regularization, which decreases model variance at the cost of increased bias.

### <span style="color:#D35400"><b>Understanding Overfitting and Underfitting:</b></span>

- **Overfitting**: 
    -   Characterized by low bias and high variance. This occurs when a model learns noise from the training data.
    -   Happens when the model is too complex and captures even the noise in the data, making it perform well on the training data but poorly on unseen data.
- **Underfitting**: 
    -   Refers to high bias and high variance, resulting in poor predictions on both training and test data. This often arises from insufficient training data or parameters.
    -   Occurs when a model is too simple to capture the underlying patterns in the data.

---

- ### <span style="color:#D35400"><b>Impact of Data Size on Underfitting and Overfitting</b></span>

    - #### <span style="color:#28B463"><b>1. Small Data Size</b></span>

        When the dataset is small, **overfitting** is more likely to occur because the model memorizes the limited data points and fails to generalize to new data.

        `Small data`: Models may learn specific details (including noise) and struggle when exposed to new data.

    - #### <span style="color:#28B463"><b>2. Large Data Size</b></span>

        With **more data**, the risk of **overfitting decreases**, as the model has a larger, more diverse set of examples to learn from. However, with a simple model, underfitting might occur because the model cannot capture the complexity of the larger dataset.

---

- ### <span style="color:#D35400"><b>Balancing the Data and Model Complexity</b></span>

    - **Larger datasets** generally help reduce overfitting because the model can generalize better. However, to prevent underfitting, **model complexity** should increase with the size of the dataset.

    - Proper techniques such as **cross-validation**, **regularization**, and **model tuning** are crucial to ensuring that the model neither underfits nor overfits, regardless of the data size.


---

<center><img src="../../../images/bias_variance_tradeoff.jpg" alt="error" width="600"/></center>

### <span style="color:#D35400"><b>Regularization Effects:</b></span>

While regularization aims to reduce overfitting, it can also lead to underfitting if too much bias is introduced. Thus, determining the appropriate type and degree of regularization requires careful consideration of:

- Model complexity
- Dataset characteristics
- Specific requirements of the task

---



## <span style="color:#D35400"><b>1. L1 Regularization (Lasso Regression)</b></span>

- **Detailed Explanation**:  
  L1 regularization adds the **absolute values** of the feature weights (coefficients) to the model's error as a penalty. This encourages the model to reduce some weights to **exactly zero**. By doing so, the model effectively eliminates irrelevant or less important features, simplifying the model. This makes it great for **feature selection**, especially when you have a large number of features and want the model to automatically ignore those that don’t contribute much to predictions.

  - **Why does it lead to sparsity?**: The absolute value penalty forces the model to choose between keeping a feature or eliminating it. If a feature doesn’t contribute enough, the model is pushed to set its weight to zero. This creates a **sparse** model, where only a few features have non-zero weights, and the rest are ignored.

  - **Use in practice**: L1 regularization (Lasso) is used when you expect some features to be irrelevant, or when you have many features and need to reduce the dimensionality by picking only the most important ones.

- **Formula**:  
$$ \text{Loss Function} = \text{Original Loss} + \lambda \sum |w_i| $$  
Where:  
$w_i$ = weight of each feature (coefficient)  
$\lambda$ = regularization parameter that controls how strong the penalty is. A higher value makes the model shrink more weights to zero.

---

## <span style="color:#D35400"><b>2. L2 Regularization (Ridge Regression)</b></span>

- **Detailed Explanation**:  
  L2 regularization adds the **squared values** of the feature weights as a penalty to the model's error. This doesn’t force any weights to become exactly zero like L1 does, but it **shrinks** all weights closer to zero. By doing this, L2 regularization reduces the impact of less important features without completely eliminating them.

  - **Why shrink, but not eliminate?**: Squaring the weights means the penalty grows faster for larger weights, encouraging the model to reduce them, but never to zero. It results in **smaller weights**, so the model becomes simpler and less likely to overfit, while still considering all features.

  - **Use in practice**: L2 regularization (Ridge) is useful when you think **all features** are important, but you want to **control their influence** to prevent overfitting, especially when you have a lot of features or complex models.

- **Formula**:  
$$ \text{Loss Function} = \text{Original Loss} + \lambda \sum w_i^2 $$  
Where:  
$w_i$ = weight of each feature  
$\lambda$ = regularization parameter (controls how much the weights are shrunk).

---

## <span style="color:#D35400"><b>3. Elastic Net Regularization</b></span>

- **Detailed Explanation**:  
  Elastic Net combines the strengths of both **L1** (Lasso) and **L2** (Ridge) regularization. It uses both the absolute value of the weights (L1) and their square (L2). This gives Elastic Net the ability to both **select important features** (L1 property) and **shrink weights** (L2 property) to prevent overfitting.

  - **Why combine L1 and L2?**: L1 regularization alone can sometimes be too aggressive and remove too many features, while L2 doesn’t remove any features. Elastic Net gives a **balance**, offering both **feature selection** and **weight shrinkage**. This is especially useful when you suspect that many features are correlated or when you don’t know if you need L1 or L2, and you want the model to figure it out.

  - **Use in practice**: Elastic Net is used when you want the **flexibility** of both regularization methods, making it a good choice when you have a lot of features and need both selection and shrinkage.

- **Formula**:  
$$ \text{Loss Function} = \text{Original Loss} + \lambda_1 \sum |w_i| + \lambda_2 \sum w_i^2 $$  
Where:  
$w_i$ = weight of each feature  
$\lambda_1, \lambda_2$ = regularization parameters for L1 and L2 penalties.

---

This breakdown explains the technical details in a simple way while maintaining their core concepts, helping you understand how L1, L2, and Elastic Net work.


## <span style="color:#D35400"><b>4. Dropout Regularization (Neural Networks)</b></span>

- **Explanation**: **Dropout** is a regularization technique primarily used in neural networks. During training, randomly selected neurons are "dropped" or ignored, preventing the model from becoming too dependent on particular neurons and reducing overfitting.
  
- **How It Works**: Neurons are randomly set to zero during each training step, which forces the model to learn more robust representations.

- **Sample Code** (Keras):
```bash
from tensorflow.keras.layers import Dropout
model.add(Dropout(0.5))
```

- **Use Case**: Especially useful in **deep learning models** to prevent overfitting, particularly in large networks.

---

## <span style="color:#D35400"><b>5. Early Stopping</b></span>

- **Explanation**: **Early stopping** halts the training process when the performance on a validation dataset starts to degrade. This prevents the model from continuing to fit the noise in the training data.

- **Sample Code** (Keras):
```bash
from tensorflow.keras.callbacks import EarlyStopping
early_stop = EarlyStopping(monitor='val_loss', patience=5)
model.fit(X_train, y_train, validation_data=(X_val, y_val), callbacks=[early_stop])
```

- **Use Case**: Commonly used in **deep learning** to reduce overfitting when training for a large number of epochs.

---

## <span style="color:#D35400"><b>6. Data Augmentation (Deep Learning)</b></span>

- **Explanation**: **Data Augmentation** increases the size of the training dataset by applying transformations (rotations, flips, etc.) to existing data. It’s a form of regularization that forces the model to learn more robust features by exposing it to slightly varied data.

- **Sample Code** (Keras):
```bash
from tensorflow.keras.preprocessing.image import ImageDataGenerator
datagen = ImageDataGenerator(rotation_range=20, horizontal_flip=True)
datagen.fit(X_train)
```

- **Use Case**: Especially effective in **computer vision** tasks when training datasets are small.

---

## <span style="color:#D35400"><b>7. Weight Regularization (Neural Networks)</b></span>

- **Explanation**: In neural networks, **weight regularization** techniques (like L1 or L2 penalties) are applied to the weights of the network to limit their size, thus preventing overfitting.

- **Sample Code** (Keras):
```bash
from tensorflow.keras.regularizers import l2
model.add(Dense(units=64, kernel_regularizer=l2(0.01)))
```

- **Use Case**: Applied in deep learning networks to control the size of weights and avoid overfitting.

---

By understanding and applying the right type of regularization, you can control the complexity of your machine learning models, prevent overfitting, and improve generalization on unseen data.
```