### **Challenge 4: Compare Feature Scaling Methods in a Machine Learning Model**
#### **Topic:** Feature Scaling (StandardScaler, MinMaxScaler)  

#### **Problem Description:**  
You are given a dataset with both numerical and categorical features. Your task is to:
1. **Apply two different scaling methods** (`MinMaxScaler` and `StandardScaler`) to the numeric features.
2. **Train a simple Linear Regression model** on both scaled datasets.
3. **Compare the model's performance** by calculating the Mean Squared Error (MSE) for each scaling method.

Your function should return the MSE for both Min-Max Scaling and Standard Scaling.

---

### **Function Signature:**
```python
def compare_scaling_methods(data: pd.DataFrame, target_column: str) -> dict:
    """
    Compare the impact of Min-Max Scaling and Standard Scaling on Linear Regression performance.

    Args:
    data (pd.DataFrame): The dataset containing both numerical and categorical features.
    target_column (str): The column name of the target variable.

    Returns:
    dict: A dictionary containing MSE values for 'minmax' and 'zscore' scaling methods.
    """
```

---

### **Constraints:**
1. The dataset contains both **numerical and categorical** features.
2. The **target column is numeric** and should not be scaled.
3. Use **Linear Regression** from `sklearn.linear_model` to evaluate performance.
4. Compute **Mean Squared Error (MSE)** using `sklearn.metrics.mean_squared_error`.
5. **Non-numeric columns** should be dropped before scaling, as we're only scaling numeric features.
6. Return a dictionary with MSE values for both scaling methods.

---

### **Example Input:**
```python
import pandas as pd

data = pd.DataFrame({
    'feature1': [100, 200, 300, 400, 500],
    'feature2': [10, 20, 30, 40, 50],
    'category': ['A', 'B', 'A', 'B', 'A'],  # Categorical feature (to be ignored)
    'target': [5, 10, 15, 20, 25]  # Target variable
})

target_column = 'target'
```

---

### **Example Output:**
```python
{
    'minmax': 0.0,
    'zscore': 0.0
}
```
(In this simple dataset, the MSE might be `0.0`, but in real-world cases, different scaling methods could impact model performance.)

---

### **Hints:**
1. Use `select_dtypes` to **filter numeric columns** for scaling.
2. Drop the **target column** before scaling, as it should not be transformed.
3. Use `MinMaxScaler` and `StandardScaler` from `sklearn.preprocessing` to scale features.
4. Use `LinearRegression` from `sklearn.linear_model` to train and evaluate the model.
5. Use `train_test_split` to split the data into training and testing sets.

---

# Solution

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

In [2]:
def compare_scaling_methods(data: pd.DataFrame, target_column: str) -> dict:
    """
    Compare the impact of Min-Max Scaling and Standard Scaling on Linear Regression performance.

    Args:
    data (pd.DataFrame): The dataset containing both numerical and categorical features.
    target_column (str): The column name of the target variable.

    Returns:
    dict: A dictionary containing MSE values for 'minmax' and 'zscore' scaling methods.
    """
    # Step 1: Select numeric features and separate target variable
    numeric_cols = data.select_dtypes(include=['number']).columns.tolist()
    numeric_cols.remove(target_column)  # Remove target column from features

    X = data[numeric_cols]  # Feature matrix
    y = data[target_column]  # Target variable

    # Step 2: Split data into training and testing sets (80-20 split)
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    # Step 3: Function to scale, train, and evaluate model
    def train_and_evaluate(scaler):
        """Scales the data using the given scaler, trains a Linear Regression model, and computes MSE."""
        X_train_scaled = scaler.fit_transform(X_train)
        X_test_scaled = scaler.transform(X_test)

        model = LinearRegression()
        model.fit(X_train_scaled, y_train)
        y_pred = model.predict(X_test_scaled)

        return mean_squared_error(y_test, y_pred)

    # Step 4: Train and evaluate models using both scalers
    mse_results = {
        'minmax': train_and_evaluate(MinMaxScaler()),
        'zscore': train_and_evaluate(StandardScaler())
    }

    return mse_results

### **Time and Space Complexity:**
- **Time Complexity:**
  - `train_test_split`: $ O(n) $
  - `MinMaxScaler` / `StandardScaler`: $ O(n \cdot m) $ (where $ m $ is the number of features)
  - `LinearRegression.fit()`: $ O(n \cdot m^2) $ (depends on the solver)
  - `LinearRegression.predict()`: $ O(n \cdot m) $
  - **Overall: $ O(n \cdot m^2) $** (dominated by training step)

- **Space Complexity:**
  - Stores scaled copies of `X_train` and `X_test` $ O(n \cdot m) $
  - **Overall: $ O(n \cdot m) $**

---

### **Note: target variables do not need to be scaled**

Here's why:
- In **regression problems**, the target variable represents the value we want to predict. Scaling the target isn't necessary unless the target values have **very large magnitude differences** or if we're using a model that is sensitive to different scales.
- **Linear Regression is scale-invariant with respect to the target**, meaning scaling `y` does not impact the final predictions in a meaningful way.
- We scale only the features (`X`) because:
  - Many machine learning algorithms (e.g., **gradient-based optimizers, distance-based models**) work better when the features are on the same scale.
  - Linear Regression can be sensitive to large magnitude differences in feature values.

✅ **When should you scale the target (`y`)?**
- If the target values have very large ranges, you might scale them (e.g., in **Neural Networks, SVR, Gradient Boosting**).
- You can use `MinMaxScaler` or `StandardScaler` on `y` if needed, but this is uncommon in Linear Regression.

---

# Example Execution 1

In [3]:
data = pd.DataFrame({
    'feature1': [100, 200, 300, 400, 500],
    'feature2': [10, 20, 30, 40, 50],
    'category': ['A', 'B', 'A', 'B', 'A'],  # Categorical feature (to be ignored)
    'target': [5, 10, 15, 20, 25]  # Target variable
})

target_column = 'target'

In [4]:
data

Unnamed: 0,feature1,feature2,category,target
0,100,10,A,5
1,200,20,B,10
2,300,30,A,15
3,400,40,B,20
4,500,50,A,25


In [5]:
compare_scaling_methods(data, 'target')

{'minmax': 3.1554436208840472e-30, 'zscore': 3.1554436208840472e-30}

In **this specific dataset**, both **MinMaxScaler** and **StandardScaler** produce identical results with a **Mean Squared Error (MSE) of 0.0**.  

**Why?**
- The dataset is **perfectly linear**:  
  - `feature1` and `feature2` are scaled versions of each other.
  - The target `y` follows a simple linear equation.
  - Since Linear Regression models linear relationships exactly, it finds a perfect fit, and MSE becomes **0.0**.

However, in real-world cases, **MinMaxScaler and StandardScaler will produce different results**. The results would differ **when the data has outliers** or **features with different distributions**.

# Example Execution 2 (with outliers)

In [6]:
data = pd.DataFrame({
    'feature1': [1, 2, 3, 4, 1000],  # Large outlier in last value
    'feature2': [10, 20, 30, 40, 50],
    'category': ['A', 'B', 'A', 'B', 'A'],  # Categorical feature (to be ignored)
    'target': [5, 10, 15, 20, 25]
})

In [7]:
compare_scaling_methods(data, 'target')

{'minmax': 3.1554436208840472e-30, 'zscore': 0.0}

🔹 **Effect of the outlier:**
- `MinMaxScaler` will **compress all other values close to 0**, because `1000` is the max.
- `StandardScaler` will **keep the distribution of values**, handling the outlier better.

# Example Execution 3 (with different distribution in features)

In [8]:
import numpy as np

In [9]:
data = pd.DataFrame({
    'feature1': np.random.normal(50, 10, 5),  # Normal distribution
    'feature2': np.random.exponential(10, 5),  # Exponential distribution
    'category': ['A', 'B', 'A', 'B', 'A'],  # Categorical feature (to be ignored)
    'target': [5, 10, 15, 20, 25]
})

In [10]:
compare_scaling_methods(data, 'target')

{'minmax': 122.78079535758108, 'zscore': 122.78079535758108}

### **Summary of Key Differences**
| Feature | **MinMaxScaler** | **StandardScaler** |
|---------|----------------|------------------|
| **Formula** | $ X' = \frac{X - X_{\text{min}}}{X_{\text{max}} - X_{\text{min}}} $ | $ X' = \frac{X - \mu}{\sigma} $ |
| **Range** | Scales between [0, 1] | Mean = 0, Std = 1 |
| **Sensitive to Outliers?** | ✅ Yes | ❌ No |
| **Changes Distribution?** | ❌ No | ✅ Yes (transforms closer to normal) |
| **Use Cases** | When feature scales vary a lot but no outliers | When normality is preferred and outliers exist |

---

# Alternative Solution 1 (modular)

In [11]:
import pandas as pd
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

def scale_features(X_train: pd.DataFrame, X_test: pd.DataFrame, method: str):
    """Fits scaler on X_train and transforms both X_train and X_test."""
    if method not in ['minmax', 'zscore']:
        raise ValueError("Method must be 'minmax' or 'zscore'")
    
    scaler = MinMaxScaler() if method == 'minmax' else StandardScaler()
    
    # Fit on training data and transform both train & test sets
    X_train_scaled = pd.DataFrame(scaler.fit_transform(X_train), columns=X_train.columns)
    X_test_scaled = pd.DataFrame(scaler.transform(X_test), columns=X_test.columns)

    return X_train_scaled, X_test_scaled

def train_linear_model(X_train, y_train, X_test, y_test):
    """Trains a linear model and returns the MSE."""
    model = LinearRegression()
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    return mean_squared_error(y_test, y_pred)

def compare_scaling_methods(data: pd.DataFrame, target_column: str) -> dict:
    """Compares MinMax and Z-Score scaling methods."""
    # Select numeric columns only
    numeric_cols = data.select_dtypes(include=['number']).columns.tolist()
    numeric_cols.remove(target_column)

    X = data[numeric_cols]
    y = data[target_column]

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    # Corrected scaling method: Fit only on X_train, transform both X_train and X_test
    X_train_minmax, X_test_minmax = scale_features(X_train, X_test, 'minmax')
    X_train_zscore, X_test_zscore = scale_features(X_train, X_test, 'zscore')

    # Train and evaluate both scaling methods
    mse_results = {
        'minmax': train_linear_model(X_train_minmax, y_train, X_test_minmax, y_test),
        'zscore': train_linear_model(X_train_zscore, y_train, X_test_zscore, y_test)
    }

    return mse_results


# Alternative Solution 2 (using pipelines, best for production)

In [12]:
from sklearn.pipeline import Pipeline

def create_pipeline(scaler):
    """Creates a pipeline that scales data and applies Linear Regression."""
    return Pipeline([
        ('scaler', scaler),
        ('regressor', LinearRegression())
    ])

def compare_pipeline_methods(data: pd.DataFrame, target_column: str) -> dict:
    """Compares MinMax and Z-Score scaling with pipelines."""
    numeric_cols = data.select_dtypes(include=['number']).columns.tolist()
    numeric_cols.remove(target_column)

    X = data[numeric_cols]
    y = data[target_column]

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    mse_results = {}
    for method, scaler in [('minmax', MinMaxScaler()), ('zscore', StandardScaler())]:
        pipeline = create_pipeline(scaler)
        pipeline.fit(X_train, y_train)
        y_pred = pipeline.predict(X_test)
        mse_results[method] = mean_squared_error(y_test, y_pred)

    return mse_results
