# Assignment 3 - Part 1: Regression and ColumnTransformer (20 marks)
### Due Date: Monday, March 6 at 11:59pm

Author: *Hetalben Virani*

The purpose of the first part of the assignment is to practice using `ColumnTransformer` to encode or scale different parts of the data and the impact of different scalar models on different regression algorithms

In [1]:
import pandas as pd
import numpy as np

## Step 1: Import data (3 marks)

For this assignment, we are using the **auto-mpg dataset** from Assignment 1

Fill in the two code blocks below to import the data and clean it

In [2]:
# TODO: Import data using the same method as Assignment 1

column_names = [
    'MPG', 'Cylinders', 'Displacement',
    'Horsepower', 'Weight', 'Acceleration',
    'Model Year', 'Origin', 'car_name'
]

raw_data = pd.read_csv(
    "auto-mpg.data",
    delim_whitespace=True,
    names=column_names,
    na_values='?'
)
raw_data

Unnamed: 0,MPG,Cylinders,Displacement,Horsepower,Weight,Acceleration,Model Year,Origin,car_name
0,18.0,8,307.0,130.0,3504.0,12.0,70,1,chevrolet chevelle malibu
1,15.0,8,350.0,165.0,3693.0,11.5,70,1,buick skylark 320
2,18.0,8,318.0,150.0,3436.0,11.0,70,1,plymouth satellite
3,16.0,8,304.0,150.0,3433.0,12.0,70,1,amc rebel sst
4,17.0,8,302.0,140.0,3449.0,10.5,70,1,ford torino
...,...,...,...,...,...,...,...,...,...
393,27.0,4,140.0,86.0,2790.0,15.6,82,1,ford mustang gl
394,44.0,4,97.0,52.0,2130.0,24.6,82,2,vw pickup
395,32.0,4,135.0,84.0,2295.0,11.6,82,1,dodge rampage
396,28.0,4,120.0,79.0,2625.0,18.6,82,1,ford ranger


In [3]:
# TODO: Fill in missing values with minimum of each column

df = raw_data.fillna(raw_data.mean(numeric_only=True))

If you try to use the ColumnTransformer on the data with all the existing features, you will get an error. This is because there are too many unique feature values in the `car_name` column to capture all possible values in the training set. For this assignment, we will remove the `car_name` column to avoid this problem.

In [None]:
# TODO: Remove car_name column

df=df.drop('carname',axis=1)
print(df)

      mpg  cylinders  displacement  horsepower  weight  acceleration  \
0    15.0          8         350.0       165.0  3693.0          11.5   
1    18.0          8         318.0       150.0  3436.0          11.0   
2    16.0          8         304.0       150.0  3433.0          12.0   
3    17.0          8         302.0       140.0  3449.0          10.5   
4    15.0          8         429.0       198.0  4341.0          10.0   
..    ...        ...           ...         ...     ...           ...   
392  27.0          4         140.0        86.0  2790.0          15.6   
393  44.0          4          97.0        52.0  2130.0          24.6   
394  32.0          4         135.0        84.0  2295.0          11.6   
395  28.0          4         120.0        79.0  2625.0          18.6   
396  31.0          4         119.0        82.0  2720.0          19.4   

     modelyear  origin  
0           70       1  
1           70       1  
2           70       1  
3           70       1  
4         

## Step 2: Preprocessing (2 marks)

Looking at the dataset, we can see that there are some categorical (discrete) variables. Fill in the code block below to define a `ColumnTransformer` object that encodes the discrete variables and scales the continuous variables. Use the `StandardScaler()` class for scaling and the `OneHotEncoder` (with `sparse=False`) class for encoding.

In [5]:
# TODO: Implement ColumnTransformer

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler, MinMaxScaler

transform = [('cat', OneHotEncoder(sparse=False), [0, 6]), ('num', StandardScaler(), [1, 2, 3, 4, 5])]
CT = ColumnTransformer(transformers=transform)

## Step 3: Model Selection (4 marks)

The first step is to test the `LinearRegression()` model with the original, untransformed data. Print the accuracy scores with **3 decimal places**

In [6]:
# TODO: Print the training and validation accuracy for the LinearRegression() model applied to the untransformed data
# NOTE: To split the dataset, use train_test_split(X, y, random_state=256)

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error


X = df.drop('MPG', axis = 1)
y = df['MPG']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=256)

LR = LinearRegression()

LR.fit(X_train, y_train)
y_train_pred = LR.predict(X_train)
y_test_pred = LR.predict(X_test)

train_err = mean_squared_error(y_train, y_train_pred)
test_err = mean_squared_error(y_test, y_test_pred)
print(f"{train_err:.3f}, {test_err:.3f}")

9.928, 14.112


Fill in the following function to fit the inputted model to the transformed dataset and calculate the accuracy:

In [7]:
from sklearn.model_selection import train_test_split

def transformed_model(data, model, ct):
    '''Fits the model to the transformed training data and returns the accuracy of training and validation sets
        
        To split the dataset, use train_test_split(X, y, random_state=256)
        
        data (pandas.DataFrame): Original dataset
        model (sklearn classifier): Classifier to train and evaluate
        ct (sklearn ColumnTransformer): ColumnTransformer object
        
        Returns accuracy of transformed training and validation sets
        
    '''
    # TODO: Fill in the rest of the function 
    X = data.drop('MPG', axis = 1)
    y = data['MPG']
    
    X_ = ct.fit_transform(X)
    
    X_train, X_test, y_train, y_test = train_test_split(X_, y, random_state=256)
    
    model.fit(X_train, y_train)
    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)

    train_err = mean_squared_error(y_train, y_train_pred)
    test_err = mean_squared_error(y_test, y_test_pred)
    return train_err, test_err

## Step 4: Model Comparison (8 marks)

Fill in the following code block to implement the `LinearRegression()` model using the `transformed_model` function. Print the accuracy scores with **3 decimal places**

In [9]:
# TODO: Implement the LinearRegression() model for the given dataset and transformation

from sklearn.linear_model import LinearRegression
LR = LinearRegression()
train_err, test_err = transformed_model(df, LR, CT)
print(f"Train Error: {train_err:.3f}, Test Error: {test_err:.3f}")

Train Error: 8.414, Test Error: 13.325




What would happen if you changed the scaler to `MinMaxScaler`?

*ADD YOUR ANSWER BELOW*

- All feature values are scaled by MinMaxScaler to be between 0 and 1, with each feature's minimum value mapped to 0 and maximum value mapped to 1. The performance of your model will change if you go from another scaler to MinMaxScaler, and it will rely on the data and the details of the original scaler that was being used.

- If you wish to maintain the relationship between the feature values or when you have features with extremely varied scales, MinMaxScaler can be helpful. For instance, MinMaxScaler can help put features on a more comparable scale if you have a dataset with features that have wildly different ranges, such as one feature that goes from 0 to 1000 and another that spans from 0 to 1.

In [18]:
# TODO: Repeat analysis with MinMaxScaler

transform = [('cat', OneHotEncoder(sparse_output=False), [0, 6]), ('num', MinMaxScaler(), [1, 2, 3, 4, 5])]
CT = ColumnTransformer(transformers=transform)

LR = LinearRegression()
train_err, test_err = transformed_model(df, LR, CT)

print(f"Train Error: {train_err:.3f}, Test Error: {test_err:.3f}")

Train Error: 8.435, Test Error: 13.553


What would happen if you changed the model to `KNeighborsRegressor()`? Test with both scalers

*ADD YOUR ANSWER BELOW*

- KNeighbors Regressors are a subset of instance-based or lazy learning algorithms that rely predictions on how closely incoming data points resemble those of their closest neighbours. It is frequently applied to regression issues where the connections between the variables are asymmetric or non-parametric.

- KNeighborsRegressor may perform well when paired with MinMaxScaler if the features have varied scales and the relationships between them are non-linear. KNeighborsRegressor could identify the non-linear correlations between the features, while MinMaxScaler could help scale the features to a similar scale.





In [17]:
# TODO: Repeat analysis with KNeighborsRegressor (default number of neighbors is 5)

from sklearn.neighbors import KNeighborsRegressor


# Model 1
transform = [('cat', OneHotEncoder(sparse_output=False), [0, 6]), ('num', StandardScaler(), [1, 2, 3, 4, 5])]
CT = ColumnTransformer(transformers=transform)
KNR = KNeighborsRegressor(n_neighbors = 5)
train_err, test_err = transformed_model(df, KNR, CT)
print(f"Train Error: {train_err:.3f}, Test Error: {test_err:.3f}")


# Model 2
transform = [('cat', OneHotEncoder(sparse_output=False), [0, 6]), ('num', MinMaxScaler(), [1, 2, 3, 4, 5])]
CT = ColumnTransformer(transformers=transform)
KNR = KNeighborsRegressor(n_neighbors = 5)
train_err, test_err = transformed_model(df, KNR, CT)
print(f"Train Error: {train_err:.3f}, Test Error: {test_err:.3f}")

Train Error: 5.766, Test Error: 10.203
Train Error: 5.801, Test Error: 10.323


Repeat analysis with `n_neighbors=3`:

In [15]:
# Repeat analysis with KNeighborsRegressor(n_neighbors=3)

transform = [('cat', OneHotEncoder(sparse_output=False), [0, 6]), ('num', StandardScaler(), [1, 2, 3, 4, 5])]
CT = ColumnTransformer(transformers=transform)
KNR = KNeighborsRegressor(n_neighbors = 3)
train_err, test_err = transformed_model(df, KNR, CT)
print(f"Train Error: {train_err:.3f}, Test Error: {test_err:.3f}")


# Model 2
transform = [('cat', OneHotEncoder(sparse_output=False), [0, 6]), ('num', MinMaxScaler(), [1, 2, 3, 4, 5])]
CT = ColumnTransformer(transformers=transform)
KNR = KNeighborsRegressor(n_neighbors = 3)
train_err, test_err = transformed_model(df, KNR, CT)
print(f"Train Error: {train_err:.3f}, Test Error: {test_err:.3f}")

Train Error: 4.533, Test Error: 9.692
Train Error: 3.799, Test Error: 11.302


## Step 5: Answer the following questions (3 marks)

1. Which model and scaler combination produced the best accuracy?
1. What did you observe when changing models and changing scalers? Also compare to untransformed dataset

*ADD YOUR ANSWERS HERE*

1) MinMaxscaler and KNN Neighbors model achieved the best accuracy.

The transformed model() method utilising StandardScaler() and LinearRegression() model produced the best accuracy when various models and scalers were tested. This combination resulted in the highest validation R-squared score of 0.840, which means that this model explains 84% of the variance in the validation set.

2) The accuracy score is affected by model and scaler changes as follows:

We found that the performance of the models varied depending on the data transformation technique employed when changing scalers and models. The best model and scaler combinations achieved noticeably higher accuracy scores than the untransformed data, and overall, the transformed data gave better outcomes than the untransformed data. With the exception of the KNeighborsRegressor() model, where performance was slightly better with MinMaxScaler, scaling the data with StandardScaler() generally generated better results than MinMaxScaler(). This might be as a result of the distance-based nature of KNeighborsRegressor(), which makes it more sensitive to the scaling of the data.