# Linear regression on mpg dataset

### a) We want to predict the "mpg", split up X and y, and perform train|test split using scikit-learn. Choose test_size of 0.2 and random_state 42. Control the shapes of each X_train, X_test, y_train, y_test.

In [3]:
import seaborn as sns
import pandas as pd

df = sns.load_dataset("mpg")

# group by displacement to get median horsepower
median_hp = df.groupby("displacement")["horsepower"].median()

# fill null values with median hp by displacement
df["horsepower"] = df.apply(
    lambda row: median_hp[row["displacement"]] if pd.isna(row["horsepower"]) else row["horsepower"],
    axis=1
)

# fill null values of single displacement values with median hp
df["horsepower"] = df["horsepower"].fillna(df["horsepower"].median())

In [6]:
# divide data into X and y
X, y = df.drop(["mpg", "origin", "name"], axis=1), df["mpg"]
X.head()

Unnamed: 0,cylinders,displacement,horsepower,weight,acceleration,model_year
0,8,307.0,130.0,3504,12.0,70
1,8,350.0,165.0,3693,11.5,70
2,8,318.0,150.0,3436,11.0,70
3,8,304.0,150.0,3433,12.0,70
4,8,302.0,140.0,3449,10.5,70


In [7]:
y.head()

0    18.0
1    15.0
2    18.0
3    16.0
4    17.0
Name: mpg, dtype: float64

In [8]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"{X_train.shape = }")
print(f"{y_train.shape = }")
print(f"{X_test.shape = }")
print(f"{y_test.shape = }")

X_train.shape = (318, 6)
y_train.shape = (318,)
X_test.shape = (80, 6)
y_test.shape = (80,)


### b) Create a function for training a regression model, predicting and computing the metrics MAE, MSE, RMSE. It should take in parameters of X_train, X_test, y_train, y_test, model. Now create a linear regression model using scikit-learns LinearRegression() (OLS normal equation with SVD) and call your function to get metrics

In [9]:
from sklearn.metrics import mean_absolute_error, mean_squared_error
import numpy as np

def train_predict_metrics(X_train, X_test, y_train, y_test, model):
    # train the model
    model.fit(X_train, y_train)
    
    y_pred = model.predict(X_test)
    
    mae = mean_absolute_error(y_test, y_pred)
    mse = mean_squared_error(y_test, y_pred)
    rmse = np.sqrt(mse)
    
    return {"MAE": mae, "MSE": mse, "RMSE": rmse}

In [10]:
from sklearn.linear_model import LinearRegression

model = LinearRegression()

metrics = train_predict_metrics(X_train, X_test, y_train, y_test, model)
print(metrics)

{'MAE': 2.469255033626278, 'MSE': 9.442011256514293, 'RMSE': np.float64(3.072785585834829)}
