
<a href="https://colab.research.google.com/github/kokchun/Machine-learning-AI22/blob/main/Exercises/E02_sklearn.ipynb" target="_parent"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a> &nbsp; to see hints and answers.

---
# Scikit-learn exercises 

---
These are introductory exercises in Machine learning with focus in **scikit-learn** .

<p class = "alert alert-info" role="alert"><b>Note</b> that sometimes you don't get exactly the same answer as I get, but it doesn't neccessarily mean it is wrong. Could be some parameters, randomization, that we have different. Also very important is that in the future there won't be any answer sheets, use your skills in data analysis, mathematics and statistics to back up your work.</p>

<p class = "alert alert-info" role="alert"><b>Note</b> that in cases when you start to repeat code, try not to. Create functions to reuse code instead. </p>

<p class = "alert alert-info" role="alert"><b>Remember</b> to use <b>descriptive variable, function, index </b> and <b> column names</b> in order to get readable code </p>

The number of stars (\*), (\*\*), (\*\*\*) denotes the difficulty level of the task

---

## 0. EDA (*)

In the whole exercise, we will work with the "mpg" dataset from seaborn dataset. Start by loading dataset "mpg" from the ```load_dataset``` method in seaborn module. The goal will be to use linear regression to predict mpg - miles per gallon. 

&nbsp; a) Start by doing some initial EDA such as info(), describe() and figure out what you want to do with the missing values.

&nbsp; b) Use describe only on those columns that are relevant to get statistical information from. 

&nbsp; c) Make some plots on some of the columns that you find interesting.

&nbsp; d) Check if there are any columns you might want to drop. 

<details>

<summary>Answer</summary>

a) I have chosen to drop the rows, but it doesn't neccessary have to be the best method. Maybe some NaNs should be filled somehow?

b)
|      |      mpg |   cylinders |   displacement |   horsepower |   weight |   acceleration |
|:-----|---------:|------------:|---------------:|-------------:|---------:|---------------:|
| mean | 23.4459  |     5.47194 |        194.412 |     104.469  | 2977.58  |       15.5413  |
| std  |  7.80501 |     1.70578 |        104.644 |      38.4912 |  849.403 |        2.75886 |
| min  |  9       |     3       |         68     |      46      | 1613     |        8       |
| 25%  | 17       |     4       |        105     |      75      | 2225.25  |       13.775   |
| 50%  | 22.75    |     4       |        151     |      93.5    | 2803.5   |       15.5     |
| 75%  | 29       |     8       |        275.75  |     126      | 3614.75  |       17.025   |
| max  | 46.6     |     8       |        455     |     230      | 5140     |       24.8     |


c) Here are some example plots

<img src="../assets/EDA_mpg.png" height="400"/>

d) I have chosen to drop the columns origin and name. Think yourself if it is reasonable and feel free to experiment. Also there might be some domain experts in our class, that you can ask. 

</details>

---

In [None]:
import seaborn as sns

df = sns.load_dataset("mpg")
df["horsepower"].unique()
df.dropna(inplace=True)
df.drop(columns=["origin", "name"], inplace=True)

In [None]:
sns.pairplot(df, corner=True, height=2)

## 1. Train|test split (*)

We want to predict the "mpg", split up X and y, and perform train|test split using scikit-learn. Choose test_size of 0.2 and random_state 42. Control the shapes of each X_train, X_test, y_train, y_test.  

<details>

<summary>Answer</summary>

Do a manual calculation to check against the shapes after train|test split. 

</details>

---

In [None]:
from sklearn.model_selection import train_test_split

X = df.drop(columns=["mpg"])
# X.insert(0, "Intercept", 1)
y = df["mpg"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train.shape

## 2. Function for evaluation (*)

Create a function for training a regression model, predicting and computing the metrics MAE, MSE, RMSE. It should take in parameters of X_train, X_test, y_train, y_test, model. Now create a linear regression model using scikit-learns ```LinearRegression()``` (OLS normal equation with SVD) and call your function to get metrics. 

<details>

<summary>Answer</summary>

MAE 2.50

MSE 10.50

RMSE 3.24

</details>

In [None]:
def evaluate(y, y_hat):

    from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error, root_mean_squared_error

    R2 = r2_score(y, y_hat)
    MAE = mean_absolute_error(y, y_hat)
    MSE = mean_squared_error(y, y_hat)
    RMSE = root_mean_squared_error(y, y_hat)

    return {
        "R2": R2,
        "MAE": MAE,
        "MSE": MSE,
        "RMSE": RMSE,
    }


def lr(X_train, y_train, X_test, y_test):
    
    from sklearn.linear_model import LinearRegression

    model = LinearRegression()
    model.fit(X_train, y_train)
    y_hat_train = model.predict(X_train)
    y_hat_test = model.predict(X_test)

    train_eval = evaluate(y_train, y_hat_train)
    test_eval = evaluate(y_test, y_hat_test)

    return (train_eval, test_eval)


linear = lr(X_train, y_train, X_test, y_test)

---
## 3. Compare models (*)

Create the following models 
- Linear regression (SVD)
- Linear regression (SVD) with scaled data (feature standardization)
- Polynomial linear regression with degree 1
- Polynomial linear regression with degree 2
- Polynomial linear regression with degree 3

Make a DataFrame with evaluation metrics and model. Which model performed overall best?

<details>

<summary>Answer</summary>

|      |   Linear regr. SVD |   Linear regr. SVD scaled |   Linear regr. SGD |   Polynom. regr. deg 1 |   Polynom. regr. deg 2 |   Polynom. regr. deg 3 |
|:-----|-------------------:|--------------------------:|-------------------:|-----------------------:|-----------------------:|-----------------------:|
| mae  |            2.50386 |                   2.50386 |            2.53515 |                2.50386 |                1.98048 |                2.11788 |
| mse  |           10.5024  |                  10.5024  |           10.8908  |               10.5024  |                7.41986 |                9.27353 |
| rmse |            3.24074 |                   3.24074 |            3.30012 |                3.24074 |                2.72394 |                3.04525 |

</details>

---

In [None]:
def standardise(X_train, X_test):
    
    from sklearn.preprocessing import StandardScaler
    
    scaler = StandardScaler().fit(X_train)
    X_train_std = scaler.transform(X_train)
    X_test_std = scaler.transform(X_test)

    return X_train_std, X_test_std


def lr_std(X_train, y_train, X_test, y_test):
    
    from sklearn.linear_model import LinearRegression

    X_train, X_test = standardise(X_train, X_test)

    model = LinearRegression()
    model.fit(X_train, y_train)
    y_hat_train = model.predict(X_train)
    y_hat_test = model.predict(X_test)

    train_eval = evaluate(y_train, y_hat_train)
    test_eval = evaluate(y_test, y_hat_test)

    return (train_eval, test_eval)


def plr(X_train, y_train, X_test, y_test, degree=1, std=False):

    import numpy as np
    from sklearn.preprocessing import PolynomialFeatures
    from sklearn.linear_model import LinearRegression

    if std == True:
        X_train, X_test = standardise(X_train, X_test)
    
    poly = PolynomialFeatures(degree=degree, include_bias=False)
    X_train = poly.fit_transform(X_train)
    X_test = poly.transform(X_test)

    model = LinearRegression()
    model.fit(X_train, y_train)
    y_hat_train = model.predict(X_train)
    y_hat_test = model.predict(X_test)

    train_eval = evaluate(y_train, y_hat_train)
    test_eval = evaluate(y_test, y_hat_test)

    return (train_eval, test_eval)

In [None]:
linear = lr(X_train, y_train, X_test, y_test)
linear_std = lr_std(X_train, y_train, X_test, y_test)
poly1 = plr(X_train, y_train, X_test, y_test)
poly2 = plr(X_train, y_train, X_test, y_test, 2)
poly3 = plr(X_train, y_train, X_test, y_test, 3)

In [None]:
import pandas as pd

cols = [key for key in linear[0].keys()]
rows = ["linear", "linear_std", "poly1", "poly2", "poly3"]
results_train = pd.DataFrame(columns=rows, index=cols)
results_test = pd.DataFrame(columns=rows, index=cols)
for model_name, model_results in zip(rows, [linear, linear_std, poly1, poly2, poly3]):
    results_train[model_name] = pd.Series(model_results[0])
    results_test[model_name] = pd.Series(model_results[1])

print("Training results:")
display(results_train)
print("Testing results:")
display(results_test)


## 4. Further explorations (**)

Feel free to further explore the dataset, for example you could choose to 
- drop different columns
- find out feature importance in polynomial models
- fine tune further for a specific model by exploring hyperparameters (check documentation which type of parameters that can be changed)

In [None]:
from sklearn.metrics import r2_score, root_mean_squared_error
import matplotlib.pyplot as plt


def plr(X_train, y_train, X_test, y_test, degree=1, std=False):

    import numpy as np
    from sklearn.preprocessing import PolynomialFeatures
    from sklearn.linear_model import LinearRegression

    if std == True:
        X_train, X_test = standardise(X_train, X_test)
    
    poly = PolynomialFeatures(degree=degree, include_bias=False)
    X_train = poly.fit_transform(X_train)
    X_test = poly.transform(X_test)

    model = LinearRegression()
    model.fit(X_train, y_train)
    y_hat_train = model.predict(X_train)
    y_hat_test = model.predict(X_test)

    train_eval = evaluate(y_train, y_hat_train)
    test_eval = evaluate(y_test, y_hat_test)

    return (train_eval, test_eval), y_hat_test


poly1, y_hat_degree_1 = plr(X_train, y_train, X_test, y_test, degree=1, std=False)
error_list = []

for d in range(1, 11):
    _, y_hat_d = plr(X_train, y_train, X_test, y_test, degree=d, std=False)
    error_list += [root_mean_squared_error(y_test, y_hat_d)]

fig, ax = plt.figure(), plt.axes()

ax.plot(range(1, len(error_list)+1), error_list, ".-")
ax.set(title="Elbow", xlabel="Degree", ylabel="Root Mean Squared Error")

In [None]:
plt.scatter(y_test, y_hat_degree_1)
sns.regplot(x=y_test, y=y_hat_degree_1, line_kws={"color": "red"})

In [None]:
from sklearn.linear_model import ElasticNetCV


def plr(X_train, y_train, X_test, y_test, degree=1, std=False, reg=False):

    import numpy as np
    from sklearn.preprocessing import PolynomialFeatures
    from sklearn.linear_model import LinearRegression

    if std == True:
        X_train, X_test = standardise(X_train, X_test)
    
    poly = PolynomialFeatures(degree=degree, include_bias=False)
    X_train = poly.fit_transform(X_train)
    X_test = poly.transform(X_test)

    if reg == True:
        ratios = [0.1, 0.5, 0.7, 0.9, 1.0]
        model = ElasticNetCV(
            l1_ratio=ratios, 
            eps = 0.001, 
            n_alphas = 100, 
            max_iter=10000)
        model.fit(X_train, y_train)
        print(f"L1 ratio: {model.l1_ratio_}")
        print(f"alpha {model.alpha_}")

    else:
        model = LinearRegression()

    model.fit(X_train, y_train)
    y_hat_train = model.predict(X_train)
    y_hat_test = model.predict(X_test)

    train_eval = evaluate(y_train, y_hat_train)
    test_eval = evaluate(y_test, y_hat_test)

    return (train_eval, test_eval)


metrics = plr(X_train, y_train, X_test, y_test, degree=1, std=True, reg=True)

cols = [key for key in metrics[0].keys()]
rows = ["poly_elastic"]
results_train = pd.DataFrame(columns=rows, index=cols)
results_test = pd.DataFrame(columns=rows, index=cols)
for model_name, model_results in zip(rows, [metrics]):
    results_train[model_name] = pd.Series(model_results[0])
    results_test[model_name] = pd.Series(model_results[1])

print("Training results:")
display(results_train)
print("Testing results:")
display(results_test)

---

Kokchun Giang

[LinkedIn][linkedIn_kokchun]

[GitHub portfolio][github_portfolio]

[linkedIn_kokchun]: https://www.linkedin.com/in/kokchungiang/
[github_portfolio]: https://github.com/kokchun/Portfolio-Kokchun-Giang

---