
<a href="https://colab.research.google.com/github/kokchun/Machine-learning-AI22/blob/main/Exercises/E02_sklearn.ipynb" target="_parent"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a> &nbsp; to see hints and answers.

---
# Scikit-learn exercises 

---
These are introductory exercises in Machine learning with focus in **scikit-learn** .

<p class = "alert alert-info" role="alert"><b>Note</b> that sometimes you don't get exactly the same answer as I get, but it doesn't neccessarily mean it is wrong. Could be some parameters, randomization, that we have different. Also very important is that in the future there won't be any answer sheets, use your skills in data analysis, mathematics and statistics to back up your work.</p>

<p class = "alert alert-info" role="alert"><b>Note</b> that in cases when you start to repeat code, try not to. Create functions to reuse code instead. </p>

<p class = "alert alert-info" role="alert"><b>Remember</b> to use <b>descriptive variable, function, index </b> and <b> column names</b> in order to get readable code </p>

The number of stars (\*), (\*\*), (\*\*\*) denotes the difficulty level of the task

---

## 0. EDA (*)

In the whole exercise, we will work with the "mpg" dataset from seaborn dataset. Start by loading dataset "mpg" from the ```load_dataset``` method in seaborn module. The goal will be to use linear regression to predict mpg - miles per gallon. 

&nbsp; a) Start by doing some initial EDA such as info(), describe() and figure out what you want to do with the missing values.

&nbsp; b) Use describe only on those columns that are relevant to get statistical information from. 

&nbsp; c) Make some plots on some of the columns that you find interesting.

&nbsp; d) Check if there are any columns you might want to drop. 

<details>

<summary>Answer</summary>

a) I have chosen to drop the rows, but it doesn't neccessary have to be the best method. Maybe some NaNs should be filled somehow?

b)
|      |      mpg |   cylinders |   displacement |   horsepower |   weight |   acceleration |
|:-----|---------:|------------:|---------------:|-------------:|---------:|---------------:|
| mean | 23.4459  |     5.47194 |        194.412 |     104.469  | 2977.58  |       15.5413  |
| std  |  7.80501 |     1.70578 |        104.644 |      38.4912 |  849.403 |        2.75886 |
| min  |  9       |     3       |         68     |      46      | 1613     |        8       |
| 25%  | 17       |     4       |        105     |      75      | 2225.25  |       13.775   |
| 50%  | 22.75    |     4       |        151     |      93.5    | 2803.5   |       15.5     |
| 75%  | 29       |     8       |        275.75  |     126      | 3614.75  |       17.025   |
| max  | 46.6     |     8       |        455     |     230      | 5140     |       24.8     |


c) Here are some example plots

<img src="../assets/EDA_mpg.png" height="400"/>

d) I have chosen to drop the columns origin and name. Think yourself if it is reasonable and feel free to experiment. Also there might be some domain experts in our class, that you can ask. 

</details>

---

In [149]:
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

df = sns.load_dataset("mpg")

# df = pd.DataFrame(df.values)
# df.fillna()

nan_rows = df[df.isna().any(axis=1)]

df

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin,name
0,18.0,8,307.0,130.0,3504,12.0,70,usa,chevrolet chevelle malibu
1,15.0,8,350.0,165.0,3693,11.5,70,usa,buick skylark 320
2,18.0,8,318.0,150.0,3436,11.0,70,usa,plymouth satellite
3,16.0,8,304.0,150.0,3433,12.0,70,usa,amc rebel sst
4,17.0,8,302.0,140.0,3449,10.5,70,usa,ford torino
...,...,...,...,...,...,...,...,...,...
393,27.0,4,140.0,86.0,2790,15.6,82,usa,ford mustang gl
394,44.0,4,97.0,52.0,2130,24.6,82,europe,vw pickup
395,32.0,4,135.0,84.0,2295,11.6,82,usa,dodge rampage
396,28.0,4,120.0,79.0,2625,18.6,82,usa,ford ranger


In [150]:
df.dropna(axis=0,inplace=True)

df = df.drop(["origin","name"], axis=1)
df

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year
0,18.0,8,307.0,130.0,3504,12.0,70
1,15.0,8,350.0,165.0,3693,11.5,70
2,18.0,8,318.0,150.0,3436,11.0,70
3,16.0,8,304.0,150.0,3433,12.0,70
4,17.0,8,302.0,140.0,3449,10.5,70
...,...,...,...,...,...,...,...
393,27.0,4,140.0,86.0,2790,15.6,82
394,44.0,4,97.0,52.0,2130,24.6,82
395,32.0,4,135.0,84.0,2295,11.6,82
396,28.0,4,120.0,79.0,2625,18.6,82


## 1. Train|test split (*)

We want to predict the "mpg", split up X and y, and perform train|test split using scikit-learn. Choose test_size of 0.2 and random_state 42. Control the shapes of each X_train, X_test, y_train, y_test.  

<details>

<summary>Answer</summary>

Do a manual calculation to check against the shapes after train|test split. 

</details>

---

In [151]:
from sklearn.model_selection import train_test_split

X_train,X_test, y_train, y_test = train_test_split(df.loc[:,"cylinders":],df.loc[:,:"mpg"], test_size=0.2, random_state=42)

print(X_train.shape, y_train.shape, X_test.shape,y_test.shape)



(313, 6) (313, 1) (79, 6) (79, 1)


## 2. Function for evaluation (*)

Create a function for training a regression model, predicting and computing the metrics MAE, MSE, RMSE. It should take in parameters of X_train, X_test, y_train, y_test, model. Now create a linear regression model using scikit-learns ```LinearRegression()``` (OLS normal equation with SVD) and call your function to get metrics. 

<details>

<summary>Answer</summary>

MAE 2.50

MSE 10.50

RMSE 3.24

</details>

In [152]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error,mean_squared_error,root_mean_squared_error

model = LinearRegression()


def train_eval(X_train,X_test,y_train,y_test, model):
    model.fit(X_train,y_train)
    y_pred= model.predict(X_test)
    
    MAE = mean_absolute_error(y_test,y_pred)
    MSE = mean_squared_error(y_test,y_pred)
    RMSE = root_mean_squared_error(y_test,y_pred)
    metrics = {"MAE": MAE, "MSE": MSE, "RMSE": RMSE}
    return metrics


metrics =  train_eval(X_train,X_test,y_train,y_test,model)
print(metrics)

{'MAE': 2.503860089776125, 'MSE': 10.502370329417303, 'RMSE': 3.2407360783342574}


---
## 3. Compare models (*)

Create the following models 
- Linear regression (SVD)
- Linear regression (SVD) with scaled data (feature standardization)
- Polynomial linear regression with degree 1
- Polynomial linear regression with degree 2
- Polynomial linear regression with degree 3

Make a DataFrame with evaluation metrics and model. Which model performed overall best?

<details>

<summary>Answer</summary>

|      |   Linear regr. SVD |   Linear regr. SVD scaled |   Linear regr. SGD |   Polynom. regr. deg 1 |   Polynom. regr. deg 2 |   Polynom. regr. deg 3 |
|:-----|-------------------:|--------------------------:|-------------------:|-----------------------:|-----------------------:|-----------------------:|
| mae  |            2.50386 |                   2.50386 |            2.53515 |                2.50386 |                1.98048 |                2.11788 |
| mse  |           10.5024  |                  10.5024  |           10.8908  |               10.5024  |                7.41986 |                9.27353 |
| rmse |            3.24074 |                   3.24074 |            3.30012 |                3.24074 |                2.72394 |                3.04525 |

</details>

---

In [153]:


# X_train, X_test, y_train, y_test = train_test_split(poly_features, y, test_size=0.33, random_state=42)

In [154]:
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import PolynomialFeatures






def linear_regression_SVD():
    metrics =  train_eval(X_train,X_test,y_train,y_test,model)
    print(f"normal svd metrics: {metrics}")


def linear_regression_scaled():
    scaler = StandardScaler()
    scaled_X_train = scaler.fit_transform(X_train)
    scaled_X_test = scaler.transform(X_test)
    metrics = train_eval(scaled_X_train,scaled_X_test,y_train,y_test,model)
    print(f"scaled metrics: {metrics}")
    

def poly1_regression():
    poly1 = PolynomialFeatures(1, include_bias=False)
    poly1_features = poly1.fit_transform(X_train)
    poly1_features_test = poly1.transform(X_test)
    metrics =  train_eval(poly1_features,poly1_features_test,y_train,y_test,model)
    print(f"1 degree poly metrics: {metrics}")


def poly2_regression():
    poly2 = PolynomialFeatures(2, include_bias=False)
    poly2_features = poly2.fit_transform(X_train)
    poly2_features_test = poly2.transform(X_test)
    metrics =  train_eval(poly2_features,poly2_features_test,y_train,y_test,model)
    print(f"2 degree poly metrics: {metrics}")
    pass


def poly3_regression():
    poly3 = PolynomialFeatures(3, include_bias=False)
    poly3_features = poly3.fit_transform(X_train)
    poly3_features_test = poly3.transform(X_test)
    metrics =  train_eval(poly3_features,poly3_features_test,y_train,y_test,model)
    print(f"3 degree poly metrics: {metrics}")
    pass


linear_regression_SVD()
linear_regression_scaled()
poly1_regression()
poly2_regression()
poly3_regression()



normal svd metrics: {'MAE': 2.503860089776125, 'MSE': 10.502370329417303, 'RMSE': 3.2407360783342574}
scaled metrics: {'MAE': 2.5038600897761234, 'MSE': 10.502370329417294, 'RMSE': 3.240736078334256}
1 degree poly metrics: {'MAE': 2.5038600897761247, 'MSE': 10.502370329417301, 'RMSE': 3.2407360783342574}
2 degree poly metrics: {'MAE': 1.980477209601935, 'MSE': 7.419858147786743, 'RMSE': 2.723941656457925}
3 degree poly metrics: {'MAE': 2.0375481146938856, 'MSE': 8.010495726743402, 'RMSE': 2.8302819164781803}



## 4. Further explorations (**)

Feel free to further explore the dataset, for example you could choose to 
- drop different columns
- find out feature importance in polynomial models
- fine tune further for a specific model by exploring hyperparameters (check documentation which type of parameters that can be changed)

---

Kokchun Giang

[LinkedIn][linkedIn_kokchun]

[GitHub portfolio][github_portfolio]

[linkedIn_kokchun]: https://www.linkedin.com/in/kokchungiang/
[github_portfolio]: https://github.com/kokchun/Portfolio-Kokchun-Giang

---