<a href="https://colab.research.google.com/github/CazMayhem/adv_dsi_lab_1/blob/master/notebooks/AdvDSI_Lab1_Exercise2_Solutions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Lab: ML Engineering**



## Exercise 2: Organising Git Repository

We will train a ElasticNet model on the following dataset:
https://code.datasciencedojo.com/datasciencedojo/datasets/tree/master/Online%20News%20Popularity

The objective is to predict the volume of shares for a news article. 

**Pre-requisites:**
- Create a github account (https://github.com/join)
- Install git (https://git-scm.com/book/en/v2/Getting-Started-Installing-Git)
- Install Python

The steps are:
1.   Download dataset
2.   Load and explore dataset
3.   Prepare Data
4.   Get Baseline model
5.   Train ElasticNet model
6.   Push changes


## 1. Download dataset

**[1.1]** Download the dataset into the folder data/raw 

In [None]:
wget -P ~/Projects/adv_dsi_lab_1/data/raw https://code.datasciencedojo.com/datasciencedojo/datasets/raw/master/Online%20News%20Popularity/OnlineNewsPopularity.csv

**[1.2]** Preventing push to `master` branch


In [None]:
git config branch.master.pushRemote no_push


## 2. Load and Explore Dataset



**[2.1]** Launch Jupyter Lab from your virtual environment

In [None]:
# Placeholder for student's code (1 command line)
# Task: Launch Jupyter Lab from your virtual environment

In [None]:
#Solution:
pipenv run jupyter lab

**[2.2]** Navigate the folder `notebooks` and create a new jupyter notebook called `1_elasticnet.ipynb`

**[2.3]** Import the pandas and numpy package

In [None]:
# Placeholder for student's code (2 lines of Python code)
# Task: Import the pandas and numpy package

In [None]:
# Solution
import pandas as pd
import numpy as np

**[2.4]** Load the dataset into dataframe called df

In [None]:
# Placeholder for student's code (1 line of Python code)
# Task: Load the dataset into dataframe called df

In [None]:
#Solution:
df = pd.read_csv('../data/raw/OnlineNewsPopularity.csv')

**[2.5]** Display the first 5 rows of df

In [None]:
# Placeholder for student's code (1 line of Python code)
# Task: Display the first 5 rows of df

In [None]:
# Solution
df.head()

**[2.6]** Display the dimensions (shape) of df

In [None]:
# Placeholder for student's code (1 line of Python code)
# Task: Task: Display the dimensions (shape) of df

In [None]:
# Solution
df.shape

**[2.7]** Display the summary (info) of df

In [None]:
# Placeholder for student's code (1 line of Python code)
# Task: Display the summary (info) of df

In [None]:
# Solution
df.info()

**[2.8]** Display the descriptive statistics of df


In [None]:
# Placeholder for student's code (1 line of Python code)
# Task: Display the descriptive statictics of df

In [None]:
# Solution
df.describe()

## 3. Prepare Data

**[3.1]** Create a copy of df and save it into a variable called df_cleaned

In [None]:
# Placeholder for student's code (1 line of Python code)
# Task: Create a copy of df and save it into a variable called df_cleaned

In [None]:
# Solution
df_cleaned = df.copy()

**[3.2]** Drop the column `url`

In [None]:
# Placeholder for student's code (1 line of Python code)
# Task: Drop the column `url`

In [None]:
# Solution:
df_cleaned.drop('url', axis=1, inplace=True)

**[3.3]** Remove leading and trailing space from the column names

In [None]:
df_cleaned.columns = df_cleaned.columns.str.strip()

**[3.4]** Extract the column `shares` and save it into variable called `target`

In [None]:
# Placeholder for student's code (1 line of Python code)
# Task: Extract the column shares and save it into variable called target

In [None]:
# Solution:
target = df_cleaned.pop('shares')

**[3.5]** Import StandardScaler from sklearn.preprocessing

In [None]:
# Placeholder for student's code (1 line of Python code)
# Task: Import StandardScaler from sklearn.preprocessing

In [None]:
# Solution
from sklearn.preprocessing import StandardScaler

**[3.6]** Instantiate the StandardScaler

In [None]:
# Placeholder for student's code (1 line of Python code)
# Task: Instantiate the StandardScaler

In [None]:
# Solution
scaler = StandardScaler()

**[3.7]** Fit and apply the scaling on df

In [None]:
# Placeholder for student's code (1 line of Python code)
# Task: Fit and apply the scaling on df_cleaned

In [None]:
# Solution:
df_cleaned = scaler.fit_transform(df_cleaned)

**[3.8]** Import dump from joblib



In [None]:
# Placeholder for student's code (1 line of Python code)
# Task: Import dump from joblib

In [None]:
# Solution:
from joblib import dump

**[3.9]** Save the scaler into the folder `models` and call the file `scaler.joblib`

In [None]:
# Placeholder for student's code (1 line of Python code)
# Task: Save the scaler into the folder models and call the file scaler.joblib

In [None]:
# Solution:
dump(scaler, '../models/scaler.joblib')

**[3.10]** Import train_test_split from sklearn.model_selection 

In [None]:
# Placeholder for student's code (1 line of Python code)
# Import train_test_split from sklearn.model_selection

In [None]:
# Solution
from sklearn.model_selection import train_test_split

**[3.11]** Split randomly the dataset with random_state=8 into 2 different sets: data (80%) and test (20%)

In [None]:
# Placeholder for student's code (1 line of Python code)
# Task: Task: Split randomly the dataset with random_state=8 into 2 different sets: data (80%) and test (20%)

In [None]:
# Solution
X_data, X_test, y_data, y_test = train_test_split (df_cleaned, target, test_size=0.2, random_state=8)

**[3.12]** Split the remaining data (80%) randomly with random_state=8 into 2 different sets: training (80%) and validation (20%)

In [None]:
# Placeholder for student's code (1 line of Python code)
# Task: Split the remaining data (80%) randomly with random_state=8 into 2 different sets: training (80%) and validation (20%)

In [None]:
# Solution
X_train, X_val, y_train, y_val = train_test_split(X_data, y_data, test_size=0.2, random_state=8)

**[3.13]** Save the different sets in the folder `data/processed`

In [None]:
# Placeholder for student's code (6 lines of Python code)
# Task: Save the different sets in the folder `data/processed`

In [None]:
# Solution:
np.save('../data/processed/X_train', X_train)
np.save('../data/processed/X_val',   X_val)
np.save('../data/processed/X_test',  X_test)
np.save('../data/processed/y_train', y_train)
np.save('../data/processed/y_val',   y_val)
np.save('../data/processed/y_test',  y_test)

## 4. Get Baseline Model

**[4.1]** Calculate the average of the target variable for the training set and save it into a variable called `y_mean`

In [None]:
# Placeholder for student's code (1 line of Python code)
# Task: Calculate the average of the target variable for the training set and save it into a variable called y_mean

In [None]:
# Solution:
y_mean = y_train.mean()

**[4.2]** Create a numpy array called `y_base` of dimensions (len(y_train), 1) filled with this value

In [None]:
# Placeholder for student's code (1 line of Python code)
# Task: Create a numpy array called y_base of dimensions (len(y_train), 1) filled with this value

In [None]:
# Solution:
y_base = np.full((len(y_train), 1), y_mean)

**[4.3]** Import the MSE and MAE metrics from sklearn

In [None]:
# Placeholder for student's code (1 line of Python code)
# Task: Import the MSE and MAE metrics from sklearn

In [None]:
# Solution:
from sklearn.metrics import mean_squared_error as mse
from sklearn.metrics import mean_absolute_error as mae

**[4.4]** Display the RMSE and MAE scores of this baseline model

In [None]:
# Placeholder for student's code (2 lines of Python code)
# Task: Display the RMSE and MAE scores of this baseline model

In [None]:
# Solution:
print(mse(y_train, y_base, squared=False))
print(mae(y_train, y_base))

## 5. Train ElasticNet model

**[5.1]** Import the ElasticNet module from sklearn

In [None]:
# Placeholder for student's code (1 line of Python code)
# Task: Import the ElasticNet module from sklearn

In [None]:
# Solution:
from sklearn.linear_model import ElasticNet 

**[5.2]** Task: instantiate the ElasticNet class into a variable called reg

In [None]:
# Placeholder for student's code (1 line of code)
# Task: instantiate the ElasticNet class into a variable called reg

In [None]:
# Solution
reg = ElasticNet()

**[5.3]** Task: Fit the model with the prepared data

In [None]:
# Placeholder for student's code (1 line of code)
# Task: Fit the model with the prepared data

In [None]:
# Solution
reg.fit(X_train, y_train)

**[5.4]** Save the fitted model into the folder `models` as a file called `elasticnet_default.joblib`

In [None]:
# Placeholder for student's code (1 line of Python code)
# Task: Save the fitted model into the folder models as a file called linearelasticnet_default_reg.joblib

In [None]:
# Solution:
dump(reg,  '../models/elasticnet_default.joblib')

**[5.5]** Save the predictions from this model for the training and validation sets into 2 variables called `y_train_preds` and `y_val_preds`


In [None]:
# Placeholder for student's code (2 lines of Python code)
# Task: Save the predictions from this model for the training and validation sets into 2 variables called y_train_preds and y_val_preds

In [None]:
# Solution:
y_train_preds = reg.predict(X_train)
y_val_preds = reg.predict(X_val)

**[5.6]** Display the RMSE and MAE scores of this model on the training set

In [None]:
# Placeholder for student's code (2 lines of Python code)
# Task: Display the RMSE and MAE scores of this model on the training set

In [None]:
# Solution:
print(mse(y_train, y_train_preds, squared=False))
print(mae(y_train, y_train_preds))

**[5.7]** Display the RMSE and MAE scores of this model on the validation set

In [None]:
# Placeholder for student's code (2 lines of Python code)
# Task: Display the RMSE and MAE scores of this model on the validation set

In [None]:
# Solution:
print(mse(y_val, y_val_preds, squared=False))
print(mae(y_val, y_val_preds))

## 6.   Push changes

**[6.1]** Add you changes to git staging area

In [None]:
# Placeholder for student's code (1 command line)
# Task: Add you changes to git staging area

In [None]:
# Solution:
git add .

**[6.2]** Create the snapshot of your repository and add a description

In [None]:
# Placeholder for student's code (1 command line)
# Task: Create the snapshot of your repository and add a description

In [None]:
# Solution:
git commit -m "first commit"

**[6.3]** Push your snapshot to Github

In [None]:
# Placeholder for student's code (1 command line)
# Task: Push your snapshot to Github

In [None]:
# Solution:
git push

**[6.4]** Close Jupyter Lab with control (command) + c