# **Lab: Neural Networks**

## Exercise 1: Regression with Pytorch

In this exercise, we will build a Neural Networks with Pytorch for predicting pollution level. We will be working on the Beijing Pollution dataset:
https://code.datasciencedojo.com/datasciencedojo/datasets/tree/master/Beijing%20PM2.5

The steps are:
1.   Setup Repository
2.   Load and Explore Dataset
3.   Prepare Data
4.   Baseline Model
5.   Define Architecture
6.   Create Data Loader
7.   Train Model
8.   Assess Performance
9.   Push Changes

### 1. Setup Repository

**[1.1]** Go to a folder of your choice on your computer (where you store projects)

**[1.2]** Copy the cookiecutter data science template
- Follow the prompt (name the project and repo adv_dsi_lab_5)

**[1.3]** Go inside the created folder `adv_dsi_lab_5`

In [None]:
# Go to a folder of your choice on your computer (where you store projects)
cd ~/Projects/

# Copy the cookiecutter data science template
cookiecutter -c v1 https://github.com/drivendata/cookiecutter-data-science
    
# Go inside the created folder adv_dsi_lab_5
cd adv_dsi_lab_5

**[1.4]** Create a file called `Dockerfile` and add the following content:

`FROM jupyter/scipy-notebook:0ce64578df46`

`RUN pip install torch==1.9.0+cpu torchvision==0.10.0+cpu torchtext==0.10.0 -f https://download.pytorch.org/whl/torch_stable.html`

`ENV PYTHONPATH "${PYTHONPATH}:/home/jovyan/work"`

`RUN echo "export PYTHONPATH=/home/jovyan/work" >> ~/.bashrc`

`WORKDIR /home/jovyan/work`


**[1.5]** Build the image from this Dockerfile

**[1.6]** Run the built image

**[1.7]** Display last 50 lines of logs
- Copy the url displayed and paste it to a browser in order to launch Jupyter Lab

In [None]:
# Build the image from this Dockerfile
docker build -t pytorch-notebook:latest .
    
# Run the built image
docker run  -dit --rm --name adv_dsi_lab_5 -p 8888:8888 -e JUPYTER_ENABLE_LAB=yes 
-v ~/Projects/adv_dsi/adv_dsi_lab_5:/home/jovyan/work 
-v ~/Projects/adv_dsi/src:/home/jovyan/work/src 
-v ~/Projects/adv_dsi/data:/home/jovyan/work/data 
pytorch-notebook:latest  
    
# Display last 50 lines of logs
docker logs --tail 50 adv_dsi_lab_5

**[1.8]** Initialise the repo

**[1.9]** Login into Github with your account (https://github.com/) and create a public repo with the name `adv_dsi_lab_5`

**[1.10]** In your local repo `adv_dsi_lab_5`, link it with Github (replace the url with your username)

**[1.11]** Add your changes to git staging area and commit them

**[1.12]** Push your master branch to origin

**[1.13]** Preventing push to `master` branch

**[1.14]** Create a new git branch called `pytorch_reg`


In [None]:
"""
# Initialise the repo
git init

# Login into Github with your account (https://github.com/) 
# and create a public repo with the name `adv_dsi_lab_5`

# Link repo with Github
git remote add origin git@github.com:CazMayhem/adv_dsi_lab_1_5.git

# Add your changes to git staging area and commit them
git add .
git commit -m "init"

# Push your master branch to origin
git push https://<insert_pat>@github.com/CazMayhem/adv_dsi_lab_5.git --set-upstream origin master

# Preventing push to master branch
git config branch.master.pushRemote no_push

# Create a new git branch called pytorch_reg
git checkout -b pytorch_reg

"""

### 2.   Load and Explore Dataset
**[2.1]** Download the dataset into the `data/raw` folder:https://code.datasciencedojo.com/datasciencedojo/datasets/raw/master/Beijing%20PM2.5/PRSA_data_2010.1.1-2014.12.31.csv

In [33]:
!wget -P ../data/raw https://code.datasciencedojo.com/datasciencedojo/datasets/raw/master/Beijing%20PM2.5/PRSA_data_2010.1.1-2014.12.31.csv

--2022-03-19 00:14:47--  https://code.datasciencedojo.com/datasciencedojo/datasets/raw/master/Beijing%20PM2.5/PRSA_data_2010.1.1-2014.12.31.csv
Resolving code.datasciencedojo.com (code.datasciencedojo.com)... 167.99.111.153
Connecting to code.datasciencedojo.com (code.datasciencedojo.com)|167.99.111.153|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1966669 (1.9M) [text/plain]
Saving to: ‘../data/raw/PRSA_data_2010.1.1-2014.12.31.csv.1’


2022-03-19 00:14:50 (1.39 MB/s) - ‘../data/raw/PRSA_data_2010.1.1-2014.12.31.csv.1’ saved [1966669/1966669]



**[2.2]** Launch the magic commands for auto-relaoding external modules

In [1]:
%load_ext autoreload
%autoreload 2

**[2.3]** Import the pandas and numpy packages

In [2]:
# import the pandas and numpy packages
import pandas as pd
import numpy as np

In [3]:
# Load the data in a dataframe called df
df = pd.read_csv('../data/raw/PRSA_data_2010.1.1-2014.12.31.csv')

# Display the first 5 rows of df
df.head()

Unnamed: 0,No,year,month,day,hour,pm2.5,DEWP,TEMP,PRES,cbwd,Iws,Is,Ir
0,1,2010,1,1,0,,-21,-11.0,1021.0,NW,1.79,0,0
1,2,2010,1,1,1,,-21,-12.0,1020.0,NW,4.92,0,0
2,3,2010,1,1,2,,-21,-11.0,1019.0,NW,6.71,0,0
3,4,2010,1,1,3,,-21,-14.0,1019.0,NW,9.84,0,0
4,5,2010,1,1,4,,-20,-12.0,1018.0,NW,12.97,0,0


In [4]:
# Display the dimensions (shape) of df
df.shape

(43824, 13)

In [5]:
# Display the summary (info) of df
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 43824 entries, 0 to 43823
Data columns (total 13 columns):
No       43824 non-null int64
year     43824 non-null int64
month    43824 non-null int64
day      43824 non-null int64
hour     43824 non-null int64
pm2.5    41757 non-null float64
DEWP     43824 non-null int64
TEMP     43824 non-null float64
PRES     43824 non-null float64
cbwd     43824 non-null object
Iws      43824 non-null float64
Is       43824 non-null int64
Ir       43824 non-null int64
dtypes: float64(4), int64(8), object(1)
memory usage: 4.3+ MB


In [6]:
# isplay the descriptive statictics of df
df.describe()

Unnamed: 0,No,year,month,day,hour,pm2.5,DEWP,TEMP,PRES,Iws,Is,Ir
count,43824.0,43824.0,43824.0,43824.0,43824.0,41757.0,43824.0,43824.0,43824.0,43824.0,43824.0,43824.0
mean,21912.5,2012.0,6.523549,15.72782,11.5,98.613215,1.817246,12.448521,1016.447654,23.88914,0.052734,0.194916
std,12651.043435,1.413842,3.448572,8.799425,6.922266,92.050387,14.43344,12.198613,10.268698,50.010635,0.760375,1.415867
min,1.0,2010.0,1.0,1.0,0.0,0.0,-40.0,-19.0,991.0,0.45,0.0,0.0
25%,10956.75,2011.0,4.0,8.0,5.75,29.0,-10.0,2.0,1008.0,1.79,0.0,0.0
50%,21912.5,2012.0,7.0,16.0,11.5,72.0,2.0,14.0,1016.0,5.37,0.0,0.0
75%,32868.25,2013.0,10.0,23.0,17.25,137.0,15.0,23.0,1025.0,21.91,0.0,0.0
max,43824.0,2014.0,12.0,31.0,23.0,994.0,28.0,42.0,1046.0,585.6,27.0,36.0


### 3. Prepare Data

**[3.1]** Create a copy of `df` and save it into a variable called `df_cleaned`

**[3.2]** Remove the column `No` as it is an identifier for rows

**[3.3]** Remove the missing values from the target variable `pm2.5`

**[3.4]** Reset the indexes of the dataframe



In [7]:
# Create a copy of df and save it into a variable called df_cleaned
df_cleaned = df.copy()

# Remove the column No as it is an identifier for rows
df_cleaned.drop('No', axis=1, inplace=True)

# Remove the missing values from the target variable pm2.5
df_cleaned.dropna(inplace=True)

# Reset the indexes of the dataframe
df_cleaned.reset_index(drop=True, inplace=True)

**[3.5]** Import `StandardScaler` and `OneHotEncoder` from `sklearn.preprocessing`

**[3.6]** Create a list called `num_cols` that contains `year`, `DEWP`, `TEMP`, `PRES`, `Iws`, `Is`, `Ir`

**[3.7]** Instantiate a `StandardScaler` and called it `sc`

In [8]:
# Import StandardScaler and OneHotEncoder from sklearn.preprocessing
from sklearn.preprocessing import StandardScaler, OneHotEncoder

# Create a list called num_cols that contains year, DEWP, TEMP, PRES, Iws, Is, Ir
num_cols = ['year', 'DEWP', 'TEMP', 'PRES', 'Iws', 'Is', 'Ir']

# Instantiate a StandardScaler and called it sc
sc = StandardScaler()

**[3.8]** Fit and transform the numeric feature of `df_cleaned` and replace the data into it

**[3.9]** Create a list called `cat_cols` that contains `month`, `day`, `hour`, `cbwd`

In [9]:
# Fit and transform the numeric feature of X_train_cleaned and replace the data into it
df_cleaned[num_cols] = sc.fit_transform(df_cleaned[num_cols])

# Create a list called cat_cols that contains Gender
cat_cols = ['month', 'day', 'hour', 'cbwd']

**[3.10]** Instantiate a `OneHotEncoder` and called it `ohe`

**[3.11]** Perform One-Hot encoding on `cat_cols` and save them into a dataframe called `X_cat`

**[3.12]** Extract the feature names from `ohe` and replace the names of the columns of the `X_cat`

**[3.13]** Drop the original columns of `cat_cols` from `df_cleaned`

In [10]:
# Instantiate a OneHotEncoder and called it ohe
ohe = OneHotEncoder(sparse=False)

# Perform One-Hot encoding on cat_cols and save them into a dataframe called X_cat
X_cat = pd.DataFrame(ohe.fit_transform(df_cleaned[cat_cols]))

# Extract the feature names from ohe and replace the names of the columns of the X_cat
X_cat.columns = ohe.get_feature_names(cat_cols)

# Drop the original columns of cat_cols from df_cleaned
df_cleaned.drop(cat_cols, axis=1, inplace=True)

**[3.14]** Concatenate `df_cleaned` with `X_cat` and save the result to a variable called `X`

**[3.15]** Import `split_sets_by_time` and `save_sets` from `src.data.sets`

**[3.16]** Split the data into training and testing sets with 80-20 ratio

In [11]:
# Concatenate df_cleaned with X_cat and save the result to a variable called X
X = pd.concat([df_cleaned, X_cat], axis=1)

# Import train_test_split from sklearn.model_selection
from src.data.sets import split_sets_by_time, save_sets

# Split the data into training and testing sets with 80-20 ratio
X_train, y_train, X_val, y_val, X_test, y_test = split_sets_by_time(X, target_col='pm2.5', test_ratio=0.2)

**[3.17]** Create the following folder: `data/processed/beijing_pollution`



In [12]:
# create the following folder: data/processed/beijing_pollution
!mkdir ../data/processed
!mkdir ../data/processed/beijing_pollution

mkdir: cannot create directory ‘../data/processed’: File exists
mkdir: cannot create directory ‘../data/processed/beijing_pollution’: File exists


**[3.18]** Save the sets in the `data/processed/beijing_pollution` folder



In [13]:
# Save the sets in the data/processed/beijing_pollution folder
save_sets(X_train=X_train, y_train=y_train, X_val=X_val, y_val=y_val, X_test=X_test, y_test=y_test, 
          path='../data/processed/beijing_pollution/')

## 4. Baseline Model

**[4.1]** Import `NullModel` from `src.models.null`

**[4.2]** Instantiate a `NullModel` and call `.fit_predict()` on the training target to extract your predictions into a variable called `y_base`

**[4.3]** Import `print_reg_perf` from `src.models.performance`

**[4.4]** Print the regression metrics for this baseline model

In [20]:
# Import NullModel from src.models.null
from src.models.null import NullModel

# Instantiate a NullModel and call .fit_predict() on the training target 
# to extract your predictions into a variable called y_base
baseline_model = NullModel()
y_base = baseline_model.fit_predict(y_train)

# Import print_reg_perf from src.models.performance
from src.models.performance import print_class_perf

# Print the regression metrics for this baseline model
print_reg_perf(y_base, y_train, set_name='Training')

RMSE Training: 92.82545840756482
MAE Training : 69.67082209440568


### 5. Define Architecture

**[5.1]** Import `torch`, `torch.nn` as `nn` and `torch.nn.functional` as `F`

**[5.3]** Instantiate `PytorchRegression` with the correct number of input feature and save it into a variable called `model`

**[5.5]** Set `model` to use the device available


In [23]:
# Import torch and torch.nn as nn
import torch
import torch.nn as nn
import torch.nn.functional as F

# Instantiate PytorchRegression with the correct number of input feature and save it into a variable called model
from src.models.pytorch import PytorchRegression

model = PytorchRegression(X_train.shape[1])

# Set model to use the device available
from src.models.pytorch import get_device

device = get_device()
model.to(device)

PytorchRegression(
  (layer_1): Linear(in_features=78, out_features=128, bias=True)
  (layer_out): Linear(in_features=128, out_features=1, bias=True)
)

### 6. Create Data Loader

**[6.1]** Import `Dataset` and `DataLoader` from `torch.utils.data`

**[6.3]** Import this class from `src/models/pytorch` and convert all sets to PytorchDataset

In [26]:
# Import Dataset and DataLoader from torch.utils.data
from torch.utils.data import Dataset, DataLoader

# Import this class from src/models/pytorch and convert all sets to PytorchDataset
from src.models.pytorch import PytorchDataset

train_dataset = PytorchDataset(X=X_train, y=y_train)
val_dataset = PytorchDataset(X=X_val, y=y_val)
test_dataset = PytorchDataset(X=X_test, y=y_test)

### 7. Train Model

**[7.1]** Instantiate a `nn.MSELoss()` and save it into a variable called `criterion` 

**[7.2]** Instantiate a `torch.optim.Adam()` optimizer with the model's parameters and 0.001 as learning rate and save it into a variable called `optimizer`

**[7.5]** Create 2 variables called `N_EPOCHS` and `BATCH_SIZE` that will take respectively 5 and 32 as values

In [29]:
# Instantiate a nn.MSELoss() and save it into a variable called criterion
criterion = nn.MSELoss()

# Instantiate a torch.optim.Adam() optimizer with the model's parameters and 0.001 as learning rate 
# and save it into a variable called optimizer
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

# Variables N_EPOCHS and BATCH_SIZE that will take respectively 5 and 32 as values
N_EPOCHS = 5
BATCH_SIZE = 32

**[7.6]** Create a for loop that will iterate through the specified number of epochs and will train the model with the training set and assess the performance on the validation set and print their scores

In [30]:
# Create a for loop that will iterate through the specified number of epochs and will train the model 
# with the training set and assess the performance on the validation set and print their scores
from src.models.pytorch import train_regression, test_regression

for epoch in range(N_EPOCHS):
    train_loss, train_rmse = train_regression(train_dataset, model=model, criterion=criterion, optimizer=optimizer, batch_size=BATCH_SIZE, device=device)
    valid_loss, valid_rmse = test_regression(val_dataset, model=model, criterion=criterion, batch_size=BATCH_SIZE, device=device)

    print(f'Epoch: {epoch}')
    print(f'\t(train)\tLoss: {train_loss:.4f}\t|\tRMSE: {train_rmse:.1f}')
    print(f'\t(valid)\tLoss: {valid_loss:.4f}\t|\tRMSE: {valid_rmse:.1f}')

  return F.mse_loss(input, target, reduction=self.reduction)
  return F.mse_loss(input, target, reduction=self.reduction)


Epoch: 0
	(train)	Loss: 352.0528	|	RMSE: 18.8
	(valid)	Loss: 249.2544	|	RMSE: 15.8
Epoch: 1
	(train)	Loss: 273.6801	|	RMSE: 16.5
	(valid)	Loss: 237.7091	|	RMSE: 15.4
Epoch: 2
	(train)	Loss: 272.6869	|	RMSE: 16.5
	(valid)	Loss: 237.6739	|	RMSE: 15.4
Epoch: 3
	(train)	Loss: 272.2375	|	RMSE: 16.5
	(valid)	Loss: 237.9588	|	RMSE: 15.4
Epoch: 4
	(train)	Loss: 272.2680	|	RMSE: 16.5
	(valid)	Loss: 237.3370	|	RMSE: 15.4


**[7.7]** Save the model into the `models` folder

In [31]:
# Save the model into the models folder
torch.save(model, "../models/pytorch_reg_pm2_5.pt")

### 8.   Assess Performance

**[8.1]** Assess the model performance on the testing set and print its scores

In [32]:
# Assess the model performance on the testing set and print its scores
test_loss, test_rmse = test_regression(test_dataset, model=model, criterion=criterion, batch_size=BATCH_SIZE, device=device)
print(f'\tLoss: {test_loss:.4f}\t|\tRMSE: {test_rmse:.1f}')

	Loss: 276.4697	|	RMSE: 16.6


### 9.   Push changes

In [None]:
"""
# Add your changes to git staging area
git add .

# Create the snapshot of your repository and add a description
git commit -m "pytorch regression"

# Push your snapshot to Github
git push https://<insert_pat>@github.com/CazMayhem/adv_dsi_lab_5.git

# Check out to the master branch
git checkout master

# Pull the latest updates
git pull https://<insert_pat>@github.com/CazMayhem/adv_dsi_lab_5.git

# Merge the branch pytorch_reg
git checkout pytorch_reg

# Merge the master branch and push your changes, 
# any merge issues use:  git merge master --allow-unrelated-histories
git merge master 
git push https://<insert_pat>@github.com/CazMayhem/adv_dsi_lab_5.git

"""

In [None]:
# Stop the Docker container
docker stop adv_dsi_lab_5