# IMRSV TEST

#### Solution to ML-Engineer Exam

By Jana Rasras

## Import Libraries

I will be using 
* `Pandas`, `Numpy`, `Sklearn` for data processing and engineering
* `Pytorch` for Neural Network analysis

In [105]:
import torch
from torch import nn, optim
from torch.utils import data

import pandas as pd
import numpy as np
import sklearn.preprocessing as skl

## Load and prepare the data

The data is provided as `.csv` file. Data curation includes:
* Remove unsued columns and rows with missing values.
* find total of counts per day. Then encode it into neumarical class labels.
* Encode date to useful numarical format.
* Normalize Data

Note: I decided to reference `Pandas DataFrame` using `date` column to simplify preprocessing. Also, `meta_df` will be used to test each step of the implementation.

In [106]:
df = pd.read_csv('data.csv', index_col='date', parse_dates=True, usecols=['date', 'max temp','mean temp','min temp','snow on grnd (cm)', 'total precip (mm)', 'total rain (mm)', 'total snow (cm)','count'])
df = df.astype('float64')

meta_df = df.copy()
print(len(meta_df))
meta_df.head()

23028


Unnamed: 0_level_0,count,max temp,mean temp,min temp,snow on grnd (cm),total precip (mm),total rain (mm),total snow (cm)
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2015-02-20,0.0,-14.0,-18.2,-22.3,27.0,0.3,0.0,0.3
2016-08-28,968.0,29.5,23.8,18.0,0.0,2.8,2.8,0.0
2011-12-25,10.0,-3.0,-9.3,-15.6,1.0,7.0,0.0,9.5
2017-05-01,122.0,14.0,8.8,3.5,0.0,34.4,34.4,0.0
2015-04-27,228.0,15.2,9.4,3.5,0.0,0.0,0.0,0.0


### A. Preprocessing

In this part, I will:

* Remove unsued columns and rows with missing values.
* find total of counts per day. Then encode it into neumarical class labels.

In [107]:
meta_df = meta_df.dropna(axis=0)
print(len(meta_df))
meta_df.head()

22895


Unnamed: 0_level_0,count,max temp,mean temp,min temp,snow on grnd (cm),total precip (mm),total rain (mm),total snow (cm)
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2015-02-20,0.0,-14.0,-18.2,-22.3,27.0,0.3,0.0,0.3
2016-08-28,968.0,29.5,23.8,18.0,0.0,2.8,2.8,0.0
2011-12-25,10.0,-3.0,-9.3,-15.6,1.0,7.0,0.0,9.5
2017-05-01,122.0,14.0,8.8,3.5,0.0,34.4,34.4,0.0
2015-04-27,228.0,15.2,9.4,3.5,0.0,0.0,0.0,0.0


In [108]:
## Total Count
total_col = meta_df.groupby('date')[['count']].sum()
print(len(total_col))
total_col.head()

3260


Unnamed: 0_level_0,count
date,Unnamed: 1_level_1
2010-01-01,0.0
2010-01-02,0.0
2010-01-03,0.0
2010-01-04,0.0
2010-01-05,0.0


In [109]:
meta_df.insert(3, "total", total_col) 
print(len(meta_df))
meta_df.head()

22895


Unnamed: 0_level_0,count,max temp,mean temp,total,min temp,snow on grnd (cm),total precip (mm),total rain (mm),total snow (cm)
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2015-02-20,0.0,-14.0,-18.2,285.0,-22.3,27.0,0.3,0.0,0.3
2016-08-28,968.0,29.5,23.8,6771.0,18.0,0.0,2.8,2.8,0.0
2011-12-25,10.0,-3.0,-9.3,91.0,-15.6,1.0,7.0,0.0,9.5
2017-05-01,122.0,14.0,8.8,2206.0,3.5,0.0,34.4,34.4,0.0
2015-04-27,228.0,15.2,9.4,4512.0,3.5,0.0,0.0,0.0,0.0


In [110]:
## Create 3 Classes [Low:0, Med:1, High:2]

meta_df['total count'] = np.where(
    meta_df['total']>10000, 
    2, 
    np.where(
        meta_df['total']<=2000, 
        0,
        1
    )
)

meta_df.drop(columns=['count','total'], inplace=True)

print(len(meta_df))
meta_df.head()

22895


Unnamed: 0_level_0,max temp,mean temp,min temp,snow on grnd (cm),total precip (mm),total rain (mm),total snow (cm),total count
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2015-02-20,-14.0,-18.2,-22.3,27.0,0.3,0.0,0.3,0
2016-08-28,29.5,23.8,18.0,0.0,2.8,2.8,0.0,1
2011-12-25,-3.0,-9.3,-15.6,1.0,7.0,0.0,9.5,0
2017-05-01,14.0,8.8,3.5,0.0,34.4,34.4,0.0,1
2015-04-27,15.2,9.4,3.5,0.0,0.0,0.0,0.0,1


In [111]:
def preprocessing(df: pd.DataFrame) -> pd.DataFrame:
    '''
    Pre-process a dataframe

    :param pd.DataFrame df: raw dataframe from data.csv

    :returns pd.DataFrame processed_df: processed dataframe
    '''
    
    processed_df = df.copy()
    processed_df = processed_df.dropna(axis=0)
    
    total_col = processed_df.groupby('date')[['count']].sum()
    processed_df.insert(3, "total", total_col) 
    
    processed_df['total count'] = np.where(
        processed_df['total']>10000, 
        2, 
        np.where(
            processed_df['total']<=2000, 
            0,
            1
        )
    )

    processed_df.drop(columns=['count', 'total'], inplace=True)

    return processed_df

**Test**:

In [112]:
processed_df = preprocessing(df)

meta_df = processed_df.copy()
print(len(meta_df))
meta_df.head()

22895


Unnamed: 0_level_0,max temp,mean temp,min temp,snow on grnd (cm),total precip (mm),total rain (mm),total snow (cm),total count
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2015-02-20,-14.0,-18.2,-22.3,27.0,0.3,0.0,0.3,0
2016-08-28,29.5,23.8,18.0,0.0,2.8,2.8,0.0,1
2011-12-25,-3.0,-9.3,-15.6,1.0,7.0,0.0,9.5,0
2017-05-01,14.0,8.8,3.5,0.0,34.4,34.4,0.0,1
2015-04-27,15.2,9.4,3.5,0.0,0.0,0.0,0.0,1


Save Dataframes to `.csv` files

In [113]:
with open('test/processed_df.csv', 'w') as f:
    processed_df.to_csv(f, index=False)

### B. Data Engineering

In this section, I will:

* Encode date to useful numarical format as follows:
    * Day/Month => `DayOFYear [0-365]` : it captures trends based on different seasons
    * Year => `Index [2010=0, 2019=19]` : it is normalized to capture increasing popularity of bike riding as the population increases over time.
* Normalize Data: between [0,1]
* Split Data into train and test `DataFrames`

In [116]:
# Encode date
ts = pd.Series(meta_df.index)


encode_df = pd.DataFrame({
    'date': meta_df.index, 
    'year': ts.dt.year - min(ts.dt.year), 
    'day_y':ts.dt.dayofyear,
    'day_w':ts.dt.dayofweek
})
encode_df.set_index('date', inplace=True)

print(len(encode_df))
encode_df.head()

22895


Unnamed: 0_level_0,year,day_y,day_w
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2015-02-20,5,51,4
2016-08-28,6,241,6
2011-12-25,1,359,6
2017-05-01,7,121,0
2015-04-27,5,117,0


In [117]:
meta2_df = pd.concat([meta_df,encode_df], axis=1)
meta2_df.head()

Unnamed: 0_level_0,max temp,mean temp,min temp,snow on grnd (cm),total precip (mm),total rain (mm),total snow (cm),total count,year,day_y,day_w
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2015-02-20,-14.0,-18.2,-22.3,27.0,0.3,0.0,0.3,0,5,51,4
2016-08-28,29.5,23.8,18.0,0.0,2.8,2.8,0.0,1,6,241,6
2011-12-25,-3.0,-9.3,-15.6,1.0,7.0,0.0,9.5,0,1,359,6
2017-05-01,14.0,8.8,3.5,0.0,34.4,34.4,0.0,1,7,121,0
2015-04-27,15.2,9.4,3.5,0.0,0.0,0.0,0.0,1,5,117,0


In [118]:
# Normalization Transform

cols = ['max temp','mean temp','min temp','snow on grnd (cm)', 'total precip (mm)', 'total rain (mm)', 'total snow (cm)']

tf = skl.MinMaxScaler()
meta2_df.loc[:,cols] = tf.fit_transform( meta2_df[cols] )

print(len(meta2_df))
meta2_df.head()

22895


Unnamed: 0_level_0,max temp,mean temp,min temp,snow on grnd (cm),total precip (mm),total rain (mm),total snow (cm),total count,year,day_y,day_w
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2015-02-20,0.172697,0.151943,0.158672,0.409091,0.003546,0.0,0.008108,0,5,51,4
2016-08-28,0.888158,0.893993,0.902214,0.0,0.033097,0.033097,0.0,1,6,241,6
2011-12-25,0.353618,0.309187,0.282288,0.015152,0.082742,0.0,0.256757,0,1,359,6
2017-05-01,0.633224,0.628975,0.634686,0.0,0.406619,0.406619,0.0,1,7,121,0
2015-04-27,0.652961,0.639576,0.634686,0.0,0.0,0.0,0.0,1,5,117,0


In [119]:
# Shuffle the data and split into features and targets
meta3_df = meta2_df.sample(frac=1)

N = int( 0.8 * len(meta3_df) )

train_df = meta3_df[:N]
test_df = meta3_df[N:]

print("Train set: \t\t{}".format(train_df.shape), 
      "\nTest set: \t\t{}".format(test_df.shape))

Train set: 		(18316, 11) 
Test set: 		(4579, 11)


In [138]:
def data_engineering(processed_df: pd.DataFrame) -> (pd.DataFrame,
                                                     pd.DataFrame):
    '''
    Perform data engineering on processed dataframe

    :param pd.DataFrame processed_df: output of preprocess()

    :returns pd.DataFrame train_df: training set of the engineered dataframe
    :returns pd.DataFrame test_df: test set of the engineered dataframe
    '''
    
    # Encode Date
    ts = pd.Series(processed_df.index)


    encode_df = pd.DataFrame({
        'date': processed_df.index, 
        'year': ts.dt.year - min(ts.dt.year), 
        'day_y':ts.dt.dayofyear,
        'day_w': ts.dt.dayofweek
    })
    encode_df.set_index('date', inplace=True)
    
    engineered_df = pd.concat([processed_df,encode_df], axis=1)

    # Normalization Transform
    cols = ['max temp','mean temp','min temp','snow on grnd (cm)', 
            'total precip (mm)', 'total rain (mm)', 'total snow (cm)',
           'year', 'day_w', 'day_y']

    tf = skl.MinMaxScaler()
    engineered_df.loc[:,cols] = tf.fit_transform( engineered_df[cols] )
    
    rand_df = engineered_df.sample(frac=1)

    N = int( 0.8 * len(rand_df) )

    train_df = rand_df[:N]
    test_df = rand_df[N:]

    return (train_df, test_df)

In [139]:
train_df, test_df = data_engineering(processed_df)

metax_df = train_df.copy()
metay_df = test_df.copy()

metax_df.head()

Unnamed: 0_level_0,max temp,mean temp,min temp,snow on grnd (cm),total precip (mm),total rain (mm),total snow (cm),total count,year,day_y,day_w
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2011-09-21,0.814145,0.768551,0.725092,0.0,0.040189,0.040189,0.0,2,0.125,0.720548,0.333333
2012-01-01,0.486842,0.484099,0.49631,0.181818,0.01182,0.01182,0.0,0,0.25,0.0,1.0
2011-11-19,0.608553,0.597173,0.595941,0.0,0.0,0.0,0.0,0,0.125,0.882192,0.833333
2015-03-15,0.435855,0.473498,0.53321,0.469697,0.0,0.0,0.0,0,0.625,0.2,1.0
2013-09-07,0.731908,0.765018,0.809963,0.0,0.050827,0.050827,0.0,2,0.375,0.682192,0.833333


Save Dataframes to `.csv` files

In [140]:
with open('test/train_df.csv', 'w') as f:
    train_df.to_csv(f, index=False)
    
with open('test/test_df.csv', 'w') as f:
    test_df.to_csv(f, index=False)

## C. Build NN in Pytorch

In this section, I will:
* create NN `model`
* create a `DataLoader` to `batch` training.
* train and test model

Note: Pytorch default datatypes are: `float32` for features, `long` for classes.

In [141]:
x_train = torch.from_numpy( metax_df.drop(columns=['total count']).to_numpy() )
x_train.shape

torch.Size([18316, 10])

In [142]:
y_train = torch.from_numpy( metax_df['total count'].to_numpy() ).long()
y_train.shape

torch.Size([18316])

In [143]:
x_test  = torch.from_numpy( metay_df.drop(columns=['total count']).to_numpy() )
x_test.shape

torch.Size([4579, 10])

In [144]:
y_test = torch.from_numpy( metay_df['total count'].to_numpy() ).long()
y_test.shape

torch.Size([4579])

In [145]:
# Convert to Dataset
N = int( 0.8 * len(metax_df) )
M = len(metax_df) - N
trainDataset = data.TensorDataset(x_train, y_train)

trainDS, validDS = data.random_split(trainDataset, [N, M])
testDS = data.TensorDataset(x_test, y_test)

print(len(trainDS), len(validDS), len(testDS))

14652 3664 4579


In [146]:
# Create DataLoader
trainDL = data.DataLoader(trainDS, batch_size=1, shuffle=True)
validDL = data.DataLoader(validDS, batch_size=1, shuffle=True)
testDL = data.DataLoader(testDS, batch_size=1, shuffle=True)

x,y = next(iter(trainDL))
print(x, y)

tensor([[0.6332, 0.6113, 0.5978, 0.0000, 0.0189, 0.0189, 0.0000, 0.5000, 0.2767,
         0.8333]]) tensor([1])


### Create NN Class

As we can see from previous cell, the number of features is 9, and the number of classes is 3

In [154]:
# torch.set_default_dtype(torch.float64)

class NetModel(nn.Module):
    def __init__(self, dim):
        super().__init__()
        
        self.fc1 = nn.Sequential(
            nn.Linear(dim[0], 128),
            nn.ReLU(),
            nn.Linear(128, 512),
            nn.ReLU(),
            nn.Dropout(0.25)
        )
        self.fc2 = nn.Sequential(
            nn.Linear(512, 128),
            nn.BatchNorm1d(128),
            nn.CELU(),
            nn.Dropout(0.25)
        )
        self.fc3 = nn.Sequential(
            nn.Linear(128,dim[1]),
            nn.Softmax(dim=1)
        )
        
    def forward(self, x):
        x = self.fc1(x)
        x = self.fc2(x)
        x = self.fc3(x)
        return x
    
meta_mdl = NetModel( dim=(10,3) )
meta_mdl.eval()
z = meta_mdl(x)
_, ind = torch.max(z, 1)
print(ind, y)

tensor([0]) tensor([1])


In [151]:
def nn_ml(train_df: pd.DataFrame, test_df: pd.DataFrame, device='cpu') ->  ('model',
                                                           'test_accuracy'):
    '''
    Use neural networks to predict total counts

    :param pd.DataFrame train_df: training set dataframe
    :param pd.DataFrame test_df: test set dataframe

    :returns 'model': trained model
    :returns 'test_accuracy': accuracy on test set
    '''
    # Convert Pandas_DataFrame to Pytorch_Tensor
    x_train = torch.from_numpy( train_df.drop(columns=['total count']).to_numpy() )
    y_train = torch.from_numpy( train_df['total count'].to_numpy() ).long()

    x_test  = torch.from_numpy( test_df.drop(columns=['total count']).to_numpy() )
    y_test = torch.from_numpy( test_df['total count'].to_numpy() ).long()

    # Convert to Dataset
    N = int( 0.8 * len(train_df) )
    M = len(train_df) - N
    trainDataset = data.TensorDataset(x_train, y_train)
    
    trainDS, validDS = data.random_split(trainDataset, [N, M])
    testDS = data.TensorDataset(x_test, y_test)

    # Create DataLoader
    trainDL = data.DataLoader(trainDS, batch_size=32, shuffle=True)
    validDL = data.DataLoader(validDS, batch_size=32, shuffle=True)
    testDL = data.DataLoader(testDS, batch_size=32, shuffle=True)
    
    
    # Create Model
    mdl = NetModel(dim=(10,3))
    mdl.to(device)
    criterion = nn.CrossEntropyLoss()                      # Multi-class Classification
    optimizer = optim.SGD(mdl.parameters(), lr=0.01)
    scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=5, gamma=0.9)  # Decay LR every 5 Episodes to 0.9

    # ----------------- Start Training -----------------
    
    print('----------------- | Start Training | ----------------- ')

    n_epochs = 30
    for epoch in range(n_epochs):
        print(f'Epoch {epoch}/{n_epochs}', end='')

        trainLoss = 0.0
        train_acc = 0.0
        mdl.train()

        for batch_x, batch_y in trainDL:
            batch_x, batch_y = batch_x.to(device), batch_y.to(device)
            batchSize = batch_x.size(0)

            optimizer.zero_grad()
            
            outputs = mdl(batch_x)
            _, labels = torch.max(outputs, 1)
        
            loss = criterion(outputs, batch_y)
            nn.utils.clip_grad_norm_(mdl.parameters(), .5)

            loss.backward()
            optimizer.step()

            trainLoss += loss.item() * batchSize
            train_acc += torch.sum(labels == batch_y).item()
        
        print(' @ LR: {:.7f} '.format(
            float( scheduler.get_lr()[0] )
            ) , end='')
        scheduler.step()

        trainLoss /= len(trainDL.sampler)
        train_acc = 100. * train_acc / len(trainDL.sampler)
        print('~ Train Error {:.3f}/{:2.1f} '.format(trainLoss,train_acc), end='')

        # ----------------- Validation ----------------- #

        validLoss = 0.0
        valid_acc = 0.0
        mdl.eval()

        # with torch.no_grad():
        for batch_x, batch_y in validDL:
            batch_x, batch_y = batch_x.to(device), batch_y.to(device)
            batchSize = batch_x.size(0)

            outputs = mdl(batch_x)
            _, labels = torch.max(outputs, 1)

            loss = criterion(outputs, batch_y)

            validLoss += loss.item() * batchSize
            valid_acc += torch.sum(labels == batch_y).item()

        valid_acc = 100. * valid_acc / len(validDL.sampler)
        validLoss /= len(validDL.sampler)
        print('~ Valid Error {:.3f}/{:2.1f} '.format(validLoss,valid_acc))

    print('----------------- | End Training | ----------------- ')

    # -----------------   Testing  ----------------- #
    test_acc = 0.0
    mdl.eval()

    for batch_x, batch_y in testDL:
        batch_x, batch_y = batch_x.to(device), batch_y.to(device)
        batchSize = batch_x.size(0)

        outputs = mdl(batch_x)
        _, labels = torch.max(outputs, 1)
        test_acc += torch.sum(labels == batch_y).item()

    test_acc = 100. * test_acc / len(testDL.sampler)
    print('Test Accuracy: {:2.1f}'.format(test_acc))

    
    torch.save(mdl.state_dict(), 'model.pt')


    return (mdl, test_acc)

In [152]:
mdl, test_acc = nn_ml(train_df, test_df)

print(test_acc)

----------------- | Start Training | ----------------- 
Epoch 0/30 @ LR: 0.0100000 ~ Train Error 0.904/64.2 ~ Valid Error 0.840/72.1 
Epoch 1/30 @ LR: 0.0100000 ~ Train Error 0.841/71.3 ~ Valid Error 0.806/75.1 
Epoch 2/30 @ LR: 0.0100000 ~ Train Error 0.815/73.9 ~ Valid Error 0.794/75.9 
Epoch 3/30 @ LR: 0.0100000 ~ Train Error 0.801/74.8 ~ Valid Error 0.788/75.7 
Epoch 4/30 @ LR: 0.0100000 ~ Train Error 0.796/75.2 ~ Valid Error 0.780/77.7 
Epoch 5/30 @ LR: 0.0090000 ~ Train Error 0.793/75.5 ~ Valid Error 0.779/77.5 
Epoch 6/30 @ LR: 0.0090000 ~ Train Error 0.790/75.9 ~ Valid Error 0.774/77.6 
Epoch 7/30 @ LR: 0.0090000 ~ Train Error 0.785/76.3 ~ Valid Error 0.773/78.3 
Epoch 8/30 @ LR: 0.0090000 ~ Train Error 0.784/76.2 ~ Valid Error 0.771/78.1 
Epoch 9/30 @ LR: 0.0090000 ~ Train Error 0.782/76.7 ~ Valid Error 0.767/78.6 
Epoch 10/30 @ LR: 0.0081000 ~ Train Error 0.780/76.8 ~ Valid Error 0.768/78.2 
Epoch 11/30 @ LR: 0.0081000 ~ Train Error 0.777/77.0 ~ Valid Error 0.766/78.1 
Epoch 

In [155]:
meta_mdl2 = mdl

z = meta_mdl2(x)
_, ind = torch.max(z, 1)
print(ind, y)

tensor([1]) tensor([1])
