# Feature Engineering and Modeling

Hello and welcome to the second notebook. In this one, we'll train a baseline regression model to predict our target variable (as denoted in our previously generated dataset). If you missed the data generation and exporting step, head on over to <code>data-generation.ipynb</code>.

#### Structure

This notebook will be structured in 3 parts:

1. We will import and preprocess our data in an adequate fashion.
2. Setup and training of a baseline regression model.
3. Evaluation of our trained model.

In [169]:
%matplotlib inline
import os
import numpy as np
import pandas as pd
import pickle
import plotly.express as px
import time
from sklearn.linear_model import Ridge
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler

In [153]:
# Data imports
DATA_DIR = "../data"
FILENAME = "testData-42-raw.csv"
path = os.path.join(DATA_DIR, FILENAME)
raw_df = pd.read_csv(path, index_col=0)

In [139]:
raw_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10000 entries, 0 to 9999
Data columns (total 4 columns):
target      10000 non-null float64
feature1    10000 non-null float64
feature2    10000 non-null float64
feature3    10000 non-null object
dtypes: float64(3), object(1)
memory usage: 390.6+ KB


In [140]:
len(raw_df.feature3.unique())

26

### Feature Enginering
Most statistical/machine learning tasks require some data preprocessing. In this case, we are able to keep it to a minimum given the relatively simple complexity of the dataset. Just by looking at our dataset, we can see that we have a categorical variable (lowercase letters of the english alphabet). As matrix operations can't be done with strings, we'll have to encode it.

**Note:** because of our uniform sampling of english letters, we will end up with a VERY sparse matrix. 

In [141]:
# Reshape and store categorical
categorical = raw_df.feature3.to_numpy().reshape(-1,1)

# Encoding categorical feature and remembering categories
encoded = OneHotEncoder(handle_unknown='ignore').fit(categorical)
categories = encoded.categories_

# Transform our data into df
encoded_series = encoded.transform(categorical.reshape(-1,1)).todense()

In [142]:
encoded_series

matrix([[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [1., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 1.],
        [0., 0., 0., ..., 0., 0., 0.]])

### What about the numerical features?

The choice was made not engineer the numerical features further. In a realistic scenario, this would likely be a neccessary, but in this case of uniformly-random and low range we wouldn't see much benefit. 

As such, we can proceed to construcing our finally processed data:

In [143]:
numerical_cols_matrix = raw_df[["feature1", "feature2"]].to_numpy()
X = np.concatenate((encoded_series, numerical_cols_matrix), axis=1)
y = raw_df.target

In [144]:
X.shape, y.shape

((10000, 28), (10000,))

### Train/Test Split

Finally, we need to split our dataset to prevent overfitting on the training set. Ideally we would have 3 splits (train,dev,test) but in this case (again, random data) we'll just go with train/test. One could also opt for a more refined model selection strategy (e.g. kfold), and likely would in a realistic scenario.

In [145]:
# Train/test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.70, random_state=42)

### Baseline Model Training and Evaluation
For the choice of baseline, we will use Ridge regression /w Stochastic Average Gradient descent. 

In [146]:
# Initiate and train our baseline
baseline_model = Ridge(alpha=1.0, solver="sag", random_state=42).fit(X_train, y_train)

In [158]:
# Evaluate on test set, save performance metrics, and visualise actual vs. predicted on testset
y_hat = baseline_model.predict(X_test)
baseline_r2_score = r2_score(y_test, y_hat)

In [174]:
# Visualise predicted vs. actual while displaying test set r^2
fig = px.scatter(x=y_test, y=y_hat, title=f"Baseline Model - Testset Predicted vs. Actual, R^2={baseline_r2_score}")
fig.update_layout(
    xaxis_title="Actual",
    yaxis_title="Predicted",
)
fig.show()

## What a mess....
It seems that our baseline model is actually worse than if we were to guess using a simple average. In a realistic situation, this would most likely indicate serious issues with preprocessing or model set up. In our case, it might be considered normal given our completely arbitrary and random data.

### Extended Model: Training and Comparitive Evaluation
Let's see if we can actually do better using a different type of model. Considering what we've seen so far, it is hardly likely. Let's go with another favourite in the ML community: random forest. 

### Model Export via Serialization

In [179]:
# timestamp for use in file name
now_array = time.asctime(time.localtime(time.time())).split(" ")[1:]
timestamp = now_array[0] + "-" + "-".join(now_array[2:]).replace(":", "-")

# Setup filename and dirs
MODEL_DIR = "../models"
model_name = "baseline"
file_name = model_name + "-" + timestamp + ".bin"
file_path = os.path.join(MODEL_DIR, file_name)

# handle accidental overwrites (should be fine with timestamp but just in case)
assert not(os.path.isfile('./path_of_file')), "A file with that name is already stored."

# Pickle and save to disk
with open(file_path,"wb") as f:
    pickle.dump(baseline_model, f)

# The end (or is it?)
You have reached the end of this notebook. We know have at least one trained model we can deploy inside an API for real-time access by an authorised public. There are no more notebooks for the task. The remainer will be done in a more traditonal python web dev environment. Please head on over to <code>../src</code> and follow the documentation if you want to deploy the solution locally yourself.