# Feature Engineering and Modeling

Hello and welcome to the second notebook. In this one, we'll train a baseline regression model to predict our target variable (as denoted in our previously generated dataset). If you missed the data generation and exporting step, head on over to <code>data-generation.ipynb</code>.

#### Structure

This notebook will be structured in 3 parts:

1. We will import and preprocess our data in an adequate fashion.
2. Setup and training of a baseline regression model.
3. Evaluation of our trained model.

In [286]:
import os
import numpy as np
import pandas as pd
import pickle
import plotly.express as px
import time
import string
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import Ridge
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import OneHotEncoder, LabelEncoder

In [151]:
# Data imports
DATA_DIR = "../data"
FILENAME = "testData-42-raw.csv"
path = os.path.join(DATA_DIR, FILENAME)
raw_df = pd.read_csv(path, index_col=0)

In [152]:
raw_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10000 entries, 0 to 9999
Data columns (total 4 columns):
target      10000 non-null float64
feature1    10000 non-null float64
feature2    10000 non-null float64
feature3    10000 non-null object
dtypes: float64(3), object(1)
memory usage: 390.6+ KB


In [153]:
len(raw_df.feature3.unique())

26

In [211]:
# clean up of features with descriptive names and seperation of regressors from target
column_names = ["numerical_feature1", "numerical_feature2", "categorical_feature"]
raw_features_df = raw_df.drop("target", axis="columns")
raw_features_df.columns = column_names
raw_features_df.categorical_feature = raw_features_df.categorical_feature.astype("category")

In [212]:
raw_features_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10000 entries, 0 to 9999
Data columns (total 3 columns):
numerical_feature1     10000 non-null float64
numerical_feature2     10000 non-null float64
categorical_feature    10000 non-null category
dtypes: category(1), float64(2)
memory usage: 245.6 KB


### Feature Enginering
Most statistical/machine learning tasks require some data preprocessing. In this case, we are able to keep it to a minimum given the relatively simple complexity of the dataset. Just by looking at our dataset, we can see that we have a categorical variable (lowercase letters of the english alphabet). As matrix operations can't be done with strings, we'll have to encode it.

**Note:** because of our uniform sampling of english letters, we will end up with a VERY sparse matrix. 

In [272]:
numerical_features = raw_features_df[["numerical_feature1", "numerical_feature2"]].to_numpy()
categorical_features = raw_features_df[["categorical_feature"]].to_numpy()

In [273]:
# define onehot encoder (handles unkowns, i.e. any letter not found our dataset)
onehot_encoder = OneHotEncoder(categories="auto", sparse=False, handle_unknown="ignore")
onehot = onehot_encoder.fit_transform(categorical_features.reshape(-1,1))
onehot.shape

(10000, 26)

In [278]:
# create new df with engineering features
clean_features = np.concatenate((numerical_features, onehot), axis=1)
clean_features.shape

(10000, 28)

### Now for our model set up...
In this case, we will just start with a baseline model. This is purely for the sake of getting up and running with the API endpoint + model iteration (the latter not particularly fruitful on random data).

In [234]:
# Setup model /w params
baseline_model = Ridge(alpha=1.0, solver="sag", random_state=42)

### Train/Test Split

Finally, we need to split our dataset to prevent overfitting on the training set. Ideally we would have 3 splits (train,dev,test) but in this case (again, random data) we'll just go with train/test. One could also opt for a more refined model selection strategy (e.g. kfold), and likely would in a realistic scenario.

In [279]:
# pull out x and y from df
X, y = clean_features, raw_df.target

In [280]:
# Train/test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.70, random_state=42)

### Baseline Model, Training and Evaluation
For the choice of baseline, we will use Ridge regression /w Stochastic Average Gradient descent. 

In [281]:
baseline_model.fit(X_train, y_train)

Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
      normalize=False, random_state=42, solver='sag', tol=0.001)

In [282]:
# Evaluate on test set, save performance metrics, and visualise actual vs. predicted on testset
y_hat = baseline_model.predict(X_test)
baseline_r2_score = r2_score(y_test, y_hat)

In [287]:
# Visualise predicted vs. actual while displaying test set r^2
fig = px.scatter(x=y_test, y=y_hat, title=f"Baseline Model - Testset Predicted vs. Actual, R^2={baseline_r2_score}")
fig.update_layout(
    xaxis_title="Actual",
    yaxis_title="Predicted",
).show()

## What a mess....
It seems that our baseline model is actually worse than if we were to guess using a simple average. In a realistic situation, this would most likely indicate serious issues with preprocessing or model set up. In our case, it might be considered normal given our completely arbitrary and random data.

At least we have a model and a data tranform that we want to put in production. Let's export both and get to it!

### Model and Encoder Export via Serialization

In [288]:
# timestamp for use in file name
now_array = time.asctime(time.localtime(time.time())).split(" ")[1:]
timestamp = now_array[0] + "-" + "-".join(now_array[2:]).replace(":", "-")

# Setup filename and dirs
TRANFORMER_DIR = "../transformers"
transformer_name = "onehot"
transformer_file_name = transformer_name + "-" + timestamp + ".bin"
tf_file_path = os.path.join(TRANFORMER_DIR, transformer_file_name)

# Pickle and save to disk
with open(tf_file_path,"wb") as f:
    pickle.dump(onehot, f)


# Setup filename and dirs
MODEL_DIR = "../models"
model_name = "baseline"
model_file_name = model_name + "-" + timestamp + ".bin"
model_file_path = os.path.join(MODEL_DIR, model_file_path)

# Pickle and save to disk
with open(file_path,"wb") as f:
    pickle.dump(baseline_model, f)

NameError: name 'model_file_path' is not defined

# The end (or is it?)
You have reached the end of this notebook. We know have at least one trained model we can deploy inside an API for real-time access by an authorised public. There are no more notebooks for the task. The remainer will be done in a more traditonal python web dev environment. Please head on over to <code>../src</code> and follow the documentation if you want to deploy the solution locally yourself.