# Feature Engineering and Modeling

Hello and welcome to the second notebook. In this one, we'll train a baseline regression model to predict our target variable (as denoted in our previously generated dataset). If you missed the data generation and exporting step, head on over to <code>data-generation.ipynb</code>.

#### Structure

This notebook will be structured in 3 parts:

1. We will import and preprocess our data in an adequate fashion.
2. Setup and training of a baseline regression model.
3. Evaluation of our trained model.

Step 1 & 2 will be fused in a pipeline, but our overall process will look about the same.

In [148]:
import os
import numpy as np
import pandas as pd
import joblib
import plotly.express as px
import time
import string
from sklearn import linear_model
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.compose import make_column_transformer
from sklearn.impute import SimpleImputer
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import OneHotEncoder, LabelEncoder

In [149]:
# Data imports
DATA_DIR = "../data"
FILENAME = "testData-42-raw.csv"
path = os.path.join(DATA_DIR, FILENAME)
raw_df = pd.read_csv(path, index_col=0)

In [150]:
raw_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10000 entries, 0 to 9999
Data columns (total 4 columns):
target      10000 non-null float64
feature1    10000 non-null float64
feature2    10000 non-null float64
feature3    10000 non-null object
dtypes: float64(3), object(1)
memory usage: 390.6+ KB


In [151]:
len(raw_df.feature3.unique())

26

In [152]:
# clean up of features with descriptive names and seperation of regressors from target
column_names = ["numerical_feature1", "numerical_feature2", "categorical_feature"]
raw_features_df = raw_df.drop("target", axis="columns")
raw_features_df.columns = column_names
raw_features_df.categorical_feature = raw_features_df.categorical_feature.astype("category")

In [153]:
raw_features_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10000 entries, 0 to 9999
Data columns (total 3 columns):
numerical_feature1     10000 non-null float64
numerical_feature2     10000 non-null float64
categorical_feature    10000 non-null category
dtypes: category(1), float64(2)
memory usage: 245.6 KB


### Feature Enginering
Most statistical/machine learning tasks require some data preprocessing. In this case, we are able to keep it to a minimum given the relatively simple complexity of the dataset. Just by looking at our dataset, we can see that we have a categorical variable (lowercase letters of the english alphabet). As matrix operations can't be done with strings, we'll have to encode it.

**Note:** because of our uniform sampling of english letters, we will end up with a VERY sparse matrix. 

In [154]:
numerical_features = raw_features_df[["numerical_feature1", "numerical_feature2"]].to_numpy()
categorical_features = raw_features_df[["categorical_feature"]].to_numpy()

In [155]:
# define onehot encoder (handles unkowns, i.e. any letter not found our dataset)
onehot_encoder = OneHotEncoder(categories="auto", sparse=False, handle_unknown="ignore")
onehot = onehot_encoder.fit_transform(categorical_features.reshape(-1,1))
onehot.shape

(10000, 26)

In [156]:
# create new df with engineering features
clean_features = np.concatenate((numerical_features, onehot), axis=1)
clean_features.shape

(10000, 28)

### Now for our model set up...
In this case, we will just start with a baseline model. This is purely for the sake of getting up and running with the API endpoint + model iteration (the latter not particularly fruitful on random data).

In [173]:
# Setup model /w params
baseline_model = linear_model.Ridge(alpha=1.0, solver="sag", random_state=1337)

# There must be a better way?

Sklearn does have inbuilt support for tying both a model and its required input data transformations into a pipeline object. We will use this approach as it is arguably the neatest.

In [158]:
# Setup data transformation as ColumnTransformer
raw_X = raw_features_df
raw_X

Unnamed: 0,numerical_feature1,numerical_feature2,categorical_feature
0,0.950714,0.731994,m
1,0.156019,0.155995,f
2,0.866176,0.601115,a
3,0.020584,0.969910,f
4,0.212339,0.181825,m
5,0.304242,0.524756,m
6,0.291229,0.611853,m
7,0.292145,0.366362,t
8,0.785176,0.199674,d
9,0.592415,0.046450,e


### Data Transforms Pipeline
Note, Sklearn does not like single categorical columns combined with "skipping" transforms of numerical columns. The below solution is a little overkill as we don't actually need to transform our numerical features, nor do we even need to impute missing values (as there are none...). 

The setup is merely made in the fashion for numerical columns because it is easier to get a well-behaved pipeline.

In [215]:
# debugger implemetation for inspecting midway through pipeline
class Debug(BaseEstimator, TransformerMixin):

    def transform(self, X):
        # print("DEBUG: outputting shape after step: ", X.shape)
        return X

    def fit(self, X, y=None, **fit_params):
        return self


# setup numerical feature transformer pipeline
num_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
])

# setup numerical categorical transformer pipeline
cat_transformer = Pipeline(steps=[
    ('onehot', OneHotEncoder(handle_unknown="ignore", sparse=False))
    #,('debug', Debug())
])

# Bind together the preprocessors and fit it on full X data
preprocessor = ColumnTransformer([
        ('num', num_transformer, [0, 1]),
        ('cat', cat_transformer, [2])
])

### Model Setup
For the choice of baseline, we used use Ridge regression /w Stochastic Average Gradient descent. This felt like a realistic starting point in a regression scenario with plenty of data. I'd happily discuss these choices further.

In [216]:
# Setup model /w params
baseline_ridge_model = linear_model.Ridge(alpha=1.0, solver="sag", random_state=42)

### Fused Preprocessing and Model Pipeline

In [217]:
# Append regression model to preprocessing pipeline. 
ridge_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('estimator', baseline_ridge_model)
])

In [218]:
# pull out x and y from df
X, y = raw_X, raw_df.target
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.70, random_state=1337)

### Fit transform training data

In [219]:
features = ridge_pipeline["preprocessor"].fit_transform(X_train)
features.shape

(7000, 28)

In [221]:
# just an example of what a single example of preprocessing looks like
ridge_pipeline["preprocessor"].transform(np.array([2.1, 2.0, "f"]).reshape(1,-1))

array([[2.1, 2. , 0. , 0. , 0. , 0. , 0. , 1. , 0. , 0. , 0. , 0. , 0. ,
        0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ,
        0. , 0. ]])

In [222]:
# let's export the preprocessor just in case
with open("preprocessor.joblib","wb") as f:
    joblib.dump(ridge_pipeline["preprocessor"], f)

### Fit our model to the training data

In [223]:
model = ridge_pipeline["estimator"].fit(features, y_train)

In [224]:
len(model.coef_) # dosn't include intercept

28

### Baseline Model Evaluation
Now that we have trained our model, we nede to evaluate the quality of fit on the test data. This will ensure generalisation to unseen datapoints.

In [225]:
# Evaluate on test set, save performance metrics, and visualise actual vs. predicted on testset
y_hat = ridge_pipeline.predict(X_test)
baseline_r2_score = r2_score(y_test, y_hat)

In [226]:
y_hat

array([0.49699204, 0.50534232, 0.49946418, ..., 0.47932567, 0.49422431,
       0.51006225])

In [227]:
# Visualise predicted vs. actual while displaying test set r^2
fig = px.scatter(x=y_test, y=y_hat, title=f"Baseline Model - Testset Predicted vs. Actual, R^2={baseline_r2_score}")
fig.update_layout(
    xaxis_title="Actual",
    yaxis_title="Predicted",
).show()

## What a mess....
It seems that our baseline model is actually worse than if we were to guess using a simple average. In a realistic situation, this would most likely indicate serious issues with preprocessing or model set up. In our case, it might be considered normal given our completely arbitrary and random data.

At least we have a model and a data tranform that we want to put in production. Let's export both and get to it!

### Model Export

In [229]:
# Pickle and save to disk
with open("model.joblib","wb") as f:
    joblib.dump(model, f)

# The end (or is it?)
You have reached the end of this notebook. We know have at least one trained model we can deploy inside an API for real-time access by an authorised public. There are no more notebooks for the task. The remainer will be done in a more traditonal python web dev environment. Please head on over to <code>../src</code> and follow the documentation if you want to deploy the solution locally yourself.