### Creating and Persisting an ML Model

In [1]:
import pandas as pd
import numpy as np
df = pd.read_csv('data/student-mat.csv', sep=';')

The goal is to predict the quality of the student. We will build a predictor based on the final grade (G3).
Becasue we are trying to find quality students. In this model we define a quality student as one who achieves a final grade of 15 or higher. 
Since G1 and G2 will not be available when reviewing students, we should not use them as predictors. So we drop all of the grades

In [2]:
df['qual_student'] = np.where(df['G3']>=15, 1, 0)
df.drop(columns=['G1', 'G2', 'G3'], inplace=True)

Many attributes are given as nominal strings. In order to use them with sklearn's random forest classifier, we encode them as `int`. This step is also be applied in the microservice

In [3]:
from sklearn.preprocessing import LabelEncoder
for col in df:
    if df[col].dtype == object:
        le = LabelEncoder()
        le.fit(df[col])
        df[col] = le.transform(df[col])

We use cross validation with grid search on best hyperparameters to train the predictor. 

In [4]:
from sklearn.ensemble import RandomForestClassifier as rf
from sklearn.model_selection import GridSearchCV, StratifiedKFold
from sklearn.feature_selection import RFECV
from sklearn.metrics import f1_score

dependent_variable = 'qual_student'
# separate to train and test set
from sklearn.model_selection import train_test_split
x = df[df.columns.difference([dependent_variable])]
y = df[dependent_variable]
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=42, test_size=0.2)
k_fold = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)
params = {
    'max_depth' : [4,8,12],
    'max_features': [i for i in range(1, 31)]
}
clf = rf()
grid = GridSearchCV(clf, params, cv=k_fold, scoring='f1')
grid.fit(x_train, y_train)
print(grid.best_params_)
pred = grid.best_estimator_.predict(x_test)
f1_score(y_test, pred, average='binary')


Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  method='lar', copy_X=True, eps=np.finfo(np.float).eps,
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  method='lar', copy_X=True, eps=np.finfo(np.float).eps,
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  eps=np.finfo(np.float).eps, copy_Gram=True, verbose=0,
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  eps=np.finfo(np.float).eps, copy_X=True, fit_path=True,
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  eps=np.finfo(np.float).eps, copy_X=True, fit_path=True,
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes

{'max_depth': 8, 'max_features': 24}


Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  y_encoded = np.zeros(y.shape, dtype=np.int)
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  y_encoded = np.zeros(y.shape, dtype=np.int)
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  y_encoded = np.zeros(y.shape, dtype=np.int)
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  y_encoded = np.zeros(y.shape, dtype=np.int)
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  y_encoded = np.zeros(y.shape, dtype=np.int)
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  y_encoded = np.zeros(y.shape, dtype=

0.29629629629629634

Let's export this model so we can use it in a microservice (flask api)

In [5]:
import joblib
# modify the file path to where you want to save the model
joblib.dump(grid, 'dockerfile/apps/model.pkl')

['dockerfile/apps/model.pkl']

In [6]:
query_df = pd.DataFrame({
         "school":"GP",
         "sex":"F",
         "age":15,
         "address":"U",
         "famsize":"LE3",
         "Pstatus":"T",
         "Medu":0,
         "Fedu":0,
         "Mjob":"teacher",
         "FJob":"teacher",
         "reason":"home",
         "guardian":"mother",
         "traveltime":1,
         "studytime":1,
         "failures":1,
         "schoolsup":True,
         "famsup":True,
         "paid":True,
         "activities":True,
         "nursery":True,
         "higher":True,
         "internet":True,
         "romantic":True,
         "famrel":1,
         "freetime":1,
         "goout":1,
         "Dalc":1,
         "Walc":1,
         "health":1,
         "absences":0
     }, index=[0])

In [7]:
from sklearn.preprocessing import LabelEncoder
for col in query_df:
    if query_df[col].dtype == object:
        le = LabelEncoder()
        le.fit(query_df[col])
        query_df[col] = le.transform(query_df[col])

In [8]:
pred = grid.best_estimator_.predict(query_df)
print(pred)

[1]


Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  dtype=np.int)
