### Creating and Persisting an ML Model

Read the student data csv and define the quality student as a student with a G3 greater than or equal to 15.

In [2]:
import pandas as pd
import numpy as np

df = pd.read_csv('data/student-mat.csv', sep=';')
df['qual_student'] = np.where(df['G3']>=15, 1, 0)
X = df.drop(['G3', 'qual_student', 'G1', 'G2'], axis=1)
y = df['qual_student']

Replace the non-numerical columns with numerical values.

In [3]:
X['school'] = X['school'].replace(to_replace=['GP','MS'],value=[0,1])
X['sex'] = X['sex'].replace(to_replace=['F','M'],value=[0,1])
X['address'] = X['address'].replace(to_replace=['U','R'],value=[0,1])
X['famsize'] = X['famsize'].replace(to_replace=['LE3','GT3'],value=[0,1])
X['Pstatus'] = X['Pstatus'].replace(to_replace=['T','A'],value=[0,1])
X['Mjob'] = X['Mjob'].replace(to_replace=['teacher','health','services','at_home','other'],value=[0,1,2,3,4])
X['Fjob'] = X['Fjob'].replace(to_replace=['teacher','health','services','at_home','other'],value=[0,1,2,3,4])
X['reason'] = X['reason'].replace(to_replace=['home','reputation','course','other'],value=[0,1,2,3])
X['guardian'] = X['guardian'].replace(to_replace=['mother','father','other'],value=[0,1,2])
X['schoolsup'] = X['schoolsup'].replace(to_replace=['yes','no'],value=[0,1])
X['famsup'] = X['famsup'].replace(to_replace=['yes','no'],value=[0,1])
X['paid'] = X['paid'].replace(to_replace=['yes','no'],value=[0,1])
X['activities'] = X['activities'].replace(to_replace=['yes','no'],value=[0,1])
X['nursery'] = X['nursery'].replace(to_replace=['yes','no'],value=[0,1])
X['higher'] = X['higher'].replace(to_replace=['yes','no'],value=[0,1])
X['internet'] = X['internet'].replace(to_replace=['yes','no'],value=[0,1])
X['romantic'] = X['romantic'].replace(to_replace=['yes','no'],value=[0,1])

Split the data into training and testing with a 0.7 and 0.3 split.

In [4]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

Using sklearn's feature selection, find the 5 features that have the most weight for the G3 value.

In [5]:
from sklearn.feature_selection import SelectKBest, f_regression
fs = SelectKBest(score_func=f_regression, k='all')
fs.fit(X_train, y_train)
X_train_fs = fs.transform(X_train)
weights = {}
for i in range(len(fs.scores_)):
	weights[X_train.columns[i]] = fs.scores_[i]
top_features = list(dict(sorted(weights.items(), key=lambda item: item[1], reverse=True)).keys())[:5]
top_features

['Medu', 'failures', 'Walc', 'goout', 'Dalc']

Keep only those top 5 features.

In [6]:
X_train = X_train[top_features]
X_test = X_test[top_features]

Import scikit-learn and build a random forest classifer with 100 trees.

In [7]:
from sklearn.ensemble import RandomForestClassifier as rf
import sklearn
clf = rf()
clf.fit(X_train, y_train)

RandomForestClassifier(n_estimators=1000)

Find the F1 score, precision, recall, and accuracy for the model for the testing data set.

In [205]:
pred = clf.predict(X_test)
print(sklearn.metrics.f1_score(y_test, pred, average='binary'))
print(sklearn.metrics.precision_score(y_test, pred, average='binary'))
print(sklearn.metrics.recall_score(y_test, pred, average='binary'))
print(sklearn.metrics.accuracy_score(y_test, pred))

0.380952380952381
0.4444444444444444
0.3333333333333333
0.7815126050420168


Save the model.

In [9]:
import joblib
# modify the file path to where you want to save the model
joblib.dump(clf, './dockerfile/apps/new_model.pkl')

['./dockerfile/apps/new_model.pkl']