### Creating and Persisting an ML Model

In [40]:
import pandas as pd
import numpy as np
df = pd.read_csv('data/student-mat.csv', sep=';')

Summary of the data

In [26]:
df.describe()
df.corr()
df.head()

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,...,4,3,4,1,1,3,6,5,6,6
1,GP,F,17,U,GT3,T,1,1,at_home,other,...,5,3,3,1,1,3,4,5,5,6
2,GP,F,15,U,LE3,T,1,1,at_home,other,...,4,3,2,2,3,3,10,7,8,10
3,GP,F,15,U,GT3,T,4,2,health,services,...,3,2,2,1,1,5,2,15,14,15
4,GP,F,16,U,GT3,T,3,3,other,other,...,4,3,2,1,2,5,4,6,10,10


In [3]:
df.info

<bound method DataFrame.info of     school sex  age address famsize Pstatus  Medu  Fedu      Mjob      Fjob  \
0       GP   F   18       U     GT3       A     4     4   at_home   teacher   
1       GP   F   17       U     GT3       T     1     1   at_home     other   
2       GP   F   15       U     LE3       T     1     1   at_home     other   
3       GP   F   15       U     GT3       T     4     2    health  services   
4       GP   F   16       U     GT3       T     3     3     other     other   
..     ...  ..  ...     ...     ...     ...   ...   ...       ...       ...   
390     MS   M   20       U     LE3       A     2     2  services  services   
391     MS   M   17       U     LE3       T     3     1  services  services   
392     MS   M   21       R     GT3       T     1     1     other     other   
393     MS   M   18       R     LE3       T     3     2  services     other   
394     MS   M   19       U     LE3       T     1     1     other   at_home   

     ... famrel fre

Create a subset of features as an example.

In [41]:
include = ['health','Medu','Fedu','studytime','traveltime','absences','age','G3']
df.drop(columns=df.columns.difference(include), inplace=True)  # only using 3 features

In [5]:
df.info

<bound method DataFrame.info of      age  health  absences  G3
0     18       3         6   6
1     17       3         4   6
2     15       3        10  10
3     15       5         2  15
4     16       5         4  10
..   ...     ...       ...  ..
390   20       4        11   9
391   17       2         3  16
392   21       3         3   7
393   18       5         0  10
394   19       5         5   9

[395 rows x 4 columns]>

The goal is to predict the quality of the student. We will build a predictor based on the final grade (G3).
Becasue we are trying to find quality students. In this model we define a quality student as one who achieves a final grade of 15 or higher. 

In [42]:
df['qual_student'] = np.where(df['G3']>=15, 1, 0)

In [16]:
df.describe()

Unnamed: 0,age,Fedu,failures,G3,qual_student
count,395.0,395.0,395.0,395.0,395.0
mean,16.696203,2.521519,0.334177,10.41519,0.18481
std,1.276043,1.088201,0.743651,4.581443,0.388636
min,15.0,0.0,0.0,0.0,0.0
25%,16.0,2.0,0.0,8.0,0.0
50%,17.0,2.0,0.0,11.0,0.0
75%,18.0,3.0,0.0,14.0,0.0
max,22.0,4.0,3.0,20.0,1.0


Drop the G3 score

In [43]:
include = ['health','Medu','Fedu','studytime','traveltime', 'absences','age','qual_student']
df.drop(columns=df.columns.difference(include), inplace=True) 

Import scikit-learn and build a random forest classifer

In [44]:
from sklearn.ensemble import RandomForestClassifier as rf
import sklearn
dependent_variable = 'qual_student'
x = df[df.columns.difference([dependent_variable])]
y = df[dependent_variable]
clf = rf(n_estimators = 1000)
clf.fit(x, y)

RandomForestClassifier(n_estimators=1000)

In [54]:
pred = clf.predict(x)
sklearn.metrics.f1_score(y, pred, average='binary')

0.9863013698630136

It's not very good! We didn't even cross validate. You'll need to do better :)
Let's export this model so we can use it in a microservice (flask api)

In [55]:
import joblib
# modify the file path to where you want to save the model
joblib.dump(clf, 'dockerfile/apps/model.pkl')

['dockerfile/apps/model.pkl']

In [47]:
query_df = pd.DataFrame({ 'age' : pd.Series(1) ,'health' : pd.Series(15) ,'absences' : pd.Series(10)})

In [48]:
pred = clf.predict(query_df)

ValueError: X has 3 features, but DecisionTreeClassifier is expecting 7 features as input.

In [39]:
x

Unnamed: 0,Fedu,age,failures
0,4,18,0
1,1,17,0
2,1,15,3
3,2,15,0
4,3,16,0
...,...,...,...
390,2,20,2
391,1,17,0
392,1,21,3
393,2,18,0


In [15]:
type(x)

pandas.core.frame.DataFrame