### Creating and Persisting an ML Model

In [None]:
import pandas as pd
import numpy as np

Summary of the data

In [None]:
df = pd.read_csv('data/student-mat.csv', sep=';')
df.describe()
df.info

The goal is to predict the quality of the student. We will build a predictor based on the final grade (G3).
Because we are trying to find quality students. In this model we define a quality student as one who achieves a final grade of 15 or higher. 

Import scikit-learn and build a random forest classifer

In [89]:
import sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score
from sklearn.feature_selection import SelectKBest, chi2

# Lists of all useable features in df (except 'G3' and 'qual_student')
# PROBLEM: linear/logistic regression can't use categorical features (eg. school, sex)!
all_features = ['school', 'sex', 'age', 'address', 'famsize', 'Pstatus', 'Medu', 'Fedu',
    'Mjob', 'Fjob', 'reason', 'guardian', 'traveltime', 'studytime',
    'failures', 'schoolsup', 'famsup', 'paid', 'activities', 'nursery',
    'higher', 'internet', 'romantic', 'famrel', 'freetime', 'goout', 'Dalc',
    'Walc', 'health', 'absences', 'G1', 'G2']
all_numerical_features = ['age', 'Medu', 'Fedu', 'traveltime', 'studytime', 'failures', 'famrel',
    'freetime', 'goout', 'Dalc', 'Walc', 'health', 'absences', 'G1', 'G2']


# Return x (reduced features) and y (qual_student labels)
def parse_data(df, include):
    df['qual_student'] = np.where(df['G3']>=15, 1, 0)
    
    x = df[include]
    y = df['qual_student']
    return x, y

# Print accuracy/F1
def print_scores(y, pred):
    print('accuracy = {},\tF1 = {}'.format( \
        accuracy_score(y, pred), f1_score(y, pred, average='binary')) )

# Select the best k features to use
# k=5: ['Medu', 'failures', 'absences', 'G1', 'G2']
x, y = parse_data(df, all_numerical_features)
k = 5
features = SelectKBest(chi2, k=k).fit(x, y).get_support()
include = x.columns[features]
print(include)

# Build ML model using best k features
x, y = parse_data(df, include)
clf = LogisticRegression()
clf.fit(x, y)

# Get prediction scores
pred = clf.predict(x)
print_scores(y, pred)

Index(['Medu', 'failures', 'absences', 'G1', 'G2'], dtype='object')
accuracy = 0.9822784810126582,	F1 = 0.9510489510489512


It's not very good! We didn't even cross validate. You'll need to do better :)
Let's export this model so we can use it in a microservice (flask api)

In [None]:
import joblib
# TODO: Change these lines
# modify the file path to where you want to save the model
joblib.dump(clf, 'app/handlers/model.pkl')
query_df = pd.DataFrame({ 'age' : pd.Series(1) ,'health' : pd.Series(15) ,'absences' : pd.Series(10)})
pred = clf.predict(query_df)
type(x)