### Creating and Persisting an ML Model

In [1]:
import pandas as pd
import numpy as np

Summary of the data

In [2]:
df = pd.read_csv('data/student-mat.csv', sep=';')
df.describe()
df.info

<bound method DataFrame.info of     school sex  age address famsize Pstatus  Medu  Fedu      Mjob      Fjob  \
0       GP   F   18       U     GT3       A     4     4   at_home   teacher   
1       GP   F   17       U     GT3       T     1     1   at_home     other   
2       GP   F   15       U     LE3       T     1     1   at_home     other   
3       GP   F   15       U     GT3       T     4     2    health  services   
4       GP   F   16       U     GT3       T     3     3     other     other   
..     ...  ..  ...     ...     ...     ...   ...   ...       ...       ...   
390     MS   M   20       U     LE3       A     2     2  services  services   
391     MS   M   17       U     LE3       T     3     1  services  services   
392     MS   M   21       R     GT3       T     1     1     other     other   
393     MS   M   18       R     LE3       T     3     2  services     other   
394     MS   M   19       U     LE3       T     1     1     other   at_home   

     ... famrel fre

The goal is to predict the quality of the student. We will build a predictor based on the final grade (G3).
Because we are trying to find quality students. In this model we define a quality student as one who achieves a final grade of 15 or higher. 

Import scikit-learn and build a random forest classifer

In [23]:
import sklearn
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score
from sklearn.preprocessing import OrdinalEncoder
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.pipeline import make_pipeline

# Lists of all useable features in df (except 'G3' and 'qual_student')
# PROBLEM: linear/logistic regression can't use categorical features (eg. school, sex)!
all_features = ['school', 'sex', 'age', 'address', 'famsize', 'Pstatus', 'Medu', 'Fedu',
    'Mjob', 'Fjob', 'reason', 'guardian', 'traveltime', 'studytime',
    'failures', 'schoolsup', 'famsup', 'paid', 'activities', 'nursery',
    'higher', 'internet', 'romantic', 'famrel', 'freetime', 'goout', 'Dalc',
    'Walc', 'health', 'absences', 'G1', 'G2']
all_numerical_features = ['age', 'Medu', 'Fedu', 'traveltime', 'studytime', 'failures', 'famrel',
    'freetime', 'goout', 'Dalc', 'Walc', 'health', 'absences', 'G1', 'G2']


# Return x (reduced features) and y (qual_student labels)
def parse_data(df, include):
    df['qual_student'] = np.where(df['G3']>=15, 1, 0)
    
    x = df[include]
    y = df['qual_student']
    return x, y

# Print accuracy/F1
def print_scores(y, pred):
    print('accuracy = {},\tF1 = {}'.format( \
        accuracy_score(y, pred), f1_score(y, pred, average='binary')) )

# TODO: PROBLEM: current pipeline treats all features as categorical (in OrdinalEncoder)
# This means that during testing, unseen numerical features (e.g. G2="1") won't be recognized!
# to find out what to do, see 
# https://inria.github.io/scikit-learn-mooc/python_scripts/03_categorical_pipeline_column_transformer.html

# Select the best k features to use
# k=5: ['Medu', 'failures', 'absences', 'G1', 'G2']
x, y = parse_data(df, all_features)
k = 5
selector = make_pipeline(OrdinalEncoder(), SelectKBest(k=k))
include = selector.fit(x, y).get_feature_names_out()
print(include)

# Build ML model using best k features
x, y = parse_data(df, include)
clf = make_pipeline(OrdinalEncoder(), LogisticRegression())
clf.fit(x, y)

# Get prediction scores
pred = clf.predict(x)
print_scores(y, pred)

['Medu' 'failures' 'Dalc' 'G1' 'G2']
accuracy = 0.9822784810126582,	F1 = 0.9503546099290779


It's not very good! We didn't even cross validate. You'll need to do better :)
Let's export this model so we can use it in a microservice (flask api)

In [4]:
import joblib
# TODO: Change these lines
# modify the file path to where you want to save the model
joblib.dump(clf, 'app/handlers/model.pkl')
query_df = pd.DataFrame({ 'age' : pd.Series(1) ,'health' : pd.Series(15) ,'absences' : pd.Series(10)})
pred = clf.predict(query_df)
type(x)

Feature names unseen at fit time:
- age
- health
Feature names seen at fit time, yet now missing:
- G1
- G2
- Medu
- failures



ValueError: X has 3 features, but LogisticRegression is expecting 5 features as input.