# Chronic Absenteeism Package: Develop and Train ModelResults Table, using StudentModel Table
This notebook is intended to explore the capabilities of the OEA Chronic Absenteeism package by developing and training an InterpretML glassbox ML model. 

**It is recommended that you review and execute all relevant module pipelines, before testing this Chronic Absenteeism notebook.**

 - Train Model Table:
     * Uses InterpretML ExplainableBoostingClassifier.
     * Provides a mean-value feature importance visual for visualizing the comparative feature correlations found from the ML model training.

This notebook trains and develops the ML model by:
 1. Training the ML model on the StudentModel table.
 2. Refine which specific features are wanted, and developing the InterpretML model predictions table.
 3. Extracting the top 5 feature-drivers of predicted chronic absenteeism.
 4. Write this final "ModelResults_pseudo" table out to stage 3p.

**NOTE: This notebook must be attached to a spark pool with the proper requirements.txt file.**

In [1]:
%run /OEA_py

In [2]:
# 0) Initialize the OEA framework.
oea = OEA()

## Import Relevant Python Packages

In [3]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score, cross_validate
from sklearn.datasets import load_iris
from interpret.glassbox import ExplainableBoostingClassifier

## Read in the StudentModel table

In [4]:
dfStudentModel = oea.load_delta('stage3p/chronic_absenteeism/StudentModel_pseudo')

## 1.) Train ML Model on StudentModel table

Training is completed on InterpretML's ExplainableBoostingClassifier; see documentation here: https://interpret.ml/docs/ebm.html. For debugging and questions on the model, see their FAQ page: https://interpret.ml/docs/faq.html#.

To see their GitHub repository with technical explanations, visit: https://github.com/interpretml/interpret. Or for the `ebm.py` script (which uses most of the functions used from InterpretML), visit: https://github.com/interpretml/interpret/blob/develop/python/interpret-core/interpret/glassbox/ebm/ebm.py.

In [5]:
# select only relevant columns
dfStudentModel = dfStudentModel.drop('Surname', 'GivenName', 'MiddleName', 'InPerson_numDaysAttended', 'InPerson_percentDaysAttended', 'StudentGrade', 'SchoolName')
display(dfStudentModel.limit(10))

In [6]:
pdf = dfStudentModel.toPandas()

In [7]:
# cross validation accuracy check, used for model exploration.

df = pdf.copy()
labels = df.pop("InPerson_chronicAbsFlag")
X = df.copy()

clf = ExplainableBoostingClassifier()

scoring = {'acc': 'accuracy',
           'f1_macro': 'f1_macro'}

scores = cross_validate(clf, X, labels, cv=5, scoring=scoring)

acc_scores = scores['test_acc']
f1_scores = scores['test_f1_macro']
print("Classifier: %0.3f Accuracy with a standard deviation of %0.4f" % (acc_scores.mean(), acc_scores.std()))
print("Classifier: %0.3f macro F1-Score with a standard deviation of %0.4f\n" % (f1_scores.mean(), f1_scores.std()))

In [8]:
# full data model train and explain
clf.fit(X, labels)
output = clf.explain_local(X, labels)

In [9]:
clf.explain_global().visualize()

## 2.) Refine which specific features are wanted, and develop the final InterpretML model predictions table

**NOTES:** 
 - Update the features cross-correlated columns to pull the most relevant columns, for processing in the Power BI dashboard analyses.
 - You may also want to push the StudentIds through the training of the model. To do so, look up documentation here: https://interpret.ml/docs/ebm.html


In [10]:
# student predictions and explanations

score_df_list = []
features = pd.Index(list(df)+['Clever_avgNumAccessesPerDay_Newsela x Clever_avgNumAccessesPerDay_Office365', 'Clever_avgNumAccessesPerDay_DestinyDiscover x Clever_avgNumAccessesPerDay_XtraMath', \
'Clever_avgNumAccessesPerDay_Scholastic x Clever_avgNumAccessesPerDay_SpringBoard', 'Clever_avgNumAccessesPerDay_Office365 x Clever_avgNumAccessesPerDay_Scholastic', \
'Clever_avgNumAccessesPerDay_EdgenuityCourseware x Clever_avgNumAccessesPerDay_Office365'])

for i in range(len(X)):
    datapoint = output.data(key=i)
    score_dict = {}
    for idx, column_name in enumerate(features):
        if (column_name == 'StudentId_external_pseudonym'):
            score_dict[column_name] = datapoint['values'][idx]
        else:
            #NOTE: uncomment the next line if you want all features to have 'interpret_val' appended to the column name, and comment out line 17 
            #score_dict[column_name + "_interpret_val"] = datapoint['scores'][idx]
            score_dict[column_name] = datapoint['scores'][idx]
    score_dict['true_class'] = datapoint['perf']['actual']
    score_dict['predicted_class'] = datapoint['perf']['predicted']
    score_dict['predicted_score'] = datapoint['perf']['predicted_score']

    if score_dict['predicted_class'] == 0:
        score_dict['absentee_probability'] = 1-datapoint['perf']['predicted_score']
    else:
        score_dict['absentee_probability'] = datapoint['perf']['predicted_score']
    score_df_list.append(score_dict)

explainable_df = pd.DataFrame(score_df_list)

In [11]:
explainable_df

## 3.) Extract the top 5 Drivers of Predicted Chronic Absenteeism

The first step to this extraction is taking the absolute value of each potential driver column. Negative driver-scores imply raw data values that correspond to (predicted) preventative factors of chronic absenteeism. Thus, taking the absolute value indicates the absolute weight of the drivers. 

The original `explainable_df` is then melted into a column per feature name, and feature value - for the top 5 feature drivers.

The last code block then extracts the "true_class", "predicted_class", "predicted_score", and "absentee_probability" columns from `explainable_df` and adds them to the final `dfModelResults_features` table.

The final `dfModelResults_features` table has the dimensions: (the number of students) x (15 columns)

**NOTE:** If you'd like to keep all drivers and their weights, you can edit or remove the code-blocks in this step, as needed.

In [23]:

explainable_df_melted = explainable_df.drop('true_class', axis=1).drop('predicted_class', axis=1).drop('predicted_score', axis=1).drop('absentee_probability', axis=1)

list_explainable_df_features = explainable_df_melted.drop('StudentId_external_pseudonym', axis=1)
list_explainable_df_features = list_explainable_df_features.columns.tolist()
# melt the df and create the absolute value table
explainable_df_melted = explainable_df_melted.melt(id_vars = ['StudentId_external_pseudonym'],value_vars = list_explainable_df_features,var_name='feature',value_name='feature_interpret_val')
explainable_df_melted_abs = explainable_df_melted.copy()
explainable_df_melted_abs['feature_interpret_val'] = explainable_df_melted_abs['feature_interpret_val'].abs()
# push these tables to pyspark dfs
dfModelResults_melted = spark.createDataFrame(explainable_df_melted)
dfModelResults_melted = dfModelResults_melted.withColumnRenamed('StudentId_external_pseudonym', 'StudentId')
dfModelResults_melted_abs = spark.createDataFrame(explainable_df_melted_abs)
# extract the first/top feature based on the interpretML value per student 
dfModelResults_f1 = dfModelResults_melted_abs.groupBy('StudentId_external_pseudonym').max('feature_interpret_val')
dfModelResults_f1 = dfModelResults_f1.withColumnRenamed('StudentId_external_pseudonym', 'StudentId').withColumnRenamed('max(feature_interpret_val)', 'feature_val')
# create features df and grab the feature name
dfModelResults_features = dfModelResults_f1.join(dfModelResults_melted_abs, (dfModelResults_f1.StudentId == dfModelResults_melted_abs.StudentId_external_pseudonym)\
 & (dfModelResults_f1.feature_val == dfModelResults_melted_abs.feature_interpret_val), how='inner')
dfModelResults_features = dfModelResults_features.drop('StudentId', 'feature_val')
dfModelResults_features = dfModelResults_features.withColumnRenamed('feature', 'feature1').withColumnRenamed('feature_interpret_val', 'feature1_interpret_absVal')
# grab the actual, model-assigned value for the feature
dfModelResults_features = dfModelResults_features.join(dfModelResults_melted, (dfModelResults_features.StudentId_external_pseudonym == dfModelResults_melted.StudentId)\
 & (dfModelResults_features.feature1 == dfModelResults_melted.feature), how='inner')
dfModelResults_features = dfModelResults_features.drop('StudentId', 'feature', 'feature1_interpret_absVal').withColumnRenamed('feature_interpret_val', 'feature1_interpret_val')

In [24]:
# remove the rows already in the dfModelResults_features from the melted_abs table to find the next top feature
dfModelResults_melted_abs = dfModelResults_melted_abs.withColumnRenamed('StudentId_external_pseudonym', 'StudentId_external')
dfModelResults_melted_abs = dfModelResults_melted_abs.join(dfModelResults_features, (dfModelResults_melted_abs.StudentId_external == dfModelResults_features.StudentId_external_pseudonym)\
 & (dfModelResults_melted_abs.feature == dfModelResults_features.feature1), how='leftanti')
# extract the second-top feature based on the interpretML value per student 
dfModelResults_f2 = dfModelResults_melted_abs.groupBy('StudentId_external').max('feature_interpret_val')
dfModelResults_f2 = dfModelResults_f2.withColumnRenamed('StudentId_external', 'StudentId').withColumnRenamed('max(feature_interpret_val)', 'feature_val')
# grab the feature name
dfModelResults_f2 = dfModelResults_f2.join(dfModelResults_melted_abs, (dfModelResults_f2.StudentId == dfModelResults_melted_abs.StudentId_external) \
& (dfModelResults_f2.feature_val == dfModelResults_melted_abs.feature_interpret_val), how='inner')
dfModelResults_f2 = dfModelResults_f2.drop('StudentId', 'feature_val')
dfModelResults_f2 = dfModelResults_f2.withColumnRenamed('feature', 'feature2').withColumnRenamed('feature_interpret_val', 'feature2_interpret_absVal')
# grab the actual, model-assigned value for the feature
dfModelResults_f2 = dfModelResults_f2.join(dfModelResults_melted, (dfModelResults_f2.StudentId_external == dfModelResults_melted.StudentId)\
 & (dfModelResults_f2.feature2 == dfModelResults_melted.feature), how='inner')
dfModelResults_f2 = dfModelResults_f2.drop('StudentId', 'feature', 'feature2_interpret_absVal').withColumnRenamed('feature_interpret_val', 'feature2_interpret_val')
# join feature 2 table to the final dfModelResults_features table
dfModelResults_features = dfModelResults_features.join(dfModelResults_f2, dfModelResults_features.StudentId_external_pseudonym == dfModelResults_f2.StudentId_external, how='inner')
dfModelResults_features = dfModelResults_features.drop('StudentId_external')

In [25]:
# remove the rows already in the dfModelResults_features from the melted_abs table to find the next top feature
dfModelResults_melted_abs = dfModelResults_melted_abs.join(dfModelResults_features, (dfModelResults_melted_abs.StudentId_external == dfModelResults_features.StudentId_external_pseudonym)\
 & (dfModelResults_melted_abs.feature == dfModelResults_features.feature2), how='leftanti')
# extract the third-top feature based on the interpretML value per student 
dfModelResults_f3 = dfModelResults_melted_abs.groupBy('StudentId_external').max('feature_interpret_val')
dfModelResults_f3 = dfModelResults_f3.withColumnRenamed('StudentId_external', 'StudentId').withColumnRenamed('max(feature_interpret_val)', 'feature_val')
# grab the feature name
dfModelResults_f3 = dfModelResults_f3.join(dfModelResults_melted_abs, (dfModelResults_f3.StudentId == dfModelResults_melted_abs.StudentId_external) \
& (dfModelResults_f3.feature_val == dfModelResults_melted_abs.feature_interpret_val), how='inner')
dfModelResults_f3 = dfModelResults_f3.drop('StudentId', 'feature_val')
dfModelResults_f3 = dfModelResults_f3.withColumnRenamed('feature', 'feature3').withColumnRenamed('feature_interpret_val', 'feature3_interpret_absVal')
# grab the actual, model-assigned value for the feature
dfModelResults_f3 = dfModelResults_f3.join(dfModelResults_melted, (dfModelResults_f3.StudentId_external == dfModelResults_melted.StudentId)\
 & (dfModelResults_f3.feature3 == dfModelResults_melted.feature), how='inner')
dfModelResults_f3 = dfModelResults_f3.drop('StudentId', 'feature', 'feature3_interpret_absVal').withColumnRenamed('feature_interpret_val', 'feature3_interpret_val')
# join feature 3 table to the final dfModelResults_features table
dfModelResults_features = dfModelResults_features.join(dfModelResults_f3, dfModelResults_features.StudentId_external_pseudonym == dfModelResults_f3.StudentId_external, how='inner')
dfModelResults_features = dfModelResults_features.drop('StudentId_external')

In [26]:
# remove the rows already in the dfModelResults_features from the melted_abs table to find the next top feature
dfModelResults_melted_abs = dfModelResults_melted_abs.join(dfModelResults_features, (dfModelResults_melted_abs.StudentId_external == dfModelResults_features.StudentId_external_pseudonym)\
 & (dfModelResults_melted_abs.feature == dfModelResults_features.feature3), how='leftanti')
# extract the fourth-top feature based on the interpretML value per student 
dfModelResults_f4 = dfModelResults_melted_abs.groupBy('StudentId_external').max('feature_interpret_val')
dfModelResults_f4 = dfModelResults_f4.withColumnRenamed('StudentId_external', 'StudentId').withColumnRenamed('max(feature_interpret_val)', 'feature_val')
# grab the feature name
dfModelResults_f4 = dfModelResults_f4.join(dfModelResults_melted_abs, (dfModelResults_f4.StudentId == dfModelResults_melted_abs.StudentId_external) \
& (dfModelResults_f4.feature_val == dfModelResults_melted_abs.feature_interpret_val), how='inner')
dfModelResults_f4 = dfModelResults_f4.drop('StudentId', 'feature_val')
dfModelResults_f4 = dfModelResults_f4.withColumnRenamed('feature', 'feature4').withColumnRenamed('feature_interpret_val', 'feature4_interpret_absVal')
# grab the actual, model-assigned value for the feature
dfModelResults_f4 = dfModelResults_f4.join(dfModelResults_melted, (dfModelResults_f4.StudentId_external == dfModelResults_melted.StudentId)\
 & (dfModelResults_f4.feature4 == dfModelResults_melted.feature), how='inner')
dfModelResults_f4 = dfModelResults_f4.drop('StudentId', 'feature', 'feature4_interpret_absVal').withColumnRenamed('feature_interpret_val', 'feature4_interpret_val')
# join feature 4 table to the final dfModelResults_features table
dfModelResults_features = dfModelResults_features.join(dfModelResults_f4, dfModelResults_features.StudentId_external_pseudonym == dfModelResults_f4.StudentId_external, how='inner')
dfModelResults_features = dfModelResults_features.drop('StudentId_external')

In [27]:
# remove the rows already in the dfModelResults_features from the melted_abs table to find the next top feature
dfModelResults_melted_abs = dfModelResults_melted_abs.join(dfModelResults_features, (dfModelResults_melted_abs.StudentId_external == dfModelResults_features.StudentId_external_pseudonym)\
 & (dfModelResults_melted_abs.feature == dfModelResults_features.feature4), how='leftanti')
# extract the fifth-top feature based on the interpretML value per student 
dfModelResults_f5 = dfModelResults_melted_abs.groupBy('StudentId_external').max('feature_interpret_val')
dfModelResults_f5 = dfModelResults_f5.withColumnRenamed('StudentId_external', 'StudentId').withColumnRenamed('max(feature_interpret_val)', 'feature_val')
# grab the feature name
dfModelResults_f5 = dfModelResults_f5.join(dfModelResults_melted_abs, (dfModelResults_f5.StudentId == dfModelResults_melted_abs.StudentId_external) \
& (dfModelResults_f5.feature_val == dfModelResults_melted_abs.feature_interpret_val), how='inner')
dfModelResults_f5 = dfModelResults_f5.drop('StudentId', 'feature_val')
dfModelResults_f5 = dfModelResults_f5.withColumnRenamed('feature', 'feature5').withColumnRenamed('feature_interpret_val', 'feature5_interpret_absVal')
# grab the actual, model-assigned value for the feature
dfModelResults_f5 = dfModelResults_f5.join(dfModelResults_melted, (dfModelResults_f5.StudentId_external == dfModelResults_melted.StudentId)\
 & (dfModelResults_f5.feature5 == dfModelResults_melted.feature), how='inner')
dfModelResults_f5 = dfModelResults_f5.drop('StudentId', 'feature', 'feature5_interpret_absVal').withColumnRenamed('feature_interpret_val', 'feature5_interpret_val')
# join feature 5 table to the final dfModelResults_features table
dfModelResults_features = dfModelResults_features.join(dfModelResults_f5, dfModelResults_features.StudentId_external_pseudonym == dfModelResults_f5.StudentId_external, how='inner')
dfModelResults_features = dfModelResults_features.drop('StudentId_external')

In [17]:
#print((dfModelResults_features.count(), len(dfModelResults_features.columns)))

In [28]:
# join the model results of true classification, predicted classification, predicted score, and absentee probability to the final model results table
dfModelResults = spark.createDataFrame(explainable_df)
dfModelResults = dfModelResults.select('StudentId_external_pseudonym', 'true_class', 'predicted_class', 'predicted_score', 'absentee_probability')
dfModelResults = dfModelResults.withColumnRenamed('StudentId_external_pseudonym', 'StudentId')
dfModelResults_features = dfModelResults_features.join(dfModelResults, dfModelResults_features.StudentId_external_pseudonym == dfModelResults.StudentId, how='inner')
dfModelResults_features = dfModelResults_features.drop('StudentId')
# clean column names to write to stage 3 per delta file requirements
#dfModelResults = dfModelResults.select([F.col(col).alias(col.replace(' ', '')) for col in dfModelResults.columns])
display(dfModelResults_features.limit(10))

## 4.) Write this final "ModelResults_pseudo" table out to stage 3p

In [29]:
dfModelResults_features.coalesce(1).write.format('delta').mode('overwrite').option('header', True).save(oea.stage3p + '/chronic_absenteeism/ModelResults_pseudo')