# Predicting incorrect student answers from DataShop data
<p style="margin:30px">
    <img width=50% src="https://www.featuretools.com/wp-content/uploads/2017/12/FeatureLabs-Logo-Tangerine-800.png" alt="Featuretools" />
</p>

In this tutorial, we show how to predict whether a student will succesfully answer a problem using a dataset from [CMU DataShop](https://pslcdatashop.web.cmu.edu/). While online courses are logistically efficient, the structure can make it more difficult for a teacher to understand how students are learning in their class. To try to fill in those gaps, we can apply machine learning. However, building an accurate machine learning model requires extracting information called **features**. Finding the right features is a crucial component of both finding a satisfactory answer and understanding what to do next. The process of **feature engineering** is made simple by [Featuretools](http://www.featuretools.com).

*If you're running this notebook yourself, please download the [geometry dataset](https://pslcdatashop.web.cmu.edu/DatasetInfo?datasetId=76) into the `data` folder in this repository. You will only need the `.txt` file. The infrastructure in this notebook will work with **any** DataShop dataset, but you will need to change the file name in the first cell.*

## Highlights
* Show how to import a DataShop dataset into featuretools
* Show efficacy of automatic feature generation with these datasets
* Show how to make custom primitives for stacking

# Step 1: Load data
At the beginning of any project, it is worthwhile to take a moment to think about how your dataset is structured.

In these datasets the unique events come from `transactions`: places where a student interacts with a system. Each transaction has a `time-index`, the time at which information in a row becomes known. Furthermore, the columns of those transactions have variables that can be grouped together. 

For instance, there are only 59 distinct students for the 6778 transactions we have in the geometry dataset. Those students log in to the system and have individual sessions. We can break down problems and problem steps in a similar way.

Featuretools stores data in an `EntitySet`. This is an abstraction which allows us to hold on to not only the data itself, but also to metadata like relationships and column types.

We create an entityset structure using the `datashop_to_entityset` function in [utils](utils.py). If you're interested in how `datashop_to_entityset` is structured, there's an associated notebook [entityset_function](entityset_function.ipynb) which explains choices made in more detail.

In [1]:
# Note that each branch is a one -> many relationship

# schools       students     problems
#        \        |         /
#   classes   sessions   problem steps
#          \     |       /
#           transactions  -- attempts
#

import utils

filename = 'data/ds2174_tx_All_Data_3991_2017_1128_123859.txt'
es = utils.datashop_to_entityset(filename)
es

Entityset: Dataset
  Entities:
    problem_steps (shape = [78, 50])
    sessions (shape = [59, 3])
    students (shape = [59, 2])
    transactions (shape = [6778, 26])
    problems (shape = [20, 2])
    ...And 3 more
  Relationships:
    transactions.Step Name -> problem_steps.Step Name
    problem_steps.Problem Name -> problems.Problem Name
    transactions.Session Id -> sessions.Session Id
    sessions.Anon Student Id -> students.Anon Student Id
    transactions.Class -> classes.Class
    ...and 2 more

Our `students` entity represents that: there are only 59 rows, one for each Anonymous student ID.

In [2]:
es['students'].df.head(3)

Unnamed: 0_level_0,Anon Student Id,first_sessions_time
Anon Student Id,Unnamed: 1_level_1,Unnamed: 2_level_1
Stu_c0bf45c22dc46067350d304ce330067e,Stu_c0bf45c22dc46067350d304ce330067e,1996-02-01 00:00:00
Stu_af3a2f63bda8c1338556108cb8d519a0,Stu_af3a2f63bda8c1338556108cb8d519a0,1996-02-01 00:00:02
Stu_d7f18a5fa205a889b0c5b0b56a7127d3,Stu_d7f18a5fa205a889b0c5b0b56a7127d3,1996-02-01 00:00:02


Featuretools allows us to make new entities as grouped by categorical values. Through this process of *normalization* we have created 8 connected entities from an initial table of transactions. We can look at what is left in `transactions` after normalization.

In [3]:
es['transactions'].head(3)

Unnamed: 0_level_0,Sample Name,Transaction Id,Session Id,Time,Time Zone,Duration (sec),Student Response Type,Student Response Subtype,Tutor Response Type,Tutor Response Subtype,...,Outcome,Selection,Action,Input,Feedback Text,Feedback Classification,Help Level,Total Num Hints,Class,End Time
Transaction Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
499a0a18d7b6d96d4ee9c16d4bead6f2,All Data,499a0a18d7b6d96d4ee9c16d4bead6f2,GEO-408d5ed7:10e14be5d3a:-8000,1996-02-01 00:00:00,US/Eastern,0,ATTEMPT,,RESULT,,...,0,(CIRCLE-AREA_A QUESTION1),,,,,,,,1996-02-01 00:00:00
d398b66148a76c537cba816efe946b85,All Data,d398b66148a76c537cba816efe946b85,GEO-408d5ed7:10e14be5d3a:-8000,1996-02-01 00:00:01,US/Eastern,1,ATTEMPT,,RESULT,,...,1,(CIRCLE-AREA_A QUESTION1),,,,,,,,1996-02-01 00:00:02
1c133061306fd8e099eb0e4f2ac21430,All Data,1c133061306fd8e099eb0e4f2ac21430,GEO-408d5ed7:10e14be5d3a:-6e40,1996-02-01 00:00:02,US/Eastern,0,ATTEMPT,,RESULT,,...,1,(AREA QUESTION1),,,,,,,,1996-02-01 00:00:02


# Step 2: Building Features
Next, we calculate a feature matrix on the `transactions` entity to try to predict the outcome of a given transaction. It's at this step that our previous setup pays off: we can automatically calculate features using data from the whole `EntitySet`. 

## Cutoff times
We are going to be generating features and doing predictive modeling on time-sensitive data. That comes with a high risk of label leakage. 

In this case, we are predicting if a student will get a particular problem correct. For a fixed problem, the feature "There exists an attempt number three" would be highly predictive of the result on attempts one and two. There can only be a third attempt if there first two attempts were wrong! In that way, storing future `attempt` information in a feature to predict `Outcome` would yield higher test accuracy than the model deserves. It's not ok to have the feature "There exists an attempt number three" while predicting attempts one and two because it contains information that can not be known at that point in time.

To circumvent that, we introduce the notion of [cutoff_times](https://docs.featuretools.com/automated_feature_engineering/handling_time.html). A `cutoff_time` has an index column and a datetime column indicating the last acceptable date we can use while generating features for a historical training example. We can also add in a label, which will be passed through Deep Feature Synthesis ([DFS](https://docs.featuretools.com/automated_feature_engineering/afe.html)) untouched so we can recover it later.

Setting cutoff times immediately mitigates the risk of fraudulently using future data, controls the number of predictions we make and controls what data is used while calculating features.


In [4]:
cutoff_times = es['transactions'].df[['Transaction Id', 'End Time', 'Outcome']]
cutoff_times.head()

Unnamed: 0_level_0,Transaction Id,End Time,Outcome
Transaction Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
499a0a18d7b6d96d4ee9c16d4bead6f2,499a0a18d7b6d96d4ee9c16d4bead6f2,1996-02-01 00:00:00,0
d398b66148a76c537cba816efe946b85,d398b66148a76c537cba816efe946b85,1996-02-01 00:00:02,1
1c133061306fd8e099eb0e4f2ac21430,1c133061306fd8e099eb0e4f2ac21430,1996-02-01 00:00:02,1
3ce42cc528ee77fb2db3b84c8b00ef39,3ce42cc528ee77fb2db3b84c8b00ef39,1996-02-01 00:00:02,1
82e2897fa00c8895bea39726e8dd303a,82e2897fa00c8895bea39726e8dd303a,1996-02-01 00:00:02,0


With that in hand, we can guarentee that future values for `Outcome` won't be used for any calculations because we set the time index of `Outcome` to be after the cutoff time.

From there, we can call `ft.dfs` to generate our features and feature matrix. Deep Feature Synthesis creates features using reusable functions ([Primitives](https://docs.featuretools.com/automated_feature_engineering/primitives.html)). The algorithm attempts to combine primitives together with actual data to create a feature matrix. Here, we'll use the primitives `Sum`, `Mean`, `PercentTrue` and `Hour`. On an ordinary laptop, you should expect the following cell to take roughly 40 minutes to complete as there are more than 3000 unique cutoff times. For faster results, uncomment the approximate line. 

In [5]:
import featuretools as ft
from featuretools.primitives import Sum, Mean, Hour, PercentTrue
from featuretools.selection import remove_low_information_features

import pandas as pd
pd.options.display.max_columns = 500

fm, features = ft.dfs(entityset=es,
                      target_entity='transactions',
                      agg_primitives=[Sum, Mean, PercentTrue],
                      trans_primitives=[Hour],
                      max_depth=3,
                      # approximate='2m',
                      cutoff_time=cutoff_times[1000:],
                      verbose=True)

# Encode the feature matrix using One-Hot encoding
fm_enc, f_enc = ft.encode_features(fm, features)
fm_enc = fm_enc.fillna(0)
fm_enc = remove_low_information_features(fm_enc)

# Pop the label
label = fm_enc.pop('Outcome')

fm.tail()

Built 227 features
Elapsed: 37:51 | Remaining: 00:00 | Progress: 100%|██████████|| Calculated: 3068/3068 cutoff times


Unnamed: 0_level_0,Class,Selection,Attempt At Step,Student Response Type,Tutor Response Subtype,Problem View,Sample Name,Feedback Classification,Level (Unit),Feedback Text,Session Id,Input,Action,Student Response Subtype,Tutor Response Type,Time Zone,Step Name,Help Level,Total Num Hints,problem_steps.KC (Khushboo),problem_steps.KC Category (KRE_circle_area),problem_steps.CF (Factor required),problem_steps.CF (Factor repeat),problem_steps.CF (Factor figure-part),problem_steps.KC Category (Geometry),problem_steps.KC (Zhulin_Textbook11_SquareRectMerge),problem_steps.KC Category (NewModel),problem_steps.KC Category (Textbook),problem_steps.KC Category (NNEWWW),problem_steps.CF (Factor cir-quad),problem_steps.KC Category (Unique-step),problem_steps.CF (Factor circle-formula),problem_steps.CF (Factor base-formula-p),problem_steps.KC (Textbook),problem_steps.KC Category (MJB-SQRECT-Merge),problem_steps.KC (Unique-step),problem_steps.KC Category (Single-KC),problem_steps.CF (Factor add-or-m),problem_steps.KC (new KC model name),problem_steps.KC (Geometry),problem_steps.KC Category (Textbook-test),problem_steps.KC Category (new KC model name),problem_steps.CF (Factor trapezoid-part),HOUR(End Time),problem_steps.CF (Factor circle-given),problem_steps.KC (KRE_circle_area),problem_steps.CF (Factor base-or-height),problem_steps.KC (Single-KC),problem_steps.CF (Factor embeddedness),problem_steps.CF (Factor circle-goal),problem_steps.KC (NNEWWW),problem_steps.KC (New),problem_steps.KC Category (New),problem_steps.KC Category (MyKC),problem_steps.KC (MyKC),problem_steps.KC (Textbook-test),problem_steps.KC (NewModel),problem_steps.KC Category (Khushboo),problem_steps.CF (Factor embedd3-tri-reg_prob_fix),problem_steps.CF (Factor basic-shape),problem_steps.CF (Factor non-standard-orientation-or-shape),sessions.Anon Student Id,HOUR(Time),problem_steps.CF (Factor parallelogram),problem_steps.KC (MJB-SQRECT-Merge),problem_steps.KC Category (Zhulin_Textbook11_SquareRectMerge),problem_steps.CF (Factor figure-type),HOUR(Problem Start Time),classes.School,problem_steps.Problem Name,problem_steps.CF (Factor parallelogram-type),problem_steps.CF (Factor backward),sessions.SUM(transactions.Tutor Response Subtype),attempts.MEAN(transactions.Input),problem_steps.MEAN(transactions.Problem View),attempts.SUM(transactions.Help Level),problem_steps.SUM(transactions.Action),problem_steps.HOUR(first_transactions_time),problem_steps.SUM(transactions.Total Num Hints),problem_steps.SUM(transactions.Feedback Text),problem_steps.MEAN(transactions.Help Level),problem_steps.MEAN(transactions.Tutor Response Subtype),problem_steps.MEAN(transactions.Action),sessions.PERCENT_TRUE(transactions.Outcome),attempts.MEAN(transactions.Feedback Classification),problem_steps.MEAN(transactions.Feedback Text),sessions.SUM(transactions.Total Num Hints),attempts.SUM(transactions.Feedback Text),attempts.MEAN(transactions.Action),problem_steps.SUM(transactions.Duration (sec)),attempts.MEAN(transactions.Problem View),sessions.MEAN(transactions.Is Last Attempt),sessions.HOUR(first_transactions_time),attempts.SUM(transactions.Total Num Hints),sessions.MEAN(transactions.Input),sessions.SUM(transactions.Help Level),problem_steps.PERCENT_TRUE(transactions.Outcome),attempts.SUM(transactions.Input),problem_steps.MEAN(transactions.Student Response Subtype),sessions.MEAN(transactions.Duration (sec)),attempts.SUM(transactions.Tutor Response Subtype),problem_steps.MEAN(transactions.Input),sessions.MEAN(transactions.Problem View),problem_steps.MEAN(transactions.Duration (sec)),attempts.SUM(transactions.Action),attempts.MEAN(transactions.Duration (sec)),attempts.SUM(transactions.Duration (sec)),attempts.SUM(transactions.Problem View),sessions.SUM(transactions.Problem View),sessions.MEAN(transactions.Tutor Response Subtype),problem_steps.SUM(transactions.Student Response Subtype),attempts.MEAN(transactions.Tutor Response Subtype),attempts.MEAN(transactions.Total Num Hints),sessions.MEAN(transactions.Student Response Subtype),attempts.SUM(transactions.Student Response Subtype),sessions.SUM(transactions.Is Last Attempt),sessions.MEAN(transactions.Action),sessions.SUM(transactions.Feedback Text),attempts.SUM(transactions.Is Last Attempt),sessions.SUM(transactions.Feedback Classification),problem_steps.SUM(transactions.Feedback Classification),attempts.SUM(transactions.Feedback Classification),problem_steps.MEAN(transactions.Is Last Attempt),problem_steps.SUM(transactions.Tutor Response Subtype),problem_steps.MEAN(transactions.Total Num Hints),problem_steps.SUM(transactions.Input),sessions.SUM(transactions.Input),sessions.MEAN(transactions.Total Num Hints),sessions.MEAN(transactions.Feedback Text),attempts.MEAN(transactions.Is Last Attempt),attempts.MEAN(transactions.Help Level),sessions.SUM(transactions.Action),sessions.MEAN(transactions.Feedback Classification),attempts.PERCENT_TRUE(transactions.Outcome),sessions.SUM(transactions.Student Response Subtype),attempts.MEAN(transactions.Student Response Subtype),problem_steps.SUM(transactions.Help Level),sessions.MEAN(transactions.Help Level),sessions.SUM(transactions.Duration (sec)),problem_steps.MEAN(transactions.Feedback Classification),problem_steps.SUM(transactions.Problem View),problem_steps.SUM(transactions.Is Last Attempt),attempts.MEAN(transactions.Feedback Text),problem_steps.problems.MEAN(problem_steps.KC Category (KRE_circle_area)),sessions.students.SUM(transactions.Feedback Text),problem_steps.problems.MEAN(problem_steps.KC Category (Single-KC)),problem_steps.problems.SUM(transactions.Feedback Text),sessions.students.HOUR(first_sessions_time),sessions.students.MEAN(transactions.Help Level),problem_steps.problems.SUM(problem_steps.KC Category (Unique-step)),problem_steps.problems.HOUR(first_problem_steps_time),sessions.students.SUM(transactions.Tutor Response Subtype),problem_steps.problems.MEAN(problem_steps.KC Category (NewModel)),problem_steps.problems.MEAN(transactions.Is Last Attempt),problem_steps.problems.MEAN(problem_steps.KC Category (MyKC)),problem_steps.problems.SUM(problem_steps.KC Category (Zhulin_Textbook11_SquareRectMerge)),sessions.students.MEAN(transactions.Feedback Classification),sessions.students.SUM(transactions.Action),problem_steps.problems.SUM(transactions.Duration (sec)),problem_steps.problems.MEAN(transactions.Student Response Subtype),problem_steps.problems.SUM(problem_steps.CF (Factor non-standard-orientation-or-shape)),problem_steps.problems.MEAN(transactions.Feedback Text),sessions.students.MEAN(transactions.Duration (sec)),problem_steps.problems.MEAN(transactions.Duration (sec)),problem_steps.problems.SUM(problem_steps.CF (Factor embedd3-tri-reg_prob_fix)),problem_steps.problems.MEAN(problem_steps.KC Category (Geometry)),problem_steps.problems.MEAN(problem_steps.KC Category (Textbook)),problem_steps.problems.SUM(problem_steps.KC Category (Textbook)),problem_steps.problems.SUM(problem_steps.CF (Factor backward)),sessions.students.MEAN(transactions.Action),problem_steps.problems.SUM(transactions.Problem View),problem_steps.problems.MEAN(transactions.Action),sessions.students.MEAN(transactions.Is Last Attempt),problem_steps.problems.SUM(transactions.Input),problem_steps.problems.SUM(transactions.Student Response Subtype),problem_steps.problems.SUM(problem_steps.KC Category (Geometry)),problem_steps.problems.MEAN(problem_steps.CF (Factor non-standard-orientation-or-shape)),sessions.students.SUM(transactions.Total Num Hints),sessions.students.PERCENT_TRUE(transactions.Outcome),sessions.students.MEAN(transactions.Student Response Subtype),problem_steps.problems.MEAN(problem_steps.CF (Factor parallelogram)),problem_steps.problems.MEAN(problem_steps.KC Category (MJB-SQRECT-Merge)),problem_steps.problems.SUM(problem_steps.KC Category (Khushboo)),problem_steps.problems.PERCENT_TRUE(transactions.Outcome),problem_steps.problems.SUM(problem_steps.CF (Factor parallelogram)),problem_steps.problems.SUM(problem_steps.KC Category (Textbook-test)),problem_steps.problems.SUM(problem_steps.KC Category (NewModel)),sessions.students.SUM(transactions.Input),problem_steps.problems.SUM(problem_steps.KC Category (MyKC)),problem_steps.problems.MEAN(transactions.Feedback Classification),problem_steps.problems.SUM(transactions.Is Last Attempt),sessions.students.MEAN(transactions.Tutor Response Subtype),sessions.students.SUM(transactions.Is Last Attempt),sessions.students.SUM(transactions.Duration (sec)),sessions.students.SUM(transactions.Problem View),problem_steps.problems.SUM(transactions.Total Num Hints),problem_steps.problems.SUM(transactions.Tutor Response Subtype),problem_steps.problems.SUM(problem_steps.KC Category (new KC model name)),problem_steps.problems.MEAN(transactions.Total Num Hints),problem_steps.problems.MEAN(transactions.Input),sessions.students.MEAN(transactions.Feedback Text),problem_steps.problems.SUM(problem_steps.KC Category (KRE_circle_area)),problem_steps.problems.MEAN(transactions.Tutor Response Subtype),problem_steps.problems.MEAN(problem_steps.KC Category (Textbook-test)),problem_steps.problems.MEAN(problem_steps.CF (Factor embedd3-tri-reg_prob_fix)),problem_steps.problems.MEAN(problem_steps.KC Category (NNEWWW)),problem_steps.problems.SUM(transactions.Help Level),problem_steps.problems.MEAN(problem_steps.KC Category (Khushboo)),problem_steps.problems.MEAN(problem_steps.KC Category (New)),problem_steps.problems.MEAN(problem_steps.KC Category (new KC model name)),sessions.students.SUM(transactions.Feedback Classification),problem_steps.problems.SUM(problem_steps.KC Category (New)),problem_steps.problems.MEAN(transactions.Problem View),sessions.students.MEAN(transactions.Problem View),sessions.students.MEAN(transactions.Total Num Hints),problem_steps.problems.MEAN(problem_steps.CF (Factor backward)),problem_steps.problems.MEAN(problem_steps.KC Category (Unique-step)),sessions.students.MEAN(transactions.Input),problem_steps.problems.SUM(transactions.Feedback Classification),problem_steps.problems.MEAN(problem_steps.KC Category (Zhulin_Textbook11_SquareRectMerge)),sessions.students.SUM(transactions.Student Response Subtype),problem_steps.problems.SUM(problem_steps.KC Category (Single-KC)),problem_steps.problems.MEAN(transactions.Help Level),problem_steps.problems.SUM(transactions.Action),problem_steps.problems.SUM(problem_steps.KC Category (MJB-SQRECT-Merge)),problem_steps.problems.SUM(problem_steps.KC Category (NNEWWW)),sessions.students.SUM(transactions.Help Level),Outcome
Transaction Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1,Unnamed: 82_level_1,Unnamed: 83_level_1,Unnamed: 84_level_1,Unnamed: 85_level_1,Unnamed: 86_level_1,Unnamed: 87_level_1,Unnamed: 88_level_1,Unnamed: 89_level_1,Unnamed: 90_level_1,Unnamed: 91_level_1,Unnamed: 92_level_1,Unnamed: 93_level_1,Unnamed: 94_level_1,Unnamed: 95_level_1,Unnamed: 96_level_1,Unnamed: 97_level_1,Unnamed: 98_level_1,Unnamed: 99_level_1,Unnamed: 100_level_1,Unnamed: 101_level_1,Unnamed: 102_level_1,Unnamed: 103_level_1,Unnamed: 104_level_1,Unnamed: 105_level_1,Unnamed: 106_level_1,Unnamed: 107_level_1,Unnamed: 108_level_1,Unnamed: 109_level_1,Unnamed: 110_level_1,Unnamed: 111_level_1,Unnamed: 112_level_1,Unnamed: 113_level_1,Unnamed: 114_level_1,Unnamed: 115_level_1,Unnamed: 116_level_1,Unnamed: 117_level_1,Unnamed: 118_level_1,Unnamed: 119_level_1,Unnamed: 120_level_1,Unnamed: 121_level_1,Unnamed: 122_level_1,Unnamed: 123_level_1,Unnamed: 124_level_1,Unnamed: 125_level_1,Unnamed: 126_level_1,Unnamed: 127_level_1,Unnamed: 128_level_1,Unnamed: 129_level_1,Unnamed: 130_level_1,Unnamed: 131_level_1,Unnamed: 132_level_1,Unnamed: 133_level_1,Unnamed: 134_level_1,Unnamed: 135_level_1,Unnamed: 136_level_1,Unnamed: 137_level_1,Unnamed: 138_level_1,Unnamed: 139_level_1,Unnamed: 140_level_1,Unnamed: 141_level_1,Unnamed: 142_level_1,Unnamed: 143_level_1,Unnamed: 144_level_1,Unnamed: 145_level_1,Unnamed: 146_level_1,Unnamed: 147_level_1,Unnamed: 148_level_1,Unnamed: 149_level_1,Unnamed: 150_level_1,Unnamed: 151_level_1,Unnamed: 152_level_1,Unnamed: 153_level_1,Unnamed: 154_level_1,Unnamed: 155_level_1,Unnamed: 156_level_1,Unnamed: 157_level_1,Unnamed: 158_level_1,Unnamed: 159_level_1,Unnamed: 160_level_1,Unnamed: 161_level_1,Unnamed: 162_level_1,Unnamed: 163_level_1,Unnamed: 164_level_1,Unnamed: 165_level_1,Unnamed: 166_level_1,Unnamed: 167_level_1,Unnamed: 168_level_1,Unnamed: 169_level_1,Unnamed: 170_level_1,Unnamed: 171_level_1,Unnamed: 172_level_1,Unnamed: 173_level_1,Unnamed: 174_level_1,Unnamed: 175_level_1,Unnamed: 176_level_1,Unnamed: 177_level_1,Unnamed: 178_level_1,Unnamed: 179_level_1,Unnamed: 180_level_1,Unnamed: 181_level_1,Unnamed: 182_level_1,Unnamed: 183_level_1,Unnamed: 184_level_1,Unnamed: 185_level_1,Unnamed: 186_level_1,Unnamed: 187_level_1,Unnamed: 188_level_1,Unnamed: 189_level_1,Unnamed: 190_level_1,Unnamed: 191_level_1,Unnamed: 192_level_1,Unnamed: 193_level_1,Unnamed: 194_level_1,Unnamed: 195_level_1,Unnamed: 196_level_1,Unnamed: 197_level_1,Unnamed: 198_level_1,Unnamed: 199_level_1,Unnamed: 200_level_1,Unnamed: 201_level_1,Unnamed: 202_level_1,Unnamed: 203_level_1,Unnamed: 204_level_1,Unnamed: 205_level_1,Unnamed: 206_level_1,Unnamed: 207_level_1,Unnamed: 208_level_1,Unnamed: 209_level_1,Unnamed: 210_level_1,Unnamed: 211_level_1,Unnamed: 212_level_1,Unnamed: 213_level_1,Unnamed: 214_level_1,Unnamed: 215_level_1,Unnamed: 216_level_1,Unnamed: 217_level_1,Unnamed: 218_level_1,Unnamed: 219_level_1,Unnamed: 220_level_1,Unnamed: 221_level_1,Unnamed: 222_level_1,Unnamed: 223_level_1,Unnamed: 224_level_1,Unnamed: 225_level_1,Unnamed: 226_level_1,Unnamed: 227_level_1,Unnamed: 228_level_1
a8899b5cab6774db36a200d174df1546,,(SQUARE-AREA QUESTION2),1,ATTEMPT,,2,All Data,,Area,,GEO-408d5ed7:10e14be5d3a:-63b8,,,,RESULT,US/Eastern,(SQUARE-AREA QUESTION2),,,area,,additional,repeat,area,,square-rect-area,,,,quad,,0,f,square-area,,KC59,,0,square-rect-area,Geometry,,,0,,0,square-area,0,Single-KC,embedded,0,square-area,square-rect-area,,,square-area,rectangle-area,square-rect-area,,0,parallelogram,0,Stu_ad3610752c4af1c3cac6638ef588e02b,1,1,sq_rect-area,,square,1,,POGS,square,0,0.0,,1.322034,0.0,0.0,0,0.0,0.0,,,,0.836634,,,0.0,0.0,,916.0,1.432464,0.585,0,0.0,,0.0,0.932203,0.0,,24.105,0.0,,1.237624,15.793103,0.0,13.075897,66674.0,7307,250,,0.0,,,,0.0,117.0,,0.0,2769.0,0.0,0.0,0.0,0.758621,0.0,,0.0,0.0,,,0.543048,,0.0,,0.751029,0.0,,0.0,,4821.0,,78,44.0,,,0.0,,0.0,0,,0.0,0,0.0,,0.62069,,0.0,,0.0,5601.0,,0,,24.105,11.361055,0,,,0.0,3,,681,,0.585,0.0,0.0,0.0,0,0.0,0.836634,,0.25,,0.0,0.856566,2,0.0,0.0,0.0,0.0,,306.0,,117.0,4821.0,250,0.0,0.0,0.0,,,,0.0,,,0,,0.0,,,,0.0,0.0,1.375758,1.237624,,0.375,,,0.0,,0.0,0.0,,0.0,0.0,0.0,0.0,1
2df16da47b60e3d15e098333b13ddf86,,(SQUARE-AREA QUESTION3),1,ATTEMPT,,2,All Data,,Area,,GEO-408d5ed7:10e14be5d3a:-63b8,,,,RESULT,US/Eastern,(SQUARE-AREA QUESTION3),,,area,,additional,repeat,area,,square-rect-area,,,,quad,,0,f,square-area,,KC50,,0,square-rect-area,Geometry,,,0,,0,square-area,0,Single-KC,embedded,0,square-area,square-rect-area,,,square-area,rectangle-area,square-rect-area,,0,parallelogram,0,Stu_ad3610752c4af1c3cac6638ef588e02b,1,1,sq_rect-area,,square,1,,POGS,square,0,0.0,,1.305085,0.0,0.0,0,0.0,0.0,,,,0.841584,,,0.0,0.0,,677.0,1.432464,0.587065,0,0.0,,0.0,0.932203,0.0,,24.0199,0.0,,1.237624,11.672414,0.0,13.074706,66681.0,7307,250,,0.0,,,,0.0,118.0,,0.0,2770.0,0.0,0.0,0.0,0.758621,0.0,,0.0,0.0,,,0.543137,,0.0,,0.751225,0.0,,0.0,,4828.0,,77,44.0,,,0.0,,0.0,0,,0.0,0,0.0,,0.621457,,0.0,,0.0,5608.0,,0,,24.0199,11.352227,0,,,0.0,3,,681,,0.587065,0.0,0.0,0.0,0,0.0,0.841584,,0.25,,0.0,0.858586,2,0.0,0.0,0.0,0.0,,307.0,,118.0,4828.0,250,0.0,0.0,0.0,,,,0.0,,,0,,0.0,,,,0.0,0.0,1.375758,1.237624,,0.375,,,0.0,,0.0,0.0,,0.0,0.0,0.0,0.0,1
ac8791497364a18b41957a72e6e1e6ec,,(SCRAP-METAL-AREA QUESTION1),1,ATTEMPT,,2,All Data,,Area,,GEO-408d5ed7:10e14be5d3a:-63b8,,,,RESULT,US/Eastern,(SCRAP-METAL-AREA QUESTION1),,,compose-by-addition,,additional,initial,area-difference,,compose-by-addition,,,,0,,0,no-f,compose-by-addition,,KC94,,a,compose-by-addition,Geometry,,,0,,0,compose-by-addition,0,Single-KC,embedded,0,compose-by-addition,compose-by-addition,,,compose-by-addition,compose-by-addition,compose-by-addition,,0,0,0,Stu_ad3610752c4af1c3cac6638ef588e02b,1,0,compose-by-addition,,0,1,,POGS,0,1,0.0,,1.333333,0.0,0.0,0,0.0,0.0,,,,0.842365,,,0.0,0.0,,677.0,1.432575,0.589109,0,0.0,,0.0,0.72381,0.0,,23.935644,0.0,,1.241379,6.509615,0.0,13.073515,66688.0,7309,252,,0.0,,,,0.0,119.0,,0.0,2771.0,0.0,0.0,0.0,0.528846,0.0,,0.0,0.0,,,0.543227,,0.0,,0.751274,0.0,,0.0,,4835.0,,140,55.0,,,0.0,,0.0,0,,0.0,0,0.0,,0.622222,,0.0,,0.0,5615.0,,0,,23.935644,11.343434,0,,,0.0,3,,683,,0.589109,0.0,0.0,0.0,0,0.0,0.842365,,0.25,,0.0,0.858871,2,0.0,0.0,0.0,0.0,,308.0,,119.0,4835.0,252,0.0,0.0,0.0,,,,0.0,,,0,,0.0,,,,0.0,0.0,1.377016,1.241379,,0.375,,,0.0,,0.0,0.0,,0.0,0.0,0.0,0.0,1
ef4433fd971485c7ffb3c9f15d9cfcf1,,(SCRAP-METAL-AREA QUESTION2),1,ATTEMPT,,2,All Data,,Area,,GEO-408d5ed7:10e14be5d3a:-63b8,,,,RESULT,US/Eastern,(SCRAP-METAL-AREA QUESTION2),,,compose-by-addition,,additional,initial,area-difference,,compose-by-addition,,,,0,,0,no-f,compose-by-addition,,KC33,,a,compose-by-addition,Geometry,,,0,,0,compose-by-addition,0,Single-KC,embedded,0,compose-by-addition,compose-by-addition,,,compose-by-addition,compose-by-addition,compose-by-addition,,0,0,0,Stu_ad3610752c4af1c3cac6638ef588e02b,1,0,compose-by-addition,,0,1,,POGS,0,1,0.0,,1.368421,0.0,0.0,0,0.0,0.0,,,,0.843137,,,0.0,0.0,,1354.0,1.432687,0.591133,0,0.0,,0.0,0.810526,0.0,,23.857143,0.0,,1.245098,14.404255,0.0,13.072521,66696.0,7311,254,,0.0,,,,0.0,120.0,,0.0,2772.0,0.0,0.0,0.0,0.595745,0.0,,0.0,0.0,,,0.543316,,0.0,,0.751323,0.0,,0.0,,4843.0,,130,56.0,,,0.0,,0.0,0,,0.0,0,0.0,,0.622984,,0.0,,0.0,5623.0,,0,,23.857143,11.336694,0,,,0.0,3,,685,,0.591133,0.0,0.0,0.0,0,0.0,0.843137,,0.25,,0.0,0.859155,2,0.0,0.0,0.0,0.0,,309.0,,120.0,4843.0,254,0.0,0.0,0.0,,,,0.0,,,0,,0.0,,,,0.0,0.0,1.37827,1.245098,,0.375,,,0.0,,0.0,0.0,,0.0,0.0,0.0,0.0,1
b1ec2f9894ea80eeb7d01763976ba012,,(SCRAP-METAL-AREA QUESTION3),1,ATTEMPT,,2,All Data,,Area,,GEO-408d5ed7:10e14be5d3a:-63b8,,,,RESULT,US/Eastern,(SCRAP-METAL-AREA QUESTION3),,,compose-by-addition,,additional,initial,area-difference,,compose-by-addition,,,,0,,0,no-f,compose-by-addition,,KC77,,a,compose-by-addition,Geometry,,,0,,0,compose-by-addition,0,Single-KC,embedded,0,compose-by-addition,compose-by-addition,,,compose-by-addition,compose-by-addition,compose-by-addition,,0,0,0,Stu_ad3610752c4af1c3cac6638ef588e02b,2,0,compose-by-addition,,0,1,,POGS,0,1,0.0,,1.352273,0.0,0.0,0,0.0,0.0,,,,0.843902,,,0.0,0.0,,1386.0,1.432798,0.593137,0,0.0,,0.0,0.875,0.0,,23.794118,0.0,,1.24878,15.931034,0.0,13.072114,66707.0,7313,256,,0.0,,,,0.0,121.0,,0.0,2773.0,0.0,0.0,0.0,0.643678,0.0,,0.0,0.0,,,0.543406,,0.0,,0.751371,0.0,,0.0,,4854.0,,119,56.0,,,0.0,,0.0,0,,0.0,0,0.0,,0.623742,,0.0,,0.0,5634.0,,0,,23.794118,11.336016,0,,,0.0,3,,687,,0.593137,0.0,0.0,0.0,0,0.0,0.843902,,0.25,,0.0,0.859438,2,0.0,0.0,0.0,0.0,,310.0,,121.0,4854.0,256,0.0,0.0,0.0,,,,0.0,,,0,,0.0,,,,0.0,0.0,1.379518,1.24878,,0.375,,,0.0,,0.0,0.0,,0.0,0.0,0.0,0.0,1


Above, you can scroll to the right to see the 227 features we created. If you look at the column names, you can see that we've done more than individually apply primitives one at a time to the raw data. Features were stacked and combined across entities in an exhaustive way. Using Deep Feature Synthesis is powerful because it greatly increases the likelihood of finding important features while decreasing the workload of the data scientist.

# Step 3: Making predictions

Here we split the data into two parts using `train_test_split` from scikit-learn. Notice that we don't want the splitter to shuffle our data, since that has the risk to leak labels in time sensitive data. 

We can do feature selection with [Recursive Feature Elimination](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html). This recursively removes features by checking feature importances (according to some model) with smaller and smaller feature sets. Here we'll set `RFE` to select 20 features.  

In [6]:
from sklearn.model_selection import TimeSeriesSplit, train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score
from utils import feature_importances
from sklearn.feature_selection import RFE

# 1. Split X and y into a train and test set
X_train, X_test, y_train, y_test = train_test_split(fm_enc, label, shuffle=False)

# 2. Select features using RFE
clf = RandomForestClassifier()
estimator = clf
selector = RFE(estimator, 20, step=1)
selector = selector.fit(X_train, y_train)
X_train.iloc[:, selector.support_].tail()

Unnamed: 0_level_0,sessions.PERCENT_TRUE(transactions.Outcome),problem_steps.SUM(transactions.Duration (sec)),sessions.MEAN(transactions.Is Last Attempt),problem_steps.PERCENT_TRUE(transactions.Outcome),sessions.MEAN(transactions.Problem View),problem_steps.MEAN(transactions.Duration (sec)),attempts.MEAN(transactions.Duration (sec)),attempts.SUM(transactions.Duration (sec)),problem_steps.MEAN(transactions.Is Last Attempt),attempts.PERCENT_TRUE(transactions.Outcome),sessions.SUM(transactions.Duration (sec)),problem_steps.SUM(transactions.Problem View),problem_steps.problems.MEAN(transactions.Is Last Attempt),problem_steps.problems.SUM(transactions.Duration (sec)),sessions.students.MEAN(transactions.Duration (sec)),problem_steps.problems.MEAN(transactions.Duration (sec)),sessions.students.MEAN(transactions.Is Last Attempt),sessions.students.PERCENT_TRUE(transactions.Outcome),problem_steps.problems.PERCENT_TRUE(transactions.Outcome),sessions.students.SUM(transactions.Problem View)
Transaction Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
1cdbd9f38140e09dc7fa4e642c0b44f1,0.744681,1290.0,0.359712,0.83237,1.787234,7.5,12.038432,47612.0,0.75,0.745972,1387.0,181,0.512195,7486.0,9.978417,5.705793,0.359712,0.744681,0.786474,252
387f37198680f1e9e6d449a9c62ef8d3,0.767347,477.0,0.360656,0.609375,1.440816,7.571429,12.038432,47612.0,0.444444,0.745972,1958.0,74,0.507143,1239.0,8.02459,8.85,0.360656,0.767347,0.702128,353
aa77623c3440cb105a939a1aca1d4fe2,0.753846,4291.0,0.4375,0.765182,1.907692,4.351927,3.707234,4356.0,0.425963,0.974511,1740.0,1486,0.511805,7491.0,13.59375,5.705255,0.4375,0.753846,0.786474,248
f4812c4adc51484bf1cdfc3be42a87c0,0.820652,453.0,0.502732,0.837209,1.929348,10.785714,12.036391,47628.0,0.619048,0.746036,1701.0,47,0.580392,2512.0,9.295082,9.85098,0.502732,0.820652,0.829457,355
ff861aa2fd87d304cb0759526cc1ed84,0.744856,565.0,0.344398,0.822222,1.205761,13.139535,3.704932,4357.0,0.627907,0.974533,1947.0,51,0.582031,2523.0,8.078838,9.855469,0.344398,0.744856,0.833333,293


Finally, we can train a Random Forest Classifier to make predictions. Those predictions can be checked against our `y_test` from above, and scored with a [roc_auc_score](https://en.wikipedia.org/wiki/Receiver_operating_characteristic). Below we'll train and score our model and output the five most important features according to this model.

In [7]:
# 3. Train a Random Forest Classifier
clf.fit(selector.transform(X_train), y_train)

# 4. Make predictions and score
preds = clf.predict(selector.transform(X_test))
print("Auc score of {:.3f}".format(roc_auc_score(y_test, preds)))

feats = feature_importances(X_train.iloc[:, selector.support_], clf)

Auc score of 0.555
Feature Importances: 
1: attempts.PERCENT_TRUE(transactions.Outcome)
2: problem_steps.PERCENT_TRUE(transactions.Outcome)
3: attempts.MEAN(transactions.Duration (sec))
4: problem_steps.problems.PERCENT_TRUE(transactions.Outcome)
5: attempts.SUM(transactions.Duration (sec))
-----



Let's examine a feature. The feature `problem_steps.MEAN(transactions.Duration (sec))` is the average time spent on a given problem step. It's easy to see how 'amount of time people spend on this problem' might be related to problem difficulty and ultimately the `Outcome` of a given attempt.

# Next Steps
This notebook showed how to structure your data and make predictions with machine learning. Rather than spending time creating features, it's now possible to explore the relationships and implications betweem thousands of features directly. Reasonable next steps might be to:
1. Try [plotting](#Appendix:-Plotting) some of the generated features
2. Run feature selection and tune the machine learning model
3. Explore other prediction problems on this `EntitySet`




# Appendix: Plotting
Here, we'll look at a couple of important features as created above. We can use plots to help us understand why certain automatically generated features are good. Here, we'll plot two important features from the model above and match results from the model to our own intuition.

In [41]:
from bokeh.io import show, output_notebook, output_file

output_notebook()
output_file('difficulty_vs_time.html')

p = utils.datashop_plot(fm,
                        col1='problem_steps.problems.PERCENT_TRUE(transactions.Outcome)',
                        col2='problem_steps.problems.MEAN(transactions.Duration (sec))',
                        label=label,
                        names=['Problem difficulty versus problem time', 
                               'Success rate on this problem', 
                               'Average time on this problem'])
show(p)


![](data/images/exampleimage.png)

If you're interested in understanding particular points and clusters, we have saved an html interactive version which you can download [here](data/images/difficulty_vs_time.html). That version will allow you to zoom in and hover over individual points to see which problem step and problem it is. 

Notice that while a feature like *Success rate on this problem* might only have one value if we use all of the data, the graph here shows that data changing with time. To start our analysis, let's get a baseline for the data. The blue dots represent a successful answer while the grey dots indicate an incorrect answer. We can ask how often students are correct on average:

In [13]:
print('Overall success rate is {:.2f}%'.format(100*np.mean(fm['Outcome'])))

Overall success rate is 0.79


That is, if you were to pick a point at random, there's a roughly 75% chance it will be a correct answer. There are sections of this graph where that sample is more likely to be correct, and more likely to be incorrect, which can be picked up by the decision trees that make up our Random Forest. From the graph it looks like there is a spike of correct answers in this dataset near problems that take 10 seconds. We can verify that directly:

In [36]:
maxtime = 15

print('If problem takes more than {} seconds: {:.2f}% of problems answered correctly'.format(maxtime,
    100 * np.mean(fm[
        (fm['problem_steps.problems.MEAN(transactions.Duration (sec))']>=maxtime)]['Outcome'])))

If problem takes more than 15 seconds: 74.80% of problems answered correctly


In other words, the average time spent on a problem is an indicator of whether or not a student will answer the problem correctly in this dataset. There are a number of possible interpretations and testable hypotheses associated to that. It is clear that the averages don't tell the whole story of what's going on. Let's look at success rate as sorted by problem.



In [35]:
split_line = .85

print('If Success Rate > {}: {:.2f}% of problems answered correctly'.format(split_line,
    100 * np.mean(fm[
        (fm['problem_steps.problems.PERCENT_TRUE(transactions.Outcome)']>=split_line)]['Outcome'])))
print('Problems with higher success than {}: {}'.format(split_line, 
                                                        fm[fm['problem_steps.problems.PERCENT_TRUE(transactions.Outcome)']
                                                           >=split_line]['problem_steps.Problem Name'].unique()))


If Success Rate > 0.85: 89.24% of problems answered correctly
Problems with higher success than 0.85: ['BUILDING_A_SIDEWALK' 'POGS' 'PAINTING_THE_WALL' 'CIRCLE_O'
 'DESIGNING_A_QUILT' 'DOG_ON_A_ROPE']


That is, of the 20 problems in this data set, only 6 have a success rate that was over 85%. In that way the machine learning has indicated that how previous students have done on the problem is a good predictor of how they will do inside of this dataset.

In addition to our earlier conclusion that "the problems that took a long time had worse scores", we have a secondary conclusion that "some problems are harder than others". What makes this line of inquiry interesting is that we didn't have to do very much work to reveal hard questions. In that way we have used automated feature engineering to make explicit our implicit understanding of this dataset.

# Feature Labs
If you have questions, comments or concerns regarding the notebook feel free to [reach out](https://stackoverflow.com/questions/tagged/featuretools).