# M4 | Research Investigation Notebook

In this notebook, you will do a research investigation of your chosen dataset in teams. You will begin by formally selecting your research question (task 0), then processing your data (task 1), creating a predictive model (task 2), and evaluating your model's results (task 3).

Please upload your solved notebook to Moodle (under [Milestone 4 Submission](https://moodle.epfl.ch/mod/assign/view.php?id=1199557)) adding your team name in title, example: `m4-lernnavi-teamname.ipynb`. Please run all cells before submission so we can grade effectively.


## Brief overview of Lernnavi
[Lernnavi](https://www.lernnavi.ch) is an instrument for promoting part of the basic technical study skills in German and mathematics.

Lernnavi's dataset is formatted in three main tables:
* *users*: demographic information of users.
* *events*: events done by the users in the platform.
* *transactions*: question and answer solved by user.

You should provide arguments and justifications for all of your design decisions throughout this investigation. You can use your M3 responses as the basis for this discussion.

In [1]:
# Import the tables of the data set as dataframes.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import json
from collections import defaultdict
from tqdm import tqdm

DATA_DIR = '../../data/raw' # You many change the directory

users = pd.read_csv('{}/users.csv.gz'.format(DATA_DIR))
events = pd.read_csv('{}/events.csv.gz'.format(DATA_DIR))
transactions = pd.read_csv('{}/transactions.csv.gz'.format(DATA_DIR))

In [2]:
topic_trees = pd.read_csv('{}/topic_trees.csv.gz'.format(DATA_DIR))
topics_translated = pd.read_csv('{}/topics_translated.csv.gz'.format(DATA_DIR))

## Task 0: Research Question

**Research question:**
Does the study of students' behavior help predict their evaluation results?

1. To answer this question, we first focus cleaning the data to remove all inactive students. Our preprocessing aims at constituting a dataset tracking a subset of actions identified as expressing a user's activity, with a minimum number of actions characterizing active users.

2. Then we cluster the students based on a set of metrics as defined in [Shirvani Boroujeni et al](https://infoscience.epfl.ch/record/218657/files/). This is to allow the detection of an optimal student behavior.

3. Finally we perform train a time series classification model on the weekly data to predict wether a student will pass or fail its level-check evaluation.

## Task 1: Data Preprocessing

In this section, you are asked to preprocess your data in a way that is relevant for the model. Please include 1-2 visualizations of features / data explorations that are related to your downstream prediction task.

In [3]:
# Your code for data processing goes here

### Getting Level Checks

The level checks scores are gotten as described in the [FAQ](https://docs.google.com/document/d/1RFcn1Y_o_MLfQo0-od8W7-Ty2rfhXZRnIVG0CLPl0NM/edit).

#### Utils
Creation of utility data structures & function definitions.

In [11]:
# Creating dictionary of topic ids, names and category (math/german)
topics_dict = topics_translated[['id', 'name_english', 'math']]
display(topics_dict.head(3))

Unnamed: 0,id,name_english,math
0,1,German,0.0
1,2,spelling,0.0
2,3,Spelling principles,0.0


In [12]:
def get_topic_name(_id, topics_dict = topics_dict, verbose = False):
    """
    Uses the topics_dict to get the english name of a topic given its id.
    """
    tmp = topics_dict.loc[topics_dict.id == _id, 'name_english'].reset_index(drop=True)
    if (tmp.shape[0] == 0):
        if verbose : print(f"Did not find any topic for id {_id}")
        return ''
    else : 
        return tmp.iloc[0]
    
def is_math(_id, topics_dict = topics_dict, verbose = False):
    """
    Uses the topics_dict to indicate wether the topic is a math topic or not (German topic) using its id.
    """
    tmp = topics_dict.loc[topics_dict.id == _id, 'math'].reset_index(drop=True)
    if (tmp.shape[0] == 0):
        if verbose : print(f"Did not find any topic for id {_id}")
        return None
    else : 
        return tmp.iloc[0] == 1

In [13]:
def get_keys(js, n) :
    """
    Prints out all keys of a json file.
    """
    if type(js) == dict:
        for k in js.keys():
            print('\t'*n + k)
            get_keys(js[k], n+1)
    elif type(js) == list:
        print('\t'*n + '[')        
        for i in js:
            get_keys(i, n)
            print('\t'*n + ',')
        print('\t'*n + ']')
    else :
        print('\t'*n + "...")

In [14]:
# Getting all the primary (non children) topics of a specific navigate_dashbard event

def get_mastery(topic, n=0, scores = None, verbose=False):
    """
    Stores all the mastery scores for the input topic and its children. 
    Stops the recursion when no children or a non-zero mastery score is encountered.
    """
    # create cumulative dict to pass store mastery scores if None
    scores = {}
    
    # get data
    data = topic["userData"]
    
    # Get id and level of mastery
    mastery = float(data["mastery"])
    
    # get topic id and name info
    topic_id = topic["topic"]["id"]
    topic_name = get_topic_name(topic_id)
    topic_math = 'Math' if is_math(topic_id) else 'German'
    
    # result
    if verbose: print('\t'*n + f"[{topic_math}] {topic_name} ({topic_id}): {mastery}")
    
    # Get the topic's children if any
    try :
        children = topic['children']
    except :
        children = []
        
    # If mastery is 0, dig deeper and try to get a non-0 score
    if (mastery == 0.) and (len(children) > 0) :
        for t in children:
            scores.update(get_mastery(t, n=n+1, scores=scores, verbose=verbose))
    else :
        scores[topic_id] = mastery * 10
    
    return scores

In [15]:
def get_level(js, verbose = False):
    if type(js) != dict: js = json.loads(js)
    all_scores = {}
    
    try:
        topics = js['dashboard']['topics']
    except :
        topics = []
    
    for topic in topics:
        all_scores.update( get_mastery(topic, scores=all_scores, verbose = verbose) )

    return all_scores

In [16]:
def get_level_changes(levels):
    """
    Note: This assumes that the user can never get 0.0 as a mastery score. 0.0 denotes absence of value.
    """
    
    # Get levels as list to work with indexes
    if type(levels) != list : levels = levels.tolist()
    
    # set initial state (stores the previous non empty scores, defaults to 0)
    previous_state = defaultdict(lambda : 0.)
    # Create list to track changes
    changes = []
    
    # Loop through all the checks and track the changes.
    for level in levels:
        _changes = {}
        
        # For each key, check if it changed compared to its previously saved value
        for k in level.keys() :
            _previous = previous_state[k]
            _current = level[k]
            
            if (_previous != _current) :
                # compute change
                _change = _current - _previous
                # store (current, change)
                _changes[k] = (_current, _change)
            
            previous_state[k] = _current
    
        changes.append(_changes)
    
    return changes

In [17]:
def build_change_df(df):
    """
    Isolates the level change events per date and per topic for the input user.
    Note: The validated mark looks at the change in level. If the change is positive, 
    we assume that the user progressed and thus we validate the level check
    """
    tmp2 = df.copy()

    # Get only the singleton changes per topic
    tmp2['change'] = tmp2.change.apply(lambda d : [(_id, *d[_id]) for _id in d.keys()])
    tmp2 = tmp2[['date', 'change']].explode("change")
    # remove empty rows (no changes at these dates)
    tmp2 = tmp2.loc[~tmp2.change.isna()].reset_index(drop=True)

    # Assign a validated or failed mark
    # validation holds on wether the change is positive (level got higher) or not
    tmp2['validated'] = tmp2.change.apply(lambda t: t[2] > 0)

    # Get topic name
    tmp2['topic'] = tmp2.change.apply(lambda t : get_topic_name(t[0]))
    
    return tmp2

#### Discovery

In [18]:
# Exploring json data file of first NAVIGATE_DASHBOARD activity
line = events[events['action']=='NAVIGATE_DASHBOARD'].iloc[1]
l0 = json.loads(line['tracking_data'])

# >>>>> uncomment to see key structure
# get_keys(l0, 0)

# >>>>> uncomment to see raw content
# l0

In [19]:
# Getting all mastery scores for a single event
get_level(l0, verbose=True)

[German] spelling (2): 0.0
	[German] Spelling principles (3): 0.737535316620025
	[German] Large and lower case (2055): 0.3981179459718665
	[German] Spellishly difficult words (2065): 0.11930505543060275
[German] punctuation (3104): 0.0
	[German] Comma in sentence (3163): 0.4325001746856052
[German] Speech (3105): 0.0
	[German] Verb (3110): 0.04742587317756678
	[German] noun (3111): 0.04742587317756678
	[German] Adjectives (3179): 0.04742587317756678
	[German] pronoun (3112): 0.04742587317756678
	[German] Particle (3113): 0.04742587317756678
[German] sentence parts (3106): 0.0
	[German] Function of the sentence members (3141): 0.04742587317756678
	[German] Determination of the sentence members (3142): 0.04742587317756679
	[German] Structure of phrases (3154): 0.04742587317756678
	[German] Congress between phrases (3156): 0.04742587317756678
[German] Phenomena (3107): 0.0
	[German]  (3244): 0.04742587317756678
	[German] Relationship between phrases / attributes and sidelines (3250): 0.04

{3: 7.3753531662002505,
 2055: 3.981179459718665,
 2065: 1.1930505543060275,
 3163: 4.325001746856052,
 3110: 0.4742587317756678,
 3111: 0.4742587317756678,
 3179: 0.4742587317756678,
 3112: 0.4742587317756678,
 3113: 0.4742587317756678,
 3141: 0.4742587317756678,
 3142: 0.4742587317756679,
 3154: 0.4742587317756678,
 3156: 0.4742587317756678,
 3244: 0.4742587317756678,
 3250: 0.4742587317756678,
 2023: 0.4742587317756678,
 2042: 0.4742587317756678,
 2045: 0.8331439759984216,
 2048: 0.4742587317756678,
 3114: 0.4742587317756678,
 3115: 0.4742587317756678,
 3116: 0.4440693277319837,
 3117: 0.0,
 3119: 0.0}

In [20]:
# Testing out tracking the evolution for a single user
events[events['action']=='NAVIGATE_DASHBOARD'].iloc[1]

user_id = 390028
tmp = events.loc[(events.user_id == user_id) & (events.action == 'NAVIGATE_DASHBOARD'), ['timestamp', 'tracking_data']]

# Transforming to date
tmp['date'] = pd.to_datetime(tmp.timestamp, unit='ms')
# extracting all the mastery scores
tmp['levels'] = tmp.tracking_data.apply(get_level)

tmp.head(3)

Unnamed: 0,timestamp,tracking_data,date,levels
17006,1622773391671,"{""text"": ""Jetzt starten"", ""title"": ""Go To Dash...",2021-06-04 02:23:11.671,{}
17007,1622773392297,"{""dashboard"": {""title"": ""Deutsch"", ""topics"": [...",2021-06-04 02:23:12.297,"{3: 7.3753531662002505, 2055: 3.98117945971866..."
17008,1622773423216,"{""text"": ""open"", ""title"": ""TopicTileDialog"", ""...",2021-06-04 02:23:43.216,{}


In [21]:
# Making sure that the values are sorted from earliest to latest
tmp.sort_values('date', inplace=True)

In [22]:
# get changes for each non-zero mastery score if any, result is scored as (current, change)
tmp['change'] = get_level_changes(tmp.levels)
tmp.head()

Unnamed: 0,timestamp,tracking_data,date,levels,change
17006,1622773391671,"{""text"": ""Jetzt starten"", ""title"": ""Go To Dash...",2021-06-04 02:23:11.671,{},{}
17007,1622773392297,"{""dashboard"": {""title"": ""Deutsch"", ""topics"": [...",2021-06-04 02:23:12.297,"{3: 7.3753531662002505, 2055: 3.98117945971866...","{3: (7.3753531662002505, 7.3753531662002505), ..."
17008,1622773423216,"{""text"": ""open"", ""title"": ""TopicTileDialog"", ""...",2021-06-04 02:23:43.216,{},{}
17009,1622773430086,"{""text"": ""OpenSession: 35248"", ""title"": ""Topic...",2021-06-04 02:23:50.086,{},{}
17030,1622773707078,"{""dashboard"": {""title"": ""Deutsch"", ""topics"": [...",2021-06-04 02:28:27.078,"{3: 7.3753531662002505, 2055: 3.98117945971866...",{}


In [23]:
# build change df
build_change_df(tmp).head()

Unnamed: 0,date,change,validated,topic
0,2021-06-04 02:23:12.297,"(3, 7.3753531662002505, 7.3753531662002505)",True,Spelling principles
1,2021-06-04 02:23:12.297,"(2055, 3.981179459718665, 3.981179459718665)",True,Large and lower case
2,2021-06-04 02:23:12.297,"(2065, 1.1930505543060275, 1.1930505543060275)",True,Spellishly difficult words
3,2021-06-04 02:23:12.297,"(3163, 4.325001746856052, 4.325001746856052)",True,Comma in sentence
4,2021-06-04 02:23:12.297,"(3110, 0.4742587317756678, 0.4742587317756678)",True,Verb


#### Applying to all users

Running the routine to get all level checks for all users.

In [24]:
users = events.user_id.unique()

level_checks = []

for u in tqdm(users):
    
    # Getting all relevant events for the current user
    tmp = events.loc[(events.user_id == user_id) & (events.action == 'NAVIGATE_DASHBOARD'), ['timestamp', 'tracking_data']]

    # Transforming to date
    tmp['date'] = pd.to_datetime(tmp.timestamp, unit='ms')
    
    # extracting all the mastery scores
    tmp['levels'] = tmp.tracking_data.apply(get_level)
    
    # Making sure that the values are sorted from earliest to latest
    tmp.sort_values('date', inplace=True)
    
    # Computing the changes in levels
    tmp['change'] = get_level_changes(tmp.levels)
    
    # Building the level check df
    tmp = build_change_df(tmp)
    
    # Adding user info
    tmp['user_id'] = u
    
    level_checks.append(tmp)
    
pd.concat(level_checks)

  0%|▍                                                                                                                                                                       | 25/10113 [00:34<3:49:20,  1.36s/it]


KeyboardInterrupt: 

*Your discussion about your processing decisions goes here*

## Task 2: Model Building

Train a model for your research question. 

In [None]:
# Your code for training a model goes here

*Your discussion about your model training goes here*

## Task 3: Model Evaluation
In this task, you will use metrics to evaluate your model.

In [None]:
# Your code for model evaluation goes here

*Your discussion/interpretation about your model's behavior goes here*