# M4 | Research Investigation Notebook

In this notebook, you will do a research investigation of your chosen dataset in teams. You will begin by formally selecting your research question (task 0), then processing your data (task 1), creating a predictive model (task 2), and evaluating your model's results (task 3).

Please upload your solved notebook to Moodle (under [Milestone 4 Submission](https://moodle.epfl.ch/mod/assign/view.php?id=1199557)) adding your team name in title, example: `m4-lernnavi-teamname.ipynb`. Please run all cells before submission so we can grade effectively.


## Brief overview of Lernnavi
[Lernnavi](https://www.lernnavi.ch) is an instrument for promoting part of the basic technical study skills in German and mathematics.

Lernnavi's dataset is formatted in three main tables:
* *users*: demographic information of users.
* *events*: events done by the users in the platform.
* *transactions*: question and answer solved by user.

You should provide arguments and justifications for all of your design decisions throughout this investigation. You can use your M3 responses as the basis for this discussion.

In [1]:
# Import the tables of the data set as dataframes.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

DATA_DIR = './data' # You many change the directory

users = pd.read_csv('{}/users.csv.gz'.format(DATA_DIR))
events = pd.read_csv('{}/events.csv.gz'.format(DATA_DIR))
transactions = pd.read_csv('{}/transactions.csv.gz'.format(DATA_DIR))

In [2]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer

## Task 0: Research Question

**Research question:**
*Predicting user motivation and retention. E.g., dropout or long pauses prediction, level checks prediction (self-supervised learning)*

## Task 1: Data Preprocessing

In this section, you are asked to preprocess your data in a way that is relevant for the model. Please include 1-2 visualizations of features / data explorations that are related to your downstream prediction task.

In [44]:
events['action'].value_counts()

PAGE_VISIT              653725
REVIEW_TASK             513389
SUBMIT_ANSWER           419862
NAVIGATE_DASHBOARD      350821
NEXT                    277020
WINDOW_VISIBLE_FALSE    240660
WINDOW_VISIBLE_TRUE     199287
VIEW_QUESTION           154592
OPEN_FEEDBACK            87071
GO_TO_THEORY             80746
CLOSE_FEEDBACK           76780
SURVEY_BANNER            68644
VIEW_SESSION_END         52205
SKIP                     50114
SUBMIT_SEARCH            48953
WINDOW_OPEN              48501
CLOSE                    42276
WINDOW_CLOSE             36576
LOGIN                    23071
SELECT_STATISTICS        16897
ACCEPT_PROGRESS          11017
GO_TO_BUG_REPORT          3660
GO_TO_COMMENTS            3640
LOGOUT                    2064
PRETEST                   1709
REJECT_PROGRESS           1200
SHARE_SESSION              403
NAVIGATE_SURVEY            219
REQUEST_HINT               185
GO_TO_SESSION              182
SHARE                       90
Name: action, dtype: int64

In [17]:
# Your code for data processing goes here
# events = events.drop(columns = ["year"])
events['date'] = pd.to_datetime(events['timestamp'], unit='ms')
events["week"] = events["date"].dt.isocalendar().week
events["year"] = events["date"].dt.year

In [18]:
df_weekly = events[["user_id","year","week"]].dropna().groupby(["user_id","year","week"]).size().reset_index(name='num_events')
df_weekly

Unnamed: 0,user_id,year,week,num_events
0,387604,2021,20,1
1,387604,2021,21,1
2,387604,2021,25,7
3,387604,2021,26,25
4,387604,2021,31,12
...,...,...,...,...
27527,404600,2022,8,34
27528,404603,2022,8,112
27529,404604,2022,8,43
27530,404605,2022,8,22


In [40]:
all_questions = events[["user_id","year","week","action","transaction_token"]].dropna()
all_questions = all_questions[all_questions['action']=='SUBMIT_ANSWER']
num_questions = all_questions.groupby(["user_id","year","week"]).nunique().reset_index().rename(columns={'transaction_token':'num_questions'}).drop(columns=['action'])
num_questions

Unnamed: 0,user_id,year,week,num_questions
0,387604,2021,35,1
1,387604,2021,38,4
2,387604,2021,39,2
3,387604,2021,40,1
4,387604,2021,41,1
...,...,...,...,...
20520,404598,2022,8,1
20521,404599,2022,8,1
20522,404600,2022,8,1
20523,404603,2022,8,13


In [41]:
transactions[['transaction_token', 'evaluation']].dropna().replace(["CORRECT","PARTIAL","WRONG"],[1,0.5,0])

Unnamed: 0,transaction_token,evaluation
2,6f292166-86bd-4ec9-81e8-22e9033d571e,1.0
3,79a7d731-a36b-4529-a11b-108b9f877a04,1.0
4,a45b6464-371e-47f4-842c-34f9e345b4ec,1.0
5,813bd412-3041-48bd-98dc-6b56ac36f288,1.0
6,426596f6-a5e3-4efb-9c96-d08e7ab7cd8a,1.0
...,...,...
800002,c5659a66-9d7c-4c30-9de7-68d62d2c486b,0.5
800007,2cc6a524-43fc-480d-96d6-04b0a64e28e4,0.0
800012,5735066d-7090-47b2-af61-41431f3b6f30,0.5
800016,7e328437-52df-4697-94e9-186844c3e269,0.5


In [42]:
evaluations = all_questions.merge(transactions[['transaction_token', 'evaluation']].dropna().replace(["CORRECT","PARTIAL","WRONG"],[1,0.5,0]),
                                  on='transaction_token',
                                  how='left')
evaluations

Unnamed: 0,user_id,year,week,action,transaction_token,evaluation
0,393211,2021,20,SUBMIT_ANSWER,7a10ca52-ffb5-4069-8800-0dc86d778e94,
1,393211,2021,20,SUBMIT_ANSWER,88fdcaad-f73b-46a2-b561-d262f2441442,
2,393211,2021,20,SUBMIT_ANSWER,a75eb7b4-b2c2-47d4-9200-27980c175037,
3,393211,2021,20,SUBMIT_ANSWER,61eb829d-bdda-4107-86af-ad9a14a7bdc9,
4,393211,2021,20,SUBMIT_ANSWER,30ff0d8a-865d-460b-9177-b698a52b0d5c,
...,...,...,...,...,...,...
419857,404540,2022,8,SUBMIT_ANSWER,87832f2d-af7b-41f2-8321-953473f1d0fa,0.0
419858,404556,2022,8,SUBMIT_ANSWER,dd8cd76c-0a95-4b4e-9693-e0cb4d43bd25,0.0
419859,404550,2022,8,SUBMIT_ANSWER,8b1b72d5-551c-49fe-a4e8-a268a2956909,1.0
419860,404560,2022,8,SUBMIT_ANSWER,9edfed2d-ebe8-4b0e-873c-9ba13e6541b3,1.0


In [20]:
num_theory = events[["user_id","year","week","action"]].dropna()
num_theory = num_theory[num_theory["action"]=='GO_TO_THEORY'].groupby(
    ["user_id","year","week"]).size().reset_index(name='num_theory')
num_theory

Unnamed: 0,user_id,year,week,num_theory
0,387604,2021,33,18
1,387604,2021,34,50
2,387604,2021,35,118
3,387604,2021,36,27
4,387604,2021,37,78
...,...,...,...,...
7858,404589,2022,8,4
7859,404597,2022,8,8
7860,404598,2022,8,15
7861,404599,2022,8,38


In [23]:
df_weekly_merged = df_weekly.merge(num_questions, on=['user_id', 'year', 'week'], how='left').fillna(value={'num_questions':0})
df_weekly_merged = df_weekly_merged.merge(num_theory, on=['user_id', 'year', 'week'], how='left').fillna(value={'num_theory':0})


In [24]:
df_weekly_merged = df_weekly_merged.sort_values(by = ["user_id","year","week"]).groupby('user_id').head(13).reset_index(drop=True)
df_weekly_merged["week"] = df_weekly_merged.groupby('user_id').cumcount()
df_weekly_merged = df_weekly_merged.drop(columns = ["year"])
df_weekly_merged

Unnamed: 0,user_id,week,num_events,num_questions,num_theory
0,387604,0,1,0.0,0.0
1,387604,1,1,0.0,0.0
2,387604,2,7,0.0,0.0
3,387604,3,25,0.0,0.0
4,387604,4,12,0.0,0.0
...,...,...,...,...,...
26957,404600,0,34,1.0,0.0
26958,404603,0,112,13.0,0.0
26959,404604,0,43,4.0,0.0
26960,404605,0,22,0.0,0.0


In [27]:
columns_to_standarize = ['num_events', 'num_questions', 'num_theory']
scaler = StandardScaler()

df_weekly_stand = df_weekly_merged.copy()
df_weekly_stand[columns_to_standarize] = scaler.fit_transform(df_weekly_stand[columns_to_standarize]) 
df_weekly_stand

Unnamed: 0,user_id,week,num_events,num_questions,num_theory
0,387604,0,-0.765258,-0.615650,-0.242192
1,387604,1,-0.765258,-0.615650,-0.242192
2,387604,2,-0.728122,-0.615650,-0.242192
3,387604,3,-0.616712,-0.615650,-0.242192
4,387604,4,-0.697175,-0.615650,-0.242192
...,...,...,...,...,...
26957,404600,0,-0.561008,-0.574744,-0.242192
26958,404603,0,-0.078233,-0.083881,-0.242192
26959,404604,0,-0.505303,-0.452029,-0.242192
26960,404605,0,-0.635281,-0.615650,-0.242192


In [10]:
# USE THIS PIPELINE TO APPLY TRANSFORMATIONS TO ONLY SOME COLUMNS

# preprocessor = ColumnTransformer(
#     transformers=[('scaler', StandardScaler(), columns_to_standarize)],
#     remainder='passthrough'
# )

# pipeline = Pipeline([
#     ('preprocessor', preprocessor)
# ]) # add your model in pipeline

# pipeline.fit(THE_DATA)


*Your discussion about your processing decisions goes here*

## Task 2: Model Building

Train a model for your research question. 

In [11]:
# Your code for training a model goes here

*Your discussion about your model training goes here*

## Task 3: Model Evaluation
In this task, you will use metrics to evaluate your model.

In [12]:
# Your code for model evaluation goes here

*Your discussion/interpretation about your model's behavior goes here*