# Problem-level Performace Prediction Notebook

## Dataset
We have a dataset of online learning activity from Junyi Academy. Junyi Academy Foundation is a non-profit organization based in Taiwan that aims to provide all children equitable quality education by technology. The dataset is divided into the following files - 

1. Log_Problem.csv - This has data about 16,217,311 problem attempts of 72,630 selected students for a year from 2018/08 to 2019/07.
2. Info_Content.csv - This describes the metadata of the exercises, each exercise is a basic unit of learning consisted of many problems.
3. Info_UserData.csv - This describes the metadata of the selected registered students in Junyi Academy.

The entire dataset can be downloaded from Kaggle - [Link](https://www.kaggle.com/junyiacademy/learning-activity-public-dataset-by-junyi-academy).

## Features

The files have the following columns - 

### Log_Problem

1. timestamp_TW - The timestamp of the first behavior, answered the problem or used a hint. It is in UTC+8 timezone.
2. uuid - The unique ID of the user. It can be used to join with Info_UserData.
3. ucid - The unique ID of the content. It can be used to join with Info_Content.
4. upid - The unique ID of the problem.
5. problem_number - The number of problems this user had encountered, including this problem, in this exercise.
6. exercise_problem_repeat_session - The number of times the user encounters this problem in this exercise.
7. is_correct - Whether the answer is considered correct or not. Only if the student answered the correct answer for the first time.
8. total_sec_taken - How many seconds the user use for this problem encounter.
9. total_attempt_cnt - How many times have the user submitted an answer for this problem encounter.
10. used_hint_cnt - How many hints the user have used for this problem encounter.
11. is_hint_used - Whether the user use a hint or not..
12. is_downgrade - After this attempt, is the user upgraded to the next level.
13. is_upgrade - After this attempt, is the user downgraded to the next level.
14. level - After this attempt, which level does this user belong to in this exercise? There are five possible levels. All users start from level 0 and declare Proficient at level 4.

***

### Info_Content
1. ucid - The hashed unique ID of the content.
2. content_pretty_name - The Chinese display name of this content.
3. content_kind - The kind of this content. The current dataset release only includes `Exercise`.
4. difficulty - The difficulty of this content. There are four possible values: `Easy`, `Normal`, `Hard` and `Unset`. Unset means that this content has not been set to any difficulty yet.
5. subject - The subject of this content. The current dataset release only includes `math`.
6. learning_stage - The learning stage of this content. There are three possible values: `Elementary`, `Junior` and `Senior`.
7. level1_id - The hashed level 1 layer ID of this content. The levels form the tree-like hierarchy structure of contents in Junyi Academy. The current dataset release has four levels in the hierarchy.
8. level2_id - The hashed level 2 layer ID of this content. The levels form the tree-like hierarchy structure of contents in Junyi Academy. The current dataset release has four levels in the hierarchy.
9. level3_id - The hashed level 3 layer ID of this content. The levels form the tree-like hierarchy structure of contents in Junyi Academy. The current dataset release has four levels in the hierarchy.
10. level4_id - The hashed level 4 layer ID of this content. The levels form the tree-like hierarchy structure of contents in Junyi Academy. The current dataset release has four levels in the hierarchy.

***

### Info_UserData
1. uuid - The unique ID of this user.
2. gender - The gender of this user. There are four possible values: `male`, `female`, `unspecified` and `null`.
3. points - The user will receive energy points from the Junyi Academy after completing exercises, watching videos, and when the user receives a badge.
4. badges_cnt - Badges are awarded to the users when the user achieves certain conditions.
5. first_login_date_TW - The first login date after the user registers to Junyi Academy.
6. user_grade - The grade of the user. The possible values are between 1 and 12.
7. user_city - The resident city of the user.
8. has_teacher_cnt - The number of teachers this user has in the Junyi Academy.
9. is_self_coach - Does the user add himself/herself as a teacher of their own?
10. has_student_cnt - The number of students this user has in the Junyi Academy. Despite the user role of this user is a student, this user can still add another user as a student.
11. belongs_to_class_cnt - The number of classes this user belongs to.
12. has_class_cnt - The number of classes this user created to add other users. Despite the user role of this user is a student, this user can still create a class to add other users in.

***

Using the above features, the goal is to predict whether a student will answer a problem correct given the details of the problem and the student's performance history. 

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os
from collections import Counter, defaultdict
import seaborn as sns
from matplotlib import pyplot as plt

## Initialize the constants

In the following cell, we initialize the various constants used throughout the notebook for various tasks. We have variables which control which parts of the notebook to run and variables which contains the paths to raw data (original data) and pre-processed data. 

The pre-processed data has the following files -

1. FILE_LOG_PROCESSED - This contains the data from the Log_Problem table sorted by timestamp. 
2. FILE_USER_PROCESSED - This contains the data from Info_UserData in parquet format for quick read from disk into memory. 
3. FILE_CONTENT_PROCESSED - This contains the data from Info_Content in parquet format for quick read from disk into memory.
4. FILE_M_PROFICIENCY_LEVEL4 - This file has the one-hot encoding of all level_4 ids per problem. We use this to capture the learning history of a student. We will see more on this later.
5. FILE_M_PROFICIENCY_CONCEPT = This file has the one-hot encoding of all concept Ids per problem. Again, we use this to capture the learning history of a student. We will see more on this later.
6. FILE_V_UCID_ACC - This file has the accuracy per content id. (ie) the number of right answers received so far for the content id.
7. FILE_V_UPID_ACC - This file has the accuracy per problem id. (ie) the number of right answers received so far for the problem id. 

In [None]:
# - we shouldn't use this info before the user takes the exercise
VARS_REDUNDANT = ['total_sec_taken','is_hint_used','is_downgrade','is_upgrade']
VARS_LOG_CATEGORY = ['uuid', 'ucid', 'upid']
VARS_CONTENT_CATEGORY = ['level3_id','level4_id']

# - par
# ORDER_MONTH = ['2018-08','2018-09','2018-10','2018-11','2018-12',
#                '2019-01','2019-02','2019-03','2019-04','2019-05','2019-06','2019-07','2019-08']

# - control which parts of the notebook to run
# -- false if want to read the preprocessed files to save time
RUN_PREPRECESS = False
RUN_LEVEL4 = True

# whether to compute the one-hot encoding vector for level4
RUN_M_LEVEL4 = True

# -- whether to compute the upid accuracy vector (False: read from input file)
RUN_V_UPID_ACC = False

# -- whether to compute the proficiency matrix (False: read from input file)
RUN_M_PROFICIENCY = False

# -- whether to compute the concept proficiency matrix.
RUN_M_CONCEPT = False

PLOT = True
# MONTH_EXCLUDED = ['2018-08','2019-08']

# -- False for reading only the top 1000 rows in the df_log
RUN_FULL = True

# - path
PATH_INPUT = '/kaggle/input/learning-activity-public-dataset-by-junyi-academy/'
PATH_OUTPUT = '/kaggle/working/'
PATH_PREPROCESSED_INPUT = '../input/junyi-preprocessed/'

# - file
# -- raw timestamp
FILE_LOG_FULL = os.path.join(PATH_PREPROCESSED_INPUT ,'Log_Problem_raw_timestamp.parquet.gzip')

FILE_USER = os.path.join(PATH_INPUT,'Info_UserData.csv')
FILE_CONTENT = os.path.join(PATH_INPUT,'Info_Content.csv')

# -- read the preprocessed files to save time
# --- raw timestamp
FILE_LOG_PROCESSED = os.path.join(PATH_PREPROCESSED_INPUT ,'Processed_Log_Problem_raw_timestamp.parquet.gzip')

# --- rounded timestamp
# FILE_LOG_PROCESSED = os.path.join(PATH_PREPROCESSED_INPUT ,'Processed_Log_Problem.parquet.gzip')
FILE_USER_PROCESSED = os.path.join(PATH_PREPROCESSED_INPUT ,'Processed_Info_UserData.parquet.gzip')
FILE_CONTENT_PROCESSED = os.path.join(PATH_PREPROCESSED_INPUT ,'Processed_Info_Content.parquet.gzip')

# FILE_M_HISTORY_LEVEL4 = os.path.join(PATH_PREPROCESSED_INPUT ,'m_history_level4.npz')
FILE_M_PROFICIENCY_LEVEL4 = os.path.join(PATH_PREPROCESSED_INPUT ,'m_proficiency_level4.npz')
FILE_M_PROFICIENCY_CONCEPT = os.path.join(PATH_PREPROCESSED_INPUT ,'m_concept_proficiency.npz')
FILE_V_UCID_ACC = os.path.join(PATH_PREPROCESSED_INPUT ,'v_ucid_acc.npz')
FILE_V_UPID_ACC = os.path.join(PATH_PREPROCESSED_INPUT ,'v_upid_acc.npz')

## Pre-processing

The following section contains the code for preprocessing the data.

In [None]:
# Read the preprecessed file (1.5GB)
# - should first run the preprocessing, save the files to output, and then download the files. 
# -- After that, "Add data->upload a data set".
if not RUN_PREPRECESS:
    df_log = pd.read_parquet(FILE_LOG_PROCESSED)
    df_user = pd.read_parquet(FILE_USER_PROCESSED)
    df_content = pd.read_parquet(FILE_CONTENT_PROCESSED)

In [None]:
if RUN_PREPRECESS:
    '''
    -----------------------------------------------------------------------------
    Read in the log file
    -----------------------------------------------------------------------------
    '''

    '''
    -----------------------------------------------------------------------------
    #         Un-comment this block of code to read from csv. 
    #         log_dtypes = {
    #                     'timestamp_TW':'object',
    #                     'uuid':'category',
    #                     'ucid':'category',
    #                     'upid':'category',
    #                     #int16: -32768 to 32767,
    #                     'problem_number':'int16',
    #                     'exercise_problem_repeat_session':'int16',
    #                     'is_correct':'boolean',
    #                     'total_sec_taken':'int16',
    #                     'total_attempt_cnt':'int16',
    #                     'used_hint_cnt':'int16',
    #                     'is_hint_used':'boolean',
    #                     'is_downgrade':'boolean',
    #                     'is_upgrade':'boolean',
    #                     #int8: -256 to 256                    
    #                     'level':'int8'                            
    #                       }
    #         df_log = pd.read_csv(FILE_LOG_FULL,dtype=log_dtypes)  
    #                 # 545.9+ MB
    -----------------------------------------------------------------------------
    '''

    # read from parquet         
    df_log = pd.read_parquet(FILE_LOG_FULL)
 
    
    '''
    -----------------------------------------------------------------------------
    Read in the user file
    -----------------------------------------------------------------------------
    '''
    user_dtype = {'uuid':'category',
                  'gender':'category',
                  #int8: -256 to 256                                      
                  'user_grade':'int8'
                 }
    df_user = pd.read_csv(FILE_USER,dtype=user_dtype)
    
    '''
    -----------------------------------------------------------------------------
    Read in the content file
    -----------------------------------------------------------------------------
    '''
    content_dtype = {'ucid':'category',
                     'level4_id':'category',
                     'level4_id':'category',
                     'difficulty':'category',
                     'learning_stage':'category'
                     }    
    df_content = pd.read_csv(FILE_CONTENT,dtype=content_dtype)

In [None]:
if RUN_PREPRECESS:
    # join the "user_grade" and "gender" info
    df_log = pd.merge(df_log,df_user[['uuid','user_grade','gender']],on='uuid',how='left')  

In [None]:
if RUN_PREPRECESS:
    # sort the df by time stamp in ascending order
    # - to faciliate the derivation of the history vectors
    df_log = df_log.sort_values(by = 'timestamp_TW')   
    # reset the row index
    df_log = df_log.reset_index(drop=True)

    # NOTE: If using the log file with rounded timestamp:
    # one critical limitation: the timestamp was rounded to the closest 15 mins, 
    # so the order of the row does not reflect the actual order of a student's activity
    # print(df_log.head())

In [None]:
if RUN_PREPRECESS:
    # - Convert gender to one-hot encoding to handle "unspecified"
    # set NaN as "unspecified"
    df_log.fillna(value = {'gender':'unspecified'},inplace=True)
    df_log = pd.concat([df_log,pd.get_dummies(df_log.gender)],axis=1).drop(columns='gender')

In [None]:
if RUN_PREPRECESS:
    # Redefine "level" as the 'uuid' level of this exercise right before the attempt
    # - Should offset the change due to this attempt
    df_log['level'] = (df_log['level']+df_log['is_downgrade'].fillna(0).astype(int)-df_log['is_upgrade'].fillna(0).astype(int)).astype('int8')

In [None]:
if RUN_PREPRECESS:
    # Preprocessing
    # - drop redundant columns
    df_log = df_log.drop(columns = VARS_REDUNDANT)

##### Uncomment the following cell to save the output of pre-processed data if you are running pre-processing. This saves the data to output directory for future use.

In [None]:
# # save the preprocessed data
# df_log.to_parquet(os.path.join(PATH_OUTPUT ,'Processed_Log_Problem_raw_timestamp.parquet.gzip'))
# df_user.to_parquet(os.path.join(PATH_OUTPUT ,'Processed_Info_UserData.parquet.gzip'))
# df_content.to_parquet(os.path.join(PATH_OUTPUT ,'Processed_Info_Content.parquet.gzip'))

In [None]:
## Pick only a subset of records for quick testing. 
if not RUN_FULL:
    df_log = df_log.head(1000)

In [None]:
# Parameters used by the following cells for various tasks. 
# - used by the cells below
# list of problem id in order
list_upid = df_log.upid.unique().to_numpy()
# list of concept id in order
list_concept_id = df_content.ucid.to_numpy()
# list of the level4_id in order
list_level4_id = df_content.level4_id.unique().to_numpy()
# list of user id in order
list_user_id = df_user['uuid'].unique()

# dict of list_upid {id: order}
dict_upid = {id:order for order, id in enumerate(list_upid)}
# dict of list_concept_id {id: order}
dict_concept_id = {id:order for order, id in enumerate(list_concept_id)}
# dict of list_level4_id {id: order}
dict_level4_id = {id:order for order, id in enumerate(list_level4_id)}
# dict of list_user_id {id: order}
dict_user_id = {id:order for order, id in enumerate(list_user_id)}

### Feature Engineering

This section contains the code for creating concept proficiency, level_4 proficiency, upid_accuracy vector.

#### Creation of UPID Accuracy Matrix

- m = 25785

In [None]:
if RUN_V_UPID_ACC:
    ACC_GRAND_AVG = df_log.is_correct.mean()
    # create the accuracy vector (# logs, 1) which encodes the upid accuracy of each log so far (based only on past data)
    v_upid_acc = np.zeros((len(df_log),1),dtype = 'float16')
    # - initialize the helper vector (# upid, 1) to keep track of the sum of correct response
    v_sum_correct = np.zeros((len(list_upid),1),dtype='int')
    # - initialize the helper matrix (# upid, 1) to keep track of the count per upid
    v_count = np.zeros((len(list_upid),1),dtype='int')    

    # update the matrices while iterating over df_log
    for i_r,log in df_log.iterrows():
        if i_r % 10000 == 0:                
            print(i_r)    
        # update v_acc (should update this before processing the response of this current log)
        if v_count[dict_upid[log['upid']],0] == 0:            
            v_upid_acc[i_r,0] = ACC_GRAND_AVG
        else:
            v_upid_acc[i_r,0] = v_sum_correct[dict_upid[log['upid']],0]/v_count[dict_upid[log['upid']],0]
        # update v_sum_correct
        v_sum_correct[dict_upid[log['upid']],0] += log['is_correct']   
        
        # update v_count
        v_count[dict_upid[log['upid']],0] += 1
            
    # save the v_acc
    np.savez_compressed(os.path.join(PATH_OUTPUT,'v_upid_acc'), v_upid_acc)

In [None]:
if not RUN_V_UPID_ACC:
    v_upid_acc = np.load(FILE_V_UPID_ACC)['arr_0']

#### Creation of Concept proficiency matrix

- m = 1326

In [None]:
if RUN_M_CONCEPT:
    # create the "proficiency matrix" (# logs, # concept id) which encodes the most recent level per concept id
    m_concept_proficiency = np.empty((len(df_log),len(list_concept_id)),dtype = 'float16')
    m_concept_proficiency[:] = np.nan     

    # update the matrices while iterating over df_log
    for i_r,log in df_log.iterrows():
        if i_r % 1000000 == 0:                
            print(i_r)

        # update the "proficiency matrix" with the average concept level within the level 4 id
        # - only update the relevant cell
        m_concept_proficiency[i_r,dict_concept_id[log['ucid']]] = log['level']

In [None]:
if not RUN_M_CONCEPT:
    m_concept_proficiency = np.load(FILE_M_PROFICIENCY_CONCEPT)['m_concept_proficiency']

#### Creation of Level-4 proficiency matrix

In [None]:
if RUN_LEVEL4:
    # initialize some useful variables for section below
    # - the map for looking up the dummy vector given a ucid
    df_content_level4_dummies = pd.get_dummies(list_level4_id)
    
    # - the map for looking up the list of ucid given a level4 id
    dict_level4_to_ucid = defaultdict(list)
    for i_r, row in df_content.iterrows():
        dict_level4_to_ucid[dict_level4_id[row['level4_id']]].append(dict_concept_id[row['ucid']])

In [None]:
if RUN_LEVEL4:
    # join the level 4 info
    df_log = df_log.merge(df_content[["ucid","level4_id"]],how ="left")

In [None]:
if RUN_LEVEL4:
    if RUN_M_LEVEL4:
        # Problem vector: one-hot encoding (one row vector for one log, i.e., one row in df_log)
        # - create the 2d numpy matrix of problem vectors: avoid joining to `df_log` (RAM expensive)
        # - (# logs = df_log.shape[0], # level4 id = df_content_level4_dummies.shape[1]-1)
        m_level4_id = df_content_level4_dummies[df_log.level4_id]
        m_level4_id = np.transpose(m_level4_id.to_numpy())

#### Create the matrix of level-4 proficiency vectors

- m = 171
- matrix: (# logs, # level4 id)
- For each cell: the student's most recent "level" of a level-4 category, which is derived by averaging across the most recent levels of all concepts within one "level-4" category.

In [None]:
# This will take a long while (1~2 hours...)
if RUN_LEVEL4:
    if RUN_M_PROFICIENCY:
        # create the "proficiency matrix" (# logs, # level 4 id) which encodes the most recent level per level-4 category (averaged across ucid)
        # - note: unseen concept is encoded as NaN. Therefore, when computing the level of level-4 sum, one should use np.nansum().
        m_proficiency = np.empty((len(df_log),len(list_level4_id)),dtype = 'float16')
        m_proficiency[:] = np.nan     
        
        # create the helper "concept level matrix" (# users, # concept id) which encodes the most recent level per concept of each student
        # - note: unseen concept is encoded as NaN. Therefore, when computing the level of level-4 sum, one should use np.nansum().        
        m_concept_level = np.empty((len(list_user_id),len(list_concept_id)),dtype = 'int8')
        m_concept_level[:] = np.nan     
        # update the matrices while iterating over df_log
        for i_r,log in df_log.iterrows():
            if i_r % 10000 == 0:                
                print(i_r)
            # update the "concept level matrix"
            m_concept_level[dict_user_id[log['uuid']],dict_concept_id[log['ucid']]] = log['level']
                            
            # update the "proficiency matrix" with the average concept level within the level 4 id
            # - only update the relevant cell
            m_proficiency[i_r,dict_level4_id[log['level4_id']]] =\
            np.nansum(m_concept_level[dict_user_id[log['uuid']],dict_level4_to_ucid[dict_level4_id[log['level4_id']]]])                              

        # save the m_proficiency matrix
        np.savez_compressed(os.path.join(PATH_OUTPUT,'m_proficiency_level4'), m_proficiency)

In [None]:
if RUN_LEVEL4:
    if not RUN_M_PROFICIENCY:
        m_proficiency = np.load(FILE_M_PROFICIENCY_LEVEL4)['arr_0']

In [None]:
if not RUN_FULL:
    m_proficiency = m_proficiency[:1000,]

***

#### The following cell is very important. We need to convert nan values present in proficiency matrix to 0. Otherwise, we will get an exception when training the model.

***

In [None]:
m_proficiency[np.isnan(m_proficiency)] = 0

## Data Exploration

The following section contains few data exploration tasks which were done to understand the correlation between input variables and the output. This section is entirely optional and you can skip to the next section if needed. 

In [None]:
user_data = pd.read_csv(FILE_USER)
log_problem = pd.read_csv(FILE_LOG_FULL)
content = pd.read_csv(FILE_CONTENT)

In [None]:
# Join tables based on uuid and ucid
df1 = pd.merge(df_log_problem, user_data, on='uuid')
df2 = pd.merge(df1, content, on='ucid')

In [None]:
'''
    -----------------------------------------------------------------------------
    Select only required columns
    -----------------------------------------------------------------------------
'''

required_columns = ['is_correct', 'total_sec_taken', 'total_attempt_cnt', 'used_hint_cnt', 'is_hint_used', 'level', 'difficulty', 'learning_stage', 'gender', 'user_grade', 'has_teacher_cnt', 'is_self_coach', 'has_student_cnt', 'belongs_to_class_cnt', 'has_class_cnt']

df = df2[required_columns]
df.head(5)

In [None]:
corr = df.corr()

In [None]:
f = plt.figure(figsize=(15, 15))
plt.matshow(df.corr(), fignum=f.number)
plt.xticks(range(df.shape[1]), df.columns, fontsize=14, rotation=45)
plt.yticks(range(df.shape[1]), df.columns, fontsize=14)
cb = plt.colorbar()
cb.ax.tick_params(labelsize=14)

In [None]:
plt.figure(figsize = (13, 13))
sns.heatmap(corr, 
            xticklabels=corr.columns.values,
            yticklabels=corr.columns.values)

In [None]:
'''
-----------------------------------------------------------------------------
The above correlation matrix does not include gender, difficulty, learning_stage because they are non-numeric values.
Changing them to numeric values so that we can use them for training.
-----------------------------------------------------------------------------
'''

print('Unique Gender values = ', df.gender.unique())
print('Unique Difficulty values = ', df.difficulty.unique())
print('Unique Learning Stage values = ', df.learning_stage.unique())

In [None]:
'''
-----------------------------------------------------------------------------
Assigning category labels to Gender, Difficulty and Learning Columns.
-----------------------------------------------------------------------------
'''

df['gender'].replace({'unspecified': 0, 'male': 1, 'female': 2},inplace=True)
df['difficulty'].replace({'unset': 0, 'easy': 1, 'normal': 2, 'hard': 3}, inplace=True)
df['learning_stage'].replace({'elementary': 0, 'junior': 1, 'senior': 2}, inplace=True)

In [None]:
# Dropping Nan values

df = df.dropna()

In [None]:
'''
-----------------------------------------------------------------------------
Plotting correlation matrix after updating gender, difficulty and learning stage with numerical values.
-----------------------------------------------------------------------------
'''

corr = df.corr()

# Print correlation matrix
corr

In [None]:
plt.figure(figsize = (10, 10))
sns.heatmap(corr, 
            xticklabels=corr.columns.values,
            yticklabels=corr.columns.values)

## Model Training and Evaluation

The following sections contain the code for training and evaluation of several different models. For each model, we used a combination of following features - 

- level
- difficulty
- learning_stage
- gender
- user_grade
- has_teacher_cnt
- is_self_coach
- has_student_cnt
- belongs_to_class_cnt
- has_class_cnt
- m_level4_proficiency matrix
- m_concept_proficiency matrix
- v_upid_acc matrix
- v_ucid_acc matrix

In each model, our output variable was `is_correct` (i.e.), whether the student got the particular problem right / wrong. Each subsection contains the code for creating the training and testing data and we have reported accuracy of training and testing sets of different sizes. In most cases, we were unable to train the model on the entire dataset due to memory constraints. Hence, we used sizes of `10K`, `100K`, and `1MM` as our data set sizes for training and evaluation purposes.

***


### Model 1: Benchmark model

#### Input Features
- level
- difficulty
- learning_stage
- gender
- user_grade
- has_teacher_cnt
- is_self_coach
- has_student_cnt
- belongs_to_class_cnt
- has_class_cnt

#### Output Feature
- is_correct

In [None]:
## Note, the df used here is from the data exploration section. Only this Logistic model uses this dataframe as input.
## All the other following model use a different data frame as input. 

input_data = df.to_numpy()
n = input_data.shape[0]

In [None]:
'''
-------------------
    Split the data into 80 - 20% split for training and testing
-------------------
'''
num_samples = int(n * 0.8)

samples = np.random.choice(range(n), num_samples, replace=False)

mask = np.ones(n, dtype=bool)
mask[samples] = False

X_train = input_data[samples, 5:]
y_train = input_data[samples, 0]

#y_train = np.reshape(y_train, (num_samples, 1))
y_train = y_train.astype('int')

X_eval = input_data[mask, 5:]
y_eval = input_data[mask, 0]

#y_eval = np.reshape(y_eval, (n - num_samples, 1))
y_eval = y_eval.astype('int')

print('X_train shape is = ', np.shape(X_train))
print('y_train shape is = ', np.shape(y_train))


print('X_eval shape is = ', np.shape(X_eval))
print('y_eval shape is = ', np.shape(y_eval))

In [None]:
X_train_scaled = preprocessing.MinMaxScaler().fit_transform(X_train)

model = LogisticRegression(random_state=0).fit(X_train_scaled, y_train)

In [None]:
X_eval_scaled = preprocessing.MinMaxScaler().fit_transform(X_eval)

model.score(X_eval_scaled, y_eval)

***

- Accuracy (n = 1MM) = 70.9 %
- Accuracy (n = 16MM) (Entire Dataset) = 71.1 %

***

### Model 2 - Full model


- For how the features were engineered, see the section [Feature Engineering](#Feature-Engineering)
- Labels (y) [# logs x 1]:
    - Correct or not of the new problem (problem-level)
- Features (X):
    - Demographics [#logs x 4] [From **df_user**]
        - grade (#logs x 1)
        - gender (#logs x 3)
    - Difficulty features  [#logs x 1]:
        - upid accuracy [**From v_upid_acc**]
        - ~difficulty (only 3 levels, not quite informative)~
        - ~learning_stage (only elementary vs. junior, not quite informative)~
    - History features [#logs x 3]: 
        - most recent 'Level' of this ucid [From **df_log**]
        - 'problem_number' of this 'ucid' [From **df_log**]
        - 'exercise_problem_repeat_session' of this 'upid' [From **df_log**]        
    - One-hot encoding matrix [#logs x #level4 id]:  [**m_level4_id**]
        - one-hot encoding of the content ID of the new 
    - Proficiency matrix [#logs x #level4 id]: [**m_proficiency**]
        - encodes the student’s performance of each content (i.e.,level)    
- Model:
    - Decision Tree
    - Logistic Regression
        - With L2 penalty
        - With L1 penalty
    - SVM
        - With rbf kernal
        - With linear kernal
- Evaludate Accuracy:
    - Hold-out 20% test set

***


#### Split the data into 80 - 20% split for training and testing

In [None]:
# set to `num_samples` for using full data. set to a small number for quick testing
# n_subset = 10000000 will overflow the RAM limit (this step `np.concatenate()`)
n_subset = 10000
# n_subset = df_log.shape[0]

num_samples = int(df_log.head(n_subset).shape[0])
num_train_samples = int(num_samples * 0.8)

np.random.seed(760)
samples_train = np.random.choice(range(num_samples), num_train_samples, replace=False)

# True: training set/ False: test set
mask_train = np.zeros(num_samples, dtype=bool)
mask_train[samples_train] = True

X_train = np.concatenate((
        # grade
        df_log.head(n_subset).loc[mask_train,"user_grade"].to_numpy()[:,np.newaxis],
        # gender
        df_log.head(n_subset).loc[mask_train,["female","male","unspecified"]].to_numpy(),
        # Difficulty features 
        v_upid_acc[:n_subset,:][mask_train,:],
        # History features
        df_log.head(n_subset).loc[mask_train,"level"].to_numpy()[:,np.newaxis],    
        df_log.head(n_subset).loc[mask_train,"problem_number"].to_numpy()[:,np.newaxis],
        df_log.head(n_subset).loc[mask_train,"exercise_problem_repeat_session"].to_numpy()[:,np.newaxis],    
        # one-hot matrix
        m_level4_id[:n_subset,:][mask_train,:],
        # proficiency matrix
        m_proficiency[:n_subset,:][mask_train,:]
#         # interaction between one-hot matrix and proficiency matrix
#         m_inter_level4_proficiency[:n_subset,:][mask_train,:]
    ),axis=1)

y_train = df_log.head(n_subset).loc[mask_train,"is_correct"].to_numpy(dtype = bool)

X_test = np.concatenate((
        # grade    
        df_log.head(n_subset).loc[~mask_train,"user_grade"].to_numpy()[:,np.newaxis],
        # gender
        df_log.head(n_subset).loc[~mask_train,["female","male","unspecified"]].to_numpy(),
        # Difficulty features 
        v_upid_acc[:n_subset,:][~mask_train,:],    
        # History features
        df_log.head(n_subset).loc[~mask_train,"level"].to_numpy()[:,np.newaxis],        
        df_log.head(n_subset).loc[~mask_train,"problem_number"].to_numpy()[:,np.newaxis],
        df_log.head(n_subset).loc[~mask_train,"exercise_problem_repeat_session"].to_numpy()[:,np.newaxis],    
        # one-hot matrix
        m_level4_id[:n_subset,:][~mask_train,:],
        # proficiency matrix
        m_proficiency[:n_subset,:][~mask_train,:]
#         # interaction between one-hot matrix and proficiency matrix
#         m_inter_level4_proficiency[:n_subset,:][~mask_train,:]    
    ),axis=1)
y_test = df_log.head(n_subset).loc[~mask_train,"is_correct"].to_numpy(dtype = bool)


print('X_train shape is = ', np.shape(X_train))
print('y_train shape is = ', np.shape(y_train))

print('X_test shape is = ', np.shape(X_test))
print('y_test shape is = ', np.shape(y_test))

#### Min-max transformation

In [None]:
# Overwrite the raw data matrix to reduce RAM usage
X_train = preprocessing.MinMaxScaler().fit_transform(X_train)
X_test = preprocessing.MinMaxScaler().fit_transform(X_test)

#### Decision Tree

In [None]:
dc_full = DecisionTreeClassifier(criterion="entropy",random_state=0).fit(X_train, y_train)
print("# n_subset = " + str(n_subset),": ",end = "")
print("train = " + str(dc_full.score(X_train, y_train))+"/ ",end = "")
print("test = " + str(dc_full.score(X_test, y_test)))

# With difficulty ----------------
# [Grade + Gender model + concent history + difficulty + one-hot matrix + proficiency matrix]
# n_subset = 10000 : train = 0.901875/ test = 0.6715
# n_subset = 100000 : train = 0.92925/ test = 0.6881
# n_subset = 1000000 : train = 0.9574675/ test = 0.67654

# [Grade + Gender model + concent history + difficulty + one-hot matrix]
# n_subset = 10000 : train = 0.892375/ test = 0.6775
# n_subset = 100000 : train = 0.9190875/ test = 0.69365
# n_subset = 1000000 : train = 0.94633875/ test = 0.67734

# [Grade + Gender model + concent history + difficulty]
# n_subset = 10000 : train = 0.82225/ test = 0.683
# n_subset = 100000 : train = 0.8305875/ test = 0.71545
# n_subset = 1000000 : train = 0.86713875/ test = 0.69857

# Without difficulty ----------------
# [Grade+Gender model]
# n_subset = 10000: train = 0.7225/ test = 0.721 (Best)
# n_subset = 100000: train = 0.7413/ test = 0.74865 (Best)
# n_subset = 1000000: train = 0.74041625/ test = 0.741355 (Best)

# [Grade only model] 
# n_subset = 10000: train = 0.7225 / test = 0.721 (Best)
# n_subset = 100000: train = 0.7411875 / test = 0.7486 (Best)
# n_subset = 1000000: train = 0.74041625/ test = 0.741355 (Best)

#### Gradient Boosting

- https://medium.com/@gabrieltseng/gradient-boosting-and-xgboost-c306c1bcfaf5

In [None]:
gb_full = GradientBoostingClassifier(random_state=0).fit(X_train, y_train)
print("# n_subset = " + str(n_subset),": ",end = "")
print("train = " + str(gb_full.score(X_train, y_train))+"/ ",end = "")
print("test = " + str(gb_full.score(X_test, y_test)))

# With difficulty ----------------
# [Grade + Gender model + concent history + difficulty + one-hot matrix + proficiency matrix]
# n_subset = 10000 : train = 0.760875/ test = 0.744
# n_subset = 100000 : train = 0.763675/ test = 0.7665
# n_subset = 1000000 : train = 0.76075/ test = 0.761455 (best)

# [Grade + Gender model + concent history + difficulty + one-hot matrix]
# n_subset = 10000 : train = 0.755125/ test = 0.745 (best)
# n_subset = 100000 : train = 0.7642125/ test = 0.7678 (best)
# n_subset = 1000000 : train = 0.7605875/ test = 0.76125

# [Grade + Gender model + concent history + difficulty]
# n_subset = 10000 : train = 0.74675/ test = 0.732
# n_subset = 100000 : train = 0.7594375/ test = 0.7638
# n_subset = 1000000 : train = 0.7604625/ test = 0.76084

# Without difficulty ----------------
# [Grade + Gender model]
# n_subset = 10000 : train = 0.7225/ test = 0.721
# n_subset = 100000 : train = 0.7411875/ test = 0.7486
# n_subset = 1000000 : train = 0.74041625/ test = 0.741355

# [Grade only model] 
# n_subset = 10000 : train = 0.7225/ test = 0.721
# n_subset = 100000 : train = 0.7411875/ test = 0.7486
# n_subset = 1000000 : train = 0.74041625/ test = 0.741355

#### Logistic Model (with L2 penalty)

In [None]:
logit_full = LogisticRegression(random_state=0,max_iter=1000).fit(X_train, y_train)
print("# n_subset = " + str(n_subset),": ",end = "")
print("train = " + str(logit_full.score(X_train, y_train))+"/ ",end = "")
print("test = " + str(logit_full.score(X_test, y_test)))

# With difficulty ----------------
# [Grade + Gender model + concent history + difficulty + one-hot matrix + proficiency matrix]
# n_subset = 10000 : train = 0.74325/ test = 0.734 (Best)
# n_subset = 100000 : train = 0.7613375/ test = 0.7638
# n_subset = 1000000 : train = 0.7580275/ test = 0.75891 (Best)

# [Grade + Gender model + concent history + difficulty + one-hot matrix] 
# n_subset = 10000 : train = 0.741375/ test = 0.7325
# n_subset = 100000 : train = 0.760325/ test = 0.7641 (Best)
# n_subset = 1000000 : train = 0.7577325/ test = 0.75869

# [Grade + Gender model + concent history + difficulty]
# n_subset = 10000 : train = 0.73125/ test = 0.728
# n_subset = 100000 : train = 0.75265/ test = 0.7574
# n_subset = 1000000 : train = 0.75702875/ test = 0.758

# Without difficulty ----------------
# [Grade + Gender model]
# n_subset = 10000: train = 0.7225/ test = 0.721
# n_subset = 100000: train = 0.7411875/ test = 0.7486
# n_subset = 1000000: train = 0.74041625/ test = 0.741355

# [Grade only model]
# n_subset = 10000: train = 0.7225/ test = 0.721
# n_subset = 100000: train = 0.7411875/ test = 0.7486
# n_subset = 1000000: train = 0.74041625/ test = 0.741355

#### Logistic Model (with L1 penalty)

In [None]:
lasso_full = LogisticRegression(penalty='l1', solver='saga',random_state=0,max_iter=1000).fit(X_train, y_train)
print("# n_subset = " + str(n_subset),": ",end = "")
print("train = " + str(lasso_full.score(X_train, y_train))+"/ ",end = "")
print("test = " + str(lasso_full.score(X_test, y_test)))

# With difficulty ----------------
# [Grade + Gender model + concent history + difficulty + one-hot matrix + proficiency matrix]
# n_subset = 10000 : train = 0.7435/ test = 0.7335 (Best)
# n_subset = 100000 : train = 0.761575/ test = 0.7641
# n_subset = 1000000 : train = 0.75812375/ test = 0.758975 (Best)

# [Grade + Gender model + concent history + difficulty + one-hot matrix] 
# n_subset = 10000 : train = 0.741/ test = 0.7335 (Best)
# n_subset = 100000 : train = 0.760275/ test = 0.7643 (Best)
# n_subset = 1000000 : train = 0.75778875/ test = 0.758735

# [Grade + Gender model + concent history + difficulty]
# n_subset = 10000 : train = 0.729875/ test = 0.7285
# n_subset = 100000 : train = 0.7526625/ test = 0.7576
# n_subset = 1000000 : train = 0.75708/ test = 0.7581

# Without difficulty ----------------
# [Grade + Gender model]
# n_subset = 10000: train = 0.7225 / test = 0.721
# n_subset = 100000: train = 0.7411875 / test = 0.7486
# n_subset = 1000000: train = 0.74041625 / test = 0.741355

# [Grade only model]
# n_subset = 10000: train = 0.7225/ test = 0.721
# n_subset = 100000: train = 0.7411875/ test = 0.7486
# n_subset = 1000000: train = 0.74041625/ test = 0.741355

#### SVM (with rbf kernal)

In [None]:
svc_full = SVC().fit(X_train, y_train)
print("# n_subset = " + str(n_subset),": ",end = "")
print("train = " + str(svc_full.score(X_train, y_train))+"/ ",end = "")
print("test = " + str(svc_full.score(X_test, y_test)))

# With difficulty ----------------
# [Grade + Gender model + concent history + difficulty + one-hot matrix + proficiency matrix]
# n_subset = 10000 : train = 0.748375/ test = 0.7395 (Best)
# n_subset = 100000: (exceeds the 6 hour time limit...)
# n_subset = 1000000:

# [Grade + Gender model + concent level + difficulty + one-hot matrix]
# n_subset = 10000 : train = 0.7455/ test = 0.738
# n_subset = 100000: 
# n_subset = 1000000:

# [Grade + Gender model + concent level + difficulty]
# n_subset = 10000 : train = 0.736625/ test = 0.7395 (Best)
# n_subset = 100000: 
# n_subset = 1000000:

# Without difficulty ----------------
# [Grade + Gender model]
# n_subset = 10000: train = 0.7225/ test = 0.721
# n_subset = 100000:
# n_subset = 1000000:

# [Grade only model]
# n_subset = 10000: train = 0.7225/test = 0.721
# n_subset = 100000: 
# n_subset = 1000000:

#### SVM (with polynomial kernal, degree = 3)

In [None]:
svc_poly_full = SVC(kernel='poly').fit(X_train, y_train)
print("# n_subset = " + str(n_subset),": ",end = "")
print("train = " + str(svc_poly_full.score(X_train, y_train))+"/ ",end = "")
print("test = " + str(svc_poly_full.score(X_test, y_test)))

# With difficulty ----------------
# [Grade + Gender model + concent history + difficulty + one-hot matrix + proficiency matrix]
# n_subset = 10000 : train = 0.75025/ test = 0.74 (Best)
# n_subset = 100000: (exceeds the 6 hour time limit...)
# n_subset = 1000000:

# [Grade + Gender model + concent level + difficulty + one-hot matrix]
# n_subset = 10000 : train = 0.74875/ test = 0.74 (Best)
# n_subset = 100000: 
# n_subset = 1000000:

# [Grade + Gender model + concent level + difficulty]
# n_subset = 10000 : train = 0.731625/ test = 0.726
# n_subset = 100000: 
# n_subset = 1000000:

# Without difficulty ----------------
# [Grade + Gender model]
# n_subset = 10000: train = 0.7225/ test = 0.721
# n_subset = 100000:
# n_subset = 1000000:

# [Grade only model]
# n_subset = 10000: train = 0.7225/test = 0.721
# n_subset = 100000:
# n_subset = 1000000:

#### SVM (with linear kernal)

In [None]:
svc_linear_full = LinearSVC(random_state=0,max_iter=10000,dual=False).fit(X_train, y_train)
print("# n_subset = " + str(n_subset),": ",end = "")
print("train = " + str(svc_linear_full.score(X_train, y_train))+"/ ",end = "")
print("test = " + str(svc_linear_full.score(X_test, y_test)))

# With difficulty ----------------
# [Grade + Gender model + concent history + difficulty + one-hot matrix + proficiency matrix]
# n_subset = 10000 : train = 0.745375/ test = 0.732
# n_subset = 100000 : train = 0.7615375/ test = 0.76335 (Best)
# n_subset = 1000000 : train = 0.75712/ test = 0.75817 (Best)

# [Grade + Gender model + concent level + difficulty + one-hot matrix]
# n_subset = 10000 : train = 0.741625/ test = 0.7325
# n_subset = 100000 : train = 0.760325/ test = 0.76335 (Best)
# n_subset = 1000000 : train = 0.7567025/ test = 0.757925

# [Grade + Gender model + concent level + difficulty]
# n_subset = 10000 : train = 0.73225/ test = 0.7285
# n_subset = 100000 : train = 0.7518875/ test = 0.7568
# n_subset = 1000000 : train = 0.75613625/ test = 0.75743

# Without difficulty ----------------
# [Grade + Gender model]
# n_subset = 10000 : train = 0.7225/ test = 0.721
# n_subset = 100000 : train = 0.7411875/ test = 0.7486
# n_subset = 1000000: train = 0.74041625/ test = 0.741355

# [Grade only model]
# n_subset = 10000 : train = 0.7225/ test = 0.721
# n_subset = 100000 : train = 0.7411875/ test = 0.7486
# n_subset = 1000000 : train = 0.74041625/ test = 0.741355

## Model 3:  Model without the proficiency matrix

- Features (X):
    - Demographics [#logs x 4] [From **df_user**]
        - grade (#logs x 1)
        - gender (#logs x 3)
    - Difficulty features  [#logs x 1]:
        - upid accuracy [**From v_upid_acc**]
        - ~difficulty (only 3 levels, not quite informative)~
        - ~learning_stage (only elementary vs. junior, not quite informative)~
    - History features [#logs x 3]: 
        - most recent 'Level' of this ucid [From **df_log**]
        - 'problem_number' of this 'ucid' [From **df_log**]
        - 'exercise_problem_repeat_session' of this 'upid' [From **df_log**]        
    - One-hot encoding matrix [#logs x #level4 id]:  [**m_level4_id**]
        - one-hot encoding of the content ID of the new 

In [None]:
# all features excluding the proficiency matrix
slice_no_prof = slice(None, -len(list_level4_id), None)

#### Decision Tree

In [None]:
dc_no_prof = DecisionTreeClassifier(criterion="entropy",random_state=0).fit(X_train[:,slice_no_prof], y_train)
print("# n_subset = " + str(n_subset),": ",end = "")
print("train = " + str(dc_no_prof.score(X_train[:,slice_no_prof], y_train))+"/ ",end = "")
print("test = " + str(dc_no_prof.score(X_test[:,slice_no_prof], y_test)))

# n_subset = 10000 : train = 0.892375/ test = 0.6775 *
# n_subset = 100000 : train = 0.9190875/ test = 0.69365 *
# n_subset = 1000000 : train = 0.94633875/ test = 0.67734 *

#### Gradient Boosting

In [None]:
gb_no_prof = GradientBoostingClassifier(random_state=0).fit(X_train[:,slice_no_prof], y_train)
print("# n_subset = " + str(n_subset),": ",end = "")
print("train = " + str(gb_no_prof.score(X_train[:,slice_no_prof], y_train))+"/ ",end = "")
print("test = " + str(gb_no_prof.score(X_test[:,slice_no_prof], y_test)))
# n_subset = 10000 : train = 0.755125/ test = 0.745 *
# n_subset = 100000 : train = 0.7642125/ test = 0.7678 *
# n_subset = 1000000 : train = 0.7605875/ test = 0.76125 *

#### Logistic Model (with L2 penalty)

In [None]:
logit_no_prof = LogisticRegression(random_state=0,max_iter=1000).fit(X_train[:,slice_no_prof], y_train)
print("# n_subset = " + str(n_subset),": ",end = "")
print("train = " + str(logit_no_prof.score(X_train[:,slice_no_prof], y_train))+"/ ",end = "")
print("test = " + str(logit_no_prof.score(X_test[:,slice_no_prof], y_test)))

# n_subset = 10000 : train = 0.741375/ test = 0.7325 *
# n_subset = 100000 : train = 0.760325/ test = 0.7641 *
# n_subset = 1000000 : train = 0.7577325/ test = 0.75869 *

#### Logistic Model (with L1 penalty)

In [None]:
lasso_no_prof = LogisticRegression(penalty='l1', solver='saga',random_state=0,max_iter=1000).fit(X_train[:,slice_no_prof], y_train)
print("# n_subset = " + str(n_subset),": ",end = "")
print("train = " + str(lasso_no_prof.score(X_train[:,slice_no_prof], y_train))+"/ ",end = "")
print("test = " + str(lasso_no_prof.score(X_test[:,slice_no_prof], y_test)))

# n_subset = 10000 : train = 0.741/ test = 0.7335 *
# n_subset = 100000 : train = 0.760275/ test = 0.7643 *
# n_subset = 1000000 : train = 0.75778875/ test = 0.758735 *

#### SVM (with rbf kernal)

In [None]:
svc_no_prof = SVC().fit(X_train[:,slice_no_prof], y_train)
acc_train = svc_no_prof.score(X_train[:,slice_no_prof], y_train)
acc_test = svc_no_prof.score(X_test[:,slice_no_prof], y_test)
print("# n_subset = " + str(n_subset),": ",end = "")
print("train = " + str(acc_train)+"/ ",end = "")
print("test = " + str(acc_test))
# n_subset = 10000 : train = 0.7455/ test = 0.738 *
# n_subset = 100000: 
# n_subset = 1000000:

#### SVM (with polynomial kernal, degree = 3)

In [None]:
svc_poly_no_prof = SVC(kernel = 'poly').fit(X_train[:,slice_no_prof], y_train)
acc_train = svc_poly_no_prof.score(X_train[:,slice_no_prof], y_train)
acc_test = svc_poly_no_prof.score(X_test[:,slice_no_prof], y_test)
print("# n_subset = " + str(n_subset),": ",end = "")
print("train = " + str(acc_train)+"/ ",end = "")
print("test = " + str(acc_test))
# n_subset = 10000 : train = 0.74875/ test = 0.74 *
# n_subset = 100000: 
# n_subset = 1000000:

#### SVM (with linear kernal)

In [None]:
svc_linear_no_prof = LinearSVC(random_state=0,max_iter=10000,dual=False).fit(X_train[:,slice_no_prof], y_train)
print("# n_subset = " + str(n_subset),": ",end = "")
print("train = " + str(svc_linear_no_prof.score(X_train[:,slice_no_prof], y_train))+"/ ",end = "")
print("test = " + str(svc_linear_no_prof.score(X_test[:,slice_no_prof], y_test)))

# n_subset = 10000 : train = 0.741625/ test = 0.7325 *
# n_subset = 100000 : train = 0.760325/ test = 0.76335 *
# n_subset = 1000000 : train = 0.7567025/ test = 0.757925 *

## Model 4: Model without proficiency matrix and one-hot encoding matrix

- Features (X):
    - Demographics [#logs x 4] [From **df_user**]
        - grade (#logs x 1)
        - gender (#logs x 3)
    - Difficulty features  [#logs x 1]:
        - upid accuracy [**From v_upid_acc**]
        - ~difficulty (only 3 levels, not quite informative)~
        - ~learning_stage (only elementary vs. junior, not quite informative)~
    - History features [#logs x 3]: 
        - most recent 'Level' of this ucid [From **df_log**]
        - 'problem_number' of this 'ucid' [From **df_log**]
        - 'exercise_problem_repeat_session' of this 'upid' [From **df_log**]

***


In [None]:
# all features excluding the proficiency matrix
slice_no_prof_onehot = slice(None, -(2*len(list_level4_id)), None)

#### Decision Tree

In [None]:
dc_no_prof_onehot = DecisionTreeClassifier(criterion="entropy",random_state=0).fit(X_train[:,slice_no_prof_onehot], y_train)
acc_train = dc_no_prof_onehot.score(X_train[:,slice_no_prof_onehot], y_train)
acc_test = dc_no_prof_onehot.score(X_test[:,slice_no_prof_onehot], y_test)
print("# n_subset = " + str(n_subset),": ",end = "")
print("train = " + str(acc_train)+"/ ",end = "")
print("test = " + str(acc_test))

# n_subset = 10000 : train = 0.82225/ test = 0.683 *
# n_subset = 100000 : train = 0.8305875/ test = 0.71545 *
# n_subset = 1000000 : train = 0.86713875/ test = 0.69857 *

#### Gradient Boosting

In [None]:
gb_no_prof_onehot = GradientBoostingClassifier(random_state=0).fit(X_train[:,slice_no_prof_onehot], y_train)
print("# n_subset = " + str(n_subset),": ",end = "")
print("train = " + str(gb_no_prof_onehot.score(X_train[:,slice_no_prof_onehot], y_train))+"/ ",end = "")
print("test = " + str(gb_no_prof_onehot.score(X_test[:,slice_no_prof_onehot], y_test)))
# n_subset = 10000 : train = 0.74675/ test = 0.732 *
# n_subset = 100000 : train = 0.7594375/ test = 0.7638 *
# n_subset = 1000000 : train = 0.7604625/ test = 0.76084 *

#### Logistic Model (with L2 penalty)

In [None]:
logit_no_prof_onehot = LogisticRegression(random_state=0,max_iter=1000).fit(X_train[:,slice_no_prof_onehot], y_train)
acc_train = logit_no_prof_onehot.score(X_train[:,slice_no_prof_onehot], y_train)
acc_test = logit_no_prof_onehot.score(X_test[:,slice_no_prof_onehot], y_test)
print("# n_subset = " + str(n_subset),": ",end = "")
print("train = " + str(acc_train)+"/ ",end = "")
print("test = " + str(acc_test))

# n_subset = 10000 : train = 0.73125/ test = 0.728 *
# n_subset = 100000 : train = 0.75265/ test = 0.7574 *
# n_subset = 1000000 : train = 0.75702875/ test = 0.758 *

#### Logistic Model (with L1 penalty)

In [None]:
lasso_no_prof_onehot = LogisticRegression(penalty='l1', solver='saga',random_state=0,max_iter=1000).fit(X_train[:,slice_no_prof_onehot], y_train)
acc_train = lasso_no_prof_onehot.score(X_train[:,slice_no_prof_onehot], y_train)
acc_test = lasso_no_prof_onehot.score(X_test[:,slice_no_prof_onehot], y_test)
print("# n_subset = " + str(n_subset),": ",end = "")
print("train = " + str(acc_train)+"/ ",end = "")
print("test = " + str(acc_test))

# n_subset = 10000 : train = 0.729875/ test = 0.7285 *
# n_subset = 100000 : train = 0.7526625/ test = 0.7576 *
# n_subset = 1000000 : train = 0.75708/ test = 0.7581 *

#### SVM (with rbf kernal)

In [None]:
svc_no_prof_onehot = SVC().fit(X_train[:,slice_no_prof_onehot], y_train)
acc_train = svc_no_prof_onehot.score(X_train[:,slice_no_prof_onehot], y_train)
acc_test = svc_no_prof_onehot.score(X_test[:,slice_no_prof_onehot], y_test)
print("# n_subset = " + str(n_subset),": ",end = "")
print("train = " + str(acc_train)+"/ ",end = "")
print("test = " + str(acc_test))
# n_subset = 10000 : train = 0.736625/ test = 0.7395
# n_subset = 100000: 
# n_subset = 1000000:

#### SVM (with polynomial kernal, degree = 3)

In [None]:
svc_poly_no_prof_onehot = SVC(kernel = 'poly').fit(X_train[:,slice_no_prof_onehot], y_train)
acc_train = svc_poly_no_prof_onehot.score(X_train[:,slice_no_prof_onehot], y_train)
acc_test = svc_poly_no_prof_onehot.score(X_test[:,slice_no_prof_onehot], y_test)
print("# n_subset = " + str(n_subset),": ",end = "")
print("train = " + str(acc_train)+"/ ",end = "")
print("test = " + str(acc_test))
# n_subset = 10000 : train = 0.731625/ test = 0.726
# n_subset = 100000: 
# n_subset = 1000000:

#### SVM (with linear kernal)

In [None]:
svc_linear_no_prof_onehot = LinearSVC(random_state=0,max_iter=10000,dual=False).fit(X_train[:,slice_no_prof_onehot], y_train)
acc_train = svc_linear_no_prof_onehot.score(X_train[:,slice_no_prof_onehot], y_train)
acc_test = svc_linear_no_prof_onehot.score(X_test[:,slice_no_prof_onehot], y_test)
print("# n_subset = " + str(n_subset),": ",end = "")
print("train = " + str(acc_train)+"/ ",end = "")
print("test = " + str(acc_test))

# n_subset = 10000 : train = 0.73225/ test = 0.7285 *
# n_subset = 100000 : train = 0.7518875/ test = 0.7568 *
# n_subset = 1000000 : train = 0.75613625/ test = 0.75743 *

## Model 5: Demographics only model

- Features (X):
    - Demographics [#logs x 4] [From **df_user**]
        - grade (#logs x 1)
        - gender (#logs x 3)

***


In [None]:
slice_demo = slice(None, 4, None)

#### Decision Tree

In [None]:
dc_demo= DecisionTreeClassifier(criterion="entropy",random_state=0).fit(X_train[:,slice_demo], y_train)
print(dc_demo.score(X_train[:,slice_demo], y_train))
print(dc_demo.score(X_test[:,slice_demo], y_test))
# n_subset = 10000: train = 0.7225/ test = 0.721
# n_subset = 100000: train = 0.7413/ test = 0.74865
# n_subset = 1000000: train = 0.74041625/ test = 0.741355

#### Gradient Boosting

In [None]:
gb_demo = GradientBoostingClassifier(random_state=0).fit(X_train[:,slice_demo], y_train)
print("# n_subset = " + str(n_subset),": ",end = "")
print("train = " + str(gb_demo.score(X_train[:,slice_demo], y_train))+"/ ",end = "")
print("test = " + str(gb_demo.score(X_test[:,slice_demo], y_test)))
# n_subset = 10000 : train = 0.7225/ test = 0.721
# n_subset = 100000 : train = 0.7411875/ test = 0.7486
# n_subset = 1000000 : train = 0.74041625/ test = 0.741355

#### Logistic Model (with L2 penalty)

In [None]:
logit_demo = LogisticRegression(random_state=0,max_iter=1000).fit(X_train[:,slice_demo], y_train) 
print(logit_demo.score(X_train[:,slice_demo], y_train))
print(logit_demo.score(X_test[:,slice_demo], y_test))
# n_subset = 10000: train = 0.7225/ test = 0.721
# n_subset = 100000: train = 0.7411875/ test = 0.7486
# n_subset = 1000000: train = 0.74041625/ test = 0.741355

#### Logistic Model (with L1 penalty)

In [None]:
lasso_demo = LogisticRegression(penalty='l1', solver='saga',random_state=0,max_iter=1000).fit(X_train[:,slice_demo], y_train)
print(lasso_demo.score(X_train[:,slice_demo], y_train))
print(lasso_demo.score(X_test[:,slice_demo], y_test))
# n_subset = 10000: train = 0.7225 / test = 0.721
# n_subset = 100000: train = 0.7411875 / test = 0.7486
# n_subset = 1000000: train = 0.74041625 / test = 0.741355

#### SVM (with rbf kernal)

In [None]:
svc_demo = SVC().fit(X_train[:,slice_demo], y_train)
print(svc_demo.score(X_train[:,slice_demo], y_train))
print(svc_demo.score(X_test[:,slice_demo], y_test))
# n_subset = 10000: train = 0.7225/ test = 0.721
# n_subset = 100000: 
# n_subset = 1000000:

#### SVM (with polynomial kernal, degree = 3)

In [None]:
svc_poly_demo = SVC(kernel= 'poly').fit(X_train[:,slice_demo], y_train)
print(svc_poly_demo.score(X_train[:,slice_demo], y_train))
print(svc_poly_demo.score(X_test[:,slice_demo], y_test))
# n_subset = 10000: train = 0.7225/ test = 0.721
# n_subset = 100000: 
# n_subset = 1000000:

#### SVM (with linear kernal)

In [None]:
svc_demo = LinearSVC(random_state=0,max_iter=10000,dual=False).fit(X_train[:,slice_demo], y_train)
acc_train = svc_demo.score(X_train[:,slice_demo], y_train)
acc_test = svc_demo.score(X_test[:,slice_demo], y_test)
print("# n_subset = " + str(n_subset),": ",end = "")
print("train = " + str(acc_train)+"/ ",end = "")
print("test = " + str(acc_test))

# n_subset = 10000 : train = 0.7225/ test = 0.721
# n_subset = 100000 : train = 0.7411875/ test = 0.7486
# n_subset = 1000000: train = 0.74041625/ test = 0.741355

## Model 6: Grade only model


- Features (X):
    - Demographics [#logs x 1] [From **df_user**]
        - grade (#logs x 1)

#### Decision Tree

In [None]:
dc_grade= DecisionTreeClassifier(criterion="entropy",random_state=0).fit(X_train[:,0].reshape(-1, 1), y_train)
print(dc_grade.score(X_train[:,0].reshape(-1, 1), y_train))
print(dc_grade.score(X_test[:,0].reshape(-1, 1), y_test))
# n_subset = 10000: train = 0.7225 / test = 0.721
# n_subset = 100000: train = 0.7411875 / test = 0.7486
# n_subset = 1000000: train = 0.74041625/ test = 0.741355

#### Gradient Boosting

In [None]:
gb_grade = GradientBoostingClassifier(random_state=0).fit(X_train[:,0].reshape(-1, 1), y_train)
print("# n_subset = " + str(n_subset),": ",end = "")
print("train = " + str(gb_grade.score(X_train[:,0].reshape(-1, 1), y_train))+"/ ",end = "")
print("test = " + str(gb_grade.score(X_test[:,0].reshape(-1, 1), y_test)))
# n_subset = 10000 : train = 0.7225/ test = 0.721
# n_subset = 100000 : train = 0.7411875/ test = 0.7486
# n_subset = 1000000 : train = 0.74041625/ test = 0.741355

#### Logistic Model (with L2 penalty)

In [None]:
logit_grade= LogisticRegression(random_state=0,max_iter=1000).fit(X_train[:,0].reshape(-1, 1), y_train)
print(logit_grade.score(X_train[:,0].reshape(-1, 1), y_train))
print(logit_grade.score(X_test[:,0].reshape(-1, 1), y_test))
# n_subset = 10000: train = 0.7225/ test = 0.721
# n_subset = 100000: train = 0.7411875/ test = 0.7486
# n_subset = 1000000: train = 0.74041625/ test = 0.741355

#### Logistic Model (with L1 penalty)

In [None]:
lasso_grade = LogisticRegression(penalty='l1', solver='saga',random_state=0,max_iter=1000).fit(X_train[:,0].reshape(-1, 1), y_train)
print(lasso_grade.score(X_train[:,0].reshape(-1, 1), y_train))
print(lasso_grade.score(X_test[:,0].reshape(-1, 1), y_test))
# n_subset = 10000: train = 0.7225/ test = 0.721
# n_subset = 100000: train = 0.7411875/ test = 0.7486
# n_subset = 1000000: train = 0.74041625/ test = 0.741355

#### SVM (with rbf kernal)

In [None]:
svc_grade = SVC().fit(X_train[:,0].reshape(-1, 1), y_train)
print(svc_grade.score(X_train[:,0].reshape(-1, 1), y_train))
print(svc_grade.score(X_test[:,0].reshape(-1, 1), y_test))
# n_subset = 10000: train = 0.7225/test = 0.721
# n_subset = 100000: 0.7486 
# n_subset = 1000000:

#### SVM (with polynomial kernal, degree = 3)

In [None]:
svc_poly_grade = SVC(kernel='poly').fit(X_train[:,0].reshape(-1, 1), y_train)
print(svc_poly_grade.score(X_train[:,0].reshape(-1, 1), y_train))
print(svc_poly_grade.score(X_test[:,0].reshape(-1, 1), y_test))
# n_subset = 10000: train = 0.7225/test = 0.721
# n_subset = 100000: 
# n_subset = 1000000:

#### SVM (with linear kernal)

In [None]:
svc_grade = LinearSVC(random_state=0,max_iter=10000,dual=False).fit(X_train[:,0].reshape(-1, 1), y_train)
acc_train = svc_grade.score(X_train[:,0].reshape(-1, 1), y_train)
acc_test = svc_grade.score(X_test[:,0].reshape(-1, 1), y_test)
print("# n_subset = " + str(n_subset),": ",end = "")
print("train = " + str(acc_train)+"/ ",end = "")
print("test = " + str(acc_test))
# n_subset = 10000 : train = 0.7225/ test = 0.721
# n_subset = 100000 : train = 0.7411875/ test = 0.7486
# n_subset = 1000000 : train = 0.74041625/ test = 0.741355