The data used for this notebook can be downloaded from https://github.com/riiid/ednet. We use the EdNet-KT1 dataset. After downloading and unzipping the file folder, you can either place it directly in the directory of your code, or alter the directory references when reading the CSV files. No further preprocessing is required for the code of this notebook to work.

### Import libraries and read accompanying question CSV

In [None]:
import os
import timeit
import pandas as pd
import numpy as np

df_q = pd.read_csv('EdNet-Contents/contents/questions.csv')
df_q = df_q.drop(['bundle_id', 'explanation_id', 'tags', 'deployed_at'], axis = 1)
df_q.head()

### Read individual user CSVs, construct features, and store overall CSV

In [None]:
def create_feature_df(folder_path, n_users):
    file_paths = [os.path.join(folder_path, file) for file in os.listdir(folder_path) if file.endswith('.csv')]
    
    np.random.seed(42)
    file_indices = np.random.choice(len(file_paths), n_users)
    file_paths = [file_paths[i] for i in file_indices]
    
    dtype_dict = {'timestamp': str, 'solving_id': int, 'question_id': str,'user_answer': str,'elapsed_time': int}
    
    df_list = []
    lag_values = [i+1 for i in range(10)]
    for idx, file_path in enumerate(file_paths):
        if idx % 250 == 0:
            print(f"Now reading file {idx} of {n_users} user files.")
            
        user_id = file_path.split('\\')[1][:-4]
        user_df = pd.read_csv(file_path, dtype=dtype_dict)
        user_df['user_id'] = user_id
        user_df = user_df.merge(df_q, on='question_id', how='left')
        # Subtract from unix timestamp to make eventual feature CSV smaller
        user_df['timestamp'] = [int(t) - 1400000000000 for t in user_df['timestamp']]
        # Max elapsed time set to 300, similar to Choi et al. (2020) and Shin et al. (2021)
        user_df['elapsed_time'] = [min(300, int(t)/1000) for t in user_df['elapsed_time']]
        user_df['correct_response'] = [1 if u == c else 0 for u, c in zip(user_df['user_answer'], user_df['correct_answer'])]
        for lag in lag_values:
            user_df[f'correct_response_lag_{lag}'] = user_df['correct_response'].shift(lag)
        
        df_list.append(user_df)
        
    df = pd.concat(df_list, axis = 0, ignore_index = True)
    df['avg_correct_last_five'] = df.loc[:, ['correct_response_lag_1','correct_response_lag_2','correct_response_lag_3',
                                            'correct_response_lag_4','correct_response_lag_5']].mean(axis = 1, skipna=False)
    df['avg_correct_last_ten'] = df.loc[:, ['correct_response_lag_1','correct_response_lag_2','correct_response_lag_3',
                                            'correct_response_lag_4','correct_response_lag_5','correct_response_lag_6',
                                           'correct_response_lag_7','correct_response_lag_8','correct_response_lag_9',
                                           'correct_response_lag_10']].mean(axis = 1, skipna=False)
    
    df = df.drop(['user_answer', 'correct_answer'], axis = 1)
    return df

Note that the below code reads the data of 10,000 randomly selected users. Seeing as creating features for all users takes a prohibitively large amount of time and creates a very large feature CSV, we use the 10,000 user CSV for our local learning and federated learning experiments. For central learning we experimented with several values of N_USERS, as can be seen in our detailed results.

In [None]:
FOLDER_PATH = 'EdNet-KT1/KT1'
N_USERS = 10000
file_str = 'ednet_features_' + str(N_USERS) + '_users.csv'

feature_df = create_feature_df(FOLDER_PATH, N_USERS)
feature_df = feature_df.convert_dtypes()
feature_df.to_csv(file_str, index = False)
feature_df.head()