### Deep Learning Adventures Present
### Attention and Transformers Bonus Edition | Applications of transformers for a non-NLP task
### https://www.meetup.com/Deep-Learning-Adventures/events/281617686

Hello dear friends 🌎 🌍,

We hope you are enjoying our latest sessions 🏖️ deploying models and cool applications to our phones 📱 and edge devices 🕹️

Our friend Dmitri has offered to lead some very interesting papers, code and content all on the attention mechanism in deep learning models and transformers! 😲

Attention and Transformers Bonus Edition | Applications of transformers for a non-NLP task

- Kaggle competition on knowledge tracing:
Riiid Answer Correctness Prediction
Track knowledge states of 1M+ students in the wild
https://www.kaggle.com/c/riiid-test-answer-prediction/overview

- Dmitri's approach, ranked # 54 out of 3,395 participants (top 2%).
https://github.com/dlevonian/riiid-prediction ✅

- Presentation: https://docs.google.com/presentation/d/1QWWgewJ12unyPUDRpnOLsygl1_0lVCwpWSJZGIRB1YY/edit?usp=sharing

The recording of this cool event 😎 is available at:
https://bit.ly/dla-transformers

### Riiid Answer Correctness Prediction 
https://www.kaggle.com/c/riiid-test-answer-prediction

In this competition, your challenge is to create algorithms for "Knowledge Tracing," the modeling of student knowledge over time. The goal is to accurately predict how students will perform on future interactions. You will pair your machine learning skills using Riiid’s EdNet data.

Inspired by: https://www.kaggle.com/erikbruin/riiid-comprehensive-eda-baseline/notebook

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
%matplotlib inline
import matplotlib.style as style
style.use('fivethirtyeight')
import seaborn as sns
from matplotlib.ticker import FuncFormatter

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
%%time

train = pd.read_pickle("../input/riiid-train-data-multiple-formats/riiid_train.pkl.gzip")
train['prior_question_had_explanation'] = train['prior_question_had_explanation'].astype('boolean')
print("Train size:", train.shape)

In [None]:
train.memory_usage(deep=True)

In [None]:
train.info()

In [None]:
%%time

questions = pd.read_csv('/kaggle/input/riiid-test-answer-prediction/questions.csv')
lectures = pd.read_csv('/kaggle/input/riiid-test-answer-prediction/lectures.csv')
example_test = pd.read_csv('/kaggle/input/riiid-test-answer-prediction/example_test.csv')
example_sample_submission = pd.read_csv('/kaggle/input/riiid-test-answer-prediction/example_sample_submission.csv')

In [None]:
cids = train.content_id.value_counts()[:30]

fig = plt.figure(figsize=(12,6))
ax = cids.plot.bar()
plt.title("Thirty most used content id's")
plt.xticks(rotation=90)
ax.get_yaxis().set_major_formatter(FuncFormatter(lambda x, p: format(int(x), ','))) #add thousands separator
plt.show()

In [None]:
#1 year = 31536000000 ms
ts = train['timestamp']/(31536000000/12)
fig = plt.figure(figsize=(12,6))
ts.plot.hist(bins=100)
plt.title("Histogram of timestamp")
plt.xticks(rotation=0)
plt.xlabel("Months between this user interaction and the first event completion from that user")
plt.show()

In [None]:
correct = train[train.answered_correctly != -1].answered_correctly.value_counts(ascending=True)

fig = plt.figure(figsize=(12,4))
correct.plot.barh()
for i, v in zip(correct.index, correct.values):
    plt.text(v, i, '{:,}'.format(v), color='white', fontweight='bold', fontsize=14, ha='right', va='center')
plt.title("Questions answered correctly")
plt.xticks(rotation=0)
plt.show()

In [None]:
user_percent = train[train.answered_correctly != -1].groupby('user_id')['answered_correctly'].agg(Mean='mean', Answers='count')
print(f'the highest number of questions answered by a user is {user_percent.Answers.max()}')

In [None]:
user_percent = user_percent.query('Answers <= 1000').sample(n=200, random_state=1)

fig = plt.figure(figsize=(12,6))
x = user_percent.Answers
y = user_percent.Mean
plt.scatter(x, y, marker='o')
plt.title("Percent answered correctly versus number of questions answered User")
plt.xticks(rotation=0)
plt.xlabel("Number of questions answered")
plt.ylabel("Percent answered correctly")
z = np.polyfit(x, y, 1)
p = np.poly1d(z)
plt.plot(x,p(x),"r--")

plt.show()

### Test data

In [None]:
example_test.shape

In [None]:
example_test.head()

In [None]:
batches_test = set(list(example_test.task_container_id.unique()))
batches_train = set(list(train.task_container_id.unique()))
print(f'All batches in example_test are also in train is {batches_test.issubset(batches_train)}.')

In [None]:
user_test = set(list(example_test.user_id.unique()))
user_train = set(list(train.user_id.unique()))

print(f'User_ids in example_test but not in train: {user_test - user_train}.')

### Baseline model

In [None]:
#this clears everything loaded in RAM, including the libraries
%reset -f

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
import sys
sys.path.append('/kaggle/input/riiid-test-answer-prediction')
import riiideducation

In [None]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
%matplotlib inline
import matplotlib.style as style
style.use('fivethirtyeight')
import seaborn as sns

import lightgbm as lgb
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import LabelEncoder

import gc
import sys
pd.set_option('display.max_rows', None)

In [None]:
%%time
cols_to_load = ['row_id', 'user_id', 'answered_correctly', 'content_id', 'prior_question_had_explanation', 'prior_question_elapsed_time']
train = pd.read_pickle("../input/riiid-train-data-multiple-formats/riiid_train.pkl.gzip")[cols_to_load]
train['prior_question_had_explanation'] = train['prior_question_had_explanation'].astype('boolean')

print("Train size:", train.shape)

In [None]:
%%time

questions = pd.read_csv('/kaggle/input/riiid-test-answer-prediction/questions.csv')
lectures = pd.read_csv('/kaggle/input/riiid-test-answer-prediction/lectures.csv')
example_test = pd.read_csv('/kaggle/input/riiid-test-answer-prediction/example_test.csv')
example_sample_submission = pd.read_csv('/kaggle/input/riiid-test-answer-prediction/example_sample_submission.csv')

In [None]:
print(train.shape)
train.head()

In [None]:
%%time
#adding user features
user_df = train[train.answered_correctly != -1].groupby('user_id').agg({'answered_correctly': ['count', 'mean']}).reset_index()
user_df.columns = ['user_id', 'user_questions', 'user_mean']

user_lect = train.groupby(["user_id", "answered_correctly"]).size().unstack()
user_lect.columns = ['Lecture', 'Wrong', 'Right']
user_lect = user_lect[['Lecture']].fillna(0).astype('int8')
#user_lect = user_lect.astype('int8')
user_lect['watches_lecture'] = np.where(user_lect.Lecture > 0, 1, 0)
user_lect = user_lect.reset_index()
user_lect = user_lect[['user_id', 'watches_lecture']]

user_df = user_df.merge(user_lect, on = "user_id", how = "left")
del user_lect
user_df.head()

In [None]:
%%time
#adding content features
content_df = train[train.answered_correctly != -1].groupby('content_id').agg({'answered_correctly': ['count', 'mean']}).reset_index()
content_df.columns = ['content_id', 'content_questions', 'content_mean']
content_df.head()

In [None]:
%%time
#using one of the validation sets composed by tito
cv2_train = pd.read_pickle("../input/riiid-cross-validation-files/cv2_train.pickle")['row_id']
cv2_valid = pd.read_pickle("../input/riiid-cross-validation-files/cv2_valid.pickle")['row_id']

In [None]:
train = train[train.answered_correctly != -1]

#save mean before splitting
#please be aware that there is an issues with train.prior_question_elapsed_time.mean()
#see https://www.kaggle.com/c/riiid-test-answer-prediction/discussion/195032
mean_prior = train.prior_question_elapsed_time.astype("float64").mean()

validation = train[train.row_id.isin(cv2_valid)]
train = train[train.row_id.isin(cv2_train)]

validation = validation.drop(columns = "row_id")
train = train.drop(columns = "row_id")

del cv2_train, cv2_valid
gc.collect()

In [None]:
def data_pipeline(df, user_df, content_df, label_enc, mode='train'):
    df = df.merge(user_df, on = "user_id", how = "left")
    df = df.merge(content_df, on = "content_id", how = "left")
    df['content_questions'].fillna(0, inplace = True)
    df['content_mean'].fillna(0.5, inplace = True)
    df['watches_lecture'].fillna(0, inplace = True)
    df['user_questions'].fillna(0, inplace = True)
    df['user_mean'].fillna(0.5, inplace = True)
    df['prior_question_elapsed_time'].fillna(mean_prior, inplace = True)
    df['prior_question_had_explanation'].fillna(False, inplace = True)

    if mode =='train':
        label_enc.fit(df['prior_question_had_explanation'])

    df['prior_question_had_explanation'] = label_enc.transform(df['prior_question_had_explanation'])    
    df[['content_questions', 'user_questions']] = df[['content_questions', 'user_questions']].astype(int)
    return df, label_enc

In [None]:
#Your notebook tried to allocate more memory than is available. It has restarted.
#label_enc = LabelEncoder()
#train, label_enc = data_pipeline(train, user_df, content_df, label_enc, mode='train')
#train.sample(5)

In [None]:
#validation, _ = data_pipeline(validation, user_df, content_df, label_enc, mode='validation')
#validation.sample(5)

In [None]:
label_enc = LabelEncoder()

train = train.merge(user_df, on = "user_id", how = "left")
train = train.merge(content_df, on = "content_id", how = "left")
train['content_questions'].fillna(0, inplace = True)
train['content_mean'].fillna(0.5, inplace = True)
train['watches_lecture'].fillna(0, inplace = True)
train['user_questions'].fillna(0, inplace = True)
train['user_mean'].fillna(0.5, inplace = True)
train['prior_question_elapsed_time'].fillna(mean_prior, inplace = True)
train['prior_question_had_explanation'].fillna(False, inplace = True)

label_enc.fit(train['prior_question_had_explanation'])

train['prior_question_had_explanation'] = label_enc.transform(train['prior_question_had_explanation'])
train[['content_questions', 'user_questions']] = train[['content_questions', 'user_questions']].astype(int)
train.sample(5)

In [None]:
validation = validation.merge(user_df, on = "user_id", how = "left")
validation = validation.merge(content_df, on = "content_id", how = "left")
validation['content_questions'].fillna(0, inplace = True)
validation['content_mean'].fillna(0.5, inplace = True)
validation['watches_lecture'].fillna(0, inplace = True)
validation['user_questions'].fillna(0, inplace = True)
validation['user_mean'].fillna(0.5, inplace = True)
validation['prior_question_elapsed_time'].fillna(mean_prior, inplace = True)
validation['prior_question_had_explanation'].fillna(False, inplace = True)

validation['prior_question_had_explanation'] = label_enc.transform(validation['prior_question_had_explanation'])
validation[['content_questions', 'user_questions']] = validation[['content_questions', 'user_questions']].astype(int)
validation.sample(5)

In [None]:
# features = ['user_questions', 'user_mean', 'content_questions', 'content_mean', 'watches_lecture',
#             'prior_question_elapsed_time', 'prior_question_had_explanation']

features = ['user_questions', 'user_mean', 'content_questions', 'content_mean', 'prior_question_elapsed_time']


#for now just taking 10.000.000 rows for training
train = train.sample(n=10_000_000, random_state = 1)

y_train = train['answered_correctly']
train = train[features]

y_val = validation['answered_correctly']
validation = validation[features]

In [None]:
params = {'objective': 'binary',
          'metric': 'auc',
          'seed': 2020,
          'learning_rate': 0.1, #default
          "boosting_type": "gbdt" #default
         }

In [None]:
lgb_train = lgb.Dataset(train, y_train, categorical_feature = None)
lgb_eval = lgb.Dataset(validation, y_val, categorical_feature = None)

del train, y_train, validation, y_val
gc.collect()

In [None]:
%%time
model = lgb.train(
    params, lgb_train,
    valid_sets=[lgb_train, lgb_eval],
    verbose_eval=50,
    num_boost_round=10000,
    early_stopping_rounds=8
)

In [None]:
lgb.plot_importance(model)
plt.show()

### Lets use the Competition API to bring in the test data and make a submission
https://www.kaggle.com/sohier/competition-api-detailed-introduction/notebook

In [None]:
env = riiideducation.make_env()

In [None]:
iter_test = env.iter_test()

In [None]:
for (test_df, sample_prediction_df) in iter_test:
    test_df = test_df.merge(user_df, on = "user_id", how = "left")
    test_df = test_df.merge(content_df, on = "content_id", how = "left")
    test_df['content_questions'].fillna(0, inplace = True)
    test_df['content_mean'].fillna(0.5, inplace = True)
    test_df['watches_lecture'].fillna(0, inplace = True)
    test_df['user_questions'].fillna(0, inplace = True)
    test_df['user_mean'].fillna(0.5, inplace = True)
    test_df['prior_question_elapsed_time'].fillna(mean_prior, inplace = True)
    test_df['prior_question_had_explanation'].fillna(False, inplace = True)
    test_df['prior_question_had_explanation'] = label_enc.transform(test_df['prior_question_had_explanation'])
    test_df[['content_questions', 'user_questions']] = test_df[['content_questions', 'user_questions']].astype(int)
    test_df['answered_correctly'] =  model.predict(test_df[features])
    env.predict(test_df.loc[test_df['content_type_id'] == 0, ['row_id', 'answered_correctly']])

In [None]:
#last batch overview
#submission_test_df = test_df.loc[test_df['content_type_id'] == 0, ['row_id', 'answered_correctly']]
#submission_test_df.to_csv('submission.csv')

In [None]:
#submission_test_df.shape

In [None]:
#submission_test_df