# Riiid! Answer Correctness Prediction
**Concept taken from @Ilia Start Notebook**
## Introduction
In this competition you will predict which questions each student is able to answer correctly. You will loop through a series of batches of questions. Once you make that prediction, you can move on to the next batch.

This competition is different from most Kaggle Competitions in that:
* You can only submit from Kaggle Notebooks
* You must use our custom **`riiideducation`** Python module.  The purpose of this module is to control the flow of information to ensure that you are not using future data to make predictions.  If you do not use this module properly, your code may fail.

## In this Starter Notebook, we'll show how to use the **`riiideducation`** module to get the test features and make predictions.
## TL;DR: End-to-End Usage Example
```
import riiideducation
env = riiideducation.make_env()

# Training data is in the competition dataset as usual
train_df = pd.read_csv('/kaggle/input/riiideducation/train.csv', low_memory=False)
train_my_model(train_df)

for (test_df, sample_prediction_df) in iter_test:
    test_df['answered_correctly'] = 0.5
    env.predict(test_df.loc[test_df['content_type_id'] == 0, ['row_id', 'answered_correctly']])```
Note that `train_my_model` and `make_my_predictions` are functions you need to write for the above example to work.

## In-depth Introduction
First let's import the module and create an environment.

In [None]:
data_types_dict = {
    'row_id': 'int64',
    'timestamp': 'int64',
    'user_id': 'int32',
    'content_id': 'int16',
#     'content_type_id': 'int8',
#     'task_container_id': 'int16',
#     'user_answer': 'int8',
    'answered_correctly': 'int8',
    'prior_question_elapsed_time': 'float16',
    'prior_question_had_explanation': 'boolean'
}

In [None]:
import pandas as pd
train_df = pd.read_csv('/kaggle/input/riiid-test-answer-prediction/train.csv', 
                       nrows=10**7,
                       usecols = data_types_dict.keys(),
                       dtype=data_types_dict, 
                       index_col = 0)

### Training data is in the competition dataset as usual
It's larger than will fit in memory with default settings, so we'll specify more efficient datatypes and only load a subset of the data for now.

In [None]:
grouped_by_user_df = train_df.groupby('user_id')

In [None]:
grouped_by_user_df.agg({'timestamp': 'max'}).hist(bins = 100)

Answered correctly¶

In [None]:
(train_df['answered_correctly']==-1).mean()

~2% of activities are lectures, we should exclude them for answers analysis.

In [None]:
train_questions_only_df = train_df[train_df['answered_correctly']!=-1]
train_questions_only_df['answered_correctly'].mean()

On average users answer ~66% questions correctly. Let's look how it is different from user to user.

**Answers by users**

In [None]:
grouped_by_user_df = train_questions_only_df.groupby('user_id')

In [None]:
user_answers_df = grouped_by_user_df.agg({'answered_correctly': ['mean', 'count'] })

user_answers_df[('answered_correctly','mean')].hist(bins = 100)

Look's noisy, let's clear it a little bit

In [None]:
user_answers_df[('answered_correctly','count')].hist(bins = 100)

In [None]:
(user_answers_df[('answered_correctly','count')]< 50).mean()

54% of users answered less than 50 questions. Let's divide all users into novices and active users

In [None]:
user_answers_df[user_answers_df[('answered_correctly','count')]< 50][('answered_correctly','mean')].mean()

In [None]:
user_answers_df[user_answers_df[('answered_correctly','count')]< 50][('answered_correctly','mean')].hist(bins = 100)

In [None]:
user_answers_df[user_answers_df[('answered_correctly','count')] >= 50][('answered_correctly','mean')].hist(bins = 100)

In [None]:
user_answers_df[user_answers_df[('answered_correctly','count')] >= 50][('answered_correctly','mean')].mean()

We can see that active users do much better than novices. But anyway average user score is lower than the overall % of correct answers. It means heavy users have even better scores. Let's look at them.

In [None]:
user_answers_df[user_answers_df[('answered_correctly','count')] >= 500][('answered_correctly','mean')].hist(bins = 100)

In [None]:
import matplotlib.pyplot as plt
plt.scatter(x = user_answers_df[('answered_correctly','count')], y=user_answers_df[ ('answered_correctly','mean')])

Timestamp, the average score for the active user, and the number of questions answered can be useful for baseline.

**Answers by content**

In [None]:
grouped_by_content_df = train_questions_only_df.groupby('content_id')

In [None]:
content_answers_df = grouped_by_user_df.agg({'answered_correctly': ['mean', 'count'] })

In [None]:
content_answers_df[('answered_correctly','count')].hist(bins = 100)

In [None]:
content_answers_df[('answered_correctly','mean')].hist(bins = 100)

Different questions have different popularity and complexity, and it can also be used in the baseline.

In [None]:
content_answers_df[content_answers_df[('answered_correctly','count')]>50][('answered_correctly','mean')].hist(bins = 100)

Let's try to use discovered features and use them in model to predict the right answer probability.

In [None]:
train_df = pd.read_csv('/kaggle/input/riiid-test-answer-prediction/train.csv',
                       usecols = data_types_dict.keys(),
                       dtype=data_types_dict, 
                       index_col = 0)

In [None]:
features_part_df = train_df.iloc[:int(9 /10 * len(train_df))]
train_part_df = train_df.iloc[int(9 /10 * len(train_df)):]

In [None]:
train_questions_only_df = features_part_df[features_part_df['answered_correctly']!=-1]
grouped_by_user_df = train_questions_only_df.groupby('user_id')
user_answers_df = grouped_by_user_df.agg({'answered_correctly': ['mean', 'count']}).copy()
user_answers_df.columns = ['mean_user_accuracy', 'questions_answered']
# user_features_dict = user_answers_df.to_dict('index')

In [None]:
features_part_df = train_df.iloc[:int(9 /10 * len(train_df))]
train_part_df = train_df.iloc[int(9 /10 * len(train_df)):]

In [None]:
train_questions_only_df = features_part_df[features_part_df['answered_correctly']!=-1]
grouped_by_user_df = train_questions_only_df.groupby('user_id')
user_answers_df = grouped_by_user_df.agg({'answered_correctly': ['mean', 'count']}).copy()
user_answers_df.columns = ['mean_user_accuracy', 'questions_answered']
# user_features_dict = user_answers_df.to_dict('index')

In [None]:
grouped_by_content_df = train_questions_only_df.groupby('content_id')
content_answers_df = grouped_by_content_df.agg({'answered_correctly': ['mean', 'count'] }).copy()
content_answers_df.columns = ['mean_accuracy', 'question_asked']
# user_features_dict = conten_answers_df.to_dict('index')

In [None]:
del train_df
del features_part_df
del grouped_by_user_df
del grouped_by_content_df

In [None]:
import gc
gc.collect()

In [None]:
features = ['timestamp','mean_user_accuracy', 'questions_answered','mean_accuracy', 'question_asked', 'prior_question_elapsed_time', 'prior_question_had_explanation']
target = 'answered_correctly'

In [None]:
train_part_df = train_part_df[train_part_df[target] != -1]

In [None]:
train_part_df = train_part_df.merge(user_answers_df, how = 'left', on = 'user_id')
train_part_df = train_part_df.merge(content_answers_df, how = 'left', on = 'content_id')

In [None]:
train_part_df['prior_question_had_explanation'] = train_part_df['prior_question_had_explanation'].fillna(value = False).astype(bool)
train_part_df.fillna(value = -1, inplace = True)

In [None]:
train_part_df.columns

In [None]:
train_part_df = train_part_df[features + [target]]

In [None]:
train_part_df

In [None]:
from sklearn.metrics import roc_auc_score

In [None]:
from lightgbm import LGBMClassifier

In [None]:
lgbm = LGBMClassifier(
    boosting_type='gbdt', 
    num_leaves=31, 
    max_depth=- 1, 
    n_estimators=60, 
    min_child_samples=1000, 
    subsample=0.6, 
    subsample_freq=1, 
    n_jobs= 2
)

In [None]:
lgbm.fit(train_part_df[features], train_part_df[target])

In [None]:
roc_auc_score(train_part_df[target].values, lgbm.predict_proba(train_part_df[features])[:,1])

In [None]:
import riiideducation

env = riiideducation.make_env()

In [None]:
iter_test = env.iter_test()

In [None]:
for (test_df, sample_prediction_df) in iter_test:
    test_df = test_df.merge(user_answers_df, how = 'left', on = 'user_id')
    test_df = test_df.merge(content_answers_df, how = 'left', on = 'content_id')
    test_df['prior_question_had_explanation'] = test_df['prior_question_had_explanation'].fillna(value = False).astype(bool)
    test_df.fillna(value = -1, inplace = True)

    test_df['answered_correctly'] = lgbm.predict_proba(test_df[features])[:,1]
    env.predict(test_df.loc[test_df['content_type_id'] == 0, ['row_id', 'answered_correctly']])