<img src='https://www.koreatechtoday.com/wp-content/uploads/2020/04/riiid-logo-background-scaled.jpg' width='640'>

<h1><center>Riiid! Answer Correctness Prediction - EDA</center><h1>
    
# 1. <a id='1'>Introduction 🃏 </a>




In this competition, your challenge is to create algorithms for "Knowledge Tracing," the modeling of student knowledge over time. The goal is to accurately predict how students will perform on future interactions. You will pair your machine learning skills using Riiid’s EdNet data.

## 1.1 Metric: Area under the ROC curve
Submissions are evaluated on area under the ROC curve between the predicted probability and the observed target.

- [Image link](http://arogozhnikov.github.io/2015/10/05/roc-curve.html)

<img src='http://arogozhnikov.github.io/images/roc_curve.gif' width='640'>

## 1.2. Important point

This is a time-series code competition, you will receive test set data and make predictions with Kaggle's time-series API. Please be sure to review the Time-series API Details section closely.

you will predict whether students are able to answer their next questions correctly.

please see basic kernels.
- [Competition API Detailed Introduction](http://https://www.kaggle.com/sohier/competition-api-detailed-introduction)
- [Quick Sample Submission](http://https://www.kaggle.com/sohier/quick-sample-submission)

If you feel this was something new and fresh, and it added some value to you, 
# please consider <font color='orange'> upvoting</font>, it motivates to keep writing good kernels. 😄

## <font size='5' color='blue'>Contents</font> 



* [Basic Exploratory Data Analysis](#1)  
    * [Getting started - Importing libraries]()
    * [Reading the dataset]()
    
 
* [Basic Data Exploration](#2)   
     * [Check Train Info.]()
     * [Check Test Info.]()
     * [Check Metadata Info.]()

* [Data Exploration in Details for Train DataFrame](#3)   
     * [Distribution of columns]()
     * [Heatmap]()
     
 
* [Pandas Profiling](#3)    
     * [Pandas Profiling Report For Train Info.]()
     * [Pandas Profiling Report For Test Info.]()
     * [Pandas Profiling Report For Metadata Info.]()

     
* [Etc. Sample Submission](#4)

# 2. <a id='2'>Importing the necessary libraries📗</a>

In [None]:
# import libraries
import os
import pandas as pd
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# datatable
!pip install ../input/python-datatable/datatable-0.11.0-cp37-cp37m-manylinux2010_x86_64.whl

#color
from colorama import Fore, Back, Style
y_ = Fore.YELLOW
r_ = Fore.RED
g_ = Fore.GREEN
b_ = Fore.BLUE
m_ = Fore.MAGENTA
sr_ = Style.RESET_ALL

#plotly
#!pip install chart_studio
!pip install ../input/chart-studio/chart_studio-1.0.0-py3-none-any.whl
import plotly.express as px
import chart_studio.plotly as py
import plotly.graph_objs as go
from plotly.offline import iplot
import cufflinks
cufflinks.go_offline()
cufflinks.set_config_file(world_readable=True, theme='pearl')

# Settings for pretty nice plots
plt.style.use('fivethirtyeight')
plt.show()

# 3. <a id='3'>Reading the dataset 📚</a>

In [None]:
# List files available
print(f'{y_}{list(os.listdir("../input/riiid-test-answer-prediction"))}{r_}' )

## Train_df

### Original Reading Train.csv

It's larger than will fit in memory with default settings, so we'll specify more efficient datatypes and only load a subset of the data for now.

In [None]:
%%time

train_df = pd.read_csv('/kaggle/input/riiid-test-answer-prediction/train.csv', low_memory=False, nrows=10**5, 
                       dtype={'row_id': 'int64', 'timestamp': 'int64', 'user_id': 'int32', 'content_id': 'int16', 'content_type_id': 'int8',
                              'task_container_id': 'int16', 'user_answer': 'int8', 'answered_correctly': 'int8', 'prior_question_elapsed_time': 'float32', 
                             'prior_question_had_explanation': 'boolean',
                             }
                      )
print(Fore.YELLOW + 'Training data shape: ',Style.RESET_ALL,train_df.shape)
train_df

From this we can see that there are four continuous features: 
* `timestamp` which is the time between this user interaction and the first event from that user.
* `content_id`: ID code for the user interaction 
* `task_container_id`: Id code for the batch of questions or lectures. 
* `prior_question_elapsed_time` which is how long it took a user to answer their previous question bundle.

There is one low cardinality integer feature:
* `user_id`: the ID code for the user.

There are categorical features:
* `user_answer`: the user's answer to the question, if any (read -1 as null), and answered_correctly if the user responded correctly (again, read -1 as null).

* `content_type_id`: 0 if the event was a question being posed to the user, 1 if the event was the user watching a lecture

### Reading Train data in jay format

- https://www.kaggle.com/rohanrao/riiid-with-blazing-fast-rid/data?

In [None]:
import gc

del train_df
gc.collect()

In [None]:
%%time

# reading the dataset from raw csv file
import datatable as dt

dt.fread("../input/riiid-test-answer-prediction/train.csv").to_jay("train.jay")

train_df = dt.fread("train.jay").to_pandas()

print(Fore.YELLOW + 'Training data shape: ',Style.RESET_ALL,train_df.shape)
train_df

## Test_df

In [None]:
import riiideducation

# You can only call make_env() once, so don't lose it!
env = riiideducation.make_env()

In [None]:
iter_test = env.iter_test()

In [None]:
iteration = 0
count = 0
for (test_df, sample_prediction_df) in iter_test:
    test_df['answered_correctly'] = 0.5
    env.predict(test_df.loc[test_df['content_type_id'] == 0, ['row_id', 'answered_correctly']])
    print(f'{iteration} iteration !!!!')
    iteration += 1
    
    print(len(test_df))
    count += len(test_df)

In [None]:
'''
print(Fore.YELLOW + 'Test data shape: ',Style.RESET_ALL,test_df.shape)

test_df.head()
'''

In [None]:
print(f'{y_}Test data shape: {sr_}{test_df.shape}')

test_df.head()

The format is largely the same as `train.csv`

Some questions will appear in the hidden test set that have NOT been presented in the train set, emulating the challenge of quickly adapting to modeling newly introduced questions. Their metadata is still in question.csv as usual.
 - `prior_group_responses (string)`:  all of the user_answer entries for previous group in a string representation of a list in the first row of the group. All other rows in each group are null. If you are using Python, you will likely want to call eval on the non-null rows. Some rows may be null, or empty lists.

- `prior_group_answers_correct (string)` : all the answered_correctly field for previous group, with the same format and caveats as prior_group_responses. Some rows may be null, or empty lists.

There are two different rows that mirror what information the AI tutor actually has available at any given time, but with the user interactions grouped together for the sake of API performance rather than strictly showing information for a single user at a time. Some questions will appear in the hidden test set that have NOT been presented in the train set, emulating the challenge of quickly adapting to modeling newly introduced questions.

In [None]:
count

`104` rows in Test_df
* but, End of `row_id` number is `108`.

Let's see example_test.csv.

In [None]:
test_df = pd.read_csv('/kaggle/input/riiid-test-answer-prediction/example_test.csv')

In [None]:
test_df.iloc[[34, 35, 36]]

In [None]:
test_df.iloc[[51, 52, 53]]

In [None]:
test_df.iloc[[67, 68, 69]]

In [None]:
test_df.iloc[[78, 79, 80]]

In [None]:
test_df.iloc[[80, 81, 82]]

Some `row_id`s are hidden.

* `36`, `52`, `68`, `83`, `85`

# Metadata - Questions.csv

In [None]:
question_df = pd.read_csv('../input/riiid-test-answer-prediction/questions.csv')

print(f'{y_}Questions metadata shape: {sr_}{question_df.shape}')

In [None]:
question_df.head()

`question_id`: foreign key for the train/test content_id column, when the content type is question (0).

`bundle_id`: code for which questions are served together.

`correct_answer`: the answer to the question. Can be compared with the train user_answer column to check if the user was right.

`part`: top level category code for the question.

`tags`: one or more detailed tag codes for the question. The meaning of the tags will not be provided, but these codes are sufficient for clustering the questions together.

# Metadata - lectures.csv

In [None]:
lectures_df = pd.read_csv('../input/riiid-test-answer-prediction/lectures.csv')

print(f'{y_}Lectures metadata shape: {sr_}{question_df.shape}')

In [None]:
lectures_df.head()

`lecture_id`: foreign key for the train/test content_id column, when the content type is lecture (1).

`part`: top level category code for the lecture.

`tag`: one tag codes for the lecture. The meaning of the tags will not be provided, but these codes are sufficient for clustering the lectures together.

`type_of`: brief description of the core purpose of the lecture

# 4. <a id='4'>Basic Data Exploration 🏕️</a> 

### General Info

In [None]:
print(f'{y_}Train Set !!: {sr_}')
print(train_df.info())
print('-------------')
print(f'{y_}Question Set !!: {sr_}')
print(question_df.info())
print('-------------')
print(f'{y_}Lectures Set !!: {sr_}')
print(lectures_df.info())

In [None]:
print(f'Total row_id in Train set: {g_}{train_df["row_id"].count()}{sr_}')

### Missing Values

In [None]:
train_df.isnull().sum()

In [None]:
question_df.isnull().sum()

In [None]:
lectures_df.isnull().sum()

### Unique User Id

In [None]:
print(Fore.YELLOW + "The total user ids are",Style.RESET_ALL,f"{train_df['user_id'].count()},", Fore.BLUE + "from those the unique ids are", Style.RESET_ALL, f"{train_df['user_id'].value_counts().shape[0]}.")

### Value Counts

In [None]:
train_df['row_id'].value_counts().max()

In [None]:
train_df['user_id'].value_counts().max()

In [None]:
train_df['content_id'].value_counts().max()

In [None]:
train_df['task_container_id'].value_counts().max()

# 5. <a id='5'>Data Exploration in Details For Train Dataset 🎠</a> 

## Creating Individual User Id Dataframe for Train_df
for 349 unique user ids, we make new dataframe.

In [None]:
train_df = train_df[['user_id', 'row_id', 'timestamp', 'content_id', 'content_type_id', 'task_container_id', 'user_answer', 'answered_correctly', 'prior_question_elapsed_time', 'prior_question_had_explanation']].drop_duplicates()
train_df.head()

### Distribution of timestamp

`timestamp`: the time between this user interaction and the first event from that user.

In [None]:
train_df['timestamp'].iplot(kind='hist',
                              xTitle='timestamp', 
                              yTitle='Counts',
                              linecolor='black', 
                              opacity=0.7,
                              color='#FB8072',
                              theme='pearl',
                              bargap=0.2,
                              gridcolor='white',
                              title='Distribution of the timestamp in the train_df')

https://www.kaggle.com/artgor/riiid-eda-feature-engineering-and-models

In [None]:
train_df.groupby(['user_id'])['timestamp'].max().sort_values(ascending=False)

In [None]:
fig = px.scatter(train_df, x="user_id", y="timestamp", color='user_id')
fig.show()

We can see that some users have huge activity time.

### Distribution of content_id

In [None]:
train_df['content_id'].value_counts()

In [None]:
train_df['content_id'].iplot(kind='hist',
                              xTitle='content_id', 
                              yTitle='Counts',
                              linecolor='black', 
                              opacity=0.7,
                              color='#FB8072',
                              theme='pearl',
                              bargap=0.2,
                              gridcolor='white',
                              title='Distribution of the content_id column in the Unique Train_df')

It need to check question.csv.

In [None]:
train_df.loc[train_df['content_id'] == 4120, 'user_answer'].value_counts()

In [None]:
question_df.loc[question_df['question_id'] == 4120]

### Distribution content_id over Unique user_id

In [None]:
fig = px.scatter(train_df, x="user_id", y="content_id", color='content_type_id')
fig.show()

It means that `content_id` is low number, some users did not answer questions.

## Distribution of content_type_id

In [None]:
train_df['content_type_id'].value_counts()

In [None]:
train_df['content_type_id'].value_counts().iplot(kind='bar',
                                          yTitle='Count', 
                                          linecolor='black', 
                                          opacity=0.7,
                                          color='blue',
                                          theme='pearl',
                                          bargap=0.8,
                                          gridcolor='white',
                                          title='Distribution of the Content_type_id column in Train_df')

`0` means some users not watched lectures because 0 if the event was a question being posed to the user.

In [None]:
# pull is given as a fraction of the pie radius
fig = go.Figure(data=[go.Pie(labels=train_df['content_type_id'].value_counts().index, values=train_df['content_type_id'].value_counts(), pull=[0, 0.2])])
fig.show()

In [None]:
fig = px.scatter(train_df, x="content_id", y="user_id", color='user_id')
fig.show()

## Distribution of task_container_id

In [None]:
train_df['task_container_id']

In [None]:
train_df['task_container_id'].iplot(kind='hist',
                              xTitle='task_container_id', 
                              yTitle='Counts',
                              linecolor='black', 
                              opacity=0.7,
                              color='#FB8072',
                              theme='pearl',
                              bargap=0.2,
                              gridcolor='white',
                              title='Distribution of the task_container_id in the train_df')

In [None]:
fig = px.scatter(train_df, x="task_container_id", y="prior_question_elapsed_time", color='user_id')
fig.show()

It is also need to check question.csv.

train_df.loc[train_df['content_id'] == 4120, 'user_answer'].value_counts()

In [None]:
train_df['task_container_id'].value_counts()

In [None]:
train_df.loc[train_df['task_container_id'] == 15, 'user_answer'].value_counts()

`-1` means null for lectures.

In [None]:
question_df.loc[question_df['question_id'] == 15]

In [None]:
train_df.loc[train_df['task_container_id'] == 5283, 'user_answer'].value_counts()

In [None]:
question_df.loc[question_df['question_id'] == 5283]

## Distribution of User_answer

In [None]:
train_df['user_answer'].value_counts()

In [None]:
train_df['user_answer'].value_counts().iplot(kind='bar',
                                          yTitle='Count', 
                                          linecolor='black', 
                                          opacity=0.7,
                                          color='red',
                                          theme='pearl',
                                          bargap=0.8,
                                          gridcolor='white',
                                          title='Distribution of the user_answer column in Train_df')

- https://www.kaggle.com/dwchen/riiid-test-simpleeda-10m-data

In [None]:
ds = train_df['user_answer'].value_counts().reset_index()
ds.columns = ['user_answer', 'count']
fig = px.pie(
    ds, 
    values='count', 
    names="user_answer", 
    title='user_answer bar chart', 
    width=500, 
    height=500
)
fig.show()

In [None]:
fig = px.scatter(train_df, x="user_answer", y="content_type_id", color='user_id')
fig.show()

## Answered_correctly Distribution of Unique user_id

In [None]:
train_df['answered_correctly'].value_counts()

In [None]:
train_df['answered_correctly'].value_counts().iplot(kind='bar',
                                          yTitle='Count', 
                                          linecolor='black', 
                                          opacity=0.7,
                                          color='blue',
                                          theme='pearl',
                                          bargap=0.8,
                                          gridcolor='white',
                                        title='Distribution of the answered_correctly column in Train_df')

In [None]:
plt.figure(figsize = (16,12))

a = sns.countplot(data=train_df, x='answered_correctly', hue='prior_question_had_explanation')


for p in a.patches:
    a.annotate(format(p.get_height(), ','), 
           (p.get_x() + p.get_width() / 2., 
            p.get_height()), ha = 'center', va = 'center', 
           xytext = (0, 4), textcoords = 'offset points')

plt.title('Answers result with and without explanations', fontsize=20)
plt.xlabel('Answered_correctly', fontsize = 16)
sns.despine(left=True, bottom=True);

In [None]:
plt.figure(figsize=(16,8))
sns.countplot(train_df['user_answer'], hue=train_df['answered_correctly'],palette='Set1',**{'hatch':'-','linewidth':0.5})
plt.title('User_Answer vs Correctness', fontsize = 20)
plt.show()

In [None]:
fig = px.scatter(train_df, x="answered_correctly", y="task_container_id", color='user_id')
fig.show()

## Distribution of Prior_question_elapsed_time

In [None]:
train_df['prior_question_elapsed_time']

In [None]:
train_df['prior_question_elapsed_time'].iplot(kind='hist',
                              xTitle='prior_question_elapsed_time', 
                              yTitle='Counts',
                              linecolor='black', 
                              opacity=0.7,
                              color='#FB8072',
                              theme='pearl',
                              bargap=0.2,
                              gridcolor='white',
                              title='Distribution of the prior_question_elapsed_time column in the Unique Train_df')

### Content id vs prior_question_elapsed_time

In [None]:
fig = px.scatter(train_df, x="content_id", y="prior_question_elapsed_time", color='user_id')
fig.show()

## Prior_question_had_explanation Distribution of Unique user_id

In [None]:
train_df['prior_question_had_explanation'].value_counts()

In [None]:
train_df['prior_question_had_explanation'].value_counts().iplot(kind='bar',
                                          yTitle='Count', 
                                          linecolor='black', 
                                          opacity=0.7,
                                          color='red',
                                          theme='pearl',
                                          bargap=0.8,
                                          gridcolor='white',
                                          title='Distribution of the prior_question_had_explanation column in Train_df')

## Distribution of correct answers percentage by each user
- https://www.kaggle.com/aykhanpy/riiid-answer-correctness-prediction-eda

In [None]:
temp_train = train_df.groupby('user_id').agg({'answered_correctly': 'sum', 'row_id':'count'})
plt.figure(figsize = (16,8))
sns.distplot((temp_train.answered_correctly * 100)/temp_train.row_id)
plt.title('Distribution of correct answers percentage by each user', fontdict = {'size': 16})
plt.xlabel('Percentage of correct answers', size = 12)

# Heatmap for train_df

In [None]:
corrmat = train_df.corr() 
f, ax = plt.subplots(figsize =(9, 8)) 
sns.heatmap(corrmat, ax = ax, cmap = 'RdYlBu_r', linewidths = 0.5) 

Please compare with the previous visualization information. And we may compare to Pandas Profiling below.

# 6. <a id='6'>Data Exploration in Details For Metadata-Question 🎠</a> 

In [None]:
question_df.head()

## Bundle Id

In [None]:
question_df['bundle_id'].iplot(kind='hist',
                              xTitle='bundle_id', 
                              yTitle='Counts',
                              linecolor='black', 
                              opacity=0.7,
                              color='#FB8072',
                              theme='pearl',
                              bargap=0.2,
                              gridcolor='white',
                              title='Distribution of the bundle_id in the question_df')

## correct_answer

In [None]:
question_df['correct_answer'].iplot(kind='hist',
                              xTitle='correct_answer', 
                              yTitle='Counts',
                              linecolor='black', 
                              opacity=0.7,
                              color='#098060',
                              theme='pearl',
                              bargap=0.2,
                              gridcolor='white',
                              title='Distribution of the correct_answer in the question_df')

In [None]:
fig = px.scatter(question_df, x="bundle_id", y="correct_answer", color='question_id')
fig.show()

In [None]:
question_df['part'].iplot(kind='hist',
                              xTitle='correct_answer', 
                              yTitle='Counts',
                              linecolor='black', 
                              opacity=0.7,
                              color='#FB8072',
                              theme='pearl',
                              bargap=0.2,
                              gridcolor='white',
                              title='Distribution of the part in the question_df')

In [None]:
fig = px.scatter(question_df, x="correct_answer", y="part", color='part')
fig.show()

## Heatmap for Question_df

In [None]:
corrmat = question_df.corr() 
f, ax = plt.subplots(figsize =(9, 8)) 
sns.heatmap(corrmat, ax = ax, cmap = 'RdYlBu_r', linewidths = 0.5) 

# 7. <a id='7'>Data Exploration in Details For Metadata-Lectures 🎠</a> 

In [None]:
lectures_df.head()

In [None]:
lectures_df['tag'].iplot(kind='hist',
                              xTitle='tag', 
                              yTitle='Counts',
                              linecolor='black', 
                              opacity=0.7,
                              color='#FB8072',
                              theme='pearl',
                              bargap=0.2,
                              gridcolor='white',
                              title='Distribution of the tag in the lectures_df')

In [None]:
lectures_df['part'].iplot(kind='hist',
                              xTitle='part', 
                              yTitle='Counts',
                              linecolor='black', 
                              opacity=0.7,
                              color='#098060',
                              theme='pearl',
                              bargap=0.2,
                              gridcolor='white',
                              title='Distribution of the part in the lectures_df')

In [None]:
lectures_df['type_of'].iplot(kind='hist',
                              xTitle='part', 
                              yTitle='Counts',
                              linecolor='black', 
                              opacity=0.7,
                              color='#FB8072',
                              theme='pearl',
                              bargap=0.2,
                              gridcolor='white',
                              title='Distribution of the type_of in the lectures_df')

In [None]:
fig = px.scatter(lectures_df, x="type_of", y="part", color='lecture_id')
fig.show()

In [None]:
fig = px.bar(lectures_df, x='type_of', color=lectures_df['type_of'], labels={'value':'type_of'}, title='Type of lectures distribution Overall')
fig.show()

https://www.kaggle.com/naim99/eda-riiid

In [None]:
fig = px.bar(lectures_df, x='type_of', color=lectures_df['type_of'], labels={'value':'type_of'}, title='Type of lectures distribution based on each part', facet_col='part')
fig.show()

## Heatmap for Lectures_df

In [None]:
corrmat = lectures_df.corr() 
f, ax = plt.subplots(figsize =(9, 8)) 
sns.heatmap(corrmat, ax = ax, cmap = 'RdYlBu_r', linewidths = 0.5) 

# I'm working in progress.

# 8. <a id='8'>Pandas Profiling </a>

In [None]:
import pandas_profiling as pdp

In [None]:
train_df = pd.read_csv('/kaggle/input/riiid-test-answer-prediction/train.csv', low_memory=False, nrows=10**5, 
                       dtype={'row_id': 'int64', 'timestamp': 'int64', 'user_id': 'int32', 'content_id': 'int16', 'content_type_id': 'int8',
                              'task_container_id': 'int16', 'user_answer': 'int8', 'answered_correctly': 'int8', 'prior_question_elapsed_time': 'float32', 
                             'prior_question_had_explanation': 'boolean',
                             }
                      )

test_df = pd.read_csv('/kaggle/input/riiid-test-answer-prediction/example_test.csv')

In [None]:
profile_train_df = pdp.ProfileReport(train_df)

In [None]:
profile_train_df

In [None]:
profile_test_df = pdp.ProfileReport(test_df)

In [None]:
profile_test_df

In [None]:
profile_question_df = pdp.ProfileReport(question_df)

In [None]:
profile_question_df

In [None]:
profile_lectures_df = pdp.ProfileReport(lectures_df)

In [None]:
profile_lectures_df

# 7. Etc - Sample Submission

- https://www.kaggle.com/sishihara/riiid-answered-correctly-benchmark

In [None]:
content_acc = train_df.query('answered_correctly != -1').groupby('content_id')['answered_correctly'].mean().to_dict()

In [None]:
def add_content_acc(x):
    if x in content_acc.keys():
        return content_acc[x]
    else:
        return 0.5


for (test_df, sample_prediction_df) in iter_test:
    test_df['answered_correctly'] = test_df['content_id'].apply(add_content_acc).values
    env.predict(test_df.loc[test_df['content_type_id'] == 0, ['row_id', 'answered_correctly']])

## If this kernel is useful, <font color='orange'>please upvote</font>!
- See you next time!