# Introduction

Tailoring education to a student's ability level is one of the many valuable things an AI tutor can do.We will predict whether students are able to answer their next questions correctly.We will be provided with the same sorts of information a complete education app would have: that student's historic performance, the performance of other students on the same question, metadata about the question itself, and more.

In 2018, 260 million children weren't attending school. At the same time, more than half of these young students didn't meet minimum reading and math standards. Education was already in a tough place when COVID-19 forced most countries to temporarily close schools. This further delayed learning opportunities and intellectual development. The equity gaps in every country could grow wider. We need to re-think the current education system in terms of attendance, engagement, and individualized attention.

### **This Notebook will contain Full Pipeline from data analytics to data modelling with powerful insights of prediction with explainable comments.**

In [1]:
# Importing Libraries
import numpy as np # To deal with matrix calculations 
import pandas as pd # This library will be used for for Structural Data Handling
import os # This Library is used for making interactions with the OS for File Handling
import matplotlib.pyplot as plt #For Data Visualization
import seaborn as sns # For Data Visualization
%matplotlib inline 
sns.set() #for making background properties for visualization beautiful
from scipy import stats # For Stastical Approach

# Data Structure and its analysis

We have been given three csv (Comma Seperated Values) files: **Train**, **Questions**, **Lectures**. All the three has been given with different features and related to each other
* **Train**: This File Contains the **ID's related to Questions, User and Content**. Along with **Timestamps of Interaction of User with Content**.**Answers** Provided by the user and their overall **effeciency**.Now Content is Divided into two parts: As this Data has been collected from an **Educational Application**, they are divided into **Questions** and **Lectures**. Metadata related to Questions and Lectures has been provided in different files.

* **Questions**: This File contain question ID's, their correct answers and tags to which these questions are related to.
* **Lectures**:  This File contain Lecture ID along with Summary to what part is covered by this particular lecture.


### Training File Overview

**The Original Training File Contains Millions of Rows, which is not Memory effecient, So here we are taking first 1 lakh rows and in further in the notebook we will take the help of [Large DataFrame Import Technique Notebook](https://www.kaggle.com/rohanrao/tutorial-on-reading-large-datasets/), from which we will take Pickle technique.**

In [2]:
# Train File . We will be using Pandas library 
train = pd.read_csv(
    '/kaggle/input/riiid-test-answer-prediction/train.csv', 
    low_memory=False, 
    nrows=10**6, #We have the limit the rows for Memory Purposes
    dtype={ # We are providing Data type as per specified in the problem statement
        'row_id': 'int64', 
        'timestamp': 'int64', 
        'user_id': 'int32', 
        'content_id': 'int16', 
        'content_type_id': 'int8',
        'task_container_id': 'int16', 
        'user_answer': 'int8', 
        'answered_correctly': 'int8', 
        'prior_question_elapsed_time': 'float32', 
        'prior_question_had_explanation': 'boolean'
    }
) # We have import the file into train variable

#Let's See some Data Information
print(f'This File Contains {train.shape[0]} Records and {train.shape[1]} Features')
print('Below are the Top 5 Rows of the CSV file \n')
train.head()

This File Contains 1000000 Records and 10 Features
Below are the Top 5 Rows of the CSV file 



Unnamed: 0,row_id,timestamp,user_id,content_id,content_type_id,task_container_id,user_answer,answered_correctly,prior_question_elapsed_time,prior_question_had_explanation
0,0,0,115,5692,0,1,3,1,,
1,1,56943,115,5716,0,2,2,1,37000.0,False
2,2,118363,115,128,0,0,0,1,55000.0,False
3,3,131167,115,7860,0,3,0,1,19000.0,False
4,4,137965,115,7922,0,4,1,1,11000.0,False


In [3]:
print('Below are the Insights of the null values percentage in our Data')
print(train.isnull().sum()*100/train.shape[0])

Below are the Insights of the null values percentage in our Data
row_id                            0.0000
timestamp                         0.0000
user_id                           0.0000
content_id                        0.0000
content_type_id                   0.0000
task_container_id                 0.0000
user_answer                       0.0000
answered_correctly                0.0000
prior_question_elapsed_time       2.3723
prior_question_had_explanation    0.3816
dtype: float64


In [4]:
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objs as go

ds = train['user_id'].value_counts().reset_index() #We are counting unique User ID's and creating another dataframe named as ds
ds.columns = ['user_id', 'count'] # Naming Columns for the DataFrame as User_ID and Count
ds['user_id'] = ds['user_id'].astype(str) + '-' #This Has been done for Visualization as you can see in below graph on Y-axis User Id with an -
ds = ds.sort_values(['count']) # We are sorting in Ascending Order for Top Users Count

fig = px.bar(
    ds.tail(30), 
    x='count', # Specifying the X-axis
    y='user_id', #Specifying the Y-axis
    orientation='h', 
    title='Top 30 users by number of actions', 
    height=900, 
    width=700
)

fig.show()


In [5]:
# Replacing NULL Values of prior_question_elapsed_time to 0 as according to the data description null values are given to that values which are 1st course in their bundles
train['prior_question_elapsed_time'].fillna(0,inplace=True)
train['prior_question_had_explanation'].fillna(False,inplace=True)
train.head()

Unnamed: 0,row_id,timestamp,user_id,content_id,content_type_id,task_container_id,user_answer,answered_correctly,prior_question_elapsed_time,prior_question_had_explanation
0,0,0,115,5692,0,1,3,1,0.0,False
1,1,56943,115,5716,0,2,2,1,37000.0,False
2,2,118363,115,128,0,0,0,1,55000.0,False
3,3,131167,115,7860,0,3,0,1,19000.0,False
4,4,137965,115,7922,0,4,1,1,11000.0,False


In [6]:
answers = train['answered_correctly'].value_counts().reset_index() # Creating DataFrame for Answers whether correct, incorrect or lectures 
answers.columns = ['answers','count']

#Here in Pie Chart 1 represents Answered Correctly, 0 represents Answered Incorrect and -1 represents Lectures
fig = px.pie(answers,values='count',names='answers', title='Showing relative percentage of Answers')
fig.show()

In [7]:
answers = train['prior_question_had_explanation'].value_counts().reset_index() # Creating DataFrame for Answers whether correct, incorrect or lectures 
answers.columns = ['explaination','count']

#Here in Pie Chart 1 represents Answered Correctly, 0 represents Answered Incorrect and -1 represents Lectures
fig = px.pie(answers,values='count',names='explaination', title='Showing relative percentage of Explanations presence for prior question')
fig.show()

In [8]:
fig = px.histogram(train,x='prior_question_elapsed_time',nbins=300)
fig.show()
print('In Above Figure we can see that the graph is skewed to left, means Most of the elapsed time is before 50000 milliseconds. This is Skewed Graph we will normalize it .')

In Above Figure we can see that the graph is skewed to left, means Most of the elapsed time is before 50000 milliseconds. This is Skewed Graph we will normalize it .


### Questions File Overview

In [9]:
questions = pd.read_csv('/kaggle/input/riiid-test-answer-prediction/questions.csv')
questions

Unnamed: 0,question_id,bundle_id,correct_answer,part,tags
0,0,0,0,1,51 131 162 38
1,1,1,1,1,131 36 81
2,2,2,0,1,131 101 162 92
3,3,3,0,1,131 149 162 29
4,4,4,3,1,131 5 162 38
...,...,...,...,...,...
13518,13518,13518,3,5,14
13519,13519,13519,3,5,8
13520,13520,13520,2,5,73
13521,13521,13521,0,5,125


In [10]:
questions['tag'] = questions['tags'].str.split(' ') #Splitiing Difeerent Tags so we can work with them
questions = questions.explode('tag') # Explode Function is used to make multiple values in a single row into single value row by duplicating the rows and different column values. e.g:- Tags here
questions = pd.merge(questions, questions.groupby('question_id')['tag'].count().reset_index(), on='question_id') #merge function
questions = questions.drop(['tag_x'], axis=1) # we are dropping the column from the dataframe
questions.columns = ['question_id', 'bundle_id', 'correct_answer', 'part', 'tags', 'tags_number']
questions = questions.drop_duplicates() # Removal of Duplicates
questions

Unnamed: 0,question_id,bundle_id,correct_answer,part,tags,tags_number
0,0,0,0,1,51 131 162 38,4
4,1,1,1,1,131 36 81,3
7,2,2,0,1,131 101 162 92,4
11,3,3,0,1,131 149 162 29,4
15,4,4,3,1,131 5 162 38,4
...,...,...,...,...,...,...
30988,13518,13518,3,5,14,1
30989,13519,13519,3,5,8,1
30990,13520,13520,2,5,73,1
30991,13521,13521,0,5,125,1


In [11]:
# Comparing the correct answers percentages given by Users and Actual Correct answers

actual_correct = questions['correct_answer'].value_counts().reset_index()
actual_correct.columns = ['correct_answers','count']

user_answers = train['user_answer'].value_counts().reset_index()
user_answers.columns = ['user_answers','count']

fig1 = px.pie(actual_correct,values='count',names='correct_answers',title='Actual Answers Relative Percentage')

fig2 = px.pie(user_answers[user_answers['user_answers'] != -1],values='count',names='user_answers',title='User Answers Relative Proportions')

fig1.show()
fig2.show()

In [12]:
# Top Most Tags 
distinct_tags = set() # A Set for Collecting Different Tags
tags = questions['tags'].astype(str) # Converting Data Type for Tags Column
# Collecting Tags
for i in tags:
    for t in i.split(' '):
        distinct_tags.add(t)
distinct_tags = list(distinct_tags)
n = []
#Collecting Different Counts for Different Tags
for tag in distinct_tags:
    count = 0
    for t in tags:
        if tag in t:
            count += 1
    n.append(count)
# Tags DataFrame 
tags_df = pd.DataFrame({'Tags':distinct_tags,'Count':n})
tags_df = tags_df.sort_values(['Count'])
tags_df['Tags'] = tags_df['Tags'].astype(str) + '-'
#Figure Creation
fig = px.bar(
    tags_df.tail(20), 
    x='Tags', # Specifying the X-axis
    y='Count', #Specifying the Y-axis
    title='Most Useful Contents', 
    height=900, 
    width=700
)

fig.show()


### Lectures File Overview

In [13]:
lectures = pd.read_csv('../input/riiid-test-answer-prediction/lectures.csv')
lectures

Unnamed: 0,lecture_id,tag,part,type_of
0,89,159,5,concept
1,100,70,1,concept
2,185,45,6,concept
3,192,79,5,solving question
4,317,156,5,solving question
...,...,...,...,...
413,32535,8,5,solving question
414,32570,113,3,solving question
415,32604,24,6,concept
416,32625,142,2,concept


In [14]:
# Different type of Lectures : Concept or Solving Question
type_lec = lectures['type_of'].value_counts().reset_index()
type_lec.columns = ['Type','Count']

fig = px.bar(type_lec,x='Type',y='Count',title='Relative Type of Lectures',color=['Concept','Solving Question','Intention','Starter'])
fig.show()

In [15]:
#freeing up some memory
del train
del questions
del lectures

# Feature Extraction and Engineering

The **Most Important** process for Data Modeling is **Feature Extraction and Engineering**. As we all know, Machine Learning depends mostly on Data and its features, we are done with data analytics and now will be moving on to Feature creation and extraction. In this Section, we will be going to explore various feature creation and processing techniques.

First, we going to explore the **example_test.csv** file, through which we have have to create our feature pipeline later for Model preprocessing.
**As per the Data Description**:  Three sample groups of the test set data as it will be delivered by the time-series API. The format is largely the same as train.csv. There are two different columns that mirror what information the AI tutor actually has available at any given time, but with the user interactions grouped together for the sake of API performance rather than strictly showing information for a single user at a time. Some users will appear in the hidden test set that have NOT been presented in the train set, emulating the challenge of quickly adapting to modeling new arrivals to a website.

* **Steps**:
    1. Examining the Test Data Sample.     
    2. Creating a New Train Dataset for Model with important Features.
    3. Filter Train Data for Questions and Remove rows for Lecture based.
    4. Grouping and Dividing Data on Basis of User ID and Content ID.
    5. Creating New Features on Both DataSets.
    6. Merging Two DataSets and New and Insightful Data Formation and Rearranging Columns (Features and Target)
    7. Data Wrangling (Missing Values Treatment)

### Step-1: Examining the Test Data Sample

In [16]:
example_test = pd.read_csv('../input/riiid-test-answer-prediction/example_test.csv')
example_test
#Example Test Set is Similar to Train DataSet, unlike two more columns prior_group_answers_correct, prior_group_responses

Unnamed: 0,row_id,group_num,timestamp,user_id,content_id,content_type_id,task_container_id,prior_question_elapsed_time,prior_question_had_explanation,prior_group_answers_correct,prior_group_responses
0,0,0,0,275030867,5729,0,0,,,[],[]
1,1,0,13309898705,554169193,12010,0,4427,19000.0,True,,
2,2,0,4213672059,1720860329,457,0,240,17000.0,True,,
3,3,0,62798072960,288641214,13262,0,266,23000.0,True,,
4,4,0,10585422061,1728340777,6119,0,162,72400.0,True,,
...,...,...,...,...,...,...,...,...,...,...,...
99,104,3,13167339284,1900527744,3004,0,1179,24667.0,True,,
100,105,3,13167339284,1900527744,3003,0,1179,24667.0,True,,
101,106,3,64497673060,7792299,7908,0,676,19000.0,True,,
102,107,3,62798166743,288641214,9077,0,269,25000.0,True,,


### Step-2: Creating a New Train Dataset for Model with important Features.

**Now will take Large Dataset into handle using pickle file of Training DataSet Uploaded with this Notebook**

In [17]:
# Determing Features and their DataType for Train Data
used_features_dict = {
    'timestamp': 'int64',
    'user_id': 'int32',
    'content_id': 'int16',
    'answered_correctly': 'int8', # This will be the Target Feature. Here we have been doing classification problem between 0 and 1
    'prior_question_elapsed_time': 'float16',
    'prior_question_had_explanation': 'boolean'
}

# Creating New DataFrame for Training Purpose
train = pd.read_pickle('../input/riiid-train-data-multiple-formats/riiid_train.pkl.gzip')
train_df = train[:5000000]
del train
print(f'DataFrame Shape: {train_df.shape}')

DataFrame Shape: (5000000, 10)


### Step-3: Filter Train Data for Questions and Remove rows for Lecture based.

As per the Data Description, any value in answered_correctly column and many other columns equal to -1 will be corressponding to Lectures

In [18]:
# We will be given condition for filtering out
train_df = train_df[used_features_dict.keys()]
train_questions_only_df = train_df[train_df['answered_correctly']!=-1]
train_questions_only_df.head()

Unnamed: 0,timestamp,user_id,content_id,answered_correctly,prior_question_elapsed_time,prior_question_had_explanation
0,0,115,5692,1,,
1,56943,115,5716,1,37000.0,False
2,118363,115,128,1,55000.0,False
3,131167,115,7860,1,19000.0,False
4,137965,115,7922,1,11000.0,False


### Step-4: Grouping and Dividing Data on Basis of User ID and Content ID.

In [19]:
user_df = train_questions_only_df.groupby('user_id') #Group By Function will group by user-id
content_df = train_questions_only_df.groupby('content_id')

### Step-5: Creating New Features on Both DataSets.

Now we have to engineer the features. What we can think of Making Statistical Features like Mean, Median, Standard Deviation and Skewness
First We create features and then merge both of them main dataframe using UserID and ContentID like we use to do in SQL Inner Join.In the next step we will be Using **Left Join** concept

In [20]:
user_answers_df = user_df.agg({'answered_correctly': ['mean', 'count', 'std', 'skew']}).copy() #Using AGG feature means Aggregate Functions like in SQL.
user_answers_df.columns = ['mean_user_accuracy', 'questions_answered', 'std_user_accuracy', 'skew_user_accuracy']

user_answers_df

Unnamed: 0_level_0,mean_user_accuracy,questions_answered,std_user_accuracy,skew_user_accuracy
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
115,0.695652,46,0.465215,-0.879359
124,0.233333,30,0.430183,1.328338
2746,0.578947,19,0.507257,-0.347892
5382,0.672000,125,0.471374,-0.741648
8623,0.642202,109,0.481566,-0.601619
...,...,...,...,...
106742797,0.742857,35,0.443440,-1.161718
106745694,0.266667,30,0.449776,1.111663
106747159,0.748363,3513,0.434015,-1.145142
106747466,0.552469,324,0.498008,-0.212025


In [21]:
del user_df

In [22]:
content_answers_df = content_df.agg({'answered_correctly': ['mean', 'count', 'std', 'skew'] }).copy()
content_answers_df.columns = ['mean_accuracy', 'question_asked', 'std_accuracy', 'skew_accuracy']

content_answers_df

Unnamed: 0_level_0,mean_accuracy,question_asked,std_accuracy,skew_accuracy
content_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,0.892857,364,0.309721,-2.550865
1,0.908854,384,0.288192,-2.852230
2,0.549070,2313,0.497694,-0.197362
3,0.778066,1158,0.415727,-1.340047
4,0.627488,1608,0.483624,-0.527874
...,...,...,...,...
13518,0.861111,36,0.350736,-2.180288
13519,0.613636,44,0.492545,-0.483398
13520,0.733333,45,0.447214,-1.092033
13521,0.723404,47,0.452151,-1.032104


In [23]:
# We can delete old dataframes to free up memory. It is a Good Practice.
del content_df

### Step-6: Merging Two DataSets and New and Insightful Data Formation and rearranging columns.

In [24]:
train_df = train_df.merge(user_answers_df, how='left', on='user_id')
train_df = train_df.merge(content_answers_df, how='left', on='content_id')
train_df

Unnamed: 0,timestamp,user_id,content_id,answered_correctly,prior_question_elapsed_time,prior_question_had_explanation,mean_user_accuracy,questions_answered,std_user_accuracy,skew_user_accuracy,mean_accuracy,question_asked,std_accuracy,skew_accuracy
0,0,115,5692,1,,,0.695652,46,0.465215,-0.879359,0.727740,1752.0,0.445250,-1.024143
1,56943,115,5716,1,37000.0,False,0.695652,46,0.465215,-0.879359,0.743266,1188.0,0.437015,-1.115184
2,118363,115,128,1,55000.0,False,0.695652,46,0.465215,-0.879359,0.973442,979.0,0.160869,-5.898109
3,131167,115,7860,1,19000.0,False,0.695652,46,0.465215,-0.879359,0.959854,1096.0,0.196391,-4.691604
4,137965,115,7922,1,11000.0,False,0.695652,46,0.465215,-0.879359,0.953878,954.0,0.209858,-4.334655
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4999995,4151535711,106763796,3610,1,21000.0,True,0.659091,308,0.474786,-0.674538,0.605195,1155.0,0.489021,-0.430973
4999996,4151558505,106763796,8907,1,19000.0,True,0.659091,308,0.474786,-0.674538,0.687135,342.0,0.464339,-0.810766
4999997,4151583786,106763796,4039,1,12000.0,True,0.659091,308,0.474786,-0.674538,0.562341,1572.0,0.496256,-0.251565
4999998,4231505166,106763796,996,1,13000.0,True,0.659091,308,0.474786,-0.674538,0.693431,2603.0,0.461158,-0.839537


In [25]:
features = ['user_id', 'content_id',
       'prior_question_elapsed_time', 'prior_question_had_explanation',
       'mean_user_accuracy', 'questions_answered', 'std_user_accuracy',
        'skew_user_accuracy', 'mean_accuracy',
       'question_asked', 'std_accuracy',  'skew_accuracy']
target = ['answered_correctly']

from sklearn.preprocessing import LabelEncoder
train_df = train_df[features + target]

train_df.head()

Unnamed: 0,user_id,content_id,prior_question_elapsed_time,prior_question_had_explanation,mean_user_accuracy,questions_answered,std_user_accuracy,skew_user_accuracy,mean_accuracy,question_asked,std_accuracy,skew_accuracy,answered_correctly
0,115,5692,,,0.695652,46,0.465215,-0.879359,0.72774,1752.0,0.44525,-1.024143,1
1,115,5716,37000.0,False,0.695652,46,0.465215,-0.879359,0.743266,1188.0,0.437015,-1.115184,1
2,115,128,55000.0,False,0.695652,46,0.465215,-0.879359,0.973442,979.0,0.160869,-5.898109,1
3,115,7860,19000.0,False,0.695652,46,0.465215,-0.879359,0.959854,1096.0,0.196391,-4.691604,1
4,115,7922,11000.0,False,0.695652,46,0.465215,-0.879359,0.953878,954.0,0.209858,-4.334655,1


In [26]:
print(train_df.isnull().sum())

user_id                                0
content_id                             0
prior_question_elapsed_time       117345
prior_question_had_explanation     19479
mean_user_accuracy                     0
questions_answered                     0
std_user_accuracy                      5
skew_user_accuracy                    15
mean_accuracy                      59589
question_asked                     59589
std_accuracy                       59645
skew_accuracy                      59801
answered_correctly                     0
dtype: int64


### Step-7: Data Wrangling (Missing Values Treatment)

In [27]:
train_df['prior_question_had_explanation'] = train_df['prior_question_had_explanation'].fillna(value = False).astype(bool)
train_df = train_df.fillna(value = 0.5) #Filling Values with 0.5



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



In [28]:
train_df = train_df.replace([np.inf, -np.inf], np.nan)
train_df = train_df.fillna(0.5)
train_df = train_df[train_df['answered_correctly'] != -1]

In [29]:
lb = LabelEncoder()
train_df['prior_question_had_explanation'] = lb.fit_transform(train_df['prior_question_had_explanation'])

In [30]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X = train_df[features]
Y = train_df[target]
X = sc.fit_transform(X)

# Modelling

Modelling is a part of the pipeline which actually learns the rule of the data they follow and make prediction according to it. The very first and crucial step in Modelling is **Splitting the Data into Train and Test Sets**. Then we will use hyperparameter tuning for our Model. There are many Classification Model available. We will going to use XGBoost and LGBMClassifier.

In [31]:
# Libraries Importation for Classification

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, roc_auc_score
import optuna
from optuna.samplers import TPESampler

### Train and Test Dataset Spliting

In [32]:
X_train,X_test,Y_train,Y_test = train_test_split(X,Y, random_state=666, test_size=0.2,shuffle=True) # Going to split in 80% and 20% part


In [33]:
X_lstm_train = X_train
X_cnn_train = X_train
X_lstm_test = X_test
X_cnn_test  =X_test

In [34]:
Y_lstm_train = Y_train
Y_cnn_train = Y_train
Y_lstm_test = Y_test
Y_cnn_test  =Y_test

## DEEP LEARNING 

In [35]:
from keras.models import Sequential
from keras.layers import Dense, Activation, Dropout,Conv1D,Flatten,LSTM,Embedding
import tensorflow as tf

X_lstm_train = X_lstm_train.reshape(X_lstm_train.shape[0],1, X_lstm_train.shape[1])
X_lstm_test = X_lstm_test.reshape(X_lstm_test.shape[0],1, X_lstm_test.shape[1])
print(X_lstm_train.shape)
def create_lstm_model():
    model=Sequential()
    model.add(LSTM(50,input_shape=(1,X_lstm_train.shape[2]),return_sequences=True))
    model.add(LSTM(150))
    model.add(Dense(64, activation='relu'))
    model.add(Dense(128, activation='relu'))
    model.add(Flatten())
    model.add(Dense(256,activation='relu'))
    model.add(Dense(256,activation='relu'))
    model.add(Dense(1, activation='sigmoid'))

    
    model.compile(optimizer='adam', loss='binary_crossentropy', metrics=[tf.keras.metrics.AUC()])
    print(model.summary())
    return model

(3921707, 1, 12)


In [36]:
from keras.models import Sequential
from keras.layers import Dense, Activation, Dropout,Conv1D,Flatten
import tensorflow as tf


X_cnn_train = X_cnn_train.reshape(X_cnn_train.shape[0], X_cnn_train.shape[1], 1)
X_cnn_test = X_cnn_test.reshape(X_cnn_test.shape[0], X_cnn_test.shape[1], 1)
def create_model():
    model=Sequential()
    model.add(Conv1D(64, 2, activation='relu', input_shape=X_cnn_train.shape[0]))
    model.add(Conv1D(128, 2, activation='relu'))

    model.add(Dropout(0.5))
    model.add(Dense(64, activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(128, activation='relu'))
    model.add(Dropout(0.5))
    model.add(Flatten())
    model.add(Dense(256,activation='relu'))
    model.add(Dense(1, activation='sigmoid'))

    model.compile(optimizer='adam', loss='binary_crossentropy', metrics=[tf.keras.metrics.AUC()])
    return model

In [37]:


from keras.callbacks import EarlyStopping
es = EarlyStopping(monitor='val_auc_1', mode='max',patience=5)



In [38]:
'''model = create_lstm_model()
history = model.fit(X_lstm_train,Y_lstm_train, epochs=150, verbose=1, validation_data=(X_lstm_test,Y_lstm_test),batch_size=65536,callbacks=[es])
model.save('model.h5')'''

"model = create_lstm_model()\nhistory = model.fit(X_lstm_train,Y_lstm_train, epochs=150, verbose=1, validation_data=(X_lstm_test,Y_lstm_test),batch_size=65536,callbacks=[es])\nmodel.save('model.h5')"

# Test Results Generation

In [39]:
from keras.models import load_model
model = load_model('../input/modelforriiid/model(3).h5')

In [40]:
import riiideducation

#Creating Environment
env = riiideducation.make_env()

iter_test = env.iter_test()

for (test_df, sample_prediction_df) in iter_test:
    test_df = test_df.merge(user_answers_df, how = 'left', on = 'user_id')
    test_df = test_df.merge(content_answers_df, how = 'left', on = 'content_id')
    test_df['prior_question_had_explanation'] = test_df['prior_question_had_explanation'].fillna(value = False).astype(bool)
    test_df = test_df.replace([np.inf, -np.inf], np.nan)
    test_df.fillna(value = 0.5, inplace = True)
    lb = LabelEncoder()
    test_df['prior_question_had_explanation'] = lb.fit_transform(test_df['prior_question_had_explanation'])
    '''test_df['prior_question_elapsed_time'] = test_df['prior_question_elapsed_time'].abs()
    test_df.prior_question_elapsed_time.replace(0,0.5,inplace=True)
    test_df['prior_question_elapsed_time'] = stats.boxcox(test_df['prior_question_elapsed_time'])[0]'''
    sc = StandardScaler()
    X_test = test_df[features]
    X_test = sc.fit_transform(X_test)
    X_test = X_test.reshape(X_test.shape[0],1,X_test.shape[1])
    test_df['answered_correctly'] = model.predict_proba(X_test)
    env.predict(test_df.loc[test_df['content_type_id'] == 0, ['row_id', 'answered_correctly']])


# WORK IN PROGRESS

In [41]:
df = pd.read_csv('./submission.csv')
df

Unnamed: 0,row_id,answered_correctly
0,0,0.111688
1,1,0.774972
2,2,0.858686
3,3,0.859270
4,4,0.260166
...,...,...
99,104,0.344121
100,105,0.651319
101,106,0.932898
102,107,0.567374
