# Riiid! Answer Correctness Prediction
## Track knowledge states of 1M+ students in the wild

In this competition, your challenge is to create algorithms for "Knowledge Tracing," the modeling of student knowledge over time. The goal is to accurately predict how students will perform on future interactions. You will pair your machine learning skills using Riiid’s EdNet data.

**Note: Sections of the code are intended to be run seperately as such there are multiple load data cell at the appropriate breakpoints **

<a id="top"></a>
# Table of contents

*  [Load Libraries and Data](#1)
*  [EDA using questions - train.csv](#2)
*  [EDA using questions - questions.csv](#3)
*  [EDA using questions - lectures.csv](#4)
*  [Model Building - Data Split, Preparation and Hyperparameter tuning](#5)
*  [Model Building -Training with full data](#6)

<a id="1"></a>
# Load Libraries and Data

In [None]:
from sklearn.linear_model import LogisticRegression
import numpy as np
import pandas as pd
from sklearn import preprocessing
from sklearn.model_selection import KFold
from sklearn.metrics import roc_auc_score
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
!pip install seaborn --upgrade
import seaborn as sns
import random
import math
from sklearn.model_selection import train_test_split

SEED = 299458792

## train.csv

row_id: (int64) ID code for the row.

timestamp: (int64) the time in milliseconds between this user interaction and the first event completion from that user.

user_id: (int32) ID code for the user.

content_id: (int16) ID code for the user interaction

content_type_id: (int8) 0 if the event was a question being posed to the user, 1 if the event was the user watching a lecture.

task_container_id: (int16) Id code for the batch of questions or lectures. For example, a user might see three questions in a row before seeing the explanations for any of them. Those three would all share a task_container_id.

user_answer: (int8) the user's answer to the question, if any. Read -1 as null, for lectures.

answered_correctly: (int8) if the user responded correctly. Read -1 as null, for lectures.

prior_question_elapsed_time: (float32) The average time in milliseconds it took a user to answer each question in the previous question bundle, ignoring any lectures in between. Is null for a user's first question bundle or lecture. Note that the time is the average time a user took to solve each question in the previous bundle.

prior_question_had_explanation: (bool) Whether or not the user saw an explanation and the correct response(s) after answering the previous question bundle, ignoring any lectures in between. The value is shared across a single question bundle, and is null for a user's first question bundle or lecture. Typically the first several questions a user sees were part of an onboarding diagnostic test where they did not get any feedback

In [None]:
train_df = pd.read_csv('../input/riiid-test-answer-prediction/train.csv', low_memory=False, nrows=10**5, 
                       dtype={'row_id': 'int64', 'timestamp': 'int64', 'user_id': 'int32', 'content_id': 'int16', 'content_type_id': 'int8',
                              'task_container_id': 'int16', 'user_answer': 'int8', 'answered_correctly': 'int8', 'prior_question_elapsed_time': 'float32', 
                             'prior_question_had_explanation': 'boolean',
                             }
                      )

#train_df = dt.fread("../input/riiid-test-answer-prediction/train.csv").to_pandas()

In [None]:
print("Train size:", train_df.shape)

In [None]:
train_df.head()

## question.csv

questions.csv: metadata for the questions posed to users.

question_id: foreign key for the train/test content_id column, when the content type is question (0).

bundle_id: code for which questions are served together.

correct_answer: the answer to the question. Can be compared with the train user_answer column to check if the user was right.

part: top level category code for the question.

tags: one or more detailed tag codes for the question. The meaning of the tags will not be provided, but these codes are sufficient for clustering the questions together.

In [None]:
questions = pd.read_csv('../input/riiid-test-answer-prediction/questions.csv')
questions['tags'] = questions['tags'].astype('category') 
questions[['question_id', 'bundle_id','correct_answer','part']] = questions[['question_id', 'bundle_id','correct_answer','part']].apply(pd.to_numeric, downcast='unsigned')

In [None]:
print("Questions:", questions.shape)

In [None]:
questions.head()

## lectures.csv

lectures.csv: metadata for the lectures watched by users as they progress in their education.

lecture_id: foreign key for the train/test content_id column, when the content type is lecture (1).

part: top level category code for the lecture.

tag: one tag codes for the lecture. The meaning of the tags will not be provided, but these codes are sufficient for clustering the lectures together.

type_of: brief description of the core purpose of the lecture

In [None]:
lectures = pd.read_csv('../input/riiid-test-answer-prediction/lectures.csv')
lectures['type_of'] = lectures['type_of'].astype('category') 
lectures[['lecture_id', 'tag','part']] = lectures[['lecture_id', 'tag','part']].apply(pd.to_numeric, downcast='unsigned')

In [None]:
print("Lectures:", lectures.shape)

In [None]:
lectures.head()

<a id="2"></a>
# EDA using questions - train.csv

In [None]:
train_df.describe()

In [None]:
train_df.isnull().sum()

In [None]:
print(train_df.dtypes)

In [None]:
for col in train_df.columns:
  print('Number of Unique variables in ' + col+':',len(train_df[col].unique()))

## Lets ask some simple questions for students who have answered atleast a 100 questions
1. Who are the top 5 students?
2. Who are the bottom 5 worst student?
3. Who are the top 5 fastest students to answer question?
4. Who are the slowest 5 students to answer questions? What metric was used? Does the distribution have an effect on the answer?
5. What is the performance distirbution of the class based on answer correctness?
6.What is the performance distirbution of the class based on speed correctness?


In [None]:
train_df.head()

In [None]:
train_user_grouped=train_df.groupby(['user_id']).apply(lambda x: pd.Series({
      'total_answered_correctly'       : x['answered_correctly'].sum(),
      'total_answered'       : x['answered_correctly'].count(),
      'fraction_answered_correctly'      : x['answered_correctly'].sum()/x['answered_correctly'].count(),
      'total_lectures' : x['content_type_id'].sum(),
      'median_time_to_answer' : x['prior_question_elapsed_time'].median(),
      'std_time_to_answer' : x['prior_question_elapsed_time'].std()}))

In [None]:
train_user_grouped=train_user_grouped.loc[train_user_grouped.loc[:,'total_answered']>=100,:]

### 1. Who are the top 5 students?

In [None]:
train_user_grouped.sort_values(by=['fraction_answered_correctly'],ascending=False).head(5)

### 2. Who are the bottom 5 worst student?

In [None]:
train_user_grouped.sort_values(by=['fraction_answered_correctly'],ascending=True).head(5)

### 3. Who are the top 5 fastest students to answer question?

In [None]:
train_user_grouped.sort_values(by=['median_time_to_answer'],ascending=True).head(5)

### 4. Who are the slowest 5 students to answer questions? What metric was used? Does the distribution have an effect on the answer?

In [None]:
train_user_grouped.sort_values(by=['median_time_to_answer'],ascending=False).head(5)

In [None]:
def boxplot_sorted(df, by, column, ax, rot=90):
    # use dict comprehension to create new dataframe from the iterable groupby object
    # each group name becomes a column in the new dataframe
    df2 = pd.DataFrame({col:vals[column] for col, vals in df.groupby(by)})
    # find and sort the median values in this new dataframe
    meds = df2.median().sort_values()
    # use the columns in the dataframe, ordered sorted by median value
    # return axes so changes can be made outside the function
    return df2[meds.index].boxplot(rot=rot,ax=ax,return_type="axes")

fig, ax = plt.subplots(figsize=(50,8))
boxplot_sorted(train_df, by = ['user_id'], column = 'prior_question_elapsed_time', ax=ax)
plt.xlabel('user_id')
plt.ylabel('prior_question_elapsed_time')
plt.title('Boxplots')
plt.suptitle('')
plt.show()

Yes, the distribution does have an signifhcant effect on the answer because the standard deviation is comparable to the median across all the students.

### 5. What is the performance distirbution of the class based on answer correctness?

In [None]:
sns.histplot(train_user_grouped['fraction_answered_correctly'])

### 6. What is the performance distirbution of the class based on speed correctness?

In [None]:
sns.histplot(train_user_grouped['median_time_to_answer'])

<a id="3"></a>
# EDA using questions - questions.csv

In [None]:
questions.describe()

In [None]:
questions.isnull().sum()

In [None]:
print(questions.dtypes)

In [None]:
for col in questions.columns:
  print('Number of Unique variables in ' + col+':',len(questions[col].unique()))

In [None]:
questions.tail(100)

In [None]:
sns.pairplot(questions.iloc[:,:-2])

## Lets ask some simple questions about questions.csv dataset

1. What is the distribution of questions per bundle? 
2. What is the distribution of questions per part? Which part has the most number of questions? Which part has the least number of questions?
3. What is the distribution of questions per tags? Which tag has the most number of questions? 


### 1.  What is the distribution of questions per bundle?

In [None]:
questions_bundle_grouped=questions.groupby(['bundle_id']).apply(lambda x: pd.Series({
      'total_questions'       : x['question_id'].count()}))

In [None]:
sns.histplot(questions_bundle_grouped['total_questions'])

### 2. What is the distribution of questions per part? Which part has the most number of questions? Which part has the least number of questions?

In [None]:
questions_part_grouped=questions.groupby(['part']).apply(lambda x: pd.Series({
      'total_questions'       : x['question_id'].count()}))

In [None]:
questions_part_grouped

### 3. What is the distribution of questions per tags? 

In [None]:
questions_tags_grouped=questions.groupby(['tags']).apply(lambda x: pd.Series({
      'total_questions'       : x['question_id'].count()}))

In [None]:
sns.histplot(questions_tags_grouped['total_questions'])

In [None]:
questions_tags_grouped.sort_values(by=['total_questions'],ascending=False).head(15)

<a id="4"></a>
# EDA using questions - lectures.csv

In [None]:
lectures.describe()

In [None]:
lectures.isnull().sum()

In [None]:
print(lectures.dtypes)

In [None]:
for col in lectures.columns:
  print('Number of Unique variables in ' + col+':',len(lectures[col].unique()))

## Lets ask some simple questions about lectures.csv dataset

1. What is the distribution of lectures per tag? 
2. What is the distribution of lectures per part? Which part has the most number of questions? Which part has the least number of questions?
3. What is the distribution of questions per type_of? Which type has the most number of lectures? 


### 1. What is the distribution of lectures per tag? 


In [None]:
lectures_tag_grouped=lectures.groupby(['tag']).apply(lambda x: pd.Series({
      'total_lectures'       : x['lecture_id'].count()}))

In [None]:
sns.histplot(lectures_tag_grouped['total_lectures'])

In [None]:
lectures_tag_grouped.sort_values(by=['total_lectures'],ascending=False).head(15)

### 2. What is the distribution of lectures per part? Which part has the most number of questions? Which part has the least number of questions?

In [None]:
lectures_part_grouped=lectures.groupby(['part']).apply(lambda x: pd.Series({
      'total_lectures'       : x['lecture_id'].count()}))

In [None]:
lectures_part_grouped

### 3. What is the distribution of questions per type_of? Which type has the most number of lectures? 


In [None]:
lectures_type_grouped=lectures.groupby(['type_of']).apply(lambda x: pd.Series({
      'total_lectures'       : x['lecture_id'].count()}))

In [None]:
sns.histplot(lectures_type_grouped['total_lectures'])

In [None]:
lectures_type_grouped.sort_values(by=['total_lectures'],ascending=False).head(15)

<a id="5"></a>
# Model Building - Data Split, Preparation and Hyperparameter tuning

## Load dataset for Model building

row_id: (int64) ID code for the row.

timestamp: (int64) the time in milliseconds between this user interaction and the first event completion from that user.

user_id: (int32) ID code for the user.

content_id: (int16) ID code for the user interaction

content_type_id: (int8) 0 if the event was a question being posed to the user, 1 if the event was the user watching a lecture.

task_container_id: (int16) Id code for the batch of questions or lectures. For example, a user might see three questions in a row before seeing the explanations for any of them. Those three would all share a task_container_id.

user_answer: (int8) the user's answer to the question, if any. Read -1 as null, for lectures.

answered_correctly: (int8) if the user responded correctly. Read -1 as null, for lectures.

prior_question_elapsed_time: (float32) The average time in milliseconds it took a user to answer each question in the previous question bundle, ignoring any lectures in between. Is null for a user's first question bundle or lecture. Note that the time is the average time a user took to solve each question in the previous bundle.

prior_question_had_explanation: (bool) Whether or not the user saw an explanation and the correct response(s) after answering the previous question bundle, ignoring any lectures in between. The value is shared across a single question bundle, and is null for a user's first question bundle or lecture. Typically the first several questions a user sees were part of an onboarding diagnostic test where they did not get any feedback

In [None]:
#columns for analysis, only select columns needed to prevent out of memory
cols_2_read = ['timestamp','content_id','user_id','content_type_id','answered_correctly','prior_question_elapsed_time','prior_question_had_explanation']#,'row_id','task_container_id','user_answer','prior_question_had_explanation']

#column type to minimize memory usage
all_dtype_dic = {'row_id': 'uint64', 'timestamp': 'uint64', 'user_id': 'uint32', 'content_id': 'uint16', 'content_type_id': 'uint8',
                              'task_container_id': 'uint16', 'user_answer': pd.CategoricalDtype([-1,0,1,2,3]), 'answered_correctly': 'int8', 'prior_question_elapsed_time': 'float32', 
                             'prior_question_had_explanation': 'boolean',
                             }
#dictionary to look up column type based on column selected
cols_2_read_dtype_dic = {k: all_dtype_dic[k] for k in all_dtype_dic.keys() & cols_2_read}
train_df = pd.DataFrame()

#read dataset in chunks to prevent out-of-memory
for chunk in pd.read_csv('../input/riiid-test-answer-prediction/train.csv', low_memory=False,nrows=10**7,
                       dtype=cols_2_read_dtype_dic,usecols = cols_2_read,chunksize=10**5 
                      ):
    train_df = pd.concat([train_df, chunk], ignore_index=True)

The above code allows you to add only the columns needed for analysis.

In [None]:
train_df['user_id'].describe()

In [None]:
train_df.describe()

## Data split

**This time we are going to split data into 3 parts: train (70%), tune (15%), and test (15%). The tune set is used to perform model selection and feature selection (more in latter lessons).**

In [None]:
# data split to have 70% of train, 30% of tune + test
train_df, validation_test_df = train_test_split(train_df,
                                          test_size=0.3,
                                          random_state=SEED,
                                          shuffle=True,
                                          stratify=None)


In [None]:
# further evenly split between tune and test
validation_df, test_df = train_test_split(validation_test_df,
                                    test_size=0.5,
                                    random_state=SEED,
                                    shuffle=True,
                                    stratify=None)

In [None]:
#reset index of created dataframes
train_df=train_df.reset_index(drop=True)
test_df=test_df.reset_index(drop=True)
validation_df=validation_df.reset_index(drop=True)

In [None]:
#check shape
print(train_df.shape)
print(validation_df.shape)
print(test_df.shape)

## Perform a full join of train, questions, and lectures

In [None]:
questions = pd.read_csv('../input/riiid-test-answer-prediction/questions.csv')
questions['tags'] = questions['tags'].astype('category') 
questions[['question_id', 'bundle_id','correct_answer','part']] = questions[['question_id', 'bundle_id','correct_answer','part']].apply(pd.to_numeric, downcast='unsigned')

lectures = pd.read_csv('../input/riiid-test-answer-prediction/lectures.csv')
lectures['type_of'] = lectures['type_of'].astype('category') 
lectures[['lecture_id', 'tag','part']] = lectures[['lecture_id', 'tag','part']].apply(pd.to_numeric, downcast='unsigned')

In [None]:
def concat_questions_lectures(train_df,questions,lectures):
   ##questions
   train_questions = train_df.loc[train_df.loc[:,'content_type_id']==0,:]
   train_questions=train_questions.set_index('content_id')
   questions = questions.rename(columns={"question_id": "content_id",'tags':'question_tags'})
   questions = questions.set_index('content_id') 
   train_questions=train_questions.join(questions, how='left')
   #lectures
   train_lectures = train_df.loc[train_df.loc[:,'content_type_id']==1,:]
   train_lectures=train_lectures.set_index('content_id')
   lectures = lectures.rename(columns={"lecture_id": "content_id",'tag':'lecture_tag'})
   lectures = lectures.set_index('content_id') 
   train_lectures=train_lectures.join(lectures, how='left')
   #lectures and questions
   train_questions_lectures= pd.concat([train_questions,train_lectures], axis=0)
   train_questions_lectures.reset_index(inplace=True)
   train_questions_lectures = train_questions_lectures.rename(columns = {'index':'content_id'})
   return train_questions_lectures

train_df = concat_questions_lectures(train_df,questions,lectures)
validation_df = concat_questions_lectures(validation_df,questions,lectures)
test_df = concat_questions_lectures(test_df,questions,lectures)

## Convert all to Numeric

In [None]:
train_df.isnull().sum()

In [None]:
train_df.dtypes

In [None]:
train_df[['type_of','question_tags','lecture_tag']] = train_df[['type_of','question_tags','lecture_tag']].astype('category') 
train_df[['timestamp', 'user_id','content_id','content_type_id','answered_correctly','bundle_id','correct_answer','part']] = train_df[['timestamp', 'user_id','content_id','content_type_id','answered_correctly','bundle_id','correct_answer','part']].apply(pd.to_numeric, downcast='unsigned')

validation_df[['type_of','question_tags','lecture_tag']] = validation_df[['type_of','question_tags','lecture_tag']].astype('category') 
validation_df[['timestamp', 'user_id','content_id','content_type_id','answered_correctly','bundle_id','correct_answer','part']] = validation_df[['timestamp', 'user_id','content_id','content_type_id','answered_correctly','bundle_id','correct_answer','part']].apply(pd.to_numeric, downcast='unsigned')

test_df[['type_of','question_tags','lecture_tag']] = train_df[['type_of','question_tags','lecture_tag']].astype('category') 
test_df[['timestamp', 'user_id','content_id','content_type_id','answered_correctly','bundle_id','correct_answer','part']] = test_df[['timestamp', 'user_id','content_id','content_type_id','answered_correctly','bundle_id','correct_answer','part']].apply(pd.to_numeric, downcast='unsigned')

## Fill NaN values

In [None]:
##fill prior_question_elapsed_time
train_df['prior_question_elapsed_time'] = train_df['prior_question_elapsed_time'].fillna(0)
validation_df['prior_question_elapsed_time'] = validation_df['prior_question_elapsed_time'].fillna(0)
test_df['prior_question_elapsed_time'] = test_df['prior_question_elapsed_time'].fillna(0)

##prior_question_had_explanation
train_df['prior_question_had_explanation'] = train_df['prior_question_had_explanation'].fillna(False)
validation_df['prior_question_had_explanation'] = validation_df['prior_question_had_explanation'].fillna(False)
test_df['prior_question_had_explanation'] = test_df['prior_question_had_explanation'].fillna(False)

##fill rows for questions which were lectures
train_df['question_tags']=train_df['question_tags'].cat.add_categories(-1)
train_df[['bundle_id','correct_answer','question_tags']] = train_df[['bundle_id','correct_answer','question_tags']].fillna(-1)
validation_df['question_tags']=validation_df['question_tags'].cat.add_categories(-1)
validation_df[['bundle_id','correct_answer','question_tags']] = validation_df[['bundle_id','correct_answer','question_tags']].fillna(-1)
test_df['question_tags']=test_df['question_tags'].cat.add_categories(-1)
test_df[['bundle_id','correct_answer','question_tags']] = test_df[['bundle_id','correct_answer','question_tags']].fillna(-1)

##fill rows for lectures which were questions
train_df['lecture_tag']=train_df['lecture_tag'].cat.add_categories(-1)
train_df[['lecture_tag']] = train_df[['lecture_tag']].fillna(-1)
validation_df['lecture_tag']=validation_df['lecture_tag'].cat.add_categories(-1)
validation_df[['lecture_tag']] = validation_df[['lecture_tag']].fillna(-1)
test_df['lecture_tag']=test_df['lecture_tag'].cat.add_categories(-1)
test_df[['lecture_tag']] = test_df[['lecture_tag']].fillna(-1)

##fill type_of as questions which questions
train_df['type_of']=train_df['lecture_tag'].cat.add_categories('question')
train_df[['type_of']] = train_df[['type_of']].fillna('question')
validation_df['type_of']=validation_df['lecture_tag'].cat.add_categories('question')
validation_df[['type_of']] = validation_df[['type_of']].fillna('question')
test_df['type_of']=test_df['lecture_tag'].cat.add_categories('question')
test_df[['type_of']] = test_df[['type_of']].fillna('question')

In [None]:
train_df.isnull().sum()

In [None]:
train_df.dtypes

## Feature Engineering

### Calculate Lag (difference in time between two consecutive interactions for each user)

In [None]:
train_df['user_interaction_lag'] = train_df.sort_values(['user_id','timestamp']).groupby('user_id')['timestamp'].diff()
validation_df['user_interaction_lag'] = validation_df.sort_values(['user_id','timestamp']).groupby('user_id')['timestamp'].diff()
test_df['user_interaction_lag'] = test_df.sort_values(['user_id','timestamp']).groupby('user_id')['timestamp'].diff()

### User statistics

In [None]:
def user_answer_stats(df):
  questions_only_df = df[df['answered_correctly']!=-1]
  questions_only_df = questions_only_df[questions_only_df['user_id'].notna()]
  grouped_by_user_df = questions_only_df.groupby('user_id')
  user_answers_df = grouped_by_user_df.agg({'answered_correctly': ['mean', 'count', 'std', 'median', 'skew']}).copy()
  user_answers_df.columns = ['mean_user_accuracy', 'questions_answered', 'std_user_accuracy', 'median_user_accuracy', 'skew_user_accuracy']
  user_answers_df.reset_index(inplace=True)
  user_answers_df = user_answers_df.rename(columns = {'index':'user_id'})
  return user_answers_df

train_df = train_df.merge(user_answer_stats(train_df), how='left', on='user_id')
validation_df = validation_df.merge(user_answer_stats(validation_df), how='left', on='user_id')
test_df = test_df.merge(user_answer_stats(test_df), how='left', on='user_id')

### Content Statistics

In [None]:
def content_answer_stats(df):
  questions_only_df = df[df['answered_correctly']!=-1]
  questions_only_df = questions_only_df[questions_only_df['content_id'].notna()]
  grouped_by_content_df = questions_only_df.groupby('content_id')
  content_answers_df = grouped_by_content_df.agg({'answered_correctly': ['mean', 'count', 'std', 'median', 'skew']}).copy()
  content_answers_df.columns = ['mean_accuracy', 'questions_asked', 'std_accuracy', 'median_accuracy', 'skew_accuracy']
  content_answers_df.reset_index(inplace=True)
  content_answers_df = content_answers_df.rename(columns = {'index':'content_id'})
  return content_answers_df

train_df = train_df.merge(content_answer_stats(train_df), how='left', on='content_id')
validation_df = validation_df.merge(content_answer_stats(validation_df), how='left', on='content_id')
test_df = test_df.merge(content_answer_stats(test_df), how='left', on='content_id')

In [None]:
train_df.info()

## Features Selection

In [None]:
features = [
    'mean_user_accuracy', 
    'questions_answered',
    'std_user_accuracy', 
    'median_user_accuracy',
    'skew_user_accuracy',
    'mean_accuracy', 
    'questions_asked',
    'std_accuracy', 
    'median_accuracy',
    'prior_question_elapsed_time', 
    'prior_question_had_explanation',
    'skew_accuracy',
]

target = 'answered_correctly'

## Final Train, Validation, and Test data

In [None]:
train_df = train_df[train_df['answered_correctly']!=-1][features + [target]]
validation_df = validation_df[validation_df['answered_correctly']!=-1][features + [target]]
test_df = test_df[test_df['answered_correctly']!=-1][features + [target]]

In [None]:
train_df.isnull().sum()

In [None]:
train_df = train_df.replace([np.inf, -np.inf], np.nan)
train_df = train_df.fillna(0.5)

validation_df = validation_df.replace([np.inf, -np.inf], np.nan)
validation_df = validation_df.fillna(0.5)

test_df = test_df.replace([np.inf, -np.inf], np.nan)
test_df = test_df.fillna(0.5)

train_df['prior_question_had_explanation'] = train_df['prior_question_had_explanation'].astype('bool')
validation_df['prior_question_had_explanation'] = validation_df['prior_question_had_explanation'].astype('bool')
test_df['prior_question_had_explanation'] = test_df['prior_question_had_explanation'].astype('bool')

In [None]:
train_df.dtypes

## LightGBM model Creation and Hyperparamter tuning

In [None]:
import lightgbm as lgb
import optuna

In [None]:
def create_model(trial):
    num_leaves = trial.suggest_int("num_leaves", 2, 31)
    n_estimators = trial.suggest_int("n_estimators", 50, 300)
    max_depth = trial.suggest_int('max_depth', 3, 8)
    min_child_samples = trial.suggest_int('min_child_samples', 100, 1200)
    learning_rate = trial.suggest_uniform('learning_rate', 0.0001, 0.99)
    min_data_in_leaf = trial.suggest_int('min_data_in_leaf', 5, 90)
    bagging_fraction = trial.suggest_uniform('bagging_fraction', 0.0001, 1.0)
    feature_fraction = trial.suggest_uniform('feature_fraction', 0.0001, 1.0)
    model = lgb.LGBMClassifier(
        num_leaves=num_leaves,
        n_estimators=n_estimators, 
        max_depth=max_depth, 
        min_child_samples=min_child_samples, 
        min_data_in_leaf=min_data_in_leaf,
        learning_rate=learning_rate,
        feature_fraction=feature_fraction,
        random_state=666
)
    return model

def objective(trial):
    model = create_model(trial)
    model.fit(train_df[features], train_df[target])
    score = roc_auc_score(test_df[target].values, model.predict_proba(test_df[features])[:,1])
    return score

# # uncomment to use optuna
# study = optuna.create_study(direction="maximize")
# study.optimize(objective, n_trials=70)
# params = study.best_params
# params['random_state'] = 666

params = {'num_leaves': 6, 'n_estimators': 242, 'max_depth': 3, 
          'min_child_samples': 352, 'learning_rate': 0.1954705382751018, 'min_data_in_leaf': 72, 
          'bagging_fraction': 0.2654709578619099, 'feature_fraction': 0.4901470199766588} #0.7714113571421728 0.7717271921771559  0.7711878055353522

model = lgb.LGBMClassifier(**params)
model.fit(train_df[features], train_df[target])

In [None]:
print('LGB score: ', roc_auc_score(validation_df[target].values, model.predict_proba(validation_df[features])[:,1]))

In [None]:
print('LGB score: ', roc_auc_score(test_df[target].values, model.predict_proba(test_df[features])[:,1]))

<a id="6"></a>
# Model Building -Training with full data

In this section the the model tuned in the previous step is trained using the full dataset. The relevant feature engineering tables are saved to csv. The model is saved to txt. The output will be loaded into a seperate notebook in preparation for Kaggle submission.

## Load data

In [None]:
cols_2_read = ['timestamp','content_id','user_id','content_type_id','answered_correctly','prior_question_elapsed_time','prior_question_had_explanation']#,'row_id','task_container_id','user_answer','prior_question_had_explanation']
all_dtype_dic = {'row_id': 'uint64', 'timestamp': 'uint64', 'user_id': 'uint32', 'content_id': 'uint16', 'content_type_id': 'uint8',
                              'task_container_id': 'uint16', 'user_answer': pd.CategoricalDtype([-1,0,1,2,3]), 'answered_correctly': 'int8', 'prior_question_elapsed_time': 'float32', 
                             'prior_question_had_explanation': 'boolean',
                             }
cols_2_read_dtype_dic = {k: all_dtype_dic[k] for k in all_dtype_dic.keys() & cols_2_read}
train_df = pd.DataFrame()
for chunk in pd.read_csv('train.csv', low_memory=False,nrows=10**7,
                       dtype=cols_2_read_dtype_dic,usecols = cols_2_read,chunksize=10**5 
                      ):
    train_df = pd.concat([train_df, chunk], ignore_index=True)

## Data preparation

In [None]:
#add lectures and questions
train_df = concat_questions_lectures(train_df,questions,lectures)

In [None]:
#change to numeric
train_df[['type_of','question_tags','lecture_tag']] = train_df[['type_of','question_tags','lecture_tag']].astype('category') 
train_df[['timestamp', 'user_id','content_id','content_type_id','answered_correctly','bundle_id','correct_answer','part']] = train_df[['timestamp', 'user_id','content_id','content_type_id','answered_correctly','bundle_id','correct_answer','part']].apply(pd.to_numeric, downcast='unsigned')

In [None]:
#Feature engineering

#user_interaction_lag
train_df['user_interaction_lag'] = train_df.sort_values(['user_id','timestamp']).groupby('user_id')['timestamp'].diff()

#content_answer_stats
train_content_answer_stats_df = content_answer_stats(train_df)
#train_content_answer_stats_df.to_csv('train_content_answer_stats_df.csv') # uncomment to save output to use for feature engineering of test dataset for competition submissions
train_df = train_df.merge(train_content_answer_stats_df, how='left', on='content_id')

#user answer stats
train_user_answer_stats_df = user_answer_stats(train_df)
#train_user_answer_stats_df.to_csv('train_user_answer_stats_df.csv') # uncomment to save output to use for feature engineering of test dataset for competition submissions
train_df = train_df.merge(train_user_answer_stats_df, how='left', on='user_id')

In [None]:
train_df = train_df[train_df['answered_correctly']!=-1][features + [target]]

In [None]:
train_df.isnull().sum()

In [None]:
train_df.dtypes

In [None]:
#deal with the nulls
train_df = train_df.replace([np.inf, -np.inf], np.nan)
train_df = train_df.fillna(0.5)
train_df['prior_question_had_explanation'] = train_df['prior_question_had_explanation'].astype('bool')

## Train full model and save

In [None]:
params = {'num_leaves': 6, 'n_estimators': 242, 'max_depth': 3, 
          'min_child_samples': 352, 'learning_rate': 0.1954705382751018, 'min_data_in_leaf': 72, 
          'bagging_fraction': 0.2654709578619099, 'feature_fraction': 0.4901470199766588}

model = lgb.LGBMClassifier(**params)
model.fit(train_df[features], train_df[target])
#model.booster_.save_model('model.txt') # uncomment to save output to use for feature engineering of test dataset for competition submissions