# Student Performance from Game Play Using TensorFlow Decision Forests

---

This notebook will take you through the steps needed to train a baseline Gradient Boosted Trees Model using TensorFlow Decision Forests on the `Student Performance from Game Play` dataset made available for this competition, to predict if players will answer questions correctly.
We will load the data from a CSV file. Roughly, the code will look as follows:

```
import tensorflow_decision_forests as tfdf
import pandas as pd
  
dataset = pd.read_csv("project/dataset.csv")
tf_dataset = tfdf.keras.pd_dataframe_to_tf_dataset(dataset, label="my_label")

model = tfdf.keras.GradientBoostedTreesModel()
model.fit(tf_dataset)
  
print(model.summary())
```

We will also learn how to optimize reading of big datasets, do some feature engineering, data visualization and calculate better results using the F1-score


Decision Forests are a family of tree-based models including Random Forests and Gradient Boosted Trees. They are the best place to start when working with tabular data, and will often outperform (or provide a strong baseline) before you begin experimenting with neural networks.

One of the key aspects of TensorFlow Decision Forests that makes it even more suitable for this competition, particularly given the runtime limitations, is that it has been extensively tested for training and inference on CPUs, making it possible to train it on lower-end machines.

# Import the Required Libraries

In [1]:
import tensorflow as tf
import tensorflow_addons as tfa
import tensorflow_decision_forests as tfdf

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import gc
from sklearn.metrics import f1_score, roc_auc_score

2023-06-09 15:20:01.406654: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI AVX512_BF16 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-06-09 15:20:01.464229: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-06-09 15:20:01.753512: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/home/thor_01/miniconda3/envs/ds2023/lib/:/home/thor_01/miniconda3/envs/ds

In [3]:
print("TensorFlow Decision Forests v" + tfdf.__version__)
print("TensorFlow Addons v" + tfa.__version__)
print("TensorFlow v" + tf.__version__)

TensorFlow Decision Forests v1.2.0
TensorFlow Addons v0.20.0
TensorFlow v2.11.1


# Load the Dataset

Since the dataset is huge, some people may face memory errors while reading the dataset from the csv. To avoid this, we will try to optimize the memory used by Pandas to load and store the dataset.


When Pandas loads a dataset, by default, it automatically detects the data types of the different columns.
Irresepective of the maximum value that is stored in these columns, Pandas assigns `int64` for numerical columns, `float64` for float columns, `object` dtype for string columns etc.


We may be able to reduce the size of these columns in memory by downcasting numerical columns to smaller types (like `int8`, `int32`, `float32` etc.), if their maximum values don't need the larger types for storage, (like `int64`, `float64` etc.).


Similarly, Pandas automatically detects string columns as `object` datatype. To reduce memory usage of string columns which store categorical data, we specify their datatype as `category`.


Many of the columns in this dataset can be downcast to smaller types.

We will provide a dict of `dtypes` for columns to pandas while reading the dataset.

In [4]:
# Reference: https://www.kaggle.com/competitions/predict-student-performance-from-game-play/discussion/384359
dtypes={
    'elapsed_time':np.int32,
    'event_name':'category',
    'name':'category',
    'level':np.uint8,
    'room_coor_x':np.float32,
    'room_coor_y':np.float32,
    'screen_coor_x':np.float32,
    'screen_coor_y':np.float32,
    'hover_duration':np.float32,
    'text':'category',
    'fqid':'category',
    'room_fqid':'category',
    'text_fqid':'category',
    'fullscreen':'category',
    'hq':'category',
    'music':'category',
    'level_group':'category'}
work_dir = 'data/predict-student-performance-from-game-play' 

train_df = pd.read_csv(work_dir+'/train.csv', dtype=dtypes)
print("Full train dataset shape is {}".format(train_df.shape))

Full train dataset shape is (26296946, 20)


The data is composed of 20 columns and 26296946 entries. We can see all 20 dimensions of our dataset by printing out the first 5 entries using the following code:

In [5]:
# Display the first 5 examples
train_df.head()

Unnamed: 0,session_id,index,elapsed_time,event_name,name,level,page,room_coor_x,room_coor_y,screen_coor_x,screen_coor_y,hover_duration,text,fqid,room_fqid,text_fqid,fullscreen,hq,music,level_group
0,20090312431273200,0,0,cutscene_click,basic,0,,-413.991394,-159.314682,380.0,494.0,,undefined,intro,tunic.historicalsociety.closet,tunic.historicalsociety.closet.intro,0,0,1,0-4
1,20090312431273200,1,1323,person_click,basic,0,,-413.991394,-159.314682,380.0,494.0,,"Whatcha doing over there, Jo?",gramps,tunic.historicalsociety.closet,tunic.historicalsociety.closet.gramps.intro_0_...,0,0,1,0-4
2,20090312431273200,2,831,person_click,basic,0,,-413.991394,-159.314682,380.0,494.0,,Just talking to Teddy.,gramps,tunic.historicalsociety.closet,tunic.historicalsociety.closet.gramps.intro_0_...,0,0,1,0-4
3,20090312431273200,3,1147,person_click,basic,0,,-413.991394,-159.314682,380.0,494.0,,I gotta run to my meeting!,gramps,tunic.historicalsociety.closet,tunic.historicalsociety.closet.gramps.intro_0_...,0,0,1,0-4
4,20090312431273200,4,1863,person_click,basic,0,,-412.991394,-159.314682,381.0,494.0,,"Can I come, Gramps?",gramps,tunic.historicalsociety.closet,tunic.historicalsociety.closet.gramps.intro_0_...,0,0,1,0-4


Please note that `session_id` uniquely identifies a user session.

# Load the labels

The labels for the training dataset are stored in the `train_labels.csv`. It consists of the information on whether the user in a particular session answered each question correctly. Load the labels data by running the following code. `

In [6]:
labels = pd.read_csv(work_dir + '/train_labels.csv')

Each value in the column, `session_id` is a combination of both the session and the question number. 
We will split these into individual columns for ease of use.

In [7]:
labels['session'] = labels.session_id.apply(lambda x: int(x.split('_')[0]) )
labels['q'] = labels.session_id.apply(lambda x: int(x.split('_')[-1][1:]) )

 Let us take a look at the first 5 entries of `labels` using the following code:

In [8]:
# Display the first 5 examples
labels.head()

Unnamed: 0,session_id,correct,session,q
0,20090312431273200_q1,1,20090312431273200,1
1,20090312433251036_q1,0,20090312433251036,1
2,20090312455206810_q1,1,20090312455206810,1
3,20090313091715820_q1,0,20090313091715820,1
4,20090313571836404_q1,1,20090313571836404,1


Our goal is to train models for each question to predict the label `correct` for any input user session. 

# Prepare the dataset

As summarized in the competition overview, the dataset presents the questions and data to us in order of `levels - level segments`(represented by column `level_group`) 0-4, 5-12, and 13-22. We have to predict the correctness of each segment's questions as they are presented. To do this we will create basic aggregate features from the relevant columns. You can create more features to boost your scores. 

First, we will create two separate lists with names of the Categorical columns and Numerical columns. We will avoid columns `fullscreen`, `hq` and `music` since they don't add any useful value for this problem statement.

In [11]:
CATEGORICAL = [ 'event_name', 'name', 'text', 'fqid', 'room_fqid', 'text_fqid', 'text_value']
NUMERICAL = ['elapsed_time', 'room_coor_x', 'room_coor_y', 'screen_coor_x', 'screen_coor_y', 'hover_duration', 'time_diff', 'room_coor_x_diff', 'room_coor_y_diff', 'screen_coor_x_diff', 'screen_coor_y_diff']

In [12]:


# count_var = ['event_name', 'fqid','room_fqid', 'text']
# mean_var = ['elapsed_time','level']
# event_var = ['navigate_click','person_click','cutscene_click','object_click','map_hover','notification_click',
#             'map_click','observation_click','checkpoint','elapsed_time']



For each categorical column, we will first group the dataset by `session_id`  and `level_group`. We will then count the number of **distinct elements** in the column for each group and store it temporarily.

For all numerical columns, we will group the dataset by `session id` and `level_group`. Instead of counting the number of distinct elements, we will calculate the `mean` and `standard deviation` of the numerical column for each group and store it temporarily.

After this, we will concatenate the temporary data frames we generated in the earlier step for each column to create our new feature engineered dataset.

In [13]:
# # reference: https://www.kaggle.com/code/cdeotte/random-forest-baseline-0-664/notebook
# def feature_engineer(train):
#     dfs = []
#     for c in count_var:
#         tmp = train.groupby(['session_id','level_group'])[c].agg('nunique')
#         tmp.name = tmp.name + '_nunique'
#         dfs.append(tmp)
#     for c in mean_var:
#         tmp = train.groupby(['session_id','level_group'])[c].agg('mean')
#         dfs.append(tmp)
#     for c in event_var:
#         tmp = train.groupby(['session_id','level_group'])[c].agg('sum')
#         tmp.name = tmp.name + '_sum'
#         dfs.append(tmp)
#     df = pd.concat(dfs,axis=1)
#     df = df.fillna(-1)
#     df = df.reset_index()
#     df = df.set_index('session_id')
#     return df

In [14]:
# Reference: https://www.kaggle.com/code/cdeotte/random-forest-baseline-0-664/notebook

# def feature_engineer(dataset_df):
#     dfs = []
#     for c in CATEGORICAL:
#         tmp = dataset_df.groupby(['session_id','level_group'])[c].agg('nunique')
#         tmp.name = tmp.name + '_nunique'
#         dfs.append(tmp)
#     for c in NUMERICAL:
#         tmp = dataset_df.groupby(['session_id','level_group'])[c].agg('mean')
#         dfs.append(tmp)
#     for c in NUMERICAL:
#         tmp = dataset_df.groupby(['session_id','level_group'])[c].agg('std')
#         tmp.name = tmp.name + '_std'
#         dfs.append(tmp)
#     # for c in NUMERICAL:
#     #     tmp = dataset_df.groupby(['session_id','level_group'])[c].agg('sum')
#     #     tmp.name = tmp.name + '_std'
#     #     dfs.append(tmp)
#     dataset_df = pd.concat(dfs,axis=1)
#     dataset_df = dataset_df.fillna(-1)
#     dataset_df = dataset_df.reset_index()
#     dataset_df = dataset_df.set_index('session_id')
#     return dataset_df


def feature_engineer(df, gr):

    #selecting the group
    df = df.query(f'level_group == "{gr}"') #"0-4"

    #generating new coloumns
    df = df[['session_id', 'elapsed_time', 'event_name', 'name', 'level',
    'room_coor_x', 'room_coor_y', 'screen_coor_x', 'screen_coor_y',
    'hover_duration', 'text', 'fqid', 'room_fqid', 'text_fqid',
    'level_group']]
    df['time_diff'] = df['elapsed_time'] - df['elapsed_time'].shift(1)
    df['room_coor_x_diff'] = df['room_coor_x'] - df['room_coor_x'].shift(1)
    df['room_coor_y_diff'] = df['room_coor_y'] - df['room_coor_y'].shift(1)
    df['screen_coor_x_diff'] = df['screen_coor_x'] - df['screen_coor_x'].shift(1)
    df['screen_coor_y_diff'] = df['screen_coor_y'] - df['screen_coor_y'].shift(1)

    # text Not nan
    df['text_value'] = df['text'].isna().astype('int')

    
    # Define aggregation operations for numerical and categorical columns
    agg_numerical = {num_col: ['mean', 'median', 'std', 'sum', 'min', 'max'] for num_col in NUMERICAL}
    agg_categorical = {cat_col: ['nunique','count'] for cat_col in CATEGORICAL}  # 'lambda x:x.value_counts().index[0] if x.nunique() else None' will compute mode

    agg_dict = {**agg_numerical, **agg_categorical}

    # Perform groupby operation for ['session_id', 'level']
    df_level = df.groupby(['session_id', 'level']).agg(agg_dict)
    df_level.columns = ['_'.join(col).strip() for col in df_level.columns.values]
    df_level = df_level.fillna(-1)
    df_level = df_level.unstack('level')
    df_level.columns = ['_'.join(map(str, col)) for col in df_level.columns]

    

    # Perform groupby operation for ['session_id', 'level_group']
    df_level_group = df.groupby(['session_id']).agg(agg_dict)
    df_level_group.columns = ['_'.join(col).strip() for col in df_level_group.columns.values]
    df_level_group = df_level_group.fillna(-1)

    # Concatenate the two resulting dataframes
    df_final = pd.concat([df_level, df_level_group], axis=1)

    return df_final

In [15]:
gc.collect()
#feature generation no split
df1_features = feature_engineer(train_df, "0-4" )
print(df1_features.shape)
df2_features = feature_engineer(train_df, "5-12" )
print(df2_features.shape)
df3_features = feature_engineer(train_df, "13-22")
print(df3_features.shape)

(23562, 480)
(23562, 720)
(23562, 880)


In [16]:
del train_df
gc.collect()

0

Our feature engineered dataset is composed of 22 columns and 70686 entries. 

# Basic exploration of the prepared dataset

Let us print out the first 5 entries using the following code:

In [17]:
# # Display the first 5 examples
# dataset_df.head(5)

In [18]:
# dataset_df.describe()

# Numerical data distribution¶

Let us plot some numerical columns and their value against each level_group:

In [19]:
# figure, axis = plt.subplots(3, 2, figsize=(10, 10))

# for name, data in dataset_df.groupby('level_group'):
#     axis[0, 0].plot(range(1, len(data['room_coor_x_std'])+1), data['room_coor_x_std'], label=name)
#     axis[0, 1].plot(range(1, len(data['room_coor_y_std'])+1), data['room_coor_y_std'], label=name)
#     axis[1, 0].plot(range(1, len(data['screen_coor_x_std'])+1), data['screen_coor_x_std'], label=name)
#     axis[1, 1].plot(range(1, len(data['screen_coor_y_std'])+1), data['screen_coor_y_std'], label=name)
#     axis[2, 0].plot(range(1, len(data['hover_duration'])+1), data['hover_duration_std'], label=name)
#     axis[2, 1].plot(range(1, len(data['elapsed_time_std'])+1), data['elapsed_time_std'], label=name)
    

# axis[0, 0].set_title('room_coor_x')
# axis[0, 1].set_title('room_coor_y')
# axis[1, 0].set_title('screen_coor_x')
# axis[1, 1].set_title('screen_coor_y')
# axis[2, 0].set_title('hover_duration')
# axis[2, 1].set_title('elapsed_time_std')

# for i in range(3):
#     axis[i, 0].legend()
#     axis[i, 1].legend()

# plt.show()

Now let us split the dataset into training and testing datasets:

In [20]:
def split_dataset(dataset, test_ratio=0.20):
    USER_LIST = dataset.index.unique()
    split = int(len(USER_LIST) * (1 - 0.20))
    return dataset.loc[USER_LIST[:split]], dataset.loc[USER_LIST[split:]]


In [21]:
df1_train, df1_valid = split_dataset(df1_features)
print("{} examples in training, {} examples in testing.".format(
    len(df1_train), len(df1_valid)))

df2_train, df2_valid = split_dataset(df2_features)
print("{} examples in training, {} examples in testing.".format(
    len(df2_train), len(df2_valid)))

df3_train, df3_valid = split_dataset(df3_features)
print("{} examples in training, {} examples in testing.".format(
    len(df3_train), len(df3_valid)))

18849 examples in training, 4713 examples in testing.
18849 examples in training, 4713 examples in testing.
18849 examples in training, 4713 examples in testing.


# Select a Model
There are several tree-based models for you to choose from.

- RandomForestModel
- GradientBoostedTreesModel
- CartModel
- DistributedGradientBoostedTreesModel

We can list all the available models in TensorFlow Decision Forests using the following code:

In [22]:
tfdf.keras.get_all_models()

[tensorflow_decision_forests.keras.RandomForestModel,
 tensorflow_decision_forests.keras.GradientBoostedTreesModel,
 tensorflow_decision_forests.keras.CartModel,
 tensorflow_decision_forests.keras.DistributedGradientBoostedTreesModel]

To get started, we'll work with a Gradient Boosted Trees Model. This is one of the well-known Decision Forest training algorithms.

A Gradient Boosted Decision Tree is a set of shallow decision trees trained sequentially. Each tree is trained to predict and then "correct" for the errors of the previously trained trees.

# How can I configure a tree-based model?

TensorFlow Decision Forests provides good defaults for you (e.g., the top ranking hyperparameters on our benchmarks, slightly modified to run in reasonable time). If you would like to configure the learning algorithm, you will find many options you can explore to get the highest possible accuracy.

You can select a template and/or set parameters as follows:
```
rf = tfdf.keras.GradientBoostedTreesModel(hyperparameter_template="benchmark_rank1")
```

You can read more [here](https://www.tensorflow.org/decision_forests/api_docs/python/tfdf/keras/GradientBoostedTreesModel).

# Training


We will train a model for each question to predict if the question will be answered correctly by a user. 
There are a total of 18 questions in the dataset. Hence, we will be training 18 models, one for each question.

We need to provide a few data structures to our training loop to store the trained models, predictions on the validation set and evaluation scores for the trained models.

We will create these using the following code:


In [23]:
# Fetch the unique list of user sessions in the validation dataset. We assigned 
# `session_id` as the index of our feature engineered dataset. Hence fetching 
# the unique values in the index column will give us a list of users in the 
# validation set.
VALID_USER_LIST = df1_valid.index.unique()

# Create a dataframe for storing the predictions of each question for all users
# in the validation set.
# For this, the required size of the data frame is: 
# (no: of users in validation set  x no of questions).
# We will initialize all the predicted values in the data frame to zero.
# The dataframe's index column is the user `session_id`s. 
prediction_df = pd.DataFrame(data=np.zeros((len(VALID_USER_LIST),18)), index=VALID_USER_LIST)

# Create an empty dictionary to store the models created for each question.
models = {}

# Create an empty dictionary to store the evaluation score for each question.
evaluation_dict ={}
f1_scores = {}
auc_scores = {}

Before training the data we have to understand how `level_groups` and `questions` are associated to each other.

In this game the first quiz checkpoint(i.e., questions 1 to 3) comes after finishing levels 0 to 4. So for training questions 1 to 3 we will use data from the `level_group` 0-4. Similarly, we will use data from the `level_group` 5-12 to train questions from 4 to 13 and data from the `level_group` 13-22 to train questions from 14 to 18.

We will train a model for each question and store the trained model in the `models` dict.

In [24]:
gc.collect()

0

# Inference model

In [25]:
from tensorflow.keras.models import load_model

In [26]:
# Iterate through questions 1 to 18 to train models for each question, evaluate
# the trained model and store the predicted values.
for q_no in range(1,19):
    # Select level group for the question based on the q_no.
    if q_no<=3:
        train_df = df1_train
        valid_df = df1_valid
    elif q_no<=13:
        train_df = df2_train
        valid_df = df2_valid
    elif q_no<=22:
        train_df = df3_train
        valid_df = df3_valid
    
    
        
    # Filter the rows in the datasets based on the selected level group. 
    
    train_users = train_df.index.values
    valid_users = valid_df.index.values

    # Select the labels for the related q_no.
    train_labels = labels.loc[labels.q==q_no].set_index('session').loc[train_users]
    valid_labels = labels.loc[labels.q==q_no].set_index('session').loc[valid_users]

    # Add the label to the filtered datasets.
    train_df["correct"] = train_labels["correct"]
    valid_df["correct"] = valid_labels["correct"]

    # There's one more step required before we can train the model. 
    # We need to convert the datatset from Pandas format (pd.DataFrame)
    # into TensorFlow Datasets format (tf.data.Dataset).
    # TensorFlow Datasets is a high performance data loading library 
    # which is helpful when training neural networks with accelerators like GPUs and TPUs.
    # We are omitting `level_group`, since it is not needed for training anymore.
    train_ds = tfdf.keras.pd_dataframe_to_tf_dataset(train_df, label="correct")
    valid_ds = tfdf.keras.pd_dataframe_to_tf_dataset(valid_df, label="correct")

    # We will now create the Gradient Boosted Trees Model with default settings. 
    # By default the model is set to train for a classification task.
    # gbtm = tfdf.keras.GradientBoostedTreesModel(verbose=0)
    # gbtm.compile(metrics=["accuracy"])

    # # Train the model.
    # gbtm.fit(x=train_ds)

    gbtm = load_model(f'Tensorflow_models/model_{q_no}')
    

    # Store the model
    models[f'{q_no}'] = gbtm

    # Evaluate the trained model on the validation dataset and store the 
    # evaluation accuracy in the `evaluation_dict`.
    # inspector = gbtm.make_inspector()
    # inspector.evaluation()
    evaluation = gbtm.evaluate(x=valid_ds,return_dict=True)
    evaluation_dict[q_no] = evaluation["accuracy"]         

    # Use the trained model to make predictions on the validation dataset and 
    # store the predicted values in the `prediction_df` dataframe.
    predict = gbtm.predict(x=valid_ds)
    prediction_df.loc[valid_users, q_no-1] = predict.flatten()
    
    # Calculate the F1 score
    f1 = f1_score(valid_labels["correct"], (predict > 0.5).astype(int))
    f1_scores[q_no] = f1
    
    # Calculate the AUC score
    auc = roc_auc_score(valid_labels["correct"], predict)
    auc_scores[q_no] = auc
    
    
    del train_df, valid_df, train_labels, valid_labels
    gc.collect()

2023-06-09 15:21:33.408951: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-06-09 15:21:33.409093: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-06-09 15:21:33.409153: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-06-09 15:21:33.409242: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-06-09 15:21:33.409298: I tensorflow/compiler/xla/stream_executo



[INFO 2023-06-09T15:21:37.545693561+02:00 kernel.cc:1214] Loading model from path Tensorflow_models/model_2/assets/ with prefix 8619a5faaf77467e
[INFO 2023-06-09T15:21:37.548275592+02:00 kernel.cc:1046] Use fast generic engine




[INFO 2023-06-09T15:21:40.4288462+02:00 kernel.cc:1214] Loading model from path Tensorflow_models/model_3/assets/ with prefix 3c0a956c30194d8e
[INFO 2023-06-09T15:21:40.432494189+02:00 kernel.cc:1046] Use fast generic engine




[INFO 2023-06-09T15:21:44.03018306+02:00 kernel.cc:1214] Loading model from path Tensorflow_models/model_4/assets/ with prefix f9ee2c2f779f45d5
[INFO 2023-06-09T15:21:44.03467537+02:00 kernel.cc:1046] Use fast generic engine




[INFO 2023-06-09T15:21:48.381946238+02:00 kernel.cc:1214] Loading model from path Tensorflow_models/model_5/assets/ with prefix 1c70991c7cb94820
[INFO 2023-06-09T15:21:48.384973414+02:00 abstract_model.cc:1311] Engine "GradientBoostedTreesQuickScorerExtended" built
[INFO 2023-06-09T15:21:48.385068353+02:00 kernel.cc:1046] Use fast generic engine




[INFO 2023-06-09T15:21:52.855522942+02:00 kernel.cc:1214] Loading model from path Tensorflow_models/model_6/assets/ with prefix a557ea475ae443ce
[INFO 2023-06-09T15:21:52.859886484+02:00 kernel.cc:1046] Use fast generic engine




[INFO 2023-06-09T15:21:57.224275894+02:00 kernel.cc:1214] Loading model from path Tensorflow_models/model_7/assets/ with prefix 5e502d6e6cec4829
[INFO 2023-06-09T15:21:57.228739285+02:00 kernel.cc:1046] Use fast generic engine




[INFO 2023-06-09T15:22:01.704513635+02:00 kernel.cc:1214] Loading model from path Tensorflow_models/model_8/assets/ with prefix 97015a538ce74e98
[INFO 2023-06-09T15:22:01.707022577+02:00 abstract_model.cc:1311] Engine "GradientBoostedTreesQuickScorerExtended" built
[INFO 2023-06-09T15:22:01.707106746+02:00 kernel.cc:1046] Use fast generic engine




[INFO 2023-06-09T15:22:06.136203675+02:00 kernel.cc:1214] Loading model from path Tensorflow_models/model_9/assets/ with prefix 268b0be30bd24da6
[INFO 2023-06-09T15:22:06.140495698+02:00 kernel.cc:1046] Use fast generic engine




[INFO 2023-06-09T15:22:10.492903015+02:00 kernel.cc:1214] Loading model from path Tensorflow_models/model_10/assets/ with prefix b1a79b6cc5ea415c
[INFO 2023-06-09T15:22:10.496031611+02:00 kernel.cc:1046] Use fast generic engine




[INFO 2023-06-09T15:22:15.066659768+02:00 kernel.cc:1214] Loading model from path Tensorflow_models/model_11/assets/ with prefix 8378d7a697964fca
[INFO 2023-06-09T15:22:15.068822514+02:00 abstract_model.cc:1311] Engine "GradientBoostedTreesQuickScorerExtended" built
[INFO 2023-06-09T15:22:15.068912713+02:00 kernel.cc:1046] Use fast generic engine




[INFO 2023-06-09T15:22:19.611189793+02:00 kernel.cc:1214] Loading model from path Tensorflow_models/model_12/assets/ with prefix 600804e856494cff
[INFO 2023-06-09T15:22:19.613376979+02:00 kernel.cc:1046] Use fast generic engine




[INFO 2023-06-09T15:22:24.106350235+02:00 kernel.cc:1214] Loading model from path Tensorflow_models/model_13/assets/ with prefix 707ab8a5083b4fe6
[INFO 2023-06-09T15:22:24.108806288+02:00 kernel.cc:1046] Use fast generic engine




[INFO 2023-06-09T15:22:29.163651344+02:00 kernel.cc:1214] Loading model from path Tensorflow_models/model_14/assets/ with prefix 8d2ca04ae18e482a
[INFO 2023-06-09T15:22:29.167864169+02:00 abstract_model.cc:1311] Engine "GradientBoostedTreesQuickScorerExtended" built
[INFO 2023-06-09T15:22:29.167984858+02:00 kernel.cc:1046] Use fast generic engine




[INFO 2023-06-09T15:22:34.744147975+02:00 kernel.cc:1214] Loading model from path Tensorflow_models/model_15/assets/ with prefix 37b1f7b6f1894683
[INFO 2023-06-09T15:22:34.747846885+02:00 kernel.cc:1046] Use fast generic engine




[INFO 2023-06-09T15:22:40.330397226+02:00 kernel.cc:1214] Loading model from path Tensorflow_models/model_16/assets/ with prefix 0a77a5cba05c4281
[INFO 2023-06-09T15:22:40.332281487+02:00 abstract_model.cc:1311] Engine "GradientBoostedTreesQuickScorerExtended" built
[INFO 2023-06-09T15:22:40.332388956+02:00 kernel.cc:1046] Use fast generic engine




[INFO 2023-06-09T15:22:46.112804226+02:00 kernel.cc:1214] Loading model from path Tensorflow_models/model_17/assets/ with prefix f7378ed593884f1a
[INFO 2023-06-09T15:22:46.118512507+02:00 kernel.cc:1046] Use fast generic engine




[INFO 2023-06-09T15:22:51.911827961+02:00 kernel.cc:1214] Loading model from path Tensorflow_models/model_18/assets/ with prefix eb77cb7d9ff74ef6
[INFO 2023-06-09T15:22:51.915387414+02:00 abstract_model.cc:1311] Engine "GradientBoostedTreesQuickScorerExtended" built
[INFO 2023-06-09T15:22:51.915556092+02:00 kernel.cc:1046] Use fast generic engine




In [27]:
for name, value in evaluation_dict.items():
  print(f"question {name}: accuracy {value:.4f}")

print("\nAverage accuracy", sum(evaluation_dict.values())/18)


for name, value in f1_scores.items():
  print(f"question {name}: f1_score {value:.4f}")

print("\nAverage f1_score", sum(evaluation_dict.values())/18)


for name, value in auc_scores.items():
  print(f"question {name}: auc_score {value:.4f}")

print("\nAverage auc_score", sum(evaluation_dict.values())/18)

question 1: accuracy 0.7560
question 2: accuracy 0.9728
question 3: accuracy 0.9338
question 4: accuracy 0.8080
question 5: accuracy 0.6435
question 6: accuracy 0.7906
question 7: accuracy 0.7454
question 8: accuracy 0.6351
question 9: accuracy 0.7708
question 10: accuracy 0.6153
question 11: accuracy 0.6635
question 12: accuracy 0.8684
question 13: accuracy 0.7191
question 14: accuracy 0.7431
question 15: accuracy 0.6488
question 16: accuracy 0.7481
question 17: accuracy 0.6991
question 18: accuracy 0.9497

Average accuracy 0.7617346809970008
question 1: f1_score 0.8496
question 2: f1_score 0.9862
question 3: f1_score 0.9657
question 4: f1_score 0.8882
question 5: f1_score 0.7044
question 6: f1_score 0.8790
question 7: f1_score 0.8487
question 8: f1_score 0.7657
question 9: f1_score 0.8640
question 10: f1_score 0.6419
question 11: f1_score 0.7852
question 12: f1_score 0.9295
question 13: f1_score 0.1524
question 14: f1_score 0.8423
question 15: f1_score 0.6515
question 16: f1_score 0.

# Visualize the model

One benefit of tree-based models is that we can easily visualize them. The default number of trees used in the Random Forests is 300. 

Let us pick one model from `models` dict and select a tree to display below.

In [28]:
tfdf.model_plotter.plot_model_in_colab(models['1'], tree_idx=0, max_depth=3)

# Threshold-Moving for Imbalanced Classification

Since the values of the column `correct` is fairly imbalanced, using the default threshold of `0.5` to map the predictions into classes 0 or 1 can result in poor performance. 
In such cases, to improve performance we will calculate the `F1 score` for a certain range of thresholds and try to find the best threshold aka, threshold with highest `F1 score`. Then we will use this threshold to map the predicted probabilities to class labels 0 or 1.

Please note that we are using `F1 score` since it is a better metric than `accuracy` to evaluate problems with class imbalance.

In [31]:
# Create a dataframe of required size:
# (no: of users in validation set x no: of questions) initialized to zero values
# to store true values of the label `correct`. 
true_df = pd.DataFrame(data=np.zeros((len(VALID_USER_LIST),18)), index=VALID_USER_LIST)
for i in range(18):
    # Get the true labels.
    tmp = labels.loc[labels.q == i+1].set_index('session').loc[VALID_USER_LIST]
    true_df[i] = tmp.correct.values

max_score = 0; best_threshold = 0

# Loop through threshold values from 0.4 to 0.8 and select the threshold with 
# the highest `F1 score`.
for threshold in np.arange(0.4,0.8,0.01):
    metric = tfa.metrics.F1Score(num_classes=2,average="macro",threshold=threshold)
    y_true = tf.one_hot(true_df.values.reshape((-1)), depth=2)
    y_pred = tf.one_hot((prediction_df.values.reshape((-1))>threshold).astype('int'), depth=2)
    metric.update_state(y_true, y_pred)
    f1_score = metric.result().numpy()
    if f1_score > max_score:
        max_score = f1_score
        best_threshold = threshold
        
print("Best threshold ", best_threshold, "\tF1 score ", max_score)

Instructions for updating:
Lambda fuctions will be no more assumed to be used in the statement where they are used, or at least in the same block. https://github.com/tensorflow/tensorflow/issues/56089


Instructions for updating:
Lambda fuctions will be no more assumed to be used in the statement where they are used, or at least in the same block. https://github.com/tensorflow/tensorflow/issues/56089


Best threshold  0.6200000000000002 	F1 score  0.685995


# Submission

Here you'll use the `best_threshold` calculate in the previous cell

In [None]:
# Reference
# https://www.kaggle.com/code/philculliton/basic-submission-demo
# https://www.kaggle.com/code/cdeotte/random-forest-baseline-0-664/notebook


import jo_wilder
env = jo_wilder.make_env()
iter_test = env.iter_test()

limits = {'0-4':(1,4), '5-12':(4,14), '13-22':(14,19)}

count = 0

for (sample_submission, test) in iter_test:
        
        session_id = test.session_id.values[0]
        gr = test.level_group.values[0]
        a,b = limits[grp]
  

        # ------------------- level 0-4 ---------------------------------
        if a == 1:
            test = feature_engineer(test, gr)
            

        # ------------------- level 5-12 ---------------------------------
        elif a == 4:
            test = feature_engineer(test, gr)

        # ------------------- level 13-22 ---------------------------------    
        elif a == 14:
            test = feature_engineer(test, gr)
        
        for t in range(a,b):
            gbtm = models[f'{grp}_{t}']
            test_ds = tfdf.keras.pd_dataframe_to_tf_dataset(test)
            predictions = gbtm.predict(test_ds)
            mask = sample_submission.session_id.str.contains(f'q{t}')
            n_predictions = (predictions > best_threshold).astype(int)
            sample_submission.loc[mask,'correct'] = n_predictions.flatten()
            
        env.predict(sample_submission)

In [None]:
! head submission.csv