# Student Performance from Game Play Using TensorFlow Decision Forests

---

This notebook will take you through the steps needed to train a baseline Gradient Boosted Trees Model using TensorFlow Decision Forests on the `Student Performance from Game Play` dataset made available for this competition, to predict if players will answer questions correctly.
We will load the data from a CSV file. Roughly, the code will look as follows:

```
import tensorflow_decision_forests as tfdf
import pandas as pd
  
dataset = pd.read_csv("project/dataset.csv")
tf_dataset = tfdf.keras.pd_dataframe_to_tf_dataset(dataset, label="my_label")

model = tfdf.keras.GradientBoostedTreesModel()
model.fit(tf_dataset)
  
print(model.summary())
```

We will also learn how to optimize reading of big datasets, do some feature engineering, data visualization and calculate better results using the F1-score


Decision Forests are a family of tree-based models including Random Forests and Gradient Boosted Trees. They are the best place to start when working with tabular data, and will often outperform (or provide a strong baseline) before you begin experimenting with neural networks.

One of the key aspects of TensorFlow Decision Forests that makes it even more suitable for this competition, particularly given the runtime limitations, is that it has been extensively tested for training and inference on CPUs, making it possible to train it on lower-end machines.

# Import the Required Libraries

In [1]:
import tensorflow as tf
import tensorflow_addons as tfa
import tensorflow_decision_forests as tfdf

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import gc
from sklearn.metrics import f1_score, roc_auc_score

2023-06-10 19:28:12.623269: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI AVX512_BF16 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-06-10 19:28:12.786435: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-06-10 19:28:13.442333: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/home/thor_01/miniconda3/envs/ds2023/lib/:/home/thor_01/miniconda3/envs/ds

In [2]:
print("TensorFlow Decision Forests v" + tfdf.__version__)
print("TensorFlow Addons v" + tfa.__version__)
print("TensorFlow v" + tf.__version__)

TensorFlow Decision Forests v1.2.0
TensorFlow Addons v0.20.0
TensorFlow v2.11.1


# Load the Dataset

Since the dataset is huge, some people may face memory errors while reading the dataset from the csv. To avoid this, we will try to optimize the memory used by Pandas to load and store the dataset.


When Pandas loads a dataset, by default, it automatically detects the data types of the different columns.
Irresepective of the maximum value that is stored in these columns, Pandas assigns `int64` for numerical columns, `float64` for float columns, `object` dtype for string columns etc.


We may be able to reduce the size of these columns in memory by downcasting numerical columns to smaller types (like `int8`, `int32`, `float32` etc.), if their maximum values don't need the larger types for storage, (like `int64`, `float64` etc.).


Similarly, Pandas automatically detects string columns as `object` datatype. To reduce memory usage of string columns which store categorical data, we specify their datatype as `category`.


Many of the columns in this dataset can be downcast to smaller types.

We will provide a dict of `dtypes` for columns to pandas while reading the dataset.

In [3]:
# Reference: https://www.kaggle.com/competitions/predict-student-performance-from-game-play/discussion/384359
dtypes={
    'elapsed_time':np.int32,
    'event_name':'category',
    'name':'category',
    'level':np.uint8,
    'room_coor_x':np.float32,
    'room_coor_y':np.float32,
    'screen_coor_x':np.float32,
    'screen_coor_y':np.float32,
    'hover_duration':np.float32,
    'text':'category',
    'fqid':'category',
    'room_fqid':'category',
    'text_fqid':'category',
    'fullscreen':'category',
    'hq':'category',
    'music':'category',
    'level_group':'category'}
work_dir = 'data/predict-student-performance-from-game-play' 

train_df = pd.read_csv(work_dir+'/train.csv', dtype=dtypes)
print("Full train dataset shape is {}".format(train_df.shape))

Full train dataset shape is (26296946, 20)


The data is composed of 20 columns and 26296946 entries. We can see all 20 dimensions of our dataset by printing out the first 5 entries using the following code:

In [4]:
# Display the first 5 examples
train_df.head()

Unnamed: 0,session_id,index,elapsed_time,event_name,name,level,page,room_coor_x,room_coor_y,screen_coor_x,screen_coor_y,hover_duration,text,fqid,room_fqid,text_fqid,fullscreen,hq,music,level_group
0,20090312431273200,0,0,cutscene_click,basic,0,,-413.991394,-159.314682,380.0,494.0,,undefined,intro,tunic.historicalsociety.closet,tunic.historicalsociety.closet.intro,0,0,1,0-4
1,20090312431273200,1,1323,person_click,basic,0,,-413.991394,-159.314682,380.0,494.0,,"Whatcha doing over there, Jo?",gramps,tunic.historicalsociety.closet,tunic.historicalsociety.closet.gramps.intro_0_...,0,0,1,0-4
2,20090312431273200,2,831,person_click,basic,0,,-413.991394,-159.314682,380.0,494.0,,Just talking to Teddy.,gramps,tunic.historicalsociety.closet,tunic.historicalsociety.closet.gramps.intro_0_...,0,0,1,0-4
3,20090312431273200,3,1147,person_click,basic,0,,-413.991394,-159.314682,380.0,494.0,,I gotta run to my meeting!,gramps,tunic.historicalsociety.closet,tunic.historicalsociety.closet.gramps.intro_0_...,0,0,1,0-4
4,20090312431273200,4,1863,person_click,basic,0,,-412.991394,-159.314682,381.0,494.0,,"Can I come, Gramps?",gramps,tunic.historicalsociety.closet,tunic.historicalsociety.closet.gramps.intro_0_...,0,0,1,0-4


Please note that `session_id` uniquely identifies a user session.

# Load the labels

The labels for the training dataset are stored in the `train_labels.csv`. It consists of the information on whether the user in a particular session answered each question correctly. Load the labels data by running the following code. `

In [5]:
labels = pd.read_csv(work_dir + '/train_labels.csv')

Each value in the column, `session_id` is a combination of both the session and the question number. 
We will split these into individual columns for ease of use.

In [6]:
labels['session'] = labels.session_id.apply(lambda x: int(x.split('_')[0]) )
labels['q'] = labels.session_id.apply(lambda x: int(x.split('_')[-1][1:]) )

 Let us take a look at the first 5 entries of `labels` using the following code:

In [7]:
# Display the first 5 examples
labels.head()

Unnamed: 0,session_id,correct,session,q
0,20090312431273200_q1,1,20090312431273200,1
1,20090312433251036_q1,0,20090312433251036,1
2,20090312455206810_q1,1,20090312455206810,1
3,20090313091715820_q1,0,20090313091715820,1
4,20090313571836404_q1,1,20090313571836404,1


Our goal is to train models for each question to predict the label `correct` for any input user session. 

# Bar chart for label column: correct

First we will plot a bar chart for the values of the label `correct`.

In [8]:
# plt.figure(figsize=(3, 3))
# plot_df = labels.correct.value_counts()
# plot_df.plot(kind="bar", color=['b', 'c'])

Now, let us plot the values of the label column `correct` for each question.

In [9]:
# plt.figure(figsize=(10, 20))
# plt.subplots_adjust(hspace=0.5, wspace=0.5)
# plt.suptitle("\"Correct\" column values for each question", fontsize=14, y=0.94)
# for n in range(1,19):
#     #print(n, str(n))
#     ax = plt.subplot(6, 3, n)

#     # filter df and plot ticker on the new subplot axis
#     plot_df = labels.loc[labels.q == n]
#     plot_df = plot_df.correct.value_counts()
#     plot_df.plot(ax=ax, kind="bar", color=['b', 'c'])
    
#     # chart formatting
#     ax.set_title("Question " + str(n))
#     ax.set_xlabel("")


# Prepare the dataset

As summarized in the competition overview, the dataset presents the questions and data to us in order of `levels - level segments`(represented by column `level_group`) 0-4, 5-12, and 13-22. We have to predict the correctness of each segment's questions as they are presented. To do this we will create basic aggregate features from the relevant columns. You can create more features to boost your scores. 

First, we will create two separate lists with names of the Categorical columns and Numerical columns. We will avoid columns `fullscreen`, `hq` and `music` since they don't add any useful value for this problem statement.

In [8]:
CATEGORICAL = [ 'event_name', 'name', 'text', 'fqid', 'room_fqid', 'text_fqid', 'text_value']
NUMERICAL = ['elapsed_time', 'room_coor_x', 'room_coor_y', 'screen_coor_x', 'screen_coor_y', 'hover_duration', 'time_diff', 'room_coor_x_diff', 'room_coor_y_diff', 'screen_coor_x_diff', 'screen_coor_y_diff']

In [11]:


# count_var = ['event_name', 'fqid','room_fqid', 'text']
# mean_var = ['elapsed_time','level']
# event_var = ['navigate_click','person_click','cutscene_click','object_click','map_hover','notification_click',
#             'map_click','observation_click','checkpoint','elapsed_time']



For each categorical column, we will first group the dataset by `session_id`  and `level_group`. We will then count the number of **distinct elements** in the column for each group and store it temporarily.

For all numerical columns, we will group the dataset by `session id` and `level_group`. Instead of counting the number of distinct elements, we will calculate the `mean` and `standard deviation` of the numerical column for each group and store it temporarily.

After this, we will concatenate the temporary data frames we generated in the earlier step for each column to create our new feature engineered dataset.

In [12]:
# # reference: https://www.kaggle.com/code/cdeotte/random-forest-baseline-0-664/notebook
# def feature_engineer(train):
#     dfs = []
#     for c in count_var:
#         tmp = train.groupby(['session_id','level_group'])[c].agg('nunique')
#         tmp.name = tmp.name + '_nunique'
#         dfs.append(tmp)
#     for c in mean_var:
#         tmp = train.groupby(['session_id','level_group'])[c].agg('mean')
#         dfs.append(tmp)
#     for c in event_var:
#         tmp = train.groupby(['session_id','level_group'])[c].agg('sum')
#         tmp.name = tmp.name + '_sum'
#         dfs.append(tmp)
#     df = pd.concat(dfs,axis=1)
#     df = df.fillna(-1)
#     df = df.reset_index()
#     df = df.set_index('session_id')
#     return df

In [9]:
# Reference: https://www.kaggle.com/code/cdeotte/random-forest-baseline-0-664/notebook

# def feature_engineer(dataset_df):
#     dfs = []
#     for c in CATEGORICAL:
#         tmp = dataset_df.groupby(['session_id','level_group'])[c].agg('nunique')
#         tmp.name = tmp.name + '_nunique'
#         dfs.append(tmp)
#     for c in NUMERICAL:
#         tmp = dataset_df.groupby(['session_id','level_group'])[c].agg('mean')
#         dfs.append(tmp)
#     for c in NUMERICAL:
#         tmp = dataset_df.groupby(['session_id','level_group'])[c].agg('std')
#         tmp.name = tmp.name + '_std'
#         dfs.append(tmp)
#     # for c in NUMERICAL:
#     #     tmp = dataset_df.groupby(['session_id','level_group'])[c].agg('sum')
#     #     tmp.name = tmp.name + '_std'
#     #     dfs.append(tmp)
#     dataset_df = pd.concat(dfs,axis=1)
#     dataset_df = dataset_df.fillna(-1)
#     dataset_df = dataset_df.reset_index()
#     dataset_df = dataset_df.set_index('session_id')
#     return dataset_df


def feature_engineer(df, gr):

    #selecting the group
    df = df.query(f'level_group == "{gr}"') #"0-4"

    #generating new coloumns
    df = df[['session_id', 'elapsed_time', 'event_name', 'name', 'level',
    'room_coor_x', 'room_coor_y', 'screen_coor_x', 'screen_coor_y',
    'hover_duration', 'text', 'fqid', 'room_fqid', 'text_fqid',
    'level_group']]
    df['time_diff'] = df['elapsed_time'] - df['elapsed_time'].shift(1)
    df['room_coor_x_diff'] = df['room_coor_x'] - df['room_coor_x'].shift(1)
    df['room_coor_y_diff'] = df['room_coor_y'] - df['room_coor_y'].shift(1)
    df['screen_coor_x_diff'] = df['screen_coor_x'] - df['screen_coor_x'].shift(1)
    df['screen_coor_y_diff'] = df['screen_coor_y'] - df['screen_coor_y'].shift(1)

    # text Not nan
    df['text_value'] = df['text'].isna().astype('int')

    
    # Define aggregation operations for numerical and categorical columns
    agg_numerical = {num_col: ['mean', 'median', 'std', 'sum', 'min', 'max'] for num_col in NUMERICAL}
    agg_categorical = {cat_col: ['nunique','count'] for cat_col in CATEGORICAL}  # 'lambda x:x.value_counts().index[0] if x.nunique() else None' will compute mode

    agg_dict = {**agg_numerical, **agg_categorical}

    # Perform groupby operation for ['session_id', 'level']
    df_level = df.groupby(['session_id', 'level']).agg(agg_dict)
    df_level.columns = ['_'.join(col).strip() for col in df_level.columns.values]
    df_level = df_level.fillna(-1)
    df_level = df_level.unstack('level')
    df_level.columns = ['_'.join(map(str, col)) for col in df_level.columns]

    

    # Perform groupby operation for ['session_id', 'level_group']
    df_level_group = df.groupby(['session_id']).agg(agg_dict)
    df_level_group.columns = ['_'.join(col).strip() for col in df_level_group.columns.values]
    df_level_group = df_level_group.fillna(-1)

    # Concatenate the two resulting dataframes
    df_final = pd.concat([df_level, df_level_group], axis=1)

    return df_final

In [10]:
gc.collect()
#feature generation no split
df1_features = feature_engineer(train_df, "0-4" )
print(df1_features.shape)
df2_features = feature_engineer(train_df, "5-12" )
print(df2_features.shape)
df3_features = feature_engineer(train_df, "13-22")
print(df3_features.shape)

(23562, 480)
(23562, 720)
(23562, 880)


In [13]:
df3_features.dtypes

elapsed_time_mean_13    float64
elapsed_time_mean_14    float64
elapsed_time_mean_15    float64
elapsed_time_mean_16    float64
elapsed_time_mean_17    float64
                         ...   
room_fqid_count           int64
text_fqid_nunique         int64
text_fqid_count           int64
text_value_nunique        int64
text_value_count          int64
Length: 880, dtype: object

In [15]:
del train_df
gc.collect()

0

Our feature engineered dataset is composed of 22 columns and 70686 entries. 

# Basic exploration of the prepared dataset

Let us print out the first 5 entries using the following code:

In [16]:
# # Display the first 5 examples
# dataset_df.head(5)

In [17]:
# dataset_df.describe()

# Numerical data distribution¶

Let us plot some numerical columns and their value against each level_group:

In [18]:
# figure, axis = plt.subplots(3, 2, figsize=(10, 10))

# for name, data in dataset_df.groupby('level_group'):
#     axis[0, 0].plot(range(1, len(data['room_coor_x_std'])+1), data['room_coor_x_std'], label=name)
#     axis[0, 1].plot(range(1, len(data['room_coor_y_std'])+1), data['room_coor_y_std'], label=name)
#     axis[1, 0].plot(range(1, len(data['screen_coor_x_std'])+1), data['screen_coor_x_std'], label=name)
#     axis[1, 1].plot(range(1, len(data['screen_coor_y_std'])+1), data['screen_coor_y_std'], label=name)
#     axis[2, 0].plot(range(1, len(data['hover_duration'])+1), data['hover_duration_std'], label=name)
#     axis[2, 1].plot(range(1, len(data['elapsed_time_std'])+1), data['elapsed_time_std'], label=name)
    

# axis[0, 0].set_title('room_coor_x')
# axis[0, 1].set_title('room_coor_y')
# axis[1, 0].set_title('screen_coor_x')
# axis[1, 1].set_title('screen_coor_y')
# axis[2, 0].set_title('hover_duration')
# axis[2, 1].set_title('elapsed_time_std')

# for i in range(3):
#     axis[i, 0].legend()
#     axis[i, 1].legend()

# plt.show()

Now let us split the dataset into training and testing datasets:

In [19]:
def split_dataset(dataset, test_ratio=0.20):
    USER_LIST = dataset.index.unique()
    split = int(len(USER_LIST) * (1 - 0.20))
    return dataset.loc[USER_LIST[:split]], dataset.loc[USER_LIST[split:]]


In [20]:
df1_train, df1_valid = split_dataset(df1_features)
print("{} examples in training, {} examples in testing.".format(
    len(df1_train), len(df1_valid)))

df2_train, df2_valid = split_dataset(df2_features)
print("{} examples in training, {} examples in testing.".format(
    len(df2_train), len(df2_valid)))

df3_train, df3_valid = split_dataset(df3_features)
print("{} examples in training, {} examples in testing.".format(
    len(df3_train), len(df3_valid)))

18849 examples in training, 4713 examples in testing.
18849 examples in training, 4713 examples in testing.
18849 examples in training, 4713 examples in testing.


# Select a Model
There are several tree-based models for you to choose from.

- RandomForestModel
- GradientBoostedTreesModel
- CartModel
- DistributedGradientBoostedTreesModel

We can list all the available models in TensorFlow Decision Forests using the following code:

In [21]:
tfdf.keras.get_all_models()

[tensorflow_decision_forests.keras.RandomForestModel,
 tensorflow_decision_forests.keras.GradientBoostedTreesModel,
 tensorflow_decision_forests.keras.CartModel,
 tensorflow_decision_forests.keras.DistributedGradientBoostedTreesModel]

To get started, we'll work with a Gradient Boosted Trees Model. This is one of the well-known Decision Forest training algorithms.

A Gradient Boosted Decision Tree is a set of shallow decision trees trained sequentially. Each tree is trained to predict and then "correct" for the errors of the previously trained trees.

# How can I configure a tree-based model?

TensorFlow Decision Forests provides good defaults for you (e.g., the top ranking hyperparameters on our benchmarks, slightly modified to run in reasonable time). If you would like to configure the learning algorithm, you will find many options you can explore to get the highest possible accuracy.

You can select a template and/or set parameters as follows:
```
rf = tfdf.keras.GradientBoostedTreesModel(hyperparameter_template="benchmark_rank1")
```

You can read more [here](https://www.tensorflow.org/decision_forests/api_docs/python/tfdf/keras/GradientBoostedTreesModel).

# Training


We will train a model for each question to predict if the question will be answered correctly by a user. 
There are a total of 18 questions in the dataset. Hence, we will be training 18 models, one for each question.

We need to provide a few data structures to our training loop to store the trained models, predictions on the validation set and evaluation scores for the trained models.

We will create these using the following code:


In [22]:
# Fetch the unique list of user sessions in the validation dataset. We assigned 
# `session_id` as the index of our feature engineered dataset. Hence fetching 
# the unique values in the index column will give us a list of users in the 
# validation set.
VALID_USER_LIST = df1_valid.index.unique()

# Create a dataframe for storing the predictions of each question for all users
# in the validation set.
# For this, the required size of the data frame is: 
# (no: of users in validation set  x no of questions).
# We will initialize all the predicted values in the data frame to zero.
# The dataframe's index column is the user `session_id`s. 
prediction_df = pd.DataFrame(data=np.zeros((len(VALID_USER_LIST),18)), index=VALID_USER_LIST)

# Create an empty dictionary to store the models created for each question.
models = {}

# Create an empty dictionary to store the evaluation score for each question.
evaluation_dict ={}
f1_scores = {}
auc_scores = {}

Before training the data we have to understand how `level_groups` and `questions` are associated to each other.

In this game the first quiz checkpoint(i.e., questions 1 to 3) comes after finishing levels 0 to 4. So for training questions 1 to 3 we will use data from the `level_group` 0-4. Similarly, we will use data from the `level_group` 5-12 to train questions from 4 to 13 and data from the `level_group` 13-22 to train questions from 14 to 18.

We will train a model for each question and store the trained model in the `models` dict.

In [23]:
gc.collect()

0

In [24]:
# Iterate through questions 1 to 18 to train models for each question, evaluate
# the trained model and store the predicted values.
for q_no in range(1,19):
    # Select level group for the question based on the q_no.
    if q_no<=3:
        train_df = df1_train
        valid_df = df1_valid
    elif q_no<=13:
        train_df = df2_train
        valid_df = df2_valid
    elif q_no<=22:
        train_df = df3_train
        valid_df = df3_valid
    
    
        
    # Filter the rows in the datasets based on the selected level group. 
    
    train_users = train_df.index.values
    valid_users = valid_df.index.values

    # Select the labels for the related q_no.
    train_labels = labels.loc[labels.q==q_no].set_index('session').loc[train_users]
    valid_labels = labels.loc[labels.q==q_no].set_index('session').loc[valid_users]

    # Add the label to the filtered datasets.
    train_df["correct"] = train_labels["correct"]
    valid_df["correct"] = valid_labels["correct"]

    # There's one more step required before we can train the model. 
    # We need to convert the datatset from Pandas format (pd.DataFrame)
    # into TensorFlow Datasets format (tf.data.Dataset).
    # TensorFlow Datasets is a high performance data loading library 
    # which is helpful when training neural networks with accelerators like GPUs and TPUs.
    # We are omitting `level_group`, since it is not needed for training anymore.
    train_ds = tfdf.keras.pd_dataframe_to_tf_dataset(train_df, label="correct")
    valid_ds = tfdf.keras.pd_dataframe_to_tf_dataset(valid_df, label="correct")

    # We will now create the Gradient Boosted Trees Model with default settings. 
    # By default the model is set to train for a classification task.
    gbtm = tfdf.keras.GradientBoostedTreesModel(verbose=0)
    gbtm.compile(metrics=["accuracy"])

    # Train the model.
    gbtm.fit(x=train_ds)

    # Store the model
    models[f'{q_no}'] = gbtm

    # Evaluate the trained model on the validation dataset and store the 
    # evaluation accuracy in the `evaluation_dict`.
    inspector = gbtm.make_inspector()
    inspector.evaluation()
    evaluation = gbtm.evaluate(x=valid_ds,return_dict=True)
    evaluation_dict[q_no] = evaluation["accuracy"]         

    # Use the trained model to make predictions on the validation dataset and 
    # store the predicted values in the `prediction_df` dataframe.
    predict = gbtm.predict(x=valid_ds)
    prediction_df.loc[valid_users, q_no-1] = predict.flatten()
    
    # Calculate the F1 score
    f1 = f1_score(valid_labels["correct"], (predict > 0.5).astype(int))
    f1_scores[q_no] = f1
    
    # Calculate the AUC score
    auc = roc_auc_score(valid_labels["correct"], predict)
    auc_scores[q_no] = auc
    
    
    del train_df, valid_df, train_labels, valid_labels
    gc.collect()

2023-06-09 14:56:38.465072: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-06-09 14:56:38.525071: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-06-09 14:56:38.525210: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-06-09 14:56:38.528078: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-06-09 14:56:38.528189: I tensorflow/compiler/xla/stream_executo

Instructions for updating:
Lambda fuctions will be no more assumed to be used in the statement where they are used, or at least in the same block. https://github.com/tensorflow/tensorflow/issues/56089


Instructions for updating:
Lambda fuctions will be no more assumed to be used in the statement where they are used, or at least in the same block. https://github.com/tensorflow/tensorflow/issues/56089
2023-06-09 14:56:44.655519: W external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1790] "goss_alpha" set but "sampling_method" not equal to "GOSS".
2023-06-09 14:56:44.655544: W external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1800] "goss_beta" set but "sampling_method" not equal to "GOSS".
2023-06-09 14:56:44.655547: W external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1814] "selective_gradient_boosting_ratio" set but "sampling_method" not equal to "SELGB".
2023-06-09 14:56:44.664701: I external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:452] Default loss set to BINOMIAL_LOG_LIKELIHOOD
2023-06-09 14:56:44.6

Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: could not get source code


Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: could not get source code


Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: could not get source code


2023-06-09 14:57:01.545128: W external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1790] "goss_alpha" set but "sampling_method" not equal to "GOSS".
2023-06-09 14:57:01.545149: W external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1800] "goss_beta" set but "sampling_method" not equal to "GOSS".
2023-06-09 14:57:01.545152: W external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1814] "selective_gradient_boosting_ratio" set but "sampling_method" not equal to "SELGB".
2023-06-09 14:57:01.554615: I external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:452] Default loss set to BINOMIAL_LOG_LIKELIHOOD
2023-06-09 14:57:01.554644: I external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1077] Training gradient boosted tree on 18849 example(s) and 480 feature(s).
2023-06-09 14:57:0



2023-06-09 14:57:16.852617: W external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1790] "goss_alpha" set but "sampling_method" not equal to "GOSS".
2023-06-09 14:57:16.852642: W external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1800] "goss_beta" set but "sampling_method" not equal to "GOSS".
2023-06-09 14:57:16.852645: W external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1814] "selective_gradient_boosting_ratio" set but "sampling_method" not equal to "SELGB".
2023-06-09 14:57:16.860498: I external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:452] Default loss set to BINOMIAL_LOG_LIKELIHOOD
2023-06-09 14:57:16.860521: I external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1077] Training gradient boosted tree on 18849 example(s) and 480 feature(s).
2023-06-09 14:57:1







2023-06-09 14:57:36.215695: W external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1790] "goss_alpha" set but "sampling_method" not equal to "GOSS".
2023-06-09 14:57:36.215720: W external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1800] "goss_beta" set but "sampling_method" not equal to "GOSS".
2023-06-09 14:57:36.215723: W external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1814] "selective_gradient_boosting_ratio" set but "sampling_method" not equal to "SELGB".
2023-06-09 14:57:36.233156: I external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:452] Default loss set to BINOMIAL_LOG_LIKELIHOOD
2023-06-09 14:57:36.233180: I external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1077] Training gradient boosted tree on 18849 example(s) and 720 feature(s).
2023-06-09 14:57:3







2023-06-09 14:58:03.028703: W external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1790] "goss_alpha" set but "sampling_method" not equal to "GOSS".
2023-06-09 14:58:03.028728: W external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1800] "goss_beta" set but "sampling_method" not equal to "GOSS".
2023-06-09 14:58:03.028731: W external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1814] "selective_gradient_boosting_ratio" set but "sampling_method" not equal to "SELGB".
2023-06-09 14:58:03.045234: I external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:452] Default loss set to BINOMIAL_LOG_LIKELIHOOD
2023-06-09 14:58:03.045248: I external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1077] Training gradient boosted tree on 18849 example(s) and 720 feature(s).
2023-06-09 14:58:0







2023-06-09 14:58:22.918022: W external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1790] "goss_alpha" set but "sampling_method" not equal to "GOSS".
2023-06-09 14:58:22.918046: W external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1800] "goss_beta" set but "sampling_method" not equal to "GOSS".
2023-06-09 14:58:22.918050: W external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1814] "selective_gradient_boosting_ratio" set but "sampling_method" not equal to "SELGB".
2023-06-09 14:58:22.936191: I external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:452] Default loss set to BINOMIAL_LOG_LIKELIHOOD
2023-06-09 14:58:22.936205: I external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1077] Training gradient boosted tree on 18849 example(s) and 720 feature(s).
2023-06-09 14:58:2







2023-06-09 14:58:48.402710: W external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1790] "goss_alpha" set but "sampling_method" not equal to "GOSS".
2023-06-09 14:58:48.402733: W external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1800] "goss_beta" set but "sampling_method" not equal to "GOSS".
2023-06-09 14:58:48.402736: W external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1814] "selective_gradient_boosting_ratio" set but "sampling_method" not equal to "SELGB".
2023-06-09 14:58:48.420945: I external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:452] Default loss set to BINOMIAL_LOG_LIKELIHOOD
2023-06-09 14:58:48.420978: I external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1077] Training gradient boosted tree on 18849 example(s) and 720 feature(s).
2023-06-09 14:58:4



2023-06-09 14:59:15.376509: W external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1790] "goss_alpha" set but "sampling_method" not equal to "GOSS".
2023-06-09 14:59:15.376533: W external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1800] "goss_beta" set but "sampling_method" not equal to "GOSS".
2023-06-09 14:59:15.376536: W external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1814] "selective_gradient_boosting_ratio" set but "sampling_method" not equal to "SELGB".
2023-06-09 14:59:15.393417: I external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:452] Default loss set to BINOMIAL_LOG_LIKELIHOOD
2023-06-09 14:59:15.393432: I external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1077] Training gradient boosted tree on 18849 example(s) and 720 feature(s).
2023-06-09 14:59:1



2023-06-09 14:59:33.878148: W external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1790] "goss_alpha" set but "sampling_method" not equal to "GOSS".
2023-06-09 14:59:33.878173: W external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1800] "goss_beta" set but "sampling_method" not equal to "GOSS".
2023-06-09 14:59:33.878177: W external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1814] "selective_gradient_boosting_ratio" set but "sampling_method" not equal to "SELGB".
2023-06-09 14:59:33.895558: I external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:452] Default loss set to BINOMIAL_LOG_LIKELIHOOD
2023-06-09 14:59:33.895587: I external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1077] Training gradient boosted tree on 18849 example(s) and 720 feature(s).
2023-06-09 14:59:3



2023-06-09 14:59:59.507971: W external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1790] "goss_alpha" set but "sampling_method" not equal to "GOSS".
2023-06-09 14:59:59.507996: W external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1800] "goss_beta" set but "sampling_method" not equal to "GOSS".
2023-06-09 14:59:59.507999: W external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1814] "selective_gradient_boosting_ratio" set but "sampling_method" not equal to "SELGB".
2023-06-09 14:59:59.525988: I external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:452] Default loss set to BINOMIAL_LOG_LIKELIHOOD
2023-06-09 14:59:59.526015: I external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1077] Training gradient boosted tree on 18849 example(s) and 720 feature(s).
2023-06-09 14:59:5



2023-06-09 15:00:19.766246: W external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1790] "goss_alpha" set but "sampling_method" not equal to "GOSS".
2023-06-09 15:00:19.766277: W external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1800] "goss_beta" set but "sampling_method" not equal to "GOSS".
2023-06-09 15:00:19.766281: W external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1814] "selective_gradient_boosting_ratio" set but "sampling_method" not equal to "SELGB".
2023-06-09 15:00:19.782360: I external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:452] Default loss set to BINOMIAL_LOG_LIKELIHOOD
2023-06-09 15:00:19.782384: I external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1077] Training gradient boosted tree on 18849 example(s) and 720 feature(s).
2023-06-09 15:00:1



2023-06-09 15:00:38.047882: W external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1790] "goss_alpha" set but "sampling_method" not equal to "GOSS".
2023-06-09 15:00:38.047900: W external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1800] "goss_beta" set but "sampling_method" not equal to "GOSS".
2023-06-09 15:00:38.047904: W external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1814] "selective_gradient_boosting_ratio" set but "sampling_method" not equal to "SELGB".
2023-06-09 15:00:38.065081: I external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:452] Default loss set to BINOMIAL_LOG_LIKELIHOOD
2023-06-09 15:00:38.065095: I external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1077] Training gradient boosted tree on 18849 example(s) and 720 feature(s).
2023-06-09 15:00:3



2023-06-09 15:00:57.051355: W external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1790] "goss_alpha" set but "sampling_method" not equal to "GOSS".
2023-06-09 15:00:57.051379: W external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1800] "goss_beta" set but "sampling_method" not equal to "GOSS".
2023-06-09 15:00:57.051383: W external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1814] "selective_gradient_boosting_ratio" set but "sampling_method" not equal to "SELGB".
2023-06-09 15:00:57.068329: I external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:452] Default loss set to BINOMIAL_LOG_LIKELIHOOD
2023-06-09 15:00:57.068343: I external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1077] Training gradient boosted tree on 18849 example(s) and 720 feature(s).
2023-06-09 15:00:5



2023-06-09 15:01:18.255348: W external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1790] "goss_alpha" set but "sampling_method" not equal to "GOSS".
2023-06-09 15:01:18.255371: W external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1800] "goss_beta" set but "sampling_method" not equal to "GOSS".
2023-06-09 15:01:18.255375: W external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1814] "selective_gradient_boosting_ratio" set but "sampling_method" not equal to "SELGB".
2023-06-09 15:01:18.280339: I external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:452] Default loss set to BINOMIAL_LOG_LIKELIHOOD
2023-06-09 15:01:18.280363: I external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1077] Training gradient boosted tree on 18849 example(s) and 880 feature(s).
2023-06-09 15:01:1



2023-06-09 15:01:50.605505: W external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1790] "goss_alpha" set but "sampling_method" not equal to "GOSS".
2023-06-09 15:01:50.605532: W external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1800] "goss_beta" set but "sampling_method" not equal to "GOSS".
2023-06-09 15:01:50.605535: W external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1814] "selective_gradient_boosting_ratio" set but "sampling_method" not equal to "SELGB".
2023-06-09 15:01:50.631075: I external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:452] Default loss set to BINOMIAL_LOG_LIKELIHOOD
2023-06-09 15:01:50.631089: I external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1077] Training gradient boosted tree on 18849 example(s) and 880 feature(s).
2023-06-09 15:01:5



2023-06-09 15:02:19.509721: W external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1790] "goss_alpha" set but "sampling_method" not equal to "GOSS".
2023-06-09 15:02:19.509746: W external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1800] "goss_beta" set but "sampling_method" not equal to "GOSS".
2023-06-09 15:02:19.509749: W external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1814] "selective_gradient_boosting_ratio" set but "sampling_method" not equal to "SELGB".
2023-06-09 15:02:19.532886: I external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:452] Default loss set to BINOMIAL_LOG_LIKELIHOOD
2023-06-09 15:02:19.532911: I external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1077] Training gradient boosted tree on 18849 example(s) and 880 feature(s).
2023-06-09 15:02:1



2023-06-09 15:02:40.944441: W external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1790] "goss_alpha" set but "sampling_method" not equal to "GOSS".
2023-06-09 15:02:40.944469: W external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1800] "goss_beta" set but "sampling_method" not equal to "GOSS".
2023-06-09 15:02:40.944472: W external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1814] "selective_gradient_boosting_ratio" set but "sampling_method" not equal to "SELGB".
2023-06-09 15:02:40.970615: I external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:452] Default loss set to BINOMIAL_LOG_LIKELIHOOD
2023-06-09 15:02:40.970639: I external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1077] Training gradient boosted tree on 18849 example(s) and 880 feature(s).
2023-06-09 15:02:4



2023-06-09 15:03:11.174980: W external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1790] "goss_alpha" set but "sampling_method" not equal to "GOSS".
2023-06-09 15:03:11.175005: W external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1800] "goss_beta" set but "sampling_method" not equal to "GOSS".
2023-06-09 15:03:11.175008: W external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1814] "selective_gradient_boosting_ratio" set but "sampling_method" not equal to "SELGB".
2023-06-09 15:03:11.198925: I external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:452] Default loss set to BINOMIAL_LOG_LIKELIHOOD
2023-06-09 15:03:11.198946: I external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1077] Training gradient boosted tree on 18849 example(s) and 880 feature(s).
2023-06-09 15:03:1



In [34]:
gbtm.summary()

Model: "gradient_boosted_trees_model_17"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
Total params: 1
Trainable params: 0
Non-trainable params: 1
_________________________________________________________________
Type: "GRADIENT_BOOSTED_TREES"
Task: CLASSIFICATION
Label: "__LABEL"

Input Features (880):
	elapsed_time_max
	elapsed_time_max_13
	elapsed_time_max_14
	elapsed_time_max_15
	elapsed_time_max_16
	elapsed_time_max_17
	elapsed_time_max_18
	elapsed_time_max_19
	elapsed_time_max_20
	elapsed_time_max_21
	elapsed_time_max_22
	elapsed_time_mean
	elapsed_time_mean_13
	elapsed_time_mean_14
	elapsed_time_mean_15
	elapsed_time_mean_16
	elapsed_time_mean_17
	elapsed_time_mean_18
	elapsed_time_mean_19
	elapsed_time_mean_20
	elapsed_time_mean_21
	elapsed_time_mean_22
	elapsed_time_median
	elapsed_time_median_13
	elapsed_time_median_14
	elapsed_time_median_15
	elapsed_time_median_16
	elapsed_time_median_17
	

### Training final model - No validation set 

In [26]:
# Iterate through questions 1 to 18 to train models for each question, evaluate
# the trained model and store the predicted values.
for q_no in range(1,19):
    # Select level group for the question based on the q_no.
    if q_no<=3:

        X = df1_features
        # Filter the rows in the datasets based on the selected level group. 
        X_users = X.index.values
        X_labels = labels.loc[labels.q==q_no].set_index('session').loc[X_users]
        X["correct"] = X_labels["correct"]
        X_ds = tfdf.keras.pd_dataframe_to_tf_dataset(X, label="correct")
        
    elif q_no<=13:
        
        X = df2_features
        X_users = X.index.values
        X_labels = labels.loc[labels.q==q_no].set_index('session').loc[X_users]
        X["correct"] = X_labels["correct"]
        X_ds = tfdf.keras.pd_dataframe_to_tf_dataset(X, label="correct")

    elif q_no<=22:

        X = df3_features
        
    # Filter the rows in the datasets based on the selected level group. 
    X_users = X.index.values
    
    # Select the labels for the related q_no.
    X_labels = labels.loc[labels.q==q_no].set_index('session').loc[X_users]

    # Add the label to the filtered datasets.

    X["correct"] = X_labels["correct"]

    # There's one more step required before we can train the model. 
    # We need to convert the datatset from Pandas format (pd.DataFrame)
    # into TensorFlow Datasets format (tf.data.Dataset).
    # TensorFlow Datasets is a high performance data loading library 
    # which is helpful when training neural networks with accelerators like GPUs and TPUs.
    # We are omitting `level_group`, since it is not needed for training anymore.
    X_ds = tfdf.keras.pd_dataframe_to_tf_dataset(X, label="correct")

    # We will now create the Gradient Boosted Trees Model with default settings. 
    # By default the model is set to train for a classification task.
    gbtm = tfdf.keras.GradientBoostedTreesModel(verbose=0)
    gbtm.compile(metrics=["accuracy"])

    # # Train the model.
    gbtm.fit(x=X_ds)

    #gbtm = load_model(f'Tensorflow_models/model_{q_no}')
    

    # Store the model
    models[f'{q_no}'] = gbtm
    print(f'model{q_no} is done')
    del X, X_ds, gbtm
    gc.collect()

In [45]:
# Create an empty dictionary to store the models created for each question.
models = {}
limits = {'0-4':(1,4), '5-12':(4,14), '13-22':(14,19)}

for grp in limits:
    a, b = limits[grp]

    # Select appropriate dataset
    if a == 1:
        X = df1_features
    elif a == 4:            
        del X
        gc.collect()
        X = df2_features
    elif a == 14:
        del X
        gc.collect()
        X = df3_features
    

    # Training models
    for t in range(a, b):

        X_users = X.index.values
        X_labels = labels.loc[labels.q == t].set_index('session').loc[X_users]
        # Add the labels to X dataframe
        X["correct"] = X_labels["correct"]
        
        # Creating TensorFlow Dataset
        X_ds = tfdf.keras.pd_dataframe_to_tf_dataset(X, label="correct")
        # Free some memory

        gbtm = tfdf.keras.GradientBoostedTreesModel(verbose=0)
        gbtm.compile(metrics=["accuracy"])

        # Train the model.
        gbtm.fit(x=X_ds)
        
        # Store the model
        models[f'{t}'] = gbtm
        del gbtm, X_ds
        gc.collect()
        print(f'model{t} is done')

2023-06-09 17:25:22.014476: W external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1790] "goss_alpha" set but "sampling_method" not equal to "GOSS".
2023-06-09 17:25:22.014501: W external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1800] "goss_beta" set but "sampling_method" not equal to "GOSS".
2023-06-09 17:25:22.014504: W external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1814] "selective_gradient_boosting_ratio" set but "sampling_method" not equal to "SELGB".
2023-06-09 17:25:22.037498: I external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:452] Default loss set to BINOMIAL_LOG_LIKELIHOOD
2023-06-09 17:25:22.037529: I external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1077] Training gradient boosted tree on 23562 example(s) and 480 feature(s).
2023-06-09 17:25:2

model1 is done


2023-06-09 17:25:50.896362: W external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1790] "goss_alpha" set but "sampling_method" not equal to "GOSS".
2023-06-09 17:25:50.896385: W external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1800] "goss_beta" set but "sampling_method" not equal to "GOSS".
2023-06-09 17:25:50.896388: W external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1814] "selective_gradient_boosting_ratio" set but "sampling_method" not equal to "SELGB".
2023-06-09 17:25:50.905001: I external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:452] Default loss set to BINOMIAL_LOG_LIKELIHOOD
2023-06-09 17:25:50.905028: I external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1077] Training gradient boosted tree on 23562 example(s) and 480 feature(s).
2023-06-09 17:25:5

model2 is done


2023-06-09 17:26:02.555188: W external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1790] "goss_alpha" set but "sampling_method" not equal to "GOSS".
2023-06-09 17:26:02.555209: W external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1800] "goss_beta" set but "sampling_method" not equal to "GOSS".
2023-06-09 17:26:02.555213: W external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1814] "selective_gradient_boosting_ratio" set but "sampling_method" not equal to "SELGB".
2023-06-09 17:26:02.579262: I external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:452] Default loss set to BINOMIAL_LOG_LIKELIHOOD
2023-06-09 17:26:02.579294: I external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1077] Training gradient boosted tree on 23562 example(s) and 480 feature(s).
2023-06-09 17:26:0

model3 is done


2023-06-09 17:26:24.327094: W external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1790] "goss_alpha" set but "sampling_method" not equal to "GOSS".
2023-06-09 17:26:24.327119: W external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1800] "goss_beta" set but "sampling_method" not equal to "GOSS".
2023-06-09 17:26:24.327122: W external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1814] "selective_gradient_boosting_ratio" set but "sampling_method" not equal to "SELGB".
2023-06-09 17:26:24.373799: I external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:452] Default loss set to BINOMIAL_LOG_LIKELIHOOD
2023-06-09 17:26:24.373830: I external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1077] Training gradient boosted tree on 23562 example(s) and 720 feature(s).
2023-06-09 17:26:2

model4 is done


2023-06-09 17:26:49.820736: W external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1790] "goss_alpha" set but "sampling_method" not equal to "GOSS".
2023-06-09 17:26:49.820759: W external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1800] "goss_beta" set but "sampling_method" not equal to "GOSS".
2023-06-09 17:26:49.820763: W external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1814] "selective_gradient_boosting_ratio" set but "sampling_method" not equal to "SELGB".
2023-06-09 17:26:49.855277: I external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:452] Default loss set to BINOMIAL_LOG_LIKELIHOOD
2023-06-09 17:26:49.855308: I external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1077] Training gradient boosted tree on 23562 example(s) and 720 feature(s).
2023-06-09 17:26:4

model5 is done


2023-06-09 17:27:12.271769: W external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1790] "goss_alpha" set but "sampling_method" not equal to "GOSS".
2023-06-09 17:27:12.271793: W external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1800] "goss_beta" set but "sampling_method" not equal to "GOSS".
2023-06-09 17:27:12.271797: W external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1814] "selective_gradient_boosting_ratio" set but "sampling_method" not equal to "SELGB".
2023-06-09 17:27:12.303448: I external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:452] Default loss set to BINOMIAL_LOG_LIKELIHOOD
2023-06-09 17:27:12.303482: I external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1077] Training gradient boosted tree on 23562 example(s) and 720 feature(s).
2023-06-09 17:27:1

model6 is done


2023-06-09 17:27:40.661168: W external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1790] "goss_alpha" set but "sampling_method" not equal to "GOSS".
2023-06-09 17:27:40.661194: W external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1800] "goss_beta" set but "sampling_method" not equal to "GOSS".
2023-06-09 17:27:40.661198: W external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1814] "selective_gradient_boosting_ratio" set but "sampling_method" not equal to "SELGB".
2023-06-09 17:27:40.679882: I external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:452] Default loss set to BINOMIAL_LOG_LIKELIHOOD
2023-06-09 17:27:40.679905: I external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1077] Training gradient boosted tree on 23562 example(s) and 720 feature(s).
2023-06-09 17:27:4

model7 is done


2023-06-09 17:28:10.020199: W external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1790] "goss_alpha" set but "sampling_method" not equal to "GOSS".
2023-06-09 17:28:10.020222: W external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1800] "goss_beta" set but "sampling_method" not equal to "GOSS".
2023-06-09 17:28:10.020227: W external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1814] "selective_gradient_boosting_ratio" set but "sampling_method" not equal to "SELGB".
2023-06-09 17:28:10.056796: I external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:452] Default loss set to BINOMIAL_LOG_LIKELIHOOD
2023-06-09 17:28:10.056829: I external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1077] Training gradient boosted tree on 23562 example(s) and 720 feature(s).
2023-06-09 17:28:1

model8 is done


2023-06-09 17:28:30.282511: W external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1790] "goss_alpha" set but "sampling_method" not equal to "GOSS".
2023-06-09 17:28:30.282535: W external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1800] "goss_beta" set but "sampling_method" not equal to "GOSS".
2023-06-09 17:28:30.282539: W external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1814] "selective_gradient_boosting_ratio" set but "sampling_method" not equal to "SELGB".
2023-06-09 17:28:30.315752: I external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:452] Default loss set to BINOMIAL_LOG_LIKELIHOOD
2023-06-09 17:28:30.315787: I external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1077] Training gradient boosted tree on 23562 example(s) and 720 feature(s).
2023-06-09 17:28:3

model9 is done


2023-06-09 17:29:01.098233: W external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1790] "goss_alpha" set but "sampling_method" not equal to "GOSS".
2023-06-09 17:29:01.098262: W external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1800] "goss_beta" set but "sampling_method" not equal to "GOSS".
2023-06-09 17:29:01.098266: W external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1814] "selective_gradient_boosting_ratio" set but "sampling_method" not equal to "SELGB".
2023-06-09 17:29:01.116449: I external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:452] Default loss set to BINOMIAL_LOG_LIKELIHOOD
2023-06-09 17:29:01.116475: I external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1077] Training gradient boosted tree on 23562 example(s) and 720 feature(s).
2023-06-09 17:29:0

model10 is done


2023-06-09 17:29:25.790612: W external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1790] "goss_alpha" set but "sampling_method" not equal to "GOSS".
2023-06-09 17:29:25.790636: W external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1800] "goss_beta" set but "sampling_method" not equal to "GOSS".
2023-06-09 17:29:25.790639: W external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1814] "selective_gradient_boosting_ratio" set but "sampling_method" not equal to "SELGB".
2023-06-09 17:29:25.809096: I external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:452] Default loss set to BINOMIAL_LOG_LIKELIHOOD
2023-06-09 17:29:25.809122: I external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1077] Training gradient boosted tree on 23562 example(s) and 720 feature(s).
2023-06-09 17:29:2

model11 is done


2023-06-09 17:29:48.595018: W external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1790] "goss_alpha" set but "sampling_method" not equal to "GOSS".
2023-06-09 17:29:48.595061: W external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1800] "goss_beta" set but "sampling_method" not equal to "GOSS".
2023-06-09 17:29:48.595064: W external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1814] "selective_gradient_boosting_ratio" set but "sampling_method" not equal to "SELGB".
2023-06-09 17:29:48.633135: I external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:452] Default loss set to BINOMIAL_LOG_LIKELIHOOD
2023-06-09 17:29:48.633189: I external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1077] Training gradient boosted tree on 23562 example(s) and 720 feature(s).
2023-06-09 17:29:4

model12 is done


2023-06-09 17:30:09.347591: W external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1790] "goss_alpha" set but "sampling_method" not equal to "GOSS".
2023-06-09 17:30:09.347614: W external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1800] "goss_beta" set but "sampling_method" not equal to "GOSS".
2023-06-09 17:30:09.347618: W external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1814] "selective_gradient_boosting_ratio" set but "sampling_method" not equal to "SELGB".
2023-06-09 17:30:09.382733: I external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:452] Default loss set to BINOMIAL_LOG_LIKELIHOOD
2023-06-09 17:30:09.382765: I external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1077] Training gradient boosted tree on 23562 example(s) and 720 feature(s).
2023-06-09 17:30:0

model13 is done


2023-06-09 17:30:33.645875: W external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1790] "goss_alpha" set but "sampling_method" not equal to "GOSS".
2023-06-09 17:30:33.645902: W external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1800] "goss_beta" set but "sampling_method" not equal to "GOSS".
2023-06-09 17:30:33.645905: W external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1814] "selective_gradient_boosting_ratio" set but "sampling_method" not equal to "SELGB".
2023-06-09 17:30:33.692086: I external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:452] Default loss set to BINOMIAL_LOG_LIKELIHOOD
2023-06-09 17:30:33.692118: I external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1077] Training gradient boosted tree on 23562 example(s) and 880 feature(s).
2023-06-09 17:30:3

model14 is done


2023-06-09 17:31:04.356553: W external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1790] "goss_alpha" set but "sampling_method" not equal to "GOSS".
2023-06-09 17:31:04.356578: W external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1800] "goss_beta" set but "sampling_method" not equal to "GOSS".
2023-06-09 17:31:04.356581: W external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1814] "selective_gradient_boosting_ratio" set but "sampling_method" not equal to "SELGB".
2023-06-09 17:31:04.428400: I external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:452] Default loss set to BINOMIAL_LOG_LIKELIHOOD
2023-06-09 17:31:04.428432: I external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1077] Training gradient boosted tree on 23562 example(s) and 880 feature(s).
2023-06-09 17:31:0

model15 is done


2023-06-09 17:31:41.019625: W external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1790] "goss_alpha" set but "sampling_method" not equal to "GOSS".
2023-06-09 17:31:41.019649: W external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1800] "goss_beta" set but "sampling_method" not equal to "GOSS".
2023-06-09 17:31:41.019652: W external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1814] "selective_gradient_boosting_ratio" set but "sampling_method" not equal to "SELGB".
2023-06-09 17:31:41.063827: I external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:452] Default loss set to BINOMIAL_LOG_LIKELIHOOD
2023-06-09 17:31:41.063864: I external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1077] Training gradient boosted tree on 23562 example(s) and 880 feature(s).
2023-06-09 17:31:4

model16 is done


2023-06-09 17:32:04.044043: W external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1790] "goss_alpha" set but "sampling_method" not equal to "GOSS".
2023-06-09 17:32:04.044068: W external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1800] "goss_beta" set but "sampling_method" not equal to "GOSS".
2023-06-09 17:32:04.044072: W external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1814] "selective_gradient_boosting_ratio" set but "sampling_method" not equal to "SELGB".
2023-06-09 17:32:04.092329: I external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:452] Default loss set to BINOMIAL_LOG_LIKELIHOOD
2023-06-09 17:32:04.092361: I external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1077] Training gradient boosted tree on 23562 example(s) and 880 feature(s).
2023-06-09 17:32:0

model17 is done


2023-06-09 17:32:29.119983: W external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1790] "goss_alpha" set but "sampling_method" not equal to "GOSS".
2023-06-09 17:32:29.120003: W external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1800] "goss_beta" set but "sampling_method" not equal to "GOSS".
2023-06-09 17:32:29.120007: W external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1814] "selective_gradient_boosting_ratio" set but "sampling_method" not equal to "SELGB".
2023-06-09 17:32:29.165305: I external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:452] Default loss set to BINOMIAL_LOG_LIKELIHOOD
2023-06-09 17:32:29.165341: I external/ydf/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.cc:1077] Training gradient boosted tree on 23562 example(s) and 880 feature(s).
2023-06-09 17:32:2

model18 is done


In [48]:
models['18'].summary()

Model: "gradient_boosted_trees_model_72"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
Total params: 1
Trainable params: 0
Non-trainable params: 1
_________________________________________________________________
Type: "GRADIENT_BOOSTED_TREES"
Task: CLASSIFICATION
Label: "__LABEL"

Input Features (880):
	elapsed_time_max
	elapsed_time_max_13
	elapsed_time_max_14
	elapsed_time_max_15
	elapsed_time_max_16
	elapsed_time_max_17
	elapsed_time_max_18
	elapsed_time_max_19
	elapsed_time_max_20
	elapsed_time_max_21
	elapsed_time_max_22
	elapsed_time_mean
	elapsed_time_mean_13
	elapsed_time_mean_14
	elapsed_time_mean_15
	elapsed_time_mean_16
	elapsed_time_mean_17
	elapsed_time_mean_18
	elapsed_time_mean_19
	elapsed_time_mean_20
	elapsed_time_mean_21
	elapsed_time_mean_22
	elapsed_time_median
	elapsed_time_median_13
	elapsed_time_median_14
	elapsed_time_median_15
	elapsed_time_median_16
	elapsed_time_median_17
	

# Inspect the Accuracy of the models.

We trained a model for each question. Now let us check the accuracy of each model and overall accuracy for all the models combined. 

Note: Since the label distribution is imbalanced, we can't make an assumption on the model performance from accuracy score alone. 

In [27]:
for name, value in evaluation_dict.items():
  print(f"question {name}: accuracy {value:.4f}")

print("\nAverage accuracy", sum(evaluation_dict.values())/18)


for name, value in f1_scores.items():
  print(f"question {name}: f1_score {value:.4f}")

print("\nAverage f1_score", sum(evaluation_dict.values())/18)


for name, value in auc_scores.items():
  print(f"question {name}: auc_score {value:.4f}")

print("\nAverage auc_score", sum(evaluation_dict.values())/18)

question 1: accuracy 0.7560
question 2: accuracy 0.9728
question 3: accuracy 0.9338
question 4: accuracy 0.8080
question 5: accuracy 0.6435
question 6: accuracy 0.7906
question 7: accuracy 0.7454
question 8: accuracy 0.6351
question 9: accuracy 0.7708
question 10: accuracy 0.6153
question 11: accuracy 0.6635
question 12: accuracy 0.8684
question 13: accuracy 0.7191
question 14: accuracy 0.7431
question 15: accuracy 0.6488
question 16: accuracy 0.7481
question 17: accuracy 0.6991
question 18: accuracy 0.9497

Average accuracy 0.7617346809970008
question 1: f1_score 0.8496
question 2: f1_score 0.9862
question 3: f1_score 0.9657
question 4: f1_score 0.8882
question 5: f1_score 0.7044
question 6: f1_score 0.8790
question 7: f1_score 0.8487
question 8: f1_score 0.7657
question 9: f1_score 0.8640
question 10: f1_score 0.6419
question 11: f1_score 0.7852
question 12: f1_score 0.9295
question 13: f1_score 0.1524
question 14: f1_score 0.8423
question 15: f1_score 0.6515
question 16: f1_score 0.

# Visualize the model

One benefit of tree-based models is that we can easily visualize them. The default number of trees used in the Random Forests is 300. 

Let us pick one model from `models` dict and select a tree to display below.

In [28]:
tfdf.model_plotter.plot_model_in_colab(models['1'], tree_idx=0, max_depth=3)

# Variable importances

Variable importances generally indicate how much a feature contributes to the model predictions or quality. There are several ways to identify important features using TensorFlow Decision Forests. Let us pick one model from models dict and inspect it.

Let us list the available Variable Importances for Decision Trees:

In [29]:
inspector = models['1'].make_inspector()

print(f"Available variable importances:")
for importance in inspector.variable_importances().keys():
  print("\t", importance)

Available variable importances:
	 SUM_SCORE
	 INV_MEAN_MIN_DEPTH
	 NUM_AS_ROOT
	 NUM_NODES


As an example, let us display the important features for the Variable Importance NUM_AS_ROOT.

The larger the importance score for NUM_AS_ROOT, the more impact it has on the outcome of the model for Question 1(i.e., model\["0-4_1"\]).

By default, the list is sorted from the most important to the least. From the output you can infer that the feature at the top of the list is used as the root node in most number of trees in the gradient boosted trees  than any other feature.

In [30]:
# Each line is: (feature name, (index of the feature), importance score)
inspector.variable_importances()["NUM_AS_ROOT"]

[("time_diff_max_4" (1; #450), 19.0),
 ("fqid_count_3" (1; #53), 7.0),
 ("fqid_count" (1; #49), 3.0),
 ("name_nunique_3" (1; #107), 3.0),
 ("room_coor_x_median" (1; #157), 3.0),
 ("name_nunique" (1; #103), 2.0),
 ("screen_coor_x_mean_3" (1; #311), 2.0),
 ("screen_coor_x_median_4" (1; #318), 2.0),
 ("screen_coor_y_sum" (1; #403), 2.0),
 ("fqid_count_2" (1; #52), 1.0),
 ("hover_duration_median_4" (1; #78), 1.0),
 ("name_nunique_4" (1; #108), 1.0),
 ("room_coor_x_diff_mean_2" (1; #118), 1.0),
 ("room_coor_x_diff_mean_3" (1; #119), 1.0),
 ("room_coor_x_max" (1; #145), 1.0),
 ("room_coor_x_max_2" (1; #148), 1.0),
 ("room_coor_x_mean" (1; #151), 1.0),
 ("room_coor_x_min_1" (1; #165), 1.0),
 ("room_coor_x_min_2" (1; #166), 1.0),
 ("room_coor_x_sum_3" (1; #179), 1.0),
 ("room_coor_y_diff_median" (1; #193), 1.0),
 ("room_coor_y_diff_median_2" (1; #196), 1.0),
 ("screen_coor_x_mean_4" (1; #312), 1.0),
 ("screen_coor_x_min_3" (1; #323), 1.0),
 ("screen_coor_y_diff_sum" (1; #367), 1.0),
 ("screen_

# Saving Model

In [38]:
import os

# Your dictionary with the models


# Create a directory to save models
save_directory = 'Tensorflow_models'
#os.makedirs(save_directory, exist_ok=True)

# Iterate through the dictionary and save each model
for key, model in models.items():
    # Save each model in a separate directory
    model_path = os.path.join(save_directory, f'model_{key}')
    model.save(model_path)



INFO:tensorflow:Assets written to: Tensorflow_models/model_1/assets


INFO:tensorflow:Assets written to: Tensorflow_models/model_1/assets


INFO:tensorflow:Assets written to: Tensorflow_models/model_2/assets


INFO:tensorflow:Assets written to: Tensorflow_models/model_2/assets


INFO:tensorflow:Assets written to: Tensorflow_models/model_3/assets


INFO:tensorflow:Assets written to: Tensorflow_models/model_3/assets


INFO:tensorflow:Assets written to: Tensorflow_models/model_4/assets


INFO:tensorflow:Assets written to: Tensorflow_models/model_4/assets


INFO:tensorflow:Assets written to: Tensorflow_models/model_5/assets


INFO:tensorflow:Assets written to: Tensorflow_models/model_5/assets


INFO:tensorflow:Assets written to: Tensorflow_models/model_6/assets


INFO:tensorflow:Assets written to: Tensorflow_models/model_6/assets


INFO:tensorflow:Assets written to: Tensorflow_models/model_7/assets


INFO:tensorflow:Assets written to: Tensorflow_models/model_7/assets


INFO:tensorflow:Assets written to: Tensorflow_models/model_8/assets


INFO:tensorflow:Assets written to: Tensorflow_models/model_8/assets


INFO:tensorflow:Assets written to: Tensorflow_models/model_9/assets


INFO:tensorflow:Assets written to: Tensorflow_models/model_9/assets


INFO:tensorflow:Assets written to: Tensorflow_models/model_10/assets


INFO:tensorflow:Assets written to: Tensorflow_models/model_10/assets


INFO:tensorflow:Assets written to: Tensorflow_models/model_11/assets


INFO:tensorflow:Assets written to: Tensorflow_models/model_11/assets


INFO:tensorflow:Assets written to: Tensorflow_models/model_12/assets


INFO:tensorflow:Assets written to: Tensorflow_models/model_12/assets


INFO:tensorflow:Assets written to: Tensorflow_models/model_13/assets


INFO:tensorflow:Assets written to: Tensorflow_models/model_13/assets


INFO:tensorflow:Assets written to: Tensorflow_models/model_14/assets


INFO:tensorflow:Assets written to: Tensorflow_models/model_14/assets


INFO:tensorflow:Assets written to: Tensorflow_models/model_15/assets


INFO:tensorflow:Assets written to: Tensorflow_models/model_15/assets


INFO:tensorflow:Assets written to: Tensorflow_models/model_16/assets


INFO:tensorflow:Assets written to: Tensorflow_models/model_16/assets


INFO:tensorflow:Assets written to: Tensorflow_models/model_17/assets


INFO:tensorflow:Assets written to: Tensorflow_models/model_17/assets


INFO:tensorflow:Assets written to: Tensorflow_models/model_18/assets


INFO:tensorflow:Assets written to: Tensorflow_models/model_18/assets


# Threshold-Moving for Imbalanced Classification

Since the values of the column `correct` is fairly imbalanced, using the default threshold of `0.5` to map the predictions into classes 0 or 1 can result in poor performance. 
In such cases, to improve performance we will calculate the `F1 score` for a certain range of thresholds and try to find the best threshold aka, threshold with highest `F1 score`. Then we will use this threshold to map the predicted probabilities to class labels 0 or 1.

Please note that we are using `F1 score` since it is a better metric than `accuracy` to evaluate problems with class imbalance.

In [32]:
# Create a dataframe of required size:
# (no: of users in validation set x no: of questions) initialized to zero values
# to store true values of the label `correct`. 
true_df = pd.DataFrame(data=np.zeros((len(VALID_USER_LIST),18)), index=VALID_USER_LIST)
for i in range(18):
    # Get the true labels.
    tmp = labels.loc[labels.q == i+1].set_index('session').loc[VALID_USER_LIST]
    true_df[i] = tmp.correct.values

max_score = 0; best_threshold = 0

# Loop through threshold values from 0.4 to 0.8 and select the threshold with 
# the highest `F1 score`.
for threshold in np.arange(0.4,0.8,0.01):
    metric = tfa.metrics.F1Score(num_classes=2,average="macro",threshold=threshold)
    y_true = tf.one_hot(true_df.values.reshape((-1)), depth=2)
    y_pred = tf.one_hot((prediction_df.values.reshape((-1))>threshold).astype('int'), depth=2)
    metric.update_state(y_true, y_pred)
    f1_score = metric.result().numpy()
    if f1_score > max_score:
        max_score = f1_score
        best_threshold = threshold
        
print("Best threshold ", best_threshold, "\tF1 score ", max_score)

Best threshold  0.6200000000000002 	F1 score  0.685995


In [33]:
# Create a dataframe of required size:
# (no: of users in validation set x no: of questions) initialized to zero values
# to store true values of the label `correct`. 
true_df = pd.DataFrame(data=np.zeros((len(VALID_USER_LIST),18)), index=VALID_USER_LIST)
for i in range(18):
    # Get the true labels.
    tmp = labels.loc[labels.q == i+1].set_index('session').loc[VALID_USER_LIST]
    true_df[i] = tmp.correct.values

best_thresholds = np.zeros(18)
max_scores = np.zeros(18)

# Loop through each question
for i in range(18):
    max_score = 0; best_threshold = 0
    # Loop through threshold values from 0.4 to 0.8 and select the threshold with 
    # the highest `F1 score` for each question.
    for threshold in np.arange(0.4,0.8,0.01):
        metric = tfa.metrics.F1Score(num_classes=2,average="macro",threshold=threshold)
        y_true = tf.one_hot(true_df[i].values, depth=2)
        y_pred = tf.one_hot((prediction_df[i].values>threshold).astype('int'), depth=2)
        metric.update_state(y_true, y_pred)
        f1_score = metric.result().numpy()
        if f1_score > max_score:
            max_score = f1_score
            best_threshold = threshold
    best_thresholds[i] = best_threshold
    max_scores[i] = max_score

for i in range(18):
    print(f"Question {i+1}: Best threshold {best_thresholds[i]}, F1 score {max_scores[i]}")

Question 1: Best threshold 0.6700000000000003, F1 score 0.6619926691055298
Question 2: Best threshold 0.7800000000000004, F1 score 0.5130048394203186
Question 3: Best threshold 0.7800000000000004, F1 score 0.550042986869812
Question 4: Best threshold 0.7700000000000004, F1 score 0.6647797226905823
Question 5: Best threshold 0.5500000000000002, F1 score 0.6384062767028809
Question 6: Best threshold 0.7200000000000003, F1 score 0.6403340101242065
Question 7: Best threshold 0.6900000000000003, F1 score 0.6104722023010254
Question 8: Best threshold 0.6000000000000002, F1 score 0.5568194389343262
Question 9: Best threshold 0.6700000000000003, F1 score 0.634412944316864
Question 10: Best threshold 0.5100000000000001, F1 score 0.618131697177887
Question 11: Best threshold 0.6400000000000002, F1 score 0.6024016737937927
Question 12: Best threshold 0.7900000000000004, F1 score 0.5704087018966675
Question 13: Best threshold 0.4, F1 score 0.5853684544563293
Question 14: Best threshold 0.650000000

# Submission

Here you'll use the `best_threshold` calculate in the previous cell

In [None]:
# Reference
# https://www.kaggle.com/code/philculliton/basic-submission-demo
# https://www.kaggle.com/code/cdeotte/random-forest-baseline-0-664/notebook


import jo_wilder
env = jo_wilder.make_env()
iter_test = env.iter_test()

limits = {'0-4':(1,4), '5-12':(4,14), '13-22':(14,19)}

count = 0

for (sample_submission, test) in iter_test:
        
        session_id = test.session_id.values[0]
        gr = test.level_group.values[0]
        a,b = limits[grp]
  

        # ------------------- level 0-4 ---------------------------------
        if a == 1:
            test = feature_engineer(test, gr)
            

        # ------------------- level 5-12 ---------------------------------
        elif a == 4:
            test = feature_engineer(test, gr)

        # ------------------- level 13-22 ---------------------------------    
        elif a == 14:
            test = feature_engineer(test, gr)
        
        for t in range(a,b):
            gbtm = models[f'{t}']
            test_ds = tfdf.keras.pd_dataframe_to_tf_dataset(test)
            predictions = gbtm.predict(test_ds)
            mask = sample_submission.session_id.str.contains(f'q{t}')
            n_predictions = (predictions > best_threshold).astype(int)
            sample_submission.loc[mask,'correct'] = n_predictions.flatten()
            
        env.predict(sample_submission)

In [None]:
! head submission.csv