<a href="https://colab.research.google.com/github/Dansah2/Identifying-Age-Related-Conditions/blob/main/GB_Tree_Model_ICR_Identifying_Age_Related_Conditions_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ICR - Identifying Age-Related Conditions

Kaggle Dataset Download API Command:

kaggle competitions download -c icr-identify-age-related-conditions

Predict whether a subject has or has not been diagnosed with one of these conditions -- a binary classification problem.

##Project Outline:

1) Download the dataset

2) Explore/Analyze the data

3) Preprocess and organize the data

4) Create and Train baseline Model

5) Save the Model

## Download the Dataset

1) Install required libraries

2) Import required libraries

3) Obtain the preprocessed data previously saved to Google Drive


#### Install Required Libraries

In [1]:
!pip install -q -U keras-tuner
!pip install -q -U scikit-learn
!pip install -q -U numpy
!pip install -q -U tensorflow_decision_forests
!pip install -q -U plotly

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/127.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━[0m [32m81.9/127.5 kB[0m [31m2.9 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m127.5/127.5 kB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m950.8/950.8 kB[0m [31m32.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.8/10.8 MB[0m [31m32.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m18.2/18.2 MB[0m [31m80.2 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
cupy-cuda11x 11.0.0 requires numpy<1.26,>=1.20, but you have numpy 1.26.0 wh

#### Import Required Libraries

In [1]:
# loading and handeling data
import numpy as np
import pandas as pd

# model training
import tensorflow as tf
from tensorflow import keras
import tensorflow_decision_forests as tfdf

# downloading data
from google.colab import drive

# Training/Evaluating the model
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import KFold
from sklearn.metrics import precision_score, recall_score, f1_score

# hyperparameter tuning
import keras_tuner as kt

# graph training accuracy and loss
import plotly.graph_objs as go
from plotly.subplots import make_subplots

Using TensorFlow backend


#### Obtain the preprocessed data previously saved to Google Drive


In [2]:
# Mount google drive to store Kaggle API for future use
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
# create a function to read the data into a dataframe

def read_function(csv_file):

    return pd.read_csv(csv_file)

clean_train = read_function('/content/drive/My Drive/ICR_Project/non_encoded_train_df.csv')

## Create and Train baseline model
1) Calculate the weights

2) Create the model / callbacks

3) Define the plot function

4) Train the model

###Calculate the weights

In [4]:
def get_weights(train_df, target):
  # Calculate the number of samples for each label.
  neg, pos = np.bincount(train_df[target])

  # Calculate total samples.
  total = neg + pos

  # Calculate the weight for each label.
  weight_for_0 = (1 / neg) * (total / 2.0)
  weight_for_1 = (1 / pos) * (total / 2.0)

  class_weight = {0: weight_for_0, 1: weight_for_1}

  print(f'Weight for class 0: {weight_for_0:.2f}')
  print(f'Weight for class 1: {weight_for_1:.2f}')

  return class_weight

class_weight = get_weights(clean_train, 'Class')

Weight for class 0: 0.61
Weight for class 1: 2.86


###Select the model

In [5]:
#Look at the models to select from
tfdf.keras.get_all_models()

[tensorflow_decision_forests.keras.RandomForestModel,
 tensorflow_decision_forests.keras.GradientBoostedTreesModel,
 tensorflow_decision_forests.keras.CartModel,
 tensorflow_decision_forests.keras.DistributedGradientBoostedTreesModel]

In [6]:
# check config options
model = tfdf.keras.GradientBoostedTreesModel(hyperparameter_template="benchmark_rank1")

Resolve hyper-parameter template "benchmark_rank1" to "benchmark_rank1@v1" -> {'growing_strategy': 'BEST_FIRST_GLOBAL', 'categorical_algorithm': 'RANDOM', 'split_axis': 'SPARSE_OBLIQUE', 'sparse_oblique_normalization': 'MIN_MAX', 'sparse_oblique_num_projections_exponent': 1.0}.
Use /tmp/tmpiyr_okod as temporary training directory


###Tune hyperparameters


In [7]:
tf_dataset = tfdf.keras.pd_dataframe_to_tf_dataset(clean_train, label="Class")

tuner = tfdf.tuner.RandomSearch(num_trials=20)

# Hyper-parameters to optimize.
tuner.choice("max_depth", [3, 4, 5, 6, 7, 8, 9, 10])
tuner.choice("num_trees", [400, 500, 600, 700, 800])

model = tfdf.keras.GradientBoostedTreesModel(tuner=tuner)
model.fit(tf_dataset)

print(model.summary())





Use /tmp/tmp4ina0vna as temporary training directory
Reading training dataset...
Training dataset read in 0:00:07.416563. Found 617 examples.
Training model...
Model trained in 0:00:06.337140
Compiling model...


Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: could not get source code


Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: could not get source code
Model compiled.
Model: "gradient_boosted_trees_model_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
Total params: 1 (1.00 Byte)
Trainable params: 0 (0.00 Byte)
Non-trainable params: 1 (1.00 Byte)
_________________________________________________________________
Type: "GRADIENT_BOOSTED_TREES"
Task: CLASSIFICATION
Label: "__LABEL"

Input Features (56):
	AB
	AF
	AH
	AM
	AR
	AX
	AY
	AZ
	BC
	BD_
	BN
	BP
	BQ
	BR
	BZ
	CB
	CC
	CD_
	CF
	CH
	CL
	CR
	CS
	CU
	CW_
	DA
	DE
	DF
	DH
	DI
	DL
	DN
	DU
	DV
	DY
	EB
	EE
	EG
	EH
	EJ
	EL
	EP
	EU
	FC
	FD_
	FE
	FI
	FL
	FR
	FS
	GB
	GE
	GF
	GH
	GI
	GL

No weights

Variable Importance: INV_MEAN_MIN_DEPTH:
    1.  "DU"  0.399590 ################
    2.  "CC"  0.364146 ######
    3.  "BQ"  

###Train the model

In [8]:
# Create list of ids for the creation of oof dataframe.
ID_LIST = clean_train.index

# Create a dataframe of required size with zero values.
oof = pd.DataFrame(data=np.zeros((len(ID_LIST),1)), index=ID_LIST)

# Create an empty dictionary to store the models trained for each fold.
models = {}

# Create empty dict to save metircs for the models trained for each fold.
accuracy = {}
cross_entropy = {}

# Save the name of the label column to a variable.
label = "Class"

# Creates a GroupKFold with 5 splits
kf = KFold(n_splits=5)

In [19]:
def train_model(dataset_df, class_weight):
  # Create subplots for accuracy and loss
  fig = make_subplots(rows=5, cols=1, subplot_titles=("Accuracy", "Loss", "Precision", "Recall", "F1-Score"))

  # initialize evaluation metrics
  precision_scores = []
  recall_scores = []
  f1_scores = []
  average_loss = 0
  average_acc = 0

  # Loop through each fold
  for i, (train_index, valid_index) in enumerate(kf.split(X=dataset_df)):
    print('##### Fold',i+1)

    # Fetch values corresponding to the index
    train_df = dataset_df.iloc[train_index]
    valid_df = dataset_df.iloc[valid_index]
    valid_ids = valid_df.index.values

    train_ds = tfdf.keras.pd_dataframe_to_tf_dataset(train_df, label=label)
    valid_ds = tfdf.keras.pd_dataframe_to_tf_dataset(valid_df, label=label)

    # Define the model and metrics
    model = tfdf.keras.GradientBoostedTreesModel(max_depth=3, num_trees=700)
    model.compile(metrics=["accuracy", "binary_crossentropy"])

    # Train the model
    model.fit(x=train_ds, class_weight=class_weight)

    # Store the model
    models[f"fold_{i+1}"] = model

    # Predict OOF value for validation data
    predict = model.predict(x=valid_ds)

    # Store the predictions in oof dataframe
    oof.loc[valid_ids, 0] = predict.flatten()

    # Calculate precision, recall, and F1-score
    precision_fold = precision_score(valid_df[label], (predict > 0.5).astype(int))
    recall_fold = recall_score(valid_df[label], (predict > 0.5).astype(int))
    f1_fold = f1_score(valid_df[label], (predict > 0.5).astype(int))

    # Append the values to the respective lists
    precision_scores.append(precision_fold)
    recall_scores.append(recall_fold)
    f1_scores.append(f1_fold)

    # Evaluate and store the metrics in respective dicts
    evaluation = model.evaluate(x=valid_ds,return_dict=True)
    accuracy[f"fold_{i+1}"] = evaluation["accuracy"]
    cross_entropy[f"fold_{i+1}"]= evaluation["binary_crossentropy"]

    # Update accuracy plot
    fold_accuracy = evaluation["accuracy"]
    fig.add_trace(go.Scatter(x=[i+1], y=[fold_accuracy], mode='markers+lines', name=f'Accuracy Fold {i+1}'), row=1, col=1)

    # Update loss plot
    fold_loss = evaluation["binary_crossentropy"]
    fig.add_trace(go.Scatter(x=[i+1], y=[fold_loss], mode='markers+lines', name=f'Loss Fold {i+1}'), row=2, col=1)

    # Update precision plot
    fig.add_trace(go.Scatter(x=[i+1], y=[precision_fold], mode='markers+lines', name=f'Precision Fold {i+1}'), row=3, col=1)

    # Update recall plot
    fig.add_trace(go.Scatter(x=[i+1], y=[recall_fold], mode='markers+lines', name=f'Recall Fold {i+1}'), row=4, col=1)

    # Update f1-score plot
    fig.add_trace(go.Scatter(x=[i+1], y=[f1_fold], mode='markers+lines', name=f'F1-Score Fold {i+1}'), row=5, col=1)

  # calculate / print eval metrics
  for _model in  models:
    average_loss += cross_entropy[_model]
    average_acc += accuracy[_model]
    print(f"\n{_model}: acc: {accuracy[_model]:.4f} loss: {cross_entropy[_model]:.4f}")

  print(f"\nAverage accuracy: {average_acc/5:.4f}  Average loss: {average_loss/5:.4f}\n")

  for i in range(len(models)):
    print(f"\nPrecison: {precision_scores[i]:.4f}, Recall: {recall_scores[i]:.4f}, F1 Score: {f1_scores[i]:.4f}")

  print(f"\nAverage Precision: {np.mean(precision_scores):.4f}  Average Recall: {np.mean(recall_scores):.4f}, Average F1 Score: {np.mean(f1_scores):.4f}")

  # Set titles for both subplots
  fig.update_layout(title="Evaluation Metrics per Fold")
  fig.update_yaxes(title_text="Value", row=3, col=1)
  fig.update_xaxes(title_text="Fold", row=5, col=1)

  fig.show()

  return model

model = train_model(clean_train, class_weight)

##### Fold 1








Use /tmp/tmpmb4_d15u as temporary training directory
Reading training dataset...
Training dataset read in 0:00:01.742973. Found 493 examples.
Training model...
Model trained in 0:00:00.684577
Compiling model...
Model compiled.
##### Fold 2








Use /tmp/tmpi2fh7y97 as temporary training directory
Reading training dataset...
Training dataset read in 0:00:01.361835. Found 493 examples.
Training model...
Model trained in 0:00:00.337900
Compiling model...
Model compiled.
##### Fold 3








Use /tmp/tmpaspah6ds as temporary training directory
Reading training dataset...
Training dataset read in 0:00:00.796959. Found 494 examples.
Training model...
Model trained in 0:00:00.656447
Compiling model...
Model compiled.
##### Fold 4








Use /tmp/tmpzluh_ju_ as temporary training directory
Reading training dataset...
Training dataset read in 0:00:01.185355. Found 494 examples.
Training model...
Model trained in 0:00:00.497402
Compiling model...
Model compiled.
##### Fold 5








Use /tmp/tmpob6ue_e4 as temporary training directory
Reading training dataset...
Training dataset read in 0:00:01.398398. Found 494 examples.
Training model...
Model trained in 0:00:00.806367
Compiling model...
Model compiled.

fold_1: acc: 0.9597 loss: 0.1458

fold_2: acc: 0.8790 loss: 0.2566

fold_3: acc: 0.9268 loss: 0.2221

fold_4: acc: 0.9187 loss: 0.2841

fold_5: acc: 0.9512 loss: 0.1557

Average accuracy: 0.9271  Average loss: 0.2129


Precison: 0.8500, Recall: 0.8947, F1 Score: 0.8718

Precison: 0.5714, Recall: 0.6667, F1 Score: 0.6154

Precison: 0.8462, Recall: 0.8148, F1 Score: 0.8302

Precison: 0.7600, Recall: 0.8261, F1 Score: 0.7917

Precison: 0.8261, Recall: 0.9048, F1 Score: 0.8636

Average Precision: 0.7707  Average Recall: 0.8214, Average F1 Score: 0.7945


##Save the Model


In [11]:
model.save('/content/drive/My Drive/ICR_Project/ICR_model.keras', save_format="keras")