## Random Forest Model for Age-Related Conditions Classification

This code demonstrates how to train a Random Forest model for age-related conditions classification using the `TensorFlow Decision Forests` library. It includes the following steps:

1. **Load and preprocess the training dataset**: The code reads the training dataset from a CSV file, handles missing values by filling them with the mean values, and separates the features and target columns.

2. **Perform K-fold validation and Out-of-Fold (OOF) prediction**: The code splits the training data into K folds using stratified K-fold cross-validation. It trains a Random Forest model on each fold and makes predictions on the out-of-fold data. The OOF predictions are stored for later evaluation.

3. **Evaluate OOF accuracy**: The code compares the OOF predictions with the true target values to calculate the OOF accuracy.

4. **Load and preprocess the test dataset**: The code reads the test dataset from a CSV file, handles missing values by filling them with the mean values, and ensures the test dataset has the same column semantics as the training dataset.

5. **Convert the test dataset to a TensorFlow dataset**: The code converts the preprocessed test dataset to a TensorFlow dataset.

6. **Make predictions on the test dataset**: The code uses the trained Random Forest model to make predictions on the test dataset.

7. **Create a submission DataFrame**: The code creates a DataFrame with the predicted labels, including the 'Id', 'class_0', and 'class_1' columns, where 'class_0' represents the probability of the sample belonging to class 0, and 'class_1' represents the probability of the sample belonging to class 1.

8. **Save the submission file**: The code saves the submission DataFrame as a CSV file named 'submission.csv', without including the index column.

**Note**: Ensure that the necessary libraries, such as `pandas`, `numpy`, `tensorflow_decision_forests`, `sklearn`, and `imblearn`, are installed before running the code.

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/icr-identify-age-related-conditions/sample_submission.csv
/kaggle/input/icr-identify-age-related-conditions/greeks.csv
/kaggle/input/icr-identify-age-related-conditions/train.csv
/kaggle/input/icr-identify-age-related-conditions/test.csv


In [2]:
import pandas as pd
import numpy as np
import tensorflow_decision_forests as tfdf
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score
from imblearn.over_sampling import SMOTENC

caused by: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io_plugins.so: undefined symbol: _ZN3tsl6StatusC1EN10tensorflow5error4CodeESt17basic_string_viewIcSt11char_traitsIcEENS_14SourceLocationE']
caused by: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io.so: undefined symbol: _ZTVN10tensorflow13GcsFileSystemE']


In [3]:
# Load the training dataset
dataset_df = pd.read_csv('/kaggle/input/icr-identify-age-related-conditions/train.csv')

In [4]:
# Handle missing values in the training dataset
dataset_df.fillna(dataset_df.mean(), inplace=True)

  dataset_df.fillna(dataset_df.mean(), inplace=True)


In [5]:
# Define features and target
features = dataset_df.drop(columns=['Id', 'Class'])
target = dataset_df['Class']

In [6]:
# KFold validation and OOF
n_splits = 5
kf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42)

oof_predictions = np.zeros(dataset_df.shape[0])

# Iterate over each fold
for fold, (train_index, valid_index) in enumerate(kf.split(features, target)):
    X_train, X_valid = features.iloc[train_index], features.iloc[valid_index]
    y_train, y_valid = target.iloc[train_index], target.iloc[valid_index]

    # Convert to TensorFlow datasets
    train_dataset = tfdf.keras.pd_dataframe_to_tf_dataset(X_train.assign(target=y_train), task=tfdf.keras.Task.CLASSIFICATION, label="target")
    valid_dataset = tfdf.keras.pd_dataframe_to_tf_dataset(X_valid.assign(target=y_valid), task=tfdf.keras.Task.CLASSIFICATION, label="target")

    # Train a new model for each fold
    model = tfdf.keras.RandomForestModel(task=tfdf.keras.Task.CLASSIFICATION, hyperparameter_template="benchmark_rank1")
    model.fit(x=train_dataset)

    # Predict on the out-of-fold data
    valid_predictions = model.predict(valid_dataset)
    oof_predictions[valid_index] = valid_predictions.reshape(-1)

Resolve hyper-parameter template "benchmark_rank1" to "benchmark_rank1@v1" -> {'winner_take_all': True, 'categorical_algorithm': 'RANDOM', 'split_axis': 'SPARSE_OBLIQUE', 'sparse_oblique_normalization': 'MIN_MAX', 'sparse_oblique_num_projections_exponent': 1.0}.
Use /tmp/tmp9qgme_f_ as temporary training directory
Reading training dataset...
Training dataset read in 0:00:07.429049. Found 493 examples.
Training model...
Model trained in 0:00:02.525329
Compiling model...


[INFO 23-06-24 19:16:00.7699 UTC kernel.cc:1242] Loading model from path /tmp/tmp9qgme_f_/model/ with prefix 4b3550adeea94d23
[INFO 23-06-24 19:16:00.8234 UTC decision_forest.cc:660] Model loaded with 300 root(s), 10214 node(s), and 55 input feature(s).
[INFO 23-06-24 19:16:00.8235 UTC abstract_model.cc:1311] Engine "RandomForestGeneric" built
[INFO 23-06-24 19:16:00.8235 UTC kernel.cc:1074] Use fast generic engine


Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: could not get source code
Model compiled.
Resolve hyper-parameter template "benchmark_rank1" to "benchmark_rank1@v1" -> {'winner_take_all': True, 'categorical_algorithm': 'RANDOM', 'split_axis': 'SPARSE_OBLIQUE', 'sparse_oblique_normalization': 'MIN_MAX', 'sparse_oblique_num_projections_exponent': 1.0}.
Use /tmp/tmps1_r02oc as temporary training directory
Reading training dataset...
Training dataset read in 0:00:00.792452. Found 493 examples.
Training model...
Model trained in 0:00:02.444898
Compiling model...


[INFO 23-06-24 19:16:07.5690 UTC kernel.cc:1242] Loading model from path /tmp/tmps1_r02oc/model/ with prefix 7046f4d163a94df8
[INFO 23-06-24 19:16:07.6207 UTC decision_forest.cc:660] Model loaded with 300 root(s), 10104 node(s), and 55 input feature(s).
[INFO 23-06-24 19:16:07.6207 UTC kernel.cc:1074] Use fast generic engine


Model compiled.
Resolve hyper-parameter template "benchmark_rank1" to "benchmark_rank1@v1" -> {'winner_take_all': True, 'categorical_algorithm': 'RANDOM', 'split_axis': 'SPARSE_OBLIQUE', 'sparse_oblique_normalization': 'MIN_MAX', 'sparse_oblique_num_projections_exponent': 1.0}.
Use /tmp/tmp7_y3un6v as temporary training directory
Reading training dataset...
Training dataset read in 0:00:00.808222. Found 494 examples.
Training model...
Model trained in 0:00:02.500155
Compiling model...


[INFO 23-06-24 19:16:11.7791 UTC kernel.cc:1242] Loading model from path /tmp/tmp7_y3un6v/model/ with prefix 86cf214dcfa84729
[INFO 23-06-24 19:16:11.8328 UTC decision_forest.cc:660] Model loaded with 300 root(s), 10224 node(s), and 56 input feature(s).
[INFO 23-06-24 19:16:11.8329 UTC abstract_model.cc:1311] Engine "RandomForestGeneric" built
[INFO 23-06-24 19:16:11.8329 UTC kernel.cc:1074] Use fast generic engine


Model compiled.
Resolve hyper-parameter template "benchmark_rank1" to "benchmark_rank1@v1" -> {'winner_take_all': True, 'categorical_algorithm': 'RANDOM', 'split_axis': 'SPARSE_OBLIQUE', 'sparse_oblique_normalization': 'MIN_MAX', 'sparse_oblique_num_projections_exponent': 1.0}.
Use /tmp/tmpub3gsl_v as temporary training directory
Reading training dataset...
Training dataset read in 0:00:00.815610. Found 494 examples.
Training model...
Model trained in 0:00:02.487889
Compiling model...


[INFO 23-06-24 19:16:16.0385 UTC kernel.cc:1242] Loading model from path /tmp/tmpub3gsl_v/model/ with prefix ad4c2af1970d466f
[INFO 23-06-24 19:16:16.0860 UTC decision_forest.cc:660] Model loaded with 300 root(s), 9422 node(s), and 56 input feature(s).
[INFO 23-06-24 19:16:16.0861 UTC kernel.cc:1074] Use fast generic engine


Model compiled.
Resolve hyper-parameter template "benchmark_rank1" to "benchmark_rank1@v1" -> {'winner_take_all': True, 'categorical_algorithm': 'RANDOM', 'split_axis': 'SPARSE_OBLIQUE', 'sparse_oblique_normalization': 'MIN_MAX', 'sparse_oblique_num_projections_exponent': 1.0}.
Use /tmp/tmpdbu2igy4 as temporary training directory
Reading training dataset...
Training dataset read in 0:00:00.821059. Found 494 examples.
Training model...
Model trained in 0:00:02.501914
Compiling model...


[INFO 23-06-24 19:16:20.2476 UTC kernel.cc:1242] Loading model from path /tmp/tmpdbu2igy4/model/ with prefix 1662017b59784888
[INFO 23-06-24 19:16:20.2983 UTC decision_forest.cc:660] Model loaded with 300 root(s), 10284 node(s), and 55 input feature(s).
[INFO 23-06-24 19:16:20.2984 UTC kernel.cc:1074] Use fast generic engine


Model compiled.


In [7]:
# Evaluate OOF accuracy
oof_predictions_binary = np.where(oof_predictions > 0.5, 1, 0)
oof_accuracy = accuracy_score(target, oof_predictions_binary)
print(f"OOF Accuracy: {oof_accuracy}")

OOF Accuracy: 0.9222042139384117


In [8]:
# Load the test dataset
test_data = pd.read_csv('/kaggle/input/icr-identify-age-related-conditions/test.csv')

# Handle missing values in the test dataset
test_data.fillna(test_data.mean(), inplace=True)

# Ensure test dataset has the same column semantics as the training dataset
test_data = test_data.reindex(columns=features.columns, fill_value=0)

# Create the 'Id' column in the test dataset
test_data['Id'] = range(1, len(test_data) + 1)

# Convert the test dataset to a TensorFlow dataset
test_dataset = tfdf.keras.pd_dataframe_to_tf_dataset(test_data, task=tfdf.keras.Task.CLASSIFICATION)



  test_data.fillna(test_data.mean(), inplace=True)


In [9]:
# Make predictions on the test dataset
test_predictions = model.predict(test_dataset)

# Reshape the test predictions to match the shape of the 'Id' column
test_predictions_reshaped = test_predictions.reshape(-1)



In [10]:
# Create a DataFrame with the predicted labels
submission_df = pd.DataFrame({'Id': test_data['Id'], 'class_0': 1 - test_predictions_reshaped, 'class_1': test_predictions_reshaped})

# Save the submission file as a CSV
submission_df.to_csv('submission.csv', index=False)

In [11]:
submission_df

Unnamed: 0,Id,class_0,class_1
0,1,0.626667,0.373333
1,2,0.626667,0.373333
2,3,0.626667,0.373333
3,4,0.626667,0.373333
4,5,0.626667,0.373333
