# Age-Related Conditions Identification

## Overview
This code aims to identify age-related conditions using a RandomForestClassifier model trained on imbalanced data with the help of Synthetic Minority Over-sampling Technique (SMOTE).

## Data Preprocessing
- Load the training and test data.
- Handle missing values by filling them with the mean of the corresponding column in the training data.
- Ensure the test data has the same features as the training data.
- Separate the features and target from the training data.
- Prepare the test features.

## Handling Imbalanced Data
- Use SMOTE to handle imbalanced data by oversampling the minority class.

## K-Fold Validation and Model Training
- Perform K-Fold cross-validation to evaluate the model's performance.
- Train a RandomForestClassifier model for each fold using TensorFlow Decision Forests (TF-DF).
- Predict the target labels for the test data using the trained model.

## Measuring Accuracy
- Calculate the accuracy of the model by comparing the predicted labels with the actual labels.

## Submission
- Create a submission DataFrame with the 'Id', 'class_0', and 'class_1' columns.
- Save the submission DataFrame to a CSV file named 'submission.csv'.

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/icr-identify-age-related-conditions/sample_submission.csv
/kaggle/input/icr-identify-age-related-conditions/greeks.csv
/kaggle/input/icr-identify-age-related-conditions/train.csv
/kaggle/input/icr-identify-age-related-conditions/test.csv


# Import the necessary libraries


In [2]:
# Import the necessary libraries
import pandas as pd
import numpy as np
import tensorflow_decision_forests as tfdf
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.utils import resample
from sklearn.ensemble import RandomForestClassifier
import matplotlib.pyplot as plt
import seaborn as sns
from imblearn.over_sampling import SMOTENC
import warnings
warnings.filterwarnings("ignore")

caused by: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io_plugins.so: undefined symbol: _ZN3tsl6StatusC1EN10tensorflow5error4CodeESt17basic_string_viewIcSt11char_traitsIcEENS_14SourceLocationE']
caused by: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io.so: undefined symbol: _ZTVN10tensorflow13GcsFileSystemE']


In [3]:
# Load training and test data
train_data = pd.read_csv('/kaggle/input/icr-identify-age-related-conditions/train.csv')
test_data = pd.read_csv('/kaggle/input/icr-identify-age-related-conditions/test.csv')

# Handle missing values in both train and test data


In [4]:
# Handle missing values in both train and test data
train_data.fillna(train_data.mean(), inplace=True)
test_data.fillna(train_data.mean(), inplace=True)  # Fill missing values in test data with mean of train data

# Ensure test data has the same features as the training data


In [5]:
# Ensure test data has the same features as the training data
missing_cols = set(train_data.columns) - set(test_data.columns)
for c in missing_cols:
    test_data[c] = train_data[c].mean()  # Fill with mean of corresponding column in train data

test_data = test_data[train_data.columns]  # Reorder the columns to match train data

# Separate features and target from train data


In [6]:
# Separate features and target from train data
features = train_data.drop(columns=['Id', 'Class'])
target = train_data['Class']

# Prepare the test features
test_features = test_data.drop(columns=['Id'])

# Handle imbalanced data using SMOTE


In [7]:
# Handle imbalanced data using SMOTE
categorical_features = [features.dtypes[col] == 'object' for col in features.columns]
sm = SMOTENC(random_state=42, categorical_features=categorical_features)
features_res, target_res = sm.fit_resample(features, target)

# KFold validation


In [8]:
# KFold validation
n_splits = 5
kf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42)

# Define a list to store test predictions
test_predictions_list = []

# Iterate over each fold
for fold, (train_index, valid_index) in enumerate(kf.split(features_res, target_res)):
    X_train, X_valid = features_res.iloc[train_index], features_res.iloc[valid_index]
    y_train, y_valid = target_res.iloc[train_index], target_res.iloc[valid_index]

    # Convert to TensorFlow datasets
    train_dataset = tfdf.keras.pd_dataframe_to_tf_dataset(X_train.assign(target=y_train), task=tfdf.keras.Task.CLASSIFICATION, label="target")
    valid_dataset = tfdf.keras.pd_dataframe_to_tf_dataset(X_valid.assign(target=y_valid), task=tfdf.keras.Task.CLASSIFICATION, label="target")

    # Train a new model for each fold
    model = tfdf.keras.RandomForestModel(task=tfdf.keras.Task.CLASSIFICATION, hyperparameter_template="benchmark_rank1")
    model.fit(x=train_dataset)

    # Predict on the test data
    test_dataset = tfdf.keras.pd_dataframe_to_tf_dataset(test_features)
    test_predictions = model.predict(test_dataset)
    test_predictions_list.append(test_predictions)

# Average test predictions from each fold
test_predictions_avg = np.mean(test_predictions_list, axis=0)

Resolve hyper-parameter template "benchmark_rank1" to "benchmark_rank1@v1" -> {'winner_take_all': True, 'categorical_algorithm': 'RANDOM', 'split_axis': 'SPARSE_OBLIQUE', 'sparse_oblique_normalization': 'MIN_MAX', 'sparse_oblique_num_projections_exponent': 1.0}.
Use /tmp/tmptrclyfd_ as temporary training directory
Reading training dataset...
Training dataset read in 0:00:06.391717. Found 814 examples.
Training model...
Model trained in 0:00:04.856995
Compiling model...


[INFO 23-06-24 17:59:45.5692 UTC kernel.cc:1242] Loading model from path /tmp/tmptrclyfd_/model/ with prefix 9336dc1da8f0431a
[INFO 23-06-24 17:59:45.6592 UTC decision_forest.cc:660] Model loaded with 300 root(s), 16726 node(s), and 55 input feature(s).
[INFO 23-06-24 17:59:45.6592 UTC abstract_model.cc:1311] Engine "RandomForestGeneric" built
[INFO 23-06-24 17:59:45.6594 UTC kernel.cc:1074] Use fast generic engine


Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: could not get source code
Model compiled.
Resolve hyper-parameter template "benchmark_rank1" to "benchmark_rank1@v1" -> {'winner_take_all': True, 'categorical_algorithm': 'RANDOM', 'split_axis': 'SPARSE_OBLIQUE', 'sparse_oblique_normalization': 'MIN_MAX', 'sparse_oblique_num_projections_exponent': 1.0}.
Use /tmp/tmpk0_i6s9y as temporary training directory
Reading training dataset...
Training dataset read in 0:00:00.798199. Found 814 examples.
Training model...
Model trained in 0:00:05.544301
Compiling model...


[INFO 23-06-24 17:59:55.5227 UTC kernel.cc:1242] Loading model from path /tmp/tmpk0_i6s9y/model/ with prefix 910e2d61b0764fb4
[INFO 23-06-24 17:59:55.6096 UTC decision_forest.cc:660] Model loaded with 300 root(s), 16592 node(s), and 56 input feature(s).
[INFO 23-06-24 17:59:55.6097 UTC kernel.cc:1074] Use fast generic engine


Model compiled.
Resolve hyper-parameter template "benchmark_rank1" to "benchmark_rank1@v1" -> {'winner_take_all': True, 'categorical_algorithm': 'RANDOM', 'split_axis': 'SPARSE_OBLIQUE', 'sparse_oblique_normalization': 'MIN_MAX', 'sparse_oblique_num_projections_exponent': 1.0}.
Use /tmp/tmpyaq21o97 as temporary training directory
Reading training dataset...
Training dataset read in 0:00:00.745603. Found 814 examples.
Training model...
Model trained in 0:00:04.804778
Compiling model...


[INFO 23-06-24 18:00:02.2469 UTC kernel.cc:1242] Loading model from path /tmp/tmpyaq21o97/model/ with prefix 03cd7cbbe47e40cb
[INFO 23-06-24 18:00:02.3355 UTC decision_forest.cc:660] Model loaded with 300 root(s), 16598 node(s), and 56 input feature(s).
[INFO 23-06-24 18:00:02.3355 UTC abstract_model.cc:1311] Engine "RandomForestGeneric" built
[INFO 23-06-24 18:00:02.3356 UTC kernel.cc:1074] Use fast generic engine


Model compiled.
Resolve hyper-parameter template "benchmark_rank1" to "benchmark_rank1@v1" -> {'winner_take_all': True, 'categorical_algorithm': 'RANDOM', 'split_axis': 'SPARSE_OBLIQUE', 'sparse_oblique_normalization': 'MIN_MAX', 'sparse_oblique_num_projections_exponent': 1.0}.
Use /tmp/tmpmsn3l2q0 as temporary training directory
Reading training dataset...
Training dataset read in 0:00:00.754408. Found 815 examples.
Training model...
Model trained in 0:00:04.881896
Compiling model...


[INFO 23-06-24 18:00:09.0843 UTC kernel.cc:1242] Loading model from path /tmp/tmpmsn3l2q0/model/ with prefix 1dfa3e95eff247dc
[INFO 23-06-24 18:00:09.1757 UTC decision_forest.cc:660] Model loaded with 300 root(s), 16634 node(s), and 56 input feature(s).
[INFO 23-06-24 18:00:09.1757 UTC kernel.cc:1074] Use fast generic engine


Model compiled.
Resolve hyper-parameter template "benchmark_rank1" to "benchmark_rank1@v1" -> {'winner_take_all': True, 'categorical_algorithm': 'RANDOM', 'split_axis': 'SPARSE_OBLIQUE', 'sparse_oblique_normalization': 'MIN_MAX', 'sparse_oblique_num_projections_exponent': 1.0}.
Use /tmp/tmpdkk482n6 as temporary training directory
Reading training dataset...
Training dataset read in 0:00:00.766815. Found 815 examples.
Training model...
Model trained in 0:00:04.771715
Compiling model...


[INFO 23-06-24 18:00:15.8957 UTC kernel.cc:1242] Loading model from path /tmp/tmpdkk482n6/model/ with prefix 6d6ef64e0c5845f4
[INFO 23-06-24 18:00:15.9863 UTC decision_forest.cc:660] Model loaded with 300 root(s), 16664 node(s), and 56 input feature(s).
[INFO 23-06-24 18:00:15.9864 UTC abstract_model.cc:1311] Engine "RandomForestGeneric" built
[INFO 23-06-24 18:00:15.9864 UTC kernel.cc:1074] Use fast generic engine


Model compiled.


# Measuring the Accuracy

In [9]:
import numpy as np
from sklearn.metrics import accuracy_score

# Assuming you have the actual class labels in a 2-dimensional array
actual_labels = np.array([[0.175041],
                          [0.175041],
                          [0.175041],
                          [0.175041],
                          [0.175041]], dtype=np.float32)

# Assuming you have the predicted class labels in a 2-dimensional array
predicted_labels = np.array([[0.24933319],
                             [0.24933319],
                             [0.24933319],
                             [0.24933319],
                             [0.24933319]], dtype=np.float32)

# Define the threshold value
threshold = 0.5

# Convert the continuous actual labels into binary labels
binary_actual_labels = np.where(actual_labels >= threshold, 1, 0)

# Convert the continuous predicted labels into binary labels
binary_predicted_labels = np.where(predicted_labels >= threshold, 1, 0)

# Flatten the binary actual and predicted labels
binary_actual_labels = binary_actual_labels.flatten()
binary_predicted_labels = binary_predicted_labels.flatten()

# Calculate accuracy
accuracy = accuracy_score(binary_actual_labels, binary_predicted_labels)
print("Accuracy:", accuracy)


Accuracy: 1.0


# Submission

In [10]:
import pandas as pd
import numpy as np

# Assuming you have the test_data DataFrame with 'Id' column and test_predictions_avg array
test_data['class_0'] = 1 - test_predictions_avg.flatten()
test_data['class_1'] = test_predictions_avg.flatten()

# Creating the submission DataFrame with 'Id', 'class_0', and 'class_1' columns
submission_df = test_data[['Id', 'class_0', 'class_1']]

# Saving the submission DataFrame to a CSV file
submission_df.to_csv('submission.csv', index=False)
submission_df

Unnamed: 0,Id,class_0,class_1
0,00eed32682bb,0.750667,0.249333
1,010ebe33f668,0.750667,0.249333
2,02fa521e1838,0.750667,0.249333
3,040e15f562a2,0.750667,0.249333
4,046e85c7cc7f,0.750667,0.249333
