# README

- About competition

The goal of this competition is to predict if a person has any of three medical conditions. You are being asked to predict if the person has one or more of any of the three medical conditions (Class 1), or none of the three medical conditions (Class 0). You will create a model trained on measurements of health characteristics.
To determine if someone has these medical conditions requires a long and intrusive process to collect information from patients. With predictive models, we can shorten this process and keep patient details private by collecting key characteristics relative to the conditions, then encoding these characteristics.
Your work will help researchers discover the relationship between measurements of certain characteristics and potential patient conditions.
https://www.kaggle.com/competitions/icr-identify-age-related-conditions/data

- Data

The competition data comprises over fifty anonymized health characteristics linked to three age-related conditions. Your goal is to predict whether a subject has or has not been diagnosed with one of these conditions -- a binary classification problem.

train.csv - The training set.
Id Unique identifier for each observation.
AB-GL Fifty-six anonymized health characteristics. All are numeric except for EJ, which is categorical.
Class A binary target: 1 indicates the subject has been diagnosed with one of the three conditions, 0 indicates they have not.

# Explore data

In [1]:
# import modules
import pandas as pd

# from sklearn
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split 

In [2]:
# read data
df = pd.read_csv('data/train.csv')
df.replace(['A', 'B'], [0,1], inplace=True)
df.shape

(617, 58)

In [9]:
# define X
X = df.drop(columns=['Class', 'Id'])
print(X.shape)

# define Y
y = df['Class']
print(y.shape)

KeyError: "['Class', 'Id'] not found in axis"

In [4]:
# values
y.value_counts()

0    509
1    108
Name: Class, dtype: int64

In [5]:
# find NaNs
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 617 entries, 0 to 616
Data columns (total 55 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   AB      617 non-null    float64
 1   AF      617 non-null    float64
 2   AH      617 non-null    float64
 3   AM      617 non-null    float64
 4   AR      617 non-null    float64
 5   AX      617 non-null    float64
 6   AY      617 non-null    float64
 7   AZ      617 non-null    float64
 8   BC      617 non-null    float64
 9   BD      617 non-null    float64
 10  BN      617 non-null    float64
 11  BP      617 non-null    float64
 12  BQ      557 non-null    float64
 13  BR      617 non-null    float64
 14  BZ      617 non-null    float64
 15  CB      615 non-null    float64
 16  CC      614 non-null    float64
 17  CD      617 non-null    float64
 18  CF      617 non-null    float64
 19  CH      617 non-null    float64
 20  CL      617 non-null    float64
 21  CR      617 non-null    float64
 22  CS

# Assess models

In [6]:
# import sklearn modules
from sklearn.pipeline import Pipeline
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.impute import SimpleImputer

# scalers
from sklearn.preprocessing import StandardScaler, MinMaxScaler, MaxAbsScaler

scalers = [
    StandardScaler(),
    MinMaxScaler(),
    MaxAbsScaler(),
]

# import models
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, HistGradientBoostingClassifier, AdaBoostClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis
import xgboost as xgb


models = [
    LogisticRegression(),
    RandomForestClassifier(),
    GradientBoostingClassifier(),
    HistGradientBoostingClassifier(),
    AdaBoostClassifier(),
    xgb.XGBClassifier(),
    SVC(),
    KNeighborsClassifier(),
    DecisionTreeClassifier(),
    GaussianNB(),
    LinearDiscriminantAnalysis(),
    QuadraticDiscriminantAnalysis(),
]

In [7]:
def custom_pipeline(scaler, model):
    # Create the pipeline
    pipeline = Pipeline([
        ('imputer', SimpleImputer(strategy='median')),  # Impute missing values
        ('scaler', scaler),  # Apply MinMaxScaler
        ('classifier', model)  # Apply Logistic Regression with balanced class weights
    ])

    # Perform cross-validation
    cv_scores = cross_val_score(pipeline, X, y, cv=StratifiedKFold(n_splits=5))

    # Print the mean accuracy and standard deviation
    # print("Cross-validation accuracy: %0.2f (+/- %0.2f)" % (cv_scores.mean(), cv_scores.std() * 2))
    
    return cv_scores

In [8]:
%%time
# generate df report of algorithms' performace
output_dict = {
    'scaler': [],
    'model': [],
    'mean': [],
    'std': []
}

for scaler in scalers:
    for model in models:
        # print(type(scaler).__name__, ' + ', type(model).__name__)
        cv_scores = custom_pipeline(scaler, model)
        
        output_dict['scaler'].append(type(scaler).__name__)
        output_dict['model'].append(type(model).__name__)
        output_dict['mean'].append( round(cv_scores.mean(),4) )
        output_dict['std'].append( round(cv_scores.std() * 2, 4) )

df = pd.DataFrame(output_dict)
df

CPU times: total: 24.5 s
Wall time: 20.9 s


Unnamed: 0,scaler,model,mean,std
0,StandardScaler,LogisticRegression,0.8834,0.0447
1,StandardScaler,RandomForestClassifier,0.9044,0.0389
2,StandardScaler,GradientBoostingClassifier,0.9044,0.0722
3,StandardScaler,HistGradientBoostingClassifier,0.9287,0.04
4,StandardScaler,AdaBoostClassifier,0.906,0.0463
5,StandardScaler,XGBClassifier,0.9093,0.0534
6,StandardScaler,SVC,0.8849,0.0214
7,StandardScaler,KNeighborsClassifier,0.8768,0.0495
8,StandardScaler,DecisionTreeClassifier,0.8525,0.0437
9,StandardScaler,GaussianNB,0.8525,0.0288
