# Summary
The dataset for this project includes data on 617 people and their 57 health characteristics, with one of the columns indicating whether they have a certain health condition associated with aging or not. The goal of the competition, which this dataset was originally a part of, was to build a classification model which would predict whether another person has or does not have this same condition based on his characteristics, which could be used as a powerful tool to help doctors in their diagnostic work. The aim of this notebook, thus, is to recreate this competition and deliver possible models for the task. 5 basic models where employed, including Logistic Regression, Random Forest Clasifier, Support Vector Machine, Naive Bayes, XGBoost, with further hypertuning of top 3 models. In the end, Random Forest was established as the best choice for the problem with 96.96% validation accuracy, but it would make sense to try out the model on more data and test more sophisticated approaches as well.

# 1 Data preprocessing

Before we actually work on our models we will need to prepare the dataset.

<b> 1.1 </b> In this step I make import needed modules for the project as well as the dataset itself

<b> 1.2 </b> After that I start cleaning my data by encoding categorical data, checking for duplicates and missing entries. For characteristics with a few missing entries I just drop the entries alltogether, luckily it is just 5 entries. For columns with more missing data (around 10% in both'BQ', 'EL') I decided to regress missing information based on values from other columns. Finally, I start dimensionality reduction by looking at pairs of features which correlate the most, leaving only one feature out of the pair in the final dataset, which allows as to drop 13 additional features.

<b> 1.3 </b> Normally, I would start the exploratory process here, however, the data we have was produced as a result of Principal Component Analysis, therefore the features we have now are nearly non-interpretable. Instead, I go straight to preparing the dataset for future training by scaling all features (helps some models train faster by removing bias towards columns with biggest values). After that I separate the data into training and testing sets

## 1.1 Importing modules and reading data

In [2]:
# Pandas as a main tool to work with dataset
import pandas as pd

# Libraries for Data preprocessing, Cross-Validation and Accuracy measurements
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.model_selection import train_test_split, GridSearchCV

# Importing our Machine Learning models for the classification tasks, 
# as well as linear regression model which will be used to regress missing data in the columns
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.naive_bayes import GaussianNB

In [4]:
data = pd.read_csv('train.csv')
print(data.shape)

data = data.drop('Id', axis=1)
data.head()

(617, 58)


Unnamed: 0,AB,AF,AH,AM,AR,AX,AY,AZ,BC,BD,...,FL,FR,FS,GB,GE,GF,GH,GI,GL,Class
0,0.209377,3109.03329,85.200147,22.394407,8.138688,0.699861,0.025578,9.812214,5.555634,4126.58731,...,7.298162,1.73855,0.094822,11.339138,72.611063,2003.810319,22.136229,69.834944,0.120343,1
1,0.145282,978.76416,85.200147,36.968889,8.138688,3.63219,0.025578,13.51779,1.2299,5496.92824,...,0.173229,0.49706,0.568932,9.292698,72.611063,27981.56275,29.13543,32.131996,21.978,0
2,0.47003,2635.10654,85.200147,32.360553,8.138688,6.73284,0.025578,12.82457,1.2299,5135.78024,...,7.70956,0.97556,1.198821,37.077772,88.609437,13676.95781,28.022851,35.192676,0.196941,0
3,0.252107,3819.65177,120.201618,77.112203,8.138688,3.685344,0.025578,11.053708,1.2299,4169.67738,...,6.122162,0.49706,0.284466,18.529584,82.416803,2094.262452,39.948656,90.493248,0.155829,0
4,0.380297,3733.04844,85.200147,14.103738,8.138688,3.942255,0.05481,3.396778,102.15198,5728.73412,...,8.153058,48.50134,0.121914,16.408728,146.109943,8524.370502,45.381316,36.262628,0.096614,1


## 1.2 Data cleaning

### Checking for duplicates

In [5]:
duplicates = data.duplicated()
print("Number of duplicates:", duplicates.sum())

Number of duplicates: 0


### Categorical encoding

In [6]:
# Create a sample DataFrame
data['EJ'] = data['EJ'].replace('A', 0)
data['EJ'] = data['EJ'].replace('B', 1)

# Convert 'Column1' from object to integer
data['EJ'] = data['EJ'].astype(int)

### Dealing with insignificant missing data

In [7]:
missing_values = data.isna().sum().sum()
print("Number of missing values:", missing_values)

print("Shape of the data before dropping NULL values:", data.shape)

missing_names = []
for column_name, column_data in data.items():
    missing_percentage = (column_data.isnull().sum() / len(column_data)) * 100
    if missing_percentage < 9:
        data.dropna(subset=[column_name], inplace=True)
    else:
        missing_names.append(column_name) 
        

print("Shape of the data before after NULL values:", data.shape)

Number of missing values: 131
Shape of the data before dropping NULL values: (617, 57)
Shape of the data before after NULL values: (612, 57)


### Separating responses from predictors before regressing data

In [8]:
y = data['Class']  # Extract the response variable column(s)
X = data.copy()  # Remove the response variable column(s) from the DataFrame

print(missing_names)

['BQ', 'EL']


### Regressing missing data

In [9]:
#Regressing missing BQ data

train_X_BQ = X.drop('EL', axis=1)
full_BQ = train_X_BQ.copy()
test_X_BQ = train_X_BQ[train_X_BQ['BQ'].isnull()].copy()
test_X_BQ = test_X_BQ.drop('BQ', axis=1)

train_X_BQ.dropna(inplace=True)

train_Y_BQ = train_X_BQ['BQ']
train_X_BQ = train_X_BQ.drop('BQ', axis=1)

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

regressor = LinearRegression()
regressor.fit(train_X_BQ, train_Y_BQ)
test_Y_BQ = regressor.predict(test_X_BQ)

full_BQ.loc[full_BQ['BQ'].isnull(), 'BQ'] = test_Y_BQ
X['BQ'].fillna(full_BQ['BQ'], inplace=True)

In [10]:
#Regressing missing EL data

train_X_EL = X.copy()
full_EL = train_X_EL.copy()
test_X_EL = train_X_EL[train_X_EL['EL'].isnull()].copy()
test_X_EL = test_X_EL.drop('EL', axis=1)

train_X_EL.dropna(inplace=True)

train_Y_EL = train_X_EL['EL']
train_X_EL = train_X_EL.drop('EL', axis=1)

regressor_EL = LinearRegression()
regressor_EL.fit(train_X_EL, train_Y_EL)
test_Y_EL = regressor_EL.predict(test_X_EL)

full_EL.loc[full_EL['EL'].isnull(), 'EL'] = test_Y_EL
X['EL'].fillna(full_EL['EL'], inplace=True)

In [11]:
X.columns.tolist()
correlation_matrix = X.corr()

class_correlations = correlation_matrix['Class']

low_correlation_columns = class_correlations[abs(class_correlations) < 0.05]

high_correlation_columns = class_correlations[abs(class_correlations) > 0.8]

print("Columns with low correlation (absolute value < 0.05):")
for column, correlation in low_correlation_columns.items():
    print(f"{column}: {abs(correlation)}")

print("\nColumns with high correlation (absolute value > 0.8):")
for column, correlation in high_correlation_columns.items():
    print(f"{column}: {abs(correlation)}")

Columns with low correlation (absolute value < 0.05):
AH: 0.04209024672535381
AZ: 0.01270428917437259
CB: 0.025776691867449207
CH: 0.0033435657242242853
CL: 0.014922630449296549
CS: 0.04964172746379496
DN: 0.027359905615540175
DV: 0.012521335596920034
EG: 0.0343881237284732
EU: 0.040800103477387994
FC: 0.030450532282503075
FS: 0.0011210074041133038
GH: 0.027651325525788365

Columns with high correlation (absolute value > 0.8):
Class: 1.0


In [12]:
col_drops = ['AH', 'AZ','CB', 'CH','CL','CS','DN','DV','EG','EU','FC','FS','GH'] 
X.drop(col_drops,axis=1,inplace=True) 

## 1.3 Preparing for training

### Normalizing and Splitting data

In [13]:
X = X.drop('Class', axis=1)
scaler = StandardScaler()
X = scaler.fit_transform(X)

In [14]:
from sklearn.model_selection import train_test_split 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0) 

# 2 Training models
Now that we have prepared detasets we can start training models. Logistic Regression, Random Forest Clasifier, Support Vector Machine, Naive Bayes, XGBoost were chosen as most popular regression choices. In the end, XGBoost, Logistic Regression and Random Forest Classficator where chosen for further hyper-tuning, as they performed best as is among 5 initial models. In case of Random Forest Classificator it increased the accuracy the most, from 7 initial misclassifications to 4, effectively placing it better than both Logistic Regression and XGBoost with 6 and 5 misclassifications respectively

## 2.1 Training models

In [15]:
logreg = LogisticRegression()

# Fit the model to the training data
logreg.fit(X_train, y_train)

#logistic_regression.fit(X_train, y_train)

# Predict the target variable for the test set
y_pred = logreg.predict(X_test)

# Calculate accuracy
accuracy_lr = accuracy_score(y_test, y_pred)
print("Accuracy of Logistic Regression:", accuracy_lr)

# Print confusion matrix
confusion_matrix_lr = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(confusion_matrix_lr)


random_forest_model = RandomForestClassifier()

# Train the model on the training data
random_forest_model.fit(X_train, y_train)

# Make predictions on the test data
y_pred_rf = random_forest_model.predict(X_test)

# Calculate accuracy
accuracy_rf = accuracy_score(y_test, y_pred_rf)
print("Accuracy of Random Forest Clasifier:", accuracy_rf)

# Print confusion matrix
confusion_matrix_rf = confusion_matrix(y_test, y_pred_rf)
print("Confusion Matrix:")
print(confusion_matrix_rf)

# Initialize the SVC classifier
svc_model = SVC(kernel='rbf')
# Train the SVC model on the training data
svc_model.fit(X_train, y_train)

# Initialize the GaussianNB classifier
gnb_model = GaussianNB()
# Train the GaussianNB model on the training data
gnb_model.fit(X_train, y_train)

# Initialize the XGBClassifier
xgb_model = XGBClassifier()
# Train the XGBClassifier on the training data
xgb_model.fit(X_train, y_train)

# Make predictions on the test data
y_pred_svc = svc_model.predict(X_test)
y_pred_gnb = gnb_model.predict(X_test)
y_pred_xgb = xgb_model.predict(X_test)


# Calculate accuracy
accuracy_svc = accuracy_score(y_test, y_pred_svc)
print("Accuracy of Support Vector Machine:", accuracy_svc)


confusion_matrix_svc = confusion_matrix(y_test, y_pred_svc)
print("Confusion Matrix:")
print(confusion_matrix_svc)

# Calculate accuracy
accuracy_gnb = accuracy_score(y_test, y_pred_gnb)
print("Accuracy of Naive Bayes:", accuracy_gnb)


confusion_matrix_gnb = confusion_matrix(y_test, y_pred_gnb)
print("Confusion Matrix:")
print(confusion_matrix_gnb)

# Calculate accuracy
accuracy_xgb = accuracy_score(y_test, y_pred_xgb)
print("Accuracy of XGBoost:", accuracy_xgb)


confusion_matrix_xgb = confusion_matrix(y_test, y_pred_xgb)
print("Confusion Matrix:")
print(confusion_matrix_xgb)

Accuracy of Logistic Regression: 0.9512195121951219
Confusion Matrix:
[[103   2]
 [  4  14]]
Accuracy of Random Forest Clasifier: 0.943089430894309
Confusion Matrix:
[[104   1]
 [  6  12]]
Accuracy of Support Vector Machine: 0.926829268292683
Confusion Matrix:
[[105   0]
 [  9   9]]
Accuracy of Naive Bayes: 0.8861788617886179
Confusion Matrix:
[[102   3]
 [ 11   7]]
Accuracy of XGBoost: 0.959349593495935
Confusion Matrix:
[[102   3]
 [  2  16]]


## 2.2 Fine-tuning top 3

In [16]:
# Initialize the Random Forest classifier
random_forest_best = RandomForestClassifier()

# Define the hyperparameters and their possible values
param_grid_rf_best = {
    'n_estimators': [100, 200, 300],  # Number of trees in the forest
    'max_depth': [None, 5, 10],       # Maximum depth of the trees
    'min_samples_split': [2, 5, 10],  # Minimum number of samples required to split an internal node
}

# Perform grid search using 5-fold cross-validation
grid_search_rf_best = GridSearchCV(random_forest_best, param_grid_rf_best, cv=5)

# Fit the grid search to the training data
grid_search_rf_best.fit(X_train, y_train)

# Get the best parameters and the best model
best_params_rf_best = grid_search_rf_best.best_params_
best_model_rf_best = grid_search_rf_best.best_estimator_

# Make predictions on the test data using the best model
y_pred_rf_best = best_model_rf_best.predict(X_test)

accuracy_rf_best = accuracy_score(y_test, y_pred_rf_best)
print("Accuracy of best Random Forest:", accuracy_rf_best)

confusion_matrix_rf_best = confusion_matrix(y_test, y_pred_rf_best)
print("Confusion Matrix:")
print(confusion_matrix_rf_best)

Accuracy of best Random Forest: 0.967479674796748
Confusion Matrix:
[[104   1]
 [  3  15]]


In [17]:
# Initialize the XGBoost classifier
xgb_model_best = XGBClassifier()

# Define the hyperparameters and their possible values
param_grid_xgb_best = {
    'n_estimators': [100, 200, 300],  # Number of trees in the ensemble
    'max_depth': [3, 5, 7],           # Maximum depth of each tree
    'learning_rate': [0.1, 0.01, 0.001],  # Learning rate
}

# Perform grid search using 5-fold cross-validation
grid_search_xgb_best = GridSearchCV(xgb_model_best, param_grid_xgb_best, cv=5)

# Fit the grid search to the training data
grid_search_xgb_best.fit(X_train, y_train)

# Get the best parameters and the best model
best_params_xgb = grid_search_xgb_best.best_params_
best_model_xgb = grid_search_xgb_best.best_estimator_

# Make predictions on the test data using the best model
y_pred_xgb_best = best_model_xgb.predict(X_test)

accuracy_xgb_best = accuracy_score(y_test, y_pred_xgb_best)
print("Accuracy of best XGBoost:", accuracy_xgb_best)

confusion_matrix_xgb_best = confusion_matrix(y_test, y_pred_xgb_best)
print("Confusion Matrix:")
print(confusion_matrix_xgb_best)

Accuracy of best XGBoost: 0.959349593495935
Confusion Matrix:
[[102   3]
 [  2  16]]


In [18]:
# Initialize the Logistic Regression classifier
logreg_model_best = LogisticRegression()

# Define the hyperparameters and their possible values
param_grid_lr_best = {
    'C': [0.1, 1.0, 10.0],      # Inverse of regularization strength
    'penalty': ['l1', 'l2'],    # Regularization penalty type
    'solver': ['liblinear'],    # Algorithm to use in the optimization problem
}

# Perform grid search using 5-fold cross-validation
grid_search_lr_best = GridSearchCV(logreg_model_best, param_grid_lr_best, cv=5)

# Fit the grid search to the training data
grid_search_lr_best.fit(X_train, y_train)

# Get the best parameters and the best model
best_params_lr = grid_search_lr_best.best_params_
best_model_lr = grid_search_lr_best.best_estimator_

# Make predictions on the test data using the best model
y_pred_lr_best = best_model_lr.predict(X_test)

accuracy_lr_best = accuracy_score(y_test, y_pred_lr_best)
print("Accuracy of best Logistic Regression:", accuracy_lr_best)

confusion_matrix_lr_best = confusion_matrix(y_test, y_pred_lr_best)
print("Confusion Matrix:")
print(confusion_matrix_lr_best)

Accuracy of best Logistic Regression: 0.9512195121951219
Confusion Matrix:
[[103   2]
 [  4  14]]


# 3 Conclusion
In conclusion, Random Forest Classificator has been established as the most effective in this set, however, further testing must be done. Additionally, different penalty strategies for the model should be discussed with the healthcare providers, as it might be the case that it is much more vital to catch people with the disease even if it means going though more healthy people as well. As for other models, it might deem useful to test Deep Learning architectures that can spot more synergistic between features, rather than their effect on the result alone.