# Japan Heart Attack Classification

This project explores a dataset that offers an in-depth exploration of heart attack incidents in Japan, focusing on the differences between youth and adult age groups. With the growing prevalence of cardiovascular diseases worldwide, this dataset provides critical insights into the health profiles, risk factors, and potential triggers associated with heart attacks among two distinct demographics. For this project, we will attempt to create a classification model that can predict whether a heart attack may occur in a patient by using their medical history. This dataset was provided by a user on Kaggle (https://www.kaggle.com/datasets/ashaychoudhary/heart-attack-in-japan-youth-vs-adult).

## Import Libraries

In [158]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
import numpy as np
from matplotlib import pyplot as plt 
import seaborn as sns
import pandas as pd
from sklearn.model_selection import train_test_split, RandomizedSearchCV, cross_val_score, GridSearchCV, KFold
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, r2_score, accuracy_score
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder
from sklearn.tree import DecisionTreeClassifier

In [159]:
df = pd.read_csv('japan_heart_attack_dataset.csv')
df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30000 entries, 0 to 29999
Data columns (total 32 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Age                      30000 non-null  int64  
 1   Gender                   30000 non-null  object 
 2   Region                   30000 non-null  object 
 3   Smoking_History          30000 non-null  object 
 4   Diabetes_History         30000 non-null  object 
 5   Hypertension_History     30000 non-null  object 
 6   Cholesterol_Level        30000 non-null  float64
 7   Physical_Activity        30000 non-null  object 
 8   Diet_Quality             30000 non-null  object 
 9   Alcohol_Consumption      26985 non-null  object 
 10  Stress_Levels            30000 non-null  float64
 11  BMI                      30000 non-null  float64
 12  Heart_Rate               30000 non-null  float64
 13  Systolic_BP              30000 non-null  float64
 14  Diastolic_BP          

Unnamed: 0,Age,Gender,Region,Smoking_History,Diabetes_History,Hypertension_History,Cholesterol_Level,Physical_Activity,Diet_Quality,Alcohol_Consumption,...,Extra_Column_6,Extra_Column_7,Extra_Column_8,Extra_Column_9,Extra_Column_10,Extra_Column_11,Extra_Column_12,Extra_Column_13,Extra_Column_14,Extra_Column_15
0,56,Male,Urban,Yes,No,No,186.400209,Moderate,Poor,Low,...,0.007901,0.794583,0.290779,0.497193,0.521995,0.799657,0.722398,0.148739,0.83401,0.061632
1,69,Male,Urban,No,No,No,185.136747,Low,Good,Low,...,0.083933,0.688951,0.830164,0.63449,0.302043,0.043683,0.451668,0.878671,0.535602,0.617825
2,46,Male,Rural,Yes,No,No,210.696611,Low,Average,Moderate,...,0.227205,0.496344,0.752107,0.181501,0.62918,0.018276,0.063227,0.146512,0.997296,0.974455
3,32,Female,Urban,No,No,No,211.165478,Moderate,Good,High,...,0.403182,0.741409,0.223968,0.329314,0.143191,0.907781,0.542322,0.922461,0.626217,0.228606
4,60,Female,Rural,No,No,No,223.814253,High,Good,High,...,0.689787,0.904574,0.757098,0.337761,0.362375,0.728552,0.176699,0.484749,0.312091,0.452809


The dataset contains 30000 rows and 32 columns. There are many irrelevant columns in this dataset that will be removed in the following section. All of the columns except 'Alcohol_Consumption' do not contain any null values.

Here's a summary of all the columns:

- **Age**: Age of patient
- **Gender**: Sex of patient
- **Region**: Region of residence (Urban or Rural)
- **Smoking_History**: Smokes/Smoked before
- **Diabetes_History**: Has/Had diabetes before
- **Hypertension_History**: Previous hypertension history
- **Cholesterol_Level**: Current cholesterol levels
- **Physical_Activity**: Frequency of exercise (Low, Moderate, High)
- **Diet_Quality**: Quality of diet (Poor, Average, Good)
- **Alcohol_Consumption**: Amount of alcohol consumption (Low, Moderate, High)
- **Stress_Levels**: Measured stress levels
- **BMI**: Body Mass Index (BMI)
- **Heart_Rate**: Average heart rate (bpm)
- **Systolic_BP**: Systolic blood pressure (mm/Hg)
- **Diastolic_BP**: Diastolic blood pressure (mm/Hg)
- **Family_History**: Family history of heart attacks
- **Heart_Attack_Occurrence**: Whether a heart attack has occured before

## Data Cleaning and Preparation

In [160]:
df = df.drop(columns = ['Extra_Column_1', 'Extra_Column_2', 'Extra_Column_3', 'Extra_Column_4', 'Extra_Column_5', 'Extra_Column_6', 'Extra_Column_7',
                        'Extra_Column_8', 'Extra_Column_9', 'Extra_Column_10', 'Extra_Column_11', 'Extra_Column_12', 'Extra_Column_13', 
                        'Extra_Column_14', 'Extra_Column_15'])

In [161]:
df.head()

Unnamed: 0,Age,Gender,Region,Smoking_History,Diabetes_History,Hypertension_History,Cholesterol_Level,Physical_Activity,Diet_Quality,Alcohol_Consumption,Stress_Levels,BMI,Heart_Rate,Systolic_BP,Diastolic_BP,Family_History,Heart_Attack_Occurrence
0,56,Male,Urban,Yes,No,No,186.400209,Moderate,Poor,Low,3.644786,33.961349,72.301534,123.90209,85.682809,No,No
1,69,Male,Urban,No,No,No,185.136747,Low,Good,Low,3.384056,28.242873,57.45764,129.893306,73.524262,Yes,No
2,46,Male,Rural,Yes,No,No,210.696611,Low,Average,Moderate,3.810911,27.60121,64.658697,145.654901,71.994812,No,No
3,32,Female,Urban,No,No,No,211.165478,Moderate,Good,High,6.014878,23.717291,55.131469,131.78522,68.211333,No,No
4,60,Female,Rural,No,No,No,223.814253,High,Good,High,6.806883,19.771578,76.667917,100.694559,92.902489,No,No


### Dealing with null values

The 'Alcohol_Consumption' column has 3015 null values and we will use backforward filling to replace these values.

In [162]:
df[df.isnull().any(axis = 1)]

Unnamed: 0,Age,Gender,Region,Smoking_History,Diabetes_History,Hypertension_History,Cholesterol_Level,Physical_Activity,Diet_Quality,Alcohol_Consumption,Stress_Levels,BMI,Heart_Rate,Systolic_BP,Diastolic_BP,Family_History,Heart_Attack_Occurrence
7,38,Female,Urban,Yes,No,No,203.443219,High,Poor,,5.334899,14.979730,62.077527,138.127972,93.183468,No,No
16,53,Male,Rural,No,Yes,No,181.308138,Moderate,Average,,7.862301,31.152845,76.130461,126.735253,80.688332,No,No
31,29,Female,Urban,Yes,No,No,155.258188,Moderate,Good,,6.135744,28.545810,69.560957,102.542957,69.888920,Yes,No
35,61,Male,Urban,Yes,No,No,204.462973,Moderate,Good,,6.211564,23.026736,63.987772,148.858694,83.499061,No,Yes
37,66,Male,Urban,No,No,No,225.623215,Moderate,Good,,5.193163,28.083480,84.860209,122.623817,77.356078,No,No
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
29942,19,Female,Urban,No,Yes,No,196.510358,High,Good,,4.857555,18.840955,73.712490,123.012627,71.667865,No,Yes
29946,27,Male,Urban,Yes,No,No,223.771158,High,Average,,5.477079,23.508492,50.071302,112.617064,80.582181,Yes,No
29961,78,Male,Urban,No,No,No,244.366064,Low,Poor,,6.691320,22.643378,88.814227,129.863203,97.209719,No,No
29962,21,Female,Urban,Yes,No,Yes,206.355701,Low,Average,,6.223192,26.814677,59.152131,108.056982,79.288965,No,No


In [163]:
# Backforward filling to replace null values
df.Alcohol_Consumption = df.Alcohol_Consumption.bfill()
df.Alcohol_Consumption.value_counts()

Alcohol_Consumption
Moderate    13400
Low         10169
High         6431
Name: count, dtype: int64

### Converting data types

In [164]:
# Boolean columns
bool_cols = ['Smoking_History', 'Diabetes_History', 'Hypertension_History', 'Family_History']

# Converting columns to boolean data type
for cols in bool_cols:
    df[cols] = df[cols].replace({'Yes':1, 'No':0})
    df[cols] = df[cols].astype('bool')

df['Gender'] = df['Gender'].replace({'Male':1, 'Female':0})
df['Gender'] = df['Gender'].astype('bool')

## Model Selection and Evaluation

### Feature Importances

In [165]:
# Feature and target variables
X = df.drop(columns = ['Heart_Attack_Occurrence'])
y = df.Heart_Attack_Occurrence

# Feature and target variables
cat_cols = ['Region']
num_cols = ['Age', 'Cholesterol_Level', 'Stress_Levels', 'BMI', 'Heart_Rate', 'Systolic_BP', 'Diastolic_BP']
bool_cols = ['Gender', 'Smoking_History', 'Diabetes_History', 'Hypertension_History', 'Family_History']
ord_cols = ['Physical_Activity', 'Diet_Quality', 'Alcohol_Consumption']

preprocesser = ColumnTransformer(
    transformers = [
        ('cat', OneHotEncoder(sparse=False, drop='first'), cat_cols),
        ('num', StandardScaler(), num_cols),
        ('bin', 'passthrough', bool_cols),
        ('ord', OrdinalEncoder(categories=[['Low', 'Moderate', 'High'], ['Poor', 'Average', 'Good'], ['Low', 'Moderate', 'High']]), ord_cols)
    ]
)

In [166]:
# Apply the transformations to the training data
X_preprocessed = preprocesser.fit_transform(X)
X_preprocessed = pd.DataFrame(X_preprocessed, columns= preprocesser.get_feature_names_out())

# Split the data into train and test sets
x_train_processed, x_test_processed, y_train_processed, y_test_processed = train_test_split(X_preprocessed, y, test_size=0.2, random_state=0)

In [167]:
# Fitting model for feature importance
dtc = DecisionTreeClassifier(random_state = 0, class_weight = 'balanced')
dtc.fit(x_train_processed, y_train_processed)

In [179]:
# Get feature importances
importances = dtc.feature_importances_

# Create a DataFrame to view feature importances
feature_importances = pd.DataFrame({'feature': x_train_processed.columns, 'importance': importances})

# Sort by importance
feature_importances = feature_importances.sort_values(by='importance', ascending=False)

# Print features and importances
print(feature_importances.head(16))

                      feature  importance
5             num__Heart_Rate    0.135682
6            num__Systolic_BP    0.135092
4                    num__BMI    0.134873
3          num__Stress_Levels    0.129082
2      num__Cholesterol_Level    0.127831
7           num__Diastolic_BP    0.122539
1                    num__Age    0.083297
15   ord__Alcohol_Consumption    0.024309
13     ord__Physical_Activity    0.022278
14          ord__Diet_Quality    0.022185
12        bin__Family_History    0.013178
0           cat__Region_Urban    0.011880
10      bin__Diabetes_History    0.010446
8                 bin__Gender    0.009267
11  bin__Hypertension_History    0.009071
9        bin__Smoking_History    0.008989


It seems that heart rate is the most important feature. 

### Hyperparameter Tuning

In [169]:
# Creating parameters
parameters = {
    'min_samples_split':[2,4,6,8],
    'min_samples_leaf':[1,2,4,6]
}

# Initalising GridSearchCV
rsv = RandomizedSearchCV(dtc, param_distributions = parameters, cv=5, scoring='accuracy')

# Ftting model to training set
rsv.fit(x_train_processed, y_train_processed)

In [170]:
# Best parameters and score
print("Best Parameters:", rsv.best_params_)
print("Best Score:", rsv.best_score_)

Best Parameters: {'min_samples_split': 2, 'min_samples_leaf': 1}
Best Score: 0.81775


## Pipeline

In [171]:
# Initalising Pipeline
pipeline = Pipeline([('preprocess', preprocesser), ('classifier', rsv.best_estimator_)])

# Creating training and testing data
x_train, x_test, y_train, y_test = train_test_split(X, y, random_state = 0, test_size = 0.25)

# Fitting model to training dataset
pipeline.fit(x_train, y_train)

In [172]:
# Predictions from testing data
y_pred = pipeline.predict(x_test)

# Pipeline score
train_score = pipeline.score(x_train, y_train)
test_score = pipeline.score(x_test, y_test)
print(f'Train Score: {train_score}')
print(f'Test Score: {test_score}')

Train Score: 1.0
Test Score: 0.8176


### K-Fold Cross Validation

In [173]:
# Define K-Fold cross-validator
kf = KFold(n_splits=5, shuffle=True, random_state=42)  # 5-Fold CV

# Perform 5-fold cross-validation
cv_scores = cross_val_score(pipeline, X, y, cv=kf, scoring='accuracy')

# Displaying cross val and accuracy scores
print(f'Cross-validation scores: {cv_scores}')
print(f'Average accuracy: {cv_scores.mean():.4f}')

Cross-validation scores: [0.81516667 0.81666667 0.82533333 0.815      0.82516667]
Average accuracy: 0.8195


Our model's training score and testing score was 1.0 and 0.8176 respectively. These signifies that our model has trained extremely well on our training data and performs relatively well on our testing data.  By using 5-fold cross validation, we managed to get an average accuracy score of 0.8195 which is also really good. 

## Conclusion

This project focuses on a dataset consisting of medical histories and medical records on patients. The data was first cleaned and prepped before analysing it and determining the feature importances. After that, we created a Decision Tree Classification model and tuned some of its parameters for optimisation. We then trained our model using the data and made some predictions to compare with the testing data. Fortunately, the results were favorable and we performed k-fold cross validation to ensure that it was not an outlier result.