# Final assignment - Heart dataset - Group 2 - Alessandro, Fadi & Timon

**Context:**

<sub>
Cardiovascular diseases (CVDs) are the leading cause of death globally, accounting for an estimated 17.9 million lives each year (31% of all deaths). Four out of 5 CVD deaths result from heart attacks and strokes, with one-third occurring prematurely in individuals under 70 years old. Heart failure is a common event caused by CVDs, and this dataset comprises 11 features for predicting potential heart disease. Early detection and management are crucial for people with cardiovascular disease or those at high cardiovascular risk (due to risk factors such as hypertension, diabetes, hyperlipidemia, or established disease), where a machine learning model can be invaluable.
</sub>

**Goal:**

<sub>To predict the likelihood of heart failure based on the "Heart Failure Prediction" dataset. The output will be a binary classification (0 or 1) representing the likelihood of heart failure.</sub>

**Features (x):**

<sub>The features to be used are as follows: Age, Sex, ChestPainType, RestingBP, Cholesterol, FastingBS, RestingECG, MaxHR, ExerciseAngina,	Oldpeak	ST_Slope & HeartDisease.</sub>

**Target Variable (y):**

<sub>The target variable is "HeartDisease."</sub>

**Attribute Information:**

<sub>

- Age: Age of the patient [years]
- Sex: Sex of the patient [M: Male, F: Female]
- ChestPainType: Chest pain type [TA: Typical Angina, ATA: Atypical Angina, NAP: Non-Anginal Pain, ASY: Asymptomatic]
- RestingBP: Resting blood pressure [mm Hg]
- Cholesterol: Serum cholesterol [mm/dl]
- FastingBS: Fasting blood sugar [1: if FastingBS > 120 mg/dl, 0: otherwise]
- RestingECG: Resting electrocardiogram results [Normal: Normal, ST: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV), LVH: showing probable or definite left ventricular hypertrophy by Estes' criteria]
- MaxHR: Maximum heart rate achieved [Numeric value between 60 and 202]
- ExerciseAngina: Exercise-induced angina [Y: Yes, N: No]
- Oldpeak: Oldpeak = ST [Numeric value measured in depression]
- ST_Slope: The slope of the peak exercise ST segment [Up: upsloping, Flat: flat, Down: downsloping]
- HeartDisease: Output class [1: heart disease, 0: Normal]

</sub>

--- Start of Timon's part ---

## Step 1 - Data Loading and Initial Exploration:
<sub>

- Load the dataset into a Pandas DataFrame.
- Display basic information about the dataset.

</sub>

In [None]:
# Imports.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Assign & print the dataset.
dataset = 'data/modified_heart_dataset_supervised.csv'
df = pd.read_csv(dataset)
df

# Based on the dataset, we can make the following divide:

# Numerical Columns:
# - Age
# - RestingBP
# - Cholesterol
# - FastingBS
# - MaxHR
# - ExerciseAngina
# - Oldpeak
# - GeneticMarker1
# - GeneticMarker2

# Categorical Columns:
# - Sex
# - ChestPainType
# - RestingECG
# - ST_Slope
# - BodyWeightCategory
# - HeartDisease (Target)

In [None]:
# Explore the distribution of the HeartDisease column, which is the target variable.

# Print the number of people with heart disease and without heart disease.
print(df['HeartDisease'].value_counts())

# The amount of people with and without heart disease is almost equal. 
# This is good as it means the dataset is balanced.

In [None]:
# Display basic information about the dataset and check for missing values.
df.info()

## Step 2 - Data Preparation

<sub>

- Identify and handle missing values.

</sub>

In [None]:
# Secondary check for missing values.
missing_values = df.isnull().sum()
print(missing_values)

# No missing values found, proceed to the next step.

In [None]:
# Check for duplicate values.
duplicate_values = df.duplicated().sum()
print(duplicate_values)

# No duplicate values found, proceed to the next step.

## Step 3 - Exploratory Data Analysis (EDA)

<sub>

- Create visualizations to explore the relationships between the features and the target variable.
- Analyze the distribution of the target variable.

</sub>

In [None]:
# Display the summary statistics.
print(df.describe())

### Numerical columns

In [None]:
# Check out the correlation between the columns in the dataset using a heatmap.

# Set up the figure size.
plt.figure(figsize=(16, 9))

# Calculate the correlation matrix using all numerical columns.
cor = df[['Age', 'RestingBP', 'Cholesterol', 'FastingBS', 'MaxHR', 'Oldpeak',
           'GeneticMarker1', 'GeneticMarker2', 'HeartDisease']].corr()

# Create a heatmap with correlation values annotated.
heatmap = sns.heatmap(data=cor, annot=True, vmin=0.1, vmax=0.95, center=0, cmap='mako')

# Set the title and adjust the padding.
heatmap.set_title('Correlation Heatmap - HeartDisease', fontdict={'fontsize': 12}, pad=12)

# Show the heatmap.
plt.show()

In [None]:
# Analyzing the correlation between different features and the target variable HeartDisease.
# The focus is on understanding the correlation of eligble features with HeartDisease.

# Creating a heatmap to visualize correlation.
plt.figure(figsize=(8, 12))

# Selecting relevant columns for correlation analysis.
cor = df[['Age', 'RestingBP', 'Cholesterol', 'FastingBS', 'MaxHR', 'Oldpeak',
       'GeneticMarker1', 'GeneticMarker2', 'HeartDisease']].corr()

# Sorting and extracting correlation with HeartDisease in descending order.
heatmap = sns.heatmap(data=cor[['HeartDisease']].sort_values(by='HeartDisease', ascending=False),
                      vmin=-1, vmax=1, annot=True, cmap='BrBG')

# Setting the title and formatting for better readability.
heatmap.set_title('Features Correlating with Heart Disease', fontdict={'fontsize': 18}, pad=16)

# Displaying the heatmap.
plt.show()

# General guide for interpreting correlation values:

# Correlation Interpretation Guide:
# Perfect Negative Correlation: -1.0
# Very Strong Negative Correlation: -0.9 to -1.0
# Strong Negative Correlation: -0.7 to -0.9
# Moderate Negative Correlation: -0.5 to -0.7
# Weak Negative Correlation: -0.3 to -0.5
# Very Weak Negative Correlation: -0.1 to -0.3

# No Correlation: 0.0

# Very Weak Positive Correlation: 0.1 to 0.3
# Weak Positive Correlation: 0.3 to 0.5
# Moderate Positive Correlation: 0.5 to 0.7
# Strong Positive Correlation: 0.7 to 0.9
# Very Strong Positive Correlation: 0.9 to 1.0
# Perfect Positive Correlation: 1.0

# Conclusion:

# There seems to be no correlation between GeneticMarker1 and HeartDisease.
# There seems to be a very weak positive correlation between Cholesterol and HeartDisease.
# There seems to be a weak negative correlation between MaxHR and HeartDisease.

In [None]:
# Variable assignments.

# Set the number of bins for the histograms below.
bins = 20

#### Cholesterol

In [None]:
# Fadi and Alex raised concerns about the Cholesterol column. Seems to have a rather high amount of 0 values. 

# Doing a check here.
zero_cholesterol_count = (df['Cholesterol'] == 0).sum()
print(f"The number of patients with a cholesterol value of 0 is {zero_cholesterol_count}.")
print(f"The correlation between Heart Disease and Cholesterol is {df['HeartDisease'].corr(df['Cholesterol'])}")
# Amount of 0 values for the cholesterol column seems to be too high. Will need to be explored further.

In [None]:
# Create a boxplot for the Cholesterol column.
df.boxplot(column='Cholesterol')
plt.show()
# Looking at the boxplot, there are a lot of outliers. This will need to be addressed in the next step by removing them.

In [None]:
# Calculate the IQR for the Cholesterol column.
Q1 = df['Cholesterol'].quantile(0.25)
Q3 = df['Cholesterol'].quantile(0.75)
IQR = Q3 - Q1

# Define the upper and lower bounds for outliers.
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Remove outliers from the Cholesterol column.
df = df[(df['Cholesterol'] >= lower_bound) & (df['Cholesterol'] <= upper_bound)]

# Display the summary statistics again to check results.
print(df.describe())
# Standard deviation has decreased, indicating that the outliers have been removed.

In [None]:
# Explore the relation between Cholesterol and HeartDisease.

# Declare variables.
no_heart_disease_cholesterol = df[df['HeartDisease'] == 0]['Cholesterol']
heart_disease_cholesterol = df[df['HeartDisease'] == 1]['Cholesterol']

# Create a histogram.
plt.hist([no_heart_disease_cholesterol, heart_disease_cholesterol], bins=bins, alpha=0.5, label=['No heart Disease', 'Heart Disease'], color=['green', 'red'], histtype='bar')

# Add labels and show the plot.
plt.gca().set(title='Relation between Cholesterol and Heart Disease', xlabel='Cholesterol', ylabel='Frequency')

# Add legend to the histogram.
plt.legend()

# Show the plot.
plt.show()

# There seems to be a relation between cholesterol and heart disease. 
# The lower the cholesterol, the lower the chance of heart disease.

# The correlation between Heart Disease and Cholesterol has also increased after removing the outliers.
print(f"The correlation between Heart Disease and Cholesterol is {df['HeartDisease'].corr(df['Cholesterol'])}")

#### GeneticMarker1

In [None]:
# Explore the relation between GeneticMarker1 and HeartDisease.

# Variable assignments.
no_heart_disease_genetic_marker1 = df[df['HeartDisease'] == 0]['GeneticMarker1']
heart_disease_genetic_marker1 = df[df['HeartDisease'] == 1]['GeneticMarker1']

# Create a histogram.
plt.hist([no_heart_disease_genetic_marker1, heart_disease_genetic_marker1], bins=bins, alpha=0.5, label=['No Heart Disease', 'Heart Disease'], color=['green', 'red'])
# Add labels and title to the histogram.
plt.gca().set(title='Histogram of GeneticMarker1 by Heart Disease', xlabel='GeneticMarker1', ylabel='Frequency')
# Add legend to the histogram.
plt.legend()
# Show the plot.
plt.show()

# There seems to be no correlation between the genetic marker GeneticMarker1 and heart disease as.
# whether a person has heart disease or not seems to have no effect on the value of GeneticMarker1.
# Therefore the GeneticMarker1 column is eligible to be dropped.

#### GeneticMarker2

In [None]:
# Explore the relation between GeneticMarker2 and HeartDisease.
# GenericMarker1 did not show any correlation with HeartDisease, therefore it might be interesting to check how this compares with GeneticMarker2.

# Variable assignments.
no_heart_disease_genetic_marker2 = df[df['HeartDisease'] == 0]['GeneticMarker2']
heart_disease_genetic_marker2 = df[df['HeartDisease'] == 1]['GeneticMarker2']

# Create a histogram.
plt.hist([no_heart_disease_genetic_marker2, heart_disease_genetic_marker2], bins=bins, alpha=0.5, label=['No Heart Disease', 'Heart Disease'], color=['green', 'red'])
# Add labels and title to the histogram.
plt.gca().set(title='Histogram of GeneticMarker2 by Heart Disease', xlabel='GeneticMarker2', ylabel='Frequency')
# Add legend to the histogram.
plt.legend()
# Show the plot.
plt.show()

# There DOES seems to be a correlation between the genetic marker GeneticMarker2 and heart disease.
# The higher the value of GeneticMarker2, the more likely a person is to have heart disease and vice versa.
# Therefore the GeneticMarker2 column will be kept in.

#### MaxHR

In [None]:
# Explore the relation between MaxHR and HeartDisease.

# Variable assignments.
no_heart_disease_maxhr = df[df['HeartDisease'] == 0]['MaxHR']
heart_disease_maxhr = df[df['HeartDisease'] == 1]['MaxHR']

# Create a histogram.
plt.hist([no_heart_disease_maxhr, heart_disease_maxhr], bins=bins, alpha=0.5, label=['No Heart Disease', 'Heart Disease'], color=['green', 'red'])
# Add labels and title to the histogram.
plt.gca().set(title='Histogram of MaxHR by Heart Disease', xlabel='MaxHR', ylabel='Frequency')
# Add legend to the histogram.
plt.legend()
# Show the plot.
plt.show()

# There also seems to be a correlation between the MaxHR and heart disease. 
# The higher the value of MaxHR, the less likely a person is to have heart disease and vice versa.

In [None]:
# Explore the relation between the Age and HeartDisease columns.

# Variable assignments.
heart_disease_age = df[df['HeartDisease'] == 1]['Age']
no_heart_disease_age = df[df['HeartDisease'] == 0]['Age']

# Create a histogram.
plt.hist([no_heart_disease_age, heart_disease_age], bins=bins, alpha=0.5, label=['No Heart Disease', 'Heart Disease'], color=['green', 'red'])

# Add labels and title to the histogram.
plt.gca().set( title='Relation between Age and Heart Disease', xlabel='Age', ylabel='Frequency')

# Add legend to the histogram.
plt.legend()

# Show the plot.
plt.show()

# From the looks of it, there seems to also be a relation between age and heart disease. 
# Older people seem to be more prone to heart disease.

--- End of Timon's part ---

--- Start of Fadi's part ---

### Categorical columns

#### Sex

In [None]:
# Explore the relation between the Sex and HeartDisease columns.

# Filter the dataframe for heart disease cases and non-heart disease cases for males and females.
male_heart_disease = df[(df['HeartDisease'] == 1) & (df['Sex'] == 'M')]
male_no_heart_disease = df[(df['HeartDisease'] == 0) & (df['Sex'] == 'M')]
female_heart_disease = df[(df['HeartDisease'] == 1) & (df['Sex'] == 'F')]
female_no_heart_disease = df[(df['HeartDisease'] == 0) & (df['Sex'] == 'F')]

# Set the number of categories and the width of each bar.
categories = ['No Heart Disease', 'Heart Disease']
bar_width = 0.5

# Set the x-axis positions for the bars.
male_positions = [0, 1]
female_positions = [x + bar_width for x in male_positions]

# Set the heights of the bars.
male_heights = [len(male_heart_disease), len(male_no_heart_disease)]
female_heights = [len(female_heart_disease), len(female_no_heart_disease)]

# Create the bar plot.
plt.bar(male_positions, male_heights, bar_width, label='Male', color='blue')
plt.bar(female_positions, female_heights, bar_width, label='Female', color='pink')

# Add labels and title to the plot.
plt.gca().set(title='Relation between sex and heart disease', xlabel='Heart Disease', ylabel='Frequency')

# Add x-axis tick labels.
plt.xticks([0.17, 1.17], categories)

# Add legend to the plot.
plt.legend()

# Show the plot.
plt.show()

# There seems to be a very strong relation between sex and heart disease.
# Check the amount of males vs females.
print("Number of males:", len(df[df['Sex'] == 'M']))
print("Number of females:", len(df[df['Sex'] == 'F']))
# There are far more males than females. 
# This means we should start looking at relative numbers instead.

In [None]:
# Calculate the total number of males and females.
total_males = len(df[df['Sex'] == 'M'])
total_females = len(df[df['Sex'] == 'F'])

# Calculate the proportion of males and females with heart disease.
male_heart_disease_prop = len(df[(df['HeartDisease'] == 1) & (df['Sex'] == 'M')]) / total_males
female_heart_disease_prop = len(df[(df['HeartDisease'] == 1) & (df['Sex'] == 'F')]) / total_females

# Set the number of categories and the width of each bar.
categories = ['No Heart Disease', 'Heart Disease']
bar_width = 0.35

# Set the x-axis positions for the bars.
male_positions = [0, 1]
female_positions = [x + bar_width for x in male_positions]

# Set the heights of the bars to the proportions instead of the absolute numbers.
male_heights = [1 - male_heart_disease_prop, male_heart_disease_prop]
female_heights = [1 - female_heart_disease_prop, female_heart_disease_prop]

# Create the bar plot.
plt.bar(male_positions, male_heights, bar_width, label='Male', color='blue')
plt.bar(female_positions, female_heights, bar_width, label='Female', color='pink')

# Add labels and title to the plot.
plt.gca().set(title='Relative Relation between Sex and Heart Disease', xlabel='Heart Disease', ylabel='Proportion')

# Add x-axis tick labels.
plt.xticks([0.17, 1.17], categories)

# Add legend to the plot.
plt.legend()

# Show the plot.
plt.show()

# After calculating the relative numbers, it is clear that there is indeed a strong relation between sex and heart disease.
# Males are more prone to heart disease than females.

#### ChestPainType

In [None]:
# Explore the relation between the ChestPainType and HeartDisease columns.

# To-do

#### RestingECG

In [None]:
# Explore the relation between the RestingECG and HeartDisease columns.

# To-do

#### ST_Slope

In [None]:
# Explore the relation between the ST_Slope and HeartDisease columns.

# To-do

#### BodyWeightCategory

In [None]:
# Explore the relation between the BodyWeightCategory and HeartDisease columns.

# Define the desired order of categories
categories = ['Underweight', 'Normal', 'Overweight', 'Obese']

# Filter the dataframe for heart disease cases and non-heart disease cases for different body weight categories.
heart_disease = df[df['HeartDisease'] == 1]['BodyWeightCategory']
no_heart_disease = df[df['HeartDisease'] == 0]['BodyWeightCategory']

# Set the number of categories and the width of each bar.
bar_width = 0.5

# Set the x-axis positions for the bars based on the desired order.
heart_disease_positions = range(len(categories))
no_heart_disease_positions = [x - bar_width for x in heart_disease_positions]

# Set the heights of the bars.
heart_disease_heights = [len(heart_disease[heart_disease == category]) for category in categories]
no_heart_disease_heights = [len(no_heart_disease[no_heart_disease == category]) for category in categories]

# Create the bar plot.
plt.bar(heart_disease_positions, heart_disease_heights, bar_width, label='Heart Disease', color='red')
plt.bar(no_heart_disease_positions, no_heart_disease_heights, bar_width, label='No Heart Disease', color='green')

# Add labels and title to the plot.
plt.gca().set(xlabel='Body Weight Category', ylabel='Number of Cases', title='Comparison of Heart Disease Cases by Body Weight Category')

# Add x-axis tick labels based on the desired order.
plt.xticks([x for x in heart_disease_positions], categories)

# Add legend to the plot.
plt.legend()

# Show the plot.
plt.show()

# The relation between BodyWeightCategory and HeartDisease seems rather unexpected.
# The expected outcome would be that people with a higher body weight would be more prone to heart disease.
# However, the histogram shows that weight seems to have no clear effect on heart disease.
# Therefore the BodyWeightCategory column is eligible to be dropped.

In [None]:
# Drop the columns that are not needed.

df = df.drop(columns=['GeneticMarker1'])
df = df.drop(columns=['BodyWeightCategory'])

# Check the dataset after dropping the columns.
df

## Step 4 - Feature Engineering

<sub>

- Encode categorical variables using Dummy Encoding.

</sub>

In [None]:
# Linear encoding.
df['ExerciseAngina'].replace({'N':0, 'Y':1}, inplace=True)

# Dummy encoding.
df=pd.get_dummies(df, columns=['Sex', 'ChestPainType', 'RestingECG', 'ST_Slope'], dtype=int)
# Check out the updated df.
df

--- End of Fadi's part ---

--- Start of Alessandro's part ---

## Step 5 - Model Training

<sub>

- Import necessary libraries for model training and evaluation
- Split the dataset into training and testing sets.
- Try these supervised learning algorithms (logistic regression, decision tree, random forest and SVM) and train a model to predict income.
- Evaluate the performance of the models using accuracy.

</sub>

In [None]:
# Import libraries for model training.
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.svm import LinearSVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score

# Split the dataset into features (X) and target variable (y).
X = df.drop('HeartDisease', axis=1)
y = df['HeartDisease']

# Split the data into training and testing sets.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Tested with a few different test sizes but 0.2 seems to be the best option.

In [None]:
# Train a LinearSVC model
lsvc = LinearSVC(max_iter=10000, dual=False)
lsvc.fit(X_train, y_train)

# Make predictions on the test set
lsvc_predictions = lsvc.predict(X_test)

# Evaluate the performance of the model
lsvc_accuracy = accuracy_score(y_test, lsvc_predictions)
lsvc_precision = precision_score(y_test, lsvc_predictions)
lsvc_recall = recall_score(y_test, lsvc_predictions)
lsvc_f1 = f1_score(y_test, lsvc_predictions)

print(f'LinearSVC Metrics:')
print(f'Accuracy: {lsvc_accuracy}')
print(f'Precision: {lsvc_precision}')
print(f'Recall: {lsvc_recall}')
print(f'F1 Score: {lsvc_f1}')

In [None]:
# Create and train the model.
model_lr = LogisticRegression(max_iter=2000)
model_lr.fit(X_train, y_train)

# Make predictions.
predictions_lr = model_lr.predict(X_test)

# Evaluate the performance of the model.
accuracy_lr = accuracy_score(y_test, predictions_lr)
precision_lr = precision_score(y_test, predictions_lr)
recall_lr = recall_score(y_test, predictions_lr)
f1_lr = f1_score(y_test, predictions_lr)

print(f'Logistic Regression Metrics:')
print(f'Accuracy: {accuracy_lr}')
print(f'Precision: {precision_lr}')
print(f'Recall: {recall_lr}')
print(f'F1 Score: {f1_lr}')

In [None]:
# Create and train the model.
model_dt = DecisionTreeClassifier()
model_dt.fit(X_train, y_train)

# Make predictions.
predictions_dt = model_dt.predict(X_test)

# Evaluate the performance of the model.
accuracy_dt = accuracy_score(y_test, predictions_dt)
precision_dt = precision_score(y_test, predictions_dt)
recall_dt = recall_score(y_test, predictions_dt)
f1_dt = f1_score(y_test, predictions_dt)

print(f'Decision Tree Metrics:')
print(f'Accuracy: {accuracy_dt}')
print(f'Precision: {precision_dt}')
print(f'Recall: {recall_dt}')
print(f'F1 Score: {f1_dt}')

In [None]:
# Create and train the model.
model_svm = SVC()
model_svm.fit(X_train, y_train)

# Make predictions.
predictions_svm = model_svm.predict(X_test)

# Evaluate the performance of the model.
accuracy_svm = accuracy_score(y_test, predictions_svm)
precision_svm = precision_score(y_test, predictions_svm)
recall_svm = recall_score(y_test, predictions_svm)
f1_svm = f1_score(y_test, predictions_svm)

print(f'SVM Metrics:')
print(f'Accuracy: {accuracy_svm}')
print(f'Precision: {precision_svm}')
print(f'Recall: {recall_svm}')
print(f'F1 Score: {f1_svm}')

In [None]:
# Define the hyperparameter grid for Logistic Regression.
param_grid_lr = {
    'solver': ['lbfgs', 'liblinear', 'newton-cg', 'newton-cholesky', 'sag', 'saga'],
    'C': [0.1, 1, 10],
    'max_iter': [2000, 2500, 3000]
}

# Import the necessary library for hyperparameter tuning.
from sklearn.model_selection import GridSearchCV

# Create a Logistic Regression model.
model_lr = LogisticRegression()

# Perform GridSearchCV to find the best hyperparameters.
grid_search_lr = GridSearchCV(model_lr, param_grid_lr, cv=5)
grid_search_lr.fit(X_train, y_train)

# Print the best hyperparameters found by GridSearchCV.
print("Best hyperparameters:", grid_search_lr.best_params_)

# Evaluate the performance of the best model on the test set.
best_model_lr = grid_search_lr.best_estimator_
accuracy_best_lr = best_model_lr.score(X_test, y_test)
print("Accuracy on test set:", accuracy_best_lr)

In [None]:
# Define the hyperparameter grid for Decision Tree.
param_grid_dt = {
    'criterion': ['gini', 'entropy'],
    'max_depth': [None, 5, 10, 15, 20]
}

# Import the necessary library for hyperparameter tuning.
from sklearn.model_selection import GridSearchCV

# Create a Decision Tree model.
model_dt = DecisionTreeClassifier()

# Perform GridSearchCV to find the best hyperparameters.
grid_search_dt = GridSearchCV(model_dt, param_grid_dt, cv=5)
grid_search_dt.fit(X_train, y_train)

# Print the best hyperparameters found by GridSearchCV.
print("Best hyperparameters:", grid_search_dt.best_params_)

# Evaluate the performance of the best model on the test set.
best_model_dt = grid_search_dt.best_estimator_
accuracy_best_dt = best_model_dt.score(X_test, y_test)
print("Accuracy on test set:", accuracy_best_dt)

--- End of Alessandro's part ---