# Heart Disease Prediction

- This is a supplement material for the [Machine Learning Simplified](https://themlsbook.com) book. It sheds light on Python implementations of the topics discussed while all detailed explanations can be found in the book. 
- I also assume you know Python syntax and how it works. If you don't, I highly recommend you to take a break and get introduced to the language before going forward with my code. 
- This material can be downloaded as a Jupyter notebook (Download button in the upper-right corner -> `.ipynb`) to reproduce the code and play around with it. 


## About
In this project, you have to build a model that **predicts** (a probability of) **a heart disease of a patient**.

The project contains 7 sections in total, each with step-by-step instructions of what to do. Note that, as we go further with our lessons, we will try to step away from guided projects like this to "less-guided", with less intructions involved. Thus, my advice is try to understand why we do what we do in what order.


## Structure
The project is split into **7 sections**, each containing **step-by-step instructions** of what to do. These sections are the following:

1.   Import the Libratries
2.   Import the Datasets
3.   Data Preprocessing
4.   Data Overview
5.   Model Building
6.   Model Evaluation & Hyperparameter Tuning
7.   Conclusion

## Data
There are 2 datasets provided that you should use for this project:
- heart1.csv
- heart2.csv

### > Columns:
- age: age in years
- sex: (1 = male; 0 = female)
- cp: chest pain type
- trestbps: resting blood pressure (in mm Hg on admission to the hospital)
- chol: serum cholestoral in mg/dl
- fbs: (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
- restecg: resting electrocardiographic results
- thalach: maximum heart rate achieved
- exang: exercise induced angina (1 = yes; 0 = no)
- oldpeak: ST depression induced by exercise relative to rest
- slope: the slope of the peak exercise ST segment
- ca: number of major vessels (0-3) colored by flourosopy
- thal: 3 = normal; 6 = fixed defect; 7 = reversable defect
- target: 1 or 0

### > Description:
Attribute Information: 
> 1. age 
> 2. sex 
> 3. chest pain type (4 values) 
> 4. resting blood pressure 
> 5. serum cholestoral in mg/dl 
> 6. fasting blood sugar > 120 mg/dl
> 7. resting electrocardiographic results (values 0,1,2)
> 8. maximum heart rate achieved 
> 9. exercise induced angina 
> 10. oldpeak = ST depression induced by exercise relative to rest 
> 11. the slope of the peak exercise ST segment 
> 12. number of major vessels (0-3) colored by flourosopy 
> 13. thal: 3 = normal; 6 = fixed defect; 7 = reversable defect
The names and social security numbers of the patients were recently removed from the database, replaced with dummy values. One file has been "processed", that one containing the Cleveland database. All four unprocessed files also exist in this directory.

# 1. Import the Libraries

First things first, import the libraries needed (here you will also keep adding up the required libraries as you go further with this project)

In [2]:
import pandas as pd

# 2. Import the datasets

Do the following:

*   **Step 1**: Understand where the current working directory is
*   **Step 2**: Import two datasets as df1 and df2
*   **Step 3**: Check the shape of each dataset by returning two lines with one print function: 


        Data shape of df1 is (X, Y),
        Data shape of df2 is (X, Y)

Use .format funtion for that.

---

## Step 1
Understand where the current working directory is

In [1]:
!pwd

/Users/andrewwolf/Projects/hse-2022/project


## Step 2
Import two datasets as df1 and df2

In [3]:
df1 = pd.read_csv("heart1.csv")
df2 = pd.read_csv("heart2.csv")

## Step 3
Check the shape of each dataset by returning two lines with one print function: 


        Data shape of df1 is (X, Y),
        Data shape of df2 is (X, Y)

Use .format funtion for that.

In [4]:
df1.shape

(303, 8)

# 3. Data Preprocessing

Do the following:


*   **Step 1:** Combine two datasets into one
*   **Step 2**: Check unique values of each column (function)
*   **Step 3**: Check data types. Change if needed.
*   **Step 4**: Create a function that checks the null values and inapropriate values (like "?" or "!") of a certain column, and if there is any, replaces them with 0 (if numeric), or 'None' (if categorical). Run the function.
*   **Step 5**: You might have noticed that in the target column, there are values like "Null" and "One" that need to be converted into the numeric form. Write a function that will determine those cells and substitute appropriate value (either 0 or 1).
*   **Step 6**: Validate the data: check dtypes (presence of wrong values "?", "!", null values etc), and unique values for each column

---

## Step 1
Combine two datasets into one

In [10]:
df = pd.concat([df1, df2], axis=1)
df

Unnamed: 0,patient_id,age,sex,cp,trestbps,chol,fbs,target,patient_id.1,restecg,thalach,exang,oldpeak,slope,ca,thal
0,7365861,63,1,3,145,233,1,1,7365861,0,150,0,2.3,0,0,1
1,4786508,37,1,2,130,250,0,1,4786508,1,187,0,3.5,0,0,2
2,3975494,41,0,1,130,204,0,1,3975494,0,172,0,1.4,2,0,2
3,8380447,56,1,1,120,236,0,1,8380447,1,178,0,0.8,2,0,2
4,6894258,57,?0,0,120,354,0,1,6894258,1,163,1,0.6,2,0,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,3759764,57,0,0,140,241,0,0,3759764,1,123,1,0.2,1,0,3
299,3122171,45,1,3,110,264,0,0,3122171,1,132,0,1.2,1,0,3
300,9612252,68,1,0,144,193,1,0,9612252,1,141,0,3.4,1,2,3
301,2518231,57,1,0,130,131,0,0,2518231,1,115,1,1.2,1,1,3


## Step 2
Check unique values of each column (function)

## Step 3
Check data types. Change if needed.

In [12]:
df.dtypes

patient_id      int64
age             int64
sex            object
cp              int64
trestbps        int64
chol            int64
fbs             int64
target         object
patient_id      int64
restecg         int64
thalach         int64
exang           int64
oldpeak       float64
slope           int64
ca              int64
thal            int64
dtype: object

## Step 4
Create a function that checks the null values and inapropriate values (like "?" or "!") of a certain column, and if there is any, replaces them with 0 (if numeric), or 'None' (if categorical). Run the function.

## Step 5
You might have noticed that in the target column, there are values like "Null" and "One" that need to be converted into the numeric form. Write a function that will determine those cells and substitute appropriate value (either 0 or 1).

## Step 6
Validate the data: check dtypes (presence of wrong values "?", "!", null values etc), and unique values for each column

# 4. Data Overview

Observe the data:

*   **Step 1**: Find out what is the mean for **trestbps** and cholan across people with heart deseases and not. 
*   **Step 2**: Find out what is the mean for **thalach** across people with heart deseases and not

---

## Step 1
Find out what is the mean for **trestbps** and cholan across people with heart deseases and not. 

## Step 2
Find out what is the mean for **thalach** across people with heart deseases and not

<a id='model_building'></a>

# 5. Model Building

Do the following:
  
*   **Step 1**: Identify X variables that are the most significant indicators for predicting a target. Set y variable as target category.
*   **Step 2**: Split the data into train and test
*   **Step 3**: Chose any classifier and train it. The function has to have some hyperparameters.

---

## Step 1
Identify X variables that are the most significant indicators for predicting a target. Set y variable as target category.

## Step 2
Split the data into train and test

## Step 3
Chose any classifier and train it. The function has to have some hyperparameters that you would like to have.

<a id='hyperparameter_tuning'></a>

# 6. Model Evaluation & Hyperparameter Tuning

*   **Step 1**: Evaluate the model. Print a model score for the test data.
*   **Step 2**: Change several hyperparameters using a loop funtion to evaluate if the score of your model can be actually improved. 
*   **Step 3**: Describe what other metrics you could potentially use to estimate model's accuracy.
*   **Step 4**: Try to predict a probability of a heart disease by putting some values of the X variables into your classifier. 

----

## Step 1
Evaluate the model. Print a model score for the test data.

## Step 2
Change several hyperparameters using a loop funtion to evaluate if the score of your model can be actually improved. 

## Step 3
Describe what other metrics you could potentially use to estimate model's accuracy.

## Step 4
Try to predict a probability of a heart disease by putting some values of the X variables into your classifier. 

# 7. Conclusion

Summarize your **findings**. Did you manage to build a reliable model? What **data preprocessing** strategies and **feature selection** techniques have you used in order to get the best model? Which model has performed the best?

Feel free to share/discuss your findings in our [Slack Channel](https://join.slack.com/t/mlcookbook/shared_invite/zt-eyz4czw4-l95j_2iuETCbVRPpgA3kWA)!

In [None]:
# Answer:

'''

I used X model and achieved Y accuracy...
I believe the model is reliable as I performed X feature selection technique...

'''

# 8.* Advance Zone (OPTIONAL)

*This is a section intended for advanced students or those who is willing to do some additional googling in order to familiarize themselves with potentially new concepts. The steps outlined below are typically used in production data science applications, and that is why the ML-Book team thought it would be important to include it.*

# 8.1* Exploring Different Models

*As you know, in data science there is no such algorithm that can outperform any other algorithms on any given dataset. Thus, model selection is typically an iterative process where we are not only searching for the optimal set of hyperparameters (as we did in [section 6, step 2](#hyperparameter_tuning)) but also exploring different machine learning algorithms as well.*

*You can imagine, that for models with many parameters it can easily get very boring to specify all the values of hyperparameters that you want to check. For this and some other reasons people are using [cross-validation](https://medium.com/machine-learning-eli5/cross-validation-the-right-way-386839ed39b1).*

*   **Step 1**: Train [Support Vector Machine](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html) model on the train dataset. Perform any [cross-validation](https://medium.com/machine-learning-eli5/cross-validation-the-right-way-386839ed39b1) method of your choice to select an optimal values of hyperparameters *C*, *kernel* and *degree*. 


*   **Step 2**: Train [Random Forest](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) model on the train dataset. Perform any [cross-validation](https://medium.com/machine-learning-eli5/cross-validation-the-right-way-386839ed39b1) method of your choice to select an optimal values of hyperparameters *n_estimators* and *max_depth*.


*   **Step 3**: Compare the performance of models built in this section as well as with the model from [section 5](#model_building). Keep in mind that it makes sense only to compare models that were trained on the same set of data.

---

## Step 1
Train [Support Vector Machine](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html) model on the train dataset. Perform any [cross-validation](https://medium.com/machine-learning-eli5/cross-validation-the-right-way-386839ed39b1) method of your choice to select an optimal values of hyperparameters *C*, *kernel* and *degree*. 

## Step 2
Train [Random Forest](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) model on the train dataset. Perform any [cross-validation](https://medium.com/machine-learning-eli5/cross-validation-the-right-way-386839ed39b1) method of your choice to select an optimal values of hyperparameters *n_estimators* and *max_depth*.

## Step 3
Compare the performance of models built in this section as well as with the model from [section 5](#model_building). Keep in mind that it makes sense only to compare models that were trained on the same set of data.

# 8.2* Feature Engineering

*Oftentimes the relationship between our features and target variable is very complex. Thus, it can be fruitful to include some additional features based on already existing ones. In this section we will explore feature engineering for numerical columns only, but there are techniques that can be applied to categorical features as well. You can experiment with transformations that are not listed below as well!*

*   **Step 1**: Generate additional univariate numerical features. Feel free to select any number of features from your dataset to apply any of these transformations.
    *      Power of 2
    *      Square root (watch out for negative values!)
    *      Log transformation (can be applied only to positive values)


*   **Step 2**: Generate additional multivariate numerical features. Feel free to select any number of features from your dataset to apply any of these transformations:
    *      Multiplication of features' values
    *      Ratio of features' values (watch out for zero denominator)
    
    
*   **Step 3**: Generate additional categorical features. For every categorical feature from the dataset add the column with [frequency encoded values](https://python-data-science.readthedocs.io/en/latest/preprocess.html#tree-based-models).
    
    
*   **Step 4**: Train any model that was described in this notebook on this extended dataset.


*   **Step 5**: Compare the performance of the model trained on the extended dataset against the models trained on original dataset.

---

## Step 1
Generate additional univariate numerical features. Feel free to select any number of features from your dataset to apply any of these transformations.
- Power of 2
- Square root (watch out for negative values!)
- Log transformation (can be applied only to positive values)

## Step 2
Generate additional multivariate numerical features. Feel free to select any number of features from your dataset to apply any of these transformations:
- Multiplication of features' values
- Ratio of features' values (watch out for zero denominator)

## Step 3
Generate additional categorical features. For every categorical feature from the dataset add the column with [frequency encoded values](https://python-data-science.readthedocs.io/en/latest/preprocess.html#tree-based-models).

## Step 4
Train any model that was described in this notebook on this extended dataset.

## Step 5
Compare the performance of the model trained on the extended dataset against the models trained on original dataset.

# 8.3* Advanced Conclusion

Take a look at all the models that you have trained in this notebook and try to answer the following questions:

    Which one has performed best on a test set? 
    Why do you think this happen?

In [None]:
# Answer:

'''

I believe...

'''