# Predicting heart disease using Machine Learning

The notebook looks into using various Python-based machine learning and Data Science libraries in an attempt to build a machine learning model capable of predicting whether or not someone has heart disease on their medical attribute.

We are going to take the following approach:
1. Problem Definition
2. Data
3. Evaluation
4. Features
5. Modelling
6. Experimentation

# 1. Problem Definition

In a statement,
> Given clinical parameters about a patient, can we predict whether or not they have a stroke disease?

# 2. Data

The orginal data came from https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset

# 3. Evaluation

> If we can 95% accuracy at predicting whether or not a patient has stroke disease during the proof of concept, we will pursue the project.

# 4. Featues 
This is where you will get different information about each of the features in your data. You can do this by doing your own research (such as looking at the links above) or by talking to the subject matter expert(someone who knows about the dataset).

 
 ***Create Data Dictionary***
 
1) id: unique identifier
2) gender: "Male", "Female" or "Other"
3) age: age of the patient
4) hypertension: 0 if the patient doesn't have hypertension, 1 if the patient has hypertension
5) heart_disease: 0 if the patient doesn't have any heart diseases, 1 if the patient has a heart disease
6) ever_married: "No" or "Yes"
7) work_type: "children", "Govt_jov", "Never_worked", "Private" or "Self-employed"
8) Residence_type: "Rural" or "Urban"
9) avg_glucose_level: average glucose level in blood
10) bmi: body mass index
11) smoking_status: "formerly smoked", "never smoked", "smokes" or "Unknown"*
12) stroke: 1 if the patient had a stroke or 0 if not

*Note: "Unknown" in smoking_status means that the information is unavailable for this patient

# Preparing the tools

At the start of any project, it's custom to see the required libraries imported in a big chunk like you can see below.

However, in practice, your projects may import libraries as you go. After you've spent a couple of hours working on your problem, you'll probably want to do some tidying up. This is where you may want to consolidate every library you've used at the top of your notebook (like the cell below).

The libraries you use will differ from project to project. But there are a few which will you'll likely take advantage of during almost every structured data project.

pandas for data analysis.
NumPy for numerical operations.
Matplotlib/seaborn for plotting or data visualization.
Scikit-Learn for machine learning modelling and evaluation.

In [32]:
# Regular EDA and plotting libraries
import numpy as np # np is short for numpy
import pandas as pd # pandas is so commonly used, it's shortened to pd
import matplotlib.pyplot as plt
import seaborn as sns # seaborn gets shortened to sns
import xgboost as xgb

# We want our plots to appear in the notebook
%matplotlib inline 

# For Numerical Conversion
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

## Models for Scikit-Learn
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier

## Model Evaluations
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import precision_score, recall_score, f1_score
from sklearn.metrics import RocCurveDisplay

# Load Data

In [37]:
df = pd.read_csv("stroke-dataset.csv")
df

In [35]:
df.shape 

# Data Exploration (exploratory data analysis or EDA)

The goal here is to find out more about the data and become a subject amtter expert on the data you are working with.

1. What questions are you are trying to solve?
2. What kind of data do we have and how do we treat different types?
3. What is missing from the data and how do you deal with it?
4. Where are the outliers and why should you care about them?
5. How can you add, change, or remove features to get more out of your data

In [4]:
df.head()

In [5]:
df.tail()

In [6]:
df.info()

In [7]:
# Lets find out how many of each class there are
df["stroke"].value_counts()

In [8]:
df["stroke"].value_counts().plot(kind="bar", color=["darkred", "darkblue"]);

In [9]:
# Are there any missing values?
df.isna().sum()

In [10]:
# Fill missing values wiyth pandas
df["bmi"].fillna(df["bmi"].mean(), inplace=True)

# Remove rows with missing stroke values
df.dropna(inplace=True)

In [11]:
df.isna().sum()

In [12]:
df.info()

In [13]:
df

In [14]:
df.describe()

In [15]:
pd.crosstab(df.stroke, df.gender)

In [16]:
# Create a plot of crosstab
pd.crosstab(df.stroke, df.gender).plot(kind="bar",
                                       figsize=(10, 6),
                                       color=["salmon", "gray"])

plt.title("Stroke Disease Frequency For Gender")
plt.xlabel("0 = No Stroke, 1 = Stroke Present")
plt.ylabel("Amount")
plt.legend(["Female", "Male"])
plt.xticks(rotation=0);

# Age vs BMI for Stroke Disease

In [17]:
# Create another figure
plt.figure(figsize=(10, 6))

# Scatter with positve examples
plt.scatter(df.age[df.stroke==1],
            df.bmi[df.stroke==1],
            color="darkred");

# Scatter with negative examples
plt.scatter(df.age[df.stroke==0],
            df.bmi[df.stroke==0],
            color="lightblue");

# Add some helpful info
plt.title("Stroke Disease for the Age and the BMI")
plt.xlabel("Age")
plt.ylabel("Body Mass Index (BMI)")
plt.legend(["Disease Present", "No Disease"])
plt.xticks(rotation=0);

In [18]:
# Check the distribution of the age column with a histogram
df.age.plot.hist();

# Stroke Disease Frequency per Hypertension
* Hypertension: 0 if the patient doesn't have hypertension, 1 if the patient has hypertension.

In [19]:
pd.crosstab(df.hypertension, df.stroke)

In [20]:
# Make the crosstab more visual
pd.crosstab(df.hypertension, df.stroke).plot(kind="bar",
                                             figsize=(10, 6),
                                             color=["crimson", "gray"])

# Add some communication
plt.title("Stroke Disease Frequency per Hypertension")
plt.xlabel("Hypertension")
plt.ylabel("Amount")
plt.legend(["No Stroke", "Stroke Present"])
plt.xticks(rotation=0);

# Stroke Disease Frequency per Work Type

7) work_type: 
    * "children", "Govt_jov", "Never_worked", "Private" or "Self-employed"

In [21]:
pd.crosstab(df.work_type, df.stroke)

In [22]:
# Make the crosstab more visual
pd.crosstab(df.work_type, df.stroke).plot(kind="bar",
                                          figsize=(10, 6),
                                          color=["salmon", "lightblue"])

# Add some communication
plt.title("Stroke Disease Frequency per Work Type")
plt.xlabel("worktype")
plt.ylabel("Amount")
plt.legend(["No Stroke", "Stroke Present"])
plt.xticks(rotation=0);

# Stroke Disease Frequency per Smoking Status

11) smoking_status: 
    * "formerly smoked", "never smoked", "smokes" or "Unknown"*

In [23]:
pd.crosstab(df.smoking_status, df.stroke)

In [24]:
# Make the crosstab more visual
pd.crosstab(df.smoking_status, df.stroke).plot(kind="bar",
                                          figsize=(10, 6),
                                          color=["lightblue", "darkblue"])
# Add some communication
plt.title("Stroke Disease Frequency per Smoking Status")
plt.xlabel("Smoking Status")
plt.ylabel("Amount")
plt.legend(["No Stroke Present", "Stroke Present"])
plt.xticks(rotation=0);

In [25]:
# Make a correlation matrix
df.corr()

In [26]:
# Let's make our correlation matrix a little prettier
corr_matrix = df.corr()
fig, ax = plt.subplots(figsize=(15, 10))
ax = sns.heatmap(corr_matrix,
                 annot=True,
                 linewidth=0.5,
                 fmt=".2f",
                 cmap="seismic");

In [27]:
df.head()

In [28]:
# Split Data into X and y
X = df.drop("stroke", axis=1)
y = df["stroke"]

In [29]:
transformed_X

In [None]:
y

In [None]:
# import random seed
np.random.sedd(42)

# Split into train and test set
X