# Predicting Heart Disease Using Machine Learning

This notebook is based on predicting heart disease with some foundation machine learning and data science concepts. This is the classification problem.

It is intended to be an end-to-end example of what a data science and machine learning proof of concept might look like.


## What is Classification? 

Categorizing the given data into the classes wheather the date is structured or unstructured. Classification with two outcomes is binary classification, Classification with more than two outcomes is Multi class Classification and the problem pattern belongs to more than one class is Multi label Classification. 

In this notebook we are dealing with the binary classification Problem.

## What we are gonna do throughout the notebook?

We have the dataset of heart disease UCI using this we are going to predict whether the patient have the heart disease or not. We will approach the problem step by step using this cycle.

<img src='https://miro.medium.com/max/6608/1*Gf0bWgr2wst9A1XR5gakLg.png'/>

Fig: 6 step machine learning modelling framework.

Mainly the topics that are involved in this notebook are:

* Explanatory Data Analysis: The process of going through the data set and finding more of it.
* Model Training: Creating the model using the train data set to predict the correct outcome using test data set.
* Model Evaluation: Evaluating a model using problem specific evaluation matrices.
* Model Comparision: Comparing the several different model to predict the best one.
* Model fine-tuning: Once we get the best model for our problem, how can we tune/improve it?
* Feature Importance: We are working to predict whether someone has heart disease or not? so we need to find that are there anything that are more important for prediction?
* Cross Validation: We know that cross validation is the one of best choice to predict from the unseen data because the model can be trained with the many folds during the training. It is way better choice than a random selection.
* Reporting what we've found: Presenting our work to others who are not familiar with these technical lines.

We will dive into them one by one. We are using following libraries:
### Data Analysis
* Numpy
* Pandas
* Matplotlib
* Seaborn

### Machine Learning and modelling
* Scikit-learn

So to understand this notebook well you need to better familiar with the above mentioned Python Libraries. At the end of the notebook we are going to successfully predict the heart disease patient using the given features in the data set. We we also know that which colum matter the most to predict the disease.


<img src = 'https://inteng-storage.s3.amazonaws.com/img/iea/Xy6xeK3Wwr/sizes/heart-attack-ai-oxford_resize_md.jpg'/>





## 1.Problem Definition

We know that we are prediction whether the patient have heart disease or not. This is based on the binary class classification because it is only based of 'True' or 'False'. We're going to be using a number of differnet features (pieces of information) about a person. Our main goal is:

`Given clinical parameters about a patient, can we predict whether or not they have heart disease?`


## 2.Data

The original database contains 76 attributes, but here only 14 attributes will be used. Attributes (also called features) are the variables what we'll use to predict our target variable. Attributes and features are also referred to as independent variables and a target variable can be referred to as a dependent variable.

**We use the independent variables to predict our dependent variable.**

Or in our case, the independent variables are a patients different medical attributes and the dependent variable is whether or not they have heart disease.

## 3.Evaluation

As machine learning is all about experimentation, we no need to worry about our first failure. Those who want to be successful in the very first attempt i think they don't have the long lasting career. So, we have to evaluate our project until we get the more nearly predicting output. Patence is the most important key in those moments.

'If we approach near to 95% accuracy we will be near to the future prediction. So, stay calm and tune your model to get the more better accuracy.'


## 4.Features

Features plays the most important role to predict something that we don't know. So a good features always plays a better role for prediction.

To understand the features in a better way let's create a data dictionary.

### Data Dictionary 

This data dictionary describes about the data that we are dealing with. We don't describes those data that i am not going to use during the modelling process.

The following are the features that we are going to use to predict target variable(heart disease or no heart disease):
1. age - age in years
2. sex - (1=male,0=female)
3. cp = chest pain
    * Typical angina: chest pain related decrease blood supply to the heart
    * Atypical angina: chest pain not related to heart
    * Non-anginal pain: typically esophageal spasms (non heart related)
    * Asymptomatic: chest pain not showing signs of disease
4. trestbps - resting blood pressure (in mm Hg on admission to the hospital)
    * anything above 130-140 is typically cause for concern
5. chol - serum cholestoral in mg/dl
    * serum = LDL + HDL + .2 * triglycerides
    * above 200 is cause for concern
6. fbs - (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
    * '>126' mg/dL signals diabetes
7. restecg - resting electrocardiographic results
    * 0: Nothing to note
    * 1: ST-T Wave abnormality
        * can range from mild symptoms to severe problems
        * signals non-normal heart beat
    * 2: Possible or definite left ventricular hypertrophy
        * Enlarged heart's main pumping chamber
8. thalach - maximum heart rate achieved
9. exang - exercise induced angina (1 = yes; 0 = no)
10. oldpeak - ST depression induced by exercise relative to rest
    * looks at stress of heart during excercise
    * unhealthy heart will stress more
11. slope - the slope of the peak exercise ST segment
    * 0: Upsloping: better heart rate with excercise (uncommon)
    * 1: Flatsloping: minimal change (typical healthy heart)
    * 2: Downslopins: signs of unhealthy heart
12. ca - number of major vessels (0-3) colored by flourosopy
    * colored vessel means the doctor can see the blood passing through
    * the more blood movement the better (no clots)
13. thal - thalium stress result
    * 1,3: normal
    * 6: fixed defect: used to be defect but ok now
    * 7: reversable defect: no proper blood movement when excercising
14. target - have disease or not (1=yes, 0=no) (= the predicted attribute)


## Preparing Tools

Python doesnot provide all the support to solve these types of problems. So, before starting the project we have to import the necessarily library to do such type of tasks. I am going to import the necessary library in the beginning and i will import the other libraries when the problem arieses and python can not tackle it.

* [Pandas](https://pandas.pydata.org/) for data analysis
* [Numpy](https://numpy.org/) for numerical operations
* [Matplotlib](https://matplotlib.org/) / [Seaborn](https://seaborn.pydata.org/) for plotting and visualizations
* [Scikit-learn](https://scikit-learn.org/stable/) for machine learning

In [None]:
#Regular explanatory data analysis and plotting libraries.
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
plt.style.use('fivethirtyeight')

#models
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier

#model evaluators
from sklearn.model_selection import train_test_split
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from sklearn.metrics import confusion_matrix,classification_report,precision_score,recall_score,f1_score,plot_roc_curve

## Load Data
There are many different formats of data and there are lots of tools to visualize those data. In these notebook we are dealing with the data with comma separated values(.csv) format data. Pandas have the inbuilt functions to load and visualize the data in a dataframe.

In [None]:
df = pd.read_csv('/kaggle/input/heart-disease-uci/heart.csv')
df.head() #first five rows

## Exploratory Data Analysis(Data Exploration or EDA)

one we imported our data and obtained in tabular form now we need to explore the data. The main thing we need to do is become more and more familiar with the data. Compare to the different columns, compare those columns (features) with the target variables(labels).

In [None]:
#lets count the total number of values in the target colums.
df['target'].value_counts()

value_counts() allows you to show how many times each of the values of a categorical column appear.

In [None]:
#Plot our target column
df['target'].value_counts().plot(kind='bar',color=["teal", "indigo"]);

In [None]:
#Finding if there is any missing values in the dataset.
df.isna().sum()

df.info() shows a quick insight to the number of missing values we have and what type of data we are working with.

In our case, there are no missing values and all of our columns are numerical in nature.

In [None]:
df.info()

Another way to get some quick insights from data is df.describe() function.

In [None]:
df.describe()

1 = Heart disease,  
0 = No heart disease

1 = Male,  
0 = Female

In [None]:
df['sex'].value_counts()

We have 207 males and 96 females in our dataset.

In [None]:
# comparing target column with sex column.
pd.crosstab(df['target'],df['sex'])

From the above table we can see that, there are about 100 women and 72 of them are positive value which is about 75% of the total women. on the other hand there are about 200 men out of them about 100 of them have the positice value which is about 50% of the total men.

Averaging these two values, we can assume, based on no other parameters, if there's a person, there's a 62.5% chance they have heart disease.

## Making Crosstab Visuals

Let's Compare the target and sex columns in a visual way.

In [None]:
pd.crosstab(df['target'],df['sex']).plot(kind='bar',color=['darkgreen','cornflowerblue'])
plt.title('Heart disease frequency for sex')
plt.xlabel('0=No Heart Disease, 1=Heart Disease')
plt.ylabel('Number')
plt.legend(['Female','Male']);

## Age vs Max Heart Rate for Heart Disease

Let's first combine a couple of independent variable like age, thalach and compared with our target valiable.

In [None]:
fig,ax = plt.subplots(nrows=1,
                     ncols=1,
                     figsize=(12,8))

#positive examples
ax.scatter(df['age'][df['target']==1],
          df['thalach'][df['target']==1],
          c = 'darkgreen')

#negative examples
ax.scatter(df['age'][df['target']==0],
          df['thalach'][df['target']==0],
          c = 'red')

ax.set(title='Heart disease in function of Age and Max Heart Rate',
      xlabel='Age',
      ylabel = 'Max Heart Rate')
ax.legend(['Disease','No disease']);

It seems the younger someone is, the higher their max heart rate (dots are higher on the left of the graph) and the older someone is, the more green dots there are. But this may be because there are more dots all together on the right side of the graph (older participants).

Both of these are observational of course, but this is what we're trying to do, build an understanding of the data.

In [None]:
#age distribution
df['age'].plot(kind='hist')

It is a normal distribution also slightly a right skewed.

## Heart Disease frequency by chest pain type

Let's use the same process as used before.

In [None]:
pd.crosstab(df['cp'],df['target'])

In [None]:
pd.crosstab(df['cp'],df['target']).plot(kind='bar',
                                       figsize=(12,8),
                                       color=['springgreen','purple'])
plt.title('Heart Disease frequency Per chest pain')
plt.xlabel('Chest Pain Type')
plt.ylabel('Frequency')
plt.legend(['No Heart Disease','Heart Disease']);

From our data dictionary:
1. cp - chest pain type
    * 0: Typical angina: chest pain related decrease blood supply to the heart
    * 1: Atypical angina: chest pain not related to heart
    * 2: Non-anginal pain: typically esophageal spasms (non heart related)
    * 3: Asymptomatic: chest pain not showing signs of disease
    
It's interesting the atypical agina (value 1) states it's not related to the heart but seems to have a higher ratio of participants with heart disease than not.


## Correlation between independent variables

Now compare all our independent variable because it gives the clear idea about which independent variable may or maynot have the impact on target variable. we can do this by using df.corr() function which gives the overalll result in a big table.

In [None]:
df_corr = df.corr()
df_corr

In [None]:
corr_matrix = df.corr()
plt.figure(figsize=(12,8))
sns.heatmap(corr_matrix,
           annot=True,
           linewidths=0.5,
           fmt='.2f',
           cmap='YlGnBu');

Much better. A higher positive value means a potential positive correlation (increase) and a higher negative value means a potential negative correlation (decrease). This much of data analysis gives us a indepth knowledge about heart disease data.

Now it's time to model.


## 5.Modeling

We have get the idea from the data. Now its time to built a model using machine learning. We will use the catagorical features to predict the label. In our case we have 13 features and 1 label. so to do this we have to split the data into two parts. One is for features and other for labels.

In [None]:
# droping target variable
X = df.drop('target',axis=1)

# only target column
y = df['target']

X values.

In [None]:
X.head()

Y values.

In [None]:
y.head()

### Spliting our data into training and testing

we will use the scikit learn to split our data into train and test set. Before doing that we should remember not to train the model in a whole data.
if we use our all data to train a model then how do we know that our model is performing well in unseen data.
So before modelling remember that we should use our training set to train our model and test set to test our model.


In [None]:
#for reproducible code
np.random.seed(45)

#spliting our data into training and testing set.
X_train,X_test,y_train,y_test = train_test_split(X, #independent variable
                                                 y, #dependent variable
                                                 test_size=0.2) #percentage of data used for testing

We have used 80% of our data to train and 20% to test.

In [None]:
#lets look at the shape of our training and testing data
X_train.shape, X_test.shape, y_train.shape,y_test.shape

Boom!! we are using 242 samples for training purpose and 61 samples for testing purpose.


### Choosing the right model.

I have my data prepared and everything is going fine upto now. 
Now its time to select the right model for the problem. I am gonna tackle this problem by using the following algorithms: 
1. Logistic Regression
2. K-Nearest Neighbors
3. Random Forest

### Why I am using these?

<img src='https://scikit-learn.org/stable/_static/ml_map.png'>

fig: Scikit-learn workflow



By looking at the above picture we can clearly see that we have less data in our database. So these could be the best choices at that condition.

Since our dataset is relatively small, we can experiment to find algorithm performs best.

All of the algorithms in the Scikit-Learn library use the same functions, for training a model, model.fit(X_train, y_train) and for scoring a model model.score(X_test, y_test). score() returns the ratio of correct predictions (1.0 = 100% correct).

In [None]:
# Using logistic regression
log = LogisticRegression(max_iter=1000).fit(X_train,y_train)
log_score = log.score(X_test,y_test)
print('The accuracy score of Logistic regression is: {:.2f}%'.format(log_score*100))

#Using KNeighbor
knn = KNeighborsClassifier().fit(X_train,y_train)
knn_score = knn.score(X_test,y_test)
print('The accuracy of K-Nearest Neighbors is: {:.2f}%'.format(knn_score*100))

#Using Random Forest 
clf = RandomForestClassifier().fit(X_train,y_train)
clf_score = clf.score(X_test,y_test)
print('The accuracy of Random Forest is: {:.2f}%'.format(clf_score*100))

### Model Comparison

From the above randomized selected data we can see that Logistic regression and Random Forest have better accuracy compared to K-Nearnest Neighbors.

In [None]:
model_score={log_score,knn_score,clf_score}
model_comparison = pd.DataFrame(model_score,index=['Logistic Regression','K-Nearest Neighbors','Random Forest'])
model_comparison.plot(kind='barh')
plt.ylabel('Algorithms')
plt.xlabel('Accuracy Score')
plt.title('Accuracy Comparison between different Models')
plt.legend('Accuracy');

Let's go more in depth to the problem we are solving. 

* [Hyperparameter Tuning](https://www.oreilly.com/library/view/evaluating-machine-learning/9781492048756/ch04.html): A parameter whose value is set before the learning process begins.Changing these values may increase or decrease model performance.
* [Feature Importance](https://machinelearningmastery.com/calculate-feature-importance-with-python/#:~:text=Feature%20importance%20refers%20to%20a,feature%20when%20making%20a%20prediction.): if we are dealing with the large dataset we must give importance to those who plays a significant role in the problem.
* [Confusion Matrix](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html): Compares the predicted values with the true values in a tabular way.
* [Cross Validation](https://scikit-learn.org/stable/modules/cross_validation.html):  Splits dataset into multiple parts and train and tests the model on each part and evaluates performance as an average. 
* [Precision](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_score.html): Proportion of true positives over total number of samples.
* [Recall](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.recall_score.html): Proportion of true positives over total number of true positives and false negatives.
* [F1 score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html): Combines precision and recall into one metric
* [Classification Report](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html): Returns some of the main classification metrics such as precision, recall and f1-score.
* [ROC curve](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_curve.html): Plot of true positive rate versus false positive rate.
* [AUC curve](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.auc.html): The area underneath the ROC curve.

## Hyperparameter tuning and cross validation

To test different hyperparameters, we could use a validation set but since we don't have much data, we'll use cross-validation.
The most common type of cross-validation is k-fold. It involves splitting  data into k-fold's and then testing a model on each. For example, let's say we had 5 folds (k = 5). This what it might look like.

<img src='https://scikit-learn.org/stable/_images/grid_search_cross_validation.png'/>



## Tuning models with with RandomizedSearchCV

we have seen that k-Nearest Neighbour have less accuracy as compared to Logistic regression and random forest. So let ignore that Knn for now and start with remaining two.

Let's create a hyperparameter dictionary  for each and test them out.


In [None]:
# Logistic Regression hyperperimeters
log_grid = {"C": np.logspace(-4, 4, 20),
            "solver": ["liblinear"]}

# Random Forest hyperperimeter
rf_grid = {"n_estimators": np.arange(10, 1000, 50),
           "max_depth": [None, 3, 5, 10],
           "min_samples_split": np.arange(2, 20, 2),
           "min_samples_leaf": np.arange(1, 20, 2)}

We have make our hyperparameter dictionary. Now its time to tune our logistic regression model.

### Tuning Logistic Regression

In [None]:
np.random.seed(20)

#setup random hyperparameter search.
log_reg = RandomizedSearchCV(LogisticRegression(),
                                param_distributions=log_grid,
                                cv=5,
                                n_iter=20,#try 20 different combinations of hyperparameters
                                verbose=True)

#fitting the model
log_reg.fit(X_train,y_train)

In [None]:
#checking the best parameters for logistic regression
log_reg.best_params_

In [None]:
print('The accuacy of Logistic regression using RandomizedSearchCV is: {:.2f}%'.format(log_reg.score(X_test,y_test)*100))

Now its time to do the same with Random Forest.
### Tuning Random Forest

In [None]:
np.random.seed(20)

#setup random hyperparameter search.
rand = RandomizedSearchCV(RandomForestClassifier(),
                                param_distributions=rf_grid,
                                cv=5,
                                n_iter=20,#try 20 different combinations of hyperparameters
                                verbose=True)

#fitting the model
rand.fit(X_train,y_train)

In [None]:
#checking the best parameters for random forest
rand.best_params_

In [None]:
print('The accuracy of Random forest using RandomizedSearchCV is: {:.2f}%'.format(rand.score(X_test,y_test)*100))

In the beginning when we use Logistic regression and Random forest  without cross validation, we got the accuracy score of 83.61% and 85.25% respectively. After tuning them we got the better result as 90.16% for both. This is the reason beside doing hyperparameter tuning.

## Tuning a model with GridSearchCV

The difference between RandomizedSearchCV and GridSearchCV is where RandomizedSearchCV searches over a grid of hyperparameters performing n_iter combinations, GridSearchCV will test every single possible combination.

* RandomizedSearchCV - tries n_iter combinations of hyperparameters and saves the best.
* GridSearchCV - tries every single combination of hyperparameters and saves the best.

In [None]:
log_search = GridSearchCV(LogisticRegression(),
                          param_grid=log_grid,
                          cv=5,
                          verbose=True)
#fitting the model
log_search.fit(X_train,y_train)

In [None]:
#checking the best hyperperimeter.
log_search.best_params_

In [None]:
print('The accuacy of Logistic regression using GridSearchCV is: {:.2f}%'.format(log_search.score(X_test,y_test)*100))


In this case, we get the same results as before since our grid only has a maximum of 20 different hyperparameter combinations.

Note: If there are a large amount of hyperparameters combinations in your grid, GridSearchCV may take a long time to try them all out. This is why it's a good idea to start with RandomizedSearchCV, try a certain amount of combinations and then use GridSearchCV to refine them.

## Evaluating a model beyond Accuracy.

Now we have the tuned model now lets try with matrix.

* ROC curve and AUC score - plot_roc_curve()
* Confusion matrix - confusion_matrix()
* Classification report - classification_report()
* Precision - precision_score()
* Recall - recall_score()
* F1-score - f1_score()

In [None]:
#making predictions on test data
y_preds = log_search.predict(X_test)
y_preds

In [None]:
y_test.values

### ROC Curve and AUC score

In [None]:
from sklearn.metrics import plot_roc_curve
#plotting the curve
plot_roc_curve(log_search,X_test,y_test);

As best model always scores 1. our model achieve the score of 0.93. Till now we have done the great job.

### Confusion matrix

The confusion matrix is to see where our model predicts the right decision and where it predicts the wrong.

In [None]:
#displaying confusion matrix.
print(confusion_matrix(y_test,y_preds))

In [None]:
#plotting confusion matrix.
from sklearn.metrics import plot_confusion_matrix
plot_confusion_matrix(log_search,X_test,y_test);

### Classification report

A classification report give us information of the precision and recall of our model for each class.

In [None]:
# classification report
print(classification_report(y_test, y_preds))

* Precision - Indicates the proportion of positive identifications (model predicted class 1) which were actually correct. A model which produces no false positives has a precision of 1.0.
* Recall - Indicates the proportion of actual positives which were correctly classified. A model which produces no false negatives has a recall of 1.0.
* F1 score - A combination of precision and recall. A perfect model achieves an F1 score of 1.0.
* Support - The number of samples each metric was calculated on.
* Accuracy - The accuracy of the model in decimal form. Perfect accuracy is equal to 1.0.
* Macro avg - Short for macro average, the average precision, recall and F1 score between classes. Macro avg doesn’t class imbalance into effort, so if you do have class imbalances, pay attention to this metric.
* Weighted avg - Short for weighted average, the weighted average precision, recall and F1 score between classes. Weighted means each metric is calculated with respect to how many samples there are in each class. This metric will favour the majority class (e.g. will give a high value when one class out performs another due to having more samples).


Let's see the same thing in action using cross validation.


<img src='https://img.devrant.com/devrant/rant/r_1393414_v3ymZ.jpg'/>

In [None]:
# our best hyperparameter
log_search.best_params_

In [None]:
from sklearn.model_selection import cross_val_score

#lets use the best model with best hyperpararmeters.
clf = LogisticRegression(C=0.615848211066026,
                        solver='liblinear')

Let's find some matrix using cross validation.

In [None]:
#cross validation accuracy score
cross_val_accuracy = cross_val_score(clf,X,y,cv=5,scoring='accuracy')
cross_val_accuracy

In [None]:
#Let's find the average of the above 5 values.
cross_val_accuracy = np.mean(cross_val_accuracy)
print('Cross validation accuracy score is: {:.2f}'.format(cross_val_accuracy))

In [None]:
#cross validation precision score.
cross_val_precision = cross_val_score(clf,X,y,cv=5,scoring='precision')
cross_val_precision

In [None]:
#Let's find the average of the above 5 values.
cross_val_precision = np.mean(cross_val_precision)
print('Cross validation Precision score is: {:.2f}'.format(cross_val_precision))

In [None]:
#cross validation recall score.
cross_val_recall = cross_val_score(clf,X,y,cv=5,scoring='recall')
cross_val_recall

In [None]:
#Let's find the average of the above 5 values.
cross_val_recall = np.mean(cross_val_recall)
print('Cross validation recall score is: {:.2f}'.format(cross_val_recall))

In [None]:
#cross validation recall score.
cross_val_f1 = cross_val_score(clf,X,y,cv=5,scoring='f1')
cross_val_f1

In [None]:
#Let's find the average of the above 5 values.
cross_val_f1 = np.mean(cross_val_f1)
print('Cross validation f1 score is: {:.2f}'.format(cross_val_f1))

We have got our cross validated metrics, Now visualize them to compare how they perform.

In [None]:
# Visualizing cross-validated metrics
cross_val_metrics = pd.DataFrame({"Accuracy": cross_val_accuracy,
                            "Precision": cross_val_precision,
                            "Recall": cross_val_recall,
                            "F1": cross_val_f1},
                          index=[0])
cross_val_metrics.T.plot.bar(title="Cross-Validated Metrics", legend=False);

## Feature Importance

"which features contributing most to the outcomes of the model?"


In [None]:
clf.fit(X_train,y_train)
features_dict = dict(zip(df.columns, list(clf.coef_[0])))
features_dict

Looking at this it might not make much sense. But these values are how much each feature contributes to how a model makes a decision on whether patterns in a sample of patients health data leans more towards having heart disease or not.

In [None]:
# Visualize feature importance
features_df = pd.DataFrame(features_dict, index=[0])
features_df.T.plot.bar(title="Feature Importance", legend=False);

Here we notice that some are negative and some are positive.

The larger the value (bigger bar), the more the feature contributes to the models decision.

If the value is negative, it means there's a negative correlation. And vice versa for positive values.

For example, the sex attribute has a negative value of -1.3, which means as the value for sex increases, the target value decreases.

We can see this by comparing the sex column to the target column.

In [None]:
pd.crosstab(df["sex"], df["target"])

We can see, when sex is 0 (female), there are almost 3 times as many (72 vs. 24) people with heart disease (target = 1) than without.

And then as sex increases to 1 (male), the ratio goes down to almost 1 to 1 (114 vs. 93) of people who have heart disease and who don't.

What does this mean?

It means the model has found a pattern which reflects the data. Looking at these figures and this specific dataset, it seems if the patient is female, they're more likely to have heart disease.

How about a positive correlation?

In [None]:
# Contrast slope (positive coefficient) with target
pd.crosstab(df["slope"], df["target"])

According to the model, there's a positive correlation of 0.70, not as strong as sex and target but still more than 0.

This positive correlation means our model is picking up the pattern that as slope increases, so does the target value.