# Project Details

Problem Definition In a statement: 
**Given clinical parameters about a patient, can we predict whether or not they have heart disease?**

* Data : The original data came from the Cleavland data from the UCI Machine Learning Repository. https://archive.ics.uci.edu/ml/datasets/heart+Disease
* There is also a version of it available on Kaggle. https://www.kaggle.com/datasets/johnsmith88/heart-disease-dataset 

* Evaluation If we can reach 95% accuracy at predicting whether or not a patient has heart disease during the proof of concept, we'll pursue the project.

* Features This is where you'll get different information about each of the features in your data. You can do this via doing your own research (such as looking at the links above) or by talking to a subject matter expert (someone who knows about the dataset).

## Features 

* age - age in years

* sex - (1 = male; 0 = female)

* cp - chest pain type
0: Typical angina: chest pain related decrease blood supply to the heart 
1: Atypical angina: chest pain not related to heart
2: Non-anginal pain: typically esophageal spasms (non heart related)
3: Asymptomatic: chest pain not showing signs of disease

* trestbps - resting blood pressure (in mm Hg on admission to the hospital) anything above 130-140 is typically cause for concern

* chol - serum cholestoral in mg/dl

* serum = LDL + HDL + .2 * triglycerides
 *above 200 is cause for concern*
* fbs - (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
'>126' mg/dL signals diabetes

* restecg - resting electrocardiographic results
0: Nothing to note
1: ST-T Wave abnormality
can range from mild symptoms to severe problems
signals non-normal heart beat
2: Possible or definite left ventricular hypertrophy
Enlarged heart's main pumping chamber

* thalach - maximum heart rate achieved

* exang - exercise induced angina (1 = yes; 0 = no)

* oldpeak - ST depression induced by exercise relative to rest looks at stress of heart during excercise unhealthy heart will stress more

* slope - the slope of the peak exercise ST segment
0: Upsloping: better heart rate with excercise (uncommon)
1: Flatsloping: minimal change (typical healthy heart)
2: Downslopins: signs of unhealthy heart
ca - number of major vessels (0-3) colored by flourosopy
colored vessel means the doctor can see the blood passing through
the more blood movement the better (no clots)
thal - thalium stress result
1,3: normal
6: fixed defect: used to be defect but ok now
7: reversable defect: no proper blood movement when excercising
target - have disease or not (1=yes, 0=no) (= the predicted attribute)

### Importing Dependencies

In [None]:
#Importing Libraries
import numpy as np
import sklearn
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
%matplotlib inline

In [None]:
#Models
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier

#Model Evaluation
from sklearn.model_selection import train_test_split , cross_val_score
from sklearn.metrics import confusion_matrix , classification_report
from sklearn.model_selection import RandomizedSearchCV , GridSearchCV
from sklearn.metrics import precision_score , f1_score , recall_score , plot_roc_curve

### Loading the Data 
 

In [None]:
df = pd.read_csv("heart-disease.csv")
df

# Data Exploration 
* Handling missing values and Outliers

In [None]:
df.info()

In [None]:
df.describe()

In [None]:
df.shape

In [None]:
df.tail()

## Visualizing and Finding Patterns in Data

In [None]:
#Let's find out how many of each class are there
df["target"].value_counts()

In [None]:
df.head()

In [None]:
df.tail()

In [None]:
df.sex.value_counts()

In [None]:
#Checking how sex corresponds to the target column
pd.crosstab(df.sex , df.target)

In [None]:
pd.crosstab(df.target , df.sex).plot(kind = "bar", color = ["salmon" , "lightblue"], figsize = (10,5), title = "How Gender correlates to Heart Disease" , ylabel = "Count")

In [None]:
#Comparing Age vs Maximum Heart Rate
plt.figure(figsize = (10,6))

#Using scatter plot to plot age vs maximum heart rate for target = 1
plt.scatter(df.age[df.target == 1],
           df.thalach[df.target == 1],
           color = "salmon")
#Using scatter plot to plot age vs maximum heart rate for target = 0
plt.scatter (df.age[df.target == 0], df.thalach[df.target ==0] , color = "lightblue")
plt.title("Heart Disease in function of Age and Maximum Heart Rate")
plt.xlabel("Age")
plt.ylabel("Maximum Heart Rate")

In [None]:
df.corr()

In [None]:
# Visualising Graphically
corr_matrix = df.corr()
fig , ax = plt.subplots(figsize = (12,7))
ax = sns.heatmap(corr_matrix , annot = True , linewidths = 0.5 , fmt = ".2f", cmap = sns.cubehelix_palette(as_cmap=True))

## Modelling 

In [None]:
#Spliting the dataset
X = df.drop("target" , axis = 1)
y = df["target"]

#setting the random seed
np.random.seed(42)

#Splitting the data
X_train, X_test, y_train , y_test = train_test_split(X,y , test_size = 0.2, stratify = y)

In [None]:
X

In [None]:
y_train

In [None]:
X_train 

## Training the Model

Using Reference from https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html 

*We are going to use three models:* 

> 1.  Logistic Regression
> 2. K-Nearest Neighbour Classifier
> 3. Random Forest Classifier
> 4. Decision Tree Classifier
> 5. Naive Bayes Classifier

In [None]:
#Creating an empty list to store model scores 
model_scores = []


### Logistic Regression

In [None]:
Logistic_clf = LogisticRegression()
Logistic_clf.fit(X_train , y_train)
model_scores.append(Logistic_clf.score(X_test , y_test))
Logistic_clf.score(X_test , y_test)

### KNN Classifier

In [None]:
KNN_clf = KNeighborsClassifier(n_neighbors = 4)
KNN_clf.fit(X_train , y_train)

model_scores.append(KNN_clf.score(X_test , y_test))
KNN_clf.score(X_test , y_test)

### Random Forest Classifier 

In [None]:
RandomForest_clf = RandomForestClassifier(n_estimators = 100)
RandomForest_clf.fit(X_train , y_train )
model_scores.append(RandomForest_clf.score(X_test , y_test))
RandomForest_clf.score(X_test , y_test)

### Decision Tree 

In [None]:
from sklearn.tree import DecisionTreeClassifier

tree = DecisionTreeClassifier(max_depth=8, random_state=0)
tree.fit(X_train, y_train)

model_scores.append(tree.score(X_test, y_test))
tree.score(X_test, y_test)

### Naive Bayes

In [None]:
from sklearn.naive_bayes import GaussianNB  
NaiveBayes_clf = GaussianNB()  
NaiveBayes_clf.fit(X_train, y_train)
model_scores.append(NaiveBayes_clf.score(X_test , y_test))
NaiveBayes_clf.score(X_test, y_test)

In [None]:
model_scores

In [None]:
train_scores = []
test_scores = []
neighbors = range(1,100)
knn_clf = KNeighborsClassifier()
# Tuning Hyperparameters of KNN 
for i in range(1,100):
    knn_clf.set_params(n_neighbors = i)
    knn_clf.fit(X_train , y_train)
    train_scores.append(knn_clf.score(X_train , y_train))
    test_scores.append(knn_clf.score(X_test, y_test))


In [None]:
train_scores

In [None]:
test_scores

In [None]:
plt.plot(neighbors, train_scores , label  = "Train Scores" , color = "Salmon")
plt.plot(neighbors, test_scores , label  = "Test Scores" , color = "blue")
plt.xticks(np.arange(1, 100 , 5))
plt.legend()
plt.xlabel("Number of Neighbours")
plt.ylabel("Model Score")

## Tuning Hyperparameters using Randomized SearchCV

In [None]:
#generating grid space for logistic regression 
log_reg_grid = {"C" : np.logspace(-4,4,20) , 
               "solver" : ["liblinear"]}

#Creating Grid for Random Forest 
rf_grid = {"n_estimators" : np.arange(10,1000 , 50) , 
          "max_depth" : [None, 3 ,5 , 10] ,
          "min_samples_split" : np.arange(1,20,2) ,
          "min_samples_leaf": np.arange(2 ,20, 2)}

In [None]:
#Tune Logistic Regression 
np.random.seed(42)

rs_log_reg = RandomizedSearchCV(LogisticRegression(),
                                param_distributions = log_reg_grid ,
                               cv = 5 ,
                               n_iter = 20 ,
                               verbose = True)
#Fitting the model
rs_log_reg.fit(X_train , y_train)
rs_log_reg.score(X_test, y_test)

In [None]:
rs_log_reg.best_params_

In [None]:
#Tune Logistic Regression 
np.random.seed(42)

rs_rf = RandomizedSearchCV(RandomForestClassifier(),
                                param_distributions = rf_grid ,
                               cv = 5 ,
                               n_iter = 20 ,
                               verbose = True)
#Fitting the model
rs_rf.fit(X_train , y_train)
rs_rf.score(X_test, y_test)

In [None]:
rs_rf.best_params_

In [None]:
param_grid_nb = {
    'var_smoothing': np.logspace(0,-9, num=100)
}
nbModel_grid = RandomizedSearchCV(estimator=GaussianNB(), param_distributions=param_grid_nb, verbose=1, cv=10, n_jobs=-1 , n_iter = 20)

nbModel_grid.fit(X_train, y_train)

In [None]:
nbModel_grid.score(X_test , y_test)

In [None]:
np.random.seed(42)

rs_log_reg = RandomizedSearchCV(LogisticRegression(),
                                param_distributions = log_reg_grid ,
                               cv = 5 ,
                               n_iter = 20 ,
                               verbose = True)
#Fitting the model
rs_log_reg.fit(X_train , y_train)
rs_log_reg.score(X_test, y_test) 

## GridSearchCV 

In [None]:
#Creating Grid for Logical Regression
log_reg_grid = {"C" : np.logspace(-4 ,4 ,20) ,
               "solver" : ["liblinear"]}
#Creating Grid for Random Forest
rf_grid = {"n_estimators" : np.arange(10 , 1000 , 50), 
           "max_depth" : [None, 3 , 5 ,10] , 
           "min_samples_split" : np.arange(2 ,20 ,2) ,
           "min_samples_leaf" : np.arange(1,20,2)}


In [None]:
#Setup Grid hyper parameter for Logical Regression 
gs_log_Reg = GridSearchCV(LogisticRegression(),
                         param_grid = log_reg_grid ,
                         cv=5 ,
                         verbose = True)

gs_log_Reg.fit(X_train, y_train)
gs_log_Reg.score(X_test , y_test)

In [None]:
gs_log_Reg.best_params_


## Evaluating Models 

* ROC curve and AUC score
* Confusion Matrix
* Classification Report
* Precision
* Recall
* F1 score

In [None]:
y_preds = rs_rf.predict(X_test)