# <center> Heart Disease Prediction Classifier
   
<b>Author: Kotha Charan



# Problem Statement
### Build a classification model that predicts heart disease in a subject. (note the target column to predict is 'TenYearCHD' where CHD = Coronary heart disease) 

# Attributes

1.	sex: male(0) or female(1);(Nominal)
2.	age: age of the patient;(Continuous - Although the recorded ages have been truncated to whole numbers, the concept of age is continuous)
3.	currentSmoker: whether or not the patient is a current smoker (Nominal)
4.	cigsPerDay: the number of cigarettes that the person smoked on average in one day.(can be considered continuous as one can have any number of cigarretts, even half a cigarette.)
5.	BPMeds: whether or not the patient was on blood pressure medication (Nominal)
6.	prevalentStroke: whether or not the patient had previously had a stroke (Nominal)
7.	prevalentHyp: whether or not the patient was hypertensive (Nominal)
8.	diabetes: whether or not the patient had diabetes (Nominal)
9.	totChol: total cholesterol level (Continuous)
10.	sysBP: systolic blood pressure (Continuous)
11.	diaBP: diastolic blood pressure (Continuous)
12.	BMI: Body Mass Index (Continuous)
13.	heartRate: heart rate (Continuous - In medical research, variables such as heart rate though in fact discrete, yet are considered continuous because of large number of possible values.)
14.	glucose: glucose level (Continuous)
15.	10 year risk of coronary heart disease CHD (binary: “1”, means “Yes”, “0” means “No”) - Target Variable


# Importing Libraries

In [None]:
import numpy as np
import pandas as pd
import seaborn as sn
import matplotlib.pyplot as plt
from sklearn.utils import resample
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from google.colab import files
import io

# Reading dataset

In [None]:
uploaded=files.upload()
df=pd.read_csv(io.BytesIO(uploaded['framingham.csv']))


In [None]:
df.head()

# Preprocessing

In [None]:
#Duplicates
df.duplicated().sum()

<b> No duplicates

In [None]:
#Checking relationship between variables
cor=df.corr()
plt.figure(figsize=(20,10))
sn.heatmap(cor,xticklabels=cor.columns,yticklabels=cor.columns,annot=True)
cor

<b>Since correlation coefficient between education and and target variable TenYearCHD is insignificant, we can therefore remove education column

In [None]:
df=df.drop(['education'],axis=1)

In [None]:
df.head()

In [None]:
# Check missing/null values
df.isnull().sum()

In [None]:
#Drop rows with null values
df=df.dropna()
df.isnull().sum()

<b>No null values

In [None]:
#columns
print(df.columns,"\n")

#Dimensions
print(df.shape,"\n")

#Column datatypes
print(df.dtypes)

# Exploratory Data Analysis

In [None]:
#Distributions of variables
fig=plt.figure(figsize=(20,20))
ax=fig.gca()
df.hist(ax=ax)
plt.show()

<b> The above grid of plots show the distribution of all the attribues in the dataset with the help of histograms

In [None]:
#Distribution of outcome variable, Heart Disease
plt.subplots_adjust(right=2)
plt.subplot(121)
sn.countplot(x="TenYearCHD", data=df)
plt.subplot(122)
labels=[0,1]
plt.pie(df["TenYearCHD"].value_counts(),autopct="%1.1f%%",labels=labels,colors=["lime","red"])
plt.show()

<b> The distribution is highly imbalanced. As in, the number of negative cases outweigh the number of positive cases. This would lead to class imbalance problem while fitting our models. Therefore, this problem needs to be addressed and taken care of.

# Resampling imbalanced dataset by oversampling positive cases

In [None]:
target1=df[df['TenYearCHD']==1]
target0=df[df['TenYearCHD']==0]

In [None]:
target1=resample(target1,replace=True,n_samples=len(target0),random_state=40)

In [None]:
target=pd.concat([target0,target1])

In [None]:
target['TenYearCHD'].value_counts()

In [None]:
df=target
np.shape(df)

In [None]:
#Distribution of heart disease cases in the balanced dataset, the outcome variable
plt.subplots_adjust(right=2)
plt.subplot(121)
sn.countplot(x="TenYearCHD", data=df)
plt.subplot(122)
labels=[0,1]
plt.pie(df["TenYearCHD"].value_counts(),autopct="%1.1f%%",labels=labels,colors=["red","lime"])
plt.show()

<b> The number of positive and negative cases are equal. Hence the classes are now balanced for model fitting

# Analysis of each attribute

In [None]:
df.columns

In [None]:
df["male"].nunique()

In [None]:
print("Number of males: ",len(df[df["male"]==1]))
print("\nOthers: ",len(df[df["male"]!=1]))

In [None]:
#Distribution of male and not male
sn.countplot(df["male"])

<b> From this plot we can see that majority are female

In [None]:
df["age"].nunique()

In [None]:
#Distribution of age
plt.figure(figsize=(20,10))
sn.countplot(x="age",data=df)
plt.show()

<b> The above plot shows the distribution of people of various ages. Majority of the people are aged 51

In [None]:
#Mode
df["age"].mode()

In [None]:
#Median
df["age"].median()

In [None]:
#Mean
df["age"].mean()

In [None]:
df["age"].describe()

In [None]:
#Boxplot and violinplot distribution of age
plt.subplots_adjust(right=2,top=1)
plt.subplot(121)
sn.boxplot(df["age"],color="gold",orient="v")
plt.subplot(122)
sn.violinplot(x="age",data=df,orient="v")
plt.show()

<b> The boxplot and violinplot confirm the distribution with respect to statistical results above

In [None]:
#Distribution of ages with respect to gender
plt.figure(figsize=(20,10))
sn.countplot(x="age",data=df,hue="male")
plt.show()

<b> From the above distribution we can see that, majority of males are aged 51 and females aged 63 but majority when combined together are aged 51

In [None]:
#Distribution of current smokers
plt.subplots_adjust(right=2)
plt.subplot(121)
sn.countplot(x="currentSmoker", data=df)
plt.subplot(122)
labels=[1,0]
plt.pie(df["currentSmoker"].value_counts(),autopct="%1.1f%%",labels=labels,colors=["red","lime"])
plt.show()

<b> Majority are smokers currently

In [None]:
#Distribution of currentsmokers with respect to gender
sn.countplot(x="currentSmoker", data=df, hue="male")
plt.show()

<b>

<b> Majority of males of current smokers

In [None]:
#Distribution of current smokers with respect to age
plt.figure(figsize=(30,10))
sn.countplot(x="age",data=df,hue="currentSmoker")
plt.show()

<b> Most current smokers are aged 51

In [None]:
#Distribution of age and heart disease condition
plt.figure(figsize=(30,10))
sn.countplot(x="age",data=df,hue="TenYearCHD")
plt.show()

<b> Most heart disease patients are aged 63

In [None]:
#Distribution of gender and heart disease condition
sn.countplot(x="male",data=df,hue="currentSmoker")
plt.show()

<b> Most people with heat disease are males

In [None]:
#Distribution of cigsPerDay with respect to gender
plt.figure(figsize=(20,5))
sn.countplot(x="cigsPerDay",data=df,hue="male")

<b> Males consume more cigarettes per day than others. This supports our conclusion earlier that majority of current smokers are males

In [None]:
#Distribution of BPMeds
sn.countplot(df['BPMeds'])
plt.show()

<b> We can clearly observe that almost everyone doesn't require medications for BP

In [None]:
#Distribution of BP Meds with respect to age
plt.figure(figsize=(20,5))
sn.countplot(x="age", data=df, hue="BPMeds")
plt.show()

<b> We can observe that, among those who require medication are the elderly

In [None]:
sn.countplot(x="BPMeds",data=df,hue="male")
plt.show()

<b> Majority among those who needed medication are females

In [None]:
df["prevalentStroke"].nunique()
sn.countplot(df['prevalentStroke'])
plt.show()

<b> Very few inidividuals experienced strokes before

In [None]:
#Distribution of prevalentStrokes with respect to age
plt.figure(figsize=(20,5))
sn.countplot(x="age", data=df, hue="prevalentStroke")
plt.show()

<b> Out of the few, most people who experienced strokes previously are above the age of 50

In [None]:
#Distribution of prevalentHyp
sn.countplot(df["prevalentHyp"])

<b> Most individuals werent hypertensive before

In [None]:
#Distribution of prevalentHyp vs age
plt.figure(figsize=(20,5))
sn.countplot(x="age", data=df, hue="prevalentHyp")
plt.show()

<b> Most individuals who were hypertensive before are aged 63

In [None]:
sn.countplot(x="prevalentHyp", data=df, hue="male")
plt.show()

<b> More females were hypertensive before than males

In [None]:
#Distribution of diabetes
sn.countplot(df["diabetes"])
plt.show()

<b> Most individuals were non diabetic

In [None]:
#Distribution of diabetes in age groups
plt.figure(figsize=(20,5))
sn.countplot(x="age", data=df, hue="diabetes")
plt.show()

<b> Most diabetic cases are people of age 52

In [None]:
#Distribution of diabetes vs gender
sn.countplot(x="diabetes", data=df, hue="male")
plt.show()

<b> Equal number of males and others are diabetic

In [None]:
#Distribution of Total cholesterol
plt.figure(figsize=(10,5))
sn.distplot(df["totChol"],color='red',hist_kws=dict(edgecolor="black", linewidth=1))
plt.show()

<b> Most people have total cholesterol reading between 220-270

In [None]:
df["totChol"].describe()

<b> Range of cholesterol 113-696

In [None]:
#Boxplot and violinplot distribution of cholesterol
plt.subplots_adjust(right=2,top=1)
plt.subplot(121)
sn.boxplot(df["totChol"],color="lightgreen",orient="v")
plt.subplot(122)
sn.violinplot(x="totChol",data=df,orient="v")
plt.show()

<b> The above plots and stastical measures suggest that there are outliers in this column. Therefore they must be dropped

In [None]:
#Outliers in totChol
outliers=df[df['totChol']>500]
outliers

In [None]:
#Dropping outlier
df=df.drop(df[df['totChol']>500].index)

In [None]:
sn.boxplot(df['totChol'])
plt.show()

In [None]:
sn.boxplot(x="male",y="totChol",data=df)
plt.show()

<b>The plot suggests that, females have more cholesterol since the boxplot for females is bigger than males

In [None]:
plt.figure(figsize=(20,10))
sn.boxplot(x="age",y="totChol",data=df)
plt.show()

<b> The boxplots are shifted in an upwardly manner suggesting that aged people have more cholesterol (bad cholesterol in general)

In [None]:
#Distribution of Systolic bp
plt.figure(figsize=(10,5))
sn.distplot(df["sysBP"],color='green',hist_kws=dict(edgecolor="black", linewidth=1))
plt.show()

<b>
    Most people have systolic bp within the range 120-135

In [None]:
df["sysBP"].describe()

In [None]:
sn.boxplot(df['sysBP'])
plt.show()

<b> The value 295 is an outlier, it should be removed

In [None]:
df=df.drop(df[df['sysBP']==295].index)

In [None]:
sn.boxplot(x="male",y="sysBP",data=df)
plt.show()

<b>Females in general have higher systolic bp than male

In [None]:
plt.figure(figsize=(20,10))
sn.boxplot(x="age",y="sysBP",data=df)
plt.show()

<b>Age and SysBP are positively correlated. Aged people seem to have a higher systolic bp as seen above, in general

In [None]:
#Distribution of diastolic bp
plt.figure(figsize=(10,5))
sn.distplot(df["diaBP"],color='blue',hist_kws=dict(edgecolor="black", linewidth=1))
plt.show()

<b> Most people have diastolic bp of around 80

In [None]:
df["sysBP"].describe()

In [None]:
sn.boxplot(df['diaBP'])
plt.show()

In [None]:
sn.boxplot(x="male",y="sysBP",data=df)
plt.show()

<b> Females have a higher diastolic bp

In [None]:
plt.figure(figsize=(20,10))
sn.boxplot(x="age",y="diaBP",data=df)
plt.show()

<b>Age and diaBP are very slightly positively correlated. Aged people seem to have a slightly higher diastolic bp as seen above, in general

In [None]:
#sysBP vs diaBP with respect to currentSmoker and male attributes
sn.lmplot(x='sysBP',y= 'diaBP', 
           data=df,
           hue="TenYearCHD",
           col="male",row="currentSmoker")
plt.show()

<b>
    The above graph plots the relationship between systolic blood pressure and diastolic blood pressure for patients based on their gender and whether they are current smokers or not and plots the best fit line

# Feature Selection


In [None]:
#To idenfify the features that have larger contribution towards the outcome variable, Heart disease
X=df.iloc[:,0:14]
y=df.iloc[:,-1]


In [None]:
#Apply SelectKBest and extract top 10 features
best=SelectKBest(score_func=chi2, k=10)

In [None]:
fit=best.fit(X,y)

In [None]:
df_scores=pd.DataFrame(fit.scores_)
df_columns=pd.DataFrame(X.columns)

In [None]:
#Join the two dataframes
scores=pd.concat([df_columns,df_scores],axis=1)
scores.columns=['Feature','Score']
print(scores.nlargest(11,'Score'))

In [None]:
#To visualize feature selection
scores=scores.sort_values(by="Score", ascending=False)
plt.figure(figsize=(20,7))
sn.barplot(x='Feature',y='Score',data=scores,palette='BuGn_r')
plt.show()

<B>Features and their respective scores

In [None]:
#Select 10 features
features=scores["Feature"].tolist()[:10]
features

<b> Only these features have strongest influence over the target variable. They are, in particular order:</b>
    <li>sysBP
        <li>glucose
            <li>age
                <li>totChol
                    <li>cigsPerDay
                        <li>diaBP
                            <li>prevalentHyp
                                <li>BMI
                                    <li>BPMeds
                                        <li>Male

In [None]:
df=df[['sysBP','glucose','age','cigsPerDay','totChol','diaBP','prevalentHyp','BPMeds','male','BMI','TenYearCHD']]
df.head()

# Feature Scaling

In [None]:
#Perform feature scaling to scale our features for different models 
scaler=MinMaxScaler(feature_range=(0,1)) 
scaled_df=pd.DataFrame(scaler.fit_transform(df),columns=df.columns)

In [None]:
scaled_df.describe()

In [None]:
df.describe()

In [None]:
df=scaled_df

In [None]:
#Checking relationship between variables once again
cor=df.corr()
plt.figure(figsize=(20,10))
sn.heatmap(cor,xticklabels=cor.columns,yticklabels=cor.columns,annot=True)
cor

In [None]:
plt.figure(figsize=(20,20))
sn.pairplot(df)
plt.show()

<b> The above graphs describe the relationship between each attribute

# Train-Test split

In [None]:
#Train-test split
X=df.drop(['TenYearCHD'],axis=1)
y=df['TenYearCHD']
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.4, random_state=0)
#print(X_train,X_test,y_train,y_test)

# Fitting Models

## Logistic Regression

In [None]:
reg=LogisticRegression(random_state=0)
lr=reg.fit(X_train,y_train)

In [None]:
y_pred=lr.predict(X_test)

In [None]:
accuracy_score(y_test,y_pred)

In [None]:
#Confusion Matrix
confusion_matrix(y_test,y_pred)

In [None]:
print(classification_report(y_test,y_pred))

## KNeighbors Classifier

In [None]:
knn=KNeighborsClassifier(n_neighbors=2)
knn.fit(X_train,y_train)

In [None]:
y_pred=knn.predict(X_test)

In [None]:
accuracy_score(y_test,y_pred)

In [None]:
#Confusion Matrix
confusion_matrix(y_test,y_pred)

In [None]:
print(classification_report(y_test,y_pred))

## Decision Tree

In [None]:
dtc=DecisionTreeClassifier(random_state=0)
dtc.fit(X_train,y_train)

In [None]:
y_pred=dtc.predict(X_test)

In [None]:
accuracy_score(y_test,y_pred)

In [None]:
#Confusion Matrix
confusion_matrix(y_test,y_pred)

In [None]:
print(classification_report(y_test,y_pred))

## Gradient Boosting

In [None]:
from sklearn.ensemble import GradientBoostingClassifier
clf=GradientBoostingClassifier()
clf.get_params

### Hyperparameter tuning using Randomized search Cross Validation

In [None]:
#Number of trees
n_estimators = [int(i) for i in np.linspace(start=100,stop=1000,num=10)]
#Number of features to consider at every split
max_features = ['auto','sqrt']
#Maximum number of levels in tree
max_depth = [int(i) for i in np.linspace(10, 100, num=10)]
max_depth.append(None)
#Minimum number of samples required to split a node
min_samples_split=[2,5,10]
#Minimum number of samples required at each leaf node
min_samples_leaf = [1,2,4]

#Create the random grid
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf}

In [None]:
gb=GradientBoostingClassifier(random_state=0)
#Random search of parameters, using 3 fold cross validation, 
#search across 100 different combinations
gb_random = RandomizedSearchCV(estimator=gb, param_distributions=random_grid,
                              n_iter=100, scoring='f1', 
                              cv=3, verbose=2, random_state=0, n_jobs=-1,
                              return_train_score=True)

In [None]:
clf=GradientBoostingClassifier(n_estimators=900, max_depth=40, min_samples_split=5,random_state=0)

In [None]:
clf.fit(X_train,y_train)
y_pred=clf.predict(X_test)

In [None]:
accuracy_score(y_test,y_pred)

In [None]:
#Confusion Matrix
confusion_matrix(y_test,y_pred)

In [None]:
print(classification_report(y_test,y_pred))

# RandomForest Classifier

In [None]:
from sklearn.ensemble import RandomForestClassifier
rfc=RandomForestClassifier()
print(rfc.get_params())

### Hyperparameter tuning using Randomized search Cross Validation

In [None]:
#Number of trees
n_estimators = [int(i) for i in np.linspace(start=100,stop=1000,num=10)]
#Number of features to consider at every split
max_features = ['auto','sqrt']
#Maximum number of levels in tree
max_depth = [int(i) for i in np.linspace(10, 100, num=10)]
max_depth.append(None)
#Minimum number of samples required to split a node
min_samples_split=[2,5,10]
#Minimum number of samples required at each leaf node
min_samples_leaf = [1,2,4]
#Method of selecting samples for training each tree
bootstrap = [True, False]

#Create the random grid
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}

In [None]:
rf=RandomForestClassifier(random_state=0)
#Random search of parameters, using 3 fold cross validation, 
#search across 100 different combinations
rf_random = RandomizedSearchCV(estimator=rf, param_distributions=random_grid,
                              n_iter=100, scoring='f1', 
                              cv=3, verbose=2, random_state=0, n_jobs=-1,
                              return_train_score=True)

In [None]:
rfc=RandomForestClassifier(n_estimators=900,max_depth=50,random_state=0)
rfc.fit(X_train,y_train)

In [None]:
y_pred=rfc.predict(X_test)

In [None]:
accuracy_score(y_test,y_pred)

In [None]:
#Confusion Matrix
confusion_matrix(y_test,y_pred)

In [None]:
print(classification_report(y_test,y_pred))

# Accuracy Scores

<b><li> Logistic Regression- 66.61%
   <li> KNeighbors Classification- 86.34%
   <li> Decision Tree- 89.21%
   <li> Gradient Boosting Classification- 89.13%
   <li> Random Forest Classification- 93.15%

### We can conclude that, Random Forest Classification model is best suited for this dataset.

In [114]:
def start_questionnaire():
    my_predictors=[]
    parameters=['sysBP','glucose','age','cigsPerDay','totChol','diaBP','prevalentHyp','BPMeds','male','BMI']

    print('Input Patient Information : ') 

    sysBP=input("Patient's systolic blood pressure : >>>")
    my_predictors.append(sysBP)

    glucose=input("What is the Patient's glucose level (mg/dL) >>> ")
    my_predictors.append(glucose)

    age=input("Patient's age : >>> ")
    my_predictors.append(age)

    totChol=input("Patient's cholesterol level (mg/dL): >>> ")
    my_predictors.append(totChol)

    cigsPerDay=input("Patient's smoked cigarettes per day : >>>")
    my_predictors.append(cigsPerDay)

    diaBP=input("Patient's diastolic blood pressure : >>> ")
    my_predictors.append(diaBP)

    prevalentHyp=input("Was Patient hypertensive? Yes=1, No=0 >>> ")
    my_predictors.append(prevalentHyp)

    BMI=input("Body Mass Index ? (height(cm)/weight(kg)) >>> ")
    my_predictors.append(BMI)

    BPMeds=input("Has Patient been on Blood Pressure Medication? Yes=1, No=0 >>> ")
    my_predictors.append(BPMeds)

    male=input("Patient's gender, male=1, female=0: >>> ")
    my_predictors.append(male)
    

    my_data=dict(zip(parameters,my_predictors))
    my_df=pd.DataFrame(my_data,index=[0])
    scaler=MinMaxScaler(feature_range=(0,1))
    # assign scaler to column
    my_df_scaled=pd.DataFrame(scaler.fit_transform(my_df),columns=my_df.columns)
    my_y_pred=rfc.predict(my_df)
    print('\n')
    print('Result:')
    if my_y_pred==1:
      print("The patient will develop a Heart Disease. ")
    if my_y_pred==0:
      print("The patient will not develop a Heart Disease. ")
start_questionnaire()   


Input Patient Information : 


KeyboardInterrupt: ignored