## Project Overview
The goal of this project is to develop a predective model that can predict the likelihood of developing a heart disease or not based on various factors which include: Do you smoke or not,Does your family have history of having heart disease, Do you have diabetes, What is your C-reactive protein(CRP) level, What is your Homocystine level, what is your Fasting Blood sugar,Sugar consumption,Sleep hours, stress levels, Your alcohol consumption,Either low or high LDL Cholesterol, Blood pressure levels,Age, Exercising Habits and Gender.This model will try to predict the occurence and therefore be considered as a life saving model as globally and mostly here in kenya and U.S, cardiovascular diseases have been shown to be among the top leading causes of death.This will therefore assist different stakeholders to make better health data driven decisions in the future.

# BUSINESS UNDERSTANDING


## The target audience for this project are:
Patients

Government

Insurance companies

Medical practioners and agency boards


## Problem Statement


## Business Objectives
 This project aims to:
 1. Develop a predictive model to estimate the likelihood that one will develop a heart disease or not.
 2. Analyze key factors that affect heart risk through machine learning techniques.
 3. Build models that can assess our decisions.
 4. Provide evaluations to check and review results and determine next steps.
 5. Deployment.


## Metrics of success (benefits)
1. Early Detection and Prevention- will help identidy individuals who are at high risk of heart diseases before symptoms appear. This will also lead to encouaraging early lifestyle changes to prevent disease progression and access the right medication which reduces the long run costs.

2. Personalized Healthcare-Helps doctors and patients make informed decisions based on personal risk factors and make proper judgement on tests and screenings to be done.

3. Cost Savings in Healthcare- By detecting this symptoms early and managing risk factors, you reduce hospital admissions.Lowers the burden on insurance and healthcare systems by minimizing costly late-stage treatments.

4. Research and Development- A predictive model can aid in medical research by identifying new patterns in heart disease risks.Provides insights into key factors contributing to heart disease, leading to better prevention strategies.

5. Data-Driven Public Health Strategies- Helps governments and healthcare organizations design policies to reduce heart disease prevalence and allow government to make better budget allocations to deal with these cases.


## Data Understanding

Homocysteine level- Having high Homocysteine levels may mean you are lacking vitamin B6,B12 or folic acid which without treatement means you are at risk of having blood clots, heart disease and stroke.
CRP Level- A CRP test is a blood test done to check for inflammation in your body. Normally you have low c-reactive protein in your body but your body will release more CRP into your bloodstream if you have inflammation. 

Fasting blood sugar- Also known as the glucose test. Anything less than 100 mg/dL is considered normal

Triglycerides Level- Triglycerides prrovide the body with energy from food. High levels can indicate signs of having diabetes or prediabetes which can increase the risk of heart disease or stroke.Anything less than 150 mg/dL is normal.

Cholesterol levels- Anything less than 200mg/dL is considered normal according to hopkinsmedicine organization

LDL(Bad) cholestrol level- between 100- 129mg/dL is near/above optimal, 130-159mg/dL is near borderline high,160-189mg/dL is high and above 190mg/dL is very high.Cholesterol Levels. Your cholesterol levels show how much cholesterol is circulating in your blood. Your HDL (“good” cholesterol) is the one number you want to be high (ideally above 60). Your LDL (“bad” cholesterol) should be below 100. Your total should be below 200.

BMI - Adults should keep their BMI between 18 and 24.9 as this is considerd normal.Adults with a BMI over 25 are considered overweight and a BMI over 30 is considered obese. Older adults, though, do better if they have a BMI between 25 and 27.

Blood pressure levels- Normal blood pressure is usually considered to be between 90/60 mmHg and 120/80 mmHg.


In [40]:

# importing the libraries used in this model
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# For NLP
import nltk
from nltk.corpus import RegexpTokenizer, stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
#For Modelling
import sklearn
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score,precision_score,f1_score, roc_auc_score, recall_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.feature_extraction.text import CountVectorizer

In [41]:
data= pd.read_csv('heart_disease.csv')
data


Unnamed: 0,Age,Gender,Blood Pressure,Cholesterol Level,Exercise Habits,Smoking,Family Heart Disease,Diabetes,BMI,High Blood Pressure,...,High LDL Cholesterol,Alcohol Consumption,Stress Level,Sleep Hours,Sugar Consumption,Triglyceride Level,Fasting Blood Sugar,CRP Level,Homocysteine Level,Heart Disease Status
0,56.0,Male,153.0,155.0,High,Yes,Yes,No,24.991591,Yes,...,No,High,Medium,7.633228,Medium,342.0,,12.969246,12.387250,No
1,69.0,Female,146.0,286.0,High,No,Yes,Yes,25.221799,No,...,No,Medium,High,8.744034,Medium,133.0,157.0,9.355389,19.298875,No
2,46.0,Male,126.0,216.0,Low,No,No,No,29.855447,No,...,Yes,Low,Low,4.440440,Low,393.0,92.0,12.709873,11.230926,No
3,32.0,Female,122.0,293.0,High,Yes,Yes,No,24.130477,Yes,...,Yes,Low,High,5.249405,High,293.0,94.0,12.509046,5.961958,No
4,60.0,Male,166.0,242.0,Low,Yes,Yes,Yes,20.486289,Yes,...,No,Low,High,7.030971,High,263.0,154.0,10.381259,8.153887,No
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,25.0,Female,136.0,243.0,Medium,Yes,No,No,18.788791,Yes,...,Yes,Medium,High,6.834954,Medium,343.0,133.0,3.588814,19.132004,Yes
9996,38.0,Male,172.0,154.0,Medium,No,No,No,31.856801,Yes,...,Yes,,High,8.247784,Low,377.0,83.0,2.658267,9.715709,Yes
9997,73.0,Male,152.0,201.0,High,Yes,No,Yes,26.899911,No,...,Yes,,Low,4.436762,Low,248.0,88.0,4.408867,9.492429,Yes
9998,23.0,Male,142.0,299.0,Low,Yes,No,Yes,34.964026,Yes,...,Yes,Medium,High,8.526329,Medium,113.0,153.0,7.215634,11.873486,Yes


In [42]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 21 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Age                   9971 non-null   float64
 1   Gender                9981 non-null   object 
 2   Blood Pressure        9981 non-null   float64
 3   Cholesterol Level     9970 non-null   float64
 4   Exercise Habits       9975 non-null   object 
 5   Smoking               9975 non-null   object 
 6   Family Heart Disease  9979 non-null   object 
 7   Diabetes              9970 non-null   object 
 8   BMI                   9978 non-null   float64
 9   High Blood Pressure   9974 non-null   object 
 10  Low HDL Cholesterol   9975 non-null   object 
 11  High LDL Cholesterol  9974 non-null   object 
 12  Alcohol Consumption   7414 non-null   object 
 13  Stress Level          9978 non-null   object 
 14  Sleep Hours           9975 non-null   float64
 15  Sugar Consumption   

In [48]:
# Check for unique values per column
data.nunique()

Age                       63
Gender                     2
Blood Pressure            61
Cholesterol Level        151
Exercise Habits            3
Smoking                    2
Family Heart Disease       2
Diabetes                   2
BMI                     9978
High Blood Pressure        2
Low HDL Cholesterol        2
High LDL Cholesterol       2
Alcohol Consumption        3
Stress Level               3
Sleep Hours             9975
Sugar Consumption          3
Triglyceride Level       301
Fasting Blood Sugar       81
CRP Level               9974
Homocysteine Level      9980
Heart Disease Status       2
dtype: int64

In [43]:
data.isna().sum()

Age                       29
Gender                    19
Blood Pressure            19
Cholesterol Level         30
Exercise Habits           25
Smoking                   25
Family Heart Disease      21
Diabetes                  30
BMI                       22
High Blood Pressure       26
Low HDL Cholesterol       25
High LDL Cholesterol      26
Alcohol Consumption     2586
Stress Level              22
Sleep Hours               25
Sugar Consumption         30
Triglyceride Level        26
Fasting Blood Sugar       22
CRP Level                 26
Homocysteine Level        20
Heart Disease Status       0
dtype: int64

In [49]:
duplicates = data[data.duplicated()]
print("Duplicate Rows:")
duplicates.info()

Duplicate Rows:
<class 'pandas.core.frame.DataFrame'>
Index: 0 entries
Data columns (total 21 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Age                   0 non-null      float64
 1   Gender                0 non-null      object 
 2   Blood Pressure        0 non-null      float64
 3   Cholesterol Level     0 non-null      float64
 4   Exercise Habits       0 non-null      object 
 5   Smoking               0 non-null      object 
 6   Family Heart Disease  0 non-null      object 
 7   Diabetes              0 non-null      object 
 8   BMI                   0 non-null      float64
 9   High Blood Pressure   0 non-null      object 
 10  Low HDL Cholesterol   0 non-null      object 
 11  High LDL Cholesterol  0 non-null      object 
 12  Alcohol Consumption   0 non-null      object 
 13  Stress Level          0 non-null      object 
 14  Sleep Hours           0 non-null      float64
 15  Sugar Consumption     0 

In [44]:
data.describe()

Unnamed: 0,Age,Blood Pressure,Cholesterol Level,BMI,Sleep Hours,Triglyceride Level,Fasting Blood Sugar,CRP Level,Homocysteine Level
count,9971.0,9981.0,9970.0,9978.0,9975.0,9974.0,9978.0,9974.0,9980.0
mean,49.296259,149.75774,225.425577,29.077269,6.991329,250.734409,120.142213,7.472201,12.456271
std,18.19397,17.572969,43.575809,6.307098,1.753195,87.067226,23.584011,4.340248,4.323426
min,18.0,120.0,150.0,18.002837,4.000605,100.0,80.0,0.003647,5.000236
25%,34.0,134.0,187.0,23.658075,5.449866,176.0,99.0,3.674126,8.723334
50%,49.0,150.0,226.0,29.079492,7.003252,250.0,120.0,7.472164,12.409395
75%,65.0,165.0,263.0,34.520015,8.531577,326.0,141.0,11.255592,16.140564
max,80.0,180.0,300.0,39.996954,9.999952,400.0,160.0,14.997087,19.999037


In [45]:
data.columns

Index(['Age', 'Gender', 'Blood Pressure', 'Cholesterol Level',
       'Exercise Habits', 'Smoking', 'Family Heart Disease', 'Diabetes', 'BMI',
       'High Blood Pressure', 'Low HDL Cholesterol', 'High LDL Cholesterol',
       'Alcohol Consumption', 'Stress Level', 'Sleep Hours',
       'Sugar Consumption', 'Triglyceride Level', 'Fasting Blood Sugar',
       'CRP Level', 'Homocysteine Level', 'Heart Disease Status'],
      dtype='object')