In [16]:
import time

# print(f'Notebook First Created: {time.asctime()}') #commented...don't run code agin

Notebook Last Updated: Sat Jan 20 11:48:22 2024


In [20]:
import time

print(f'Notebook Last Updated: {time.asctime()}')

Notebook Last Updated: Sat Jan 20 12:49:39 2024


# Heart Attack Prediction AI 🤖


**Introduction:**

- **Objective 🎯:** Develop an AI system for early heart attack detection, a crucial component of our larger Heart Attack Detection and Assistance System.

- **Technology 💻:** Leveraging scikit-learn (sklearn) for machine learning model development.

- **Process 🔄:**
  1. **Data Collection 📊:** Gather relevant data sources for training and testing.
  
  2. **Data Cleaning 🧹:** Prepare the dataset by handling missing values, outliers, and ensuring data quality.

  3. **Exploratory Data Analysis (EDA) 📊:** Gain insights into the cleaned dataset, understanding its nuances and patterns.

  4. **Feature Engineering 🛠️:** Create new informative features or transform existing ones to enhance model performance.

  5. **Model Building 🛠️:** Utilize various machine learning algorithms to build predictive models.

  6. **Evaluation 📈:** Rigorously assess model performance using diverse metrics.

  7. **Model Selection 🎉:** Identify the most effective model for heart attack detection.

  8. **Hyperparameter Tuning ⚙️:** Refine chosen model for optimal performance.

  9. **Model-Driven EDA 🧠:** Utilize model insights to enhance understanding of the dataset.

  10. **Deployment 🚀:** Implement the trained model into a production environment for real-world application.

- **Integration 🔗:** Seamlessly integrate the developed AI system with the main project using the Hugging Face API.

In [3]:
# data manipulation
import pandas as pd
import numpy as np

# plotting
import matplotlib.pyplot as plt
import seaborn as sns

# metrics and model information goes here 



## Exploratory Data Analysis 


**EDA (Exploratory Data Analysis) Checklist:**

1. **Initial Data Exploration:**
   - Find head, tail, and a sample of the dataset for a quick overview. **Complete ✅**

2. **Data Cleaning:**
   - Remove NaN (Not a Number) values.
   - Eliminate duplicate data for improved accuracy. **Complete ✅**

3. **Outlier Detection:**
   - Utilize box plots for each parameter to identify and visualize outliers. **[Pending]**

4. **Correlation Analysis:**
   - Construct a correlation matrix to understand relationships between different variables. **[Pending]**

5. **Dataset Characteristics:**
   - Determine the size of the dataset.
   - Explore the shape of the dataset (rows, columns). **Complete ✅**

6. **Assumptions and Testing:**
   - Formulate assumptions about the model and its parameters.
   - Test assumptions through visualizations and statistical methods. **[Pending]**

7. **Normalization:**
   - Normalize data if necessary for better model performance. **[Pending]**

8. **Additional Analysis (if needed):**
   - Perform statistical tests on relevant parameters.
   - Plot each parameter against others to explore relationships.
   - Plot each parameter against the target variable for insights. **[Pending]**

9. **Documentation:**
   - Document findings, observations, and any decisions made during the EDA process.
   - Record insights that may influence model building and feature selection. **[Pending]**

10. **Plotting:**
    - Generate visual representations (histograms, scatter plots, etc.) to aid in understanding the distribution of data.
    - Plot each parameter against others to explore relationships.
    - Plot each parameter against the target variable for insights. **[Partial 🧩]**

11. **Review and Iterate:**
    - Review the EDA results and iterate if needed based on the insights gained. **[Pending]**


   
*adapted from: Warmbein, Karen. “An EDA Checklist - DataSeries - Medium.” Medium, 14 Dec. 2021, medium.com/dataseries/an-eda-checklist-800beeaee555.*

In [5]:
df = pd.read_csv('dataset/heart.csv')

In [6]:
df.head()

Unnamed: 0,age,sex,cp,trtbps,chol,fbs,restecg,thalachh,exng,oldpeak,slp,caa,thall,output
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [7]:
df.tail()

Unnamed: 0,age,sex,cp,trtbps,chol,fbs,restecg,thalachh,exng,oldpeak,slp,caa,thall,output
298,57,0,0,140,241,0,1,123,1,0.2,1,0,3,0
299,45,1,3,110,264,0,1,132,0,1.2,1,0,3,0
300,68,1,0,144,193,1,1,141,0,3.4,1,2,3,0
301,57,1,0,130,131,0,1,115,1,1.2,1,1,3,0
302,57,0,1,130,236,0,0,174,0,0.0,1,1,2,0


In [8]:
# data doesn't have missing values therefore NaN removal is not necessary
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       303 non-null    int64  
 1   sex       303 non-null    int64  
 2   cp        303 non-null    int64  
 3   trtbps    303 non-null    int64  
 4   chol      303 non-null    int64  
 5   fbs       303 non-null    int64  
 6   restecg   303 non-null    int64  
 7   thalachh  303 non-null    int64  
 8   exng      303 non-null    int64  
 9   oldpeak   303 non-null    float64
 10  slp       303 non-null    int64  
 11  caa       303 non-null    int64  
 12  thall     303 non-null    int64  
 13  output    303 non-null    int64  
dtypes: float64(1), int64(13)
memory usage: 33.3 KB


In [9]:
# there are a lot more men there are women, there are a lot more middle aged patients than there are youngsters or older patients (let's see if this holds true at the time of plotting)

df.describe()

Unnamed: 0,age,sex,cp,trtbps,chol,fbs,restecg,thalachh,exng,oldpeak,slp,caa,thall,output
count,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0
mean,54.366337,0.683168,0.966997,131.623762,246.264026,0.148515,0.528053,149.646865,0.326733,1.039604,1.39934,0.729373,2.313531,0.544554
std,9.082101,0.466011,1.032052,17.538143,51.830751,0.356198,0.52586,22.905161,0.469794,1.161075,0.616226,1.022606,0.612277,0.498835
min,29.0,0.0,0.0,94.0,126.0,0.0,0.0,71.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,47.5,0.0,0.0,120.0,211.0,0.0,0.0,133.5,0.0,0.0,1.0,0.0,2.0,0.0
50%,55.0,1.0,1.0,130.0,240.0,0.0,1.0,153.0,0.0,0.8,1.0,0.0,2.0,1.0
75%,61.0,1.0,2.0,140.0,274.5,0.0,1.0,166.0,1.0,1.6,2.0,1.0,3.0,1.0
max,77.0,1.0,3.0,200.0,564.0,1.0,2.0,202.0,1.0,6.2,2.0,4.0,3.0,1.0


In [10]:
df.sample(10)

Unnamed: 0,age,sex,cp,trtbps,chol,fbs,restecg,thalachh,exng,oldpeak,slp,caa,thall,output
152,64,1,3,170,227,0,0,155,0,0.6,1,0,3,1
7,44,1,1,120,263,0,1,173,0,0.0,2,0,3,1
35,46,0,2,142,177,0,0,160,1,1.4,0,0,2,1
41,48,1,1,130,245,0,0,180,0,0.2,1,0,2,1
53,44,0,2,108,141,0,1,175,0,0.6,1,0,2,1
39,65,0,2,160,360,0,0,151,0,0.8,2,0,2,1
212,39,1,0,118,219,0,1,140,0,1.2,1,0,3,0
105,68,0,2,120,211,0,0,115,0,1.5,1,0,2,1
129,74,0,1,120,269,0,0,121,1,0.2,2,1,2,1
96,62,0,0,140,394,0,0,157,0,1.2,1,0,2,1


In [11]:
# checking the shape of the dataset
# dataset has 304 samples with 14 parameters each
df.shape


(303, 14)