# Heart Disease Prediction

## 1. Introduction
This notebook aims to predict the presence of heart disease in patients based on a set of medical attributes. We will use the Heart Disease UCI dataset for this classification task. The project will involve data loading, preprocessing, exploratory data analysis, and building a predictive model.

## 2. Data Loading and Initial Exploration

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Load the dataset
df = pd.read_csv('heart.csv')

# Display the first few rows
df.head()

In [None]:
# Get a summary of the dataframe
df.info()

The dataset is loaded correctly, and there are no immediate signs of missing values from the `.info()` output. The column names are a bit cryptic, so we will rename them for better readability.

## 3. Data Cleaning and Preprocessing

In [None]:
# Rename columns for better readability
df.columns = ['age', 'sex', 'chest_pain_type', 'resting_blood_pressure', 'cholesterol', 'fasting_blood_sugar', 'rest_ecg', 'max_heart_rate_achieved', 'exercise_induced_angina', 'st_depression', 'st_slope', 'num_major_vessels', 'thalassemia', 'target']

# Check for missing values
df.isnull().sum()

The dataset appears to be clean with no missing values.

## 4. Exploratory Data Analysis (EDA)

In [None]:
# Target variable distribution
sns.countplot(x='target', data=df)
plt.title('Distribution of Heart Disease')
plt.show()

In [None]:
# Correlation matrix
plt.figure(figsize=(14, 12))
correlation_matrix = df.corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Matrix')
plt.show()

The correlation matrix shows that `chest_pain_type`, `max_heart_rate_achieved`, and `st_slope` have a positive correlation with the target, while `exercise_induced_angina`, `st_depression`, `num_major_vessels`, and `thalassemia` have a negative correlation.

## 5. Model Building and Training

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Define features and target
X = df.drop('target', axis=1)
y = df['target']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train the Random Forest Classifier
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

## 6. Model Evaluation

In [None]:
# Make predictions
y_pred = model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')
print('\nClassification Report:')
print(classification_report(y_test, y_pred))
print('\nConfusion Matrix:')
sns.heatmap(confusion_matrix(y_test, y_pred), annot=True, fmt='g')
plt.show()

## 7. Conclusion
The Random Forest Classifier achieved a high accuracy in predicting heart disease. The classification report provides a detailed breakdown of the model's performance, including precision, recall, and F1-score for both classes. The confusion matrix visualizes the number of correct and incorrect predictions. This model can serve as a useful tool for assisting in the diagnosis of heart disease.