# Heart Attack Analysis & Prediction: Exploring Factors and Building a Predictive Model
The project analyzes heart attack data, applies machine learning techniques, and aims to predict the likelihood of heart attacks.

# Contents  
1. [Extraction](#1)     
2. [Exploratory Data Analysis (EDA)](#2) 
3. [Transformation & Analysis](#3) 
    1. [Age Category Column (Optional)](#3.1) 
    2. [Risk Group Category Column (Optional)](#3.2) 
    3. [Muti Risk Factors Category Column (Optional)](#3.3) 
4. [Data Visualization](#4)
5. [Prediction](#5)


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

<a id="1"></a>
# 1. Extraction
Use **pandas** to extract data from a CSV file

In [None]:
df = pd.read_csv("/kaggle/input/heart-attack-analysis-prediction-dataset/heart.csv")

---

<a id="2"></a>
# 2. Exploratory Data Analysis (EDA)
Perform initial data exploration to understand the structure and characteristics of the dataset.

In [None]:
# Check the dimensions of the dataset
df.shape

In [None]:
# Display a summary of the dataset
df.info()

In [None]:
# Display the first few rows of the dataset
df.head()

In [None]:
# Check for missing values
df.isnull().sum()

In [None]:
# Generate descriptive statistics of the dataset
df.describe()

---

<a id="3"></a>
# 3. Transformation & Analysis
Transformation and analysis involving the creation of the 'age_category', 'risk_group', and 'multi-risk factors' columns are crucial for simplifying comparisons, identifying age-related patterns, stratifying risks based on cholesterol levels, and understanding the influence of multiple factors on heart attacks.

<a id="3.1"></a>
## 3.1. Age Category Column (Optional)

Creating the 'age_category' column is important to simplify the analysis by categorizing individuals into age groups. This allows for easier comparisons, reveals age-related patterns, and enhances interpretability.

In [None]:
# Create a new column 'age_category' based on age
df['age_category'] = df['age'].apply(lambda x: 'Senior' if x > 60 else 'Adult')

# Show the sum of the count of each category in the 'age_category' column
age_category_counts = df['age_category'].value_counts()
print(age_category_counts)


There are **224 individuals** categorized as **'Adult'** and **79 individuals** categorized as **'Senior'**.

In [None]:
age_category_counts.plot(kind='pie', autopct='%1.1f%%')
plt.ylabel('')  # Remove y-axis label
plt.title('Distribution of Age Categories')

plt.show()


<a id="3.2"></a>
## 3.2. Risk Group Category Column (Optional)

The transformation to create the **'risk_group' column** based on **cholesterol levels** is important as it allows for risk stratification and simplifies the interpretation of cholesterol-related risks. By categorizing cholesterol levels into **'Low'**, **'Moderate'**, and **'High'** risk groups, it enables easier comparisons, aligns with clinical relevance, and facilitates meaningful analysis of cholesterol-related health outcomes.

In [None]:
# Create a new column 'risk_group' based on cholesterol levels
conditions = [
    (df['chol'] < 200),
    (df['chol'] >= 200) & (df['chol'] < 240),
    (df['chol'] >= 240)
]
choices = ['Low', 'Moderate', 'High']
df['risk_group'] = np.select(conditions, choices)

risk_group_counts = df['risk_group'].value_counts()
print(risk_group_counts)

In [None]:
risk_groups = risk_group_counts.index
counts = risk_group_counts.values

colors = ['darkred', 'lightcoral', 'indianred']

plt.bar(risk_groups, counts, color=colors)
plt.xlabel('Risk Group')
plt.ylabel('Count')
plt.title('Distribution of Risk Groups')

plt.show()


There are **155 individuals** in the **'High' risk group**, **98 individuals** in the **'Moderate' risk group**, and **50 individuals** in the **'Low' risk group**.

<a id="3.3"></a>
## 3.3. Muti Risk Factors Category Column (Optional)

**High cholesterol** and **high blood pressure** are significant **risk factors for heart attacks**. They contribute to the formation of artery-clogging plaque, leading to atherosclerosis and an increased likelihood of a heart attack. However, it's essential to recognize that heart attacks are influenced by multiple factors, and **high cholesterol and high blood pressure are not the only determinants**.

In [None]:
# Use a for-loop to create a new column 'heart_disease_risk'
df['heart_disease_risk'] = ''
for index, row in df.iterrows():
    if row['chol'] > 200 and row['trtbps'] > 140:
        df.at[index, 'heart_disease_risk'] = 'High'
    else:
        df.at[index, 'heart_disease_risk'] = 'Low'

# Show the sum of the count of each category in the 'heart_disease_risk' column
heart_disease_risk_counts = df['heart_disease_risk'].value_counts()
print(heart_disease_risk_counts)

In [None]:
plt.bar(heart_disease_risk_counts.index, heart_disease_risk_counts.values)
plt.xlabel('Heart Disease Risk')
plt.ylabel('Count')
plt.title('Distribution of Heart Disease Risk')
plt.show()


There are **247 individuals** categorized as having a **low risk** of heart disease based on the conditions of **cholesterol levels and blood pressure**, while **56 individuals** are categorized as having a **high risk**.

In [None]:
correlation = df['heart_disease_risk'].astype('category').cat.codes.corr(df['output'])
print("Correlation between heart_disease_risk and output:", correlation)

The correlation between the 'heart_disease_risk' column and the 'output' column in the dataset (303, 14) is 0.128. This positive correlation indicates a weak relationship between the assigned risk categories and the likelihood of a heart attack. 

However, it's important to interpret this correlation with caution, considering the relatively small dataset size. The correlation estimates in smaller datasets can be more susceptible to random variation and may not fully capture the true underlying relationships.

---

<a id="4"></a>
# 4. Data Visualization
Use seaborn and matplotlib to create visually appealing and insightful data visualizations that facilitate the exploration and analysis of the relationships and patterns within your dataset.

In [None]:
# Calculate the correlation matrix
correlation_matrix = df.corr()

# Create the heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', square=True)
plt.title('Correlation Heatmap')
plt.show()


<a id="5"></a>
# 5. Prediction
A Random Forest classifier model is created and trained using the selected features from the dataset. The trained model is then used to predict the target variable (chance of a heart attack) for the test set. The accuracy of the model's predictions is evaluated using the test set, providing an assessment of how well the machine learning model performs in predicting heart attack outcomes.

In [None]:
# Select the columns for analysis
columns_of_interest = ['age', 'sex', 'cp', 'trtbps', 'chol', 'fbs', 'restecg', 'thalachh', 'exng', 'oldpeak', 'slp', 'caa', 'thall', 'output']

# Create a subset of the original DataFrame
subset_df = df[columns_of_interest]

# Perform one-hot encoding for categorical columns
subset_df_encoded = pd.get_dummies(subset_df)

# Split the data into features (X) and the target variable (y)
X = subset_df_encoded.drop('output', axis=1)
y = subset_df_encoded['output']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train the model
model = RandomForestClassifier()
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

# Evaluate the model's accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

The achieved accuracy of 0.8524 indicates that the Random Forest classifier model accurately predicted the chance of a heart attack with an 85.24% success rate. This suggests that the model is effective in capturing the underlying patterns and relationships between the input features and the target variable.