# HEART DISEASE ANALYSIS

### Made By - [Pratyush Puri](https://www.pratyushpuri.space)
### [Linkedin](https://www.linkedin.com/in/pratyushpuri/) ---   [GitHub](https://github.com/PratyushPuri)


<br><br><br><br><br>

## DATA LOADING AND INITIAL EXPLORATION

In [None]:
#importing necessary packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings as w

#filtering warnings (if any)
w.filterwarnings('ignore')
pd.set_option('display.max_columns', None)

In [None]:
# loading the dataset and sample checking
df = pd.read_csv('/kaggle/input/heart-disease-dataset/heart.csv')
df.sample(7)

In [None]:
# checking for number of rows and columns
print(f'Total number of records are {df.shape[0]}')
print(f'Total number of features are {df.shape[1]}')

In [None]:
# checking the datatypes of all columns present
display(df.dtypes)

In [None]:
#checking the names of the columns present
display(df.columns.to_list())

In [None]:
#describing the columns present
display(df.describe(include='all').plot(kind='line',figsize=(20,10)))

In [None]:
display(df.describe(include='all').T)

In [None]:
#Checking for unique values
display(df.nunique())

In [None]:
# checking information about all features
display(df.info())

Here, all the datatypes of each and every columns seems fine and have no scope for changing `dtype`. Hence, moving forward for data cleaning and wrangling further.

## DATA CLEANING

In [None]:
#checking for duplicates
print(f'Total Duplicated Records present are : {df.duplicated().sum()}')

In [None]:
#removing the duplicates
df.drop_duplicates(inplace=True)
print(f'Total Duplicated Records present are : {df.duplicated().sum()}')

In [None]:
#checking for null values
display(df.isnull().sum())

So, there are no null values found, thus no need to fill any empty values

## EXPLORATORY DATA ANALYSIS
<br>

### `Basic Level Questions`

### 1.	What is the average age of patients in the dataset?

In [None]:
# Calculate and display the average age
average_age = df['age'].mean()
print(f"The average age of patients in the dataset is: {average_age:.2f}")

### 2.	What is the gender distribution of patients?

In [None]:
# Calculate and display the gender distribution
gender_distribution = df['sex'].value_counts()
print("Gender Distribution:")
print(gender_distribution)

### 3.	What is the average resting blood pressure of patients?

In [None]:
# Calculate and display the average resting blood pressure
average_trestbps = df['trestbps'].mean()
print(f"The average resting blood pressure of patients is: {average_trestbps:.2f}")

### 4. How many patients have fasting blood sugar levels higher than 120 mg/dl?

In [None]:
# Count patients with fasting blood sugar > 120 mg/dl
# The 'fbs' column is binary (0 or 1), where 1 indicates fbs > 120 mg/dl
high_fbs_count = df['fbs'].sum()
print(f"Number of patients with fasting blood sugar > 120 mg/dl: {high_fbs_count}")

### 5. What are the different types of chest pain recorded in the dataset?

In [None]:
# Get the unique values in the 'cp' column
chest_pain_types = df['cp'].unique()
print("Different types of chest pain:")
print(chest_pain_types)

### 6. What is the maximum heart rate achieved by patients?

In [None]:
# Find the maximum value in the 'thalach' column
max_thalach = df['thalach'].max()
print(f"Maximum heart rate achieved by patients: {max_thalach}")

### 7. What percentage of patients experience exercise-induced angina?

In [None]:
# Calculate the percentage of patients with exercise-induced angina
# The 'exang' column is binary (0 or 1), where 1 indicates exercise-induced angina
exang_percentage = df['exang'].mean() * 100
print(f"Percentage of patients experiencing exercise-induced angina: {exang_percentage:.2f}%")

### 8. What is the average cholesterol level in the dataset?

In [None]:
# Calculate the average cholesterol level
average_chol = df['chol'].mean()
print(f"The average cholesterol level in the dataset is: {average_chol:.2f}")

### 9. How many patients have a probable or definite left ventricular hypertrophy based on their resting electrocardiographic results?

In [None]:
# Count patients with probable or definite left ventricular hypertrophy (restecg = 2)
# Assuming 'restecg' = 2 corresponds to probable or definite left ventricular hypertrophy
lvh_count = df['restecg'].value_counts().get(2, 0)
print(f"Number of patients with probable or definite left ventricular hypertrophy: {lvh_count}")

### 10. What is the distribution of the number of major vessels colored by fluoroscopy?

In [None]:
# Get the distribution of the number of major vessels colored by fluoroscopy ('ca')
ca_distribution = df['ca'].value_counts().sort_index()
print("Distribution of the number of major vessels colored by fluoroscopy:")
print(ca_distribution)

### `Medium Level Questions`

### 1. What is the correlation between age and cholesterol levels?

In [None]:
# Calculate the correlation between age and cholesterol
correlation_age_chol = df[['age', 'chol']].corr().iloc[0, 1]
print(f"The correlation between age and cholesterol levels is: {correlation_age_chol:.2f}")

### 2. What is the distribution of chest pain types across different age groups?

In [None]:
# Visualize the distribution of chest pain types across different age groups
df.groupby('age')['cp'].value_counts().unstack().plot(kind='bar', stacked=True, figsize=(15, 7))
plt.title('Distribution of Chest Pain Types Across Age Groups')
plt.xlabel('Age')
plt.ylabel('Number of Patients')
plt.legend(title='Chest Pain Type')
plt.show()

### 3. How does maximum heart rate vary with exercise-induced angina?

In [None]:
# Compare maximum heart rates for patients with and without exercise-induced angina
thalach_by_exang = df.groupby('exang')['thalach'].mean()
print("Average maximum heart rate by exercise-induced angina (0: No, 1: Yes):")
print(thalach_by_exang)

### 4. Is there a significant difference in resting blood pressure between male and female patients?

In [None]:
# Calculate the average resting blood pressure for each gender
average_trestbps_by_sex = df.groupby('sex')['trestbps'].mean()
print("Average resting blood pressure by sex (0: Female, 1: Male):")
print(average_trestbps_by_sex)

### 5. What is the relationship between fasting blood sugar levels and the presence of heart disease?

In [None]:
# Visualize the relationship between fasting blood sugar and target
pd.crosstab(df['fbs'], df['target']).plot(kind='bar', stacked=True, figsize=(8, 5))
plt.title('Heart Disease Presence by Fasting Blood Sugar')
plt.xlabel('Fasting Blood Sugar (> 120 mg/dl: 1, <= 120 mg/dl: 0)')
plt.ylabel('Number of Patients')
plt.xticks(rotation=0)
plt.legend(title='Heart Disease (0: No, 1: Yes)')
plt.show()

### 6. How does the number of major vessels (ca) affect the target variable (heart disease presence)?

In [None]:
# Visualize the relationship between the number of major vessels and target
pd.crosstab(df['ca'], df['target']).plot(kind='bar', stacked=True, figsize=(8, 5))
plt.title('Heart Disease Presence by Number of Major Vessels')
plt.xlabel('Number of Major Vessels (0-3) colored by fluoroscopy')
plt.ylabel('Number of Patients')
plt.xticks(rotation=0)
plt.legend(title='Heart Disease (0: No, 1: Yes)')
plt.show()

### 7. What is the average oldpeak value for patients with different types of chest pain?

In [None]:
# Find the average oldpeak values for each chest pain type
average_oldpeak_by_cp = df.groupby('cp')['oldpeak'].mean()
print("Average oldpeak value by chest pain type:")
print(average_oldpeak_by_cp)

### 8. Analyze the distribution of thalassemia types (thal) among patients with heart disease.

In [None]:
# Visualize the distribution of thalassemia types among patients with heart disease
pd.crosstab(df['thal'], df['target']).plot(kind='bar', stacked=True, figsize=(8, 5))
plt.title('Heart Disease Presence by Thalassemia Type')
plt.xlabel('Thalassemia Type')
plt.ylabel('Number of Patients')
plt.xticks(rotation=0)
plt.legend(title='Heart Disease (0: No, 1: Yes)')
plt.show()

### 9. What are the most common combinations of risk factors in patients with heart disease?

In [None]:
# Find the most common combinations of risk factors in patients with heart disease
# Focusing on 'cp', 'fbs', 'exang', 'thal' as suggested
risk_factor_combinations = df[df['target'] == 1].groupby(['cp', 'fbs', 'exang', 'thal']).size().reset_index(name='counts').sort_values(by='counts', ascending=False)
print("Most common combinations of risk factors in patients with heart disease:")
display(risk_factor_combinations.head()) # Display the top combinations

### 10. Perform a pairwise comparison of clinical measurements for patients with and without heart disease.

In [None]:
# Get descriptive statistics for patients with and without heart disease
heart_disease_present = df[df['target'] == 1].describe().T
heart_disease_absent = df[df['target'] == 0].describe().T

print("Descriptive statistics for patients with heart disease:")
display(heart_disease_present)

print("\nDescriptive statistics for patients without heart disease:")
display(heart_disease_absent)

# Optional: Visualize pairwise comparisons for key features (similar to original suggestion but for target groups)
# sns.pairplot(df, hue='target', vars=['age', 'chol', 'trestbps', 'thalach', 'oldpeak'])
# plt.suptitle('Pairwise Comparison of Clinical Measurements by Heart Disease Presence', y=1.02)
# plt.show()

### `Advanced-Level Questions`

### 1. What is the effect of combining multiple risk factors (age, cholesterol, blood pressure) on the likelihood of heart disease?

In [None]:
# Visualize the interactions between age, cholesterol, and blood pressure, colored by the target variable
relevant_columns = df[['age', 'chol', 'trestbps', 'target']]
sns.pairplot(relevant_columns, hue='target')
plt.suptitle('Pairwise Comparison of Age, Cholesterol, and Blood Pressure by Heart Disease Presence', y=1.02)
plt.show()

### 2. Which clinical measurement has the strongest correlation with heart disease presence?

In [None]:
# Calculate the correlation matrix
correlation_matrix = df.corr()

# Find the correlation of all features with the 'target' variable
target_correlation = correlation_matrix['target'].sort_values(ascending=False)

print("Correlation with Heart Disease Presence:")
display(target_correlation)

## OUTLIER ANALYSIS
### Outlier Detection

In [None]:
# all numerical columns
numerical_cols = [col for col in df.columns if df[col].dtype in ['int64', 'float64']]
cols_per_row = 3
rows_needed = (len(numerical_cols) + cols_per_row - 1) // cols_per_row

# Subplot
fig, axes = plt.subplots(nrows=rows_needed, ncols=cols_per_row, figsize=(15, rows_needed * 4))
axes = axes.flatten()

# boxplot for each column
for i, col in enumerate(numerical_cols):
    sns.boxplot(x=df[col], ax=axes[i])
    axes[i].set_title(f'Boxplot of {col}')
    axes[i].grid(False)

# hiding extra axes if less charts present
for j in range(i + 1, len(axes)):
    fig.delaxes(axes[j])

plt.tight_layout()
plt.show()

Here, we are going to winsorize (cap) the outliers

### Outlier Handing

In [None]:
# only numerical data
numerical_cols = [col for col in df.columns if df[col].dtype in ['int64', 'float64']]

# capping of each column
for col in numerical_cols:
    Q1 = df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)
    IQR = Q3 - Q1
    lower_limit = Q1 - 1.5 * IQR
    upper_limit = Q3 + 1.5 * IQR

    # Winsorization: clip values to lower/upper limit
    df[col] = np.where(df[col] < lower_limit, lower_limit, df[col])
    df[col] = np.where(df[col] > upper_limit, upper_limit, df[col])

print("Outliers handled with Winsorization using IQR:")
print(df.head())

In [None]:
# all numerical columns
numerical_cols = [col for col in df.columns if df[col].dtype in ['int64', 'float64']]
cols_per_row = 3
rows_needed = (len(numerical_cols) + cols_per_row - 1) // cols_per_row

# Subplot
fig, axes = plt.subplots(nrows=rows_needed, ncols=cols_per_row, figsize=(15, rows_needed * 4))
axes = axes.flatten()

# boxplot for each column
for i, col in enumerate(numerical_cols):
    sns.boxplot(x=df[col], ax=axes[i])
    axes[i].set_title(f'Boxplot of {col}')
    axes[i].grid(False)

# hiding extra axes if less charts present
for j in range(i + 1, len(axes)):
    fig.delaxes(axes[j])

plt.tight_layout()
plt.show()

## MODEL TRAINING AND FURTHER ANALYSIS

### 3. Perform a logistic regression analysis to predict the presence of heart disease using all available features.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Separate features (X) and target (y)
X = df.drop('target', axis=1)
y = df['target']

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train the Logistic Regression model
model = LogisticRegression(max_iter=1000) # Increased max_iter for convergence
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)

print(f"Accuracy: {accuracy:.2f}")
print("\nConfusion Matrix:")
print(conf_matrix)
print("\nClassification Report:")
print(class_report)

### 4. How do the values of the slope of the peak exercise ST segment (slope) vary with different chest pain types?

In [None]:
# Visualize the variation of slope with different chest pain types
pd.crosstab(df['cp'], df['slope']).plot(kind='bar', stacked=True, figsize=(8, 5))
plt.title('Slope of Peak Exercise ST Segment by Chest Pain Type')
plt.xlabel('Chest Pain Type')
plt.ylabel('Number of Patients')
plt.xticks(rotation=0)
plt.legend(title='Slope')
plt.show()

### 5. Analyze the survival rates of patients with different thalassemia types over a period.

**Note:** This dataset does not contain information about survival rates over time. Therefore, a direct analysis of survival rates with thalassemia types over a period is not possible with the current data.

However, we can analyze the distribution of thalassemia types among patients with and without heart disease to understand the relationship between thalassemia and heart disease presence. (This was already done in Medium Level Question 8).

## CONCLUSION

This exploratory data analysis and initial modeling of the heart disease dataset revealed several key insights:

**Data Loading and Initial Exploration:**
- The dataset contains 1025 records and 14 features, providing a good basis for analysis.
- The data types of the columns are appropriate, requiring no initial changes.

**Data Cleaning:**
- A significant number of duplicate records (723) were identified and successfully removed, resulting in a cleaner dataset of 302 unique entries for further analysis.
- No missing values were found, which simplifies the data preparation process.

**Exploratory Data Analysis:**
- **Basic Level Questions:** We calculated basic statistics such as the average age (around 54 years), identified the gender distribution (more males than females), and found the average resting blood pressure (around 131 mmHg). We also determined the number of patients with high fasting blood sugar, the different types of chest pain, the maximum heart rate, and the percentage of patients with exercise-induced angina. The distribution of major vessels colored by fluoroscopy and the count of patients with probable or definite left ventricular hypertrophy were also analyzed.
- **Medium Level Questions:** We explored relationships between variables. The correlation between age and cholesterol was found to be relatively weak but positive. Visualizations showed the distribution of chest pain types across age groups and how maximum heart rate varies with exercise-induced angina (patients without angina tend to have higher maximum heart rates). We also compared resting blood pressure between genders, analyzed the relationship between fasting blood sugar and heart disease presence, and examined how the number of major vessels affects the likelihood of heart disease (a higher number of colored vessels is associated with a higher likelihood of heart disease). The average oldpeak values for different chest pain types were also calculated. The distribution of thalassemia types among patients with and without heart disease was visualized, suggesting that certain thalassemia types might be more prevalent in patients with heart disease.
- **Advanced-Level Questions:** Visualizing the combined effect of age, cholesterol, and blood pressure using a pairplot provided insights into how these factors interact in patients with and without heart disease. We identified the clinical measurements with the strongest correlations to heart disease presence (both positive and negative).

**Outlier Analysis:**
- Outliers were detected in several numerical features through boxplots.
- Winsorization using the Interquartile Range (IQR) method was applied to cap these outliers, which can help improve the performance of some models.

**Model Training:**
- A logistic regression model was trained to predict the presence of heart disease using all available features.
- The model achieved an accuracy of 0.80 on the test set. The confusion matrix and classification report provided further details on the model's performance, including precision, recall, and F1-score for both classes (heart disease present/absent).

**Limitations:**
- It is important to note that the dataset did not contain information on survival rates over time, which limited the scope of analysis regarding the long-term impact of factors like thalassemia.

**Overall:**
The analysis provided a comprehensive overview of the dataset, identifying key characteristics of the patient population, relationships between various clinical factors, and their association with the presence of heart disease. The initial logistic regression model shows promising results for predicting heart disease.

**Next Steps:**
- Further analysis could involve exploring other machine learning models (e.g., Support Vector Machines, Random Forests) to compare their predictive performance.
- Feature engineering could be performed to create new features that might improve model accuracy.
- Deeper investigation into the features with the strongest correlations to the target variable could provide more targeted insights for prevention and treatment strategies.