<a href="https://colab.research.google.com/github/GenAIUnplugged/Exploratory-Data-Analysis/blob/main/Detailed_EDA_on_Heart_Disease_Dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install kaggle



In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
!mkdir -p ~/.kaggle
!cp /content/drive/MyDrive/kaggle.json ~/.kaggle/kaggle.json

In [None]:
!chmod 600 ~/.kaggle/kaggle.json

In [None]:
!kaggle datasets download johnsmith88/heart-disease-dataset

Dataset URL: https://www.kaggle.com/datasets/johnsmith88/heart-disease-dataset
License(s): unknown
heart-disease-dataset.zip: Skipping, found more recently modified local copy (use --force to force download)


In [None]:
!unzip heart-disease-dataset.zip

Archive:  heart-disease-dataset.zip
replace heart.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: 

# Task
Tell me about this dataset.

Here is all the data you need:
"heart.csv"

## Data loading

### Subtask:
Load the heart disease dataset from the provided CSV file.


**Reasoning**:
Load the dataset and display its first few rows and shape to verify successful loading.



In [None]:
import pandas as pd

try:
    df = pd.read_csv('heart.csv')
    display(df.head())
    print(df.shape)
except FileNotFoundError:
    print("Error: 'heart.csv' not found. Please ensure the file is in the correct location and accessible.")
except Exception as e:
    print(f"An error occurred: {e}")

## Data exploration

### Subtask:
Explore the loaded heart disease dataset to understand its characteristics.


**Reasoning**:
I need to explore the dataset by examining data types, shape, missing values, and distributions of numerical and categorical features.



In [None]:
# Examine data types
print("Data Types:\n", df.dtypes)

# Determine the shape
print("\nShape:", df.shape)

# Identify and count missing values
print("\nMissing Values:\n", df.isnull().sum())

# Analyze numerical features
print("\nNumerical Features Description:\n", df.describe())
# Generate histograms for numerical features (adjust figure size as needed)
import matplotlib.pyplot as plt

numerical_features = ['age', 'trestbps', 'chol', 'thalach', 'oldpeak']
plt.figure(figsize=(15, 10))
for i, col in enumerate(numerical_features):
    plt.subplot(2, 3, i + 1)
    plt.hist(df[col], bins=20, color='skyblue', edgecolor='black')
    plt.title(col)
    plt.xlabel(col)
    plt.ylabel('Frequency')
plt.tight_layout()
plt.show()

# Analyze categorical features
categorical_features = ['sex', 'cp', 'fbs', 'restecg', 'exang', 'slope', 'ca', 'thal', 'target']
for col in categorical_features:
    print(f"\nCategorical Feature: {col}")
    print(df[col].value_counts())
    # Generate bar plots for categorical features
    plt.figure(figsize=(8, 6))
    df[col].value_counts().plot(kind='bar', color='salmon')
    plt.title(col)
    plt.xlabel(col)
    plt.ylabel('Frequency')
    plt.show()

## Data analysis

### Subtask:
Analyze the heart disease dataset to identify potential correlations between features and calculate descriptive statistics.


**Reasoning**:
Calculate descriptive statistics for numerical features, analyze categorical features, explore correlations between numerical features, and investigate the relationship between each feature and the target variable.



In [None]:
# Calculate descriptive statistics for numerical features
numerical_features = ['age', 'trestbps', 'chol', 'thalach', 'oldpeak']
print("Descriptive Statistics for Numerical Features:\n", df[numerical_features].describe())

# Analyze categorical features
categorical_features = ['sex', 'cp', 'fbs', 'restecg', 'exang', 'slope', 'ca', 'thal', 'target']
for col in categorical_features:
    print(f"\nCategorical Feature: {col}")
    print(df[col].value_counts(normalize=True))  # Display proportions

# Explore correlations between numerical features
import matplotlib.pyplot as plt
import seaborn as sns
plt.figure(figsize=(10, 8))
correlation_matrix = df[numerical_features].corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix of Numerical Features')
plt.show()


# Investigate the relationship between each feature and the target variable
for col in categorical_features:
    print(f"\nMean Target Value for each category of {col}:")
    print(df.groupby(col)['target'].mean())

for col in numerical_features:
    plt.figure(figsize=(8, 6))
    sns.boxplot(x='target', y=col, data=df)
    plt.title(f'Box Plot of {col} vs. Target')
    plt.show()

## Data visualization

### Subtask:
Visualize the data distributions and correlations.


**Reasoning**:
Create histograms for numerical features, bar plots for categorical features, a heatmap for correlations, and box plots to show the relationship between categorical and numerical features.



In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Histograms for numerical features
numerical_features = ['age', 'trestbps', 'chol', 'thalach', 'oldpeak']
plt.figure(figsize=(15, 10))
for i, col in enumerate(numerical_features):
    plt.subplot(2, 3, i + 1)
    plt.hist(df[col], bins=20, color='skyblue', edgecolor='black')
    plt.title(col)
    plt.xlabel(col)
    plt.ylabel('Frequency')
plt.tight_layout()
plt.show()

# Bar plots for categorical features
categorical_features = ['sex', 'cp', 'fbs', 'restecg', 'exang', 'slope', 'ca', 'thal', 'target']
for col in categorical_features:
    plt.figure(figsize=(8, 6))
    df[col].value_counts().plot(kind='bar', color='salmon')
    plt.title(col)
    plt.xlabel(col)
    plt.ylabel('Frequency')
    plt.show()

# Heatmap for correlation matrix
plt.figure(figsize=(10, 8))
correlation_matrix = df[numerical_features].corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix of Numerical Features')
plt.show()

# Box plots for categorical vs. numerical features
for col in categorical_features:
    plt.figure(figsize=(8, 6))
    sns.boxplot(x=col, y='chol', data=df) # Example using 'chol', change as needed
    plt.title(f'Box Plot of chol vs. {col}')
    plt.show()

## Summary:

### Q&A
There were no explicit questions asked in the task. However, the analysis explored the dataset to understand its characteristics, identify potential correlations, and visualize the data distributions.

### Data Analysis Key Findings
* The dataset contains 1025 rows and 14 columns with no missing values.
* Numerical features like age, resting blood pressure (trestbps), cholesterol (chol), maximum heart rate (thalach), and ST depression induced by exercise relative to rest (oldpeak) showed varying distributions, with potential outliers observed in chol and oldpeak.
* Categorical features, including sex, chest pain type (cp), fasting blood sugar (fbs), resting electrocardiographic results (restecg), exercise induced angina (exang), the slope of the peak exercise ST segment (slope), number of major vessels (ca), thalassemia (thal), and the target variable (presence of heart disease), were analyzed to understand their distributions and relationships with the target variable.  The 'target' variable appears to be relatively balanced.
* Correlation analysis of numerical features revealed relationships between them, visualized through a heatmap.
* Box plots were used to visualize the relationship between each feature and the target variable.  This helps to see how numerical features vary based on the presence or absence of heart disease.


### Insights or Next Steps
* Investigate the potential outliers in 'chol' and 'oldpeak' and consider appropriate handling methods (e.g., removal, transformation).
* Further explore the relationships between features and the target variable using more advanced statistical methods or machine learning models to build a predictive model for heart disease.
