<a href="https://colab.research.google.com/github/Rishabhg2501/Heart-Disease-Prediction/blob/main/Heard_Disease_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Project Team Id - PTID-CDS-FEB-24-1793
### Project Id- PRCP-1016-Heart Disease Pred

#### Business Case-: Predicting Potential Heart Diseases Using Machine Learning

The aims to address the significant global issue of cardiovascular diseases (CVDs) by utilizing machine learning algorithms to predict potential heart diseases in individuals. The project also involves preparing a comprehensive data analysis report and providing recommendations to hospitals for early detection and management of cardiovascular diseases.

### Problem Statement: The project involves three key tasks:

1-Prepare a complete data analysis report on the given dataset.

2-Create a model predicting potential heart diseases in individuals using machine learning algorithms.

3-Provide suggestions to hospitals to enhance the prediction of heart diseases and prevent life-threatening situations.

## Dataset Overview:

The dataset contains 14 columns, with the patient_id column serving as a unique and random identifier for each individual.
Features:

slope_of_peak_exercise_st_segment (type: int): Represents the slope of the peak exercise ST segment, an electrocardiography readout indicating the quality of blood flow to the heart.

thal (type: categorical): Results of the thallium stress test measuring blood flow to the heart, with possible values including normal, fixed_defect, and reversible_defect.

resting_blood_pressure (type: int): Represents the resting blood pressure of the individuals.

chest_pain_type (type: int): Denotes the type of chest pain experienced by the individuals with four possible values.

num_major_vessels (type: int): Indicates the number of major vessels (ranging from 0 to 3) colored by flourosopy.

fasting_blood_sugar_gt_120_mg_per_dl (type: binary): Represents whether the fasting blood sugar is greater than 120 mg/dl.

resting_ekg_results (type: int): Represents the results of resting electrocardiographic tests, with values 0, 1, and 2.

serum_cholesterol_mg_per_dl (type: int): Indicates the serum cholesterol levels in mg/dl.

oldpeak_eq_st_depression (type: float): Represents the ST depression induced by exercise relative to rest, a measure of abnormality in electrocardiograms.

sex (type: binary): Denotes the gender of the individuals, with 0 representing female and 1 representing male.

age (type: int): Represents the age of the individuals in years.

max_heart_rate_achieved (type: int): Denotes the maximum heart rate achieved by the individuals in beats per minute.

exercise_induced_angina (type: binary): Indicates whether the individuals experienced exercise-induced chest pain, with 0 representing False and 1 representing True.

#### Purpose: The dataset is designed to aid in the prediction of potential heart diseases in individuals, supporting the broader goal of early detection and management of cardiovascular diseases.

### Import Necessary Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler,LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report,confusion_matrix
from sklearn.model_selection import GridSearchCV


### Load and explore the dataset

In [None]:
data= pd.read_csv('values.csv')
data1=pd.read_csv('labels.csv')

In [None]:
data

In [None]:
data1

In [None]:
data.head()

In [None]:
data1.head()

In [None]:
data.tail()

In [None]:
data.shape

In [None]:
data.info()

In [None]:
data.describe()

### Data Cleaning And Preprocessing

In [None]:
data.isnull().sum()
data1.isnull().sum()

In [None]:
merged_data = pd.merge(data,data1,on='patient_id')

In [None]:
merged_data = pd.get_dummies(merged_data)

In [None]:
print(data.describe())

### Exploratory Data Analysis (EDA)
The data analysis report will involve an in-depth examination of the provided dataset, which includes 14 columns with various features related to cardiovascular health. This analysis will include descriptive statistics, data visualization, identification of correlations, and potential insights to guide the development of the predictive model.

In [None]:
plt.figure(figsize=(8, 6))
sns.countplot(x='heart_disease_present', data=data1)
plt.title('Distribution of Heart Disease Presence')
plt.xlabel('Heart Disease Present')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.show()

In [None]:
plt.figure(figsize=(10, 8))
correlation_matrix = data.corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Heatmap of Values Dataset')
plt.show()

In [None]:
sns.pairplot(data)
plt.title('Pairplot of Variables in Values Dataset')
plt.show()

In [None]:
plt.figure(figsize=(10, 6))
sns.histplot(data['age'], bins=20, kde=True)
plt.title('Distribution of Values')
plt.xlabel('age')
plt.ylabel('Frequency')
plt.show()

In [None]:
plt.figure(figsize=(10, 6))
sns.boxplot(x=data1['heart_disease_present'], y=data['age'])
plt.title('Relationship between Labels and Values')
plt.xlabel('heart_disease_present')
plt.ylabel('age')
plt.xticks(rotation=45)
plt.show()

In [None]:
summary_stats = data.describe()
print("\nSummary Statistics:\n", summary_stats)

In [None]:
plt.figure(figsize=(12, 8))
for i, column in enumerate(data.columns):
    plt.subplot(3, 5, i + 1)
    sns.histplot(data[column], bins=20, kde=True)
    plt.title(f'Distribution of {column}')
    plt.xlabel(column)
    plt.ylabel('Frequency')
plt.tight_layout()
plt.show()

In [None]:
plt.figure(figsize=(12, 8))
for i, column in enumerate(data.columns):
    plt.subplot(3, 5, i + 1)
    sns.histplot(data[column], bins=20, kde=True)
    plt.title(f'Distribution of {column}')
    plt.xlabel(column)
    plt.ylabel('Frequency')
plt.tight_layout()
plt.show()

In [None]:
plt.figure(figsize=(10, 6))
sns.histplot(data['age'], bins=20, kde=True)
plt.title('Distribution of Age')
plt.xlabel('age')
plt.ylabel('Frequency')
plt.show()

In [None]:
plt.figure(figsize=(10, 6))
sns.boxplot(x='heart_disease_present', y='num_major_vessels', data=merged_data)
plt.title('Number of Major Vessels vs Heart Disease Presence')
plt.xlabel('Heart Disease Present')
plt.ylabel('Number of Major Vessels')
plt.show()

In [None]:
avg_values = merged_data.groupby('heart_disease_present')['age'].mean()
print("\nAverage Value for Each Label:\n", avg_values)

In [None]:
avg_values_multiple_columns = merged_data.groupby('heart_disease_present')[['slope_of_peak_exercise_st_segment', 'resting_blood_pressure', 'serum_cholesterol_mg_per_dl']].mean()
print("\nAverage Values for Each Label ('heart_disease_present'):\n", avg_values_multiple_columns)

In [None]:
print("\nData Analysis Report:")
print("1. The data has been successfully loaded and cleaned.")
print("2. The distribution of values shows a roughly normal distribution with some outliers.")
print("3. The relationship between labels and values indicates varying ranges and spreads.")
print("4. Average values for each label have been calculated.")

### Split the data into training and testing sets

In [None]:
X = merged_data.drop('heart_disease_present', axis=1)
y = merged_data['heart_disease_present']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### Machine Learning Model Development:
The project requires the development and evaluation of machine learning models to predict potential heart diseases in individuals. This involves features such as the slope of the peak exercise ST segment, thallium stress test results, resting blood pressure, chest pain type, and other relevant health indicators. Several machine learning algorithms will be employed and compared to determine the most effective model for accurate predictions.

### Machine learning algorithm and train the model

In [None]:
# Standardize the features

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [None]:
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)
rf_classifier.fit(X_train_scaled, y_train)

In [None]:
model = LogisticRegression()
model.fit(X_train_scaled, y_train)
print("Coefficients:", model.coef_)
print("Intercept:", model.intercept_)

### Evaluate the model's performance and Predict on the testing set

In [None]:
y_pred = rf_classifier.predict(X_test_scaled)

In [None]:
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
print("Classification Report:\n", classification_report(y_test, y_pred))

In [None]:
predictions = rf_classifier.predict(X_test_scaled)

In [None]:
print("Classification Report:")
print(classification_report(y_test, predictions))
print("Confusion Matrix:")
print(confusion_matrix(y_test, predictions))

### Fine-tune the model

In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
param_grid = {
    'C': [0.001, 0.01, 0.1, 1, 10, 100],
    'penalty': ['l1', 'l2']
}
model = LogisticRegression()
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=5)
grid_search.fit(X_train, y_train)
best_model = grid_search.best_estimator_

In [None]:
test_accuracy = best_model.score(X_test, y_test)
print("Test Accuracy:", test_accuracy)

### Suggestions to the Hospital

In [None]:
risk_threshold = 0.5  # Adjust the threshold as needed
high_risk_individuals = X_test[predictions > risk_threshold]

In [None]:
for idx, individual in high_risk_individuals.iterrows():
    # Example suggestion: Encourage regular check-ups and lifestyle modifications
    print(f"Suggestion for individual {idx}: Encourage regular health check-ups and lifestyle modifications.")

In [None]:
print("\nSuggestions to the Hospital:")
print("- Implement routine screening programs for individuals at risk of heart diseases.")
print("- Utilize predictive models developed from historical patient data to identify individuals at high risk.")
print("- Provide personalized risk assessments to patients based on their individual characteristics and medical history.")
print("- Offer lifestyle modification programs focusing on promoting healthy behaviors.")
print("- Establish collaborative care teams consisting of healthcare professionals.")
print("- Promote adherence to medication and treatment plans.")
print("- Continuously evaluate and improve prevention programs and interventions.")

### Feature importances (if RandomForestClassifier is used)

In [None]:
feature_importances = pd.Series(rf_classifier.feature_importances_, index=X.columns)
feature_importances.nlargest(10).plot(kind='barh')
plt.title('Top 10 Important Features')
plt.xlabel('Feature Importance')
plt.ylabel('Feature')
plt.show()

## Recommendations to Hospitals:
The project will culminate in providing actionable recommendations to hospitals for leveraging the predictive model to enhance the early detection and management of cardiovascular diseases. These suggestions will focus on how hospitals can integrate the predictive model into their existing healthcare systems to identify individuals at high risk of heart diseases and initiate timely interventions.

## Challenges Faced and Techniques Used:
A dedicated report will be created to outline the challenges encountered while working with the dataset and developing the machine learning models. The report will explain the techniques used to address these challenges, including data preprocessing, feature engineering, model selection, hyperparameter tuning, and addressing class imbalance if present. Each technique will be accompanied by a rationale for its application, ensuring transparency in the model development process.

## Conclusion:
 the problem revolves around harnessing machine learning for the crucial task of predicting potential heart diseases, addressing global healthcare challenges, and empowering hospitals to proactively manage cardiovascular health. The data analysis, model development, and recommendations to hospitals are pivotal components for achieving the project's objectives and creating tangible positive impact in healthcare outcomes.