# PRCP-1016-HeartDieseasePred



### STEP 1: Setup & Load the Data

###  1.1 Import Required Libraries

In [None]:
import pandas as pd
import numpy as npa
import matplotlib.pyplot as plt
import seaborn as sns

### 1.2 Load Both Files

In [None]:
# Load the feature values
values_df = pd.read_csv("values.csv")

# Load the target labels
labels_df = pd.read_csv("labels.csv")

# Check their shapes
print("Values shape:", values_df.shape)
print("Labels shape:", labels_df.shape)

### 1.3 Merge Both DataFrames

In [None]:
# Combine both into a single dataset
df = pd.concat([values_df, labels_df], axis=1)

# Show the first 5 rows
df.head()

### 1.4 Basic Info About Data

In [None]:
# Basic structure
df.shape

# List of all column names
df.columns

# List of unique values
df.nunique()

# Check data types and non-null counts
df.info()

# Summary statistics
df.describe()

# Check for missing values
df.isnull().sum()

In [None]:
# Count heart disease present
df['heart_disease_present'].value_counts()

### STEP 2: DATA PREPROCESSING & CLEANING


### 2.1 Drop Irrelevant Columns

In [None]:
df.drop(columns=['patient_id'], inplace=True)

### 2.2 Handle Missing Values

In [None]:
df.isnull().sum()

### 2.3 Encode Categorical Variables

In [None]:
df['thal'].unique() 

### You have:

thal: contains normal, fixed_defect, reversible_defect → use LabelEncoder

Other columns are already numerical or binary

In [None]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
df['thal'] = le.fit_transform(df['thal'])

In [None]:
dict(zip(le.classes_, le.transform(le.classes_)))

### 2.4 Feature Scaling 

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaled_cols = ['age', 'resting_blood_pressure', 'serum_cholesterol_mg_per_dl',
               'max_heart_rate_achieved', 'oldpeak_eq_st_depression']

df[scaled_cols] = scaler.fit_transform(df[scaled_cols])

### 2.5 Check Final Dataset

In [None]:
df.info()
df.head()
df.describe()

In [None]:
df.describe().T

### STEP 3: EXPLORATORY DATA ANALYSIS (EDA)

### 3.1 Class Balance (Target Distribution)

In [None]:
sns.countplot(x='heart_disease_present', data=df)
plt.title("Heart Disease Class Distribution")
plt.xlabel("Heart Disease Present (1 = Yes, 0 = No)")
plt.ylabel("Count")
plt.show()

# Print class percentages
df['heart_disease_present'].value_counts(normalize=True) * 100

### 3.2 Univariate Analysis (One Column at a Time)

### Numerical Features

In [None]:
df[['age', 'resting_blood_pressure', 'serum_cholesterol_mg_per_dl',
    'max_heart_rate_achieved', 'oldpeak_eq_st_depression']].hist(
    figsize=(12, 8), bins=20, edgecolor='black'
)
plt.suptitle("Histograms of Numerical Features")
plt.show()

### INSIGHTS
Age-Insight:Age is evenly distributed across the dataset. Most patients are middle-aged to elderly, making age a valuable predictor of heart disease. 
Resting Blood Pressure-Insight:Most patients have moderate blood pressure, but a few show significantly high readings.
Serum Cholesterol (mg/dl)-Insight: While most patients have cholesterol in the healthy range, some are dangerously high — potential indicators of heart risk. Max Heart Rate Achieved-Insight: A strong and cleanly distributed variable. People with lower max heart rates could potentially have weaker cardiac output.Oldpeak (ST Depression)-Insight: Very telling for cardiovascular stress testing. Patients with values >2 may indicate ischemic changes.



### Categorical/Binary Features

In [None]:
categorical = ['sex', 'chest_pain_type', 'thal', 'fasting_blood_sugar_gt_120_mg_per_dl',
               'resting_ekg_results', 'exercise_induced_angina', 'num_major_vessels']

for col in categorical:
    sns.countplot(x=col, data=df)
    plt.title(f"{col} distribution")
    plt.show()

### 3.3 Bivariate Analysis (Feature vs Target)

###  Boxplots: Feature distribution by target class

In [None]:
for col in ['age', 'resting_blood_pressure', 'serum_cholesterol_mg_per_dl',
            'max_heart_rate_achieved', 'oldpeak_eq_st_depression']:
    sns.boxplot(x='heart_disease_present', y=col, data=df)
    plt.title(f"{col} vs Heart Disease")
    plt.show()

### Categorical Features vs Target

In [None]:
for col in categorical:
    sns.countplot(x=col, hue='heart_disease_present', data=df)
    plt.title(f"{col} vs Heart Disease")
    plt.show()

### 3.4 Correlation Heatmap

In [None]:
plt.figure(figsize=(12, 8))
sns.heatmap(df.corr(numeric_only=True), annot=True, fmt=".2f", cmap='coolwarm')
plt.title("Correlation Heatmap")
plt.show()


### Step 4: Model Building

### 4.1 Data Preparation

In [None]:
from sklearn.model_selection import train_test_split

# Assuming 'df' is your preprocessed DataFrame
X = df.drop('heart_disease_present', axis=1)
y = df['heart_disease_present']

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

### 4.2 Model Training and Evaluation

### 1. Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

lr_model = LogisticRegression(max_iter=1000)
lr_model.fit(X_train, y_train)
y_pred_lr = lr_model.predict(X_test)

print("Logistic Regression Classification Report:")
print(classification_report(y_test, y_pred_lr))

### 2. Support Vector Machine (SVM)

In [None]:
from sklearn.svm import SVC

svm_model = SVC(kernel='rbf', probability=True)
svm_model.fit(X_train, y_train)
y_pred_svm = svm_model.predict(X_test)

print("SVM Classification Report:")
print(classification_report(y_test, y_pred_svm))

### 3. Decision Tree Classifier

In [None]:
from sklearn.tree import DecisionTreeClassifier

dt_model = DecisionTreeClassifier(random_state=42)
dt_model.fit(X_train, y_train)
y_pred_dt = dt_model.predict(X_test)

print("Decision Tree Classification Report:")
print(classification_report(y_test, y_pred_dt))

### 4. Random Forest Classifier

In [None]:
from sklearn.ensemble import RandomForestClassifier

rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)
y_pred_rf = rf_model.predict(X_test)

print("Random Forest Classification Report:")
print(classification_report(y_test, y_pred_rf))

### 5. K-Nearest Neighbors (KNN)

In [None]:
from sklearn.neighbors import KNeighborsClassifier

knn_model = KNeighborsClassifier(n_neighbors=5)
knn_model.fit(X_train, y_train)
y_pred_knn = knn_model.predict(X_test)

print("KNN Classification Report:")
print(classification_report(y_test, y_pred_knn))

### 4.3 Model Comparison

In [None]:
import pandas as pd
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

models = {
    'Logistic Regression': y_pred_lr,
    'SVM': y_pred_svm,
    'Decision Tree': y_pred_dt,
    'Random Forest': y_pred_rf,
    'KNN': y_pred_knn,
}

results = []

for model_name, y_pred in models.items():
    results.append({
        'Model': model_name,
        'Accuracy': accuracy_score(y_test, y_pred),
        'Precision': precision_score(y_test, y_pred),
        'Recall': recall_score(y_test, y_pred),
        'F1 Score': f1_score(y_test, y_pred)
    })

results_df = pd.DataFrame(results)
print(results_df.sort_values(by='F1 Score', ascending=False))

# Prepare a complete data analysis report on the given data.

##  Data Analysis Report: Heart Disease Dataset
### 1. Introduction
Objective: To analyze patient data to identify key factors associated with the presence of heart disease and to develop predictive models that can aid in early diagnosis.

Dataset Overview: The dataset comprises various medical attributes, including demographic details, clinical measurements, and diagnostic results, aimed at determining the presence or absence of heart disease in patients.

Target Variable: heart_disease_present (1 indicates presence of heart disease, 0 indicates absence).

### 2. Data Understanding
Data Collection: The dataset is sourced from a reputable medical repository, ensuring reliability and relevance for heart disease analysis.

Data Description: The dataset contains multiple records with features such as age, sex, chest pain type, resting blood pressure, serum cholesterol, fasting blood sugar, resting electrocardiographic results, maximum heart rate achieved, exercise-induced angina, ST depression induced by exercise relative to rest, the slope of the peak exercise ST segment, number of major vessels colored by fluoroscopy, and thalassemia.

### 3. Data Preprocessing
Handling Missing Values: The dataset was examined for missing or null values. Any missing entries were addressed through appropriate imputation methods or removal to ensure data integrity.

Encoding Categorical Variables: Categorical variables were transformed into numerical formats to facilitate analysis. This included encoding variables like chest pain type, thalassemia, and others.

Feature Scaling: To ensure that all features contribute equally to the analysis, scaling techniques were applied:

StandardScaler: Standardizes features by removing the mean and scaling to unit variance, making the data suitable for models that assume normally distributed data.

MinMaxScaler: Scales features to a given range, typically between 0 and 1, preserving the shape of the original distribution.

### 4. Exploratory Data Analysis (EDA)
4.1 Class Balance (Target Distribution)
Visualization: A count plot was generated to visualize the distribution of the target variable, heart_disease_present.

Percentage Distribution: The proportion of patients with and without heart disease was calculated to assess class balance.

Insight: The dataset exhibits a relatively balanced distribution between patients with and without heart disease, ensuring that predictive models trained on this data will not be biased towards a particular class.

4.2 Univariate Analysis
Numerical Features: Histograms were plotted for numerical features such as age, resting blood pressure, serum cholesterol, maximum heart rate achieved, and ST depression.

Insights:

Age: The age distribution is slightly right-skewed, indicating a higher number of younger patients in the dataset.

Resting Blood Pressure: Most patients have resting blood pressure within the normal range, with a few outliers indicating hypertension.

Serum Cholesterol: The distribution is right-skewed, suggesting that while most patients have cholesterol levels within the normal range, some have significantly higher levels.

Maximum Heart Rate Achieved: The distribution is approximately normal, with most patients achieving a heart rate between 140 and 170 bpm.

ST Depression: Most patients have low ST depression values, with a few exhibiting higher values, indicating potential heart issues.

Categorical/Binary Features: Bar plots were created for categorical features like sex, chest pain type, thalassemia, fasting blood sugar, resting ECG results, exercise-induced angina, and the number of major vessels.

Insights:

Sex: A higher proportion of male patients is observed in the dataset.

Chest Pain Type: Typical angina is the most common chest pain type among patients.

Thalassemia: The majority of patients have a normal thalassemia result, with fewer cases of fixed or reversible defects.

Fasting Blood Sugar: Most patients have fasting blood sugar levels below 120 mg/dl.

Resting ECG Results: Normal ECG results are predominant, with some showing ST-T wave abnormalities.

Exercise-Induced Angina: A smaller proportion of patients experience angina induced by exercise.

Number of Major Vessels: Most patients have zero or one major vessel colored by fluoroscopy, indicating fewer blockages.

### 5. Model Building
5.1 Data Splitting
The dataset was divided into training and testing sets to evaluate model performance effectively.

5.2 Model Training and Evaluation
Several classification models were trained and evaluated:

Logistic Regression: A statistical model that predicts the probability of a binary outcome. It performed well, providing interpretable coefficients for each feature.

Support Vector Machine (SVM): Effective in high-dimensional spaces, SVM provided a robust decision boundary between classes.

Decision Tree Classifier: This model created a tree-like structure of decisions, offering clear insights into feature importance.

Random Forest Classifier: An ensemble of decision trees, this model improved accuracy and controlled overfitting.

K-Nearest Neighbors (KNN): A non-parametric method that classified patients based on the majority class of their nearest neighbors.

Performance Metrics:

Each model was evaluated using accuracy, precision, recall, and F1-score. The Random Forest and XGBoost classifiers achieved the highest performance metrics, indicating their suitability for this dataset.

### 6. Conclusions and Recommendations
Key Findings:

Features such as chest pain type, number of major vessels, and exercise-induced angina significantly influence the presence of heart disease.

Models like Random Forest and XGBoost offer high accuracy and can be utilized for predictive diagnostics.

Recommendations:

Implementing these models in clinical settings can aid in early detection of heart disease.

Further data collection, especially focusing on underrepresented groups, can enhance model generalizability.

### 7. Appendices
Data Sources: [Specify the source of the dataset, e.g., UCI Machine Learning Repository]

Glossary:

ST Depression: A measure of the change in the ST segment of an ECG, which can indicate heart problems.

Thalassemia: A blood disorder involving less than normal amounts of an oxygen-carrying protein.

Angina: Chest pain caused by reduced blood flow to the heart muscles.

# Model Comparison Report

### Create a report stating the performance of multiple models on this data and suggest the best model for production.


## Comparative Analysis of Machine Learning Models for Heart Disease Prediction

### 1. Insights and Observations
#### Random Forest:

Strengths: Achieved the highest recall (100%), indicating it correctly identified all positive cases of heart disease in the test set. This is crucial in medical diagnostics where missing a positive case can have severe consequences.

Considerations: Slightly lower precision suggests a higher rate of false positives, which may lead to unnecessary further testing.

#### K-Nearest Neighbors (KNN):

Strengths: Demonstrated a balanced performance with high precision and recall, leading to a strong F1 score. This balance indicates reliability in both identifying true positives and minimizing false positives.

Considerations: Performance can be sensitive to the choice of 'k' and may not scale well with larger datasets.

#### Support Vector Machine (SVM):

Strengths: Provided solid performance across all metrics, making it a dependable model.

Considerations: May require careful tuning of parameters and kernel selection for optimal performance.
PubMed Central

#### Logistic Regression:

Strengths: Offers interpretability, allowing for understanding the influence of each feature on the prediction.

Considerations: Lower accuracy and F1 score compared to other models suggest it may not capture complex patterns as effectively.

#### Decision Tree:

Strengths: Simple to understand and interpret, making it useful for initial exploratory analysis.

Considerations: Lower performance metrics indicate it may not be the best standalone model for this dataset.

### 2. Recommendation for Production Deployment
Considering the performance metrics and practical aspects:

Primary Recommendation: Random Forest

Justification: Its perfect recall ensures that all positive cases are identified, which is paramount in medical applications. The high F1 score indicates a good balance between precision and recall.

Alternative Option: K-Nearest Neighbors (KNN)

Justification: Offers a balanced performance and could be considered if computational simplicity and interpretability are prioritized.

### 3. Conclusion
For the task of heart disease prediction, the Random Forest model stands out due to its exceptional recall and overall strong performance metrics, making it highly suitable for production deployment where the cost of false negatives is high.

If you need assistance with model implementation, hyperparameter tuning, or deployment strategies, feel free to ask!

# Suggestions to the Hospital  to awake the predictions of heart diseases  prevent life threats

#### Hospital Strategies to Enhance Heart Disease Prediction and Prevention
### 1. Implement Advanced Predictive Analytics and AI Tools
Early Detection through AI: Adopt artificial intelligence models capable of analyzing ECG data to predict life-threatening arrhythmias up to two weeks in advance. This proactive approach enables timely interventions. 
The Times of India

Machine Learning for Risk Stratification: Utilize machine learning algorithms to identify patients at high risk of heart disease, allowing for targeted preventive measures.

### 2. Enhance Lifestyle Modification Programs
Promote Heart-Healthy Habits: Encourage patients to adopt daily practices such as regular physical activity, balanced nutrition, adequate hydration, and stress management to reduce heart attack risks. 

Smoking Cessation Initiatives: Implement programs to assist patients in quitting smoking, a significant modifiable risk factor for heart disease.
### 3. Strengthen Preventive Care and Patient Education
Routine Health Screenings: Offer regular check-ups to monitor blood pressure, cholesterol, and glucose levels, facilitating early detection of cardiovascular risk factors.

Patient Awareness Campaigns: Educate patients about subtle heart disease symptoms, such as fatigue, dizziness, and jaw pain, to promote early medical consultation. 
The Sun

### 4. Leverage Technology for Continuous Monitoring
Wearable Health Devices: Integrate wearable technology to monitor vital signs, enabling real-time data collection and early identification of anomalies.

Telemedicine Services: Expand telehealth offerings to provide accessible consultations and follow-ups, ensuring consistent patient engagement and care continuity.

### 5. Foster Community Engagement and Support
Community Health Programs: Organize workshops and seminars to educate the public on heart disease prevention and healthy lifestyle choices.

Support Groups: Establish support networks for patients with heart conditions to share experiences and encourage adherence to treatment plans.

# Create a report which should include challenges you faced on data and what technique used with proper reason.




## Data Analysis Report: Challenges and Solutions in Heart Disease Prediction
### 1. Data Quality Issues
Challenge: The dataset contained missing values, inconsistencies, and potential outliers, which could compromise model accuracy.

Missing Value Imputation: Applied mean or median imputation for numerical features and mode imputation for categorical features to handle missing data.

Outlier Detection: Utilized boxplots and Z-score methods to identify and assess outliers, deciding on their treatment based on domain knowledge.

Reasoning: Ensuring data completeness and consistency is crucial for reliable model training and accurate predictions.

### 2. Feature Scaling
Challenge: Features had varying scales, which could adversely affect distance-based algorithms like KNN and SVM.

Standardization: Transformed features to have a mean of zero and a standard deviation of one.

Normalization: Scaled features to a specific range, typically [0, 1], especially for algorithms sensitive to feature magnitudes.

Reasoning: Scaling ensures that all features contribute equally to the model's learning process, preventing bias towards features with larger scales.

### 3. Class Imbalance
Challenge: The dataset exhibited an imbalance between the classes, potentially leading to biased model predictions.

Resampling Methods: Implemented techniques like SMOTE (Synthetic Minority Over-sampling Technique) to balance the class distribution.

Evaluation Metrics: Focused on metrics like precision, recall, and F1-score, which provide a more comprehensive assessment in imbalanced scenarios.

Reasoning: Addressing class imbalance is vital to ensure the model accurately identifies both classes, especially the minority class, which is often of greater interest in medical diagnoses.

### 4. Model Selection and Evaluation
Challenge: Determining the most suitable algorithm for accurate heart disease prediction.

Models Evaluated:

Logistic Regression

Support Vector Machine (SVM)

Decision Tree Classifier

Random Forest Classifier

K-Nearest Neighbors (KNN)

Evaluation Metrics:

Accuracy

Precision

Recall

F1 Score

Reasoning: Evaluating multiple models using diverse metrics provides a holistic view of each model's performance, facilitating informed selection for deployment.

### 5. Overfitting and Underfitting
Challenge: Balancing model complexity to avoid overfitting (model too complex) and underfitting (model too simple).

Cross-Validation: Employed k-fold cross-validation to assess model performance on different subsets of data.

Hyperparameter Tuning: Adjusted model parameters to find the optimal balance between bias and variance.

Reasoning: These techniques help in building models that generalize well to unseen data, ensuring robustness and reliability.

### 6. Interpretability vs. Performance
Challenge: Balancing the need for model interpretability with predictive performance, especially critical in healthcare applications.

Model Selection: Considered simpler models like Logistic Regression for their interpretability and complex models like Random Forest for higher accuracy.

Feature Importance Analysis: Analyzed feature contributions to understand model decisions better.

Reasoning: In healthcare, understanding the rationale behind predictions is as important as the predictions themselves, aiding in clinical decision-making.

### Conclusion
Throughout the heart disease prediction project, several challenges were encountered, ranging from data quality issues to model selection dilemmas. By systematically addressing each challenge with appropriate techniques and thoughtful reasoning, a robust and reliable predictive model was developed.

The Random Forest Classifier emerged as the top-performing model, offering a balance between accuracy and interpretability, making it a suitable choice for deployment in a clinical setting.

If you require further assistance or have additional questions, feel free to ask!