<a href="https://colab.research.google.com/github/AIsoroush/deep-learning-projects/blob/main/Heart.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Heart Attack Risk Prediction

**Author:** soroush taghados

## Description
این پروژه برای پیش‌بینی شانس حمله قلبی در بیماران با استفاده از داده‌های پزشکی طراحی شده است.  
با توجه به ویژگی‌هایی مانند سن، جنسیت، فشار خون، نوع درد، رگ‌ها، ضربان قلب و شاخص‌های دیگر، مدل پیش‌بینی می‌کند که ریسک حمله قلبی کم است یا زیاد.

## Dataset
فایل داده: `heart.csv`  
ویژگی‌ها شامل:

- `age` : سن بیمار
- `sex` : جنسیت
- `cp` : نوع درد قفسه سینه
- `trtbps` : فشار خون در حال استراحت
- `chol` : سطح کلسترول
- `fbs` : قند خون ناشتا بالا
- `restecg` : نتیجه نوار قلب
- `thalachh` : حداکثر ضربان قلب
- `exng` : آیا درد با فعالیت خاصی شروع می‌شود
- `caa` : تعداد رگ‌های اصلی
- `oldpeak` : کاهش ST
- `thall` : نتایج تال
- `output` : شانس حمله قلبی (0=کم، 1=زیاد)

## Preprocessing
- بررسی و حذف مقادیر گم‌شده
- انتخاب ویژگی‌های مهم با Mutual Information
- نرمال‌سازی داده‌ها
- تقسیم داده‌ها به بخش آموزش و تست (80/20، 70/30 و 60/40 برای تست مقایسه‌ای)

## Models
سه مدل اصلی آموزش داده شدند:

1. **Support Vector Machine (SVM)**
2. **Logistic Regression**
3. **Decision Tree**

بهینه‌سازی پارامترها با **GridSearchCV** انجام شد.

## Evaluation
ارزیابی مدل‌ها با معیارهای زیر انجام شد:

- Accuracy
- F1 Score
- F2 Score
- Confusion Matrix (شامل False Negatives برای اهمیت بالای بیماران با ریسک زیاد)

بهترین عملکرد در مدل‌های SVM و Logistic Regression مشاهده شد.  
مدل Decision Tree به دلیل پیچیدگی داده‌ها عملکرد ضعیف‌تر داشت.

## How to Use
1. نصب کتابخانه‌های مورد نیاز:
```bash
pip install pandas numpy scikit-learn matplotlib seaborn


In [34]:
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier

from sklearn.preprocessing import StandardScaler
import pandas as pd
import numpy as np

In [35]:
# -------------------------------
# Download dataset
# -------------------------------
import gdown
import os

os.makedirs("data", exist_ok=True)

file_id = "1PTheun1CF8YQUrP7ueaEyDLcVyRPqID3"
url = f"https://drive.google.com/uc?id={file_id}"  # Direct download link

out_path = "data/drug_dataset.csv"

print("Downloading dataset...")
gdown.download(url, out_path, quiet=False)
print(f"✅ Dataset downloaded to {out_path}")

file = out_path

Downloading dataset...


Downloading...
From: https://drive.google.com/uc?id=1PTheun1CF8YQUrP7ueaEyDLcVyRPqID3
To: /content/data/drug_dataset.csv
100%|██████████| 11.3k/11.3k [00:00<00:00, 22.2MB/s]

✅ Dataset downloaded to data/drug_dataset.csv





In [36]:
# -------------------------------
# 1. Load dataset
# -------------------------------
data = pd.read_csv(file)
data.head()

Unnamed: 0,age,sex,cp,trtbps,chol,fbs,restecg,thalachh,exng,oldpeak,slp,caa,thall,output
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [37]:
# Check for missing values
data.isnull().sum()
np.isnan(data).sum()

Unnamed: 0,0
age,0
sex,0
cp,0
trtbps,0
chol,0
fbs,0
restecg,0
thalachh,0
exng,0
oldpeak,0


In [38]:
# Check data types
print(data.dtypes)

age           int64
sex           int64
cp            int64
trtbps        int64
chol          int64
fbs           int64
restecg       int64
thalachh      int64
exng          int64
oldpeak     float64
slp           int64
caa           int64
thall         int64
output        int64
dtype: object


In [39]:

# -------------------------------
# 2. Feature importance using Mutual Information
# -------------------------------
from sklearn.feature_selection import mutual_info_classif

a = data.drop("output", axis=1)  # features
b = data["output"]               # target

mi = mutual_info_classif(a, b)
mi_series = pd.Series(mi, index=a.columns).sort_values(ascending=False)
print(mi_series)


cp          0.187007
caa         0.152116
thall       0.137283
slp         0.081425
exng        0.074514
chol        0.072812
oldpeak     0.072283
thalachh    0.069315
sex         0.019844
restecg     0.018875
trtbps      0.006065
age         0.000000
fbs         0.000000
dtype: float64


In [40]:

# -------------------------------
# 3. Basic statistics
# -------------------------------
data['age'].describe()
data['trtbps'].describe()
print(data.corr()['age'].sort_values(ascending=False))
print('--')
print(data.corr()['trtbps'].sort_values(ascending=False))
print('--')
print(data.corr()['fbs'].sort_values(ascending=False))
print('--')
print(data.corr()['restecg'].sort_values(ascending=False))


age         1.000000
trtbps      0.279351
caa         0.276326
chol        0.213678
oldpeak     0.210013
fbs         0.121308
exng        0.096801
thall       0.068001
cp         -0.068653
sex        -0.098447
restecg    -0.116211
slp        -0.168814
output     -0.225439
thalachh   -0.398522
Name: age, dtype: float64
--
trtbps      1.000000
age         0.279351
oldpeak     0.193216
fbs         0.177531
chol        0.123174
caa         0.101389
exng        0.067616
thall       0.062210
cp          0.047608
thalachh   -0.046698
sex        -0.056769
restecg    -0.114103
slp        -0.121475
output     -0.144931
Name: trtbps, dtype: float64
--
fbs         1.000000
trtbps      0.177531
caa         0.137979
age         0.121308
cp          0.094444
sex         0.045032
exng        0.025665
chol        0.013294
oldpeak     0.005747
thalachh   -0.008567
output     -0.028046
thall      -0.032019
slp        -0.059894
restecg    -0.084189
Name: fbs, dtype: float64
--
restecg     1.000000
output 

In [41]:
# -------------------------------
# 4. Select features (X) and target (y)
# -------------------------------
x = a[['cp', 'thall', 'caa', 'slp','exng', 'oldpeak', 'thalachh','chol', 'sex', 'age']].values
y = b.values


Scaling, Train/Test split

In [42]:
# Normalize features
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler().fit(x)
X = scaler.transform(x)

# Split dataset into training and testing sets
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print('Train set:', x_train.shape, y_train.shape)
print('Test set:', x_test.shape, y_test.shape)


Train set: (242, 10) (242,)
Test set: (61, 10) (61,)


Train multiple models and check accuracy

In [43]:
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier

# Initialize models
models = {
    'LogisticRegression': LogisticRegression(C=0.01),
    'KNN': KNeighborsClassifier(n_neighbors=5),
    'SVM': SVC(),
    'DecisionTree': DecisionTreeClassifier()
}

# Train and evaluate models
for name, model in models.items():
    model.fit(x_train, y_train)
    y_pred = model.predict(x_test)
    acc = accuracy_score(y_test, y_pred)
    print(f"{name}: Accuracy = {acc:.2f}")


LogisticRegression: Accuracy = 0.89
KNN: Accuracy = 0.85
SVM: Accuracy = 0.90
DecisionTree: Accuracy = 0.80


In [44]:
from sklearn.metrics import f1_score, fbeta_score, confusion_matrix, classification_report
import matplotlib.pyplot as plt
import itertools

def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')
    print(cm)

    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)
    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")
    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

def full_evaluation(model, X_test, y_test):
    y_pred = model.predict(X_test)
    acc = accuracy_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    f2 = fbeta_score(y_test, y_pred, beta=2)
    cm = confusion_matrix(y_test, y_pred)
    FN = cm[1][0]

    print("Classification Report:\n", classification_report(y_test, y_pred))
    print("Accuracy:", round(acc, 4))
    print("F1 Score:", round(f1, 4))
    print("F2 Score:", round(f2, 4))
    print("False Negatives (FN):", FN)

    plt.figure()
    plot_confusion_matrix(cm, classes=['output=1','output=0'], normalize=False)
    plt.show()

In [45]:
from sklearn.model_selection import GridSearchCV

# SVM grid search
param_grid = {
    'C': [0.1, 1, 10, 100],
    'kernel': ['linear', 'rbf', 'poly'],
    'gamma': ['scale', 'auto']
}
svm = SVC()
grid_search = GridSearchCV(svm, param_grid, cv=5, scoring='accuracy', verbose=1, n_jobs=-1)
grid_search.fit(x_train, y_train)
print(grid_search.best_params_)

# Decision Tree grid search
dt = DecisionTreeClassifier()
params_grid = {'max_depth': range(1,21), 'criterion': ['gini', 'entropy']}
gs = GridSearchCV(dt, params_grid, cv=5)
gs.fit(x_train, y_train)
print(gs.best_params_)


Fitting 5 folds for each of 24 candidates, totalling 120 fits
{'C': 1, 'gamma': 'scale', 'kernel': 'poly'}
{'criterion': 'gini', 'max_depth': 4}
