# ❤️ Зүрхний өвчний таамаглал
Энэ notebook нь зүрхний өвчний таамаглалыг хийх бүрэн машин сургалтын pipeline-ийг агуулсан болно.

In [221]:
# 📦 Сангуудыг импортлох
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, roc_auc_score, RocCurveDisplay


# ❤️ Төсөл: Зүрхний Өвчний Таамаглал

## 🎯 Зорилго  
Өгөгдсөн өвчтөний оношилгооны мэдээлэлд үндэслэн тухайн хүний зүрхний өвчтэй эсэхийг урьдчилан таамаглах загвар бүтээх, мөн зүрхний өвчний эрсдэлийг нэмэгдүүлдэг хүчин зүйлсийг илрүүлэх.

## 🧾 Түүхэн мэдээлэл  
Энэхүү датасет нь UCI-ийн Машин Сургалтын Репозиторид багтдаг бөгөөд Кливленд зэрэг хэд хэдэн анагаах ухааны байгууллагын өгөгдөл дээр үндэслэсэн. Dataset нь зүрхний өвчний эмнэл зүйн оношилгоонд хэрэглэгддэг 13 үндсэн шинж чанарыг (features) агуулсан. Эцсийн зорилтот хувьсагч нь тухайн хүн зүрхний өвчтэй эсэхийг илтгэнэ (`1 = өвчтэй`, `0 = эрүүл`).

## 🧠 Асуудлын төрөл  
Энэ бол **2 ангиллын (binary classification)** асуудал бөгөөд:
- **1 → Зүрхний өвчтэй**
- **0 → Эрүүл**

## ❓ Хариулах ёстой гол асуултууд:
1. Ямар шинжүүд зүрхний өвчний магадлалд хамгийн их нөлөөлж байна вэ?
2. Нас, хүйс, цусны даралт, холестрол зэрэг хувьсагчид хэр их нөлөөлж байна?
3. Стресс тестийн үр дүн, ECG үзүүлэлт, дасгалын дараах ангина зэрэг нь зүрхний өвчинтэй хамааралтай юу?
4. Загварын үр дүнд үндэслэн эрсдэл өндөртэй хүмүүсийг ялгаж болох уу?
5. Зөв таамаглал өгөх найдвартай загвар бүтээж чадах уу?

## 📏 Амжилтыг хэмжих шалгуур:
- Тест датасет дээрх **загварын үнэлгээний үзүүлэлтүүд**:  
  - Accuracy  
  - Precision  
  - Recall  
  - F1-score  
  - ROC AUC

- Загварын **үр дүнг тайлбарлах**, шинжүүдийн чухлыг тодорхойлох чадвар

## 🚧 Хязгаарлалт, анхаарах зүйлс:
- Dataset нь 300 орчим бичлэгтэй — харьцангуй бага хэмжээтэй
- Зарим утгууд дутуу байж болно (`NaN`, `?`)
- Загварын overfitting хийхээс сэргийлэх хэрэгтэй
- Медицин орчныг ойлгомжтой тайлбарлаж ашиглах хэрэгтэй

## 📦 Шаардлагатай үр дүн:
1. Цэвэрлэгдсэн, боловсруулсан датасет
2. **Exploratory Data Analysis**: график, хамаарлын матриц, тархалт
3. Feature Engineering: категорит шинжүүд кодлох, scale хийх
4. Сургасан ML загварууд (Logistic, Decision Tree, SVM, Neural Net)
5. Загварын үр дүнг тайлбарласан тайлан: ямар шинжүүд хамгийн чухал байсан, ямар бүлгүүд эрсдэл өндөртэй байна

## 1. Өгөгдөл ачаалж, EDA хийх

In [233]:
import pandas as pd
df = pd.read_csv('../data/raw/heart_disease_uci.csv')
df.columns = df.columns.str.lower()

print("Мөр, Баганын тоо:", df.shape)
print("\nЭхний мөрнүүд:")
print(df.head())

print("\nДатаны ерөнхий мэдээлэл:")
print(df.info())





Мөр, Баганын тоо: (920, 16)

Эхний мөрнүүд:
   id  age     sex    dataset               cp  trestbps   chol    fbs  \
0   1   63    Male  Cleveland   typical angina     145.0  233.0   True   
1   2   67    Male  Cleveland     asymptomatic     160.0  286.0  False   
2   3   67    Male  Cleveland     asymptomatic     120.0  229.0  False   
3   4   37    Male  Cleveland      non-anginal     130.0  250.0  False   
4   5   41  Female  Cleveland  atypical angina     130.0  204.0  False   

          restecg  thalch  exang  oldpeak        slope   ca  \
0  lv hypertrophy   150.0  False      2.3  downsloping  0.0   
1  lv hypertrophy   108.0   True      1.5         flat  3.0   
2  lv hypertrophy   129.0   True      2.6         flat  2.0   
3          normal   187.0  False      3.5  downsloping  0.0   
4  lv hypertrophy   172.0  False      1.4    upsloping  0.0   

                thal  num  
0       fixed defect    0  
1             normal    2  
2  reversable defect    1  
3             normal

In [234]:

print("\nСтатистик үзүүлэлт:")
print(df.describe())

print("\nБаганад дутуу утга байгаа эсэх:")
print(df.isnull().sum())


Статистик үзүүлэлт:
               id         age    trestbps        chol      thalch     oldpeak  \
count  920.000000  920.000000  861.000000  890.000000  865.000000  858.000000   
mean   460.500000   53.510870  132.132404  199.130337  137.545665    0.878788   
std    265.725422    9.424685   19.066070  110.780810   25.926276    1.091226   
min      1.000000   28.000000    0.000000    0.000000   60.000000   -2.600000   
25%    230.750000   47.000000  120.000000  175.000000  120.000000    0.000000   
50%    460.500000   54.000000  130.000000  223.000000  140.000000    0.500000   
75%    690.250000   60.000000  140.000000  268.000000  157.000000    1.500000   
max    920.000000   77.000000  200.000000  603.000000  202.000000    6.200000   

               ca         num  
count  309.000000  920.000000  
mean     0.676375    0.995652  
std      0.935653    1.142693  
min      0.000000    0.000000  
25%      0.000000    0.000000  
50%      0.000000    1.000000  
75%      1.000000    2.00

## 2. Preprocessing хийх

In [235]:


# 🎯 Target багана үүсгэх
df["target"] = (df["num"] > 0).astype(int)

# 🧹 num/id багануудыг хасах — аль аль нь эртхэн хасагдах ёстой
df.drop(["num", "id"], axis=1, inplace=True, errors="ignore")

# 🧠 Багана төрлөөр ангилах
numerical_cols = df.select_dtypes(include=["int64", "float64"]).columns.tolist()
categorical_cols = df.select_dtypes(include=["object", "bool"]).columns.tolist()

# 🎯 Target-г numeric list-оос хасах
numerical_cols = [col for col in numerical_cols if col != "target"]

# 🔁 Numerical багануудыг median-р нөхөх
for col in numerical_cols:
    df[col] = df[col].fillna(df[col].median())

# 🔁 Categorical багануудыг mode-р нөхөх
for col in categorical_cols:
    mode = df[col].mode()
    if not mode.empty:
        df[col] = df[col].fillna(mode[0])

    


  df[col] = df[col].fillna(mode[0])


In [236]:
print("\nЭхний мөрнүүд:")
print(df.head())


Эхний мөрнүүд:
   age     sex    dataset               cp  trestbps   chol    fbs  \
0   63    Male  Cleveland   typical angina     145.0  233.0   True   
1   67    Male  Cleveland     asymptomatic     160.0  286.0  False   
2   67    Male  Cleveland     asymptomatic     120.0  229.0  False   
3   37    Male  Cleveland      non-anginal     130.0  250.0  False   
4   41  Female  Cleveland  atypical angina     130.0  204.0  False   

          restecg  thalch  exang  oldpeak        slope   ca  \
0  lv hypertrophy   150.0  False      2.3  downsloping  0.0   
1  lv hypertrophy   108.0   True      1.5         flat  3.0   
2  lv hypertrophy   129.0   True      2.6         flat  2.0   
3          normal   187.0  False      3.5  downsloping  0.0   
4  lv hypertrophy   172.0  False      1.4    upsloping  0.0   

                thal  target  
0       fixed defect       0  
1             normal       1  
2  reversable defect       1  
3             normal       0  
4             normal       0 

In [237]:
print("\nБаганад дутуу утга байгаа эсэх:")
print(df.isnull().sum())

print("Мөр, Баганын тоо:", df.shape)


Баганад дутуу утга байгаа эсэх:
age         0
sex         0
dataset     0
cp          0
trestbps    0
chol        0
fbs         0
restecg     0
thalch      0
exang       0
oldpeak     0
slope       0
ca          0
thal        0
target      0
dtype: int64
Мөр, Баганын тоо: (920, 15)


In [238]:
import os
os.makedirs("../data/processed", exist_ok=True)
df.to_csv('../data/processed/heart_disease_uci_cleaned.csv', index=False)

Data visualization


In [241]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Зургийг фолдерт хадгалах 
output_dir = "../outputs/figures"
import os
os.makedirs(output_dir, exist_ok=True)

# 1. Өвчтэй эсэх хувь
plt.figure(figsize=(6, 4))
sns.countplot(x='target', data=df)
plt.title('Heart Disease Presence (0 = Healthy, 1+ = Diseased)')
plt.savefig(f'{output_dir}/disease_count.png')
plt.close()

# 2. Өвчлөлийн байдал насны бүлгээр
plt.figure(figsize=(10, 6))
sns.histplot(data=df, x='age', hue='target', bins=20, kde=True)
plt.title('Age Distribution by Heart Disease')
plt.savefig(f'{output_dir}/age_distribution.png')
plt.close()

# 3. Өвчлөлийн байдал хүйсээр
plt.figure(figsize=(6, 4))
sns.countplot(x='sex', hue='target', data=df)
plt.title('Heart Disease by Gender (1 = Male, 0 = Female)')
plt.savefig(f'{output_dir}/disease_by_gender.png')
plt.close()

# 4. Chest pain type-р өвчлөл
plt.figure(figsize=(8, 5))
sns.countplot(x='cp', hue='target', data=df)
plt.title('Heart Disease by Chest Pain Type')
plt.savefig(f'{output_dir}/disease_by_cp.png')
plt.close()

# 5. Thal (Thalassemia) vs Disease
plt.figure(figsize=(8, 5))
sns.countplot(x='thal', hue='target', data=df)
plt.title('Heart Disease by Thalassemia Type')
plt.savefig(f'{output_dir}/disease_by_thal.png')
plt.close()

# 6. ST depression (oldpeak) тархалт
plt.figure(figsize=(8, 5))
sns.histplot(data=df, x='oldpeak', hue='target', bins=30, kde=True)
plt.title('Oldpeak (ST Depression) by Heart Disease')
plt.savefig(f'{output_dir}/oldpeak_distribution.png')
plt.close()

# 7. Maximum heart rate (thalach) тархалт
plt.figure(figsize=(8, 5))
sns.histplot(data=df, x='thalch', hue='target', bins=30, kde=True)
plt.title('Max Heart Rate (Thalach) by Disease')
plt.savefig(f'{output_dir}/thalach_distribution.png')
plt.close()

# 8. Correlation matrix (тоон хувьсагчид)
numerical_cols = ['age', 'trestbps', 'chol', 'thalch', 'oldpeak', 'ca', 'target']
corr_matrix = df[numerical_cols].corr()

plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Matrix of Numeric Features')
plt.savefig(f'{output_dir}/correlation_matrix.png')
plt.close()


## 3. Загвар сургалт

In [None]:


# Re-execute after kernel reset: prepare environment again
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
import joblib
from datetime import datetime
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, roc_curve, auc
from sklearn.exceptions import ConvergenceWarning
import warnings
warnings.filterwarnings("ignore", category=ConvergenceWarning)


os.makedirs("../outputs/figures", exist_ok=True)
os.makedirs("../outputs/reports", exist_ok=True)
os.makedirs("../models", exist_ok=True)

# Load the uploaded dataset
df = pd.read_csv("../data/processed/heart_disease_uci_cleaned.csv")

# Prepare data


numerical_features = df.select_dtypes(include=["int64", "float64"]).drop(columns=["target"]).columns.tolist()
categorical_features = df.select_dtypes(include=["object", "bool"]).columns.tolist()

# Preprocessing
num_transformer = Pipeline([
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler())
])
cat_transformer = Pipeline([
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("encoder", OneHotEncoder(handle_unknown="ignore", drop="first"))
])
preprocessor = ColumnTransformer([
    ("num", num_transformer, numerical_features),
    ("cat", cat_transformer, categorical_features)
])

# Models
RANDOM_STATE = 42
models = {
    "RandomForest": RandomForestClassifier(n_estimators=100, random_state=RANDOM_STATE),
    "GradientBoosting": GradientBoostingClassifier(random_state=RANDOM_STATE),
    "LogisticRegression": LogisticRegression(max_iter=1000, random_state=RANDOM_STATE),
    "DecisionTree": DecisionTreeClassifier(random_state=RANDOM_STATE),
    "SVC": SVC(probability=True, random_state=RANDOM_STATE),
    "MLPClassifier": MLPClassifier(random_state=RANDOM_STATE, max_iter=1000)
}

X = df.drop("target", axis=1)
y = df["target"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=RANDOM_STATE)

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=RANDOM_STATE)
pipelines = {}
cv_results = {}

for name, model in models.items():
    pipe = Pipeline([
        ("preprocessor", preprocessor),
        ("classifier", model)
    ])
    scores = cross_val_score(pipe, X_train, y_train, cv=cv, scoring="accuracy")
    pipelines[name] = pipe
    cv_results[name] = {
        "mean_accuracy": scores.mean(),
        "std_accuracy": scores.std()
    }
param_grid = {}
best_model_name = max(cv_results.items(), key=lambda x: x[1]["mean_accuracy"])[0]
best_pipeline = pipelines[best_model_name]
grid_search = GridSearchCV(best_pipeline, param_grid, cv=cv, scoring="accuracy", n_jobs=-1)
grid_search.fit(X_train, y_train)
best_model = grid_search.best_estimator_
# Evaluate on test set
y_pred = best_model.predict(X_test)

# Define grid

if best_model_name == "RandomForest":
    param_grid = {
        "classifier__n_estimators": [100, 200],
        "classifier__max_depth": [None, 5, 10],
    }
elif best_model_name == "GradientBoosting":
    param_grid = {
        "classifier__n_estimators": [100, 200],
        "classifier__learning_rate": [0.01, 0.1],
        "classifier__max_depth": [3, 5]
    }
elif best_model_name == "LogisticRegression":
    param_grid = {
        "classifier__C": [0.01, 0.1, 1.0, 10.0]
    }
elif best_model_name == "DecisionTree":
    param_grid = {
        "classifier__max_depth": [None, 5, 10, 15],
        "classifier__min_samples_split": [2, 5, 10]
    }
elif best_model_name == "SVC":
    param_grid = {
        "classifier__C": [0.1, 1, 10],
        "classifier__kernel": ['linear', 'rbf']
    }
elif best_model_name == "MLPClassifier":
    param_grid = {
        "classifier__hidden_layer_sizes": [(50,), (100,), (50, 50)],
        "classifier__alpha": [0.0001, 0.001]
    }
    
    
report_path = "../outputs/reports/classification_report.txt"
with open(report_path, "w") as f:
    f.write("Accuracy: {:.4f}\n".format(accuracy_score(y_test, y_pred)))
    f.write(classification_report(y_test, y_pred))
    f.write("\nConfusion Matrix:\n")
    f.write(np.array2string(confusion_matrix(y_test, y_pred)))

# Save model
joblib.dump(best_model, "../models/heart_disease_best_model.pkl")

# Feature importance
if hasattr(best_model["classifier"], "feature_importances_"):
    if hasattr(best_model["preprocessor"], "get_feature_names_out"):
        feature_names = best_model["preprocessor"].get_feature_names_out()
    else:
        feature_names = [f"feature_{i}" for i in range(X.shape[1])]
    
    importance = best_model["classifier"].feature_importances_
    fi_df = pd.DataFrame({"Feature": feature_names, "Importance": importance})
    fi_df.sort_values("Importance", ascending=False, inplace=True)
    
    plt.figure(figsize=(12, 8))
    sns.barplot(x="Importance", y="Feature", data=fi_df.head(15))
    plt.title("Top Feature Importances")
    plt.tight_layout()
    plt.savefig("../outputs/figures/feature_importance.png")
    plt.close()

# ROC Curve
if hasattr(best_model["classifier"], "predict_proba"):
    y_prob = best_model.predict_proba(X_test)[:, 1]
    fpr, tpr, _ = roc_curve(y_test, y_prob)
    roc_auc = auc(fpr, tpr)
    
    plt.figure(figsize=(8, 6))
    plt.plot(fpr, tpr, label=f"ROC curve (AUC = {roc_auc:.2f})", lw=2)
    plt.plot([0, 1], [0, 1], linestyle="--", color="gray")
    plt.xlabel("False Positive Rate")
    plt.ylabel("True Positive Rate")
    plt.title("ROC Curve")
    plt.legend()
    plt.savefig("../outputs/figures/roc_curve.png")
    plt.close()

# Summary
summary_txt = f"""Best Model: {best_model_name}
Test Accuracy: {accuracy_score(y_test, y_pred):.4f}
Best Parameters: {grid_search.best_params_}
"""

with open("../outputs/reports/summary.txt", "w") as f:
    f.write(summary_txt)

"✅ Модел амжилттай сургав. Загвар болон тайлангууд хадгалагдсан!"

grid_search = GridSearchCV(best_pipeline, param_grid, cv=cv, scoring="accuracy", n_jobs=-1)
grid_search.fit(X_train, y_train)
best_model = grid_search.best_estimator_

# Final evaluation
y_pred = best_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
best_model_name, accuracy, grid_search.best_params_



('LogisticRegression', 0.842391304347826, {'classifier__C': 1.0})