# Heart Disease ML Pipeline (UCI) — Notebooks

These notebooks implement a full pipeline on the **UCI Heart Disease** dataset using your requested loader:

```python
from ucimlrepo import fetch_ucirepo
heart_disease = fetch_ucirepo(id=45)
X = heart_disease.data.features
y = heart_disease.data.targets
```

> Bonus items (Streamlit/Ngrok) are intentionally **omitted** per the request.

## 03 — Feature Selection

Techniques:
- Chi-Square test (requires non-negative features -> use MinMax scaling inside pipeline)
- RFE with Logistic Regression
- Random Forest feature importance

Saves a ranked feature table and a suggested subset list to `results/`.

In [2]:
import numpy as np
import pandas as pd
from sklearn.feature_selection import chi2
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestClassifier

# Load processed data
proc_df = pd.read_csv("../data/processed_full.csv")
feature_names = [c for c in proc_df.columns if c != "target"]
X = proc_df[feature_names].values
y = proc_df["target"].values

# Chi-square (requires non-negative)
X_pos = MinMaxScaler().fit_transform(X)
chi2_vals, p_vals = chi2(X_pos, y)

chi_df = pd.DataFrame({
    "feature": feature_names,
    "chi2": chi2_vals,
    "p_value": p_vals
}).sort_values("chi2", ascending=False)

# RFE with Logistic Regression
lr = LogisticRegression(max_iter=500, n_jobs=None)
rfe = RFE(lr, n_features_to_select=min(12, X.shape[1]))
rfe.fit(X, y)
rfe_rank = rfe.ranking_
rfe_df = pd.DataFrame({
    "feature": feature_names,
    "rfe_rank": rfe_rank,
    "selected": rfe.support_
}).sort_values("rfe_rank")

# Random Forest importance
rf = RandomForestClassifier(n_estimators=300, random_state=42)
rf.fit(X, y)
rf_importances = rf.feature_importances_

rf_df = pd.DataFrame({
    "feature": feature_names,
    "rf_importance": rf_importances
}).sort_values("rf_importance", ascending=False)

# Merge ranks
merged = chi_df.merge(rfe_df, on="feature", how="outer").merge(rf_df, on="feature", how="outer")
merged.to_csv("../results/feature_selection_all_scores.csv", index=False)

# Suggest a final subset: union of top-k from RF and RFE selected
top_k = 15 if X.shape[1] > 15 else X.shape[1]
top_rf = set(rf_df.head(top_k)["feature"])
sel_rfe = set(rfe_df[rfe_df["selected"]]["feature"])
suggested = sorted(top_rf.union(sel_rfe))

with open("../results/selected_features.txt", "w") as f:
    for feat in suggested:
        f.write(f"{feat}\n")

print(f"Saved feature rankings to ../results and suggested subset ({len(suggested)} features) to selected_features.txt")

Saved feature rankings to ../results and suggested subset (13 features) to selected_features.txt
