# Step 4 — Model Training (Logistic Regression & Random Forest)

**Goals**
- Load preprocessed Cleveland dataset (from Step 2)
- Split data into train/test sets
- Train Logistic Regression and Random Forest
- Evaluate with **accuracy**
- Compare models and select the better one


In [1]:
#import libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score


In [2]:
#  4.1 — Load preprocessed data


df = pd.read_csv("../data/cleveland.csv", header=None)
df.columns = [
    "age","sex","cp","trestbps","chol","fbs","restecg",
    "thalach","exang","oldpeak","slope","ca","thal","target"
]
df = df.replace("?", pd.NA)
for c in df.columns:
    df[c] = pd.to_numeric(df[c], errors="coerce")

# Impute
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler

mode_imputer = SimpleImputer(strategy="most_frequent")
df[["ca","thal"]] = mode_imputer.fit_transform(df[["ca","thal"]])

median_imputer = SimpleImputer(strategy="median")
num_cols = df.columns.drop("target")
df[num_cols] = median_imputer.fit_transform(df[num_cols])

# Scale
scaler = MinMaxScaler()
df[num_cols] = scaler.fit_transform(df[num_cols])

# Binary target
df["target_bin"] = (df["target"] > 0).astype(int)

X = df.drop(columns=["target", "target_bin"])
y = df["target_bin"]

print("X shape:", X.shape)
print("y distribution:\n", y.value_counts())


X shape: (303, 13)
y distribution:
 0    164
1    139
Name: target_bin, dtype: int64


In [3]:
# 4.2 — Train/test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print("Train size:", X_train.shape, " Test size:", X_test.shape)



Train size: (242, 13)  Test size: (61, 13)


In [4]:
# 4.3 — Train Logistic Regression & Random Forest
# Logistic Regression
log = LogisticRegression(max_iter=1000)
log.fit(X_train, y_train)
log_pred = log.predict(X_test)
log_acc = accuracy_score(y_test, log_pred)

# Random Forest
rf = RandomForestClassifier(random_state=42)
rf.fit(X_train, y_train)
rf_pred = rf.predict(X_test)
rf_acc = accuracy_score(y_test, rf_pred)

print(f"Logistic Regression Accuracy: {log_acc:.4f}")
print(f"Random Forest Accuracy:      {rf_acc:.4f}")

best = "Random Forest" if rf_acc >= log_acc else "Logistic Regression"
print(f"\nSelected model (by accuracy): {best}")


Logistic Regression Accuracy: 0.8525
Random Forest Accuracy:      0.9016

Selected model (by accuracy): Random Forest


## 4.4 — Interpretation

- **Logistic Regression Accuracy:** 0.8525  
- **Random Forest Accuracy:** 0.9016  
- **Selected model (by accuracy):** Random Forest  

**Notes**
- Random Forest outperformed Logistic Regression by ~5% in accuracy.
- Since the dataset is fairly balanced (54% vs 46%), accuracy is a reasonable metric here.
- However, for medical prediction tasks, it is recommended to also evaluate:
  - **Precision** (to minimize false positives)
  - **Recall** (to minimize false negatives, very important in healthcare)
  - **F1-score** (balance of precision and recall)
- Next step: extend evaluation with these metrics to ensure robustness.
