## Assignment 3 – Dùng thư viện (Scikit‑learn)

- Nhiệm vụ: Dùng `scikit-learn` để xây 2 mô hình: (i) **Binary Logistic Regression** (Graduate vs Non‑graduate) và (ii) **Multinomial Logistic Regression** cho 3 lớp; đánh giá bằng accuracy, confusion matrix, classification report.

In [1]:

# (1) Tải và chuẩn bị dữ liệu
from ucimlrepo import fetch_ucirepo
import numpy as np, pandas as pd

ds = fetch_ucirepo(id=697)
X = ds.data.features.copy()
y = ds.data.targets.copy()

target_col = 'Target' if 'Target' in y.columns else y.columns[0]
y = y[target_col].astype(str).str.lower()

X_mat = X.select_dtypes(include=[np.number]).fillna(0).values.astype(float)

# Train/Test split
from sklearn.model_selection import train_test_split
Xtr, Xte, ytr, yte = train_test_split(X_mat, y, test_size=0.2, random_state=42, stratify=y)

# (2) Chuẩn hóa + Logistic Regression (binary & multiclass)
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# 2a) Binary: Graduate vs Non-graduate
ytr_bin = np.where(ytr.str.startswith('gradu'), 'graduate', 'non-graduate')
yte_bin = np.where(yte.str.startswith('gradu'), 'graduate', 'non-graduate')

pipe_bin = Pipeline([
    ("scaler", StandardScaler()),
    ("clf", LogisticRegression(max_iter=1000, solver="lbfgs"))
])
pipe_bin.fit(Xtr, ytr_bin)
pred_bin = pipe_bin.predict(Xte)

print("=== Binary (Graduate vs Non-graduate) ===")
print("Accuracy:", accuracy_score(yte_bin, pred_bin))
print("Confusion Matrix:\n", confusion_matrix(yte_bin, pred_bin))
print(classification_report(yte_bin, pred_bin))

# 2b) Multiclass: Dropout vs Enrolled vs Graduate
pipe_mc = Pipeline([
    ("scaler", StandardScaler()),
    ("clf", LogisticRegression(max_iter=1000, multi_class="multinomial", solver="lbfgs"))
])
pipe_mc.fit(Xtr, ytr)
pred_mc = pipe_mc.predict(Xte)

print("\n=== Multiclass (Dropout / Enrolled / Graduate) ===")
print("Accuracy:", accuracy_score(yte, pred_mc))
print("Confusion Matrix:\n", confusion_matrix(yte, pred_mc))
print(classification_report(yte, pred_mc))


=== Binary (Graduate vs Non-graduate) ===
Accuracy: 0.8542372881355932
Confusion Matrix:
 [[389  53]
 [ 76 367]]
              precision    recall  f1-score   support

    graduate       0.84      0.88      0.86       442
non-graduate       0.87      0.83      0.85       443

    accuracy                           0.85       885
   macro avg       0.86      0.85      0.85       885
weighted avg       0.86      0.85      0.85       885


=== Multiclass (Dropout / Enrolled / Graduate) ===
Accuracy: 0.768361581920904
Confusion Matrix:
 [[218  29  37]
 [ 43  53  63]
 [ 14  19 409]]
              precision    recall  f1-score   support

     dropout       0.79      0.77      0.78       284
    enrolled       0.52      0.33      0.41       159
    graduate       0.80      0.93      0.86       442

    accuracy                           0.77       885
   macro avg       0.71      0.68      0.68       885
weighted avg       0.75      0.77      0.75       885



