# BÀI TẬP ĐÁNH GIÁ HỆ THỐNG ML

## Đề Bài

Đánh giá mô hình huấn luyện trên tập dữ liệu *churn_modeling.csv* dự đoán khách hàng rời bỏ dịch vụ.

Chia tập dữ liệu train/test với *test_size = 0.2* và *random_state = 42*

Sử dụng 3 mô hình để huấn luyện và dự đoán:
- Mô hinh 1: luôn luôn dự đoán là 0
- Mô hình 2: Logistic regression
- Mô hình 3: KNN

Các thông số tính toán:
- Accuracy
- Confusion matrix
- Precision, recall, f1
- AUC, ROC (nếu có)

## Script

In [22]:
import pandas as pd
import sklearn.metrics as m
import sklearn.model_selection as ms

from sklearn.dummy import DummyClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import LabelEncoder, StandardScaler

In [23]:
df = pd.read_csv('churn_modelling.csv')
df.head()

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


In [24]:
# Drop the columns that are not required
df.drop(['RowNumber', 'CustomerId', 'Surname'], axis=1, inplace=True)

In [25]:
# Encode the categorical data
label_encoder = LabelEncoder()
df['Geography'] = label_encoder.fit_transform(df['Geography'])
df['Gender'] = label_encoder.fit_transform(df['Gender'])

In [26]:
# Split the data into features and target
X = df.drop('Exited', axis=1)
y = df['Exited']

In [27]:
# Scale the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

In [28]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = ms.train_test_split(X_scaled, y, test_size=0.2, random_state=42)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((8000, 10), (2000, 10), (8000,), (2000,))

In [29]:
# Initialize models
models = {
    'Model 1: Always Predict 0': DummyClassifier(strategy='constant', constant=0),
    'Model 2: Logistic Regression': LogisticRegression(),
    'Model 3: KNN': KNeighborsClassifier()
}

In [32]:
# Train and predict with each model
print('RESULTS:')
for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    y_prob = model.predict_proba(X_test)[:, 1] if hasattr(model, "predict_proba") else None
    accuracy = m.accuracy_score(y_test, y_pred)
    conf_matrix = m.confusion_matrix(y_test, y_pred).flatten()
    precision, recall, f1, _ = m.precision_recall_fscore_support(y_test, y_pred, average='binary', zero_division=0)
    auc_roc = y_prob is None and 'N/A' or m.roc_auc_score(y_test, y_prob)
    print(f'''
{name}:
- Accuracy': {accuracy}
- Confusion Matrix: {conf_matrix}
- 'Precision': {precision}
- 'Recall': {recall}
- 'F1': {f1}
- 'AUC-ROC': {auc_roc}''')

RESULTS:

Model 1: Always Predict 0:
- Accuracy': 0.8035
- Confusion Matrix: [1607    0  393    0]
- 'Precision': 0.0
- 'Recall': 0.0
- 'F1': 0.0
- 'AUC-ROC': 0.5

Model 2: Logistic Regression:
- Accuracy': 0.8155
- Confusion Matrix: [1559   48  321   72]
- 'Precision': 0.6
- 'Recall': 0.183206106870229
- 'F1': 0.2807017543859649
- 'AUC-ROC': 0.7635535372440231

Model 3: KNN:
- Accuracy': 0.835
- Confusion Matrix: [1519   88  242  151]
- 'Precision': 0.6317991631799164
- 'Recall': 0.3842239185750636
- 'F1': 0.4778481012658228
- 'AUC-ROC': 0.7773726904082172
