# Logistic Regression Model - NumPy Implementation

Notebook này triển khai mô hình **Logistic Regression** chỉ sử dụng **NumPy** để dự đoán **Customer Churn**. Chúng ta sẽ sử dụng dữ liệu đã được preprocessing từ notebook trước.

## Mục tiêu:
- Xây dựng Logistic Regression từ scratch chỉ với NumPy
- Training và evaluation model trên dữ liệu thực tế
- Trả lời các câu hỏi nhằm phân tích feature importance và model insights

#### Môi trường code

In [1]:
import sys
sys.executable

'/home/xv6/anaconda3/envs/min_ds-env/bin/python'

## 1. Import Libraries và Load Data

In [2]:
# Import chỉ NumPy - không dùng thư viện nào khác!
import numpy as np
import os
import sys

os.chdir("/home/xv6/Lab 2/Credit Card Customer Analysis")

project_root = os.getcwd()
if project_root not in sys.path:
    sys.path.append(project_root)

# Import custom Logistic Regression model
import src.models as models

In [3]:
# Load dữ liệu đã preprocessing
data_dir = "data/processed"

print("LOADING PREPROCESSED DATA")
print("=" * 50)

# Load training và test data
X_train = np.load(f"{data_dir}/X_train.npy")
y_train = np.load(f"{data_dir}/y_train.npy")
X_test = np.load(f"{data_dir}/X_test.npy")
y_test = np.load(f"{data_dir}/y_test.npy")

# Load feature names
with open(f"{data_dir}/feature_names.txt", "r") as f:
    feature_names = [line.strip() for line in f.readlines()]

print(f"Data loaded thành công!")
print(f"Training set: {X_train.shape}")
print(f"Test set: {X_test.shape}")
print(f"Features: {len(feature_names)}")
print(f"Class distribution - Train: Class 0: {np.sum(y_train == 0)}, Class 1: {np.sum(y_train == 1)}")
print(f"Class distribution - Test:  Class 0: {np.sum(y_test == 0)}, Class 1: {np.sum(y_test == 1)}")


LOADING PREPROCESSED DATA
Data loaded thành công!
Training set: (8102, 37)
Test set: (2025, 37)
Features: 37
Class distribution - Train: Class 0: 6800, Class 1: 1302
Class distribution - Test:  Class 0: 1700, Class 1: 325


## 2. Khởi tạo và Training Logistic Regression Model

In [4]:
# Khởi tạo Logistic Regression model
model = models.LogisticRegression(
    learning_rate=0.01,    # Tốc độ học
    max_iter=1000,         # Số iterations tối đa
    tolerance=1e-6,        # Tolerance để dừng khi converge
    fit_intercept=True,    # Thêm bias term
    verbose=True           # In quá trình training
)

print("LOGISTIC REGRESSION MODEL")
print("=" * 50)
print(f"Learning rate: {model.learning_rate}")
print(f"Max iterations: {model.max_iter}")
print(f"Tolerance: {model.tolerance}")
print(f"Fit intercept: {model.fit_intercept}")
print(f"\nBắt đầu training model...")

LOGISTIC REGRESSION MODEL
Learning rate: 0.01
Max iterations: 1000
Tolerance: 1e-06
Fit intercept: True

Bắt đầu training model...


In [5]:
# Training model
model.fit(X_train, y_train)

print(f"\nTRAINING SUMMARY:")
summary = model.get_model_summary()
for key, value in summary.items():
    if isinstance(value, float):
        print(f"{key}: {value:.6f}")
    else:
        print(f"{key}: {value}")

print(f"\nModel đã được train thành công!")

BẮT ĐẦU TRAINING LOGISTIC REGRESSION
   Features: 37
   Samples: 8102
   Learning rate: 0.01
   Max iterations: 1000
--------------------------------------------------
   Iteration  100: Loss = 0.501711, Accuracy = 0.765
   Iteration  200: Loss = 0.424545, Accuracy = 0.821
   Iteration  300: Loss = 0.381359, Accuracy = 0.843
   Iteration  400: Loss = 0.353228, Accuracy = 0.854
   Iteration  500: Loss = 0.333206, Accuracy = 0.864
   Iteration  600: Loss = 0.318160, Accuracy = 0.872
   Iteration  700: Loss = 0.306422, Accuracy = 0.878
   Iteration  800: Loss = 0.296997, Accuracy = 0.882
   Iteration  900: Loss = 0.289253, Accuracy = 0.887
   Iteration 1000: Loss = 0.282771, Accuracy = 0.888

 TRAINING HOÀN TẤT:
   Final loss: 0.282771
   Final accuracy: 0.888
   Total iterations: 1000

TRAINING SUMMARY:
n_features: 37
n_samples: 8102
learning_rate: 0.010000
max_iter: 1000
actual_iter: 1000
final_loss: 0.282771
fit_intercept: True
converged: False
bias: -0.786739
n_weights: 37

Model đã đ

## 3. Model Prediction và Evaluation

In [6]:
# Dự đoán trên training và test set
print("MAKING PREDICTIONS")
print("=" * 50)

# Training set predictions
y_train_pred = model.predict(X_train)
y_train_proba = model.predict_proba(X_train)[:, 1]  # Probability for class 1

# Test set predictions  
y_test_pred = model.predict(X_test)
y_test_proba = model.predict_proba(X_test)[:, 1]

print(f"Predictions hoàn tất!")
print(f"Train predictions: {y_train_pred.shape}")
print(f"Test predictions: {y_test_pred.shape}")

# Quick preview of predictions
print(f"\nSAMPLE PREDICTIONS (10 test samples đầu tiên):")
print("True | Pred | Probability")
print("-" * 25)
for i in range(10):
    print(f"  {y_test[i]}  |  {y_test_pred[i]}  |   {y_test_proba[i]:.3f}")
    
# Class distribution in predictions
print(f"\nPREDICTION DISTRIBUTION:")
print(f"Train - Class 0: {np.sum(y_train_pred == 0):,}, Class 1: {np.sum(y_train_pred == 1):,}")
print(f"Test  - Class 0: {np.sum(y_test_pred == 0):,}, Class 1: {np.sum(y_test_pred == 1):,}")

MAKING PREDICTIONS
Predictions hoàn tất!
Train predictions: (8102,)
Test predictions: (2025,)

SAMPLE PREDICTIONS (10 test samples đầu tiên):
True | Pred | Probability
-------------------------
  0  |  0  |   0.212
  0  |  0  |   0.007
  0  |  0  |   0.369
  0  |  0  |   0.112
  0  |  0  |   0.047
  0  |  0  |   0.033
  0  |  0  |   0.031
  0  |  0  |   0.093
  0  |  0  |   0.042
  0  |  0  |   0.071

PREDICTION DISTRIBUTION:
Train - Class 0: 7,440, Class 1: 662
Test  - Class 0: 1,876, Class 1: 149


In [7]:
# Tính toán metrics cho training set
train_metrics = models.calculate_classification_metrics(y_train, y_train_pred, y_train_proba)

print(" TRAINING SET PERFORMANCE:")
models.print_classification_report(train_metrics, class_names=['Existing Customer', 'Attrited Customer'])

# Extract metrics for use in later cells
train_accuracy = train_metrics['accuracy']
train_precision = train_metrics['precision']
train_recall = train_metrics['recall']
train_f1 = train_metrics['f1_score']
train_auc = train_metrics['auc_roc']

 TRAINING SET PERFORMANCE:
CLASSIFICATION REPORT

CONFUSION MATRIX:
                 Predicted
                 Existing Customer Attrited Customer
Actual Existing Customer      6667      133
       Attrited Customer       773      529

PERFORMANCE METRICS:
Accuracy    : 0.8882
Precision   : 0.7991
Recall      : 0.4063
F1-Score    : 0.5387
Specificity : 0.9804
AUC-ROC     : 0.8967


In [8]:
# Tính toán metrics cho test set
test_metrics = models.calculate_classification_metrics(y_test, y_test_pred, y_test_proba)

print("TEST SET PERFORMANCE:")
models.print_classification_report(test_metrics, class_names=['Existing Customer', 'Attrited Customer'])

# Extract metrics for use in later cells
test_accuracy = test_metrics['accuracy']
test_precision = test_metrics['precision']
test_recall = test_metrics['recall']
test_f1 = test_metrics['f1_score']
model_auc = test_metrics['auc_roc']

TEST SET PERFORMANCE:
CLASSIFICATION REPORT

CONFUSION MATRIX:
                 Predicted
                 Existing Customer Attrited Customer
Actual Existing Customer      1673       27
       Attrited Customer       203      122

PERFORMANCE METRICS:
Accuracy    : 0.8864
Precision   : 0.8188
Recall      : 0.3754
F1-Score    : 0.5148
Specificity : 0.9841
AUC-ROC     : 0.8948


## 4. Một số câu hỏi tìm hiểu về model

### Câu hỏi 1: Model có đang overfit hay underfit không? Việc này được đánh giá như thế nào?

In [9]:
# Câu hỏi 1: Đánh giá overfitting/underfitting
print(" CÂU HỎI 1: Model có đang overfit hay underfit không?")
print("=" * 60)

# So sánh performance train vs test
train_pred = model.predict(X_train)
test_pred = model.predict(X_test)

train_accuracy = np.mean(train_pred == y_train)
test_accuracy = np.mean(test_pred == y_test)

# Tính AUC cho train và test
train_proba = model.predict_proba(X_train)
test_proba = model.predict_proba(X_test)

train_auc = models.calculate_auc_roc(y_train, train_proba[:, 1])
test_auc = models.calculate_auc_roc(y_test, test_proba[:, 1])

print(f"\n PERFORMANCE COMPARISON:")
print(f"Training Accuracy: {train_accuracy:.4f}")
print(f"Test Accuracy:     {test_accuracy:.4f}")
print(f"Accuracy Gap:      {abs(train_accuracy - test_accuracy):.4f}")

print(f"\nTraining AUC:      {train_auc:.4f}")
print(f"Test AUC:          {test_auc:.4f}")
print(f"AUC Gap:           {abs(train_auc - test_auc):.4f}")

# Đánh giá
accuracy_gap = abs(train_accuracy - test_accuracy)
auc_gap = abs(train_auc - test_auc)

print(f"\n DIAGNOSIS:")
if accuracy_gap < 0.05 and auc_gap < 0.05:
    if test_accuracy > 0.8:
        print(" GOOD FIT: Model generalize tốt!")
        print("   - Performance gaps nhỏ (< 5%)")
        print("   - Test performance cao")
    else:
        print("  UNDERFIT: Model còn đơn giản")
        print("   - Performance thấp cả train và test")
        print("   - Cần model phức tạp hơn hoặc thêm features")
elif accuracy_gap > 0.1 or auc_gap > 0.1:
    print(" OVERFIT: Model học thuộc training data")
    print("   - Train performance >> Test performance")
    print("   - Cần regularization hoặc giảm complexity")
else:
    print(" MILD OVERFIT: Hơi overfit nhẹ")
    print("   - Gap vừa phải (5-10%)")
    print("   - Có thể cần tune hyperparameters")

 CÂU HỎI 1: Model có đang overfit hay underfit không?

 PERFORMANCE COMPARISON:
Training Accuracy: 0.8882
Test Accuracy:     0.8864
Accuracy Gap:      0.0018

Training AUC:      0.8967
Test AUC:          0.8948
AUC Gap:           0.0019

 DIAGNOSIS:
 GOOD FIT: Model generalize tốt!
   - Performance gaps nhỏ (< 5%)
   - Test performance cao


### Câu hỏi 2: Features nào quan trọng nhất trong việc dự đoán churn?

In [10]:
# Câu hỏi 2: Phân tích feature importance
print(" CÂU HỎI 2: Features nào quan trọng nhất?")
print("=" * 50)

# Lấy feature importance từ model weights
feature_importance = model.get_feature_importance(feature_names)

# Sắp xếp theo importance
if isinstance(feature_importance, dict):
    sorted_features = sorted(feature_importance.items(), key=lambda x: x[1], reverse=True)
    importance_values = [x[1] for x in sorted_features]
    feature_names_sorted = [x[0] for x in sorted_features]
else:
    # Nếu return array thì tạo dictionary
    feature_importance_dict = dict(zip(feature_names, feature_importance))
    sorted_features = sorted(feature_importance_dict.items(), key=lambda x: x[1], reverse=True)
    importance_values = [x[1] for x in sorted_features]
    feature_names_sorted = [x[0] for x in sorted_features]

print(f"\n TOP 10 FEATURES QUAN TRỌNG NHẤT:")
print("-" * 50)
for i, (feature, importance) in enumerate(sorted_features[:10], 1):
    # Tạo bar chart đơn giản
    bar_length = int(importance * 50)  # Scale to 50 chars max
    bar = "█" * bar_length + "░" * (20 - bar_length)
    print(f"{i:2d}. {feature:<25} │{bar}│ {importance:.4f}")

# Phân tích nhóm features
print(f"\n  PHÂN TÍCH THEO NHÓM FEATURES:")
print("-" * 40)

# Phân loại features
transaction_features = [f for f in feature_names_sorted[:10] if any(keyword in f.lower() 
                       for keyword in ['trans', 'amt', 'ct', 'revolving', 'utilization'])]
demographic_features = [f for f in feature_names_sorted[:10] if any(keyword in f.lower() 
                       for keyword in ['age', 'gender', 'education', 'marital', 'income'])]
account_features = [f for f in feature_names_sorted[:10] if any(keyword in f.lower() 
                   for keyword in ['credit', 'limit', 'card', 'months', 'relationship'])]

print(f" Transaction Features: {len(transaction_features)}")
if transaction_features:
    for f in transaction_features:
        idx = feature_names_sorted.index(f)
        print(f"   - {f}: {importance_values[idx]:.4f}")

print(f"\n Demographic Features: {len(demographic_features)}")
if demographic_features:
    for f in demographic_features:
        idx = feature_names_sorted.index(f) 
        print(f"   - {f}: {importance_values[idx]:.4f}")

print(f"\n Account Features: {len(account_features)}")
if account_features:
    for f in account_features:
        idx = feature_names_sorted.index(f)
        print(f"   - {f}: {importance_values[idx]:.4f}")

 CÂU HỎI 2: Features nào quan trọng nhất?

 TOP 10 FEATURES QUAN TRỌNG NHẤT:
--------------------------------------------------
 1. Total_Trans_Ct            │█████████████████████████████████████████│ 0.8264
 2. Gender_M                  │████████████████████████████│ 0.5724
 3. Total_Ct_Chng_Q4_Q1       │███████████████████████│ 0.4654
 4. Gender_F                  │█████████████████████│ 0.4300
 5. Marital_Status_Married    │████████████████████│ 0.4190
 6. Card_Category_Blue        │██████████████████░░│ 0.3750
 7. Total_Revolving_Bal       │██████████████████░░│ 0.3736
 8. Education_Level_Post-Graduate │█████████████████░░░│ 0.3559
 9. Total_Relationship_Count  │████████████████░░░░│ 0.3386
10. Income_Category_$60K - $80K │███████████████░░░░░│ 0.3098

  PHÂN TÍCH THEO NHÓM FEATURES:
----------------------------------------
 Transaction Features: 3
   - Total_Trans_Ct: 0.8264
   - Total_Ct_Chng_Q4_Q1: 0.4654
   - Total_Revolving_Bal: 0.3736

 Demographic Features: 5
   - Gender_M:

### Câu hỏi 3: Model có stable không? - Sử dụng Cross Validation

In [11]:
# Câu hỏi 3: Cross-Validation để đánh giá model stability
print(" CÂU HỎI 3: Cross-Validation - Model có stable không?")
print("=" * 60)

# Thực hiện Cross-Validation
print("Bắt đầu Cross-Validation với model hiện tại...")

# Combine train và test để có full dataset
X_full = np.vstack([X_train, X_test])
y_full = np.concatenate([y_train, y_test])

# Model parameters giống như model đã train
cv_model_params = {
    'learning_rate': 0.01,
    'max_iter': 1000,
    'tolerance': 1e-6,
    'fit_intercept': True,
    'verbose': False  # Turn off verbose for CV
}

# Thực hiện 5-Fold CV sử dụng function từ models.py
cv_scores, cv_details = models.kfold_cross_validation(
    X_full, y_full, 
    models.LogisticRegression, 
    cv_model_params, 
    k=5, 
    random_state=42
)

 CÂU HỎI 3: Cross-Validation - Model có stable không?
Bắt đầu Cross-Validation với model hiện tại...
Performing 5-Fold Cross Validation...
Total samples: 10,127
Samples per fold: ~2,025
--------------------------------------------------
Fold 1/5: Acc: 0.8894, AUC: 0.9105
Fold 2/5: Acc: 0.9081, AUC: 0.9155
Fold 3/5: Acc: 0.8983, AUC: 0.9094
Fold 4/5: Acc: 0.9002, AUC: 0.9107
Fold 5/5: Acc: 0.8821, AUC: 0.9074


In [12]:
# Phân tích stability và interpretation sử dụng functions từ models.py
single_split_metrics = {
    'accuracy': test_accuracy,
    'precision': test_precision, 
    'recall': test_recall,
    'f1': test_f1,
    'auc': model_auc
}

# In kết quả cross-validation với comparison
models.print_cv_results(cv_scores, cv_details, single_split_metrics)

# Thêm phân tích chi tiết
analysis_results = models.analyze_cv_stability(cv_scores, single_split_metrics)

print(f"\n 95% CONFIDENCE INTERVALS:")
print("-" * 35)
for metric_name, summary in analysis_results['metrics_summary'].items():
    mean = summary['mean']
    std = summary['std']
    # 95% CI = mean ± 1.96 * std/sqrt(n)
    margin_error = 1.96 * std / np.sqrt(5)  # n=5 folds
    ci_lower = mean - margin_error
    ci_upper = mean + margin_error
    print(f"{metric_name:<10}: [{ci_lower:.4f}, {ci_upper:.4f}]")

print(f"\n BUSINESS IMPLICATIONS:")
print("-" * 25)

metrics_summary = analysis_results['metrics_summary']
avg_cv = analysis_results['avg_cv']

print(f" Average Performance: {metrics_summary['accuracy']['mean']:.1%} accuracy")
print(f" Performance Variability: ±{metrics_summary['accuracy']['std']*100:.1f}%")

# Business implications
if metrics_summary['accuracy']['std'] < 0.02:
    print(f" LOW VARIANCE - Reliable for production deployment")
elif metrics_summary['accuracy']['std'] < 0.05:
    print(f"  MODERATE VARIANCE - Monitor performance closely")
else:
    print(f" HIGH VARIANCE - Consider more data or model tuning")


 CROSS-VALIDATION RESULTS:
ACCURACY  : 0.8956 ± 0.0090
Range     : [0.8821, 0.9081]
-------------------------
PRECISION : 0.8372 ± 0.0213
Range     : [0.8148, 0.8773]
-------------------------
RECALL    : 0.4358 ± 0.0394
Range     : [0.3862, 0.4955]
-------------------------
F1        : 0.5722 ± 0.0357
Range     : [0.5286, 0.6203]
-------------------------
AUC       : 0.9107 ± 0.0027
Range     : [0.9074, 0.9155]
-------------------------

 DETAILED FOLD ANALYSIS:
------------------------------------------------------------
Fold   │ Accuracy   │ Precision  │ Recall     │ F1         │ AUC       
------------------------------------------------------------
1      │ 0.8894     │ 0.8148     │ 0.4049     │ 0.5410     │ 0.9105    
2      │ 0.9081     │ 0.8773     │ 0.4628     │ 0.6059     │ 0.9155    
3      │ 0.8983     │ 0.8272     │ 0.4295     │ 0.5654     │ 0.9094    
4      │ 0.9002     │ 0.8291     │ 0.4955     │ 0.6203     │ 0.9107    
5      │ 0.8821     │ 0.8375     │ 0.3862     │ 0