# Model Building

In this notebook, we will build and train several machine learning models to predict customer churn. We will:
1. Load the preprocessed data.
2. Train a baseline Logistic Regression model.
3. Train advanced models: Random Forest, Gradient Boosting, and XGBoost.
4. Perform hyperparameter tuning on the best-performing model.
5. Save the trained models for evaluation and future use.

In [None]:
import numpy as np
import joblib
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
import xgboost as xgb
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report, accuracy_score

## 1. Load Preprocessed Data

In [None]:
X_train = np.load('X_train_resampled.npy', allow_pickle=True)
y_train = np.load('y_train_resampled.npy', allow_pickle=True)
X_test = np.load('X_test_processed.npy', allow_pickle=True)
y_test = np.load('y_test.npy', allow_pickle=True)

print(f'Training data shape: {X_train.shape}')
print(f'Test data shape: {X_test.shape}')

## 2. Train Baseline Model: Logistic Regression

In [None]:
log_reg = LogisticRegression(random_state=42, max_iter=1000)
log_reg.fit(X_train, y_train)

y_pred_lr = log_reg.predict(X_test)
print('--- Logistic Regression ---')
print('Accuracy:', accuracy_score(y_test, y_pred_lr))
print(classification_report(y_test, y_pred_lr))

## 3. Train Advanced Models

### Random Forest

In [None]:
rf = RandomForestClassifier(random_state=42)
rf.fit(X_train, y_train)

y_pred_rf = rf.predict(X_test)
print('--- Random Forest ---')
print('Accuracy:', accuracy_score(y_test, y_pred_rf))
print(classification_report(y_test, y_pred_rf))

### Gradient Boosting

In [None]:
gb = GradientBoostingClassifier(random_state=42)
gb.fit(X_train, y_train)

y_pred_gb = gb.predict(X_test)
print('--- Gradient Boosting ---')
print('Accuracy:', accuracy_score(y_test, y_pred_gb))
print(classification_report(y_test, y_pred_gb))

### XGBoost

In [None]:
xgb_clf = xgb.XGBClassifier(random_state=42, use_label_encoder=False, eval_metric='logloss')
xgb_clf.fit(X_train, y_train)

y_pred_xgb = xgb_clf.predict(X_test)
print('--- XGBoost ---')
print('Accuracy:', accuracy_score(y_test, y_pred_xgb))
print(classification_report(y_test, y_pred_xgb))

## 4. Hyperparameter Tuning (for Random Forest)

Based on the initial results, Random Forest seems to perform well. Let's try to tune it. We will use a small grid for demonstration purposes.

In [None]:
param_grid = {
    'n_estimators': [150, 200],
    'max_depth': [20, 30],
    'min_samples_split': [2, 5]
}

grid_search = GridSearchCV(estimator=RandomForestClassifier(random_state=42), param_grid=param_grid, cv=3, n_jobs=-1, verbose=2, scoring='f1')
grid_search.fit(X_train, y_train)

print(f'Best parameters found: {grid_search.best_params_}')

best_rf = grid_search.best_estimator_
y_pred_best_rf = best_rf.predict(X_test)
print('\n--- Tuned Random Forest ---')
print('Accuracy:', accuracy_score(y_test, y_pred_best_rf))
print(classification_report(y_test, y_pred_best_rf))

## 5. Save the Models

In [None]:
joblib.dump(log_reg, 'logistic_regression_model.joblib')
joblib.dump(rf, 'random_forest_model.joblib')
joblib.dump(gb, 'gradient_boosting_model.joblib')
joblib.dump(xgb_clf, 'xgboost_model.joblib')
joblib.dump(best_rf, 'best_random_forest_model.joblib')

print('Models saved successfully.')