In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

In [3]:
# Importing the preprocessed data
Xtrain = pd.read_csv('Xtrain.csv', index_col=0)
Xtest = pd.read_csv('Xtest.csv', index_col=0)
ytrain = pd.read_csv('ytrain.csv', index_col=0).values.ravel()
ytest = pd.read_csv('ytest.csv', index_col=0).values.ravel()

In [5]:
# Standardizing the features
scaler = StandardScaler()
Xtrain_scaled = scaler.fit_transform(Xtrain)
Xtest_scaled = scaler.transform(Xtest)

## Baseline Method - Logistic Regression

In [8]:
# Initialize and fit Logistic Regression model
log_reg = LogisticRegression(max_iter=10000, random_state=7)
log_reg.fit(Xtrain_scaled, ytrain)

In [10]:
# Predict on test set
y_pred_log_reg = log_reg.predict(Xtest_scaled)

In [28]:
# Evaluate the model
print("Baseline Logistic Regression Model:")
print("Accuracy:", accuracy_score(ytest, y_pred_log_reg))
print("\nClassification Report:\n", classification_report(ytest, y_pred_log_reg))
print("\nConfusion Matrix:\n", confusion_matrix(ytest, y_pred_log_reg))

Baseline Logistic Regression Model:
Accuracy: 0.807

Classification Report:
               precision    recall  f1-score   support

         0.0       0.83      0.96      0.89      1593
         1.0       0.57      0.20      0.30       407

    accuracy                           0.81      2000
   macro avg       0.70      0.58      0.59      2000
weighted avg       0.77      0.81      0.77      2000


Confusion Matrix:
 [[1531   62]
 [ 324   83]]


### The baseline logistic regression model achieved an accuracy of 80.7%. Although the accuracy seems decent, the precision for class 1 (churned customers) is relatively low at 57%. This indicates that the model has a high false positive rate, which means it incorrectly identifies non-churned customers as churn.



#### The baseline logistic regression model's performance suggests that while it can correctly identify a significant portion of non-churned customers, it struggles with correctly identifying churned customers. This is evident from its low recall score for class 1. The model's overall accuracy might be misleading due to the class imbalance in the dataset, which needs to be addressed for better performance.

## Hyperparameter Tuning

In [76]:
# Hyperparameter grid for Logistic Regression
param_grid = {'C': [0.001, 0.01, 0.1, 1, 10, 100],
              'penalty': ['l2']}
grid_search = GridSearchCV(log_reg, param_grid, cv=5, scoring='accuracy')
grid_search.fit(Xtrain_scaled, ytrain)

In [33]:
# Best hyperparameters and best model
best_params = grid_search.best_params_
best_log_reg = grid_search.best_estimator_
y_pred_best = best_log_reg.predict(Xtest_scaled)

In [35]:
# Evaluate the best model
print("Best Logistic Regression Model after Hyperparameter Tuning:")
print("Accuracy:", accuracy_score(ytest, y_pred_best))
print("\nClassification Report:\n", classification_report(ytest, y_pred_best))
print("\nConfusion Matrix:\n", confusion_matrix(ytest, y_pred_best))

Best Logistic Regression Model after Hyperparameter Tuning:
Accuracy: 0.811

Classification Report:
               precision    recall  f1-score   support

         0.0       0.82      0.97      0.89      1593
         1.0       0.63      0.18      0.28       407

    accuracy                           0.81      2000
   macro avg       0.72      0.57      0.58      2000
weighted avg       0.78      0.81      0.77      2000


Confusion Matrix:
 [[1550   43]
 [ 335   72]]


### After hyperparameter tuning, the logistic regression model achieved a slightly improved accuracy of 81.1%. The precision for class 1 increased to 63%, which is an improvement from the baseline model. However, the model still struggles with identifying churned customers, as indicated by the low recall for class 1.


#### The tuned logistic regression model shows slight improvement in accuracy and precision for class 1. However, it still faces challenges in identifying churned customers accurately. Despite hyperparameter tuning, the model may require additional features or more sophisticated techniques to capture the underlying patterns of churn.

## Alternative Predictive Models:

## Decision Tree Classifier

In [39]:
# Initialize and fit Decision Tree Classifier
dt_classifier = DecisionTreeClassifier(random_state=7)
dt_classifier.fit(Xtrain_scaled, ytrain)

In [41]:
# Predict on test set
y_pred_dt = dt_classifier.predict(Xtest_scaled)

In [43]:
# Evaluate the model
print("Decision Tree Classifier:")
print("Accuracy:", accuracy_score(ytest, y_pred_dt))
print("\nClassification Report:\n", classification_report(ytest, y_pred_dt))
print("\nConfusion Matrix:\n", confusion_matrix(ytest, y_pred_dt))

Decision Tree Classifier:
Accuracy: 0.8045

Classification Report:
               precision    recall  f1-score   support

         0.0       0.88      0.87      0.88      1593
         1.0       0.52      0.53      0.53       407

    accuracy                           0.80      2000
   macro avg       0.70      0.70      0.70      2000
weighted avg       0.81      0.80      0.81      2000


Confusion Matrix:
 [[1392  201]
 [ 190  217]]


### The decision tree classifier achieved an accuracy of 80.45%. It has a higher recall for class 1 compared to logistic regression but also has a higher false positive rate. The model seems to overfit the training data, which might affect its generalization on unseen data.

#### The decision tree classifier has shown decent performance but exhibits signs of overfitting, as indicated by its relatively lower performance on the test set compared to the training set. The model's interpretability is a plus, but its accuracy and precision could be further improved by avoiding overfitting through techniques like pruning.

## Random Forest Classifier

In [57]:
# Initialize and fit Random Forest Classifier
rf_classifier = RandomForestClassifier(random_state=7)
rf_classifier.fit(Xtrain_scaled, ytrain)

In [59]:
# Predict on test set
y_pred_rf = rf_classifier.predict(Xtest_scaled)


In [61]:
# Evaluate the model
print("Random Forest Classifier:")
print("Accuracy:", accuracy_score(ytest, y_pred_rf))
print("\nClassification Report:\n", classification_report(ytest, y_pred_rf))
print("\nConfusion Matrix:\n", confusion_matrix(ytest, y_pred_rf))

Random Forest Classifier:
Accuracy: 0.87

Classification Report:
               precision    recall  f1-score   support

         0.0       0.88      0.97      0.92      1593
         1.0       0.80      0.49      0.60       407

    accuracy                           0.87      2000
   macro avg       0.84      0.73      0.76      2000
weighted avg       0.86      0.87      0.86      2000


Confusion Matrix:
 [[1542   51]
 [ 209  198]]


### The random forest classifier outperformed other models with an accuracy of 87%. It has a good balance between precision and recall for class 1, making it a promising model for churn prediction. However, further analysis is needed to understand feature importance and potential overfitting.

#### The Random Forest classifier stands out with its high accuracy and a balanced performance in terms of precision and recall for both classes. However, it would be beneficial to investigate potential overfitting and feature importance to ensure the model's robustness and interpretability.

## Gradient Boosting Classifier

In [63]:
# Initialize and fit Gradient Boosting Classifier
gb_classifier = GradientBoostingClassifier(random_state=7)
gb_classifier.fit(Xtrain_scaled, ytrain)

In [65]:
# Predict on test set
y_pred_gb = gb_classifier.predict(Xtest_scaled)

In [67]:
# Evaluate the model
print("Gradient Boosting Classifier:")
print("Accuracy:", accuracy_score(ytest, y_pred_gb))
print("\nClassification Report:\n", classification_report(ytest, y_pred_gb))
print("\nConfusion Matrix:\n", confusion_matrix(ytest, y_pred_gb))

Gradient Boosting Classifier:
Accuracy: 0.869

Classification Report:
               precision    recall  f1-score   support

         0.0       0.88      0.97      0.92      1593
         1.0       0.79      0.48      0.60       407

    accuracy                           0.87      2000
   macro avg       0.84      0.72      0.76      2000
weighted avg       0.86      0.87      0.86      2000


Confusion Matrix:
 [[1542   51]
 [ 211  196]]


### The gradient boosting classifier achieved an accuracy of 86.9%, slightly below Random Forest but better than logistic regression. Like Random Forest, it has a good balance between precision and recall for class 1. However, similar to the decision tree, it might also suffer from overfitting if not properly tuned.

#### Gradient Boosting shows competitive performance but slightly lags behind Random Forest in accuracy. It shares some characteristics with the decision tree, suggesting potential overfitting. Tuning the model's hyperparameters and further feature engineering might enhance its performance.


## Support Vector Machine (SVM) Classifier

In [70]:
# Initialize and fit SVM Classifier
svm_classifier = SVC(random_state=7)
svm_classifier.fit(Xtrain_scaled, ytrain)

In [72]:
# Predict on test set
y_pred_svm = svm_classifier.predict(Xtest_scaled)

In [74]:
# Evaluate the model
print("Support Vector Machine Classifier:")
print("Accuracy:", accuracy_score(ytest, y_pred_svm))
print("\nClassification Report:\n", classification_report(ytest, y_pred_svm))
print("\nConfusion Matrix:\n", confusion_matrix(ytest, y_pred_svm))

Support Vector Machine Classifier:
Accuracy: 0.8565

Classification Report:
               precision    recall  f1-score   support

         0.0       0.86      0.98      0.92      1593
         1.0       0.84      0.36      0.51       407

    accuracy                           0.86      2000
   macro avg       0.85      0.67      0.71      2000
weighted avg       0.85      0.86      0.83      2000


Confusion Matrix:
 [[1565   28]
 [ 259  148]]


### The SVM classifier achieved an accuracy of 85.65%. While it has a high precision for class 0, indicating it can correctly identify non-churned customers, it has a lower recall for class 1, suggesting it misses many churned customers. This might be due to the imbalanced nature of the dataset or the choice of hyperparameters.

#### The SVM classifier has a high precision for non-churned customers but lacks in recall for churned customers. This indicates that the model is conservative in predicting churn. Hyperparameter tuning and possibly adjusting the class weights could improve its ability to identify churned customers.

## Possible Future Improvements:

### Feature Engineering: Introducing more features or exploring interactions between existing features.
### Resampling Techniques: Addressing class imbalance using techniques like oversampling, undersampling, or Synthetic Minority Over-sampling Technique (SMOTE).
### Model Ensemble: Combining predictions from multiple models to improve overall performance.
### Advanced Techniques: Exploring neural network-based models or other advanced machine learning techniques for better capturing non-linear relationships.

## Possible Scenarios to Deploy the Models in Real-World Business Scenarios:

### Customer Retention: Implementing predictive models to identify customers at risk of churn and designing targeted retention strategies.
### Marketing Campaigns: Utilizing the model predictions to personalize marketing campaigns to different customer segments.
### Resource Allocation: Prioritizing resources and efforts on high-risk customers to maximize ROI.
### Product Development: Using insights from churn prediction to inform product or service improvements based on customer feedback.