In [1]:
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

In [2]:
import pandas as pd
import numpy as np
import seaborn as sns
from datetime import datetime
import matplotlib.pyplot as plt

# Shows plots in jupyter notebook
%matplotlib inline

# Set plot style
sns.set(color_codes=True)

In [3]:
df = pd.read_csv('./data_for_predictions.csv')
df.drop(columns=["Unnamed: 0"], inplace=True)
df.head()

Unnamed: 0,id,cons_12m,cons_gas_12m,cons_last_month,forecast_cons_12m,forecast_discount_energy,forecast_meter_rent_12m,forecast_price_energy_off_peak,forecast_price_energy_peak,forecast_price_pow_off_peak,...,months_modif_prod,months_renewal,channel_MISSING,channel_ewpakwlliwisiwduibdlfmalxowmwpci,channel_foosdfpfkusacimwkcsosbicdxkicaua,channel_lmkebamcaaclubfxadlmueccxoimlema,channel_usilxuppasemubllopkaafesmlibmsdf,origin_up_kamkkxfxxuwbdslkwifmmcsiusiuosws,origin_up_ldkssxwpmemidmecebumciepifcamkci,origin_up_lxidpiddsbxsbosboudacockeimpuepw
0,24011ae4ebbe3035111d65fa7c15bc57,0.0,4.739944,0.0,0.0,0.0,0.444045,0.114481,0.098142,40.606701,...,2,6,0,0,1,0,0,0,0,1
1,d29c2c54acc38ff3c0614d0a653813dd,3.668479,0.0,0.0,2.28092,0.0,1.237292,0.145711,0.0,44.311378,...,76,4,1,0,0,0,0,1,0,0
2,764c75f661154dac3a6c254cd082ea7d,2.736397,0.0,0.0,1.689841,0.0,1.599009,0.165794,0.087899,44.311378,...,68,8,0,0,1,0,0,1,0,0
3,bba03439a292a1e166f80264c16191cb,3.200029,0.0,0.0,2.382089,0.0,1.318689,0.146694,0.0,44.311378,...,69,9,0,0,0,1,0,1,0,0
4,149d57cf92fc41cf94415803a877cb4b,3.646011,0.0,2.721811,2.650065,0.0,2.122969,0.1169,0.100015,40.606701,...,71,9,1,0,0,0,0,1,0,0


In [4]:
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

In [5]:
# Make a copy of our data
train_df = df.copy()

# Separate target variable from independent variables
y = df['churn']
X = df.drop(columns=['id', 'churn'])
print(X.shape)
print(y.shape)

(14606, 61)
(14606,)


In [6]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

(10954, 61)
(10954,)
(3652, 61)
(3652,)


In [69]:
# Add model training in here!
model = RandomForestClassifier(n_estimators=50, max_depth=12, random_state=42, max_features = 'sqrt', min_samples_split = 10, min_samples_leaf = 5,oob_score = True, n_jobs = -1)
model.fit(X_train, y_train)

# Generate Predictions and Evaluate Metrics

In [70]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report

In [71]:
y_pred = model.predict(X_test)

In [72]:
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")


Accuracy: 0.90


In [64]:
print("Classification Report:")
print(classification_report(y_test, y_pred))

Classification Report:
              precision    recall  f1-score   support

           0       0.90      1.00      0.95      3286
           1       1.00      0.02      0.03       366

    accuracy                           0.90      3652
   macro avg       0.95      0.51      0.49      3652
weighted avg       0.91      0.90      0.86      3652



# Key Issues

Imbalanced Class Performance: The model is heavily biased towards predicting non-churners (class 0), and it misses most churners (class 1).

Churners' Recall: The recall for churners is extremely low, meaning that real churners are not being identified, which is the primary goal of churn prediction models.

Churners' F1-Score: The F1-score for churners is also very low, which reflects the imbalance in the model's performance.


# Suggested course of action talk with the senior Data Scientist to authorize a research on finding a more suited model, if it is possible to use less features to reduce overfitting

# Accuracy
Accuracy tells us how well the model performs in terms of the overall correct predictions (both churners and non-churners). However, it is not the best metric for imbalanced datasets because it can be skewed towards the majority class. For example, predicting "no churn" for all customers can still give a high accuracy if the majority of customers don't churn.


# Precision
Precision is the proportion of positive predictions (churners) that were actually correct. In churn prediction, high precision means that when the model predicts a customer will churn, it is likely to be correct. This is crucial when businesses want to avoid spending unnecessary resources on incorrectly predicting churners. Precision is more important than accuracy in such scenarios because false positives (predicting a non-churner as a churner) can lead to unnecessary retention efforts.

# Recall
Recall is the proportion of actual churners that the model correctly identified. In churn prediction, high recall is important because we don't want to miss real churners. If a customer is likely to churn but isn't identified, the company could lose a valuable customer without taking corrective actions. A good model should maximize recall to catch as many true churners as possible.

# F1-Score
The F1-score combines both precision and recall into a single metric, making it a good way to balance the trade-off between them. In churn prediction, high F1-score ensures that we’re correctly identifying churners without excessively increasing false positives. It is especially useful when we have class imbalance, as it takes both false positives and false negatives into account. The F1-score helps avoid a situation where we optimize one metric at the expense of the other.

# ROC-AUC
ROC-AUC measures the model's ability to distinguish between the classes (churn vs non-churn). The area under the ROC curve (AUC) tells us how well the model can identify churners at various threshold levels. An AUC close to 1 means that the model has excellent ability to separate churners from non-churners, while an AUC closer to 0.5 indicates that the model is no better than random guessing. Since churn prediction is often about identifying the likelihood of a customer churning, this metric helps evaluate the model's discriminative power across different probability thresholds.

# Confusion Matrix
The confusion matrix gives us a detailed breakdown of the model’s predictions:

1. True Positives (TP): Correctly predicted churners.
2. True Negatives (TN): Correctly predicted non-churners.
3. False Positives (FP): Incorrectly predicted churners (non-churners predicted as churners).
4. False Negatives (FN): Incorrectly predicted non-churners (churners predicted as non-churners).

By analyzing the confusion matrix, you can understand the types of errors your model is making. In churn prediction, it’s particularly important to minimize false negatives (missed churners) since these missed opportunities can lead to lost customers.

## Till Here is the Basic Model Section From now on will be experiments to create a better model or choose another one

In [12]:
from sklearn.model_selection import train_test_split

# Take a subset of the data (e.g., 20% of the training data)
subset_size = 0.0
X_train_subset, _, y_train_subset, _ = train_test_split(
    X_train, y_train, train_size=subset_size, random_state=42, stratify=y_train
)

In [13]:
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report

# Train with default hyperparameters
svm_model = SVC(random_state=42)
svm_model.fit(X_train_subset, y_train_subset)

# Evaluate on the test set
y_pred = svm_model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

Accuracy: 0.8997809419496167
              precision    recall  f1-score   support

           0       0.90      1.00      0.95      3286
           1       0.00      0.00      0.00       366

    accuracy                           0.90      3652
   macro avg       0.45      0.50      0.47      3652
weighted avg       0.81      0.90      0.85      3652



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [14]:
#for C in [0.1, 1, 10]:
#    for kernel in ['linear', 'rbf']:
#        svm_model = SVC(C=C, kernel=kernel, random_state=42)
#        svm_model.fit(X_train_subset, y_train_subset)
#        y_pred = svm_model.predict(X_test)
#        accuracy = accuracy_score(y_test, y_pred)
#        print(f"C={C}, kernel={kernel}, Accuracy={accuracy:.2f}")

# My dataset has many features cut the features down to 20 to make the algorithm faster and avoid overfitting in the data

In [15]:
from sklearn.feature_selection import SelectKBest, f_classif

# Select top 20 features
selector = SelectKBest(f_classif, k=20)
X_train_selected = selector.fit_transform(X_train_subset, y_train_subset)
X_test_selected = selector.transform(X_test)

# Train SVM on selected features
svm_model = SVC(C=1, kernel='rbf', random_state=42)
svm_model.fit(X_train_selected, y_train_subset)
y_pred = svm_model.predict(X_test_selected)
print("Accuracy with selected features:", accuracy_score(y_test, y_pred))

Accuracy with selected features: 0.8997809419496167


# Still the accuracy is lower than my first model
### Use linear SVC for faster training

In [16]:
from sklearn.svm import LinearSVC

linear_svc = LinearSVC(C=1, random_state=42)
linear_svc.fit(X_train_subset, y_train_subset)
y_pred = linear_svc.predict(X_test)
print("Accuracy with LinearSVC:", accuracy_score(y_test, y_pred))

Accuracy with LinearSVC: 0.8956736035049289


In [17]:
# Use SGD which will make the SVM Faster

In [18]:
from sklearn.linear_model import SGDClassifier

# Use SGDClassifier with hinge loss (equivalent to SVM)
sgd_svm = SGDClassifier(loss='hinge', alpha=0.01, max_iter=1000, random_state=42)
sgd_svm.fit(X_train_subset, y_train_subset)
y_pred = sgd_svm.predict(X_test)
print("Accuracy with SGDClassifier:", accuracy_score(y_test, y_pred))

Accuracy with SGDClassifier: 0.6848302300109529


In [19]:
# Retrain on full dataset with best hyperparameters
best_svm_model = SVC(C=1, kernel='rbf', random_state=42)
best_svm_model.fit(X_train, y_train)
y_pred = best_svm_model.predict(X_test)
print("Final Accuracy:", accuracy_score(y_test, y_pred))

Final Accuracy: 0.8997809419496167


# Since we have a binary classification problem whether the customer will churn or not why not using alternative algorithms for classification

In [37]:
from sklearn.linear_model import LogisticRegression

# Train Logistic Regression
log_reg = LogisticRegression(max_iter=500, random_state=42)
log_reg.fit(X_train, y_train)

# Evaluate
y_pred = log_reg.predict(X_test)
print("Logistic Regression Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

Logistic Regression Accuracy: 0.8995071193866374
              precision    recall  f1-score   support

           0       0.90      1.00      0.95      3286
           1       0.44      0.01      0.02       366

    accuracy                           0.90      3652
   macro avg       0.67      0.50      0.48      3652
weighted avg       0.85      0.90      0.85      3652



STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [39]:
from sklearn.neighbors import KNeighborsClassifier

# Train k-NN
knn_model = KNeighborsClassifier(n_neighbors=5)
knn_model.fit(X_train, y_train)

# Evaluate
y_pred = knn_model.predict(X_test)
print("k-NN Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

k-NN Accuracy: 0.8934830230010953
              precision    recall  f1-score   support

           0       0.90      0.99      0.94      3286
           1       0.27      0.04      0.07       366

    accuracy                           0.89      3652
   macro avg       0.59      0.51      0.51      3652
weighted avg       0.84      0.89      0.86      3652



In [40]:
from sklearn.naive_bayes import GaussianNB

# Train Naive Bayes
nb_model = GaussianNB()
nb_model.fit(X_train, y_train)

# Evaluate
y_pred = nb_model.predict(X_test)
print("Naive Bayes Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

Naive Bayes Accuracy: 0.8384446878422782
              precision    recall  f1-score   support

           0       0.90      0.92      0.91      3286
           1       0.13      0.11      0.12       366

    accuracy                           0.84      3652
   macro avg       0.52      0.52      0.52      3652
weighted avg       0.83      0.84      0.83      3652



In [41]:
from sklearn.tree import DecisionTreeClassifier

# Train Decision Tree
dt_model = DecisionTreeClassifier(max_depth=5, random_state=42)
dt_model.fit(X_train, y_train)

# Evaluate
y_pred = dt_model.predict(X_test)
print("Decision Tree Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

Decision Tree Accuracy: 0.8973165388828039
              precision    recall  f1-score   support

           0       0.90      0.99      0.95      3286
           1       0.33      0.02      0.05       366

    accuracy                           0.90      3652
   macro avg       0.62      0.51      0.50      3652
weighted avg       0.84      0.90      0.86      3652



# Since Random Forest has Better Performance than all of the algorithms then I will optimize the random forest

In [73]:
# Use Smote for resampling in order to have a balanced dataset

In [74]:
from imblearn.over_sampling import SMOTE

smote = SMOTE(sampling_strategy=0.5, random_state=42)  # Make churners 50% of dataset
X_train_balanced, y_train_balanced = smote.fit_resample(X_train, y_train)

In [75]:
# Adjust the model to new parameters
model = RandomForestClassifier(
    n_estimators=100,
    max_depth=12,
    class_weight="balanced_subsample", 
    min_samples_split=5,
    min_samples_leaf=3,
    n_jobs=-1,
    random_state=42
)

In [76]:
model.fit(X_train_balanced, y_train_balanced)

In [77]:
# Predict on test set
y_pred = model.predict(X_test)

# Predict class probabilities for ROC-AUC evaluation
y_proba = model.predict_proba(X_test)[:, 1]  # Probabilities for class 1 (churn)

In [79]:
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score

In [80]:
# Print classification report
print("Classification Report:")
print(classification_report(y_test, y_pred))

# Display confusion matrix
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))

# Compute ROC-AUC score
roc_auc = roc_auc_score(y_test, y_proba)
print(f"ROC-AUC Score: {roc_auc:.4f}")

Classification Report:
              precision    recall  f1-score   support

           0       0.91      0.96      0.93      3286
           1       0.29      0.16      0.21       366

    accuracy                           0.88      3652
   macro avg       0.60      0.56      0.57      3652
weighted avg       0.85      0.88      0.86      3652

Confusion Matrix:
[[3144  142]
 [ 307   59]]
ROC-AUC Score: 0.6417


# Interpretation of Results

#### Problems:
🔴 Low Recall (16%) for churners → Your model is missing most churners.
🔴 Low Precision (29%) for churners → Many false positives (incorrectly predicting churn).
🔴 F1-Score (21%) is too low → The model struggles to balance precision & recall.
⚠️ Accuracy (88%) is misleading due to class imbalance.
⚠️ ROC-AUC (0.64) is weak (Good models for churn should aim for 0.75+).

# Check again the SMOTE Algorithm

In [84]:
# Apply SMOTE to balance classes
smote = SMOTE(sampling_strategy=0.7, random_state=42)  # Make churners 70% of majority class
X_train_balanced, y_train_balanced = smote.fit_resample(X_train, y_train)


In [86]:
model = RandomForestClassifier(
    n_estimators=200,         # More trees for better learning
    max_depth=15,             # Slightly deeper trees
    min_samples_split=3,      # Allow more splits (better learning of churners)
    min_samples_leaf=2,       # Reduce minimum leaf size (capture smaller patterns)
    class_weight="balanced",  # Penalize misclassified churners more
    n_jobs=-1,
    random_state=42
)

In [87]:
# Train model on balanced dataset
model.fit(X_train_balanced, y_train_balanced)

In [88]:
# Predict on test set
y_pred = model.predict(X_test)

# Predict class probabilities for ROC-AUC evaluation
y_proba = model.predict_proba(X_test)[:, 1]  # Probabilities for class 1 (churn)

In [89]:
# Print classification report
print("Classification Report:")
print(classification_report(y_test, y_pred))

# Display confusion matrix
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))

# Compute ROC-AUC score
roc_auc = roc_auc_score(y_test, y_proba)
print(f"ROC-AUC Score: {roc_auc:.4f}")

Classification Report:
              precision    recall  f1-score   support

           0       0.91      0.98      0.94      3286
           1       0.42      0.11      0.17       366

    accuracy                           0.90      3652
   macro avg       0.67      0.55      0.56      3652
weighted avg       0.86      0.90      0.87      3652

Confusion Matrix:
[[3233   53]
 [ 327   39]]
ROC-AUC Score: 0.6436


# Insights
#### Recall for churners (11%) is still too low → Model is still missing most churners.
#### ROC-AUC (0.64) is weak → Push this above 0.75.
#### F1-score for churners (17%) is too low → The model struggles to balance precision & recall.


# RandomForestClassifier is still biased towards the majority class (non-churners).
#### Solution 1: Increase Churner Recall with SMOTE (Harder Oversampling)

In [92]:
# Apply SMOTE to increase churn samples
smote = SMOTE(sampling_strategy=1.0, random_state=42)  # Equalize churn & non-churn
X_train_balanced, y_train_balanced = smote.fit_resample(X_train, y_train)


In [93]:
model = RandomForestClassifier(
    n_estimators=300,  
    max_depth=20,  
    min_samples_split=2,  
    min_samples_leaf=1,  
    class_weight="balanced",  # Increase churner weight
    n_jobs=-1,
    random_state=42
)

In [94]:
# Retrain model
model.fit(X_train_balanced, y_train_balanced)

In [95]:
# Predict on test set
y_pred = model.predict(X_test)

# Predict class probabilities for ROC-AUC evaluation
y_proba = model.predict_proba(X_test)[:, 1]  # Probabilities for class 1 (churn)

In [96]:
# Print classification report
print("Classification Report:")
print(classification_report(y_test, y_pred))

# Display confusion matrix
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))

# Compute ROC-AUC score
roc_auc = roc_auc_score(y_test, y_proba)
print(f"ROC-AUC Score: {roc_auc:.4f}")

Classification Report:
              precision    recall  f1-score   support

           0       0.91      0.99      0.95      3286
           1       0.50      0.10      0.17       366

    accuracy                           0.90      3652
   macro avg       0.70      0.54      0.56      3652
weighted avg       0.87      0.90      0.87      3652

Confusion Matrix:
[[3249   37]
 [ 329   37]]
ROC-AUC Score: 0.6549


## Results
#### Churner Recall (10%) is still too low → The model is missing almost 90% of churners.
#### ROC-AUC (0.65) is still weak → We need above 0.75 to consider the model useful.
#### High Precision (50%) but low recall → The model is only catching very few churners, even though its predictions for churners are somewhat correct.

In [98]:
model = RandomForestClassifier(
    n_estimators=300,  
    max_depth=8,  # Reduce overfitting
    min_samples_split=2,  
    min_samples_leaf=1,  
    class_weight="balanced",  # Penalize non-churn misclassifications
    n_jobs=-1,
    random_state=42
)

In [99]:
model.fit(X_train_balanced, y_train_balanced)

In [100]:
# Predict on test set
y_pred = model.predict(X_test)

# Predict class probabilities for ROC-AUC evaluation
y_proba = model.predict_proba(X_test)[:, 1]  # Probabilities for class 1 (churn)

In [101]:
# Print classification report
print("Classification Report:")
print(classification_report(y_test, y_pred))

# Display confusion matrix
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))

# Compute ROC-AUC score
roc_auc = roc_auc_score(y_test, y_proba)
print(f"ROC-AUC Score: {roc_auc:.4f}")

Classification Report:
              precision    recall  f1-score   support

           0       0.91      0.87      0.89      3286
           1       0.17      0.24      0.20       366

    accuracy                           0.80      3652
   macro avg       0.54      0.55      0.54      3652
weighted avg       0.84      0.80      0.82      3652

Confusion Matrix:
[[2845  441]
 [ 277   89]]
ROC-AUC Score: 0.6116


### Churner Recall (24%) is still too low → You are missing 76% of actual churners.
### ROC-AUC (0.61) is weak → A good churn model should be above 0.75.
### Many False Negatives (277 churners predicted as non-churners) → This is a risk for business decisions.
