# Feature Engineering and Modelling

---

1. Import packages
2. Load data
3. Modelling

---

## 1. Import packages

In [1]:
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

In [2]:
import pandas as pd
import numpy as np
import seaborn as sns
from datetime import datetime
import matplotlib.pyplot as plt

# Shows plots in jupyter notebook
%matplotlib inline

# Set plot style
sns.set(color_codes=True)

---
## 2. Load data

In [3]:
df = pd.read_csv('./data_for_predictions.csv')
df.drop(columns=["Unnamed: 0"], inplace=True)
df.head()

Unnamed: 0,id,cons_12m,cons_gas_12m,cons_last_month,forecast_cons_12m,forecast_discount_energy,forecast_meter_rent_12m,forecast_price_energy_off_peak,forecast_price_energy_peak,forecast_price_pow_off_peak,...,months_modif_prod,months_renewal,channel_MISSING,channel_ewpakwlliwisiwduibdlfmalxowmwpci,channel_foosdfpfkusacimwkcsosbicdxkicaua,channel_lmkebamcaaclubfxadlmueccxoimlema,channel_usilxuppasemubllopkaafesmlibmsdf,origin_up_kamkkxfxxuwbdslkwifmmcsiusiuosws,origin_up_ldkssxwpmemidmecebumciepifcamkci,origin_up_lxidpiddsbxsbosboudacockeimpuepw
0,24011ae4ebbe3035111d65fa7c15bc57,0.0,4.739944,0.0,0.0,0.0,0.444045,0.114481,0.098142,40.606701,...,2,6,0,0,1,0,0,0,0,1
1,d29c2c54acc38ff3c0614d0a653813dd,3.668479,0.0,0.0,2.28092,0.0,1.237292,0.145711,0.0,44.311378,...,76,4,1,0,0,0,0,1,0,0
2,764c75f661154dac3a6c254cd082ea7d,2.736397,0.0,0.0,1.689841,0.0,1.599009,0.165794,0.087899,44.311378,...,68,8,0,0,1,0,0,1,0,0
3,bba03439a292a1e166f80264c16191cb,3.200029,0.0,0.0,2.382089,0.0,1.318689,0.146694,0.0,44.311378,...,69,9,0,0,0,1,0,1,0,0
4,149d57cf92fc41cf94415803a877cb4b,3.646011,0.0,2.721811,2.650065,0.0,2.122969,0.1169,0.100015,40.606701,...,71,9,1,0,0,0,0,1,0,0


---

## 3. Modelling

We now have a dataset containing features that we have engineered and we are ready to start training a predictive model. Remember, we only need to focus on training a `Random Forest` classifier.

In [4]:
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

### Data sampling

The first thing we want to do is split our dataset into training and test samples. The reason why we do this, is so that we can simulate a real life situation by generating predictions for our test sample, without showing the predictive model these data points. This gives us the ability to see how well our model is able to generalise to new data, which is critical.

A typical % to dedicate to testing is between 20-30, for this example we will use a 75-25% split between train and test respectively.

In [5]:
# Make a copy of our data
train_df = df.copy()

# Separate target variable from independent variables
y = df['churn']
X = df.drop(columns=['id', 'churn'])
print(X.shape)
print(y.shape)

(14606, 61)
(14606,)


In [6]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

(10954, 61)
(10954,)
(3652, 61)
(3652,)


### Model training

Once again, we are using a `Random Forest` classifier in this example. A Random Forest sits within the category of `ensemble` algorithms because internally the `Forest` refers to a collection of `Decision Trees` which are tree-based learning algorithms. As the data scientist, you can control how large the forest is (that is, how many decision trees you want to include).

The reason why an `ensemble` algorithm is powerful is because of the laws of averaging, weak learners and the central limit theorem. If we take a single decision tree and give it a sample of data and some parameters, it will learn patterns from the data. It may be overfit or it may be underfit, but that is now our only hope, that single algorithm. 

With `ensemble` methods, instead of banking on 1 single trained model, we can train 1000's of decision trees, all using different splits of the data and learning different patterns. It would be like asking 1000 people to all learn how to code. You would end up with 1000 people with different answers, methods and styles! The weak learner notion applies here too, it has been found that if you train your learners not to overfit, but to learn weak patterns within the data and you have a lot of these weak learners, together they come together to form a highly predictive pool of knowledge! This is a real life application of many brains are better than 1.

Now instead of relying on 1 single decision tree for prediction, the random forest puts it to the overall views of the entire collection of decision trees. Some ensemble algorithms using a voting approach to decide which prediction is best, others using averaging. 

As we increase the number of learners, the idea is that the random forest's performance should converge to its best possible solution.

Some additional advantages of the random forest classifier include:

- The random forest uses a rule-based approach instead of a distance calculation and so features do not need to be scaled
- It is able to handle non-linear parameters better than linear based models

On the flip side, some disadvantages of the random forest classifier include:

- The computational power needed to train a random forest on a large dataset is high, since we need to build a whole ensemble of estimators.
- Training time can be longer due to the increased complexity and size of thee ensemble

In [16]:
# Import the necessary libraries
from sklearn.ensemble import RandomForestClassifier

# Create a RandomForestClassifier with the following hyperparameters:
# n_estimators: Number of trees in the forest. Setting to 1000 for a more robust model.
# criterion: 'entropy' specifies that the model will use the information gain for splitting the nodes.
# min_samples_split: Minimum number of samples required to split an internal node. Setting this to 10 helps prevent overfitting.
# random_state: Setting a random seed to 42 for reproducibility so that you get the same results every time you run the model.
rf = RandomForestClassifier(n_estimators=1000, criterion='entropy', min_samples_split=10, random_state=42)

# Train (fit) the RandomForestClassifier on the training data.
# X_train: Features of the training dataset.
# y_train: Target variable of the training dataset.
rf.fit(X_train, y_train)

# After this, the model has been trained on the training data and is ready to make predictions or be evaluated.


In [18]:
# Make predictions using the RandomForestClassifier on the test set.
# X_test: The features of the test set that the model hasn't seen during training.
# The model will predict the target labels for these test data points.
y_pred_rf = rf.predict(X_test)

In [19]:
y_pred_rf #contains the predicted class labels for your test data

array([0, 0, 0, ..., 0, 0, 0], dtype=int64)

In [29]:
predictions = rf.predict(X_test)
tn, fp, fn, tp = metrics.confusion_matrix(y_test, predictions).ravel()

In [30]:
print(f"True positives: {tp}")
print(f"False positives: {fp}")
print(f"False negatives: {fn}\n")

print(f"Accuracy: {metrics.accuracy_score(y_test, predictions)}")
print(f"Precision: {metrics.precision_score(y_test, predictions)}")
print(f"Recall: {metrics.recall_score(y_test, predictions)}")

True positives: 12
False positives: 1
True negatives: 3285
False negatives: 354

Accuracy: 0.9027929901423878
Precision: 0.9230769230769231
Recall: 0.03278688524590164


### Evaluation

Now let's evaluate how well this trained model is able to predict the values of the test dataset.

In [17]:
# Evaluate the performance of the RandomForestClassifier on the test set.
# The score method computes the mean accuracy of the model's predictions.
# X_test: The features of the test set.
# y_test: The true labels of the test set.
# The method returns the accuracy, which is the proportion of correctly predicted samples.
accuracy = rf.score(X_test, y_test)

# Print the accuracy score
print("Accuracy on the test set:", accuracy)


0.9027929901423878

In [28]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, roc_auc_score

# Calculate accuracy: the proportion of correctly predicted samples.
accuracy = accuracy_score(y_test, y_pred_rf)

# Calculate precision: the proportion of true positives among the predicted positives.
precision = precision_score(y_test, y_pred_rf)

# Calculate recall: the proportion of true positives among the actual positives.
recall = recall_score(y_test, y_pred_rf)

# Calculate F1-Score: the harmonic mean of precision and recall.
f1 = f1_score(y_test, y_pred_rf)

# Calculate ROC-AUC Score: the area under the ROC curve, which measures the trade-off between true positive rate and false positive rate.
roc_auc = roc_auc_score(y_test, y_pred_rf)

# Calculate the confusion matrix: a table showing the counts of true positives, false positives, true negatives, and false negatives.
conf_matrix = confusion_matrix(y_test, y_pred_rf)

# Print the performance metrics
print(f"Accuracy: {accuracy}")
print(f"Precision: {precision}")
print(f"Recall: {recall}")
print(f"F1-Score: {f1}")
print(f"ROC-AUC Score: {roc_auc}")

# Print the confusion matrix
print("Confusion Matrix:")
print(conf_matrix)


Accuracy: 0.9027929901423878
Precision: 0.9230769230769231
Recall: 0.03278688524590164
F1-Score: 0.0633245382585752
ROC-AUC Score: 0.516241281941271
Confusion Matrix:
[[3285    1]
 [ 354   12]]


In [26]:
from sklearn.metrics import classification_report

# Generate and print the classification report for the test data.
# y_test: True labels for the test set.
# y_pred_rf: Predicted labels by the RandomForestClassifier for the test set.
print(classification_report(y_test, y_pred_rf))


              precision    recall  f1-score   support

           0       0.90      1.00      0.95      3286
           1       0.92      0.03      0.06       366

    accuracy                           0.90      3652
   macro avg       0.91      0.52      0.51      3652
weighted avg       0.90      0.90      0.86      3652



Model Performance Summary

Accuracy: 0.9028
The model correctly predicted whether customers would churn or not about 90.28% of the time. While this figure seems high, in the context of churn prediction, it may be misleading due to potential class imbalance.

Precision: 0.9231
When the model predicts a customer will churn, it is correct about 92.31% of the time. This high precision indicates that the model is accurate when it does identify churners.

Recall: 0.0328
The model identified only 3.28% of the actual churners. This is extremely low, showing that the model is missing most of the churners, which is a critical issue since the primary goal is to detect as many churners as possible.

F1-Score: 0.0633
The F1-Score, which balances precision and recall, is very low. This reflects the poor performance in terms of recall, despite high precision.

ROC-AUC Score: 0.5162
The ROC-AUC score is just slightly above 0.5, indicating that the model performs nearly as well as random guessing in distinguishing between churners and non-churners.

Confusion Matrix:
[[3285    1]
 [ 354   12]]
The confusion matrix shows that the model correctly predicted 3285 non-churners (true negatives) and 12 churners (true positives). However, it incorrectly predicted 354 actual churners as non-churners (false negatives) and only 1 non-churner as a churner (false positive).

Is the Model’s Performance Satisfactory for me?

The model’s performance is not satisfactory. Despite the high accuracy and precision, 
the model fails to identify most of the actual churners, which is the primary objective of the churn prediction task. 
The very low recall and F1-Score, along with a ROC-AUC score close to random guessing, suggest that the model is not 
effectively distinguishing between churners and non-churners.