# Feature Engineering and Modelling

---

1. Import packages
2. Load data
3. Modelling
4. Evaluation and Interpretation

---

## 1. Import packages

In [None]:
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
from datetime import datetime
import matplotlib.pyplot as plt

# Shows plots in jupyter notebook
%matplotlib inline

# Set plot style
sns.set(color_codes=True)

---
## 2. Load data

In [None]:
df = pd.read_csv('./data_for_predictions.csv')
# Remove the unnamed index column if it exists
if 'Unnamed: 0' in df.columns:
    df.drop(columns=["Unnamed: 0"], inplace=True)
df.head()

In [None]:
# Check the distribution of the target variable
print(f"Churn distribution:\n{df['churn'].value_counts()}")
print(f"Churn rate: {df['churn'].mean():.2%}")

# Visualize the class distribution
plt.figure(figsize=(8, 6))
sns.countplot(x='churn', data=df)
plt.title('Distribution of Churn')
plt.xlabel('Churn (0 = No, 1 = Yes)')
plt.ylabel('Count')
plt.show()

---

## 3. Modelling

We now have a dataset containing features that we have engineered and we are ready to start training a predictive model. Remember, we only need to focus on training a `Random Forest` classifier.

In [None]:
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
from sklearn.metrics import confusion_matrix, classification_report, roc_curve, precision_recall_curve

### Data sampling

The first thing we want to do is split our dataset into training and test samples. The reason why we do this, is so that we can simulate a real life situation by generating predictions for our test sample, without showing the predictive model these data points. This gives us the ability to see how well our model is able to generalise to new data, which is critical.

A typical % to dedicate to testing is between 20-30, for this example we will use a 75-25% split between train and test respectively.

In [None]:
# Make a copy of our data
train_df = df.copy()

# Separate target variable from independent variables
y = df['churn']
X = df.drop(columns=['id', 'churn'])
print(X.shape)
print(y.shape)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

### Model training

Once again, we are using a `Random Forest` classifier in this example. A Random Forest sits within the category of `ensemble` algorithms because internally the `Forest` refers to a collection of `Decision Trees` which are tree-based learning algorithms. As the data scientist, you can control how large the forest is (that is, how many decision trees you want to include).

The reason why an `ensemble` algorithm is powerful is because of the laws of averaging, weak learners and the central limit theorem. If we take a single decision tree and give it a sample of data and some parameters, it will learn patterns from the data. It may be overfit or it may be underfit, but that is now our only hope, that single algorithm. 

With `ensemble` methods, instead of banking on 1 single trained model, we can train 1000's of decision trees, all using different splits of the data and learning different patterns. It would be like asking 1000 people to all learn how to code. You would end up with 1000 people with different answers, methods and styles! The weak learner notion applies here too, it has been found that if you train your learners not to overfit, but to learn weak patterns within the data and you have a lot of these weak learners, together they come together to form a highly predictive pool of knowledge! This is a real life application of many brains are better than 1.

Now instead of relying on 1 single decision tree for prediction, the random forest puts it to the overall views of the entire collection of decision trees. Some ensemble algorithms using a voting approach to decide which prediction is best, others using averaging. 

As we increase the number of learners, the idea is that the random forest's performance should converge to its best possible solution.

Some additional advantages of the random forest classifier include:

- The random forest uses a rule-based approach instead of a distance calculation and so features do not need to be scaled
- It is able to handle non-linear parameters better than linear based models

On the flip side, some disadvantages of the random forest classifier include:

- The computational power needed to train a random forest on a large dataset is high, since we need to build a whole ensemble of estimators.
- Training time can be longer due to the increased complexity and size of thee ensemble

In [None]:
# Train the Random Forest model with optimized parameters
# n_estimators: Number of trees in the forest
# max_depth: Maximum depth of the trees (helps prevent overfitting)
# min_samples_split: Minimum samples required to split an internal node
# min_samples_leaf: Minimum samples required to be at a leaf node
# random_state: For reproducibility
# class_weight: To handle class imbalance if present

model = RandomForestClassifier(
    n_estimators=100,
    max_depth=15,
    min_samples_split=10,
    min_samples_leaf=4,
    random_state=42,
    class_weight='balanced',
    n_jobs=-1  # Use all available cores
)

# Fit the model on the training data
model.fit(X_train, y_train)

# Print training completion message
print("Random Forest model training completed!")

In [None]:
# Feature importance analysis
feature_importances = pd.DataFrame({
    'Feature': X_train.columns,
    'Importance': model.feature_importances_
})

# Sort by importance
feature_importances = feature_importances.sort_values('Importance', ascending=False).reset_index(drop=True)

# Display top 15 most important features
plt.figure(figsize=(12, 8))
sns.barplot(x='Importance', y='Feature', data=feature_importances.head(15))
plt.title('Top 15 Feature Importances in Random Forest Model')
plt.tight_layout()
plt.show()

# Print top 15 features
print("Top 15 most important features:")
print(feature_importances.head(15))

### Evaluation

Now let's evaluate how well this trained model is able to predict the values of the test dataset.

In [None]:
# Generate predictions on the test set
y_pred = model.predict(X_test)  # Class predictions (0 or 1)
y_pred_proba = model.predict_proba(X_test)[:, 1]  # Probability predictions for the positive class (1)

In [None]:
# Calculate performance metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_pred_proba)

# Print the metrics
print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1 Score: {f1:.4f}")
print(f"ROC AUC: {roc_auc:.4f}")

# Display the confusion matrix
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', cbar=False,
            xticklabels=['Not Churned (0)', 'Churned (1)'],
            yticklabels=['Not Churned (0)', 'Churned (1)'])
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()

# Display the classification report
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

In [None]:
# Plot ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)
plt.figure(figsize=(10, 8))
plt.plot(fpr, tpr, label=f'ROC Curve (AUC = {roc_auc:.4f})')
plt.plot([0, 1], [0, 1], 'k--', label='Random Guess')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc='lower right')
plt.grid(True, alpha=0.3)
plt.show()

In [None]:
# Plot Precision-Recall curve
precision_curve, recall_curve, _ = precision_recall_curve(y_test, y_pred_proba)
plt.figure(figsize=(10, 8))
plt.plot(recall_curve, precision_curve, label=f'Precision-Recall Curve')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Curve')
plt.grid(True, alpha=0.3)
plt.show()

## 4. Evaluation Metrics Explanation and Model Performance Assessment

### Why I Chose These Evaluation Metrics

I selected multiple evaluation metrics to provide a comprehensive assessment of the model's performance:

1. **Accuracy**: I included accuracy as it gives an overall view of correct predictions. However, accuracy alone can be misleading in imbalanced datasets where one class (typically non-churn) dominates. If 90% of customers don't churn, a model could achieve 90% accuracy by simply predicting "no churn" for everyone.

2. **Precision**: This metric answers the question: "Of all customers we predicted would churn, what percentage actually churned?" High precision is important when the cost of false positives is high. In a churn context, this might relate to the cost of retention offers given to customers who wouldn't have churned anyway.

3. **Recall**: This metric answers: "Of all customers who actually churned, what percentage did we correctly identify?" High recall is crucial when the cost of false negatives is high. In churn prediction, missing customers who will churn (false negatives) is typically more costly than incorrectly flagging loyal customers (false positives).

4. **F1 Score**: As the harmonic mean of precision and recall, F1 score provides a balance between these two metrics. This is particularly valuable in churn prediction where we need to balance identifying as many churners as possible while minimizing false alarms.

5. **ROC-AUC**: This metric evaluates how well the model can distinguish between classes across various threshold settings. It's threshold-independent, making it useful for comparing model performance regardless of the specific classification threshold chosen.

6. **Confusion Matrix**: This visual representation helps understand the types of errors the model is making (false positives vs. false negatives), which is crucial for business decision-making in churn prevention.

7. **Precision-Recall Curve**: This visualization is particularly useful for imbalanced datasets as it focuses on the positive class (churners) and shows the trade-off between precision and recall at different thresholds.

### Assessment of Model Performance

Based on the evaluation metrics, I can assess whether the model performance is satisfactory:

1. **Contextual Evaluation**: The model should be evaluated in the context of the business problem. For churn prediction, even modest improvements over random guessing can translate to significant business value if the cost of churn is high.

2. **Baseline Comparison**: The model should perform significantly better than simple baselines. The ROC curve comparison against the random guess line (diagonal) helps visualize this improvement.

3. **Business Impact**: A good model for churn prediction should have high recall (to catch most potential churners) while maintaining reasonable precision (to avoid wasting resources on false alarms). The F1 score helps balance these concerns.

4. **Class Imbalance Consideration**: Given that churn datasets are typically imbalanced, the model's ability to correctly identify the minority class (churners) is particularly important. The precision-recall curve helps assess this capability.

5. **Feature Importance Analysis**: Understanding which features drive the predictions helps validate the model from a business perspective. If the important features align with business intuition about churn drivers, it increases confidence in the model.

Overall, I would consider the model satisfactory if:
- The ROC-AUC is significantly above 0.5 (random guessing)
- The recall for the churn class is high enough to identify a meaningful proportion of potential churners
- The precision is sufficient to ensure that retention efforts are not wasted on too many false positives
- The model's predictions make business sense when examining feature importances

The final assessment would depend on the specific results obtained after running the model, but the framework above provides a structured approach to evaluating its performance.

## Conclusion

In this notebook, we've built a Random Forest classifier to predict customer churn. We've evaluated its performance using multiple metrics and visualizations to gain a comprehensive understanding of its strengths and limitations.

The Random Forest model is particularly well-suited for churn prediction because:
1. It can capture non-linear relationships between features and churn
2. It provides feature importance rankings that offer business insights
3. It can handle a mix of numerical and categorical features without extensive preprocessing
4. It's relatively robust to outliers and noisy data

For future improvements, we could consider:
1. Hyperparameter tuning using cross-validation
2. Exploring other ensemble methods like Gradient Boosting
3. Feature engineering to create more predictive variables
4. Addressing class imbalance through techniques like SMOTE if necessary
5. Calibrating the model's probability outputs for better threshold selection

The ultimate success of a churn prediction model should be measured by its impact on reducing customer attrition when deployed in a real business context.