# Customer Churn Prediction

**Objective**: Predict which bank customers are likely to leave using classification models.

**Dataset**: Churn Modelling Dataset


#### Import Libraries

#### Load Dataset

In [None]:

df = pd.read_csv('../data/Churn_Modelling.csv')
df.head()

#### Data Cleaning and Preparation

In [None]:
df.drop(['RowNumber', 'CustomerId', 'Surname'], axis=1, inplace=True)
df = pd.get_dummies(df, columns=['Geography', 'Gender'], drop_first=True)

#### Exploratory Data Analysis

In [None]:
# Step 4: Exploratory Data Analysis
sns.countplot(x='Exited', data=df)
plt.title("Churn Distribution")
plt.show()



##### The Above Graph Shows that: 
The dataset is imbalanced:
- Majority (~80%) of customers did not churn.
- Minority (~20%) exited.

In [None]:
sns.histplot(df['Age'], bins=30, kde=True)
plt.title("Age Distribution")
plt.show()



##### The Above Graph Shows that:
Most customers are between 30 and 45 years old.
- Very few are under 20 or over 60.
- The distribution is slightly right-skewed.

Age could play a role in customer churn.


In [None]:
sns.countplot(x='Tenure', hue='Exited', data=df)
plt.title("Tenure vs Churn")
plt.show()



##### The Above Grapsh Shows that:
Churn is fairly uniform across all tenure levels (0–10 years).
- No clear trend of more or less churn at any specific tenure.
- Tenure alone may not be a strong predictor.


In [None]:
sns.boxplot(x='Exited', y='Balance', data=df)
plt.title("Balance vs Churn")
plt.show()


##### The Above Graph Shows that:
Churners often have **higher account balances** than non-churners.
- Customers with high balances may leave due to poor engagement or service.

Balance appears to be an important feature for churn prediction.

#### Model Training Using Different Models

#### Model Evaluation

##### Model Accuracy Comparison:

- Random Forest: 86.65% 
- Gradient Boosting: 86.75%  (slightly best)
- Logistic Regression: 81.10%  (lowest)

but we will use Random Forest because it’s simpler, faster, and more interpretable, and the accuracy difference is negligible in this case.

Confusion Matrix:
- 1550 correct non-churn predictions
- 183 correct churn predictions
- 210 churners missed (wrongly predicted as not churn)
- 57 non-churners wrongly predicted as churn

**ROC Curve – Random Forest**

- ROC curve shows the trade-off between true positive rate and false positive rate.
- AUC score closer to 1 means better model performance.

This Random Forest model has a good AUC score, indicating strong predictive power.


#### Feature Imprortance

In [None]:
# Feature Importance (Random Forest)
importances = rf.feature_importances_
feature_names = X.columns
importance_df = pd.DataFrame({'Feature': feature_names, 'Importance': importances})
importance_df = importance_df.sort_values(by='Importance', ascending=False)
plt.figure(figsize=(10, 6))
sns.barplot(x='Importance', y='Feature', data=importance_df)
plt.title("Feature Importance (Random Forest)")
plt.show()

##### Top features influencing churn:
- **Age** is the most influential feature in this model.
- Other strong predictors include EstimatedSalary, CreditScore, and Balance and Number of Products.
- Features like Geography_Spain, Gender_Male, and HasCrCard have much lower importance
These features are key drivers in the Random Forest model’s decisions.

In [None]:
# Step 10: Save the Final Model
joblib.dump(rf, 'model/churn_rf_model.pkl')

##  Conclusion

- The **Random Forest** model performed best for churn prediction.
- Key churn indicators: **Age**,**EstimatedSalary**, **Balance**, **CreditScore**, **Activity**, and **Number of Products**.
- Customers with high balance but low activity were more likely to churn.
- This model can help the bank **target at-risk customers** for retention efforts.

✅ This notebook is production-ready and optimized for deployment.