# Customer Churn Prediction


## Problem Definition
**Customer Churn** refers to customers who stop using a company's product or service.  
The goal is to **predict which customers are likely to churn** based on historical behavior.


## Why This Problem is Important
- Retaining existing customers is **cheaper than acquiring new ones**.  
- Helps companies **increase revenue and improve customer satisfaction**.  
- Enables **targeted retention strategies** for at-risk customers.


##  How Machine Learning Can Help
- ML models can **analyze historical customer data** to detect churn patterns.  
- **Predictive models** identify high-risk customers **before they leave**.  
- Supports **data-driven decision making** in marketing and customer support.


##  Data Description
| Feature | Type | Description |
|---------|------|-------------|
| CustomerID | Identifier | Unique ID for each customer |
| Age | Numeric | Age of the customer |
| Gender | Categorical | Male / Female |
| Tenure | Numeric | Months customer has been with company |
| Usage Frequency | Numeric | How often customer uses the service |
| Support Calls | Numeric | Number of calls to support |
| Payment Delay | Numeric | Delays in payments |
| Subscription Type | Categorical | Basic / Standard / Premium |
| Contract Length | Categorical | Monthly / Quarterly / Yearly |
| Total Spend | Numeric | Total amount spent by the customer |
| Last Interaction | Numeric | Days since last interaction |
| Churn | Target | 0 = Active, 1 = Churned |





In [None]:
# Very basic data science imports
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
# Scaling,Imbalance Handling,cross val,encoding,Classification metrics,pipeline
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder
from sklearn.compose import ColumnTransformer
from sklearn.metrics import classification_report, confusion_matrix, ConfusionMatrixDisplay, accuracy_score, precision_score, recall_score, f1_score
from imblearn.under_sampling import RandomUnderSampler
from imblearn.pipeline import Pipeline
# Model Imports
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier

Uploading datasets

In [None]:
train = pd.read_csv('/content/drive/MyDrive/Customer churn/Train.csv')
test = pd.read_csv('/content/drive/MyDrive/Customer churn/Test.csv')

In [None]:
train.head()

In [None]:
test.head()

In [None]:
train.drop('CustomerID', axis=1, inplace=True)
test.drop('CustomerID', axis=1, inplace=True)

In [None]:
train.isnull().sum()

In [None]:
test.isnull().sum()

In [None]:
train.dropna(inplace=True)

In [None]:
print(train.duplicated().sum())

In [None]:
print(test.duplicated().sum())

In [None]:
train.describe()

In [None]:
train.describe(include= 'object')

In [None]:
train_num_col = train.select_dtypes(include='number').columns
train_cat_col = train.select_dtypes(include='object').columns

In [None]:
test_num_col = test.select_dtypes(include='number').columns
test_cat_col = test.select_dtypes(include='object').columns

**Dataset Summary**

Numerical columns (8): CustomerID, Age, Tenure, Usage Frequency, Support Calls, Payment Delay, Total Spend, Last Interaction

Binary nominal column (1): Gender

Ordinal categorical columns (2): Contract Length, Subscription Type

**Outlier Detection**

In [None]:
plt.figure(figsize=(10, 8))
sns.boxplot(data=train[train_num_col] , palette='Greens')

plt.title('Boxplot for Outlier Detection')
plt.xticks(rotation = 45)
plt.show()

As we see above there are no outliers

In [None]:
plt.figure(figsize=(5, 5))

sns.countplot(
    data=train,
    x="Churn" ,width =.4
)

plt.title("Churn Distribution")
plt.xlabel("Churn")
plt.ylabel("Count")

plt.show()

The plot above reveals a significant class imbalance.

** Class 0 (Retained) represents the majority, while Class 1 (Churned) is the minority.

To address this issue we can use techniques like Undersampling or SMOTE to prevent the model from becoming biased toward the majority class.

In [None]:
n_cols = len(train_num_col)
n_rows = (n_cols + 3) // 4

plt.figure(figsize=(15, 4 * n_rows))

for i, col in enumerate(train_num_col, 1):
    plt.subplot(n_rows, 4, i)
    sns.histplot(data=train, x=col, kde=True, bins=30, color='skyblue')
    plt.title(f'Distribution of {col}')
    plt.xlabel(col)
    plt.ylabel('Count')

plt.tight_layout()
plt.show()


As we can see in above plot, none of the numerical features are in a normal distribution. This is acceptable as we are primarily using tree-based models (Decision Tree, Random Forest, XGBoost), which are non-parametric and do not assume any specific data distribution.

In [None]:
plt.figure(figsize=(10, 6))

for i, col in enumerate(train_cat_col, 1):
    plt.subplot(1 , 3 , i)
    sns.countplot(data=train, x=col , hue = 'Churn' , palette="Greens")
    plt.title(f'Distribution of {col}')

plt.tight_layout()
plt.show()

- We observed that females have a higher churn rate than males.
- Customers with monthly contracts have the highest churn rate.

In [None]:
sns.boxplot(data=train, x='Churn', y='Tenure', palette="Blues")

In [None]:
plt.figure(figsize=(6, 6))

corr = train[train_num_col].corr()
corr_with_target = corr['Churn'].sort_values(ascending= True).to_frame()
sns.heatmap(
    data=corr_with_target,
    annot=True,
    fmt=".2f",
    cmap="Blues",
    cbar=True,
    linewidths=0.5,
    linecolor='white',
    square=True
)

plt.title("Correlation  With Target", fontsize=18)
plt.show()

Feature correlation with the target variable (Churn). Support Calls shows the strongest positive correlation (0.52).

Note: A full correlation matrix was analyzed, and no significant multicollinearity was found between independent features, validating the use of linear models.

In [None]:
X_train = train.drop('Churn', axis=1)
y_train = train['Churn']

X_test = test.drop('Churn', axis=1)
y_test = test['Churn']

In [None]:
print("\nStarting Modeling Phase...\n")
numeric_features = ['Age', 'Tenure', 'Usage Frequency', 'Support Calls',
                    'Payment Delay', 'Total Spend', 'Last Interaction']
ordinal_features = ['Subscription Type', 'Contract Length']
nominal_features = ['Gender']
print("Numeric:", numeric_features)
print("Ordinal:", ordinal_features)
print("Nominal:", nominal_features)

In [None]:
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numeric_features),
        ('cat_nom', OneHotEncoder(drop='first', sparse_output=False, handle_unknown='ignore'), nominal_features),
        ('cat_ord', OrdinalEncoder(categories=[
            ['Basic', 'Standard', 'Premium'],
            ['Monthly', 'Quarterly', 'Annual']
        ]), ordinal_features)
    ],
    remainder='drop',
    verbose_feature_names_out=False
)

In [None]:
models = {
    "Logistic Regression": LogisticRegression(solver='lbfgs', max_iter=1000),
    "Decision Tree": DecisionTreeClassifier(random_state=42),
    "Random Forest": RandomForestClassifier(random_state=42),
    "XGBoost": XGBClassifier(
        use_label_encoder=False,
        eval_metric='logloss',
        device="cpu",
        tree_method="hist"
    )
}

param_grids = {
    "Logistic Regression": {
        "classifier__C": [0.01, 0.1, 1, 10]
    },
    "Decision Tree": {
        "classifier__max_depth": [3, 5, 10, None],
        "classifier__min_samples_split": [2, 5, 10]
    },
    "Random Forest": {
        "classifier__n_estimators": [50, 100, 150],
        "classifier__max_depth": [3, 5, 10, None]
    },
    "XGBoost": {
        "classifier__n_estimators": [50, 100, 150],
        "classifier__learning_rate": [0.01, 0.1, 0.2],
        "classifier__max_depth": [3, 5, 7]
    }
}

In [None]:
results = []

#looping through every model
for name, model in models.items():
    print(f"\nTraining {name}...")


    pipeline = Pipeline([
        ('preprocessor', preprocessor),
        ('classifier', model)
    ])

    if name in param_grids:

        grid = GridSearchCV(
            pipeline,
            param_grid=param_grids[name],
            scoring='accuracy',
            cv=3,
            n_jobs=-1
        )
        grid.fit(X_train, y_train)
        best_model = grid.best_estimator_
        best_params = grid.best_params_
    else:
        best_model = pipeline.fit(X_train, y_train)
        best_params = "Default"

    #  Predict & Evaluate
    y_pred = best_model.predict(X_test)

    # Calculate metrics
    acc = round(accuracy_score(y_test, y_pred), 2)
    prec = round(precision_score(y_test, y_pred), 2)
    rec = round(recall_score(y_test, y_pred), 2)
    f1 = round(f1_score(y_test, y_pred), 2)

    results.append({
        "Model": name,
        "Accuracy": acc,
        "Precision": prec,
        "Recall": rec,
        "F1-score": f1,
        "Best Params": best_params
    })

    print(f"--- {name} Results ---")
    print(f"Accuracy: {acc} | F1: {f1}")

# Display final comparison table
results_df = pd.DataFrame(results)
print("\nFinal Model Comparison:")
display(results_df)

In [None]:
y_pred = best_model.predict(X_test)

In [None]:
results.append({
        "Model": name,
        "Accuracy": round(accuracy_score(y_test, y_pred), 2),
        "Precision": round(precision_score(y_test, y_pred), 2),
        "Recall": round(recall_score(y_test, y_pred), 2),
        "F1-score": round(f1_score(y_test, y_pred), 2),
        "Best Params": best_params
    })

print(f"\n--- {name} Results ---")
print(classification_report(y_test, y_pred))

results_df = pd.DataFrame(results)
print("\nFinal Model Comparison:")
display(results_df)

In [None]:
correlations = train.corr(numeric_only=True)['Churn'].sort_values()

print("Top Positive Correlations:\n", correlations.tail(5))
print("Top Negative Correlations:\n", correlations.head(5))

In [None]:
import shap

preprocessor_step = best_model.named_steps['preprocessor']

xgboost_step = best_model.named_steps['classifier']

X_test_transformed = preprocessor_step.transform(X_test)

feature_names = (
    preprocessor_step.named_transformers_['num'].get_feature_names_out().tolist() +
    preprocessor_step.named_transformers_['cat_nom'].get_feature_names_out().tolist() +
    preprocessor_step.named_transformers_['cat_ord'].get_feature_names_out().tolist()
)

explainer = shap.TreeExplainer(xgboost_step)
shap_values = explainer.shap_values(X_test_transformed)

plt.figure(figsize=(12, 10))
shap.summary_plot(shap_values, X_test_transformed, feature_names=feature_names, show=False)
plt.title("SHAP Summary Plot (Feature Impact on Churn)", fontsize=16)
plt.show()

In [None]:
#from google.colab import drive
import joblib
import os

folder_path = '/content/drive/MyDrive/Customer churn'
os.makedirs(folder_path, exist_ok=True)

file_path = os.path.join(folder_path, 'Best_Model.pkl')
joblib.dump(best_model, file_path)

In [None]:
from huggingface_hub import login, HfApi

login(token="HF TOKEN")
api = HfApi()

model_repo = "MY REPO LINK"
api.upload_file(
    path_or_fileobj="/content/drive/MyDrive/Customer churn/Best_Model.pkl",
    path_in_repo="Best_Model.pkl",
    repo_id=model_repo,
    repo_type="model"
)

dataset_repo = "MY DATA SET REPO"
api.upload_file(
    path_or_fileobj="/content/drive/MyDrive/Customer churn/Train.csv",
    path_in_repo="Train.csv",
    repo_id=dataset_repo,
    repo_type="dataset"
)
api.upload_file(
    path_or_fileobj="/content/drive/MyDrive/Customer churn/Test.csv",
    path_in_repo="Test.csv",
    repo_id=dataset_repo,
    repo_type="dataset"
)