# **Customer Churn Prediction**

What is customer churn?
When the customers no longer use a company's service.

Customer churn is a critical problem for subscription-based businesses, as retaining existing customers is often more cost-effective than acquiring new ones.
In this project, I built a customer churn prediction model using the Telco Customer Churn dataset to identify customers who are likely to discontinue the service.

The project involves data cleaning, handling missing values, encoding categorical features, feature scaling, and addressing class imbalance. A neural network model was implemented using PyTorch and trained on the processed dataset. The model’s performance was evaluated using ROC-AUC, achieving a validation score of approximately 0.84, indicating good predictive capability.

This project demonstrates an end-to-end machine learning workflow using PyTorch on real-world tabular data.

# **Why PyTorch?**
I used PyTorch to learn and demonstrate deep learning fundamentals like **tensors, neural networks, and training loops** while working on a real-world problem.

*The problem could be solved with simpler models, but PyTorch was chosen to demonstrate deep learning fundamentals and practical model training.*

Customer churn is a binary classification problem with:

many features

non-linear relationships

class imbalance

Neural networks can model these patterns well.

**STEP 1:**
Select the dataset: I used the Telco Customer Churn dataset from Kaggle, which contains customer demographic information, service usage details, billing information, and a churn label indicating whether the customer left.

In [1]:
import pandas as pd
from sklearn.preprocessing import StandardScaler

df = pd.read_csv("Telco-Customer-Churn.csv")

In [2]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 


**Step 2:** Data Cleaning

The dataset contained:

An ID column (customerID) with no predictive value

A numerical column (TotalCharges) stored as text with missing values

I removed the ID column and converted TotalCharges to numeric format, replacing missing values with the median.

**Step 3:** Target Encoding

The churn label was converted from categorical values (Yes, No) into numerical values (1, 0) so it could be used for binary classification.

In [3]:
# Drop ID
df.drop("customerID", axis=1, inplace=True)

# Fix TotalCharges
df["TotalCharges"] = pd.to_numeric(df["TotalCharges"], errors="coerce") #convert string to numeric
df["TotalCharges"].fillna(df["TotalCharges"].median(), inplace=True)

# Encode target
df["Churn"] = df["Churn"].map({"Yes": 1, "No": 0}) #target variable- churn, converting the datatype to numeric(0,1) for binary classification

# One-hot encode
df = pd.get_dummies(df, drop_first=True) #Converts categorical columns into binary columns (0/1).

# Scale features
scaler = StandardScaler()  #Scales all input features to have mean 0 and standard deviation 1.
X = df.drop("Churn", axis=1)
df[X.columns] = scaler.fit_transform(X)

df.head()

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df["TotalCharges"].fillna(df["TotalCharges"].median(), inplace=True)


Unnamed: 0,SeniorCitizen,tenure,MonthlyCharges,TotalCharges,Churn,gender_Male,Partner_Yes,Dependents_Yes,PhoneService_Yes,MultipleLines_No phone service,...,StreamingTV_No internet service,StreamingTV_Yes,StreamingMovies_No internet service,StreamingMovies_Yes,Contract_One year,Contract_Two year,PaperlessBilling_Yes,PaymentMethod_Credit card (automatic),PaymentMethod_Electronic check,PaymentMethod_Mailed check
0,-0.439916,-1.277445,-1.160323,-0.994242,0,-1.009559,1.03453,-0.654012,-3.05401,3.05401,...,-0.525927,-0.790132,-0.525927,-0.79607,-0.514249,-0.562975,0.829798,-0.525047,1.406418,-0.544807
1,-0.439916,0.066327,-0.259629,-0.173244,0,0.990532,-0.966622,-0.654012,0.327438,-0.327438,...,-0.525927,-0.790132,-0.525927,-0.79607,1.944582,-0.562975,-1.205113,-0.525047,-0.711026,1.835513
2,-0.439916,-1.236724,-0.36266,-0.959674,1,0.990532,-0.966622,-0.654012,0.327438,-0.327438,...,-0.525927,-0.790132,-0.525927,-0.79607,-0.514249,-0.562975,0.829798,-0.525047,-0.711026,1.835513
3,-0.439916,0.514251,-0.746535,-0.194766,0,0.990532,-0.966622,-0.654012,-3.05401,3.05401,...,-0.525927,-0.790132,-0.525927,-0.79607,1.944582,-0.562975,-1.205113,-0.525047,-0.711026,-0.544807
4,-0.439916,-1.236724,0.197365,-0.94047,1,-1.009559,-0.966622,-0.654012,0.327438,-0.327438,...,-0.525927,-0.790132,-0.525927,-0.79607,-0.514249,-0.562975,0.829798,-0.525047,1.406418,-0.544807


**Step 4:** Train–Validation Split

The processed dataset was split into training and validation sets using an 80-20 ratio while preserving the churn distribution.
This helps evaluate how well the model generalizes to unseen data.

In [4]:
#train and test the model
from sklearn.model_selection import train_test_split

X = df.drop("Churn", axis=1).values. #.values -> converts pandas data frame numpy array
y = df["Churn"].values
X_train, X_val, y_train, y_val = train_test_split(
    X, y,
    test_size=0.2, #80% of the data is provided for training and 20% is alloted for testing purposes.
    random_state=42, #Ensures same split every time
    stratify=y #Keeps churn ratio same in both splits
)

**PyTorch Dataset and DataLoader**

Custom PyTorch Dataset and DataLoader classes were created to efficiently load data in mini-batches during training.

In [5]:
#converting data to pytorch dataset
import torch
from torch.utils.data import Dataset, DataLoader

class TelcoDataset(Dataset): #creating custom dataset
    def __init__(self, X, y):  #here the numpy array takes X,Y as the parameters and converts them to tensors, since pytorch only accepts tensors.
        self.X = torch.tensor(X, dtype=torch.float32)
        self.y = torch.tensor(y, dtype=torch.float32)

    def __len__(self):   #The __len__ method tells PyTorch how many samples are in the dataset.
        return len(self.y)

    def __getitem__(self, idx): #__getitem__ returns one feature–label pair for batch processing
        return self.X[idx], self.y[idx]

In [6]:
#datasets and dataloaders
train_ds = TelcoDataset(X_train, y_train) #creating seperate datasets for training and testing
val_ds = TelcoDataset(X_val, y_val)

train_loader = DataLoader(train_ds, batch_size=64, shuffle=True) #Loads training data in batches of 64 and shuffles for every epoch
val_loader = DataLoader(val_ds, batch_size=64, shuffle=False) #Loads validation data in batches of 64 and does not shuffle

Before this:

**Data existed as arrays**

After this:

Data flows like this:

Dataset → DataLoader → Batches → Model


This is how PyTorch is designed to work.

*PyTorch DataLoaders are used to train the model in mini-batches and shuffle training data for better generalization.*

**Step 6: Model Architecture**

A simple feed-forward neural network was built using PyTorch, consisting of:

An input layer matching the number of features

One hidden layer with ReLU activation

An output layer producing a single churn score

In [7]:
#define the model
import torch.nn as nn #Imports PyTorch’s neural network utilities- PyTorch neural networks are built using the nn.Module API.

class ChurnModel(nn.Module):
    def __init__(self, input_dim):
        super().__init__()
        self.net = nn.Sequential(.  #Define the Network Layers
            nn.Linear(input_dim, 64),
            nn.ReLU(),     #ReLU introduces non-linearity into the model.”
            nn.Linear(64, 1)
        )

    def forward(self, x): #Defines how data flows through the network
        return self.net(x)

model = ChurnModel(input_dim=X_train.shape[1]) #Instantiates the neural network

***Implemented a simple feed-forward neural network with one hidden layer using PyTorch for binary churn prediction.***

**Step 7:** Loss Function and Optimization

Binary Cross-Entropy loss with logits was used along with class weighting to handle data imbalance.
The Adam optimizer was chosen for efficient and stable training.

In [8]:
#loss function and optimizer
# Handle class imbalance
churn_rate = y_train.mean().  #calculate the churn rate to handle class imbalance.”
pos_weight = torch.tensor((1 - churn_rate) / churn_rate)

criterion = nn.BCEWithLogitsLoss(pos_weight=pos_weight) #Used BCEWithLogitsLoss to combine sigmoid activation and loss in a stable way.
optimizer = torch.optim.Adam(model.parameters(), lr=0.001) #Adam optimizer was chosen for efficient and stable training.

**Step 8:** Model Training

The model was trained over multiple epochs using mini-batch gradient descent.
During training, the model learned patterns that differentiate churned and non-churned customers.

In [9]:
#train the model
for epoch in range(20): #Trains the model for 20 full passes over the training data
    model.train()
    train_loss = 0

    for Xb, yb in train_loader:  #Iterates over data in batches of 64--Xb: batch of features---yb: batch of labels
        optimizer.zero_grad(). #Clears old gradients
        logits = model(Xb).squeeze(). #Removes extra dimensions- The forward pass generates logits for each batch.
        loss = criterion(logits, yb). #Loss measures how far predictions are from actual label
        loss.backward(). #Computes gradients for each parameter - backpropagation
        optimizer.step(). #Updates model weights using gradients
        train_loss += loss.item(). #Adds batch loss to epoch loss

    print(f"Epoch {epoch+1} | Train Loss: {train_loss:.4f}")  #Shows training loss after each epoch

Epoch 1 | Train Loss: 75.2056
Epoch 2 | Train Loss: 65.2627
Epoch 3 | Train Loss: 64.1190
Epoch 4 | Train Loss: 62.8750
Epoch 5 | Train Loss: 62.8547
Epoch 6 | Train Loss: 62.2766
Epoch 7 | Train Loss: 62.0746
Epoch 8 | Train Loss: 61.4291
Epoch 9 | Train Loss: 61.7636
Epoch 10 | Train Loss: 61.1129
Epoch 11 | Train Loss: 61.1855
Epoch 12 | Train Loss: 61.1434
Epoch 13 | Train Loss: 61.1005
Epoch 14 | Train Loss: 61.0461
Epoch 15 | Train Loss: 60.6036
Epoch 16 | Train Loss: 60.2677
Epoch 17 | Train Loss: 60.0216
Epoch 18 | Train Loss: 59.8976
Epoch 19 | Train Loss: 60.3010
Epoch 20 | Train Loss: 59.6806


**Step 9**: Model Evaluation

Model performance was evaluated on the validation set using ROC-AUC, achieving a score of approximately 0.84, indicating good discrimination ability.

In [10]:
#Evaluate
from sklearn.metrics import roc_auc_score #Used ROC-AUC since it’s robust to class imbalance.

model.eval()
with torch.no_grad():
    logits = model(torch.tensor(X_val, dtype=torch.float32)).squeeze()  #Converts validation data to tensors
    probs = torch.sigmoid(logits) #Converts logits into probabilities between 0 and 1

print("Validation ROC-AUC:", roc_auc_score(y_val, probs))

Validation ROC-AUC: 0.8432328399080318


0.5 → random guessing

0.7 → okay

0.8+ → good

0.85+ → strong

0.84 = solid baseline model

ROC-AUC answers one question:

“How well can my model tell churners apart from non-churners?”

On the validation set, the churn prediction model achieved a ROC-AUC of about 0.84, which indicates good separation between churners and non-churners.
If I randomly pick one churner and one non-churner,
the model will correctly give a higher churn score to the churner 84% of the time.