<a href="https://colab.research.google.com/github/GabyPugaBR/AAI2025/blob/main/Part_2_Predict_Customer_Churn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

###**Coding Exercise - ML Basics**

---

**Part 2: Predict Customer Churn** \
The churn rate, also known as the rate of attrition or customer churn, refers to the percentage of customers who cease doing business with an entity.

Original source: https://www.kaggle.com/datasets/muhammadshahidazeem/customer-churn-dataset?select=customer_churn_dataset-training-master.csv

In [12]:
# Got assistance from Claude.ai to reduce dataset while staying true to the data
import pandas as pd
from sklearn.model_selection import train_test_split

# Load the dataset
df = pd.read_csv('/content/customer_churn_dataset-training-master.csv')

# Remove rows where Churn is missing
df = df.dropna(subset=['Churn'])

# Sample 30% with stratified sampling (preserves churn ratio)
sampled_df, _ = train_test_split(
    df,
    train_size=0.3,
    random_state=42,
    stratify=df['Churn']
)

# Save to Google Drive
sampled_df.to_csv(
    '/content/drive/MyDrive/Colab Notebooks/Datasets/reduced_churn_master.csv',
    index=False
)

print(f"✅ Saved {len(sampled_df):,} rows (30% of {len(df):,})")

✅ Saved 132,249 rows (30% of 440,832)


In [46]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# # Generate sample customer data
# data = {
# 'age': [25, 34, 45, 28, 52, 36, 41, 29, 47, 33],
# 'monthly_usage_hours': [10, 50, 20, 15, 60, 30, 25, 12, 55, 40],
# 'purchase_amount': [100, 250, 150, 80, 300, 200, 175, 90, 280, 220],
# 'customer_service_calls': [5, 2, 8, 6, 1, 3, 7, 4, 0, 2],
# 'region': ['North', 'South', 'West', 'East', 'South', 'North', 'West', 'East',
# 'South', 'North'],
# 'churn': [1, 0, 1, 1, 0, 0, 1, 1, 0, 0] # 1 = churned, 0 = not churned
# }
# df = pd.DataFrame(data)

df = pd.read_csv('/content/customer_churn_dataset-training-master.csv')
#df.columns
# ['CustomerID', 'Age', 'Gender', 'Tenure', 'Usage Frequency',
#        'Support Calls', 'Payment Delay', 'Subscription Type',
#        'Contract Length', 'Total Spend', 'Last Interaction', 'Churn']
#df.info()
df = df.dropna()
#df.info()

# # Features and target
X = df[['Age', 'Usage Frequency', 'Total Spend', 'Support Calls',
'Subscription Type', 'Gender','Contract Length' ]]
y = df['Churn']

# With assistance from Claude.ai
# Preprocessing: Scale numerical features and one-hot encode categorical features
preprocessor = ColumnTransformer(
transformers=[
('num', StandardScaler(), ['Age', 'Usage Frequency', 'Total Spend', 'Support Calls']),
('cat', OneHotEncoder(sparse_output=False), ['Subscription Type', 'Gender','Contract Length'])
])

# Create pipeline with preprocessing and model
model = Pipeline(steps=[
('preprocessor', preprocessor),
('classifier', LogisticRegression(random_state=42))
])

# Split data into 80% train and 20% test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)
# Train model
model.fit(X_train, y_train)


# Predict churn probability for a new customer
new_customer = pd.DataFrame({
    'Age': [20],
    'Usage Frequency': [10],
    'Total Spend': [50],
    'Support Calls': [5],
    'Subscription Type': ['Standard'],
    'Gender': ['Male'],
    'Contract Length': ['Quarterly']
})

churn_probability = model.predict_proba(new_customer)[0][1]  # Probability of churn (class 1)

# low_churn_customer = pd.DataFrame({
#     'Age': [36],
#     'Usage Frequency': [16],
#     'Total Spend': [750],
#     'Support Calls': [2],
#     'Subscription Type': ['Standard'],
#     'Gender': ['Male'],
#     'Contract Length': ['Annual']
# })

# churn_probability = model.predict_proba(low_churn_customer)[0][1]  # Probability of churn (class 1)

# Classify based on threshold (0.5)
threshold = 0.5
churn_prediction = 1 if churn_probability > threshold else 0
print(f"Churn Probability for new customer: {churn_probability:.2f}")
print(f"Churn Prediction (1 = churn, 0 = no churn): {churn_prediction}")

# Display model coefficients
feature_names = (
    model.named_steps['preprocessor']
    .named_transformers_['cat']
    .get_feature_names_out(['Subscription Type', 'Gender', 'Contract Length'])
).tolist() + ['Age', 'Usage Frequency', 'Total Spend', 'Support Calls']

coefficients = model.named_steps['classifier'].coef_[0]

print("\nModel Coefficients:")
for feature, coef in zip(feature_names, coefficients):
    print(f"{feature}: {coef:.2f}")

# Evaluate the model on test data
accuracy = model.score(X_test, y_test)
print(f"Model Accuracy: {accuracy:.2%}")


Churn Probability for new customer: 0.98
Churn Prediction (1 = churn, 0 = no churn): 1

Model Coefficients:
Subscription Type_Basic: 0.42
Subscription Type_Premium: -0.12
Subscription Type_Standard: -1.40
Gender_Female: 2.18
Gender_Male: 0.66
Contract Length_Annual: 0.56
Contract Length_Monthly: 0.56
Contract Length_Quarterly: 1.31
Age: 0.47
Usage Frequency: -2.73
Total Spend: 7.25
Support Calls: -2.73
Model Accuracy: 87.98%


**The churn probability is the model's prediction of how likely a customer is to
cancel their service or stop doing business with the company. It's expressed as
a percentage from 0% to 100%. For example, a churn probability of 73% means the
model predicts there's a 73% chance this customer will leave. A probability
below 50% suggests the customer will likely stay, while above 50% suggests they
will likely leave. This helps businesses identify at-risk customers before they
actually cancel, especially since it's more costly for businesses to gain new customers than to retain current ones.**