# imports and plotting

Typical first cell in a notebook. It pulls in Pandas and NumPy for data handling, Seaborn and Matplotlib for plotting, and enables inline plotting in the notebook. The %matplotlib inline line is a Jupyter magic command so plots render in the notebook.

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import seaborn as sns
from matplotlib import pyplot as plt
%matplotlib inline


load the CSV into a DataFrame

In [None]:
df = pd.read_csv('WA_Fn-UseC_-Telco-Customer-Churn.csv')


# quick dataset checks / exploration

When I first get a new dataset, my immediate instinct is to just get a feel for it. I'll quickly check how big it is with len(df) to see how many rows I'm dealing with. Then, I always take a sneak peek at the actual data with df.head()—it's like glancing at the first few pages of a book. If there are too many columns to read easily, my little trick is to flip it by adding .T to get a much cleaner, vertical view. And I never, ever skip checking df.dtypes; it's my first line of defense against those sneaky problems like numbers masquerading as text.

In [None]:
len(df)
df.head()
df.head().T
df.dtypes


Unnamed: 0,0
customerID,object
gender,object
SeniorCitizen,int64
Partner,object
Dependents,object
tenure,int64
PhoneService,object
MultipleLines,object
InternetService,object
OnlineSecurity,object


# converting TotalCharges

When I see TotalCharges as an 'object' type, I immediately know there are hidden non-numeric values. My first step is always to run pd.to_numeric(df['TotalCharges'], errors='coerce') to force conversion and turn the problems into NaNs. Then I check for those missing values to see what needs cleaning.



In [None]:
# convert TotalCharges to numeric, coerce invalid parsing to NaN
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')

# quick checks after conversion
df['TotalCharges'].isna().sum()   # how many became NaN
df.info()                         # overview of dtypes and non-null counts


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 


# Handling non-numeric values in the TotalCharges column

A few sneaky spaces or dashes lurk in the data, tricking Pandas into thinking the column is text. My first move is always to gently force it to be numbers with pd.to_numeric(df['TotalCharges'], errors='coerce'). This politely converts the valid entries and swaps the bad ones for NaNs, which is perfect because it then tells me exactly what's left to clean up.

In [None]:
# Convert 'TotalCharges' to numeric values, coercing errors to NaN
total_charges = pd.to_numeric(df.TotalCharges, errors='coerce')

# Find rows where conversion failed (NaN)
df[total_charges.isnull()][['customerID', 'TotalCharges']]


Unnamed: 0,customerID,TotalCharges
488,4472-LVYGI,
753,3115-CZMZD,
936,5709-LVOEQ,
1082,4367-NUYAO,
1340,1371-DWPAZ,
3331,7644-OMVMY,
3826,3213-VVOLG,
4380,2520-SGTTA,
5218,2923-ARZLG,
6670,4075-WKNIU,


Handling missing values in TotalCharges

In [None]:
# Replace NaN values in 'TotalCharges' with 0
df.TotalCharges = pd.to_numeric(df.TotalCharges, errors='coerce')
df.TotalCharges = df.TotalCharges.fillna(0)


Standardizing column names

Column names have inconsistencies — some start with lowercase, others with uppercase, and there are spaces in some. We standardize by making all column names lowercase and replacing spaces with underscores.

In [None]:

df.columns = df.columns.str.lower().str.replace(' ', '_')

string_columns = list(df.dtypes[df.dtypes == 'object'].index)
for col in string_columns:
    df[col] = df[col].str.lower().str.replace(' ', '_')


# Encoding the target variable (churn)

In [None]:
df.churn = (df.churn == 'yes').astype(int)

In [None]:

# Split the data into train and test sets (80% train, 20% test)
df_train_full, df_test = train_test_split(df, test_size=0.2, random_state=1)


we split the training data into two parts: one for actual training and one for validation. This is common to validate the model's performance on unseen data during training.

In [None]:
# Split the training data further into train and validation sets (67% train, 33% validation)
df_train, df_val = train_test_split(df_train_full, test_size=0.33, random_state=11)

y_train = df_train.churn.values
y_val = df_val.churn.values

del df_train['churn']
del df_val['churn']


In [None]:
# Check if there are any missing values in the entire dataset
df_train_full.isnull().sum()

Unnamed: 0,0
customerid,0
gender,0
seniorcitizen,0
partner,0
dependents,0
tenure,0
phoneservice,0
multiplelines,0
internetservice,0
onlinesecurity,0


Inspecting the distribution of the target variable (churn)

In [None]:
# Check the count of each class in the target variable 'churn'
df_train_full.churn.value_counts()

Unnamed: 0_level_0,count
churn,Unnamed: 1_level_1
0,4113
1,1521


# calculating the churn rate

To quantify the proportion of churned customers, we calculate the churn rate. This gives insight into how imbalanced the dataset is and is a useful metric when evaluating model performance.

In [None]:
# Calculate the churn rate (proportion of customers who churned)
global_mean = df_train_full.churn.mean()

Before I even think about building a model, my first gut-check is always to see how balanced our classes are. So, I calculated the churn rate and found it's about 27%, meaning for every ten customers, nearly three are leaving. That immediately tells me we're dealing with a classic imbalanced dataset. It's not a rare event, but it's a skewed playing field, and that's a crucial piece of context that will shape everything I do next to make sure our model doesn't just take the easy way out and predict 'not churned' every time.



# Categorizing Variables

One of the first things I do with a new dataset is to sort the variables into two mental buckets: categorical and numerical. I find that this simple act of organization is incredibly powerful. It sets the stage for everything that follows, because each type tells its story differently and requires its own approach, especially when we later try to understand which features are the most important players.

In [None]:
# Categorical variables
categorical = ['gender', 'seniorcitizen', 'partner', 'dependents', 'phoneservice',
               'multiplelines', 'internetservice', 'onlinesecurity', 'onlinebackup',
               'deviceprotection', 'techsupport', 'streamingtv', 'streamingmovies',
               'contract', 'paperlessbilling', 'paymentmethod']

# Numerical variables
numerical = ['tenure', 'monthlycharges', 'totalcharges']


# Unique Values in Categorical Variables

I always do a quick sanity check on my categorical variables by peeking at how many unique values each one has. It’s like looking under the hood—I want to make sure nothing is overly complex or messy before moving forward. If I see a column with hundreds of categories, I know it might need some simplification down the line.

In [None]:
# Checking the number of unique values in categorical variables
df_train_full[categorical].nunique()


Unnamed: 0,0
gender,2
seniorcitizen,2
partner,2
dependents,2
phoneservice,2
multiplelines,3
internetservice,3
onlinesecurity,3
onlinebackup,3
deviceprotection,3


# Feature Importance: Checking Churn Rate by Categories

I like to start my feature importance analysis with a simple but powerful question: does the churn rate change across different categories? For example, I'll calculate the churn rate separately for 'female' and 'male' customers. If the rates are almost identical, it's a strong clue that gender might not be a key driver in this case. It’s a quick gut-check that tells me a lot before I even run a complex model.

In [None]:
# Calculate churn rate for female customers
female_mean = df_train_full[df_train_full.gender == 'female'].churn.mean()

# Calculate churn rate for male customers
male_mean = df_train_full[df_train_full.gender == 'male'].churn.mean()


In [None]:

print(f"Female churn rate: {female_mean}")
print(f"Male churn rate: {male_mean}")


Female churn rate: 0.27682403433476394
Male churn rate: 0.2632135306553911


# Feature Importance: Partner Variable

In [None]:
# churn rate for customers with a partner
partner_yes = df_train_full[df_train_full.partner == 'yes'].churn.mean()

# churn rate for customers without a partner
partner_no = df_train_full[df_train_full.partner == 'no'].churn.mean()


In [None]:
# Output churn rates for customers with and without a partner
print(f"Partnered churn rate: {partner_yes}")
print(f"Non-partnered churn rate: {partner_no}")


Partnered churn rate: 0.20503330866025166
Non-partnered churn rate: 0.3298090040927694


# Calculating the Risk Ratio

We now introduce the concept of risk ratio, which compares the churn rate of a specific group with the global churn rate. This helps us understand the relative risk of each group.

In [None]:
# Calculate risk ratio for gender (female vs. global churn rate)
gender_risk = female_mean / global_mean
print(f"Risk ratio for female: {gender_risk}")


Risk ratio for female: 1.0253955354648652


# Risk Ratio for Partner

In [None]:
# Calculate risk ratio for partner (yes vs. global churn rate)
partner_risk = partner_yes / global_mean
print(f"Risk ratio for partner: {partner_risk}")


Risk ratio for partner: 0.7594724924338315


# Applying to All Categorical Variables

Once I know which features are categorical, my next move is to see how churn plays out in each category. Doing this one-by-one for every column would be tedious and error-prone, so I write a quick loop. It's like building a little helper that runs the same analysis on every category, making it easy to spot which groups have shockingly high or low churn rates. It scales perfectly and lets me quickly identify the real story hidden in the categories.

In [None]:
# Calculate churn rate for each category in the categorical variables
for column in categorical:
    for value in df_train_full[column].unique():
        group_churn_rate = df_train_full[df_train_full[column] == value].churn.mean()
        print(f"Churn rate for {column} = {value}: {group_churn_rate}")


Churn rate for gender = male: 0.2632135306553911
Churn rate for gender = female: 0.27682403433476394
Churn rate for seniorcitizen = 0: 0.24227022448115204
Churn rate for seniorcitizen = 1: 0.4133771929824561
Churn rate for partner = yes: 0.20503330866025166
Churn rate for partner = no: 0.3298090040927694
Churn rate for dependents = yes: 0.16566626650660263
Churn rate for dependents = no: 0.3137600806451613
Churn rate for phoneservice = yes: 0.2730489482995872
Churn rate for phoneservice = no: 0.2413162705667276
Churn rate for multiplelines = no: 0.2574074074074074
Churn rate for multiplelines = yes: 0.29074151654796815
Churn rate for multiplelines = no_phone_service: 0.2413162705667276
Churn rate for internetservice = no: 0.07780507780507781
Churn rate for internetservice = dsl: 0.1923474663908997
Churn rate for internetservice = fiber_optic: 0.42517144009681324
Churn rate for onlinesecurity = no_internet_service: 0.07780507780507781
Churn rate for onlinesecurity = yes: 0.1532258064516

# Churn Rates and Risk Table

The churn rate and risk ratio are computed for two variables (gender and partner), showing the churn rate and the risk of churning for each group. This helps identify which customer characteristics are more likely to lead to churn.

# Rough SQL Translation for Gender Variable (Churn Rate and Risk)

In [None]:
# Calculate global churn rate
global_mean = df_train_full.churn.mean()

# Group by gender and calculate churn rate for each gender
df_group = df_train_full.groupby(by='gender').churn.agg(['mean'])

# Calculate the difference between the group churn rate and global churn rate
df_group['diff'] = df_group['mean'] - global_mean

# Calculate the risk ratio: group churn rate / global churn rate
df_group['risk'] = df_group['mean'] / global_mean


# Loop Over All Categorical Variables for Churn Rate and Risk

To calculate churn rates and risk ratios for all categorical variables, we loop through each categorical feature, apply the same logic, and display the results.

In [None]:
from IPython.display import display

# Loop through each categorical variable to calculate churn rate and risk
for col in categorical:
    df_group = df_train_full.groupby(by=col).churn.agg(['mean'])
    df_group['diff'] = df_group['mean'] - global_mean
    df_group['rate'] = df_group['mean'] / global_mean
    display(df_group)


Unnamed: 0_level_0,mean,diff,rate
gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
female,0.276824,0.006856,1.025396
male,0.263214,-0.006755,0.97498


Unnamed: 0_level_0,mean,diff,rate
seniorcitizen,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,0.24227,-0.027698,0.897403
1,0.413377,0.143409,1.531208


Unnamed: 0_level_0,mean,diff,rate
partner,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
no,0.329809,0.059841,1.221659
yes,0.205033,-0.064935,0.759472


Unnamed: 0_level_0,mean,diff,rate
dependents,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
no,0.31376,0.043792,1.162212
yes,0.165666,-0.104302,0.613651


Unnamed: 0_level_0,mean,diff,rate
phoneservice,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
no,0.241316,-0.028652,0.89387
yes,0.273049,0.003081,1.011412


Unnamed: 0_level_0,mean,diff,rate
multiplelines,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
no,0.257407,-0.012561,0.953474
no_phone_service,0.241316,-0.028652,0.89387
yes,0.290742,0.020773,1.076948


Unnamed: 0_level_0,mean,diff,rate
internetservice,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
dsl,0.192347,-0.077621,0.712482
fiber_optic,0.425171,0.155203,1.574895
no,0.077805,-0.192163,0.288201


Unnamed: 0_level_0,mean,diff,rate
onlinesecurity,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
no,0.420921,0.150953,1.559152
no_internet_service,0.077805,-0.192163,0.288201
yes,0.153226,-0.116742,0.56757


Unnamed: 0_level_0,mean,diff,rate
onlinebackup,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
no,0.404323,0.134355,1.497672
no_internet_service,0.077805,-0.192163,0.288201
yes,0.217232,-0.052736,0.80466


Unnamed: 0_level_0,mean,diff,rate
deviceprotection,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
no,0.395875,0.125907,1.466379
no_internet_service,0.077805,-0.192163,0.288201
yes,0.230412,-0.039556,0.85348


Unnamed: 0_level_0,mean,diff,rate
techsupport,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
no,0.418914,0.148946,1.551717
no_internet_service,0.077805,-0.192163,0.288201
yes,0.159926,-0.110042,0.59239


Unnamed: 0_level_0,mean,diff,rate
streamingtv,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
no,0.342832,0.072864,1.269897
no_internet_service,0.077805,-0.192163,0.288201
yes,0.302723,0.032755,1.121328


Unnamed: 0_level_0,mean,diff,rate
streamingmovies,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
no,0.338906,0.068938,1.255358
no_internet_service,0.077805,-0.192163,0.288201
yes,0.307273,0.037305,1.138182


Unnamed: 0_level_0,mean,diff,rate
contract,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
month-to-month,0.431701,0.161733,1.599082
one_year,0.120573,-0.149395,0.446621
two_year,0.028274,-0.241694,0.10473


Unnamed: 0_level_0,mean,diff,rate
paperlessbilling,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
no,0.172071,-0.097897,0.637375
yes,0.338151,0.068183,1.25256


Unnamed: 0_level_0,mean,diff,rate
paymentmethod,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
bank_transfer_(automatic),0.168171,-0.101797,0.622928
credit_card_(automatic),0.164339,-0.10563,0.608733
electronic_check,0.45589,0.185922,1.688682
mailed_check,0.19387,-0.076098,0.718121


Now for the really insightful part: by looking at the churn rates and risk ratios, I can start to see the real story behind the numbers. For instance, it becomes clear that customers without a partner have a 22% higher risk of churning, while those with a partner tend to stay. It’s these kinds of patterns—whether someone is a senior citizen, has a partner, or uses tech support—that really stand out. They give me a much clearer sense of which factors truly drive customer behavior and will be most powerful for predicting churn.

# Mutual Information for Feature Importance

When I need to quickly identify which features truly matter for predicting churn, my go-to tool is mutual information. Think of it as a way to ask each feature, 'How much do you really know about whether a customer will leave?' A high score means that feature holds a lot of valuable, predictive signal—it's clearly in the know. This lets me cut through the noise and focus my model on the variables that actually tell the story.

In [None]:
from sklearn.metrics import mutual_info_score

# Function to calculate mutual information for each categorical variable
def calculate_mi(series):
    return mutual_info_score(series, df_train_full.churn)

# Apply the function to each categorical column and sort the results
df_mi = df_train_full[categorical].apply(calculate_mi)
df_mi = df_mi.sort_values(ascending=False).to_frame(name='MI')


After running the mutual information, I love this part—it basically gives me a ranked list of the most predictive features. The higher the score, the more that feature truly informs the model about churn. It’s no surprise to see 'contract type,' 'online security,' and 'tech support' at the top; they clearly separate who stays and who goes. Meanwhile, a feature like 'gender' falls to the bottom, confirming it doesn’t hold much signal. This helps me focus on what really matters.

# Correlation Coefficient for Numerical Variables

When I want to see how a number, like a customer's tenure or monthly bill, relates to whether they leave, I use correlation. It's like a compass that points to the strength of a relationship. A high correlation means that as that number changes, the chance of churn changes quite predictably. This quickly shows me which numerical factors, like how long someone has been with the company, are the most powerful clues for predicting if they'll stay or go.

# Understanding Correlation with the Target Variable (churn)

To understand how numbers like tenure or monthly bills influence churn, I look at their correlation. A negative correlation tells me that as values increase—like a customer’s tenure—the likelihood of churn tends to decrease. It’s like a compass pointing to which numerical factors truly help predict customer behavior.

In [None]:
# Calculate correlation between numerical variables and churn
df_train_full[numerical].corrwith(df_train_full.churn)


Unnamed: 0,0
tenure,-0.351885
monthlycharges,0.196805
totalcharges,-0.196353


charges influence customer behavior. The analysis reveals a clear negative correlation between tenure and churn—longer-tenured customers are significantly less likely to churn. Conversely, higher monthly charges correlate positively with churn, indicating that pricing sensitivity plays a key role in attrition.

These insights directly inform the subsequent feature engineering phase, where categorical variables are transformed into numerical representations using one-hot encoding. This process ensures compatibility with machine learning algorithms while preserving the predictive integrity of the features.



# One-Hot Encoding for Categorical Variables

One-hot encoding is a technique that converts categorical variables into a matrix form that machine learning models can handle. It creates binary columns for each category, where a "1" indicates the presence of a category, and a "0" indicates its absence.

In [None]:
# One-hot encode categorical variables using DictVectorizer
from sklearn.feature_extraction import DictVectorizer

# Convert the dataframe to a list of dictionaries
train_dict = df_train[categorical + numerical].to_dict(orient='records')

# Initialize DictVectorizer and fit it to the dictionary data
dv = DictVectorizer(sparse=False)
dv.fit(train_dict)

# Transform the dictionary into a matrix
X_train = dv.transform(train_dict)


# Creating One-Hot Encoded Features

The DictVectorizer takes the dictionary created from the dataframe, fits it to determine how to convert the categorical values to binary vectors, and then transforms the data into a matrix. This matrix is now ready for model training.

In [None]:
# Show the first row of the one-hot encoded matrix
X_train[0]


array([0.0000e+00, 0.0000e+00, 1.0000e+00, 1.0000e+00, 0.0000e+00,
       0.0000e+00, 0.0000e+00, 1.0000e+00, 0.0000e+00, 1.0000e+00,
       1.0000e+00, 0.0000e+00, 0.0000e+00, 8.6100e+01, 1.0000e+00,
       0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 1.0000e+00,
       0.0000e+00, 0.0000e+00, 1.0000e+00, 0.0000e+00, 1.0000e+00,
       0.0000e+00, 1.0000e+00, 1.0000e+00, 0.0000e+00, 0.0000e+00,
       0.0000e+00, 0.0000e+00, 1.0000e+00, 0.0000e+00, 0.0000e+00,
       0.0000e+00, 1.0000e+00, 0.0000e+00, 0.0000e+00, 1.0000e+00,
       0.0000e+00, 0.0000e+00, 1.0000e+00, 7.1000e+01, 6.0459e+03])

In [None]:
# Get the feature names after one-hot encoding
dv.get_feature_names_out()



array(['contract=month-to-month', 'contract=one_year',
       'contract=two_year', 'dependents=no', 'dependents=yes',
       'deviceprotection=no', 'deviceprotection=no_internet_service',
       'deviceprotection=yes', 'gender=female', 'gender=male',
       'internetservice=dsl', 'internetservice=fiber_optic',
       'internetservice=no', 'monthlycharges', 'multiplelines=no',
       'multiplelines=no_phone_service', 'multiplelines=yes',
       'onlinebackup=no', 'onlinebackup=no_internet_service',
       'onlinebackup=yes', 'onlinesecurity=no',
       'onlinesecurity=no_internet_service', 'onlinesecurity=yes',
       'paperlessbilling=no', 'paperlessbilling=yes', 'partner=no',
       'partner=yes', 'paymentmethod=bank_transfer_(automatic)',
       'paymentmethod=credit_card_(automatic)',
       'paymentmethod=electronic_check', 'paymentmethod=mailed_check',
       'phoneservice=no', 'phoneservice=yes', 'seniorcitizen',
       'streamingmovies=no', 'streamingmovies=no_internet_service',

we evaluate how the DictVectorizer handles a list of dictionaries for one-hot encoding. The function will automatically handle categorical variables by creating columns for each category and will retain numerical features as they are.

# Transitioning to Logistic Regression

Once the data is ready (with features encoded), the next step is training a logistic regression model. Logistic regression is used for binary classification tasks, predicting whether a customer will churn (1) or not (0).

# Logistic Regression Formula in Python

The logistic regression formula is based on a linear combination of features. The sigmoid function then maps the result to a probability, which can be interpreted as the likelihood that the customer will churn.
First, we calculate a weighted sum of the features, then apply the sigmoid function to output a probability between 0 and 1.

In [None]:
# Sigmoid function that converts the score to a probability
import math

def sigmoid(score):
    return 1 / (1 + math.exp(-score))

# Logistic regression function
def logistic_regression(xi):
    score = bias
    for j in range(n):
        score = score + xi[j] * w[j]
    prob = sigmoid(score)
    return prob


# Training the Logistic Regression Model

To train the logistic regression model using Scikit-learn, we first import the LogisticRegression class and train the model on the training data.

LogisticRegression is a classification model provided by Scikit-learn. We specify parameters like solver (for optimization) and random_state (for reproducibility) to ensure consistent results during training.

In [None]:
# Import Logistic Regression from Scikit-learn
from sklearn.linear_model import LogisticRegression

# Initialize and train the logistic regression model
model = LogisticRegression(solver='liblinear', random_state=1)
model.fit(X_train, y_train)


# Transforming Validation Data for Prediction

After training the model, we need to apply the one-hot encoding transformation to the validation dataset before making predictions. This ensures the validation features match the format used during training.

The validation set is transformed using the same DictVectorizer that was fitted on the training data to ensure consistent feature encoding.

In [None]:
# Convert the validation data to a list of dictionaries
val_dict = df_val[categorical + numerical].to_dict(orient='records')

# Apply the same transformation as during training
X_val = dv.transform(val_dict)


# Making Predictions Using the Trained Model

Now that the validation data is encoded, we can pass it through the trained model to get churn probabilities for each customer in the validation set.
The predict_proba method of the model outputs the probabilities for each class (churn or no churn) for every observation in the validation set. The second column contains the probability that a customer will churn.

In [None]:
# Use the model to predict probabilities on the validation set
y_pred = model.predict_proba(X_val)


In [None]:
# Example of prediction output
y_pred = model.predict_proba(X_val)
# The output of predict_proba is a two-column array, where the first column represents the probability that a customer will not churn, and the second column represents the probability that a customer will churn.

# Extracting the Churn Probabilities

In [None]:
# Select only the second column for churn probabilities
y_pred = model.predict_proba(X_val)[:, 1]


In [None]:
# Convert probabilities to binary predictions (True for churn, False for not churn)
churn = y_pred >= 0.5


# Evaluating the Model: Accuracy

Once we have the binary predictions, we compare them to the actual values in the validation set. The comparison gives us an array of True and False values, where True means the prediction matches the actual value.

In [None]:
# Calculate accuracy by comparing predictions with actual values
(y_val == churn).mean()


np.float64(0.8016129032258065)

# Model Interpretation: Extracting Coefficients

To understand how the logistic regression model makes its predictions, we examine the learned coefficients. The model has a bias term and a set of weights associated with each feature. We extract and display these coefficients to interpret the model.

In [None]:
# Extract feature names and corresponding model coefficients
dict(zip(dv.get_feature_names_out(), model.coef_[0].round(3)))


{'contract=month-to-month': np.float64(0.563),
 'contract=one_year': np.float64(-0.086),
 'contract=two_year': np.float64(-0.599),
 'dependents=no': np.float64(-0.03),
 'dependents=yes': np.float64(-0.092),
 'deviceprotection=no': np.float64(0.1),
 'deviceprotection=no_internet_service': np.float64(-0.116),
 'deviceprotection=yes': np.float64(-0.106),
 'gender=female': np.float64(-0.027),
 'gender=male': np.float64(-0.095),
 'internetservice=dsl': np.float64(-0.323),
 'internetservice=fiber_optic': np.float64(0.317),
 'internetservice=no': np.float64(-0.116),
 'monthlycharges': np.float64(0.001),
 'multiplelines=no': np.float64(-0.168),
 'multiplelines=no_phone_service': np.float64(0.127),
 'multiplelines=yes': np.float64(-0.081),
 'onlinebackup=no': np.float64(0.136),
 'onlinebackup=no_internet_service': np.float64(-0.116),
 'onlinebackup=yes': np.float64(-0.142),
 'onlinesecurity=no': np.float64(0.258),
 'onlinesecurity=no_internet_service': np.float64(-0.116),
 'onlinesecurity=yes':

# Training a Smaller Logistic Regression Model

To further understand the model, we train a smaller logistic regression model using only a subset of the features: contract, tenure, and totalcharges. We preprocess the categorical features (like contract) using one-hot encoding.

We create a new feature set containing only contract, tenure, and totalcharges, convert them to dictionaries, and apply one-hot encoding using DictVectorizer. This will help us compare the simpler model with the full model.

In [None]:
# Define the subset of features for the smaller model
small_subset = ['contract', 'tenure', 'totalcharges']

# Convert the training data for the smaller model to a list of dictionaries
train_dict_small = df_train[small_subset].to_dict(orient='records')

# Apply one-hot encoding to the small model
dv_small = DictVectorizer(sparse=False)
dv_small.fit(train_dict_small)

# Transform the training data into the encoded matrix for the small model
X_small_train = dv_small.transform(train_dict_small)


# Getting Feature Names for the Smaller Model

We use the get_feature_names method of the DictVectorizer to get the feature names after applying one-hot encoding to the smaller model’s data.
The smaller model will have one-hot encoded features for contract (since it's categorical), while tenure and totalcharges remain as numerical features.

In [None]:
# Get the feature names for the smaller model
dv_small.get_feature_names_out()


array(['contract=month-to-month', 'contract=one_year',
       'contract=two_year', 'tenure', 'totalcharges'], dtype=object)

# Training the Smaller Model

A logistic regression model is trained using only a few selected features. We use the LogisticRegression class from Scikit-learn with the liblinear solver and set a random seed for reproducibility.

In [None]:
# Train a smaller model on selected features
model_small = LogisticRegression(solver='liblinear', random_state=1)
model_small.fit(X_small_train, y_train)

# Checking the Bias Term and Coefficients

After training the model, we analyze the learned parameters to interpret its decision-making process. The bias term (intercept) represents the baseline log-odds prediction when all feature values are zero. We examine the coefficients associated with each feature to quantify their contribution to the prediction: positive coefficients increase the likelihood of churn, while negative coefficients decrease it.



To enhance interpretability, we pair feature names with their corresponding coefficients using the zip function, allowing clear identification of which features most significantly influence the model’s output. This step is crucial for validating model behavior and ensuring alignment with domain knowledge.



In [None]:
# Check the bias term
model_small.intercept_[0]

# Check the model's coefficients
dict(zip(dv_small.get_feature_names_out(), model_small.coef_[0].round(3)))


{'contract=month-to-month': np.float64(0.91),
 'contract=one_year': np.float64(-0.144),
 'contract=two_year': np.float64(-1.404),
 'tenure': np.float64(-0.097),
 'totalcharges': np.float64(0.001)}

# Interpreting the Model Coefficients

The coefficients of the model indicate how strongly each feature affects the likelihood of churn. Positive coefficients mean a higher probability of churn, while negative coefficients suggest a lower likelihood of churn. We interpret the weights associated with features such as contract, tenure, and totalcharges.

We interpret the weights for the contract feature to understand which contract types increase the likelihood of churn. The month-to-month contract has a positive coefficient, indicating that customers with this contract are more likely to churn. In contrast, the two-year contract has a negative coefficient, indicating a lower likelihood of churn.

In [None]:
# Interpretation of the model weights for the 'contract' feature
{'contract=month-to-month': 0.91,
 'contract=one_year': -0.144,
 'contract=two_year': -1.404,
 'tenure': -0.097,
 'totalcharges': 0.000}


{'contract=month-to-month': 0.91,
 'contract=one_year': -0.144,
 'contract=two_year': -1.404,
 'tenure': -0.097,
 'totalcharges': 0.0}

# Dot Product and One-Hot Encoding

To understand how the model uses the weights, we perform the dot product between the one-hot encoded feature vector and the weights. Only the active (hot) feature contributes to the score, while the inactive (cold) features are ignored.

In one-hot encoding, for each categorical feature, only one feature value is active at a time. This means that during prediction, only the weight associated with the active feature is used.

In [None]:
# Dot product between one-hot encoded feature vector and weights
1 * 0.91 + 0 * -0.144 + 0 * -1.404  # For a customer with 'contract=month-to-month'


0.91

# Handling Different Contract Types

In [None]:
# Dot product for a customer with 'contract=two_year'
0 * 0.91 + 0 * -0.144 + 1 * -1.404  # For a customer with 'contract=two_year'


-1.404

# Significance of Coefficient Signs and Magnitude

The sign and magnitude of the coefficients determine whether a feature increases or decreases the probability of churn. A positive sign means the feature contributes to churn, and a negative sign means the feature contributes to reducing churn. The magnitude shows the strength of this contribution.

We analyze the contract-related weights and conclude that the magnitude and sign of each weight reflect the strength and direction of the relationship with churn.

For example, the positive weight for contract=month-to-month indicates a strong tendency for customers with this type of contract to churn, while the negative weight for contract=two_year suggests that long-term contracts are strong indicators of customer loyalty.

In [None]:
# Weight analysis
contract_weights = {
    "month_to_month": 0.91,    # positive, strong churn predictor
    "one_year": -0.144,        # negative, weaker churn predictor
    "two_year": -1.404         # strong negative churn predictor
}


# Interpreting Numerical Features

The model's coefficients reveal the directional influence of numerical features on churn prediction. Tenure exhibits a negative weight (–0.097), quantitatively confirming that longer customer tenure corresponds to reduced churn probability. In contrast, TotalCharges carries a weight of zero, indicating it provides no incremental predictive power within this feature set. These results align with expected customer behavior and validate the feature selection approach.



# Predicting Churn for a Customer Example

Finally, we walk through the prediction process for a customer with a month-to-month contract, 12 months of tenure, and a total charge of $1,000. We calculate the score step-by-step based on the weights and features, ultimately determining whether the customer is likely to churn or not.

We start with the baseline score (bias), then add the contribution of the contract type (month-to-month), and subtract the contribution of tenure. Since the weight for totalcharges is zero, it doesn’t affect the prediction.

In [None]:
# Example calculation for a month-to-month customer
bias = -0.639
contract_month = 0.91
tenure = 12 * (-0.097)
totalcharges = 0  # No impact

# Calculate score
score = bias + contract_month + tenure + totalcharges


# Scoring a Customer with the Model

We apply the logistic regression model to score a new customer, determining the probability that the customer will churn. The process involves converting the customer's data into the same format the model expects, and then using the trained model to make predictions.

A customer’s data is stored in a dictionary with the same feature names as the ones used during training. The data needs to undergo the same preprocessing steps (e.g., one-hot encoding) before feeding it into the model for prediction.

In [None]:
# Customer data to be scored
customer = {
    'customerid': '8879-zkjof',
    'gender': 'female',
    'seniorcitizen': 0,
    'partner': 'no',
    'dependents': 'no',
    'tenure': 41,
    'phoneservice': 'yes',
    'multiplelines': 'no',
    'internetservice': 'dsl',
    'onlinesecurity': 'yes',
    'onlinebackup': 'no',
    'deviceprotection': 'yes',
    'techsupport': 'yes',
    'streamingtv': 'yes',
    'streamingmovies': 'yes',
    'contract': 'one_year',
    'paperlessbilling': 'yes',
    'paymentmethod': 'bank_transfer_(automatic)',
    'monthlycharges': 79.85,
    'totalcharges': 3320.75,
}


# Preprocessing the Customer Data

We need to apply one-hot encoding to the categorical variables (like contract) before feeding the data into the model. We use DictVectorizer from Scikit-learn for this task.

The customer dictionary is transformed into a matrix using DictVectorizer, which converts the categorical features into one-hot encoded vectors while keeping numerical features intact.

In [None]:
# Convert the customer data into the feature matrix
X_test = dv.transform([customer])


# Making Predictions for a Single Customer

Once the customer data is preprocessed, we use the model to predict the probability of churn. The model returns two probabilities: the probability that the customer will not churn and the probability that they will churn.

We use the predict_proba method to get the churn probabilities. The method returns a two-column array, and we are interested in the second column, which contains the probability of churn.

In [None]:
# Get the churn probability for the customer
model.predict_proba(X_test)


array([[0.92667889, 0.07332111]])

# Extracting the Probability of Churn

To extract the probability of churn from the prediction output, we use array indexing to select the value in the second column.

In [None]:
# Select the probability of churn for the first customer
model.predict_proba(X_test)[0, 1]


np.float64(0.07332111084949638)

# Evaluating Another Customer

We apply the same steps to score another customer, with different feature values. This allows us to predict the likelihood of churn for multiple customers using the same model.

We repeat the process of transforming the customer data and using the trained model to predict the churn probability for a new customer.

In [None]:
# New customer data
customer = {
    'gender': 'female',
    'seniorcitizen': 1,
    'partner': 'no',
    'dependents': 'no',
    'phoneservice': 'yes',
    'multiplelines': 'yes',
    'internetservice': 'fiber_optic',
    'onlinesecurity': 'no',
    'onlinebackup': 'no',
    'deviceprotection': 'no',
    'techsupport': 'no',
    'streamingtv': 'yes',
    'streamingmovies': 'no',
    'contract': 'month-to-month',
    'paperlessbilling': 'yes',
    'paymentmethod': 'electronic_check',
    'tenure': 1,
    'monthlycharges': 85.7,
    'totalcharges': 85.7
}

# Transform the new customer data
X_test = dv.transform([customer])

# Get the churn probability for the new customer
model.predict_proba(X_test)[0, 1]


np.float64(0.8321656556055403)

# Decision Based on Churn Probability

Once we have the churn probability, we can make a decision on whether to send the customer a promotional email. If the probability of churn is greater than or equal to 50%, we send the promotional mail; otherwise, we do not.

For the second customer, the predicted probability of churn is 83%. Since it’s above 50%, we would send the customer a promotional mail to try and retain them.

In [None]:
# Decision to send promotional mail based on churn probability
if model.predict_proba(X_test)[0, 1] >= 0.5:
    print("Send promotional mail")
else:
    print("Do not send promotional mail")


Send promotional mail


We trained the logistic regression model, interpreted the weights, and applied it to score new customers, determining whether they are likely to churn. We also explored the process of converting categorical variables into numerical values using one-hot encoding and how to handle model predictions.