# Customer Churn Prediction

## Main Goals

- Preprocessing
    - Label encode categorical features to convert them into machine-readable form.
    - Engineer new time-based features 
    - Apply log transformation to skewed numerical features (like total charges) to help the model better understand the data.
- Predict whether a customer will churn based on their demographic and usage data.

### Context

Customer churn is a significant concern for businesses across many industries, as losing existing customers can have a major impact on revenue and growth. Understanding and predicting why customers leave is a key focus for data-driven organizations aiming to improve retention and sustain long-term success. In the field of data science, predictive modeling enables companies to identify customers who are at risk of leaving (often referred to as “churn") by analyzing patterns in demographic information, account tenure, and usage behavior.

## 1. Data Loading 

For this project, we will use the [Telco Customer Churn Dataset](https://www.kaggle.com/datasets/blastchar/telco-customer-churn), a widely used collection of customer records from a telecommunications company, which is perfect for our goals. In accordance with Kaggle licenses, please directly visit the Kaggle website and download the `WA_Fn-UseC_-Telco-Customer-Churn.csv` dataset for this activity.  The name is a little long, so feel free to rename it if that's more convienient. 

We can start by loading in the dataset into a pandas dataframe, and then displaying it to ensure it loaded correctly, and so we can see what the features are and how the target is displayed. This means that we have to start by importing pandas as well.

It's worth mentioning that anytime you have a dataset from an external source, such as Kaggle, you can and should refer back to the source of the data to clear up misconceptions and also to get a better understanding of the data.

In [1]:
#Import pandas so as to read the CSV file
import pandas as pd

#Read the CSV file
df = pd.read_csv('WA_Fn-UseC_-Telco-Customer-Churn.csv')

#Display the first few rows of the dataframe
df.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


### Understanding the Data

Displaying the data shows us that we have many features, as the function actually truncates some feature(s). Additonally, it shows us that we have mostly categorical features, with some numerical features as well. 

When working with data, it is always important to make sure that you completely understand what you're working with. Since we weren't able to view every feature with the display() funciton, we'll go ahead and print out the features so that we know what they are. Additionally, we can refer back to the source of the data, in this case Kaggle, to learn more about the dataset. 

As such, we'll print out the columns of the dataset, and take note of some features

In [2]:
#Print the columns of the dataframe
print(df.columns)

Index(['customerID', 'gender', 'SeniorCitizen', 'Partner', 'Dependents',
       'tenure', 'PhoneService', 'MultipleLines', 'InternetService',
       'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport',
       'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling',
       'PaymentMethod', 'MonthlyCharges', 'TotalCharges', 'Churn'],
      dtype='object')


Viewing the columns like this, we realize that 'OnlineBackup' was the only feature that was truncated. Let's take our understanding of the data a step further and use info from the Kaggle source to clarify some features.

- tenure: The number of months a customer has stayed with the company.
- PhoneService: Whether the customer has phone service or not. 
- InternetService: Customer’s internet service provider (DSL, Fiber optic, No)
- OnlineSecurity: Whether the customer has online security or not (many of the following features are like this, whether they have backup, protection, support, or not)
- Contract: The contract term of the customer (Month-to-month, One year, Two year)
- PaymentMethod: The customer’s payment method (Electronic check, Mailed check, Bank transfer (automatic), Credit card)
- TotalCharges: The total amount charged to the customer
- Churn: Whether the customer has churned or not. For this project, this will be our target.

We can see that many of the features are actually binary, referring to whether the customer had a certain service or not. A lot of the features that we didn't just go over are binary features, or features that seem binary with the inclusion of "no internet" as an option. With this in mind, we can go ahead and start preprocessing the dataset

## 2. Preprocessing

Having viewed and understood the features our data, it's now time to see if our data needs to be cleaned up. Before we split the data or build the model, it is important to make sure the data is ready for the model and any other transformations. Additionally, we should also take this oppurtunity to remove any features that will not assist whatever model we choose. In this case, this is the customer_ID column, as it does not indicate anything to whether somebody might churn. 

Let's start by using functions from pandas to check if there are any null entries in the data set. Having missing entries can cause errors in predictive models while training so it's important to ensure that null entries are dealt with before proceding. Unless a dataset explicitly says there are no null values, this should always be checked.

Afterwards, we can remove our customer_ID column.

In [3]:
#check for missing values in the dataframe
df.isnull().sum()

customerID          0
gender              0
SeniorCitizen       0
Partner             0
Dependents          0
tenure              0
PhoneService        0
MultipleLines       0
InternetService     0
OnlineSecurity      0
OnlineBackup        0
DeviceProtection    0
TechSupport         0
StreamingTV         0
StreamingMovies     0
Contract            0
PaperlessBilling    0
PaymentMethod       0
MonthlyCharges      0
TotalCharges        0
Churn               0
dtype: int64

Fortunately for us, none of the features are missing any piece of data. Or so it seems. Inspecting for blank strings in features such as TotalCharges tells us a different story. 

In [4]:
print(df['TotalCharges'].apply(lambda x: isinstance(x, str) and x.strip() == '').sum())

11


Surprisingly enough, we have 11 blank strings in our TotalCharges feature. This isn't exactly a common occurence, as most of the time you can us isnull().sum(). But if you find any errors later on, or don't mind the extra time in making sure the data types are correct, you can always go to this point in preprocessing and double check everything.

Fortunately, this is an easy fix. We can turn our TotalCharges feature to a numeric format using the pandas .to_numeric function. The blank strings will be turned to null values, so we can imput those values with 0.

In [5]:
#Convert to numeric (just in case) to catch blank strings
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')

#Now fill any new NaNs (caused by blanks being coerced)
df['TotalCharges'] = df['TotalCharges'].fillna(0)

With our data removed of any null values, let's now remove our customer_ID column to further clean the dataset.


In [6]:
#remove the 'customerID' column as it is not needed for analysis
df = df.drop(columns=['customerID'])

#View our dataframe again to see our changes
display(df) 


Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,Female,0,Yes,No,1,No,No phone service,DSL,No,Yes,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,Male,0,No,No,34,Yes,No,DSL,Yes,No,Yes,No,No,No,One year,No,Mailed check,56.95,1889.50,No
2,Male,0,No,No,2,Yes,No,DSL,Yes,Yes,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,Male,0,No,No,45,No,No phone service,DSL,Yes,No,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.30,1840.75,No
4,Female,0,No,No,2,Yes,No,Fiber optic,No,No,No,No,No,No,Month-to-month,Yes,Electronic check,70.70,151.65,Yes
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7038,Male,0,Yes,Yes,24,Yes,Yes,DSL,Yes,No,Yes,Yes,Yes,Yes,One year,Yes,Mailed check,84.80,1990.50,No
7039,Female,0,Yes,Yes,72,Yes,Yes,Fiber optic,No,Yes,Yes,No,Yes,Yes,One year,Yes,Credit card (automatic),103.20,7362.90,No
7040,Female,0,Yes,Yes,11,No,No phone service,DSL,Yes,No,No,No,No,No,Month-to-month,Yes,Electronic check,29.60,346.45,No
7041,Male,1,Yes,No,4,Yes,Yes,Fiber optic,No,No,No,No,No,No,Month-to-month,Yes,Mailed check,74.40,306.60,Yes


With this, we can now safely move on to the next step of preprocessing.

### Encoding Categorical Features

To prepare our dataset for machine learning, we’ll encode all categorical features into a numerical format, ensuring that our predictive models can process the data effectively. Many machine learning algorithms require numeric input, so any columns containing text or categories must be transformed. For binary features, such as those with “Yes” or “No” responses, we’ll use label encoding, which assigns a unique number to each category (for example, “Yes” becomes 1 and “No” becomes 0). For features with more than two possible categories, such as “Contract Type” or “Payment Method,” we’ll use one-hot encoding, creating separate columns for each unique value and indicating presence with a 1 or 0. We can easily accomplish this step using pandas get_dummies() function. By converting all categorical variables to a numeric format, we ensure our dataset is ready for modeling and avoid introducing unintended relationships between categories.

It's important to note that some columns, like multipleLines, seem binary at first glance, but are actually not. As such, be careful with which features you encode, and how.

Additionally, inspecting the dataset, there is a binary feature, senior citizen, which is already in a numerical binary format. It's an example fo why you should always view and analyze your data before going ahead with processing it. To be safe, we can check if it is considered an integer, or if it is a number in a string format, and deal with it accordingly.

We'll start by encoding the categorical binary features, and then one-hot encoding the remaining features.

In [7]:
#Put all the binary columns in need of encoding in a list.
#We figure this out by viewing the dataset
binary_cols = ['gender', 'Partner', 'Dependents', 'PhoneService', 'PaperlessBilling', 'Churn']

#Encode the binary columns using the map function. 
for col in binary_cols:
    df[col] = df[col].map({'Yes': 1, 'No': 0, 'Male': 1, 'Female': 0})

# Senior citizen (already 0/1, but ensure it's integer type)
df['SeniorCitizen'] = df['SeniorCitizen'].astype(int)

# Check the results
print(df.head())



   gender  SeniorCitizen  Partner  Dependents  tenure  PhoneService  \
0       0              0        1           0       1             0   
1       1              0        0           0      34             1   
2       1              0        0           0       2             1   
3       1              0        0           0      45             0   
4       0              0        0           0       2             1   

      MultipleLines InternetService OnlineSecurity OnlineBackup  \
0  No phone service             DSL             No          Yes   
1                No             DSL            Yes           No   
2                No             DSL            Yes          Yes   
3  No phone service             DSL            Yes           No   
4                No     Fiber optic             No           No   

  DeviceProtection TechSupport StreamingTV StreamingMovies        Contract  \
0               No          No          No              No  Month-to-month   
1             

Though the output is truncated, we can see that the columns that we listed as binary are now in numerical format. Let's now one-hot encode the remaining categorical features using pd.get_dummies()

In [8]:
#Find all the categorical columns in the dataframe
categorical_cols = df.select_dtypes(include=['object']).columns.tolist()

#Remove the binary columns from the categorical columns list
categorical_cols = [col for col in categorical_cols if col not in binary_cols]

#Remove the 'TotalCharges' column from the categorical columns list. 
#It gets caught as a categorical column due to its object type, but it should be treated as numeric.
#categorical_cols.remove('TotalCharges')

#one-hot encode the categorical columns
df = pd.get_dummies(df, columns=categorical_cols, drop_first=True)

#Display the categorical columns
print(categorical_cols)
display(df)

['MultipleLines', 'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract', 'PaymentMethod']


Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,PaperlessBilling,MonthlyCharges,TotalCharges,Churn,...,TechSupport_Yes,StreamingTV_No internet service,StreamingTV_Yes,StreamingMovies_No internet service,StreamingMovies_Yes,Contract_One year,Contract_Two year,PaymentMethod_Credit card (automatic),PaymentMethod_Electronic check,PaymentMethod_Mailed check
0,0,0,1,0,1,0,1,29.85,29.85,0,...,False,False,False,False,False,False,False,False,True,False
1,1,0,0,0,34,1,0,56.95,1889.50,0,...,False,False,False,False,False,True,False,False,False,True
2,1,0,0,0,2,1,1,53.85,108.15,1,...,False,False,False,False,False,False,False,False,False,True
3,1,0,0,0,45,0,0,42.30,1840.75,0,...,True,False,False,False,False,True,False,False,False,False
4,0,0,0,0,2,1,1,70.70,151.65,1,...,False,False,False,False,False,False,False,False,True,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7038,1,0,1,1,24,1,1,84.80,1990.50,0,...,True,False,True,False,True,True,False,False,False,True
7039,0,0,1,1,72,1,1,103.20,7362.90,0,...,False,False,True,False,True,True,False,True,False,False
7040,0,0,1,1,11,0,1,29.60,346.45,0,...,False,False,False,False,False,False,False,False,True,False
7041,1,1,1,0,4,1,1,74.40,306.60,1,...,False,False,False,False,False,False,False,False,False,True


### Creating time-based features

Once our data has been cleaned and encoded, we can focus on making the most of the time-related information in the dataset. In churn prediction, how long a customer has been with the company can tell us a lot about their risk of leaving. The “tenure” column already gives us the number of months each customer has stayed. We can build on this by creating features like a “new” customer flag for those who have just joined, and a “long_tenure” flag for customers who have been with the company for a longer period. These new features help our model pick up on loyalty patterns and identify which groups are most likely to churn, making our predictions more meaningful.

As such, lets go ahead and create our new features.

In [9]:
#Flag "new" customers as those with tenure less than 6 months
df['new'] = (df['tenure'] < 6).astype(int)

#Flag "long_tenure" customers as those with tenure greater than or equal to 24 months
df['long_tenure'] = (df['tenure'] >= 24).astype(int)

#inspect the new columns
print(df[['tenure', 'new', 'long_tenure']].head())

   tenure  new  long_tenure
0       1    1            0
1      34    0            1
2       2    1            0
3      45    0            1
4       2    1            0


## 3. Train Test Split

With a majority of the preprocessing completed, we can move on to splitting the data into training and testing sets. This step is important because it lets us train our model on one portion of the data while reserving another portion to fairly evaluate how well the model performs on unseen examples. We’ll use Sklearn’s train_test_split function to randomly divide our dataset, keeping 80% for training and 20% for testing. This gives us a reliable way to measure accuracy and prevent overfitting as we move forward with building and evaluating our model.

Before the split, we'll also separate our features and our target so that we have 80% of all features for training, 20% for training and same for the target. This also makes it so that whatever model we use won't be trained on the target, meaning that it'll actually try to predict the target instead of trying to memorize it.

As such, we'll start by importing the train_test_split module, and then do our split

In [10]:
#Import train_test_split from sklearn
from sklearn.model_selection import train_test_split

#Split the data into training and testing sets
X = df.drop(columns=['Churn'])
y = df['Churn']

#Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=64)

## 4. Applying Transformations

We can finally do our final bit of preproccessing now that our data has been split up. This is the point where we can apply transformations that need to be fitted using only the training data. By fitting these transformations on the training set and then applying them to the test set, we make sure that no information from the test data sneaks into the training process. This keeps our evaluation fair and realistic, helping us build a model that will generalize well to new, unseen customers.

In this part, we’ll apply a log transformation to features like “TotalCharges” because billing amounts are often highly skewed—most people pay lower totals, but a few pay very high amounts. Log transformation helps reduce the impact of these outliers and makes the data easier for the model to interpret. To do this, we’ll import the FunctionTransformer from Sklearn, fit it on the training set, and then use it to transform both the train and test data, ensuring our process is robust and free from data leakage.

Afterwards, we can scale our data as well so that the values are normalized as well. This means that the data is all scaled so that the mean is 0 and standard deviation is 1. We do this so that in case the model we use prioritizes larger values, it won't be skewed by said larger values.

We'll start by applying the log transformations, and then scaling the data after.

In [None]:
#import log transform and standardScaler from sklearn, and import numpy for the log1p function
from sklearn.preprocessing import FunctionTransformer
from sklearn.preprocessing import StandardScaler
import numpy as np

#Create a log transform function to handle the 'TotalCharges' column
logTransform = FunctionTransformer(func=lambda x: np.log1p(x), validate=True)

#Fit the log transform to the training data and transform the 'TotalCharges' column
#logTransform.fit(X_train[['TotalCharges']])
X_train['TotalCharges'] = logTransform.fit_transform(X_train[['TotalCharges']])
X_test['TotalCharges'] = logTransform.transform(X_test[['TotalCharges']])

#Standardize the entire training and testing sets
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)




Transformed Training Set:
[[ 0.99398331 -0.44435292  1.04317472 ... -0.54055697 -0.48503137
   0.89571383]
 [ 0.99398331 -0.44435292  1.04317472 ... -0.54055697 -0.48503137
   0.89571383]
 [ 0.99398331 -0.44435292 -0.95861219 ... -0.54055697 -0.48503137
  -1.116428  ]
 ...
 [-1.00605311 -0.44435292 -0.95861219 ... -0.54055697 -0.48503137
   0.89571383]
 [-1.00605311 -0.44435292 -0.95861219 ... -0.54055697 -0.48503137
   0.89571383]
 [ 0.99398331 -0.44435292 -0.95861219 ... -0.54055697 -0.48503137
   0.89571383]]

Transformed Testing Set:
[[ 0.99398331 -0.44435292 -0.95861219 ...  1.84994378 -0.48503137
   0.89571383]
 [ 0.99398331 -0.44435292  1.04317472 ... -0.54055697 -0.48503137
  -1.116428  ]
 [-1.00605311 -0.44435292 -0.95861219 ... -0.54055697 -0.48503137
   0.89571383]
 ...
 [-1.00605311 -0.44435292  1.04317472 ... -0.54055697 -0.48503137
  -1.116428  ]
 [ 0.99398331 -0.44435292  1.04317472 ... -0.54055697 -0.48503137
  -1.116428  ]
 [ 0.99398331 -0.44435292 -0.95861219 ... -0.5

Good work! With our TotalCharges transformed and our data scaled, we are ready to finally create our models.

## 5. Building and Training the Models

Now that the data is fully scaled and preprocessed, we are ready to move on to building our models. Both logistic regression and random forest classifiers are well suited to the churn prediction problem, but they approach it differently. Logistic regression is a linear model that predicts the probability of churn using a weighted combination of all our features. It is interpretable and can show us which features are most strongly associated with customers leaving. Random forests, on the other hand, are an ensemble of decision trees that can capture more complex and nonlinear relationships in the data. They are generally less sensitive to outliers and can often achieve higher accuracy on structured datasets like ours. For the sake of comparing and learning, we will train both models and see how their predictions and performance measure up, helping us understand the strengths and weaknesses of each approach.

We'll go ahead and import both, the logistic regression, and random forest modules from sklearn. After creating both models, we'll fit them on the training data.

In [12]:
#import the logistic regression model and the random forest classifier from sklearn
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

#Create a logistic regression model
#High iteration count to ensure model finds optimal weights
logistic_model = LogisticRegression(max_iter=1000, random_state=64)

#Create a random forest classifier
#n_estimators = 100 means 100 trees in the forest, giving a good balance between performance and speed
random_forest_model = RandomForestClassifier(n_estimators=100, random_state=64)

#Fit both models to the training data
logistic_model.fit(X_train, y_train)    
random_forest_model.fit(X_train, y_train)


Just like that, both models have been fit on the training data, and are ready to be tested. For those unfamiliar with a random forests

## 6. Evaulating the Models

With our models trained on our data, proper weights selected, and with plenty of decision trees, it's time to test our models and see how well they do. We'll import several metrics of success from Sklearn. Fortunately these metrics work with both models, so we can go straight ahead and see how they perform.

In [16]:
#Import necessary libraries for evaluation
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

#Make predictions on the validation set, and store it in a variable. 
y_pred1 = logistic_model.predict(X_test)
y_pred2 = random_forest_model.predict(X_test)

#Check the accuracy of our predictions
print("Validation Accuracy (Logistic Regression):", accuracy_score(y_test, y_pred1))
print("Validation Accuracy (Random Forest):", accuracy_score(y_test, y_pred2))

#Display the confusion matrix
print("Confusion Matrix (Logistic Regression):\n", confusion_matrix(y_test, y_pred1))
print("Confusion Matrix (Random Forest):\n", confusion_matrix(y_test, y_pred2))

#Print the classification report for more details
print("Classification Report (Logistic Regression):\n", classification_report(y_test, y_pred1))
print("Classification Report (Random Forest):\n", classification_report(y_test, y_pred2))

Validation Accuracy (Logistic Regression): 0.8062455642299503
Validation Accuracy (Random Forest): 0.7806955287437899
Confusion Matrix (Logistic Regression):
 [[935  98]
 [175 201]]
Confusion Matrix (Random Forest):
 [[921 112]
 [197 179]]
Classification Report (Logistic Regression):
               precision    recall  f1-score   support

           0       0.84      0.91      0.87      1033
           1       0.67      0.53      0.60       376

    accuracy                           0.81      1409
   macro avg       0.76      0.72      0.73      1409
weighted avg       0.80      0.81      0.80      1409

Classification Report (Random Forest):
               precision    recall  f1-score   support

           0       0.82      0.89      0.86      1033
           1       0.62      0.48      0.54       376

    accuracy                           0.78      1409
   macro avg       0.72      0.68      0.70      1409
weighted avg       0.77      0.78      0.77      1409



### Metrics Overview
Before diving into the results, it’s helpful to clarify what each metric means.

Accuracy reflects the proportion of all predictions that the model got right.

Precision measures how often the model’s positive predictions (predicting a customer will churn) are correct.

Recall tells us how well the model finds all the actual churners in the data.

F1-score is the harmonic mean of precision and recall, balancing the trade-off between the two.

Support simply counts the number of samples in each class.

The confusion matrix visually breaks down how the model classified each group, highlighting both successes and errors.

### Analysis

#### Confusion Matrix
The confusion matrices for both models show how well each approach distinguishes between customers who stayed (class 0) and those who churned (class 1). For logistic regression, the model correctly predicted 935 customers who stayed and 201 who churned, but also misclassified 98 customers who stayed as churners, and missed 175 true churners. The random forest model shows similar patterns, with 921 correct predictions for non-churners and 179 for churners, while making more errors for both types: 112 false positives and 197 missed churners. Both models are clearly more effective at recognizing customers who stayed, but have more difficulty correctly identifying those who will actually churn.

#### Classification Report
Precision, recall, and F1-score shed further light on each model’s strengths:

- Precision: For customers who stayed (class 0), both models are relatively strong (0.84 for logistic regression and 0.82 for random forest), meaning most “no churn” predictions are correct. Precision for churners (class 1) is lower, 0.67 for logistic regression and 0.62 for random forest, indicating both models generate a fair number of false alarms.

- Recall: Logistic regression achieves a recall of 0.91 for class 0, showing it successfully identifies the vast majority of customers who stayed. For churners, recall drops to 0.53, suggesting nearly half of actual churners are missed. Random forest is slightly lower, with recall values of 0.89 for class 0 and just 0.48 for class 1.

- F1-score: The F1-scores for non-churners are higher for both models (0.87 for logistic regression, 0.86 for random forest), but for churners, the scores are noticeably lower (0.60 and 0.54, respectively), reinforcing that both models struggle with identifying churners as effectively as non-churners.

The macro and weighted averages for all metrics hover in the 0.70–0.80 range, indicating moderate overall balance in classification but with a clear drop-off in performance for the minority class.

#### Overall Analysis
Accuracy for logistic regression is about 0.81, slightly higher than random forest at 0.78. While both models perform well in classifying customers who stayed, they struggle to catch churners, with lower recall and F1-scores for class 1. The models tend to be conservative, more likely to classify a customer as staying rather than churning. This reflects a common challenge in churn prediction, where imbalanced datasets and subtle churn patterns make it difficult to catch every customer at risk of leaving.

The results suggest that logistic regression holds a slight edge over random forest for this specific dataset, offering better recall and F1-score for churners, although both models are far from perfect. In a real-world setting, you might want to experiment with techniques to boost recall for churn (such as adjusting class weights, resampling, or fine-tuning hyperparameters) in order to identify more customers at risk.

Comparing both models side by side not only helps us understand their strengths and weaknesses, but also highlights the value of evaluating models beyond simple accuracy. As with any predictive modeling project, this experience reinforces the importance of careful preprocessing, thorough evaluation, and continual improvement in pursuit of actionable business insights. Or for the sake of learning the techniques. Whatever may be your reason for following along, hopefully, this all helped.

Congratulations on reaching this point in the project and gaining a deeper understanding of model comparison and churn prediction!