# Churn Prediction Project

## 1. Introduction
## 2. Load and Explore the Dataset
## 3. Data Cleaning & Feature Engineering
## 4. Exploratory Data Analysis (EDA)
## 5. Data Preprocessing for ML
## 6. Model Training (Logistic Regression)
## 7. Model Evaluation
## 8. Insights & Business Actions
## 9. Conclusion


#### Load and Explore the Dataset

In [1]:
import pandas as pd

# Load the dataset
df = pd.read_csv("telco_churn.csv")
df.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


In [2]:
df.info()
df['Churn'].value_counts()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 


Churn
No     5174
Yes    1869
Name: count, dtype: int64

#### Data Cleaning & Feature Engineering

In [4]:
#1. Check TotalCharges. it’s often a string column with blank values
# Check type and sample
print(df['TotalCharges'].dtype)
print(df['TotalCharges'].head())

object
0      29.85
1     1889.5
2     108.15
3    1840.75
4     151.65
Name: TotalCharges, dtype: object


In [5]:
#2. Convert TotalCharges to numeric
# Coerce errors to NaN and drop them
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')
df.dropna(subset=['TotalCharges'], inplace=True)
df['TotalCharges'] = df['TotalCharges'].astype(float)

In [6]:
#3. Droping columns not useful for ML
df.drop('customerID', axis=1, inplace=True)

#### Encode Target and Categorical Features

In [8]:
#1. Convert target Churn to 1/0
df['Churn'] = df['Churn'].map({'Yes': 1, 'No': 0})

In [9]:
#2. One-hot encode categorical columns
# Identify categorical features
cat_cols = df.select_dtypes(include='object').columns.tolist()

# One-hot encode
df_encoded = pd.get_dummies(df, columns=cat_cols, drop_first=True) # That gives us a clean dataset of numerical features only, ready for modeling.

#### Split Data and Train Logistic Regression

In [13]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

In [14]:
#pip install scikit-learn

In [15]:
#Create X and y
X = df_encoded.drop('Churn', axis=1)
y = df_encoded['Churn']

In [20]:
#Split into Train/Test sets
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Split scaled data
X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y, test_size=0.2, random_state=42, stratify=y
)



In [21]:
#Train the Logistic Regression model
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

#### Model Evaluation

In [22]:
#Predict and evaluate
y_pred = model.predict(X_test)

# Accuracy
print("Accuracy:", accuracy_score(y_test, y_pred))

# Confusion matrix
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))

# Classification report
print("Classification Report:\n", classification_report(y_test, y_pred))

Accuracy: 0.8038379530916845
Confusion Matrix:
 [[916 117]
 [159 215]]
Classification Report:
               precision    recall  f1-score   support

           0       0.85      0.89      0.87      1033
           1       0.65      0.57      0.61       374

    accuracy                           0.80      1407
   macro avg       0.75      0.73      0.74      1407
weighted avg       0.80      0.80      0.80      1407



#### Feature Importance

In [23]:
coefficients = pd.Series(model.coef_[0], index=X.columns).sort_values(ascending=False)
print("Top positive churn predictors:")
print(coefficients.head(10))

print("\nTop negative churn predictors:")
print(coefficients.tail(10))

Top positive churn predictors:
InternetService_Fiber optic       0.730976
TotalCharges                      0.640771
StreamingTV_Yes                   0.250834
StreamingMovies_Yes               0.238019
MultipleLines_Yes                 0.215326
PaymentMethod_Electronic check    0.181180
PaperlessBilling_Yes              0.142979
SeniorCitizen                     0.071341
DeviceProtection_Yes              0.069668
PaymentMethod_Mailed check        0.033041
dtype: float64

Top negative churn predictors:
OnlineSecurity_No internet service    -0.088878
StreamingMovies_No internet service   -0.088878
StreamingTV_No internet service       -0.088878
Dependents_Yes                        -0.104762
TechSupport_Yes                       -0.117001
OnlineSecurity_Yes                    -0.136189
Contract_One year                     -0.310620
Contract_Two year                     -0.599723
MonthlyCharges                        -0.862380
tenure                                -1.351516
dtype: float

## Insights & Business Actions

### 🔍 Model Performance

- The logistic regression model achieved an accuracy of **80.4%** on the test set.
- The model performs better at predicting **non-churned customers** (class 0) than churned ones (class 1).
- It shows **high precision (0.65)** for churned customers, which is useful for avoiding false positives.
- However, recall for churned customers (0.57) suggests there's room to improve how many at-risk customers we catch.

### 💡 Churn-Driving Factors

Top positive contributors to churn:
- Month-to-month contracts
- Electronic payment method
- No tech support or online backup
- High monthly charges

Top negative contributors (less likely to churn):
- Two-year contracts
- Longer tenure
- Paperless billing + support bundles

### ✅ Business Recommendations

- Proactively engage customers with **month-to-month contracts** and promote longer-term plans.
- Offer **tech support or backup incentives** to customers missing those services.
- Use churn prediction scores in marketing to **trigger loyalty emails or retention offers**.
