# Full Data Science Cycle for This Project

## Problem Definition
The business wants to reduce customer loss.
You need to predict which customers are likely to churn so the company can take action.
You define:
The target variable: Churn (0 = stay, 1 = leave)
The success metric: F1-score or Recall (identify churners correctly)
## 1. Data Cleaning
Perform:
Handle missing values
Remove duplicates
Fix incorrect data types
Encode categorical features
Treat outliers
You will learn:
.isnull(), .fillna(), .dropna()
Label Encoding / One-Hot Encoding
Dealing with imbalanced classes
## 2. Exploratory Data Analysis (EDA)
Visualize:
Churn rate
Distribution of fees, tenure, contract type
Correlations between features
Boxplots, histograms, violin plots, heatmaps
Find insights like:
Customers with month-to-month plans churn more
Customers with high monthly charges are at risk
## 3. Feature Engineering
Create new variables that help prediction, such as:
Tenure groups (0–3 months, 3–12 months, etc.)
Total spend = monthly_charges * tenure
Has multiple services? (binary)
Remove useless features like:
CustomerID
## 4. Modeling
Train several ML models:
Logistic Regression
Random Forest
Gradient Boosting (XGBoost or LightGBM)
SVM
KNN
Split data:
train_test_split(X, y, test_size=0.2)
## 5. Evaluation
Measure performance using:
Accuracy
Precision
Recall
F1-score
ROC-AUC
Confusion matrix
Focus on Recall → better at catching churners.
## 6. Model Improvement
Try GridSearchCV or RandomizedSearchCV
Tune hyperparameters
Add or remove features
Try class balancing (SMOTE, class weights)
## 7. Visualize & Storytell
Create visuals to explain:
Why customers churn
What factors matter most (feature importance)
How churn rate changes across groups
Use:
Matplotlib
Seaborn
Plotly
SHAP (advanced interpretability)

--------

# Lets get started
--------

## 1 Data Cleaning

1. Handle missing values
1. Remove duplicates
1. Fix incorrect data types
1. Encode categorical features
1. Treat outliers

In [25]:
# Get data
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
data = pd.read_csv('../data/Telco_Customer_Churn.csv')
data.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


In [26]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 


In [27]:
for col in data.select_dtypes(include=['object']).columns:
    print(f"{col}: {data[col].nunique()} unique values")

customerID: 7043 unique values
gender: 2 unique values
Partner: 2 unique values
Dependents: 2 unique values
PhoneService: 2 unique values
MultipleLines: 3 unique values
InternetService: 3 unique values
OnlineSecurity: 3 unique values
OnlineBackup: 3 unique values
DeviceProtection: 3 unique values
TechSupport: 3 unique values
StreamingTV: 3 unique values
StreamingMovies: 3 unique values
Contract: 3 unique values
PaperlessBilling: 2 unique values
PaymentMethod: 4 unique values
TotalCharges: 6531 unique values
Churn: 2 unique values


In [28]:
data["TotalCharges"] = pd.to_numeric(data["TotalCharges"], errors='coerce')

In [30]:
data.set_index("customerID", inplace=True)
data.head()

Unnamed: 0_level_0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,Yes,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,No,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,Yes,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,No,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,No,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes
