# **Customer Churn in Telecom**

## **Problem Statement**

**Objective**

The objective of this project is to analyze customer data from a telecom company to understand the factors that lead to customer churn.

Goals:
- Perform Exploratory Data Analysis (EDA)
- Identify patterns and key drivers of churn
- Build a machine learning model to predict customer churn

Business Impact:
Reducing churn helps improve customer retention and revenue.

## **Import Libraries**

In [1]:
#Data Handling
import pandas as pd
import numpy as np

#Visualization
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

#Warning
import warnings
warnings.filterwarnings('ignore')

# Display settings
pd.set_option('display.max_columns', None)
sns.set_style("whitegrid")

## **Load The Data**

**Mount Drive**

In [2]:
#Mounting the drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


**Load CSV**

In [3]:
file_path = "/content/drive/Othercomputers/SAGAR LAPTOP DATA/DATA SCIENCE/Projects/Machine-Learning-Portfolio/03_ML_Practice/Customer Churn in Telecom/raw_data/WA_Fn-UseC_-Telco-Customer-Churn.csv"
df = pd.read_csv(file_path)

Update the file path based on your Drive location.

## **Know Your Data**

### **First View Of The Data**

First 5 rows

In [4]:
df.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,Yes,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,No,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,Yes,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,No,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,No,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


Last 5 rows

In [5]:
df.tail()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
7038,6840-RESVB,Male,0,Yes,Yes,24,Yes,Yes,DSL,Yes,No,Yes,Yes,Yes,Yes,One year,Yes,Mailed check,84.8,1990.5,No
7039,2234-XADUH,Female,0,Yes,Yes,72,Yes,Yes,Fiber optic,No,Yes,Yes,No,Yes,Yes,One year,Yes,Credit card (automatic),103.2,7362.9,No
7040,4801-JZAZL,Female,0,Yes,Yes,11,No,No phone service,DSL,Yes,No,No,No,No,No,Month-to-month,Yes,Electronic check,29.6,346.45,No
7041,8361-LTMKD,Male,1,Yes,No,4,Yes,Yes,Fiber optic,No,No,No,No,No,No,Month-to-month,Yes,Mailed check,74.4,306.6,Yes
7042,3186-AJIEK,Male,0,No,No,66,Yes,No,Fiber optic,Yes,No,Yes,Yes,Yes,Yes,Two year,Yes,Bank transfer (automatic),105.65,6844.5,No


### **Shape & Size Of Data**

In [6]:
#shape of data

df.shape

(7043, 21)

The dataset has :

Columns - 21 & Rows - 7043

In [7]:
#size of the data
df.size

147903

### **Columns in Data**

In [8]:
df.columns

Index(['customerID', 'gender', 'SeniorCitizen', 'Partner', 'Dependents',
       'tenure', 'PhoneService', 'MultipleLines', 'InternetService',
       'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport',
       'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling',
       'PaymentMethod', 'MonthlyCharges', 'TotalCharges', 'Churn'],
      dtype='object')

### **Data Information**

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 


### **Missing Values In Data**

In [10]:
#Missing values in data
df.isnull().sum()

Unnamed: 0,0
customerID,0
gender,0
SeniorCitizen,0
Partner,0
Dependents,0
tenure,0
PhoneService,0
MultipleLines,0
InternetService,0
OnlineSecurity,0


### **Statistical Summary Of The Data**

In [11]:
df.describe()

Unnamed: 0,SeniorCitizen,tenure,MonthlyCharges
count,7043.0,7043.0,7043.0
mean,0.162147,32.371149,64.761692
std,0.368612,24.559481,30.090047
min,0.0,0.0,18.25
25%,0.0,9.0,35.5
50%,0.0,29.0,70.35
75%,0.0,55.0,89.85
max,1.0,72.0,118.75


### **Duplicate Rows In Data**

In [14]:
#duplicate rows in data
df.duplicated().sum()

np.int64(0)

### **Unique Values In Data**

In [15]:
for col in df.select_dtypes(include='object').columns:
    print(f"{col} -> {df[col].nunique()} unique values")


customerID -> 7043 unique values
gender -> 2 unique values
Partner -> 2 unique values
Dependents -> 2 unique values
PhoneService -> 2 unique values
MultipleLines -> 3 unique values
InternetService -> 3 unique values
OnlineSecurity -> 3 unique values
OnlineBackup -> 3 unique values
DeviceProtection -> 3 unique values
TechSupport -> 3 unique values
StreamingTV -> 3 unique values
StreamingMovies -> 3 unique values
Contract -> 3 unique values
PaperlessBilling -> 2 unique values
PaymentMethod -> 4 unique values
TotalCharges -> 6531 unique values
Churn -> 2 unique values


### **Final Observation - Know Your Data**


**Dataset Structure**

7043 rows, 21 columns

Mix of categorical and numerical features

Target variable: Churn (Binary)

**Data Quality**

No missing values

No duplicate rows

Data structurally clean

Data Type Issues Identified

TotalCharges originally object → converted to numeric

SeniorCitizen is numeric (0/1) but logically categorical

customerID is an identifier → not useful for modeling

**Feature Nature**

Most categorical columns have 2–3 unique values → manageable encoding

PaymentMethod has 4 categories

TotalCharges is continuous numeric

Dataset is suitable for classification

## **Data Cleaning**

### **Dropping CustomerID Column**

In [17]:
#droping column
df.drop('customerID',axis=1,inplace=True)

### **Convert Datatypes**

#### **Convert TotalCharges to Numeric**

In [19]:
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')

Check

In [20]:
df['TotalCharges'].isnull().sum()

np.int64(11)

11 rows became NaN after conversion

Those rows likely had blank spaces " " in original data


#### **Convert SeniorCitizen to Categorical**

In [21]:
df['SeniorCitizen'] = df['SeniorCitizen'].astype('object')

#### **Final Check**

In [22]:
df.info()
df.isnull().sum()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 20 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   gender            7043 non-null   object 
 1   SeniorCitizen     7043 non-null   object 
 2   Partner           7043 non-null   object 
 3   Dependents        7043 non-null   object 
 4   tenure            7043 non-null   int64  
 5   PhoneService      7043 non-null   object 
 6   MultipleLines     7043 non-null   object 
 7   InternetService   7043 non-null   object 
 8   OnlineSecurity    7043 non-null   object 
 9   OnlineBackup      7043 non-null   object 
 10  DeviceProtection  7043 non-null   object 
 11  TechSupport       7043 non-null   object 
 12  StreamingTV       7043 non-null   object 
 13  StreamingMovies   7043 non-null   object 
 14  Contract          7043 non-null   object 
 15  PaperlessBilling  7043 non-null   object 
 16  PaymentMethod     7043 non-null   object 


Unnamed: 0,0
gender,0
SeniorCitizen,0
Partner,0
Dependents,0
tenure,0
PhoneService,0
MultipleLines,0
InternetService,0
OnlineSecurity,0
OnlineBackup,0


### **Dropping Null Rows**

In [23]:
#Check rows
df['TotalCharges'].isnull().sum()

np.int64(11)

In [24]:
#dropping rows
df = df.dropna(subset=['TotalCharges'])

**Check**

In [None]:
df.isnull().sum()


In [28]:
df.shape

(7032, 20)

### **Standardize Column Names**

In [29]:
df.columns

Index(['gender', 'SeniorCitizen', 'Partner', 'Dependents', 'tenure',
       'PhoneService', 'MultipleLines', 'InternetService', 'OnlineSecurity',
       'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV',
       'StreamingMovies', 'Contract', 'PaperlessBilling', 'PaymentMethod',
       'MonthlyCharges', 'TotalCharges', 'Churn'],
      dtype='object')

In [30]:
#Standardizing Columns
df.columns = df.columns.str.strip().str.lower().str.replace(' ', '_')
df.columns

Index(['gender', 'seniorcitizen', 'partner', 'dependents', 'tenure',
       'phoneservice', 'multiplelines', 'internetservice', 'onlinesecurity',
       'onlinebackup', 'deviceprotection', 'techsupport', 'streamingtv',
       'streamingmovies', 'contract', 'paperlessbilling', 'paymentmethod',
       'monthlycharges', 'totalcharges', 'churn'],
      dtype='object')

### **Final Summary - Data Cleaning**

During the data cleaning phase, we performed the following steps:

Removed customerID as it is a unique identifier and not useful for analysis or modeling.

Converted TotalCharges from object to numeric datatype.

Identified and removed 11 rows with invalid or blank TotalCharges values.

Converted SeniorCitizen from numeric (0/1) to categorical type.

Standardized all column names to lowercase and snake_case format.

Revalidated dataset to ensure:

No missing values

No duplicate rows

Correct data types

Final shape: (7032, 20)

Dataset is now clean and ready for Exploratory Data Analysis (EDA).