## **Dataset Attributes**

**customerID** : Customer ID

**gender** : Whether the customer is a male or a female

**SeniorCitizen** : Whether the customer is a senior citizen or not (1, 0)

**Partner** : Whether the customer has a partner or not (Yes, No)

**Dependents** : Whether the customer has dependents or not (Yes, No)

**tenure** : Number of months the customer has stayed with the company

**PhoneService** : Whether the customer has a phone service or not (Yes, No)

**MultipleLines** : Whether the customer has multiple lines or not (Yes, No, No phone service)

**InternetService** : Customer’s internet service provider (DSL, Fiber optic, No)

**OnlineSecurity** : Whether the customer has online security or not (Yes, No, No internet service)

**OnlineBackup** : Whether the customer has online backup or not (Yes, No, No internet service)

**DeviceProtection** : Whether the customer has device protection or not (Yes, No, No internet service)

**TechSupport** : Whether the customer has tech support or not (Yes, No, No internet service)

**StreamingTV** : Whether the customer has streaming TV or not (Yes, No, No internet service)

**StreamingMovies** : Whether the customer has streaming movies or not (Yes, No, No internet service)

**Contract** : The contract term of the customer (Month-to-month, One year, Two year)

**PaperlessBilling** : Whether the customer has paperless billing or not (Yes, No)

**PaymentMethod** : The customer’s payment method (Electronic check, Mailed check, Bank transfer (automatic), Credit card (automatic))

**MonthlyCharges** : The amount charged to the customer monthly

**TotalCharges** : The total amount charged to the customer

**Churn** : Whether the customer churned or not (Yes or No)

In [1]:
# ! pip install imbalanced-learn
# ! pip install catboost

In [None]:
#imports
import pandas as pd
pd.set_option('display.max_columns', None)
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler,StandardScaler
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split
from plotly.subplots import make_subplots
from plotly.offline import iplot
import plotly.graph_objects as go
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, roc_auc_score, confusion_matrix, roc_curve
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from catboost import CatBoostClassifier
import seaborn as sns
import numpy as np
import math
import warnings
warnings.filterwarnings('ignore')

In [None]:
file_path = '/content/Telco-Customer-Churn.csv'
df = pd.read_csv(file_path)
df


Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,Yes,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,No,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,Yes,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,No,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.30,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,No,No,No,No,No,Month-to-month,Yes,Electronic check,70.70,151.65,Yes
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7038,6840-RESVB,Male,0,Yes,Yes,24,Yes,Yes,DSL,Yes,No,Yes,Yes,Yes,Yes,One year,Yes,Mailed check,84.80,1990.5,No
7039,2234-XADUH,Female,0,Yes,Yes,72,Yes,Yes,Fiber optic,No,Yes,Yes,No,Yes,Yes,One year,Yes,Credit card (automatic),103.20,7362.9,No
7040,4801-JZAZL,Female,0,Yes,Yes,11,No,No phone service,DSL,Yes,No,No,No,No,No,Month-to-month,Yes,Electronic check,29.60,346.45,No
7041,8361-LTMKD,Male,1,Yes,No,4,Yes,Yes,Fiber optic,No,No,No,No,No,No,Month-to-month,Yes,Mailed check,74.40,306.6,Yes


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 


In [None]:
def summarize(df):
    summary_df = pd.DataFrame({
        "Column": df.columns,
        "Data Type": df.dtypes.values,
        "Missing Values": df.isnull().sum().values,
        "Unique Values": df.nunique().values,
        "First Value": df.iloc[0].values if len(df) > 0 else None,
        "Second Value": df.iloc[1].values if len(df) > 1 else None,
        "Third Value": df.iloc[2].values if len(df) > 2 else None
    })
    return summary_df

summarize(df)

Unnamed: 0,Column,Data Type,Missing Values,Unique Values,First Value,Second Value,Third Value
0,customerID,object,0,7043,7590-VHVEG,5575-GNVDE,3668-QPYBK
1,gender,object,0,2,Female,Male,Male
2,SeniorCitizen,int64,0,2,0,0,0
3,Partner,object,0,2,Yes,No,No
4,Dependents,object,0,2,No,No,No
5,tenure,int64,0,73,1,34,2
6,PhoneService,object,0,2,No,Yes,Yes
7,MultipleLines,object,0,3,No phone service,No,No
8,InternetService,object,0,3,DSL,DSL,DSL
9,OnlineSecurity,object,0,3,No,Yes,Yes


* The dataset doesn't have any missing values
* tenure, MonthlyCharges and TotalCharges features are numerical, all other features are categorical.
* Categorical variables must be transformed into numerical features using appropriate encoding techniques prior to modeling.
* TotalCharges represents numeric values, but it is stored as a string and must be cast to float before modeling.
> **TotalCharges = MonthlyCharges × tenure**


In [None]:
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')
df.loc[df['TotalCharges'].isnull(), 'TotalCharges'] = (df['MonthlyCharges'] * df['tenure'])

- Filled missing TotalCharges values only where they
were null by recomputing them as MonthlyCharges × tenure, which preserves billing logic instead of using arbitrary imputation.

## Label Encode Categorical Features

In [None]:
# Create a deep copy of the dataset
encoded_df = df.copy(deep=True)
encoded_df.drop(columns=['customerID'], inplace=True)

# Identify categorical (non-numeric) features
categorical_features = [col for col in encoded_df.columns if col not in encoded_df.describe().columns]

label_encoders = {}

for col in categorical_features:
    le = LabelEncoder()
    encoded_df[col] = le.fit_transform(encoded_df[col].astype(str))
    label_encoders[col] = le

    print(
        f"Label Encoding Transformation\n"
        f"{col} : {encoded_df[col].unique()} = "
        f"{le.inverse_transform(encoded_df[col].unique())}"
    )

Label Encoding Transformation
gender : [0 1] = ['Female' 'Male']
Label Encoding Transformation
Partner : [1 0] = ['Yes' 'No']
Label Encoding Transformation
Dependents : [0 1] = ['No' 'Yes']
Label Encoding Transformation
PhoneService : [0 1] = ['No' 'Yes']
Label Encoding Transformation
MultipleLines : [1 0 2] = ['No phone service' 'No' 'Yes']
Label Encoding Transformation
InternetService : [0 1 2] = ['DSL' 'Fiber optic' 'No']
Label Encoding Transformation
OnlineSecurity : [0 2 1] = ['No' 'Yes' 'No internet service']
Label Encoding Transformation
OnlineBackup : [2 0 1] = ['Yes' 'No' 'No internet service']
Label Encoding Transformation
DeviceProtection : [0 2 1] = ['No' 'Yes' 'No internet service']
Label Encoding Transformation
TechSupport : [0 2 1] = ['No' 'Yes' 'No internet service']
Label Encoding Transformation
StreamingTV : [0 2 1] = ['No' 'Yes' 'No internet service']
Label Encoding Transformation
StreamingMovies : [0 2 1] = ['No' 'Yes' 'No internet service']
Label Encoding Transform

In [None]:
encoded_df.describe()

Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
count,7043.0,7043.0,7043.0,7043.0,7043.0,7043.0,7043.0,7043.0,7043.0,7043.0,7043.0,7043.0,7043.0,7043.0,7043.0,7043.0,7043.0,7043.0,7043.0,7043.0
mean,0.504756,0.162147,0.483033,0.299588,32.371149,0.903166,0.940508,0.872923,0.790004,0.906432,0.904444,0.797104,0.985376,0.992475,0.690473,0.592219,1.574329,64.761692,2279.734304,0.26537
std,0.500013,0.368612,0.499748,0.45811,24.559481,0.295752,0.948554,0.737796,0.859848,0.880162,0.879949,0.861551,0.885002,0.885091,0.833755,0.491457,1.068104,30.090047,2266.79447,0.441561
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,18.25,0.0,0.0
25%,0.0,0.0,0.0,0.0,9.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,35.5,398.55,0.0
50%,1.0,0.0,0.0,0.0,29.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,2.0,70.35,1394.55,0.0
75%,1.0,0.0,1.0,1.0,55.0,1.0,2.0,1.0,2.0,2.0,2.0,2.0,2.0,2.0,1.0,1.0,2.0,89.85,3786.6,1.0
max,1.0,1.0,1.0,1.0,72.0,1.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,1.0,3.0,118.75,8684.8,1.0
