# TOPIC

Telco Customer Churn

# Dataset Information

About Dataset

Predict behavior to retain customers. You can analyze all relevant customer data and develop focused customer retention programs.

Dataset:

https://www.kaggle.com/datasets/blastchar/telco-customer-churn

Each row represents a customer, each column contains customer’s attributes described on the column Metadata.

The data set includes information about:

Customers who left within the last month – the column is called Churn

Services that each customer has signed up for :

phone,

multiple lines,

Internet,

online security,

online backup,

device protection,

tech support, and

streaming TV and movies

Customer account information – how long they’ve been a customer, contract, payment method, paperless billing, monthly charges, and total charges

Demographic info about customers – gender, age range, and if they have partners and dependents

# Dependencies

In [36]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from sklearn.metrics import classification_report
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import ExtraTreesClassifier

# Data Importation

In [37]:
df = pd.read_csv("/content/WA_Fn-UseC_-Telco-Customer-Churn.csv")


## Data Head

In [38]:
# Preview data head
df.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


# Data Preprocessing

## (SID)-Analysis

### Shape

In [39]:
df.shape

(7043, 21)

### Info

In [40]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 


### De-Stats

In [41]:
df.describe()

Unnamed: 0,SeniorCitizen,tenure,MonthlyCharges
count,7043.0,7043.0,7043.0
mean,0.162147,32.371149,64.761692
std,0.368612,24.559481,30.090047
min,0.0,0.0,18.25
25%,0.0,9.0,35.5
50%,0.0,29.0,70.35
75%,0.0,55.0,89.85
max,1.0,72.0,118.75


## Numerical VS Categorical

In [42]:
# Check data types of each column
print("Data Types:")
print(df.dtypes)
print()

Data Types:
customerID           object
gender               object
SeniorCitizen         int64
Partner              object
Dependents           object
tenure                int64
PhoneService         object
MultipleLines        object
InternetService      object
OnlineSecurity       object
OnlineBackup         object
DeviceProtection     object
TechSupport          object
StreamingTV          object
StreamingMovies      object
Contract             object
PaperlessBilling     object
PaymentMethod        object
MonthlyCharges      float64
TotalCharges         object
Churn                object
dtype: object



In [43]:
# Separate numerical and categorical variables
numerical_vars = df.select_dtypes(include=['int64', 'float64']).columns
categorical_vars = df.select_dtypes(include=['object']).columns

# Display numerical and categorical variables
print("Numerical Variables:")
print(numerical_vars)
print()

print("Categorical Variables:")
print(categorical_vars)

Numerical Variables:
Index(['SeniorCitizen', 'tenure', 'MonthlyCharges'], dtype='object')

Categorical Variables:
Index(['customerID', 'gender', 'Partner', 'Dependents', 'PhoneService',
       'MultipleLines', 'InternetService', 'OnlineSecurity', 'OnlineBackup',
       'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies',
       'Contract', 'PaperlessBilling', 'PaymentMethod', 'TotalCharges',
       'Churn'],
      dtype='object')


## Missing Values

In [44]:
df.isnull().sum()


# Calculate the percentage of missing values in each column
missing_percentage = (df.isnull().mean()) * 100

# Display the result
print("Percentage of missing values in each column:")
for column, percentage in missing_percentage.items():
    print(f"{column}: {round(percentage, 2)}%")


Percentage of missing values in each column:
customerID: 0.0%
gender: 0.0%
SeniorCitizen: 0.0%
Partner: 0.0%
Dependents: 0.0%
tenure: 0.0%
PhoneService: 0.0%
MultipleLines: 0.0%
InternetService: 0.0%
OnlineSecurity: 0.0%
OnlineBackup: 0.0%
DeviceProtection: 0.0%
TechSupport: 0.0%
StreamingTV: 0.0%
StreamingMovies: 0.0%
Contract: 0.0%
PaperlessBilling: 0.0%
PaymentMethod: 0.0%
MonthlyCharges: 0.0%
TotalCharges: 0.0%
Churn: 0.0%


## TASK-1:

Perform initial data preparation by converting the 'TotalCharges' column to numeric values and filling missing values with 0.

In [45]:
# Convert 'TotalCharges' to numeric and fill missing values with 0
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce').fillna(0)


In [46]:
# Verify conversion and missing value handling
print(df['TotalCharges'].head())
print(df['TotalCharges'].isnull().sum())

0      29.85
1    1889.50
2     108.15
3    1840.75
4     151.65
Name: TotalCharges, dtype: float64
0


In [47]:
# Display numerical variables
numerical_vars = df.select_dtypes(include=['int64', 'float64']).columns
print("Numerical Variables:")
print(numerical_vars)
print()

Numerical Variables:
Index(['SeniorCitizen', 'tenure', 'MonthlyCharges', 'TotalCharges'], dtype='object')



## TASK-2:

Convert the 'Churn' column to binary values, where 'No' is mapped to 0 and 'Yes' is mapped to 1.

In [48]:
# Convert 'Churn' to binary values
df['Churn'] = df['Churn'].map({'No': 0, 'Yes': 1})


In [49]:
#Verify Conversion
print(df['Churn'].head())

0    0
1    0
2    1
3    0
4    1
Name: Churn, dtype: int64


In [50]:
# Display numerical variables
numerical_vars = df.select_dtypes(include=['int64', 'float64']).columns
print("Numerical Variables:")
print(numerical_vars)
print()

Numerical Variables:
Index(['SeniorCitizen', 'tenure', 'MonthlyCharges', 'TotalCharges', 'Churn'], dtype='object')



## TASK-3:

Split the data into an 80-20 train-test split with a random state of "1"

In [51]:
#Split data
# x = Independent & y = dependent variables

X = df.drop('Churn', axis=1)
y = df['Churn']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

## TASK-4:

Select featuees based on Instructions

In [52]:

categorical = ['gender', 'SeniorCitizen', 'Partner', 'Dependents',
               'PhoneService', 'MultipleLines', 'InternetService',
               'OnlineSecurity', 'OnlineBackup', 'DeviceProtection',
               'TechSupport', 'StreamingTV', 'StreamingMovies','Contract',
               'PaperlessBilling', 'PaymentMethod']

numerical = ['tenure', 'MonthlyCharges', 'TotalCharges']

# Feature Engineering

## TASK-1:

The numerical features should be scaled using StandardScaler, convert the output back to a dataframe and put back the column names.

In [53]:
# Select numerical features
numerical = ['tenure', 'MonthlyCharges', 'TotalCharges']

# Scale numerical features using StandardScaler
scaler = StandardScaler()

# Fit the scaler on the training data and transform both train and test data
X_train_num_scaled = pd.DataFrame(scaler.fit_transform(X_train[numerical]), columns=numerical)
X_test_num_scaled = pd.DataFrame(scaler.transform(X_test[numerical]), columns=numerical)

# Convert the scaled arrays back to dataframes and restore the column names
X_train_num_scaled_df = pd.DataFrame(X_train_num_scaled, columns=numerical)
X_test_num_scaled_df = pd.DataFrame(X_test_num_scaled, columns=numerical)

# Display the first few rows of the scaled dataframes
print("Scaled Training Data:")
print(X_train_num_scaled_df.head())
print("\nScaled Test Data:")
print(X_test_num_scaled_df.head())


Scaled Training Data:
     tenure  MonthlyCharges  TotalCharges
0 -0.825884       -1.497530     -0.890947
1  0.395961        0.302996      0.389693
2  1.577078        0.012320      1.060945
3  1.577078        0.686687      1.775397
4 -0.092777        0.186726     -0.102671

Scaled Test Data:
     tenure  MonthlyCharges  TotalCharges
0  0.355233        0.500655      0.460383
1  1.373437        1.249767      1.850854
2 -0.825884       -0.657063     -0.773570
3 -1.110981       -0.471031     -0.894653
4 -0.907340        0.037235     -0.713691


## TASK-2:

The categorical features are one-hot encoded using OneHotEncoder(set sparse_output to false), convert the output back to a dataframe and put back the column names.

In [54]:
# One-hot encode categorical features using OneHotEncoder with sparse_output=False
encoder = OneHotEncoder(sparse_output=False, drop='first')

# Fit the encoder on the training data and transform both train and test data
X_train_cat_encoded = encoder.fit_transform(X_train[categorical])
X_test_cat_encoded = encoder.transform(X_test[categorical])

# Convert the encoded arrays back to dataframes and restore the column names
X_train_cat_encoded_df = pd.DataFrame(X_train_cat_encoded, columns=encoder.get_feature_names_out(categorical))
X_test_cat_encoded_df = pd.DataFrame(X_test_cat_encoded, columns=encoder.get_feature_names_out(categorical))


# Display the first few rows of the processed dataframes
print("Processed Training Data:")
print(X_train_cat_encoded_df.head())
print("\nProcessed Test Data:")
print(X_test_cat_encoded_df.head())

Processed Training Data:
   gender_Male  SeniorCitizen_1  Partner_Yes  Dependents_Yes  \
0          1.0              0.0          1.0             1.0   
1          0.0              0.0          0.0             0.0   
2          1.0              0.0          1.0             0.0   
3          1.0              0.0          1.0             1.0   
4          1.0              0.0          0.0             0.0   

   PhoneService_Yes  MultipleLines_No phone service  MultipleLines_Yes  \
0               1.0                             0.0                0.0   
1               1.0                             0.0                0.0   
2               1.0                             0.0                1.0   
3               1.0                             0.0                1.0   
4               1.0                             0.0                0.0   

   InternetService_Fiber optic  InternetService_No  \
0                          0.0                 1.0   
1                          0.0       

## TASK-3:

Combine scaled numerical and one-hot encoded categorical features into train and test set dataframes (use pd.concat)

In [55]:
# Combine scaled numerical and one-hot encoded categorical features
X_train_processed = pd.concat([X_train_num_scaled_df, X_train_cat_encoded_df], axis=1)
X_test_processed = pd.concat([X_test_num_scaled_df, X_test_cat_encoded_df], axis=1)

# Display the first few rows of the processed dataframes
print("Processed Training Data:")
print(X_train_processed.head())
print("\nProcessed Test Data:")
print(X_test_processed.head())

Processed Training Data:
     tenure  MonthlyCharges  TotalCharges  gender_Male  SeniorCitizen_1  \
0 -0.825884       -1.497530     -0.890947          1.0              0.0   
1  0.395961        0.302996      0.389693          0.0              0.0   
2  1.577078        0.012320      1.060945          1.0              0.0   
3  1.577078        0.686687      1.775397          1.0              0.0   
4 -0.092777        0.186726     -0.102671          1.0              0.0   

   Partner_Yes  Dependents_Yes  PhoneService_Yes  \
0          1.0             1.0               1.0   
1          0.0             0.0               1.0   
2          1.0             0.0               1.0   
3          1.0             1.0               1.0   
4          0.0             0.0               1.0   

   MultipleLines_No phone service  MultipleLines_Yes  ...  \
0                             0.0                0.0  ...   
1                             0.0                0.0  ...   
2                           

## Task-4:

Use scikit learn to train a random forest and extra trees classifier, and use xgboost and lightgbm to train an extreme boosting model and a light gradient boosting model. Use random_state = 1 for training all models and evaluate on the test set

In [56]:
# Initialize models
models = {
    'RandomForest': RandomForestClassifier(random_state=1),
    'ExtraTrees': ExtraTreesClassifier(random_state=1),
    'XGBoost': XGBClassifier(random_state=1, use_label_encoder=False, eval_metric='logloss'),
    'LightGBM': LGBMClassifier(random_state=1)
}

# Train and evaluate models
for name, model in models.items():
    model.fit(X_train_processed, y_train)
    y_pred = model.predict(X_test_processed)
    print(f"{name} Classification Report:")
    print(classification_report(y_test, y_pred))
    print("-" * 80)

RandomForest Classification Report:
              precision    recall  f1-score   support

           0       0.86      0.88      0.87      1061
           1       0.61      0.55      0.58       348

    accuracy                           0.80      1409
   macro avg       0.73      0.72      0.72      1409
weighted avg       0.80      0.80      0.80      1409

--------------------------------------------------------------------------------
ExtraTrees Classification Report:
              precision    recall  f1-score   support

           0       0.84      0.87      0.86      1061
           1       0.57      0.51      0.54       348

    accuracy                           0.78      1409
   macro avg       0.71      0.69      0.70      1409
weighted avg       0.78      0.78      0.78      1409

--------------------------------------------------------------------------------
XGBoost Classification Report:
              precision    recall  f1-score   support

           0       0.86     

# QUIZ

##Q-14

ANS

##Q-15

ANS

## Q-16

ANS

## Q-17

ANS

In [57]:


# Define hyperparameter grid
n_estimators = [50, 100, 300, 500, 1000]
min_samples_split = [2, 3, 5, 7, 9]
min_samples_leaf = [1, 2, 4, 6, 8]
max_features = ['auto', 'sqrt', 'log2', None]
hyperparameter_grid = {
    'n_estimators': n_estimators,
    'min_samples_leaf': min_samples_leaf,
    'min_samples_split': min_samples_split,
    'max_features': max_features
}

# Initialize Extra Trees Classifier
et_classifier = ExtraTreesClassifier(random_state=1)

# Initialize RandomizedSearchCV
random_search = RandomizedSearchCV(estimator=et_classifier,
                                   param_distributions=hyperparameter_grid,
                                   n_iter=10,
                                   scoring='accuracy',
                                   cv=5,
                                   n_jobs=-1,
                                   verbose=1,
                                   random_state=1)

# Fit RandomizedSearchCV on the training data
random_search.fit(X_train_processed, y_train)

# Get the best hyperparameters
best_params = random_search.best_params_
print("Best Hyperparameters:")
print(best_params)


Fitting 5 folds for each of 10 candidates, totalling 50 fits
Best Hyperparameters:
{'n_estimators': 1000, 'min_samples_split': 2, 'min_samples_leaf': 8, 'max_features': None}


## Q-18

ANS

In [58]:


# Initialize Extra Trees Classifier with the best hyperparameters
optimal_et_classifier = ExtraTreesClassifier(random_state=1, **best_params)

# Train the new model
optimal_et_classifier.fit(X_train_processed, y_train)

# Evaluate the accuracy of the new model on the test set
accuracy_optimal = optimal_et_classifier.score(X_test_processed, y_test)

# Print the accuracy of the new model
print("Accuracy of the new optimal ExtraTreesClassifier model:", accuracy_optimal)


Accuracy of the new optimal ExtraTreesClassifier model: 0.8048261178140526


## Q-19

ANS

## Q-20

ANS

In [59]:
# Get feature importances from the optimal ExtraTreesClassifier model
feature_importances = optimal_et_classifier.feature_importances_

# Create a DataFrame to store feature importances along with feature names
feature_importance_df = pd.DataFrame({'Feature': X_train_processed.columns, 'Importance': feature_importances})

# Sort the DataFrame by feature importance in descending order
feature_importance_df = feature_importance_df.sort_values(by='Importance', ascending=False)

# Print the top two most important features
print("Top two most important features:")
print(feature_importance_df.head(2))


Top two most important features:
                        Feature  Importance
0                        tenure    0.246930
10  InternetService_Fiber optic    0.219039
