The telecommunications sector in India is rapidly evolving, with many businesses being created and customers frequently switching between providers. **Churn** refers to the process where customers stop using a company's services or products. It is a key challenge for telecom companies to predict and minimize customer churn, as retaining customers is critical to business growth.

As a data scientist for a telecom company, your task is to **predict customer churn** using demographic and usage data from four major telecom partners: *Airtel, Reliance Jio, Vodafone, and BSNL*. You will explore the key factors contributing to customer churn and build predictive models to help the company reduce churn rates.

The dataset contains two csv-files:

- `telecom_demographics.csv` contains information related to Indian customer demographics:

| Variable             | Description                                      |
|----------------------|--------------------------------------------------|
| `customer_id `         | Unique identifier for each customer.             |
| `telecom_partner `     | The telecom partner associated with the customer.|
| `gender `              | The gender of the customer.                      |
| `age `                 | The age of the customer.                         |
| `state`                | The Indian state in which the customer is located.|
| `city`                 | The city in which the customer is located.       |
| `pincode`              | The pincode of the customer's location.          |
| `registration_event` | When the customer registered with the telecom partner.|
| `num_dependents`      | The number of dependents (e.g., children) the customer has.|
| `estimated_salary`     | The customer's estimated salary.                 |

- `telecom_usage` contains information about the usage patterns of Indian customers:

| Variable   | Description                                                  |
|------------|--------------------------------------------------------------|
| `customer_id` | Unique identifier for each customer.                         |
| `calls_made` | The number of calls made by the customer.                    |
| `sms_sent`   | The number of SMS messages sent by the customer.             |
| `data_used`  | The amount of data used by the customer.                     |
| `churn`    | Binary variable indicating whether the customer has churned or not (1 = churned, 0 = not churned).|


In [1]:
# Useful libraries
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder
from sklearn.compose import ColumnTransformer

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import accuracy_score
from sklearn import tree
import seaborn as sns
import numpy as np

Objective:

The goal of this project is to predict customer churn in the Indian tele-
com sector using demographic and usage data. You will explore data
science techniques to build predictive models and determine which fac-
tors are most influential in customer churn. Additionally, you will com-
pare the performance of two encoding methods – OneHotEncoder and
OrdinalEncoder – and evaluate their effect on model performance.

In [42]:
demo = pd.read_csv('telecom_demographics.csv')
usage = pd.read_csv('telecom_usage.csv')

Basic Analysis

In [None]:
demo.isnull().sum()

In [None]:
usage.isnull().sum()

Merged Files

In [43]:
combined = pd.merge(demo, usage, on='customer_id', how='inner')

In [None]:
combined.isnull().sum()

In [44]:
combined = combined.drop(columns=['customer_id', 'pincode', 'registration_event', 'state'])     # The state column is misleading, for example as it says Chhattisgarh is the state and Kolkata
                                                                                                # is the city which is completely wrong.

In [40]:
demo.sample()

Unnamed: 0,customer_id,telecom_partner,gender,age,state,city,pincode,registration_event,num_dependents,estimated_salary
746,146709,BSNL,M,44,Chhattisgarh,Kolkata,677367,2022-01-04,2,112357


In [45]:
combined.sample()

Unnamed: 0,telecom_partner,gender,age,city,num_dependents,estimated_salary,calls_made,sms_sent,data_used,churn
55,Reliance Jio,F,38,Bangalore,2,149301,83,19,8797,1


In [46]:
combined['telecom_partner'].unique()

array(['Airtel', 'Reliance Jio', 'Vodafone', 'BSNL'], dtype=object)

In [47]:
combined['gender'].unique()

array(['F', 'M'], dtype=object)

In [48]:
combined['city'].unique()

array(['Delhi', 'Hyderabad', 'Chennai', 'Bangalore', 'Kolkata', 'Mumbai'],
      dtype=object)

One Hot Encoder

In [105]:
telecom_partner_columns = ['telecom_partner_Airtel', 'telecom_partner_Reliance Jio', 'telecom_partner_Vodafone', 'telecom_partner_BSNL']
gender_columns = ['gender_F', 'gender_M']
city_columns = ['city_Delhi', 'city_Hyderabad', 'city_Chennai', 'city_Bangalore', 'city_Kolkata', 'city_Mumbai']

# Combine all columns
one_hot_encoded_columns = telecom_partner_columns + gender_columns + city_columns

In [50]:
ohe = OneHotEncoder()

In [56]:
combined.shape

(6500, 10)

In [59]:
X_train, X_test, y_train, y_test = train_test_split(combined.iloc[:,0:9], combined.iloc[:,-1], test_size=0.2, random_state=42)

In [68]:
X_train_ohe = ohe.fit_transform(X_train[['telecom_partner', 'city', 'gender']]).toarray()

In [70]:
X_test_ohe = ohe.transform(X_test[['telecom_partner', 'city', 'gender']]).toarray()

In [108]:
X_train_ohe = pd.DataFrame(X_train_ohe, columns=one_hot_encoded_columns)

In [106]:
X_test_ohe = pd.DataFrame(X_test_ohe, columns=one_hot_encoded_columns)

In [72]:
scaler = StandardScaler()

In [92]:
X_train_scaled = X_train.drop(columns=['telecom_partner', 'city', 'gender', 'num_dependents'])

In [100]:
X_test_scaled = X_test.drop(columns=['telecom_partner', 'city', 'gender', 'num_dependents'])

In [94]:
X_train_scaled_columns = X_train_scaled.columns
X_train_scaled_columns

Index(['age', 'estimated_salary', 'calls_made', 'sms_sent', 'data_used'], dtype='object')

In [95]:
scaler.fit(X_train_scaled)

In [96]:
scaler.mean_

array([4.59938462e+01, 8.55309787e+04, 4.99913462e+01, 2.41878846e+01,
       5.02386096e+03])

In [102]:
X_train_scaled = scaler.transform(X_train_scaled)
X_train_scaled = pd.DataFrame(X_train_scaled, columns=X_train_scaled_columns)
X_train_scaled

Unnamed: 0,age,estimated_salary,calls_made,sms_sent,data_used
0,-2.720818,-2.277935,-1.634994,-1.695691,-1.707315
1,-2.845998,-2.278019,-1.619313,-1.611191,-1.708091
2,-2.820226,-2.278013,-1.687637,-1.601803,-1.707032
3,-2.728181,-2.277998,-1.629393,-1.639358,-1.707212
4,-2.842316,-2.277951,-1.733560,-1.719163,-1.707923
...,...,...,...,...,...
5195,-2.842316,-2.277938,-1.694358,-1.784885,-1.708219
5196,-2.742909,-2.277962,-1.637234,-1.761413,-1.707507
5197,-2.724500,-2.277956,-1.647315,-1.714469,-1.707148
5198,-2.724500,-2.277977,-1.686517,-1.752024,-1.707144


In [101]:
X_test_scaled = scaler.transform(X_test_scaled)
X_test_scaled = pd.DataFrame(X_test_scaled, columns=X_train_scaled_columns)
X_test_scaled

Unnamed: 0,age,estimated_salary,calls_made,sms_sent,data_used
0,0.303761,-0.246837,1.506328,-1.383190,-0.193017
1,1.153247,-1.288013,0.167627,1.974087,-0.175002
2,1.274602,0.528725,-0.970268,1.631508,0.979302
3,-0.788435,0.121954,-0.300918,-1.040611,1.290992
4,1.638667,0.300664,0.535770,-1.588738,1.318524
...,...,...,...,...,...
1295,0.667827,1.673558,0.669640,-0.766548,-0.560110
1296,-1.091823,-0.904759,-1.338411,-1.314674,0.510579
1297,-0.667080,-1.697793,-0.300918,0.603770,-0.056376
1298,-0.788435,-1.298613,-0.970268,1.631508,-1.281380


In [110]:
X_train_ohe.head()

Unnamed: 0,telecom_partner_Airtel,telecom_partner_Reliance Jio,telecom_partner_Vodafone,telecom_partner_BSNL,gender_F,gender_M,city_Delhi,city_Hyderabad,city_Chennai,city_Bangalore,city_Kolkata,city_Mumbai
0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
1,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0
2,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
3,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0
4,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0


In [111]:
X_train_scaled.head()

Unnamed: 0,age,estimated_salary,calls_made,sms_sent,data_used
0,-2.720818,-2.277935,-1.634994,-1.695691,-1.707315
1,-2.845998,-2.278019,-1.619313,-1.611191,-1.708091
2,-2.820226,-2.278013,-1.687637,-1.601803,-1.707032
3,-2.728181,-2.277998,-1.629393,-1.639358,-1.707212
4,-2.842316,-2.277951,-1.73356,-1.719163,-1.707923


In [112]:
X_train_final = pd.concat([X_train_ohe, X_train_scaled], axis=1)


In [113]:
X_train_final

Unnamed: 0,telecom_partner_Airtel,telecom_partner_Reliance Jio,telecom_partner_Vodafone,telecom_partner_BSNL,gender_F,gender_M,city_Delhi,city_Hyderabad,city_Chennai,city_Bangalore,city_Kolkata,city_Mumbai,age,estimated_salary,calls_made,sms_sent,data_used
0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,-2.720818,-2.277935,-1.634994,-1.695691,-1.707315
1,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,-2.845998,-2.278019,-1.619313,-1.611191,-1.708091
2,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,-2.820226,-2.278013,-1.687637,-1.601803,-1.707032
3,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,-2.728181,-2.277998,-1.629393,-1.639358,-1.707212
4,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,-2.842316,-2.277951,-1.733560,-1.719163,-1.707923
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5195,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,-2.842316,-2.277938,-1.694358,-1.784885,-1.708219
5196,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,-2.742909,-2.277962,-1.637234,-1.761413,-1.707507
5197,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,-2.724500,-2.277956,-1.647315,-1.714469,-1.707148
5198,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,-2.724500,-2.277977,-1.686517,-1.752024,-1.707144


In [114]:
X_test_final = pd.concat([X_test_ohe, X_test_scaled], axis=1)


In [117]:
X_test_final

Unnamed: 0,telecom_partner_Airtel,telecom_partner_Reliance Jio,telecom_partner_Vodafone,telecom_partner_BSNL,gender_F,gender_M,city_Delhi,city_Hyderabad,city_Chennai,city_Bangalore,city_Kolkata,city_Mumbai,age,estimated_salary,calls_made,sms_sent,data_used
0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.303761,-0.246837,1.506328,-1.383190,-0.193017
1,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.153247,-1.288013,0.167627,1.974087,-0.175002
2,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.274602,0.528725,-0.970268,1.631508,0.979302
3,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,-0.788435,0.121954,-0.300918,-1.040611,1.290992
4,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.638667,0.300664,0.535770,-1.588738,1.318524
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1295,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.667827,1.673558,0.669640,-0.766548,-0.560110
1296,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,-1.091823,-0.904759,-1.338411,-1.314674,0.510579
1297,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,-0.667080,-1.697793,-0.300918,0.603770,-0.056376
1298,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,-0.788435,-1.298613,-0.970268,1.631508,-1.281380
