The telecommunications (telecom) sector in India is rapidly changing, with more and more telecom businesses being created and many customers deciding to switch between providers.

"Churn" refers to the process where customers or subscribers stop using a company's services or products. 

Understanding the factors that influence keeping a customer as a client in predicting churn is crucial for telecom companies to enhance their service quality and customer satisfaction. 

As the data scientist on this project, you aim to explore the intricate dynamics of customer behavior and demographics in the Indian telecom sector in predicting customer churn, utilizing two comprehensive datasets from four major telecom partners: Airtel, Reliance Jio, Vodafone, and BSNL:





* Load the two CSV files into separate DataFrames. Merge them into a DataFrame named churn_df. Calculate and print churn rate, and identify the categorical variables in churn_df.

In [89]:
#importing Libiries

import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 
import seaborn as sns 
sns.set_palette("magma") 
sns.set_style("whitegrid")
from sklearn.preprocessing import StandardScaler, OneHotEncoder 
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression 
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix

In [90]:
# Load Datasets 
 # demographics Dataset
tel_demog= pd.read_csv("telecom_demographics.csv") 

#useage Dataset
tel_usea= pd.read_csv("telecom_usage.csv") 

In [4]:
#first five rows "demographics dataset"
 
tel_demog.head()

Unnamed: 0,customer_id,telecom_partner,gender,age,state,city,pincode,registration_event,num_dependents,estimated_salary
0,15169,Airtel,F,26,Himachal Pradesh,Delhi,667173,2020-03-16,4,85979
1,149207,Airtel,F,74,Uttarakhand,Hyderabad,313997,2022-01-16,0,69445
2,148119,Airtel,F,54,Jharkhand,Chennai,549925,2022-01-11,2,75949
3,187288,Reliance Jio,M,29,Bihar,Hyderabad,230636,2022-07-26,3,34272
4,14016,Vodafone,M,45,Nagaland,Bangalore,188036,2020-03-11,4,34157


In [91]:
# first five rows "useage Dataset"
tel_usea.head()

Unnamed: 0,customer_id,calls_made,sms_sent,data_used,churn
0,15169,75,21,4532,1
1,149207,35,38,723,1
2,148119,70,47,4688,1
3,187288,95,32,10241,1
4,14016,66,23,5246,1


In [92]:
# Join Datasets 
churn_df = tel_demog.merge(tel_usea, on = "customer_id")

# first five rows "churn_df"
churn_df.head()

Unnamed: 0,customer_id,telecom_partner,gender,age,state,city,pincode,registration_event,num_dependents,estimated_salary,calls_made,sms_sent,data_used,churn
0,15169,Airtel,F,26,Himachal Pradesh,Delhi,667173,2020-03-16,4,85979,75,21,4532,1
1,149207,Airtel,F,74,Uttarakhand,Hyderabad,313997,2022-01-16,0,69445,35,38,723,1
2,148119,Airtel,F,54,Jharkhand,Chennai,549925,2022-01-11,2,75949,70,47,4688,1
3,187288,Reliance Jio,M,29,Bihar,Hyderabad,230636,2022-07-26,3,34272,95,32,10241,1
4,14016,Vodafone,M,45,Nagaland,Bangalore,188036,2020-03-11,4,34157,66,23,5246,1


In [93]:
# duplicated 

churn_df.duplicated().sum()

0

In [94]:
# missing value 
churn_df.isna().sum().sort_values()

customer_id           0
telecom_partner       0
gender                0
age                   0
state                 0
city                  0
pincode               0
registration_event    0
num_dependents        0
estimated_salary      0
calls_made            0
sms_sent              0
data_used             0
churn                 0
dtype: int64

In [17]:
# information
churn_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6500 entries, 0 to 6499
Data columns (total 14 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   customer_id         6500 non-null   int64 
 1   telecom_partner     6500 non-null   object
 2   gender              6500 non-null   object
 3   age                 6500 non-null   int64 
 4   state               6500 non-null   object
 5   city                6500 non-null   object
 6   pincode             6500 non-null   int64 
 7   registration_event  6500 non-null   object
 8   num_dependents      6500 non-null   int64 
 9   estimated_salary    6500 non-null   int64 
 10  calls_made          6500 non-null   int64 
 11  sms_sent            6500 non-null   int64 
 12  data_used           6500 non-null   int64 
 13  churn               6500 non-null   int64 
dtypes: int64(9), object(5)
memory usage: 761.7+ KB


In [95]:
# Calculate Churn rate 
churn_rate= churn_df["churn"].value_counts() / len(churn_df)
churn_rate

0    0.799538
1    0.200462
Name: churn, dtype: float64

In [96]:
churn_df= pd.get_dummies(churn_df, drop_first= True)
churn_df.shape

(6500, 1260)

* Convert categorical features in churn_df into features_scaled. Perform feature scaling separating the appropriate features and scale them.  Define your scaled features and target variable for the churn prediction model.

In [97]:
# Convert features Variabels
features_vars= churn_df.drop(churn_df[["customer_id", "churn"]], axis= 1).values

#Convert target Variabel

target_var= churn_df["churn"].values

In [98]:
# Perform feature scaling separating the appropriate features and scale them

scaler = StandardScaler()

features_scaled= scaler.fit_transform(features_vars)


* Split the processed data into training and testing sets giving names of X_train, X_test, y_train, and y_test using an 80-20 split, setting a random state of 42 for reproducibility.

In [99]:
# split Dataset  
X_train ,X_test , y_train , y_test= train_test_split(features_scaled ,
                                                      target_var, test_size= 0.2,
                                                      random_state= 42) 

* Train Logistic Regression and Random Forest Classifier models, setting a random seed of 42.

In [100]:
#  Logistic Regression

loges = LogisticRegression(random_state= 42 )

# fit training data
loges.fit(X_train, y_train)

# Predict 
loges_pred= loges.predict(X_test)

# confusion matrix "Logistic Regression"  
confusion_marx_loges= confusion_matrix(y_test, loges_pred) 
print(confusion_marx)

#Calssification Report "Logistic Regression"
Classifi_repo_loges = classification_report(y_test, loges_pred)
print(Classifi_repo_loges)


[[920 107]
 [245  28]]
              precision    recall  f1-score   support

           0       0.79      0.90      0.84      1027
           1       0.21      0.10      0.14       273

    accuracy                           0.73      1300
   macro avg       0.50      0.50      0.49      1300
weighted avg       0.67      0.73      0.69      1300



In [None]:
# Random Forests 
random_calss = RandomForestClassifier(random_state = 42)

# fit trainig data 
random_calss.fit(X_train, y_train)

#Predict
random_class_pred =random_calss.predict(X_test)

#confusion matrix "Random Forest"
confusion_mat_randomfor= confusion_matrix(y_test,random_class_pred )
print(confusion_mat_randomfor)

# Calssification Report "Random Forest" 
classifi_repo_randf= classification_report(y_test,random_class_pred )
print(classifi_repo_randf)