## **Rupa Navale**
## **CodeClause Internship**
## **Data Science Intern**

## **Task: Churn Prediction in Telecom Industry using Logistic Regression**
###**Objective:**
Develop and implement a churn prediction model using Logistic Regression for the telecom industry, leveraging customer data to identify and anticipate customer churn patterns, contributing to improved customer retention strategies.

## **Index:**
####**Step 1**: Import the necessary libraries    
####**Step 2**: Importing and Merging Datasets
####**Step 3**: Data Cleaning & Transformation
####**Step 4**: Data Visualization
####**Step 5**: Model Building and Analysis



##**Step 1: Import the necessary libraries:**

In [5]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import pandas_profiling
import plotly.offline as po
import plotly.graph_objs as go
%matplotlib inline
from google.colab import drive

##**Step 2: Importing and Merging Datasets:**

In [6]:
drive.mount('/content/drive')
customer_data = pd.read_csv("/content/drive/MyDrive/Data/customer_data.csv")
churn_data = pd.read_csv("/content/drive/MyDrive/Data/churn_data.csv")
internet_data = pd.read_csv("/content/drive/MyDrive/Data/internet_data.csv")

Mounted at /content/drive


#### **Merging all datasets based on condition ("customer_id "):**

In [7]:
# Merging on 'customerID'
df_1 = pd.merge(churn_data, customer_data, how='inner', on='customerID')

# Final dataframe with all predictor variables
churn_dataset = pd.merge(df_1, internet_data, how='inner', on='customerID')

# Let's see the head of our master dataset
churn_dataset.head()

Unnamed: 0,customerID,tenure,PhoneService,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn,gender,...,Partner,Dependents,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies
0,7590-VHVEG,1,No,Month-to-month,Yes,Electronic check,29.85,29.85,No,Female,...,Yes,No,No phone service,DSL,No,Yes,No,No,No,No
1,5575-GNVDE,34,Yes,One year,No,Mailed check,56.95,1889.5,No,Male,...,No,No,No,DSL,Yes,No,Yes,No,No,No
2,3668-QPYBK,2,Yes,Month-to-month,Yes,Mailed check,53.85,108.15,Yes,Male,...,No,No,No,DSL,Yes,Yes,No,No,No,No
3,7795-CFOCW,45,No,One year,No,Bank transfer (automatic),42.3,1840.75,No,Male,...,No,No,No phone service,DSL,Yes,No,Yes,Yes,No,No
4,9237-HQITU,2,Yes,Month-to-month,Yes,Electronic check,70.7,151.65,Yes,Female,...,No,No,No,Fiber optic,No,No,No,No,No,No


In [8]:
# Discovering number of rows & columns:
churn_dataset.shape

(7042, 21)

In [9]:
# Basic summary about the data:
churn_dataset.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7042 entries, 0 to 7041
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7042 non-null   object 
 1   tenure            7042 non-null   int64  
 2   PhoneService      7042 non-null   object 
 3   Contract          7042 non-null   object 
 4   PaperlessBilling  7042 non-null   object 
 5   PaymentMethod     7042 non-null   object 
 6   MonthlyCharges    7042 non-null   float64
 7   TotalCharges      7042 non-null   object 
 8   Churn             7042 non-null   object 
 9   gender            7042 non-null   object 
 10  SeniorCitizen     7042 non-null   int64  
 11  Partner           7042 non-null   object 
 12  Dependents        7042 non-null   object 
 13  MultipleLines     7042 non-null   object 
 14  InternetService   7042 non-null   object 
 15  OnlineSecurity    7042 non-null   object 
 16  OnlineBackup      7042 non-null   object 


In [10]:
# Statistical exploration of dataset:
churn_dataset.describe()

Unnamed: 0,tenure,MonthlyCharges,SeniorCitizen
count,7042.0,7042.0,7042.0
mean,32.366373,64.755886,0.16217
std,24.557955,30.088238,0.368633
min,0.0,18.25,0.0
25%,9.0,35.5,0.0
50%,29.0,70.35,0.0
75%,55.0,89.85,0.0
max,72.0,118.75,1.0


In [11]:
churn_dataset.head()

Unnamed: 0,customerID,tenure,PhoneService,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn,gender,...,Partner,Dependents,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies
0,7590-VHVEG,1,No,Month-to-month,Yes,Electronic check,29.85,29.85,No,Female,...,Yes,No,No phone service,DSL,No,Yes,No,No,No,No
1,5575-GNVDE,34,Yes,One year,No,Mailed check,56.95,1889.5,No,Male,...,No,No,No,DSL,Yes,No,Yes,No,No,No
2,3668-QPYBK,2,Yes,Month-to-month,Yes,Mailed check,53.85,108.15,Yes,Male,...,No,No,No,DSL,Yes,Yes,No,No,No,No
3,7795-CFOCW,45,No,One year,No,Bank transfer (automatic),42.3,1840.75,No,Male,...,No,No,No phone service,DSL,Yes,No,Yes,Yes,No,No
4,9237-HQITU,2,Yes,Month-to-month,Yes,Electronic check,70.7,151.65,Yes,Female,...,No,No,No,Fiber optic,No,No,No,No,No,No


##**Step 3: Data Cleaning & Transformation:**

In [12]:
# Convert String values (Yes and No) of Churn column to 1 and 0
churn_dataset.loc[churn_dataset.Churn=='No','Churn'] = 0
churn_dataset.loc[churn_dataset.Churn=='Yes','Churn'] = 1

In [13]:
# Convert 'No internet service' to 'No' for the below mentioned columns
cols = ['OnlineBackup', 'StreamingMovies','DeviceProtection',
                'TechSupport','OnlineSecurity','StreamingTV']
for i in cols :
    churn_dataset[i]  = churn_dataset[i].replace({'No internet service' : 'No'})

In [14]:
# Replace all the spaces with null values
churn_dataset['TotalCharges'] = churn_dataset["TotalCharges"].replace(" ",np.nan)

# Drop null values of 'Total Charges' feature
churn_dataset = churn_dataset[churn_dataset["TotalCharges"].notnull()]
churn_dataset = churn_dataset.reset_index()[churn_dataset.columns]

# Convert 'Total Charges' column values to float data type
churn_dataset["TotalCharges"] = churn_dataset["TotalCharges"].astype(float)

In [15]:
churn_dataset["Churn"].value_counts().values

array([5162, 1869])

##**Step 4: Data Visualization:**

In [16]:
# Visualize Total Customer Churn
plot_by_churn_labels = churn_dataset["Churn"].value_counts().keys().tolist()
plot_by_churn_values = churn_dataset["Churn"].value_counts().values.tolist()

plot_data= [
    go.Pie(labels = plot_by_churn_labels,
           values = plot_by_churn_values,
           marker = dict(colors =  [ 'Teal' ,'Grey'],
                         line = dict(color = "white",
                                     width =  1.5)),
           rotation = 90,
           hoverinfo = "label+value+text",
           hole = .6)
]
plot_layout = go.Layout(dict(title = "Customer Churn",
                   plot_bgcolor  = "rgb(243,243,243)",
                   paper_bgcolor = "rgb(243,243,243)",))


fig = go.Figure(data=plot_data, layout=plot_layout)
po.iplot(fig)

In [17]:
# Visualize Churn Rate by Gender
plot_by_gender = churn_dataset.groupby('gender').Churn.mean().reset_index()
plot_data = [
    go.Bar(
        x=plot_by_gender['gender'],
        y=plot_by_gender['Churn'],
        width = [0.3, 0.3],
        marker=dict(
        color=['orange', 'green'])
    )
]
plot_layout = go.Layout(
        xaxis={"type": "category"},
        yaxis={"title": "Churn Rate"},
        title='Churn Rate by Gender',
        plot_bgcolor  = 'rgb(243,243,243)',
        paper_bgcolor  = 'rgb(243,243,243)',
    )
fig = go.Figure(data=plot_data, layout=plot_layout)
po.iplot(fig)

In [18]:
# Visualize Churn Rate by Tech Support
plot_by_techsupport = churn_dataset.groupby('TechSupport').Churn.mean().reset_index()
plot_data = [
    go.Bar(
        x=plot_by_techsupport['TechSupport'],
        y=plot_by_techsupport['Churn'],
        width = [0.3, 0.3, 0.3],
        marker=dict(
        color=['orange', 'green', 'teal'])
    )
]
plot_layout = go.Layout(
        xaxis={"type": "category"},
        yaxis={"title": "Churn Rate"},
        title='Churn Rate by Tech Support',
        plot_bgcolor  = 'rgb(243,243,243)',
        paper_bgcolor  = 'rgb(243,243,243)',
    )
fig = go.Figure(data=plot_data, layout=plot_layout)
po.iplot(fig)

In [19]:
# Visualize Churn Rate by Internet Services
plot_by_internet_service = churn_dataset.groupby('InternetService').Churn.mean().reset_index()
plot_data = [
    go.Bar(
        x=plot_by_internet_service['InternetService'],
        y=plot_by_internet_service['Churn'],
        width = [0.3, 0.3, 0.3],
        marker=dict(
        color=['orange', 'green', 'teal'])
    )
]
plot_layout = go.Layout(
        xaxis={"type": "category"},
        yaxis={"title": "Churn Rate"},
        title='Churn Rate by Internet Service',
        plot_bgcolor  = 'rgb(243,243,243)',
        paper_bgcolor  = 'rgb(243,243,243)',
    )
fig = go.Figure(data=plot_data, layout=plot_layout)
po.iplot(fig)

In [20]:
# Visualize Churn Rate by Payment Method
plot_by_payment = churn_dataset.groupby('PaymentMethod').Churn.mean().reset_index()
plot_data = [
    go.Bar(
        x=plot_by_payment['PaymentMethod'],
        y=plot_by_payment['Churn'],
        width = [0.3, 0.3,0.3,0.3],
        marker=dict(
        color=['orange', 'green','teal','magenta'])
    )
]
plot_layout = go.Layout(
        xaxis={"type": "category"},
        yaxis={"title": "Churn Rate"},
        title='Churn Rate by Payment Method',
        plot_bgcolor  = 'rgb(243,243,243)',
        paper_bgcolor  = 'rgb(243,243,243)',
    )
fig = go.Figure(data=plot_data, layout=plot_layout)
po.iplot(fig)

In [21]:
# Visualize Churn Rate by Contract Duration
plot_by_contract = churn_dataset.groupby('Contract').Churn.mean().reset_index()
plot_data = [
    go.Bar(
        x=plot_by_contract['Contract'],
        y=plot_by_contract['Churn'],
        width = [0.3, 0.3,0.3],
        marker=dict(
        color=['orange', 'green','teal'])
    )
]
plot_layout = go.Layout(
        xaxis={"type": "category"},
        yaxis={"title": "Churn Rate"},
        title='Churn Rate by Contract Duration',
        plot_bgcolor  = 'rgb(243,243,243)',
        paper_bgcolor  = 'rgb(243,243,243)',
    )
fig = go.Figure(data=plot_data, layout=plot_layout)
po.iplot(fig)

In [22]:
# Visualize Relation between Tenure & Churn rate
plot_by_tenure = churn_dataset.groupby('tenure').Churn.mean().reset_index()
plot_data = [
    go.Scatter(
        x=plot_by_tenure['tenure'],
        y=plot_by_tenure['Churn'],
        mode='markers',
        name='Low',
        marker= dict(size= 5,
            line= dict(width=0.8),
            color= 'green'
           ),
    )
]
plot_layout = go.Layout(
        yaxis= {'title': "Churn Rate"},
        xaxis= {'title': "Tenure"},
        title='Relation between Tenure & Churn rate',
        plot_bgcolor  = "rgb(243,243,243)",
        paper_bgcolor  = "rgb(243,243,243)",
    )
fig = go.Figure(data=plot_data, layout=plot_layout)
po.iplot(fig)

##**Step 5: Model Building and Analysis:**

In [23]:
#Perform One Hot Encoding using get_dummies method
churn_dataset = pd.get_dummies(churn_dataset, columns = ['Contract','Dependents','DeviceProtection','gender',
                                                        'InternetService','MultipleLines','OnlineBackup',
                                                        'OnlineSecurity','PaperlessBilling','Partner',
                                                        'PaymentMethod','PhoneService','SeniorCitizen',
                                                        'StreamingMovies','StreamingTV','TechSupport'],
                              drop_first=True)

In [24]:
#Perform Feature Scaling and One Hot Encoding
from sklearn.preprocessing import StandardScaler

#Perform Feature Scaling on 'tenure', 'MonthlyCharges', 'TotalCharges' in order to bring them on same scale
standardScaler = StandardScaler()
columns_for_ft_scaling = ['tenure', 'MonthlyCharges', 'TotalCharges']

#Apply the feature scaling operation on dataset using fit_transform() method
churn_dataset[columns_for_ft_scaling] = standardScaler.fit_transform(churn_dataset[columns_for_ft_scaling])

In [25]:
# See subset of values
churn_dataset.head()

Unnamed: 0,customerID,tenure,MonthlyCharges,TotalCharges,Churn,Contract_One year,Contract_Two year,Dependents_Yes,DeviceProtection_Yes,gender_Male,...,PaperlessBilling_Yes,Partner_Yes,PaymentMethod_Credit card (automatic),PaymentMethod_Electronic check,PaymentMethod_Mailed check,PhoneService_Yes,SeniorCitizen_1,StreamingMovies_Yes,StreamingTV_Yes,TechSupport_Yes
0,7590-VHVEG,-1.280133,-1.161571,-0.994124,0,0,0,0,0,0,...,1,1,0,1,0,0,0,0,0,0
1,5575-GNVDE,0.064501,-0.2607,-0.173491,0,1,0,0,1,1,...,0,0,0,0,1,1,0,0,0,0
2,3668-QPYBK,-1.239386,-0.363752,-0.959571,1,0,0,0,0,1,...,1,0,0,0,1,1,0,0,0,0
3,7795-CFOCW,0.512713,-0.747702,-0.195004,0,1,0,0,1,1,...,0,0,0,0,0,0,0,0,0,1
4,9237-HQITU,-1.239386,0.196383,-0.940375,1,0,0,0,0,0,...,1,0,0,1,0,1,0,0,0,0


In [26]:
#Number of columns increased and have suffixes attached, as a result of get_dummies method.
churn_dataset.columns

Index(['customerID', 'tenure', 'MonthlyCharges', 'TotalCharges', 'Churn',
       'Contract_One year', 'Contract_Two year', 'Dependents_Yes',
       'DeviceProtection_Yes', 'gender_Male', 'InternetService_Fiber optic',
       'InternetService_No', 'MultipleLines_No phone service',
       'MultipleLines_Yes', 'OnlineBackup_Yes', 'OnlineSecurity_Yes',
       'PaperlessBilling_Yes', 'Partner_Yes',
       'PaymentMethod_Credit card (automatic)',
       'PaymentMethod_Electronic check', 'PaymentMethod_Mailed check',
       'PhoneService_Yes', 'SeniorCitizen_1', 'StreamingMovies_Yes',
       'StreamingTV_Yes', 'TechSupport_Yes'],
      dtype='object')

In [27]:
#Create Feature variable X and Target variable y
y = churn_dataset['Churn']
X = churn_dataset.drop(['Churn','customerID'], axis = 1)

In [28]:
#Split the data into training set (70%) and test set (30%)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30, random_state = 50)

In [63]:
# Machine Learning classification model libraries
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, roc_curve, roc_auc_score

In [64]:
# Fit the logistic regression model
logmodel = LogisticRegression(random_state=50, solver='lbfgs')
logmodel.fit(X_train, y_train)

In [65]:
# Convert string labels to integers using LabelEncoder
label_encoder = LabelEncoder()
y_train = label_encoder.fit_transform(y_train)

# Fit the logistic regression model
logmodel = LogisticRegression(random_state=50, solver='lbfgs')
logmodel.fit(X_train, y_train)

# Predict the value for new, unseen data
pred = logmodel.predict(X_test)

# Get the coefficients
coefficients = logmodel.coef_
intercept = logmodel.intercept_


print("Coefficients:", coefficients)
print("Intercept:", intercept)


Coefficients: [[-1.54622455  0.18845629  0.74109063 -0.60318734 -1.31335751 -0.06568736
  -0.10078498  0.05660995  0.556659   -0.78146959  0.31729281  0.18755963
  -0.16969922 -0.4079393   0.3474447  -0.01268697 -0.16466757  0.32538302
  -0.00358772 -0.31747293  0.17631259  0.20521581  0.10323468 -0.4371966 ]]
Intercept: [-1.29677683]


In [66]:
# Fit the logistic regression model
logmodel = LogisticRegression(random_state=50, solver='lbfgs')
logmodel.fit(X_train, y_train)

# Convert string labels in y_test to integers using the same label encoder
y_test_encoded = label_encoder.transform(y_test)

# Predict the value for new, unseen data
y_pred = logmodel.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test_encoded, y_pred)
print("Accuracy:", accuracy)

Accuracy: 0.8028436018957346


In [70]:
# Predict the value for new, unseen data
y_pred = logmodel.predict(X_test)
y_pred

array([1, 0, 0, ..., 0, 0, 1])

In [88]:
import statsmodels.api as sm
logm1 = sm.GLM(y_train,(sm.add_constant(X_train)), family = sm.families.Binomial())
logm1.fit().summary()

0,1,2,3
Dep. Variable:,y,No. Observations:,4921.0
Model:,GLM,Df Residuals:,4897.0
Model Family:,Binomial,Df Model:,23.0
Link Function:,Logit,Scale:,1.0
Method:,IRLS,Log-Likelihood:,-2017.1
Date:,"Wed, 30 Aug 2023",Deviance:,4034.1
Time:,11:43:10,Pearson chi2:,5330.0
No. Iterations:,9,Pseudo R-squ. (CS):,0.292
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
const,-1.4147,0.762,-1.856,0.064,-2.909,0.080
tenure,-1.6282,0.191,-8.524,0.000,-2.003,-1.254
MonthlyCharges,-0.6682,1.144,-0.584,0.559,-2.910,1.574
TotalCharges,0.8342,0.198,4.209,0.000,0.446,1.223
Contract_One year,-0.6148,0.129,-4.777,0.000,-0.867,-0.363
Contract_Two year,-1.3713,0.216,-6.354,0.000,-1.794,-0.948
Dependents_Yes,-0.0624,0.107,-0.581,0.561,-0.273,0.148
DeviceProtection_Yes,0.0369,0.212,0.174,0.862,-0.378,0.452
gender_Male,0.0577,0.078,0.740,0.459,-0.095,0.210
