<h3 style='color:blue' align='center'>Customer Churn Prediction Using Logistic Regrssion</h3>

Customer churn prediction is to measure why customers are leaving a business. In this tutorial we will be looking at customer churn in telecom business. We will build a machine learning model to predict the churn and use precision,recall, f1-score to measure performance of our model

# Import the Libraries

In [257]:
import numpy as np            # Mathamatical & Statistical expression
import pandas as pd           # Management of the data
import matplotlib as plt
import seaborn as sns

import warnings
warnings.filterwarnings("ignore")

# Load the file

In [259]:
df=pd.read_csv('https://raw.githubusercontent.com/codebasics/deep-learning-keras-tf-tutorial/master/11_chrun_prediction/customer_churn.csv')
df.sample(5)

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
3531,4311-QTTAI,Female,0,No,No,16,Yes,No,No,No internet service,...,No internet service,No internet service,No internet service,No internet service,One year,No,Credit card (automatic),19.35,295.55,No
2543,0952-KMEEH,Male,0,No,No,13,Yes,Yes,Fiber optic,Yes,...,No,No,Yes,Yes,Month-to-month,No,Mailed check,98.15,1230.25,Yes
1921,6732-VAILE,Male,0,Yes,Yes,70,Yes,Yes,DSL,No,...,Yes,Yes,Yes,Yes,Two year,Yes,Credit card (automatic),85.95,5931.75,No
3379,0396-HUJBP,Female,0,No,No,2,Yes,No,No,No internet service,...,No internet service,No internet service,No internet service,No internet service,Month-to-month,No,Mailed check,19.3,44.4,No
783,4678-DVQEO,Female,0,No,No,1,Yes,No,DSL,No,...,No,Yes,No,No,Month-to-month,Yes,Electronic check,52.2,52.2,Yes


# Exploratory Data Analysis

The concept of changing the raw data into a clean data is known as EDA.
Preprocessing and Cleaning :
1.) Handling the missing values
2.) Encoding
3.) Scaling
4.) Feature Engineering

## 1.) Handling the missing values

### Check the data type

In [261]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 


In [263]:
# It's in string
df.TotalCharges.values

array(['29.85', '1889.5', '108.15', ..., '346.45', '306.6', '6844.5'],
      dtype=object)

### *Note :- pd.to_numeric(): This is a pandas function that attempts to convert the input into a numeric format.*

In [265]:
# let's convert it into numeric
pd.to_numeric(df.TotalCharges)

ValueError: Unable to parse string " " at position 488

In [267]:
# Explicitly (Manually) replacing empty spaces(" ") with np.nan 
df["TotalCharges"].replace(" ",np.nan,inplace=True)

#### *Note:- pd.to_numeric(df.TotalCharges,errors='coerce') converts a column to numeric values, and if any invalid value is found (e.g., empty spaces,non-convertible data (like 'invalid')), it will be replaced by NaN*

In [269]:
# Let errors='coerce' handle all bad values
pd.to_numeric(df.TotalCharges,errors='coerce')

0         29.85
1       1889.50
2        108.15
3       1840.75
4        151.65
         ...   
7038    1990.50
7039    7362.90
7040     346.45
7041     306.60
7042    6844.50
Name: TotalCharges, Length: 7043, dtype: float64

#### *Note:- Eiether replace manually or using errors='coerce'*

In [271]:
# Let's check if there are any null values
df[pd.to_numeric(df.TotalCharges,errors='coerce').isnull()]

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
488,4472-LVYGI,Female,0,Yes,Yes,0,No,No phone service,DSL,Yes,...,Yes,Yes,Yes,No,Two year,Yes,Bank transfer (automatic),52.55,,No
753,3115-CZMZD,Male,0,No,Yes,0,Yes,No,No,No internet service,...,No internet service,No internet service,No internet service,No internet service,Two year,No,Mailed check,20.25,,No
936,5709-LVOEQ,Female,0,Yes,Yes,0,Yes,No,DSL,Yes,...,Yes,No,Yes,Yes,Two year,No,Mailed check,80.85,,No
1082,4367-NUYAO,Male,0,Yes,Yes,0,Yes,Yes,No,No internet service,...,No internet service,No internet service,No internet service,No internet service,Two year,No,Mailed check,25.75,,No
1340,1371-DWPAZ,Female,0,Yes,Yes,0,No,No phone service,DSL,Yes,...,Yes,Yes,Yes,No,Two year,No,Credit card (automatic),56.05,,No
3331,7644-OMVMY,Male,0,Yes,Yes,0,Yes,No,No,No internet service,...,No internet service,No internet service,No internet service,No internet service,Two year,No,Mailed check,19.85,,No
3826,3213-VVOLG,Male,0,Yes,Yes,0,Yes,Yes,No,No internet service,...,No internet service,No internet service,No internet service,No internet service,Two year,No,Mailed check,25.35,,No
4380,2520-SGTTA,Female,0,Yes,Yes,0,Yes,No,No,No internet service,...,No internet service,No internet service,No internet service,No internet service,Two year,No,Mailed check,20.0,,No
5218,2923-ARZLG,Male,0,Yes,Yes,0,Yes,No,No,No internet service,...,No internet service,No internet service,No internet service,No internet service,One year,Yes,Mailed check,19.7,,No
6670,4075-WKNIU,Female,0,Yes,Yes,0,Yes,Yes,DSL,No,...,Yes,Yes,Yes,No,Two year,No,Mailed check,73.35,,No


In [273]:
# Changing the data type of the column which had missing values which are now replaced by null values
df["TotalCharges"]=df["TotalCharges"].astype(float)

In [275]:
#Calculating the mean of that column 
TotalCharges_mean=df["TotalCharges"].mean()

In [277]:
# And fill the null values with the mean
df["TotalCharges"].fillna(TotalCharges_mean,inplace=True)

In [279]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 


In [281]:
# Let's check the other columns
def print_unique_col_values(df):
  for column in df:
    if df[column].dtypes=='object':            # checking the object columns bcoz we have to encode those columns
      print(f'(column):{df[column].unique()}')
print_unique_col_values(df)

(column):['7590-VHVEG' '5575-GNVDE' '3668-QPYBK' ... '4801-JZAZL' '8361-LTMKD'
 '3186-AJIEK']
(column):['Female' 'Male']
(column):['Yes' 'No']
(column):['No' 'Yes']
(column):['No' 'Yes']
(column):['No phone service' 'No' 'Yes']
(column):['DSL' 'Fiber optic' 'No']
(column):['No' 'Yes' 'No internet service']
(column):['Yes' 'No' 'No internet service']
(column):['No' 'Yes' 'No internet service']
(column):['No' 'Yes' 'No internet service']
(column):['No' 'Yes' 'No internet service']
(column):['No' 'Yes' 'No internet service']
(column):['Month-to-month' 'One year' 'Two year']
(column):['Yes' 'No']
(column):['Electronic check' 'Mailed check' 'Bank transfer (automatic)'
 'Credit card (automatic)']
(column):['No' 'Yes']


In [283]:
#Some of the columns have no phone service and no internet service, that can be replaced with a simple No
df.replace('No phone service','No',inplace=True)
df.replace('No internet service','No',inplace=True)

In [285]:
# Checking if the values are replaced to No
print_unique_col_values(df)

(column):['7590-VHVEG' '5575-GNVDE' '3668-QPYBK' ... '4801-JZAZL' '8361-LTMKD'
 '3186-AJIEK']
(column):['Female' 'Male']
(column):['Yes' 'No']
(column):['No' 'Yes']
(column):['No' 'Yes']
(column):['No' 'Yes']
(column):['DSL' 'Fiber optic' 'No']
(column):['No' 'Yes']
(column):['Yes' 'No']
(column):['No' 'Yes']
(column):['No' 'Yes']
(column):['No' 'Yes']
(column):['No' 'Yes']
(column):['Month-to-month' 'One year' 'Two year']
(column):['Yes' 'No']
(column):['Electronic check' 'Mailed check' 'Bank transfer (automatic)'
 'Credit card (automatic)']
(column):['No' 'Yes']


In [287]:
# Replacing Yes No values in those colums with 1 and 0
yes_no_columns=['Partner','Dependents','PhoneService','MultipleLines','OnlineSecurity','OnlineBackup','DeviceProtection','TechSupport','StreamingTV','StreamingMovies','PaperlessBilling','Churn']
for col in yes_no_columns:
  df[col].replace({'Yes':1,'No':0},inplace=True)

In [289]:
# Let's check if it has replaced with Yes and No with 1 and 0
for col in df:
  print(f'{col}:{df[col].unique}')

customerID:<bound method Series.unique of 0       7590-VHVEG
1       5575-GNVDE
2       3668-QPYBK
3       7795-CFOCW
4       9237-HQITU
           ...    
7038    6840-RESVB
7039    2234-XADUH
7040    4801-JZAZL
7041    8361-LTMKD
7042    3186-AJIEK
Name: customerID, Length: 7043, dtype: object>
gender:<bound method Series.unique of 0       Female
1         Male
2         Male
3         Male
4       Female
         ...  
7038      Male
7039    Female
7040    Female
7041      Male
7042      Male
Name: gender, Length: 7043, dtype: object>
SeniorCitizen:<bound method Series.unique of 0       0
1       0
2       0
3       0
4       0
       ..
7038    0
7039    0
7040    0
7041    1
7042    0
Name: SeniorCitizen, Length: 7043, dtype: int64>
Partner:<bound method Series.unique of 0       1
1       0
2       0
3       0
4       0
       ..
7038    1
7039    1
7040    1
7041    1
7042    0
Name: Partner, Length: 7043, dtype: int64>
Dependents:<bound method Series.unique of 0       0
1       

In [291]:
# Now let's replace Gender column with 1 and 0
df['gender'].replace({'Female':1,'Male':0},inplace=True)
df['gender'].unique

<bound method Series.unique of 0       1
1       0
2       0
3       0
4       1
       ..
7038    0
7039    1
7040    1
7041    0
7042    0
Name: gender, Length: 7043, dtype: int64>

## 2.) Encoding

#### *Note:- pd.get_dummies is a Pandas function to call OneHot encoder*

In [293]:
df=pd.get_dummies(data=df,columns=['InternetService','Contract','PaymentMethod'])
df.sample(5)

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,OnlineSecurity,OnlineBackup,...,InternetService_DSL,InternetService_Fiber optic,InternetService_No,Contract_Month-to-month,Contract_One year,Contract_Two year,PaymentMethod_Bank transfer (automatic),PaymentMethod_Credit card (automatic),PaymentMethod_Electronic check,PaymentMethod_Mailed check
1878,0463-WZZKO,0,0,0,0,3,1,0,0,0,...,False,False,True,True,False,False,False,True,False,False
912,3865-YIOTT,0,0,1,1,72,1,1,1,0,...,False,True,False,False,True,False,True,False,False,False
3420,8663-UPDGF,1,0,0,0,26,1,1,1,1,...,True,False,False,True,False,False,True,False,False,False
1275,5095-ETBRJ,1,0,1,0,55,1,0,1,1,...,True,False,False,False,False,True,False,False,False,True
4859,8041-TMEID,0,1,1,0,63,1,1,1,1,...,False,True,False,False,False,True,False,True,False,False


You can do this by explicitly converting your boolean columns to integers using .astype(int) before applying get_dummies:
#Convert boolean columns to integers
df[['InternetService', 'Contract', 'PaymentMethod']] = df[['InternetService', 'Contract', 'PaymentMethod']].astype(int)

## 3.) Scaling

Standardization
z=(x-mean)/standard deviation

In [303]:
cols_to_scale=['tenure','MonthlyCharges','TotalCharges']

from sklearn.preprocessing import StandardScaler
ss=StandardScaler()
df[cols_to_scale]=ss.fit_transform(df[cols_to_scale])

In [307]:
# Let's check if the columns are scaled
for col in df:
  print(f'{col}:{df[col].unique()}')

customerID:['7590-VHVEG' '5575-GNVDE' '3668-QPYBK' ... '4801-JZAZL' '8361-LTMKD'
 '3186-AJIEK']
gender:[1 0]
SeniorCitizen:[0 1]
Partner:[1 0]
Dependents:[0 1]
tenure:[-1.27744458  0.06632742 -1.23672422  0.51425142 -0.99240204 -0.42231695
 -0.91096131 -0.17799476  1.2064976  -0.78880022 -0.66663913  1.04361615
  0.67713287 -0.30015585  1.49154015  0.79929397  1.57298088 -0.46303731
 -0.82952058 -0.09655404  0.59569215  1.61370124 -0.62591876 -0.21871513
 -1.11456313  0.55497178 -0.87024095  1.53226051  1.24721797  0.43281069
 -0.70735949  1.12505688 -0.5851984   1.36937906 -0.95168167 -1.19600386
 -0.05583367  0.71785324  1.28793833  0.96217542 -1.0331224   0.39209033
  0.10704778  0.63641251 -0.1372744   1.32865869  0.22920887  1.45081979
 -0.01511331  0.92145506  0.18848851  0.14776815  0.35136997 -1.07384277
 -1.15528349  0.02560706  1.41009942 -0.38159658  1.00289578  1.16577724
 -0.74807986 -0.50375767  0.84001433  0.3106496   1.08433651 -0.34087622
  0.47353106 -0.54447804  0.88

## 4.) Feature Engineering

In [322]:
# Dropping customerID column as it is of no use
df.drop('customerID',axis='columns',inplace=True)
df.dtypes

gender                                       int64
SeniorCitizen                                int64
Partner                                      int64
Dependents                                   int64
tenure                                     float64
PhoneService                                 int64
MultipleLines                                int64
OnlineSecurity                               int64
OnlineBackup                                 int64
DeviceProtection                             int64
TechSupport                                  int64
StreamingTV                                  int64
StreamingMovies                              int64
PaperlessBilling                             int64
MonthlyCharges                             float64
TotalCharges                               float64
Churn                                        int64
InternetService_DSL                           bool
InternetService_Fiber optic                   bool
InternetService_No             

# Separate x and y

In [330]:
x=df.drop('Churn',axis='columns') #Independent Variable
y=df['Churn'] #Dependent Variable

# Train-Test Split

In [332]:
from sklearn.model_selection import train_test_split
xtrain,xtest,ytrain,ytest=train_test_split(x,y,test_size=0.20,random_state=1) #random_state is the attribute used to keep the dataset/accuracy fix to fix the training and testing values.

# Build a model

In [337]:
#1. Import the model:
from sklearn.linear_model import LogisticRegression

#2.Create instance of the model:
logreg=LogisticRegression()

#3. Train/Fit the model:
logreg.fit(xtrain,ytrain)

#4. Predict the observation:
ypred=logreg.predict(xtest)

In [339]:
from sklearn.metrics import accuracy_score,classification_report

In [341]:
accuracy_score(ytest,ypred)

0.8119233498935415

In [343]:
print(classification_report(ytest,ypred))

              precision    recall  f1-score   support

           0       0.87      0.89      0.88      1061
           1       0.63      0.59      0.61       348

    accuracy                           0.81      1409
   macro avg       0.75      0.74      0.74      1409
weighted avg       0.81      0.81      0.81      1409

