# Week 3 - Binary Classification (Logistic Regression)

## 3.1 Churn Prediction
Imagine you work at a company Telco and the company has tasked you to predict the churn of customers.

## 3.2 Data preparation
- download the data, read it with pandas
- look at the data
- make column names and values look uniform
- check if all the columns read correctly
- check if the churn variable needs any preparation

In [1]:
# Import packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
data = 'https://raw.githubusercontent.com/alexeygrigorev/mlbookcamp-code/master/chapter-03-churn-prediction/WA_Fn-UseC_-Telco-Customer-Churn.csv'

In [4]:
!wget $data -O data.csv

--2023-09-25 08:19:04--  https://raw.githubusercontent.com/alexeygrigorev/mlbookcamp-code/master/chapter-03-churn-prediction/WA_Fn-UseC_-Telco-Customer-Churn.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.111.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 977501 (955K) [text/plain]
Saving to: ‘data.csv’


2023-09-25 08:19:04 (7.98 MB/s) - ‘data.csv’ saved [977501/977501]



Note the `!` above means we are executing a shell command

In [9]:
#load data to csv in directory
df = pd.read_csv('data.csv')
df.head().T #transpose to easily see all columns

Unnamed: 0,0,1,2,3,4
customerID,7590-VHVEG,5575-GNVDE,3668-QPYBK,7795-CFOCW,9237-HQITU
gender,Female,Male,Male,Male,Female
SeniorCitizen,0,0,0,0,0
Partner,Yes,No,No,No,No
Dependents,No,No,No,No,No
tenure,1,34,2,45,2
PhoneService,No,Yes,Yes,No,Yes
MultipleLines,No phone service,No,No,No phone service,No
InternetService,DSL,DSL,DSL,DSL,Fiber optic
OnlineSecurity,No,Yes,Yes,Yes,No


In [10]:
#clean column titles to all be lower case and spaces replaced by underscores
df.columns = df.columns.str.lower().str.replace(' ','_')

#clean categorical (string) columns to all be lower case and spaces replaced by underscores
categorical_columns = list(df.dtypes[df.dtypes == 'object'].index)
for c in categorical_columns:
    df[c] = df[c].str.lower().str.replace(' ','_')

In [11]:
df.head().T

Unnamed: 0,0,1,2,3,4
customerid,7590-vhveg,5575-gnvde,3668-qpybk,7795-cfocw,9237-hqitu
gender,female,male,male,male,female
seniorcitizen,0,0,0,0,0
partner,yes,no,no,no,no
dependents,no,no,no,no,no
tenure,1,34,2,45,2
phoneservice,no,yes,yes,no,yes
multiplelines,no_phone_service,no,no,no_phone_service,no
internetservice,dsl,dsl,dsl,dsl,fiber_optic
onlinesecurity,no,yes,yes,yes,no


In [13]:
#explore datatypes
df.dtypes

customerid           object
gender               object
seniorcitizen         int64
partner              object
dependents           object
tenure                int64
phoneservice         object
multiplelines        object
internetservice      object
onlinesecurity       object
onlinebackup         object
deviceprotection     object
techsupport          object
streamingtv          object
streamingmovies      object
contract             object
paperlessbilling     object
paymentmethod        object
monthlycharges      float64
totalcharges         object
churn                object
dtype: object

- `seniorcitizen` is type `int64` when we would expect type `object`
- `totalcharges` is type `object` when we would expect type `int64`

In [16]:
#attempt to convert totalcharges to numeric
pd.to_numeric(df.totalcharges)

ValueError: Unable to parse string "_" at position 488

In [19]:
#lets explore the problem records
tc = pd.to_numeric(df.totalcharges, errors='coerce')
df[tc.isnull()][['customerid','totalcharges']]

Unnamed: 0,customerid,totalcharges
488,4472-lvygi,_
753,3115-czmzd,_
936,5709-lvoeq,_
1082,4367-nuyao,_
1340,1371-dwpaz,_
3331,7644-omvmy,_
3826,3213-vvolg,_
4380,2520-sgtta,_
5218,2923-arzlg,_
6670,4075-wkniu,_


Notice these were spaces but because it was type `object` we replaced the spaces with underscores earlier

In [24]:
#replace
df['totalcharges'] = df['totalcharges'].str.replace('_', '0')
df[tc.isnull()][['customerid','totalcharges']]

Unnamed: 0,customerid,totalcharges
488,4472-lvygi,0
753,3115-czmzd,0
936,5709-lvoeq,0
1082,4367-nuyao,0
1340,1371-dwpaz,0
3331,7644-omvmy,0
3826,3213-vvolg,0
4380,2520-sgtta,0
5218,2923-arzlg,0
6670,4075-wkniu,0


In [26]:
#convert to numeric and check dtype
df['totalcharges'] = pd.to_numeric(df.totalcharges)
df.dtypes

customerid           object
gender               object
seniorcitizen         int64
partner              object
dependents           object
tenure                int64
phoneservice         object
multiplelines        object
internetservice      object
onlinesecurity       object
onlinebackup         object
deviceprotection     object
techsupport          object
streamingtv          object
streamingmovies      object
contract             object
paperlessbilling     object
paymentmethod        object
monthlycharges      float64
totalcharges        float64
churn                object
dtype: object

In [27]:
#now investigate churn
df.churn

0        no
1        no
2       yes
3        no
4       yes
       ... 
7038     no
7039     no
7040     no
7041    yes
7042     no
Name: churn, Length: 7043, dtype: object

In [31]:
#replace yes and no with 1 and 0 respectively
df.churn = (df.churn == 'yes').astype(int)
df.churn

0       0
1       0
2       0
3       0
4       0
       ..
7038    0
7039    0
7040    0
7041    0
7042    0
Name: churn, Length: 7043, dtype: int64

## 3.3 Setting up the validation framework

Split the dataset into train/test/validation

In [34]:
#import packages
from sklearn.model_selection import train_test_split

In [37]:
#split the dataset into train and test
#random_state is essentially the seed to make sure it is reproducible
df_full_train, df_test = train_test_split(df, test_size = 0.2, random_state=1)

One issue is that we can only split train/test and not train/test/validation here so we will do 80%/20% initially and then split the training dataset.

In [38]:
len(df_full_train), len(df_test)

(5634, 1409)

In [39]:
#split the dataset into train and validation
#mathematically we need 25% of the df_full_train to equal 20% of df
df_train, df_val = train_test_split(df_full_train, test_size = 0.25, random_state=1)

In [41]:
#ensure lengths match up
len(df_train), len(df_test), len(df_val)

(4225, 1409, 1409)

In [42]:
#reset index, not necessary but looks better
df_train = df_train.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)
df_val = df_val.reset_index(drop=True)

In [43]:
#get y values
y_train = df_train.churn.values
y_test = df_test.churn.values
y_val = df_val.churn.values

In [44]:
#delete churn from dataframes
del df_train['churn']
del df_test['churn']
del df_val['churn']