Instructions
- Load the dataset and explore the variables.
- We will try to predict variable Churn using a logistic regression on variables tenure, SeniorCitizen,MonthlyCharges.
- Extract the target variable.
- Extract the independent variables and scale them.
- Build the logistic regression model.
- Evaluate the model.
- Even a simple model will give us more than 70% accuracy. Why?
- Synthetic Minority Oversampling TEchnique (SMOTE) is an over sampling technique based on nearest neighbors that adds new points between existing points.
- Apply imblearn.over_sampling.SMOTE to the dataset. 
- Build and evaluate the logistic regression model. Is it there any improvement?

In [1]:
import imblearn
import pandas as pd
import numpy as np

import warnings
warnings.filterwarnings('ignore')

import matplotlib.pyplot as plt
%matplotlib inline


# importing the lib to use the log reg
from sklearn.linear_model import LogisticRegression

In [2]:
# Read the dataset and explore the variables

df = pd.read_csv("/Users/giulianamiranda/Documents/Labs/lab-imbalanced-data/files_for_lab/customer_churn.csv", sep = ',')
df

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.30,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.70,151.65,Yes
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7038,6840-RESVB,Male,0,Yes,Yes,24,Yes,Yes,DSL,Yes,...,Yes,Yes,Yes,Yes,One year,Yes,Mailed check,84.80,1990.5,No
7039,2234-XADUH,Female,0,Yes,Yes,72,Yes,Yes,Fiber optic,No,...,Yes,No,Yes,Yes,One year,Yes,Credit card (automatic),103.20,7362.9,No
7040,4801-JZAZL,Female,0,Yes,Yes,11,No,No phone service,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.60,346.45,No
7041,8361-LTMKD,Male,1,Yes,No,4,Yes,Yes,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Mailed check,74.40,306.6,Yes


In [3]:
cols = []

for a in range(len(df.columns)):
    cols.append(df.columns[a].lower().replace(' ', '_'))
    
df.columns = cols

df.columns


Index(['customerid', 'gender', 'seniorcitizen', 'partner', 'dependents',
       'tenure', 'phoneservice', 'multiplelines', 'internetservice',
       'onlinesecurity', 'onlinebackup', 'deviceprotection', 'techsupport',
       'streamingtv', 'streamingmovies', 'contract', 'paperlessbilling',
       'paymentmethod', 'monthlycharges', 'totalcharges', 'churn'],
      dtype='object')

In [4]:
df.isnull().sum()


customerid          0
gender              0
seniorcitizen       0
partner             0
dependents          0
tenure              0
phoneservice        0
multiplelines       0
internetservice     0
onlinesecurity      0
onlinebackup        0
deviceprotection    0
techsupport         0
streamingtv         0
streamingmovies     0
contract            0
paperlessbilling    0
paymentmethod       0
monthlycharges      0
totalcharges        0
churn               0
dtype: int64

In [5]:
df.shape

(7043, 21)

In [6]:
df.dtypes

customerid           object
gender               object
seniorcitizen         int64
partner              object
dependents           object
tenure                int64
phoneservice         object
multiplelines        object
internetservice      object
onlinesecurity       object
onlinebackup         object
deviceprotection     object
techsupport          object
streamingtv          object
streamingmovies      object
contract             object
paperlessbilling     object
paymentmethod        object
monthlycharges      float64
totalcharges         object
churn                object
dtype: object

In [7]:
df['churn'].value_counts()

No     5174
Yes    1869
Name: churn, dtype: int64

# encoding the variable churn

We will try to predict variable Churn using a logistic regression on variables tenure, SeniorCitizen,MonthlyCharges.
- Extract the target variable.
- Extract the independent variables and scale them.

In [8]:
churn_num = []

for a in df['churn']:
    if a == 'Yes':
        churn_num.append(1)
    else:
        churn_num.append(0)


df['churn_num'] = churn_num

target = df['churn_num']

var = ['tenure', 'seniorcitizen', 'monthlycharges']

ind_var = df[var]
ind_var


Unnamed: 0,tenure,seniorcitizen,monthlycharges
0,1,0,29.85
1,34,0,56.95
2,2,0,53.85
3,45,0,42.30
4,2,0,70.70
...,...,...,...
7038,24,0,84.80
7039,72,0,103.20
7040,11,0,29.60
7041,4,1,74.40


In [9]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

x = ind_var
y = target


x_train, x_test,y_train, y_test = train_test_split(x, y, random_state = 42, test_size = 0.2)

# Scaling the data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(x_train)
X_test_scaled = scaler.transform(x_test)


In [10]:
# Creating the model

LR = LogisticRegression() 
LR.fit(X_train_scaled, y_train)


pred = LR.predict(X_test_scaled)


In [12]:
print("Accuracy is: ", LR.score(X_test_scaled, y_test))

Accuracy is:  0.8076650106458482


In [13]:
# Evaluating the results

from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
from sklearn.metrics import classification_report #this is the confusion matrix


print("Precision is: ", precision_score(y_test, pred))
print("Recall is: ", recall_score(y_test, pred))
print("F1 score is: ", f1_score(y_test, pred))

print(classification_report(y_test, pred))

Precision is:  0.6961538461538461
Recall is:  0.48525469168900803
F1 score is:  0.5718799368088469
              precision    recall  f1-score   support

           0       0.83      0.92      0.88      1036
           1       0.70      0.49      0.57       373

    accuracy                           0.81      1409
   macro avg       0.76      0.70      0.72      1409
weighted avg       0.80      0.81      0.80      1409



In [17]:
# Even a simple model will give us more than 70% accuracy. Why?

# Because there are much more 'No' on the sample (5.174) than Yes (1.869)
# It means that the chances of the churn being 'No' are already of 73% on the base
confusion_matrix(y_test, pred)

array([[957,  79],
       [192, 181]])

Apply imblearn.over_sampling.SMOTE to the dataset


In [14]:
from imblearn.over_sampling import SMOTE
sm = SMOTE(random_state = 0, sampling_strategy = 1)
x_train_SMOTE, y_train_SMOTE = sm.fit_resample(x_train, y_train)



x_train_SMOTE

Unnamed: 0,tenure,seniorcitizen,monthlycharges
0,21,0,64.850000
1,54,0,97.200000
2,1,0,23.450000
3,4,0,70.200000
4,0,0,61.900000
...,...,...,...
8271,7,0,95.361237
8272,4,0,80.963614
8273,18,0,88.318056
8274,3,1,83.754083


In [22]:
# Build and evaluate the logistic regression model. Is it there any improvement?

train_smote = pd.concat([x_train_SMOTE, y_train_SMOTE], axis=1)
train_smote


LR = LogisticRegression(max_iter = 1000)
LR.fit(x_train_SMOTE, y_train_SMOTE)

pred2 = LR.predict(x_test) 

print("Accuracy is: ", LR.score(x_test, y_test)) 
print("Precision is: ", precision_score(y_test, pred2))
print("Precision is: ", recall_score(y_test, pred2))
print("Precision is: ", f1_score(y_test, pred2))

print(classification_report(y_test, pred2))

Accuracy is:  0.7452093683463449
Precision is:  0.5126811594202898
Precision is:  0.7587131367292225
Precision is:  0.6118918918918917
              precision    recall  f1-score   support

           0       0.89      0.74      0.81      1036
           1       0.51      0.76      0.61       373

    accuracy                           0.75      1409
   macro avg       0.70      0.75      0.71      1409
weighted avg       0.79      0.75      0.76      1409



In [23]:
from sklearn.metrics import confusion_matrix
LR.score(x_test, y_test)
confusion_matrix(y_test, pred2)

array([[767, 269],
       [ 90, 283]])

In [26]:
# The changes improved the model. The new model catched over 100 churn cases that were not predicted by
# the previous one