# Lab | Imbalanced data


We will be using the files_for_lab/customer_churn.csv dataset to build a churn predictor.


1. Load the dataset and explore the variables.
2. We will try to predict variable Churn using a logistic regression on variables tenure, SeniorCitizen,MonthlyCharges.
3. Extract the target variable.
4. Extract the independent variables and scale them.
5. Build the logistic regression model.
6. Evaluate the model.
7. Even a simple model will give us more than 70% accuracy. Why?
8. Synthetic Minority Oversampling TEchnique (SMOTE) is an over sampling technique based on nearest neighbors that adds new points between existing points. Apply imblearn.over_sampling.SMOTE to the dataset. Build and evaluate the logistic regression model. Is it there any improvement?

### Step 1: Import Python Libraries

In [295]:
# prep: import modules and get pwd
import pymysql
from sqlalchemy import create_engine
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import classification_report
from sklearn.metrics import f1_score
import seaborn as sns
import statistics as stats
import sklearn
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error


import seaborn as sns


### Step 2: Read the Dataset¶

In [296]:
# get the data
df = pd.read_csv('customer_churn.csv')
df.head()


Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


In [297]:
df.columns

Index(['customerID', 'gender', 'SeniorCitizen', 'Partner', 'Dependents',
       'tenure', 'PhoneService', 'MultipleLines', 'InternetService',
       'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport',
       'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling',
       'PaymentMethod', 'MonthlyCharges', 'TotalCharges', 'Churn'],
      dtype='object')

In [298]:
# drop unecessary columns

In [299]:
df2 = df.drop(['customerID', 'gender', 'Partner', 'Dependents','PhoneService', 
               'MultipleLines', 'InternetService','OnlineSecurity', 'OnlineBackup',
               'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies', 
               'Contract', 'PaperlessBilling', 'PaymentMethod', 'TotalCharges'], axis=1)

print(df2.shape)
df2.head()

(7043, 4)


Unnamed: 0,SeniorCitizen,tenure,MonthlyCharges,Churn
0,0,1,29.85,No
1,0,34,56.95,No
2,0,2,53.85,Yes
3,0,45,42.3,No
4,0,2,70.7,Yes


### Step 3: Explore the Dataset

1. Show main information about the data set: data type of each column, number of columns, memory usage, and the number of records in the dataset

In [300]:
df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 4 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   SeniorCitizen   7043 non-null   int64  
 1   tenure          7043 non-null   int64  
 2   MonthlyCharges  7043 non-null   float64
 3   Churn           7043 non-null   object 
dtypes: float64(1), int64(2), object(1)
memory usage: 220.2+ KB


In [301]:
# There are no null values - the total count is the same for all columns!

In [302]:
df2.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
SeniorCitizen,7043.0,0.162147,0.368612,0.0,0.0,0.0,0.0,1.0
tenure,7043.0,32.371149,24.559481,0.0,9.0,29.0,55.0,72.0
MonthlyCharges,7043.0,64.761692,30.090047,18.25,35.5,70.35,89.85,118.75


In [303]:
df2['SeniorCitizen'].value_counts()

0    5901
1    1142
Name: SeniorCitizen, dtype: int64

In [304]:
# On average, customers stay 32 months and pay per month around 64 euros
# Majority of the customers are young. Senior Citizen average is 0.16.
# There are only 1142 customers classified as senior.

In [305]:
# Since SeniorCitizen column can have only 0 or 1, it will be treated as categorical data

df2['SeniorCitizen'] = df2['SeniorCitizen'].astype('object')


### Step 4: Build the model

In [306]:
# X-y-split
y = df2['Churn']
X = df2.drop(['Churn'], axis=1)

In [307]:
df2.shape

(7043, 4)

In [308]:
#Test train split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1337)
    

In [309]:
print('X_train dataset size: ', X_train.shape)
print('y_train dataset size: ', y_train.shape)
print('X_test dataset size: ', X_test.shape)
print('y_test dataset size: ', y_test.shape)

X_train dataset size:  (5634, 3)
y_train dataset size:  (5634,)
X_test dataset size:  (1409, 3)
y_test dataset size:  (1409,)


In [310]:
# Scale numerical data
X_train_num = X_train.select_dtypes(include = np.number)
    
transformer = MinMaxScaler().fit(X_train_num)
X_train_normalized = transformer.transform(X_train_num)
X_train_norm = pd.DataFrame(X_train_normalized, columns = X_train_num.columns)

X_train_norm.shape

(5634, 2)

In [311]:
# Scale categorical data
X_train_categorical = X_train.select_dtypes(include = object)
X_train_categorical.columns

# No need to scale as all values are either 0 or 1

Index(['SeniorCitizen'], dtype='object')

In [312]:
#Finally, concatenate the scaled data: numerical and categorical
X_train_transformed = np.concatenate([X_train_norm, X_train_categorical], axis=1)

In [313]:
from sklearn.linear_model import LogisticRegression
classification = LogisticRegression(random_state=0, solver='lbfgs',
                  multi_class='multinomial').fit(X_train_transformed, y_train)

In [314]:
# Scale test data
X_test_num = X_test.select_dtypes(include = np.number)
    
X_test_normalized = transformer.transform(X_test_num)
X_test_norm = pd.DataFrame(X_test_normalized, columns = X_test_num.columns)

X_test_norm.shape

(1409, 2)

In [315]:
# Split categorical data
X_test_categorical = X_test.select_dtypes(include = object)


In [316]:
#Finally, concatenate the scaled data: numerical and categorical
X_test_transformed = np.concatenate([X_test_norm, X_test_categorical], axis=1)


In [317]:
y_pred = classification.predict(X_test_transformed)
classification.score(X_test_transformed, y_test)

0.7856635911994322

In [318]:
df2['Churn'].value_counts()

No     5174
Yes    1869
Name: Churn, dtype: int64

Accurancy score of 79% is is relatively high and not the best result because the data set is imbalanced.
Proportion of active customers is much higher when compared to churned customers.

### Oversampling¶

In [319]:
from sklearn.utils import resample

In [320]:
X_train_transformed = pd.DataFrame(X_train_transformed)
y_train = pd.DataFrame(y_train)

print(X_train_transformed.shape)
print(y_train.shape)

(5634, 3)
(5634, 1)


In [340]:
train = np.concatenate([X_train_transformed, y_train], axis=1)
train = pd.DataFrame(train, columns = ['tenure', 'MonthlyCharges','SeniorCitizen', 'Churn'])
train

Unnamed: 0,tenure,MonthlyCharges,SeniorCitizen,Churn
0,0.125,0.658209,1,Yes
1,0.333333,0.014428,0,No
2,0.888889,0.624876,0,No
3,0.527778,0.019403,0,No
4,0.222222,0.014428,0,No
...,...,...,...,...
5629,0.222222,0.310448,0,No
5630,1.0,0.673134,0,No
5631,0.361111,0.015423,0,No
5632,0.555556,0.381592,0,No


In [341]:
Churned_no = train[train['Churn']=='No']
Churned_yes = train[train['Churn']=='Yes']

In [342]:
display(Churned_no.shape)
display(Churned_yes.shape)

(4141, 4)

(1493, 4)

In [343]:
# oversample minority
Churned_yes_oversampled = resample(Churned_yes,
                                    replace=True,
                                    n_samples = len(Churned_no),#<- make both sets the same size
                                    random_state=0)

In [344]:
# both sets are now of a reasonable size
display(Churned_no.shape)
display(Churned_yes_oversampled.shape)
Churned_yes_oversampled.head(20)

(4141, 4)

(4141, 4)

Unnamed: 0,tenure,MonthlyCharges,SeniorCitizen,Churn
2650,0.902778,0.868159,0,Yes
2208,0.125,0.265672,0,Yes
4666,0.583333,0.755721,0,Yes
3197,0.527778,0.387065,1,Yes
2933,0.041667,0.697512,0,Yes
5247,0.027778,0.510945,1,Yes
3978,0.25,0.774129,0,Yes
1083,0.236111,0.579104,1,Yes
2364,0.055556,0.558706,1,Yes
4236,0.75,0.867164,0,Yes


In [345]:
train_oversampled = pd.concat([Churned_no,Churned_yes_oversampled],axis=0)
train_oversampled.tail()

Unnamed: 0,tenure,MonthlyCharges,SeniorCitizen,Churn
2475,0.152778,0.478607,0,Yes
1412,0.027778,0.634328,0,Yes
2616,0.041667,0.227861,1,Yes
3846,0.013889,0.026368,0,Yes
4274,0.111111,0.626866,1,Yes


In [346]:
y_train_over = train_oversampled['Churn'].copy()
X_train_over = train_oversampled.drop('Churn',axis = 1).copy()

In [364]:
# Our Logistic Regression, while still not amazing, has improved substantially!
# especially at detecting instances of diabetes
LR_over = LogisticRegression(random_state=0, solver='lbfgs')
LR_over.fit(X_train_over, y_train_over)
pred = LR_over.predict(X_test_transformed)

print("precision: ",precision_score(y_test, pred, pos_label = 0))
#print("recall: ",recall_score(Y_test, pred, average='binary', pos_label='yes'))
#print("f1: ",f1_score(y_test,pred))



ValueError: pos_label=0 is not a valid label. It should be one of ['No', 'Yes']