![logo_ironhack_blue 7](https://user-images.githubusercontent.com/23629340/40541063-a07a0a8a-601a-11e8-91b5-2f13e4e6b441.png)

# Lab | Handling Data Imbalance in Classification Models

For this lab and in the next lessons we will build a model on customer churn binary classification problem. You will be using `files_for_lab/Customer-Churn.csv` file.

### Scenario

You are working as an analyst with this internet service provider. You are provided with this historical data about your company's customers and their churn trends. Your task is to build a machine learning model that will help the company identify customers that are more likely to default/churn and thus prevent losses from such customers.

### Instructions

In this lab, we will first take a look at the degree of imbalance in the data and correct it using the techniques we learned on the class.

Here is the list of steps to be followed (building a simple model without balancing the data):

- Import the required libraries and modules that you would need.
- Read that data into Python and call the dataframe `churnData`.
- Check the datatypes of all the columns in the data. You would see that the column `TotalCharges` is object type. Convert this column into numeric type using `pd.to_numeric` function.
- Check for null values in the dataframe. Replace the null values.
- Use the following features: `tenure`, `SeniorCitizen`, `MonthlyCharges` and `TotalCharges`:
  - Scale the features either by using normalizer or a standard scaler.
  - Split the data into a training set and a test set.
  - Fit a logistic regression model on the training data.
  - Check the accuracy on the test data.

**Note**: So far we have not balanced the data.

Managing imbalance in the dataset

- Check for the imbalance.
- Use the resampling strategies used in class for upsampling and downsampling to create a balance between the two classes.
- Each time fit the model and see how the accuracy of the model is.




In [1]:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')
import statsmodels.api as sm
pd.set_option('display.max_columns', None)
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

In [2]:
data = pd.read_csv('Customer-Churn.csv')

In [3]:
data

Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,MonthlyCharges,TotalCharges,Churn
0,Female,0,Yes,No,1,No,No,Yes,No,No,No,No,Month-to-month,29.85,29.85,No
1,Male,0,No,No,34,Yes,Yes,No,Yes,No,No,No,One year,56.95,1889.5,No
2,Male,0,No,No,2,Yes,Yes,Yes,No,No,No,No,Month-to-month,53.85,108.15,Yes
3,Male,0,No,No,45,No,Yes,No,Yes,Yes,No,No,One year,42.30,1840.75,No
4,Female,0,No,No,2,Yes,No,No,No,No,No,No,Month-to-month,70.70,151.65,Yes
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7038,Male,0,Yes,Yes,24,Yes,Yes,No,Yes,Yes,Yes,Yes,One year,84.80,1990.5,No
7039,Female,0,Yes,Yes,72,Yes,No,Yes,Yes,No,Yes,Yes,One year,103.20,7362.9,No
7040,Female,0,Yes,Yes,11,No,Yes,No,No,No,No,No,Month-to-month,29.60,346.45,No
7041,Male,1,Yes,No,4,Yes,No,No,No,No,No,No,Month-to-month,74.40,306.6,Yes


In [4]:
data.dtypes

gender               object
SeniorCitizen         int64
Partner              object
Dependents           object
tenure                int64
PhoneService         object
OnlineSecurity       object
OnlineBackup         object
DeviceProtection     object
TechSupport          object
StreamingTV          object
StreamingMovies      object
Contract             object
MonthlyCharges      float64
TotalCharges         object
Churn                object
dtype: object

In [5]:
data['TotalCharges'].value_counts()

          11
20.2      11
19.75      9
20.05      8
19.9       8
          ..
6849.4     1
692.35     1
130.15     1
3211.9     1
6844.5     1
Name: TotalCharges, Length: 6531, dtype: int64

In [6]:
data['TotalCharges'] = pd.to_numeric(data['TotalCharges'], errors='coerce')

In [7]:
data.dtypes

gender               object
SeniorCitizen         int64
Partner              object
Dependents           object
tenure                int64
PhoneService         object
OnlineSecurity       object
OnlineBackup         object
DeviceProtection     object
TechSupport          object
StreamingTV          object
StreamingMovies      object
Contract             object
MonthlyCharges      float64
TotalCharges        float64
Churn                object
dtype: object

In [8]:
data.isna().sum()

gender               0
SeniorCitizen        0
Partner              0
Dependents           0
tenure               0
PhoneService         0
OnlineSecurity       0
OnlineBackup         0
DeviceProtection     0
TechSupport          0
StreamingTV          0
StreamingMovies      0
Contract             0
MonthlyCharges       0
TotalCharges        11
Churn                0
dtype: int64

In [9]:
data.shape

(7043, 16)

In [10]:
data.dropna(inplace=True)

In [11]:
data.shape

(7032, 16)

In [12]:
data

Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,MonthlyCharges,TotalCharges,Churn
0,Female,0,Yes,No,1,No,No,Yes,No,No,No,No,Month-to-month,29.85,29.85,No
1,Male,0,No,No,34,Yes,Yes,No,Yes,No,No,No,One year,56.95,1889.50,No
2,Male,0,No,No,2,Yes,Yes,Yes,No,No,No,No,Month-to-month,53.85,108.15,Yes
3,Male,0,No,No,45,No,Yes,No,Yes,Yes,No,No,One year,42.30,1840.75,No
4,Female,0,No,No,2,Yes,No,No,No,No,No,No,Month-to-month,70.70,151.65,Yes
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7038,Male,0,Yes,Yes,24,Yes,Yes,No,Yes,Yes,Yes,Yes,One year,84.80,1990.50,No
7039,Female,0,Yes,Yes,72,Yes,No,Yes,Yes,No,Yes,Yes,One year,103.20,7362.90,No
7040,Female,0,Yes,Yes,11,No,Yes,No,No,No,No,No,Month-to-month,29.60,346.45,No
7041,Male,1,Yes,No,4,Yes,No,No,No,No,No,No,Month-to-month,74.40,306.60,Yes


In [13]:
data.isna().sum().sum()

0

In [14]:
data[['tenure','SeniorCitizen','MonthlyCharges','TotalCharges']].dtypes

tenure              int64
SeniorCitizen       int64
MonthlyCharges    float64
TotalCharges      float64
dtype: object

In [15]:
imp_data = data[['tenure','SeniorCitizen','MonthlyCharges','TotalCharges', 'Churn']] 

In [16]:
imp_data.shape

(7032, 5)

In [17]:
data_to_scale = imp_data.drop(['Churn'], axis=1)

In [18]:
scaler = StandardScaler()

In [19]:
scaler.fit(data_to_scale)

In [20]:
trs_data = scaler.transform(data_to_scale)

In [21]:
trs_data

array([[-1.28024804, -0.44032709, -1.16169394, -0.99419409],
       [ 0.06430269, -0.44032709, -0.26087792, -0.17373982],
       [-1.23950408, -0.44032709, -0.36392329, -0.95964911],
       ...,
       [-0.87280842, -0.44032709, -1.17000405, -0.85451414],
       [-1.15801615,  2.27103902,  0.31916782, -0.87209546],
       [ 1.36810945, -0.44032709,  1.35793167,  2.01234407]])

In [22]:
scaled_data = pd.DataFrame(trs_data, columns=data_to_scale.columns)
scaled_data

Unnamed: 0,tenure,SeniorCitizen,MonthlyCharges,TotalCharges
0,-1.280248,-0.440327,-1.161694,-0.994194
1,0.064303,-0.440327,-0.260878,-0.173740
2,-1.239504,-0.440327,-0.363923,-0.959649
3,0.512486,-0.440327,-0.747850,-0.195248
4,-1.239504,-0.440327,0.196178,-0.940457
...,...,...,...,...
7027,-0.343137,-0.440327,0.664868,-0.129180
7028,1.612573,-0.440327,1.276493,2.241056
7029,-0.872808,-0.440327,-1.170004,-0.854514
7030,-1.158016,2.271039,0.319168,-0.872095


In [23]:
scaled_data.shape

(7032, 4)

In [24]:
scaled_data.isna().sum()

tenure            0
SeniorCitizen     0
MonthlyCharges    0
TotalCharges      0
dtype: int64

In [25]:
imp_data['Churn'].shape

(7032,)

In [26]:
imp_data['Churn'].isna().sum()

0

In [27]:
scaled_data.reset_index(inplace=True)

In [28]:
imp_data.reset_index(inplace=True)

In [29]:
scaled_data['Churn'] = imp_data['Churn']

In [30]:
scaled_data


Unnamed: 0,index,tenure,SeniorCitizen,MonthlyCharges,TotalCharges,Churn
0,0,-1.280248,-0.440327,-1.161694,-0.994194,No
1,1,0.064303,-0.440327,-0.260878,-0.173740,No
2,2,-1.239504,-0.440327,-0.363923,-0.959649,Yes
3,3,0.512486,-0.440327,-0.747850,-0.195248,No
4,4,-1.239504,-0.440327,0.196178,-0.940457,Yes
...,...,...,...,...,...,...
7027,7027,-0.343137,-0.440327,0.664868,-0.129180,No
7028,7028,1.612573,-0.440327,1.276493,2.241056,No
7029,7029,-0.872808,-0.440327,-1.170004,-0.854514,No
7030,7030,-1.158016,2.271039,0.319168,-0.872095,Yes


In [31]:
scaled_data.isna().sum()

index             0
tenure            0
SeniorCitizen     0
MonthlyCharges    0
TotalCharges      0
Churn             0
dtype: int64

In [32]:
target = data['Churn']
target.shape

(7032,)

In [33]:
scaled_data

Unnamed: 0,index,tenure,SeniorCitizen,MonthlyCharges,TotalCharges,Churn
0,0,-1.280248,-0.440327,-1.161694,-0.994194,No
1,1,0.064303,-0.440327,-0.260878,-0.173740,No
2,2,-1.239504,-0.440327,-0.363923,-0.959649,Yes
3,3,0.512486,-0.440327,-0.747850,-0.195248,No
4,4,-1.239504,-0.440327,0.196178,-0.940457,Yes
...,...,...,...,...,...,...
7027,7027,-0.343137,-0.440327,0.664868,-0.129180,No
7028,7028,1.612573,-0.440327,1.276493,2.241056,No
7029,7029,-0.872808,-0.440327,-1.170004,-0.854514,No
7030,7030,-1.158016,2.271039,0.319168,-0.872095,Yes


In [34]:
X = scaled_data.drop(['Churn'],axis=1)
y = scaled_data['Churn']

In [35]:
X
y

0        No
1        No
2       Yes
3        No
4       Yes
       ... 
7027     No
7028     No
7029     No
7030    Yes
7031     No
Name: Churn, Length: 7032, dtype: object

In [36]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=23)
#Train 80% Test 20%

In [37]:
logreg = LogisticRegression()
#aqui se entrena el modelo
logreg.fit(X_train, y_train)

In [38]:
#Para ver la predicciones que no entro al modelo
y_pred = logreg.predict(X_test)
# este deveria parecerse a y_test

In [39]:
y_pred

array(['Yes', 'Yes', 'No', ..., 'No', 'No', 'No'], dtype=object)

In [40]:
normal_socre = logreg.score(X_test, y_test)

## Managing imbalance in the dataset

In [41]:
data['Churn'].value_counts()

No     5163
Yes    1869
Name: Churn, dtype: int64

In [42]:
data.isna().sum().sum()

0

In [43]:
categories = scaled_data['Churn'].value_counts().index
categories
sep = []
for i in categories:
    sep.append(list(scaled_data[scaled_data['Churn']==i]))
len(sep)

2

In [44]:
category_0 = scaled_data[scaled_data['Churn']=='No']
category_1 = scaled_data[scaled_data['Churn']=='Yes']

In [45]:
category_0

Unnamed: 0,index,tenure,SeniorCitizen,MonthlyCharges,TotalCharges,Churn
0,0,-1.280248,-0.440327,-1.161694,-0.994194,No
1,1,0.064303,-0.440327,-0.260878,-0.173740,No
3,3,0.512486,-0.440327,-0.747850,-0.195248,No
6,6,-0.424625,-0.440327,0.807802,-0.147313,No
7,7,-0.913552,-0.440327,-1.165018,-0.874169,No
...,...,...,...,...,...,...
7026,7026,1.612573,-0.440327,-1.450886,-0.381142,No
7027,7027,-0.343137,-0.440327,0.664868,-0.129180,No
7028,7028,1.612573,-0.440327,1.276493,2.241056,No
7029,7029,-0.872808,-0.440327,-1.170004,-0.854514,No


In [46]:
category_1

Unnamed: 0,index,tenure,SeniorCitizen,MonthlyCharges,TotalCharges,Churn
2,2,-1.239504,-0.440327,-0.363923,-0.959649,Yes
4,4,-1.239504,-0.440327,0.196178,-0.940457,Yes
5,5,-0.995040,-0.440327,1.158489,-0.645369,Yes
8,8,-0.180161,-0.440327,1.329677,0.336516,Yes
13,13,0.675462,-0.440327,1.293113,1.214589,Yes
...,...,...,...,...,...,...
7010,7010,-0.832064,-0.440327,-0.166143,-0.686267,Yes
7015,7015,-0.954296,-0.440327,-0.684694,-0.829411,Yes
7021,7021,-1.280248,2.271039,0.364042,-0.973944,Yes
7023,7023,1.408853,-0.440327,1.268182,2.030764,Yes


## Downscale

In [47]:
category_0 = scaled_data[scaled_data['Churn']=='No']
category_1 = scaled_data[scaled_data['Churn']=='Yes']

In [48]:
category_0 = category_0.sample(len(category_1))
category_0

Unnamed: 0,index,tenure,SeniorCitizen,MonthlyCharges,TotalCharges,Churn
2210,2210,1.531085,-0.440327,-0.496885,0.479526,No
3572,3572,0.308766,-0.440327,0.010032,0.133547,No
1101,1101,-0.954296,-0.440327,0.497004,-0.668046,No
571,571,-0.872808,-0.440327,0.003384,-0.699746,No
3259,3259,-0.913552,-0.440327,0.874283,-0.581904,No
...,...,...,...,...,...,...
6130,6130,-1.198760,-0.440327,-1.482464,-0.985039,No
1759,1759,1.245878,-0.440327,0.362380,1.055938,No
2497,2497,1.531085,-0.440327,-1.288008,-0.188343,No
4701,4701,0.960670,2.271039,1.065416,1.295481,No


In [49]:
len(category_0) == len(category_1)

True

In [50]:
data2 = pd.concat([category_0, category_1], axis=0)
data2

Unnamed: 0,index,tenure,SeniorCitizen,MonthlyCharges,TotalCharges,Churn
2210,2210,1.531085,-0.440327,-0.496885,0.479526,No
3572,3572,0.308766,-0.440327,0.010032,0.133547,No
1101,1101,-0.954296,-0.440327,0.497004,-0.668046,No
571,571,-0.872808,-0.440327,0.003384,-0.699746,No
3259,3259,-0.913552,-0.440327,0.874283,-0.581904,No
...,...,...,...,...,...,...
7010,7010,-0.832064,-0.440327,-0.166143,-0.686267,Yes
7015,7015,-0.954296,-0.440327,-0.684694,-0.829411,Yes
7021,7021,-1.280248,2.271039,0.364042,-0.973944,Yes
7023,7023,1.408853,-0.440327,1.268182,2.030764,Yes


In [51]:
data2['Churn'].value_counts()

No     1869
Yes    1869
Name: Churn, dtype: int64

In [52]:
X = data2.drop(['Churn'],axis=1)
y = data2['Churn']

In [53]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=23)

logreg = LogisticRegression()
#aqui se entrena el modelo
logreg.fit(X_train, y_train)

#Para ver la predicciones que no entro al modelo
y_pred = logreg.predict(X_test)
# este deveria parecerse a y_test

In [54]:
y_test

1726    Yes
20      Yes
6582    Yes
5699     No
2726     No
       ... 
2681    Yes
1420     No
540      No
608     Yes
3424     No
Name: Churn, Length: 748, dtype: object

In [55]:
y_pred

array(['Yes', 'Yes', 'Yes', 'Yes', 'No', 'No', 'Yes', 'No', 'No', 'Yes',
       'Yes', 'No', 'No', 'No', 'No', 'No', 'Yes', 'Yes', 'No', 'No',
       'No', 'Yes', 'No', 'No', 'No', 'Yes', 'No', 'Yes', 'Yes', 'No',
       'Yes', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes', 'Yes', 'Yes',
       'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes', 'Yes', 'No', 'Yes', 'Yes',
       'Yes', 'Yes', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'No',
       'Yes', 'Yes', 'No', 'Yes', 'No', 'No', 'Yes', 'No', 'Yes', 'No',
       'Yes', 'Yes', 'Yes', 'No', 'Yes', 'No', 'No', 'No', 'No', 'No',
       'Yes', 'Yes', 'Yes', 'No', 'No', 'No', 'Yes', 'Yes', 'No', 'No',
       'Yes', 'Yes', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'No', 'No',
       'Yes', 'Yes', 'Yes', 'No', 'Yes', 'No', 'No', 'Yes', 'Yes', 'Yes',
       'No', 'Yes', 'Yes', 'No', 'No', 'No', 'No', 'No', 'Yes', 'No',
       'Yes', 'Yes', 'Yes', 'No', 'Yes', 'Yes', 'No', 'Yes', 'Yes', 'No',
       'No', 'No', 'Yes', 'Yes', 'Yes', 'No', 'No', 'No', 

In [56]:
Down_socre = logreg.score(X_test, y_test)

## UPscale


In [57]:
category_0 = scaled_data[scaled_data['Churn']=='No']
category_1 = scaled_data[scaled_data['Churn']=='Yes']

In [58]:
len(category_0) == len(category_1)

False

In [59]:
category_1 = category_1.sample(len(category_0), replace=True)

In [60]:
len(category_0) == len(category_1)

True

In [61]:
data2 = pd.concat([category_0, category_1], axis=0)

In [62]:
data2

Unnamed: 0,index,tenure,SeniorCitizen,MonthlyCharges,TotalCharges,Churn
0,0,-1.280248,-0.440327,-1.161694,-0.994194,No
1,1,0.064303,-0.440327,-0.260878,-0.173740,No
3,3,0.512486,-0.440327,-0.747850,-0.195248,No
6,6,-0.424625,-0.440327,0.807802,-0.147313,No
7,7,-0.913552,-0.440327,-1.165018,-0.874169,No
...,...,...,...,...,...,...
6925,6925,-1.117272,-0.440327,0.510300,-0.837506,Yes
3836,3836,-1.280248,2.271039,0.201164,-0.976105,Yes
4076,4076,1.449597,2.271039,1.176771,2.015499,Yes
5054,5054,0.268022,-0.440327,-0.295780,-0.043656,Yes


In [63]:
data2['Churn'].value_counts()

No     5163
Yes    5163
Name: Churn, dtype: int64

In [64]:
X = data2.drop(['Churn'],axis=1)
y = data2['Churn']

In [65]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=23)

logreg = LogisticRegression()
#aqui se entrena el modelo
logreg.fit(X_train, y_train)

#Para ver la predicciones que no entro al modelo
y_pred = logreg.predict(X_test)
# este deveria parecerse a y_test

In [66]:
y_pred

array(['Yes', 'Yes', 'Yes', ..., 'Yes', 'No', 'No'], dtype=object)

In [67]:
y_test

2685    Yes
3778    Yes
6632    Yes
1315    Yes
3109     No
       ... 
3214     No
1437     No
6581    Yes
4256     No
6879     No
Name: Churn, Length: 2066, dtype: object

In [68]:
Up_socre = logreg.score(X_test, y_test)

In [80]:
print('Normal:', normal_socre)
print('UpScale:', Up_socre)
print('DownScale:', Down_socre)

Normal: 0.7867803837953091
UpScale: 0.723136495643756
DownScale: 0.7286096256684492


In [88]:
category_0 = scaled_data[scaled_data['Churn']=='No']
category_1 = scaled_data[scaled_data['Churn']=='Yes']

In [89]:
X = data2.drop(['Churn'],axis=1)
y = data2['Churn']

In [90]:
len(category_0) == len(category_1)

False

In [91]:
from imblearn.under_sampling import RandomUnderSampler

rus = RandomUnderSampler(random_state=23)
X_rus, y_rus = rus.fit_resample(X, y)

In [93]:
len(X_rus) == len(y_rus)

True

In [94]:
y_rus.value_counts()

No     5163
Yes    5163
Name: Churn, dtype: int64

In [95]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=23)

logreg = LogisticRegression()
#aqui se entrena el modelo
logreg.fit(X_train, y_train)

#Para ver la predicciones que no entro al modelo
y_pred = logreg.predict(X_test)
# este deveria parecerse a y_test

In [96]:
y_pred

array(['Yes', 'Yes', 'Yes', ..., 'Yes', 'No', 'No'], dtype=object)

In [97]:
y_test

2685    Yes
3778    Yes
6632    Yes
1315    Yes
3109     No
       ... 
3214     No
1437     No
6581    Yes
4256     No
6879     No
Name: Churn, Length: 2066, dtype: object

In [98]:
RandUnder_socre = logreg.score(X_test, y_test)

In [99]:
RandUnder_socre

0.723136495643756

In [103]:
print('SCORES:')
print('Normal:', normal_socre)
print('UpScale:', Up_socre)
print('DownScale:', Down_socre)
print('RandUnder:', RandUnder_socre)

SCORES:
Normal: 0.7867803837953091
UpScale: 0.723136495643756
DownScale: 0.7286096256684492
RandUnder: 0.723136495643756
