![logo_ironhack_blue 7](https://user-images.githubusercontent.com/23629340/40541063-a07a0a8a-601a-11e8-91b5-2f13e4e6b441.png)

# Lab | Cross Validation

For this lab, we will build a model on customer churn binary classification problem. You will be using `files_for_lab/Customer-Churn.csv` file.



### Instructions

1. Apply SMOTE for upsampling the data

    - Use logistic regression to fit the model and compute the accuracy of the model.
    - Use decision tree classifier to fit the model and compute the accuracy of the model.
    - Compare the accuracies of the two models.


2. Apply TomekLinks for downsampling

    - It is important to remember that it does not make the two classes equal but only removes the points from the majority class that are close to other points in minority class.
    - Use logistic regression to fit the model and compute the accuracy of the model.
    - Use decision tree classifier to fit the model and compute the accuracy of the model.
    - Compare the accuracies of the two models.
    - You can also apply this algorithm one more time and check the how the imbalance in the two classes changed from the last time.


In [23]:
import pandas as pd
import numpy as np
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from imblearn.under_sampling import TomekLinks

In [2]:
data = pd.read_csv('Customer-Churn.csv')

In [3]:
data

Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,MonthlyCharges,TotalCharges,Churn
0,Female,0,Yes,No,1,No,No,Yes,No,No,No,No,Month-to-month,29.85,29.85,No
1,Male,0,No,No,34,Yes,Yes,No,Yes,No,No,No,One year,56.95,1889.5,No
2,Male,0,No,No,2,Yes,Yes,Yes,No,No,No,No,Month-to-month,53.85,108.15,Yes
3,Male,0,No,No,45,No,Yes,No,Yes,Yes,No,No,One year,42.30,1840.75,No
4,Female,0,No,No,2,Yes,No,No,No,No,No,No,Month-to-month,70.70,151.65,Yes
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7038,Male,0,Yes,Yes,24,Yes,Yes,No,Yes,Yes,Yes,Yes,One year,84.80,1990.5,No
7039,Female,0,Yes,Yes,72,Yes,No,Yes,Yes,No,Yes,Yes,One year,103.20,7362.9,No
7040,Female,0,Yes,Yes,11,No,Yes,No,No,No,No,No,Month-to-month,29.60,346.45,No
7041,Male,1,Yes,No,4,Yes,No,No,No,No,No,No,Month-to-month,74.40,306.6,Yes


In [4]:
numerical = data.select_dtypes(np.number)

In [5]:
numerical

Unnamed: 0,SeniorCitizen,tenure,MonthlyCharges
0,0,1,29.85
1,0,34,56.95
2,0,2,53.85
3,0,45,42.30
4,0,2,70.70
...,...,...,...
7038,0,24,84.80
7039,0,72,103.20
7040,0,11,29.60
7041,1,4,74.40


In [6]:
data['Churn'].value_counts()

No     5174
Yes    1869
Name: Churn, dtype: int64

In [7]:
X = numerical
y = data['Churn']

In [8]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=23)

In [9]:
smote = SMOTE()
X_sm, y_sm = smote.fit_resample(X_train,y_train)

In [10]:
len(X_sm) == len(y_sm)

True

In [11]:
y_sm.value_counts()

No     3612
Yes    3612
Name: Churn, dtype: int64

In [12]:
lr = LogisticRegression()
lr.fit(X_sm, y_sm)
y_pred_lr = lr.predict(X_test)

In [22]:
accuracy_score(y_test, y_pred_lr)

0.7259820160908661

In [14]:
arbol = DecisionTreeClassifier()
arbol.fit(X_sm, y_sm)
y_pred_arbol = arbol.predict(X_test)
y_pred_arbol

array(['Yes', 'No', 'Yes', ..., 'Yes', 'Yes', 'No'], dtype=object)

In [20]:
accuracy_score(y_test, y_pred_arbol)


0.7027922385234264

## TomekLinks

In [35]:
tomek = TomekLinks(sampling_strategy='majority', n_jobs=10)
X_tl, y_tl = tomek.fit_resample(X_train,y_train)

In [37]:
X_tl

Unnamed: 0,SeniorCitizen,tenure,MonthlyCharges
0,1,52,109.10
1,0,7,69.70
2,1,61,64.05
3,0,68,91.70
4,0,44,54.05
...,...,...,...
4563,0,2,74.75
4564,0,1,45.70
4565,0,14,55.70
4566,0,67,109.70


In [29]:
len(X_tl) == len(y_tl)

True

In [38]:
y_tl.value_counts()

No     3250
Yes    1318
Name: Churn, dtype: int64

In [39]:
lr = LogisticRegression()
lr.fit(X_tl, y_tl)
y_pred_lr = lr.predict(X_test)

In [40]:
accuracy_score(y_test, y_pred_lr)

0.7766209181258874

In [74]:
arbol = DecisionTreeClassifier(random_state=1)
arbol.fit(X_sm, y_sm)
y_pred_arbol = arbol.predict(X_test)
y_pred_arbol

array(['No', 'No', 'Yes', ..., 'No', 'Yes', 'No'], dtype=object)

In [75]:
accuracy_score(y_test, y_pred_arbol)

0.7004259346900142