# Lab 7.07

## Instructions

1. Apply SMOTE for upsampling the data
 * Use logistic regression to fit the model and compute the accuracy of the model.
 * Use decision tree classifier to fit the model and compute the accuracy of the model.
 * Compare the accuracies of the two models.
 

2. Apply TomekLinks for downsampling
 *  It is important to remember that it does not make the two classes equal but only removes the points from the majority class that are close to other points in minority class.
 * Use logistic regression to fit the model and compute the accuracy of the model.
 * Use decision tree classifier to fit the model and compute the accuracy of the model.
 * Compare the accuracies of the two models.
 * You can also apply this algorithm one more time and check the how the imbalance in the two classes changed from the last time.

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from imblearn.over_sampling import SMOTE 
from imblearn.under_sampling import TomekLinks
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

In [2]:
data = pd.read_csv('files_for_lab/Customer-Churn.csv')
data.head()

Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,MonthlyCharges,TotalCharges,Churn
0,Female,0,Yes,No,1,No,No,Yes,No,No,No,No,Month-to-month,29.85,29.85,No
1,Male,0,No,No,34,Yes,Yes,No,Yes,No,No,No,One year,56.95,1889.5,No
2,Male,0,No,No,2,Yes,Yes,Yes,No,No,No,No,Month-to-month,53.85,108.15,Yes
3,Male,0,No,No,45,No,Yes,No,Yes,Yes,No,No,One year,42.3,1840.75,No
4,Female,0,No,No,2,Yes,No,No,No,No,No,No,Month-to-month,70.7,151.65,Yes


In [3]:
data['Churn'].value_counts()

No     5174
Yes    1869
Name: Churn, dtype: int64

In [4]:
data['TotalCharges'] = pd.to_numeric(data['TotalCharges'], errors = 'coerce')  #invalid parsing will be set as NaN
data['TotalCharges'] = data['TotalCharges'].fillna(0)


In [5]:
X = data[['tenure', 'SeniorCitizen', 'MonthlyCharges', 'TotalCharges']]
y = data['Churn']
X.isna().sum()

tenure            0
SeniorCitizen     0
MonthlyCharges    0
TotalCharges      0
dtype: int64

In [6]:
transformer = StandardScaler().fit(X)
X = transformer.transform(X)

### SMOTE Upsampling

In [7]:
sm = SMOTE()
X_sm, y_sm = sm.fit_resample(X, y)
y_sm.value_counts()

No     5174
Yes    5174
Name: Churn, dtype: int64

In [8]:
X_train, X_test, y_train, y_test = train_test_split(X_sm, y_sm, test_size = 0.3)

#### Logistic Regression

In [9]:
classification = LogisticRegression(solver = 'lbfgs').fit(X_train, y_train)
y_test_predict = classification.predict(X_test)

In [10]:
print("The accuracy of the Logistic model on test set is: %4.2f " % accuracy_score(y_test, y_test_predict))

The accuracy of the Logistic model on test set is: 0.73 


#### Decision Tree Classifier

In [11]:
model = DecisionTreeClassifier(max_depth = 5).fit(X_train, y_train)
y_test_predict = model.predict(X_test)
print("The accuracy of the Decision Tree model on test set is: %4.2f " % accuracy_score(y_test, y_test_predict))

The accuracy of the Decision Tree model on test set is: 0.75 


#### With Cross Validation

In [12]:
model_1 = LogisticRegression(solver = 'lbfgs')
model_2 = DecisionTreeClassifier()

In [13]:
models_pipeline = [model_1, model_2]
models_name = ['Logistic', 'Decision Tree']

In [14]:
scores = {}
i = 0
for model in models_pipeline:
    mean_score = np.mean(cross_val_score(model, X_train, y_train, cv = 10))
    scores[models_name[i]] = round(mean_score, 3)
    i += 1
    
print(scores)

{'Logistic': 0.736, 'Decision Tree': 0.751}


### Tomek Links Downsampling

In [15]:
t1 = TomekLinks()
X_tl, y_tl = t1.fit_resample(X, y)
y_tl.value_counts()

No     4665
Yes    1869
Name: Churn, dtype: int64

In [16]:
X_train, X_test, y_train, y_test = train_test_split(X_tl, y_tl, test_size = 0.3)

#### Logistic Regression

In [17]:
classification = LogisticRegression(solver = 'lbfgs').fit(X_train, y_train)
y_test_predict = classification.predict(X_test)
print("The accuracy of the Logistic model on test set is: %4.2f " % accuracy_score(y_test, y_test_predict))

The accuracy of the Logistic model on test set is: 0.79 


#### Decision Tree Classifier

In [18]:
model = DecisionTreeClassifier(max_depth = 5).fit(X_train, y_train)
y_test_predict = model.predict(X_test)
print("The accuracy of the Decision Tree model on test set is: %4.2f " % accuracy_score(y_test, y_test_predict))

The accuracy of the Decision Tree model on test set is: 0.78 


#### With Cross Validation

In [19]:
model_1 = LogisticRegression(solver = 'lbfgs')
model_2 = DecisionTreeClassifier()
models_pipeline = [model_1, model_2]
models_name = ['Logistic', 'Decision Tree']

In [20]:
scores = {}
i = 0
for model in models_pipeline:
    mean_score = np.mean(cross_val_score(model, X_train, y_train, cv = 10))
    scores[models_name[i]] = round(mean_score, 3)
    i += 1
    
print(scores)

{'Logistic': 0.793, 'Decision Tree': 0.746}
