# Model evaluation in Python - Sampling Strategies
by María Óskarsdóttir

This notebook demonstrates the basics of splitting a dataset before learning a model in Python.  In addition, it shows balancing strategies for imbalanced data. It uses a churn dataset obtained from Kaggle.

In [None]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder

## Part 1 Prepare data
We start by preparing the data. We read in the data, encode the variables and seperate the data into variables and target.

In [None]:
#Load data
data = pd.read_csv('WA_Fn-UseC_-Telco-Customer-Churn.csv')
# The data can be found at https://www.kaggle.com/blastchar/telco-customer-churn

data.head(10)


Clean and prepare the data

In [None]:
#Make dummy variables for catigorical variables with >2 levels
dummy_columns = ["MultipleLines","InternetService","OnlineSecurity",
                 "OnlineBackup","DeviceProtection","TechSupport",
                 "StreamingTV","StreamingMovies","Contract",
                 "PaymentMethod"]

df = pd.get_dummies(data, columns = dummy_columns)

#Encode categorical variables with 2 levels
enc = LabelEncoder()
encode_columns = ["Churn","PaperlessBilling","PhoneService",
                  "gender","Partner","Dependents","SeniorCitizen"]

for col in encode_columns:
    df[col] = enc.fit_transform(df[col])
    
#Remove customer ID column
del df["customerID"]

#Make TotalCharges column numeric, empty strings are zeros
df["TotalCharges"] = pd.to_numeric(df["TotalCharges"],errors = 'coerce').fillna(0)

df.head(10)

Next we sepertate the target (y) from the other variables (x).  

In [None]:
#Split data into x (independent variables) and y (dependent variable, target)
y = df[["Churn"]]
x = df.drop("Churn", axis=1)
print('The size of the data is:', x.shape, 'and of the target vector: ',y.shape)

## Part 2: Splitting strategies.
Now we can split the data.
1. Use splitting. In this case the training data is 80% of the observartions in the original data and the test is 20% of the obervations. This is a random split.

In [None]:
#Create test and training sets
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size= .2, random_state= 1)
print('The size of the training set is:', x_train.shape, 'with the target vector: ', y_train.shape)
print('The size of the test set is:', x_test.shape, 'with the target vector: ', y_test.shape)

2. With 10-fold cross validation. We start by setting up the splits.

In [None]:
from sklearn.model_selection import KFold # import KFold
kf = KFold(n_splits=10) # Define the split - into 10 folds 
kf.get_n_splits(x) # returns the number of splitting iterations in the cross-validator


We can investigate the sizes of the training and test set in each iteration.

In [None]:
for train_index, test_index in kf.split(x):
    print('TRAIN:', train_index.shape, 'TEST:', test_index.shape)


3. Leave one out (LOO) cross validation 

In [None]:
from sklearn.model_selection import LeaveOneOut 
loo = LeaveOneOut()
loo.get_n_splits(x)

for train_index, test_index in loo.split(x):
    print('TRAIN:', train_index.shape, 'TEST:', test_index.shape)
   