<a href="https://colab.research.google.com/github/yohanesnuwara/machine-learning/blob/master/03_resampling/resampledata.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Algorithm Evaluation Methods Using Resampling**

The goal of resampling methods is to make the best use of your training data in order to accurately estimate the performance of a model on new unseen data.

1. Train and test split
2. k-fold cross-validation split

In [1]:
!git clone https://github.com/yohanesnuwara/machine-learning

Cloning into 'machine-learning'...
remote: Enumerating objects: 81, done.[K
remote: Counting objects:   1% (1/81)[Kremote: Counting objects:   2% (2/81)[Kremote: Counting objects:   3% (3/81)[Kremote: Counting objects:   4% (4/81)[Kremote: Counting objects:   6% (5/81)[Kremote: Counting objects:   7% (6/81)[Kremote: Counting objects:   8% (7/81)[Kremote: Counting objects:   9% (8/81)[Kremote: Counting objects:  11% (9/81)[Kremote: Counting objects:  12% (10/81)[Kremote: Counting objects:  13% (11/81)[Kremote: Counting objects:  14% (12/81)[Kremote: Counting objects:  16% (13/81)[Kremote: Counting objects:  17% (14/81)[Kremote: Counting objects:  18% (15/81)[Kremote: Counting objects:  19% (16/81)[Kremote: Counting objects:  20% (17/81)[Kremote: Counting objects:  22% (18/81)[Kremote: Counting objects:  23% (19/81)[Kremote: Counting objects:  24% (20/81)[Kremote: Counting objects:  25% (21/81)[Kremote: Counting objects:  27% (22/81)[Kremote: 

## Resampling Method 1: Train and test split

Splitting dataset into train and test datasets by **split percentage**. Good default split 0.6 (60% of dataset as train and 40% as test dataset). 

**Limitation**: get a noisy estimate of
algorithm performance.

In [0]:
# Split a dataset into a train and test set
def train_test_split(dataset, split=0.60):
  from random import randrange
  train = list()
  train_size = split * len(dataset)
  dataset_copy = list(dataset)
  while len(train) < train_size:
    index = randrange(len(dataset_copy))
    train.append(dataset_copy.pop(index))
  return train, dataset_copy

Implement the function to a dataset consists of 10 rows.

In [17]:
from random import seed

# test train/test split
seed(1)
dataset = [[1], [2], [3], [4], [5], [6], [7], [8], [9], [10]]
train, test = train_test_split(dataset)
print(train)
print(test)

[[3], [2], [7], [1], [8], [9]]
[[4], [5], [6], [10]]


## Resampling Method 2: k-fold Cross-Validation Split

Splitting dataset into `k` groups (or folds), then the algorithm is trained, evaluated `k` times, and taking the mean performance score. Value of `k` is chosen to split the dataset, good default `k = 3` for small dataset and `k = 10` for larger dataset. 

In [0]:
# Split a dataset into $k$ folds
def cross_validation_split(dataset, folds=3):
  from random import randrange
  dataset_split = list()
  dataset_copy = list(dataset)
  fold_size = int(len(dataset) / folds)
  for i in range(folds):
    fold = list()
    while len(fold) < fold_size:
      index = randrange(len(dataset_copy))
      fold.append(dataset_copy.pop(index))
    dataset_split.append(fold)
  return dataset_split

Implement the function to a dataset consists of 10 rows.

In [30]:
from random import seed

# test cross validation split
seed(1)
dataset = [[1], [2], [3], [4], [5], [6], [7], [8], [9], [10]]
folds = cross_validation_split(dataset, 3)
print(folds)

[[[3], [2], [7]], [[1], [8], [9]], [[10], [6], [5]]]


# Resampling Method using `scikit-klearn`

Resource: [Machine Learning Mastery](https://machinelearningmastery.com/evaluate-performance-machine-learning-algorithms-python-using-resampling/)

## Train-Test Split

Implement to `pima-indians-diabetes` dataset

In [53]:
# Evaluate using a train and a test set
import pandas

url = "/content/machine-learning/datasets/pima-indians-diabetes.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = pandas.read_csv(url, names=names)
dataframe

Unnamed: 0,preg,plas,pres,skin,test,mass,pedi,age,class
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1
...,...,...,...,...,...,...,...,...,...
763,10,101,76,48,180,32.9,0.171,63,0
764,2,122,70,27,0,36.8,0.340,27,0
765,5,121,72,23,112,26.2,0.245,30,0
766,1,126,60,0,0,30.1,0.349,47,1


In [51]:
from sklearn import model_selection
from sklearn.linear_model import LogisticRegression

array = dataframe.values
X = array[:,0:8] 
Y = array[:,8] # the class column (consisting of 0 and 1) is the function
test_size = 0.33
seed = 7
X_train, X_test, Y_train, Y_test = model_selection.train_test_split(X, Y, test_size=test_size, random_state=seed)
model = LogisticRegression()
model.fit(X_train, Y_train)
result = model.score(X_test, Y_test)
print("Accuracy: %.3f%%" % (result*100.0))

Accuracy: 78.740%


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


## k-fold Cross-Validation

In [66]:
num_instances = len(X)
seed = 7
kfold = model_selection.KFold(n_splits=6, random_state=seed)
model = LogisticRegression()
results = model_selection.cross_val_score(model, X, Y, cv=kfold)
print("Accuracy: %.3f%% (%.3f%%)" % (results.mean()*100.0, results.std()*100.0))

Accuracy: 76.953% (4.053%)


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logist

## Repeated Train-Test Split

Another variation on k-fold cross validation is to create a random split of the data like the train/test split described above, but repeat the process of splitting and evaluation of the algorithm multiple times, like cross validation.

In [0]:
# Unsuccessful
# Evaluate using Shuffle Split Cross Validation

num_samples = 5
test_size = 0.33
num_instances = len(X)
seed = 7
kfold = model_selection.ShuffleSplit(n_splits=10, test_size=test_size, random_state=seed)
model = LogisticRegression()
results = model_selection.cross_val_score(model, X, Y, cv=kfold)
print("Accuracy: %.3f%% (%.3f%%)" % (results.mean()*100.0, results.std()*100.0))

## Leave One Out Cross Validation (LOOCV)

You can configure cross validation so that the size of the fold is 1 (k is set to the number of observations in your dataset). This variation of cross validation is called leave-one-out cross validation.

In [0]:
# Unsuccessful

num_folds = 10
num_instances = len(X)
loocv = model_selection.LeaveOneOut()
model = LogisticRegression()
results = model_selection.cross_val_score(model, X, Y, cv=loocv)
print("Accuracy: %.3f%% (%.3f%%)" % (results.mean()*100.0, results.std()*100.0))