### 1.0 Split into Train and Test Sets:
The simplest method that we can use to evaluate the performance of a machine learning
algorithm is to use different training and testing datasets. We can take our original dataset and
split it into two parts. Train the algorithm on the first part, make predictions on the second
part and evaluate the predictions against the expected results. The size of the split can depend
on the size and specifics of your dataset, although it is common to use 67% of the data for
training and the remaining 33% for testing.

In [62]:
# Evaluate using a train and a test set
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
filename = 'pima.csv'
names = [ ' preg ' , ' plas ' , ' pres ' , ' skin ' , ' test ' , ' mass ' , ' pedi ' , ' age ' , ' class ' ]
dataframe = read_csv(filename, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]
test_size = 0.33
seed = 7

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=test_size, random_state=seed)
model = LogisticRegression()
model.fit(X_train, Y_train)
result = model.score(X_test, Y_test)

print("Accuracy: ", result*100.0) 

Accuracy:  75.5905511811


### K-Fold Cross Validation:
Cross validation is an approach that you can use to estimate the performance of a machine
learning algorithm with less variance than a single train-test set split. It works by splitting
the dataset into k-parts (e.g. k = 5 or k = 10). Each split of the data is called a fold. The
algorithm is trained on k − 1 folds with one held back and tested on the held back fold. This is
repeated so that each fold of the dataset is given a chance to be the held back test set. After
running cross validation you end up with k different performance scores that you can summarize
using a mean and a standard deviation.

In [74]:
# Evaluate using Cross Validation
from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
filename = 'pima.csv'
names = [ ' preg ' , ' plas ' , ' pres ' , ' skin ' , ' test ' , ' mass ' , ' pedi ' , ' age ' , ' class ' ]
dataframe = read_csv(filename, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]
num_folds = 10
seed = 7
kfold = KFold(n_splits=num_folds, random_state=seed)
model = LogisticRegression()
results = cross_val_score(model, X, Y, cv=kfold)
print("Accuracy: ", (results.mean()*100.0, results.std()*100.0)) 

Accuracy:  (76.951469583048521, 4.8410519245671946)


You can see that we report both the mean and the standard deviation of the performance
measure. When summarizing performance measures, it is a good practice to summarize the
distribution of the measures, in this case assuming a Gaussian distribution of performance (a
very reasonable assumption) and recording the mean and standard deviation.

### 1.3 Leave One Out Cross Validation:
You can configure cross validation so that the size of the fold is 1 (k is set to the number of
observations in your dataset). This variation of cross validation is called leave-one-out cross
validation. The result is a large number of performance measures that can be summarized in an effort to give a more reasonable estimate of the accuracy of your model on unseen data.
A downside is that it can be a computationally more expensive procedure than k-fold cross
validation. In the example below we use leave-one-out cross validation.

In [None]:
# Evaluate using Leave One Out Cross Validation
from pandas import read_csv
from sklearn.model_selection import LeaveOneOut
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
filename = ' pima.csv '
names = [ ' preg ' , ' plas ' , ' pres ' , ' skin ' , ' test ' , ' mass ' , ' pedi ' , ' age ' , ' class ' ]
dataframe = read_csv(filename, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]
num_folds = 10
loocv = LeaveOneOut()
model = LogisticRegression()
results = cross_val_score(model, X, Y, cv=loocv)
print("Accuracy: ", (results.mean()*100.0, results.std()*100.0))