**K-Fold is a validation technique in which we divide the data into k-subsets and repeat the holdout method k times, with each of the k subsets serving as a test set and the remaining k-1 subsets serving as a training set.**

**That is, K-Fold ensures that the score of our model does not depend on the way we select our train and test subsets. In this approach, we divide the data set into k number of subsets and the holdout method is repeated k number of times.**



In [37]:
from sklearn.linear_model import LogisticRegression 
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier 
import numpy as np
from sklearn.datasets import load_digits

`Digits is a collection of handwritten numbers. The intensity of one pixel in an 8 x 8 image is represented by each feature.`

In [38]:
digits = load_digits()
print(dir(digits))  # dir = directory 

['DESCR', 'data', 'feature_names', 'frame', 'images', 'target', 'target_names']


`model_selection is a technique for creating a blueprint for analyzing data and then applying it to new data. When making a prediction, choosing the right model allows us to get accurate results. To do so, we'll need to use a specific dataset to train our model. The model is then put to the test against a new dataset.`

`Sklearn model_selection has a function called train_test_split that splits data arrays into two subsets: training data and testing data. We don't have to divide the dataset manually with this function. Sklearn train_test_split creates random partitions for the two subsets by default.`

In [39]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(digits.data, digits.target, test_size = 0.3)

*The number of items in an object is returned by the len() function. The len() function returns the number of characters in a string when the object is a string.*

In [40]:
print(len(x_train))

1257


In [41]:
print(len(y_test))

540


`A Machine Learning classification algorithm called logistic regression is used to predict the probability of a categorical dependent variable. The dependent variable in logistic regression is a binary variable that contains data coded as 1 (yes, success, etc.) or 0 (no, failure, etc.).`

In [42]:
lr =  LogisticRegression()
lr.fit(x_train, y_train)
lr.score(x_test, y_test)*100

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


94.81481481481482

`A Linear SVC (Support Vector Classifier) is designed to fit to the data we provide and return a "best fit" hyperplane that divides or categorizes our data. Following that, we can feed some features to our classifier to see what the "predicted" class is after we've obtained the hyperplane.`

In [43]:
sm = SVC()
sm.fit(x_train, y_train)
sm.score(x_test, y_test)*100

99.25925925925925

`A random forest classifier is a type of classification algorithm. A random forest is a meta estimator that uses averaging to improve predictive accuracy and control over-fitting by fitting a number of decision tree classifiers on various sub-samples of the dataset.`


In [44]:
rf = RandomForestClassifier()
rf.fit(x_train, y_train)
rf.score(x_test, y_test)*100 

98.14814814814815

`Model tuning and hyperparameter tuning are done using K-fold cross validation. Splitting the data into training and test data sets, applying K-fold cross-validation to the training data set, and selecting the model with the best performance is what K-fold cross validation entails.`

In [45]:
from sklearn.model_selection import KFold
kf = KFold(n_splits = 3)
kf

KFold(n_splits=3, random_state=None, shuffle=False)

In [46]:
for train_index, test_index in kf.split([1,2,3,4,5,6,7,8,9]):
    print(train_index, test_index)  

[3 4 5 6 7 8] [0 1 2]
[0 1 2 6 7 8] [3 4 5]
[0 1 2 3 4 5] [6 7 8]


In [47]:
def get_score(model, x_train, x_test, y_train, y_test):
    model.fit(x_train, y_train)
    return model.score(x_test, y_test)

In [48]:
get_score(LogisticRegression(), x_train, x_test, y_train, y_test)*100

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


94.81481481481482

In [49]:
get_score(SVC(), x_train, x_test, y_train, y_test)*100

99.25925925925925

*Cross-validator with Stratified K-Folds. Allows us to split data into train and test sets using train/test indices. This cross-validation object returns stratified folds and is a variant of KFold. The folds are created by keeping track of the percentage of samples in each class.*

In [50]:
from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits = 3)

In [51]:
scores_l = []
scores_svm = []
scores_rf = []

for train_index, test_index in kf.split(digits.data):
    x_train, x_test, y_train, y_test = digits.data[train_index], digits.data[test_index], digits.target[train_index], digits.target[test_index] 

    print('LR:',get_score(LogisticRegression(),x_train, x_test, y_train, y_test)*100)
    print('SVC:',get_score(SVC(),x_train, x_test, y_train, y_test)*100)
    print('RF:',get_score(RandomForestClassifier(),x_train, x_test, y_train, y_test)*100)

    scores_l.append(get_score(LogisticRegression(),x_train, x_test, y_train, y_test)*100)
    scores_svm.append(get_score(SVC(),x_train, x_test, y_train, y_test)*100)
    scores_rf.append(get_score(RandomForestClassifier(),x_train, x_test, y_train, y_test)*100)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


LR: 92.32053422370618
SVC: 96.661101836394
RF: 94.32387312186978


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


LR: 94.15692821368948
SVC: 98.1636060100167
RF: 94.65776293823038


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


LR: 91.48580968280467
SVC: 95.49248747913188
RF: 93.15525876460768


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


In [52]:
scores_l

[92.32053422370618, 94.15692821368948, 91.48580968280467]

In [53]:
scores_svm

[96.661101836394, 98.1636060100167, 95.49248747913188]

In [54]:
scores_rf

[92.65442404006677, 94.82470784641069, 93.32220367278798]