## Training more efficiently.

Using a technique called cross-validation.  Basically:

1. Shuffle the training set and chop the training set into k parts.
2. Train on (k-1) parts, and test on the last one - we call that the validation set.  That's the score we are interested in.  
* Repeat - set aside a different part and repeat. 
* We will end up with k different models.  We'll average those to find the best one.
* Repeat the whole process on a new model.


In [1]:
from sklearn.model_selection import cross_val_score
from sklearn import tree
from sklearn import neighbors
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn import datasets

## data

digits = datasets.load_digits()

X = digits['data']   
Y = digits['target']

In [2]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=2022)

In [16]:
## tree model of depth 3:
tree_model=tree.DecisionTreeClassifier(max_depth=3, random_state=2022)

## Train on 4/5 of the data, testing on the 5th, Do this 5 times
CV_score =cross_val_score(tree_model, X_train, Y_train, cv =5)
 
print("We have 5 scores: ", CV_score)
print("Averaging them gives us an accuracy score of:", CV_score.mean())

We have 5 scores:  [0.4604811  0.48442907 0.47222222 0.41403509 0.48239437]
Averaging them gives us an accuracy score of: 0.4627123683078011


In [18]:
## when we actually make the model we get:
tree_1 = tree.DecisionTreeClassifier(max_depth=3, random_state=2021) 
tree_1 = tree_1.fit(X_train, Y_train)

Y_pred = tree_1.predict(X_test)
Y_pred_train = tree_1.predict(X_train)

print ("Training Accuracy is ", accuracy_score(Y_train,Y_pred_train))
print ("Testing Accuracy is ", accuracy_score(Y_test,Y_pred))

Training Accuracy is  0.4954766875434934
Testing Accuracy is  0.45


In [4]:
## try on a better model:
cupcake = neighbors.KNeighborsClassifier(n_neighbors=15, p=2 )
CV_score = cross_val_score(cupcake, X_train, Y_train, cv=5)
 
print("We have 5 scores: ", CV_score)
print("Averaging them gives us an accuracy score of:", CV_score.mean())

We have 5 scores:  [0.97222222 0.98263889 0.97909408 0.96515679 0.97909408]
Averaging them gives us an accuracy score of: 0.9756412117692606
