# I. Introduction to Scikit-learn

### Step 1. Import the necessary libraries

In [68]:
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline

### Step 2. Import one of the toy datasets (digits) from Scikitlearn
This is a copy of the test set of the UCI ML hand-written digits datasets https://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits

The data set contains images of hand-written digits: 10 classes where each class refers to a digit.

Each datapoint is a 8x8 image of a digit, Classes=10, Samples per class~180, 

Samples total=1797, Dimensionality=64, and Features=(integers 0-16)


After importing this dataset, split it into test and train sets. You may check the shape of data and the target attributes of the dataset. You may also want to print a few samples from the dataset.

In [69]:
from sklearn.datasets import load_digits
dataset_digits=load_digits()
dataset_digits.keys()


dict_keys(['data', 'target', 'target_names', 'images', 'DESCR'])

In [70]:
import pandas as pd
df=pd.DataFrame(dataset_digits["data"])
df.head(5)


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,54,55,56,57,58,59,60,61,62,63
0,0.0,0.0,5.0,13.0,9.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,6.0,13.0,10.0,0.0,0.0,0.0
1,0.0,0.0,0.0,12.0,13.0,5.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,11.0,16.0,10.0,0.0,0.0
2,0.0,0.0,0.0,4.0,15.0,12.0,0.0,0.0,0.0,0.0,...,5.0,0.0,0.0,0.0,0.0,3.0,11.0,16.0,9.0,0.0
3,0.0,0.0,7.0,15.0,13.0,1.0,0.0,0.0,0.0,8.0,...,9.0,0.0,0.0,0.0,7.0,13.0,13.0,9.0,0.0,0.0
4,0.0,0.0,0.0,1.0,11.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,2.0,16.0,4.0,0.0,0.0


In [71]:
dataset_digits["target_names"]

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

# II. Sklearn API for model training
-------------------
### Step 1. Import your model class

As an example, let us `LinearSVC`, a linear support vector classifier. This classifier is imprted from `sklearn.svm` module which includes Support Vector Machine algorithms.

In [72]:
from sklearn.svm import LinearSVC


### Step2. Instantiate an object and set the parameters

In [73]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test= train_test_split(dataset_digits['data'], dataset_digits['target'])

### Step 3. Fit the model
When fitting the model, use the train dataset.

In [74]:
LinearSVC()
y=LinearSVC().fit(X_train, y_train)

### Step 4. Predict and Evaluate
Use the test set for this purpose, for now.

In [75]:
y.predict(X_test)
y.score
print("Accuracy on training set: {:.3f}".format(y.score(X_train, y_train)))
print("Accuracy on test set: {:.3f}".format(y.score(X_test, y_test)))

Accuracy on training set: 0.996
Accuracy on test set: 0.956


What happens with the cross validation?


In [76]:

from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import scale

cancer_data=load_breast_cancer()
cancer_data.keys()

X,y= cancer_data.data, cancer_data.target
X=scale(X)

X_trainval, X_test, y_trainval, y_test=train_test_split(X,y)
X_train, X_val, y_train, y_val=train_test_split(X_trainval, y_trainval)
print( "length: ", len(X), len(X_trainval),  len(X_train), len(X_val), len(X_test),)
print( "length: ", len(X), len(X_trainval)/len(X)*100,  len(X_train)/len(X)*100, len(X_val)/len(X)*100, len(X_test)/len(X)*100)

length:  569 426 319 107 143
length:  569 74.86818980667839 56.06326889279437 18.804920913884008 25.13181019332162


In [77]:
from sklearn.neighbors import KNeighborsClassifier
knn=KNeighborsClassifier(n_neighbors=5).fit(X_train, y_train)

print("Train: {: .3f} ".format(knn.score(X_train,y_train)))
print("Validation: {: .3f} ".format(knn.score(X_val,y_val)))
print("Test: {: .3f} ".format(knn.score(X_test,y_test)))


Train:  0.975 
Validation:  0.972 
Test:  0.958 


In [78]:
from sklearn.neighbors import KNeighborsClassifier
knn=KNeighborsClassifier(n_neighbors=1).fit(X_train, y_train)

print("Train: {: .3f} ".format(knn.score(X_train,y_train)))
print("Validation: {: .3f} ".format(knn.score(X_val,y_val)))
print("Test: {: .3f} ".format(knn.score(X_test,y_test)))

Train:  1.000 
Validation:  0.953 
Test:  0.916 


# Overfitting the Validation DataSet 
To see the overfitting the validation dataset, a noise is added to the training data on a knn with n_neighbor=5

In [79]:
from sklearn.neighbors import KNeighborsClassifier
import matplotlib.pyplot as plt

validation_resultset=[]
training_resultset=[]
test_resultset=[]

for j in range(1000):
    rng=np.random.RandomState(j)
    noise=rng.normal(scale=.1, size=X_train.shape)
    knn=KNeighborsClassifier(n_neighbors=5).fit(X_train + noise, y_train)
#    training_resultset.append(knn.score(X_train,y_train))
    validation_resultset.append(knn.score(X_val,y_val))
    test_resultset.append(knn.score(X_test,y_test))
 

#print(" Best score: training {:.3f}".format(np.max(training_resultset)))
print(" Best score: validation {:.3f}".format(np.max(validation_resultset))) 
print(" Best score: test {:.3f}".format(np.max(test_resultset)))

print(" Best epoch: {:.3f}".format(np.argmax(validation_resultset)))

print(" Test score when best neighbor score of val: {:.3f}".format(test_resultset[np.argmax(validation_resultset)]))
#print(" Training score when best neighbor score of val: {:.3f}".format(training_resultset[np.argmax(validation_resultset)]))


d={'training': training_resultset, 'test': test_resultset, 'validation': validation_resultset }
d2={'test': test_resultset, 'validation': validation_resultset }
TrainingScore= pd.DataFrame(d2)
plt.plot(TrainingScore)
plt.xlabel("Epoch")
plt.ylabel("Accuracy Score")
plt.title("Finding the best epoch for KNN n=5")
plt.legend(loc="upper left")
plt.show()


 Best score: validation 0.981
 Best score: test 0.986
 Best epoch: 6.000
 Test score when best neighbor score of val: 0.958


TypeError: 'str' object is not callable

Issues with the above:    

1. The lecture results different: <br/>
Validation: 1.00 <br/>
Test: 0.958 <br/>
https://youtu.be/a6r_7OWwLII?t=3819
2. Legend does not show on the plot 


In [66]:
d=[]
neighbors=np.arange(2,len(X_train))
for i in neighbors:
        knn=KNeighborsClassifier(n_neighbors=i).fit(X_train , y_train)
        d.append({
            'Training' : knn.score(X_train,y_train),
            'Validation' : knn.score(X_val,y_val),
            'Test' : knn.score(X_test,y_test)
        })

import matplotlib.pyplot as plt
TrainingScore= pd.DataFrame(d)
TrainingScore.plot()
plt.title("Accuracy of n neighbors ")
plt.xlabel("n neighbor")
plt.ylabel("Score/Accuracy")
print(" Best n neighbors : {:.3f}".format(neighbors[np.argmax(TrainingScore["Validation"])]))
print(" Best n neighbors value: {:.3f}".format(np.max(TrainingScore["Validation"])))
print(" Best n neighbors value: {:.3f}".format(np.max(TrainingScore["Test"])))

ValueError: query data dimension must match training data dimension

In [65]:
import matplotlib.pyplot as plt

val_scores=[]
# neighbors=np.arange(2,len(X_train)-4)
#  Best n neighbor : 3
#  Best n neighbors value: 0.971963
#  Best n neighbors value: 0.979021
neighbors=np.arange(1,15,2)
for i in neighbors:
        knn=KNeighborsClassifier(n_neighbors=i).fit(X_train , y_train)
        val_scores.append(knn.score(X_val,y_val))

TrainingScore= pd.DataFrame(val_scores, columns=['Validation DataSet'])
TrainingScore.plot()
plt.title("Accuracy of n neighbors ")
plt.xlabel("n neighbor")
plt.ylabel("Score/Accuracy")
plt.legend='loc=upper left'
best_neighbor=neighbors[np.argmax(val_scores)]
print(" Best n neighbor :",(best_neighbor))
knn=KNeighborsClassifier(n_neighbors=best_neighbor)
knn.fit(X_trainval, y_trainval)
print(" Best validation score : {:.3f}".format(np.max(val_scores)))
print(" Best n neighbors value: {:.3f}".format(knn.score(X_test, y_test)))

ValueError: query data dimension must match training data dimension

Problems above: <br/>

1.  The lecture suggested 11 is the best answer with : 
    ```
    0.991 best validation score, 
    0.951 test score  <br/>
    ```
    Mine got : 
    ```
    Best n neighbor : 3
    Best validation score : 0.971963
    Best n neighbors value: 0.979021
    ```
2. Why we are redefining the fit operation, and running on trainval, before we test on test data?

# Decision Tree from Machine Learning with Python (Ch2)

In [47]:
from sklearn.tree import DecisionTreeClassifier
tree = DecisionTreeClassifier(random_state=0)
tree.fit(X_train, y_train)
print("Accuracy on training set: {:.3f}".format(tree.score(X_train, y_train)))
print("Accuracy on test set: {:.3f}".format(tree.score(X_test, y_test)))

Accuracy on training set: 1.000
Accuracy on test set: 0.937


The accuracy on the training set is 100%—because the leaves are pure, the tree was grown deep enough that it could perfectly memorize all the labels on the training data. If we select max depth 4, the model will generalise better on the test data it has not seen before.

In [50]:
tree = DecisionTreeClassifier(max_depth=4, random_state=0)
tree.fit(X_train, y_train)

print("Accuracy on training set: {:.3f}".format(tree.score(X_train, y_train)))
print("Accuracy on test set: {:.3f}".format(tree.score(X_test, y_test)))

Accuracy on training set: 0.984
Accuracy on test set: 0.916


# Visualise

In [51]:
from sklearn.tree import export_graphviz
export_graphviz(tree, out_file="tree.dot", class_names=["1", "2"],
                feature_names=dataset_digits.feature_names, impurity=False, filled=True)

        

AttributeError: feature_names

In [52]:
from sklearn.tree import export_graphviz
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_breast_cancer

cancer = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target, stratify=cancer.target, random_state=42)
tree = DecisionTreeClassifier(random_state=0)
tree.fit(X_train, y_train)

export_graphviz(tree, out_file="tree.dot", class_names=["malignant", "benign"],
                feature_names=cancer.feature_names, impurity=False, filled=True)

In [33]:
import graphviz

with open("tree.dot") as f:
    dot_graph = f.read()
display(graphviz.Source(dot_graph))

ModuleNotFoundError: No module named 'graphviz'

### Step5. Try another Algorithm
Try `RandomForestCLassifier` this time, import it from `sklearn.ensemble` module.

In [53]:
from sklearn.datasets import load_digits
data_digits=load_digits()
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
data_digits.keys()

dict_keys(['data', 'target', 'target_names', 'images', 'DESCR'])

In [59]:
X_train, X_test, y_train, y_test = train_test_split(data_digits["data"], data_digits["target"], random_state=0)

In [None]:
Random state means split data

In [56]:
import numpy as np
np.bincount(data_digits.target)

array([178, 182, 177, 183, 181, 182, 181, 179, 174, 180])

In [81]:
from sklearn.svm import LinearSVC
svm=LinearSVC()
svm.fit(X_train, y_train)
score_training=svm.score(X_train, y_train)
score_test=svm.score(X_test, y_test)
print("Accuracy training : {:.3f}".format(score_training))
print("Accuracy test : {:.3f}".format(score_test))

Accuracy training : 0.994
Accuracy test : 0.972


In [84]:
from sklearn.ensemble import RandomForestClassifier
rfc=RandomForestClassifier()
rfc.fit(X_train, y_train)
score_rfc= rfc.score(X_train, y_train)
score_rfc_test=rfc.score(X_test, y_test)
print("Accuracy training: {:.3f}".format(score_rfc))
print("Accuracy test: {:.3f}".format(score_rfc_test))


Accuracy training: 1.000
Accuracy test: 0.951


In [86]:
from sklearn.ensemble import RandomForestClassifier
rfc2=RandomForestClassifier(max_depth=4)
rfc2.fit(X_train, y_train)
score_rfc= rfc2.score(X_train, y_train)
score_rfc_test=rfc2.score(X_test, y_test)
print("Accuracy training: {:.3f}".format(score_rfc))
print("Accuracy test: {:.3f}".format(score_rfc_test))


Accuracy training: 0.997
Accuracy test: 0.951


Questions :
1. Why LinearSVC is better? Could we change the parameters on rfc to behave better?

# II. Cross-validation

Learning the parameters of a prediction function and testing it on the same data is a methodological mistake: a model that would just repeat the labels of the samples that it has just seen would have a perfect score but would fail to predict anything useful on yet-unseen data. This situation is called overfitting. To avoid it, it is common practice when performing a (supervised) machine learning experiment to hold out part of the available data as a test set X_test, y_test. 

When evaluating different settings (“hyperparameters”) for estimators, there is still a risk of overfitting on the test set because the parameters can be tweaked until the estimator performs optimally. This way, knowledge about the test set can “leak” into the model and evaluation metrics no longer report on generalization performance. To solve this problem, yet another part of the dataset can be held out as a so-called “validation set”: training proceeds on the training set, after which evaluation is done on the validation set, and when the experiment seems to be successful, final evaluation can be done on the test set.

However, by partitioning the available data into three sets, we drastically reduce the number of samples which can be used for learning the model, and the results can depend on a particular random choice for the pair of (train, validation) sets.

A solution to this problem is a procedure called cross-validation (CV for short). A test set should still be held out for final evaluation, but the validation set is no longer needed when doing CV. In the basic approach, called k-fold CV, the training set is split into k smaller sets (other approaches are described below, but generally follow the same principles). The following procedure is followed for each of the k “folds”:

A model is trained using  of the folds as training data; the resulting model is validated on the remaining part of the data (i.e., it is used as a test set to compute a performance measure such as accuracy).

The performance measure reported by k-fold cross-validation is then the average of the values computed in the loop. This approach can be computationally expensive, but does not waste too much data (as is the case when fixing an arbitrary validation set), which is a major advantage in problems such as inverse inference where the number of samples is very small.
(Check https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation)

In [96]:
from sklearn.model_selection import cross_val_score
from sklearn.neighbors import KNeighborsClassifier

knn= KNeighborsClassifier(n_neighbors=5)
scores=cross_val_score(knn, X_train, y_train, cv=5)
print("Accuracy : {:.3f} +/- {:.3f}".format(scores.mean(), scores.std()*2))



Accuracy : 0.956 +/- 0.072


In [98]:
!jupyter kernelspec list

Available kernels:
  python3    /Users/ebrucucen/opt/anaconda3/share/jupyter/kernels/python3


Question:  </br>
1. The lecture using knn default value, get the result of accuracy `.986` with `.015` interval (.85 confidence? )
2. Does `.072` mean % 93 confidence?


Grid Searches
=================
Exhaustive search over specified parameter values for an estimator.

Important members are fit, predict.

GridSearchCV implements a “fit” and a “score” method. It also implements “predict”, “predict_proba”, “decision_function”, “transform” and “inverse_transform” if they are implemented in the estimator used.

The parameters of the estimator used to apply these methods are optimized by cross-validated grid-search over a parameter grid.

See (https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html)

Grid-Search with build-in cross validation

A GridSearchCV object behaves just like a normal classifier.