# Basic use of scikit-learn

Scikit-learn is a well-known library for Machine Learning. It implements many basic "methods", and it is widely used in the community, both for research and in the industry.

In this lab exercise, you will make the first steps with this library.

Start by learning this introduction page: https://scikit-learn.org/1.4/tutorial/basic/tutorial.html

**Note:** You may not known some methods that are discussed in the tutorial page. Don't focus on that but on the general philosophy of Scikit-learn.

In the following, try to explore the data, different functions etc, by writting and executing a few lines of Python code! It is a good idea to always prototype simple example, to edit the code to check what works and what doesn't work, what are the shape of different tensors, what they contain, etc.

Questions are there to help you identify what is important. Try to answer them, but you don't need to submit your answers.

**WARNING:** The exam may contain questions about lab exercices.

# 1. Loading pre-formated data

We will use the IRIS dataset as a first example. You can read about this dataset on wikipedia: https://fr.wikipedia.org/wiki/Iris_de_Fisher

The following bloc of code load the data.

In [None]:
# load the IRIS dataset
from sklearn.datasets import load_iris
irisData=load_iris()
# get info on the dataset
#print(irisData.data)
print(irisData.target)
print(irisData.target_names)
print(irisData.feature_names)
#print(irisData.DESCR)

**Q1:** What type of machine learning problem is that?

**Q2:** How many features are there? What kind of features?

# 2. Plotting the data

In [None]:
%matplotlib inline
from matplotlib import pyplot as plt # replace the name "pyplot" by "plt" 
X=irisData.data
y=irisData.target
xi=0
yi=1

colors=["red","green","blue"] # associate a color to each class label
for num_label in range(3): # for each label
    plt.scatter(
        X[y==num_label][:, xi],
        X[y==num_label][:,yi],
        color=colors[num_label],
        label=irisData.target_names[num_label]
    )
plt.legend()
plt.xlabel(irisData.feature_names[xi]) 
plt.ylabel(irisData.feature_names[yi])
plt.title("Iris Data - size of the sepals only") 
plt.show()

**Q3:** From the previous visualisation, what can you predict about the difficulty of this dataset?

# 3. Classification with k-nearest neighbors

We will now use a k-nearest neighbors classifier on this dataset. Start by reading the documentation page: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html

And then study the code below.
Try to use different values for k (called `nb_neighb` in the code).

In [None]:
from sklearn import neighbors

nb_neighb = 15
# to know more about the parameters, type help(neighbors.KNeighborsClassifier)
clf = neighbors.KNeighborsClassifier(nb_neighb)

clf.fit(X, y) # training
print('accuracy on training data is', clf.score(X,y))

# to predict on a specific example
print('class predicted is', clf.predict([[ 5.4, 3.2, 1.6, 0.4]]))
print('proba of each class is', clf.predict_proba([[ 5.4, 3.2, 1.6, 0.4]]))

y_pred = clf.predict(X)
print('misclassified training examples are:',X[y_pred!=y])

**Q4:** What kind of problem do you see with the evaluation?

## 3.1 About training and test sets

if we want a test set and a training set, we can split the data

In [None]:
X_train, y_train = X[0:100], y[0:100] # 100 examples for training
X_test, y_test = X[100:], y[100:] # rest for testing

**Q5:** Explain why it is a really bad idea to split this iris dataset as we've done.

Here is a much better way to split the data into training and test sets.
Again, start by reading the document page of the `train_test_split` function: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

In [None]:
from sklearn.model_selection import train_test_split 
import random # to generate random numbers

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.3,
    random_state=random.seed()
)
print(
    'size of train / test = ',
    len(X_train),
    len(X_test)
)

print(
    'nb of training data with class 0/1/2 =',
    len(X_train[y_train==0]),
    len(X_train[y_train==1]),
    len(X_train[y_train==2])
)


You can now train and evaluate on two different parts of the original data.

To display the results, we build a confusion matrix. Start by reading:

- the wikipedia page: https://en.wikipedia.org/wiki/Confusion_matrix
- the scikitlearn documentation: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html

In [None]:
from sklearn.metrics import confusion_matrix

clf=clf.fit(X_train, y_train)
y_pred =clf.predict(X_test)

cm = confusion_matrix(y_test, y_pred)
print('Confusion matrix\n',cm)


**Q6:** What is on the diagonal of the confusion matrix?

**Q7:** What is the real error rate (give details)?

## 3.2 K-fold

The dataset is small, therefore there might be a high variance in test evaluation results.
One solution to alleviate this issue, is to evaluate the model on different train/test splits.
This approach is caleld k-fold.

Start by reading the section about this approach in the documention: https://scikit-learn.org/stable/modules/cross_validation.html#k-fold

And then try to understand the code below.

In [None]:
from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score

nb_folds = 10
kf=KFold(n_splits=nb_folds,shuffle=True)
score=0
for training_ind,test_ind in kf.split(X):
    #print("training index: ",training_ind,"\ntest index:",test_ind,'\n') 
    X_train=X[training_ind]
    y_train=y[training_ind]
    clf.fit(X_train, y_train)
    X_test=X[test_ind]
    y_test=y[test_ind]
    y_pred = clf.predict(X_test)
    score = score + accuracy_score(y_pred,y_test)

print('average accuracy:',score/nb_folds)

or as a one-liner:

In [None]:
from sklearn.model_selection import cross_val_score
t_scores = cross_val_score(clf, X, y, cv=10)
print(t_scores.mean())

# 4. Decision tree

In this second part, we will build a decision tree using scikitlearn.
Start by reading the documentation: https://scikit-learn.org/stable/modules/tree.html#tree

To read the data, we will use the pandas libraries, which simplify data manipulation.

In [None]:
# we will use another dataset (a CSV file). Pandas helps us to read this type of file.

import pandas as pd

data = 'heart.csv'
df = pd.read_csv(data)


X = df.drop(columns=['target'])
y = df['target']


X_train,X_test,y_train,y_test = train_test_split(X,y,stratify=y)

features = X.columns
classes = ['Not heart disease','heart disease']

print (features)

df.head()

In [None]:
from sklearn import tree
from graphviz import Source

X_train,X_val,y_train,y_val=train_test_split(X,y,test_size=0.3,random_state=42)

clf = tree.DecisionTreeClassifier(max_depth=20,criterion='entropy')
clf.fit(X_train, y_train)

graph = Source(tree.export_graphviz(clf, out_file=None,
                                    feature_names=features,
                                    class_names=classes,
                                    filled=True, rounded=True))
graph

If Graphviz is not working with your setup, look at http://people.irisa.fr/Vincent.Claveau/cours/fd/TP1.html

**Q8:** Explain each line displayed in the nodes/leaves of the tree.
    
**Q9:** What is the name of this decision tree according to the course?


Here is another nice viz of the decision tree. (The dtreeviz package is available in github. It can be installed with 'pip install dtreeviz'. It requires graphviz to be installed.)

In [None]:
#from dtreeviz.trees import dtreeviz # remember to load the package (depending on the library version, comment and uncomment lines)
from dtreeviz import *
#model or dtreeviz
graph = model(clf, X_train, y_train,
                target_name="target",
                feature_names=features,
                class_names=classes
                )

#graph
graph.view()


**Q10:** Explain what are the histograms displayed.

**Q11** From the sklearn manual, explain what effectmax_depth or min_samples_split will have on the decision tree. If time permits, show the effects experimentally.

### Pruning Tmax

Firsm check the documentation, again: https://scikit-learn.org/stable/auto_examples/tree/plot_cost_complexity_pruning.html

Here, we use a critrion called "Cost Complexity". Cost complexity pruning is all about finding the right parameter for alpha.We will get the alpha values for this tree and will check the accuracy with the pruned trees.

In [None]:
path = clf.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas, impurities = path.ccp_alphas, path.impurities
print(ccp_alphas)

In [None]:
# For each alpha we will append our model to a list
t_clf = []
for ccp_alpha in ccp_alphas:
    clf = tree.DecisionTreeClassifier(random_state=0, ccp_alpha=ccp_alpha)
    clf.fit(X_train, y_train)
    t_clf.append(clf)
    
# we remove the last element in clfs and ccp_alphas, because it is the trivial tree with only one node.
t_clf = t_clf[:-1]
ccp_alphas = ccp_alphas[:-1]
node_counts = [clf.tree_.node_count for clf in t_clf]
depth = [clf.tree_.max_depth for clf in t_clf]
plt.scatter(ccp_alphas,node_counts)
plt.scatter(ccp_alphas,depth)
plt.plot(ccp_alphas,node_counts,label='no of nodes',drawstyle="steps-post")
plt.plot(ccp_alphas,depth,label='depth',drawstyle="steps-post")
plt.legend()
plt.title('Tree complexity vs alpha')
plt.show()


# accuracy versus alpha
train_acc = []
val_acc = []
for c in t_clf:
    y_train_pred = c.predict(X_train)
    y_val_pred = c.predict(X_val)
    train_acc.append(accuracy_score(y_train_pred,y_train))
    val_acc.append(accuracy_score(y_val_pred,y_val))

plt.scatter(ccp_alphas,train_acc)
plt.scatter(ccp_alphas,val_acc)
plt.plot(ccp_alphas,train_acc,label='train_accuracy',drawstyle="steps-post")
plt.plot(ccp_alphas,val_acc,label='val_accuracy',drawstyle="steps-post")
plt.legend()
plt.title('Accuracy vs alpha')
plt.show()

**Q12:** from the graph above, what is the best value for alpha. Replace it in the first line below .

In [None]:
best_alpha = 0.12 # <-- replace this value
clf_ = tree.DecisionTreeClassifier(random_state=0,ccp_alpha=best_alpha)
clf_.fit(X_train,y_train)
y_train_pred = clf_.predict(X_train)
y_val_pred = clf_.predict(X_val)

print('Train score', accuracy_score(y_train_pred,y_train))
print(confusion_matrix(y_train_pred,y_train))

print('Validation score', accuracy_score(y_val_pred,y_val))
print(confusion_matrix(y_val_pred,y_val))