# SL - Comparing classifiers - CORRECTION

In this hands-on lesson, we will compare the performance of different algorithms for classification on the Iris data set.  

This notebook is adapted from https://medium.com/@pinnzonandres/iris-classification-with-svm-on-python-c1b6e833522c 

and https://www.kaggle.com/tcvieira/simple-random-forest-iris-dataset 

Iris is a genus of 260–300 species of flowering plants with showy flowers. It takes its name from the greek word for a rainbow,Iris.

In the dataset we have three types of iris:
- Iris Setosa
- Iris Versicolour
- Iris Virginica

For each flower we know (the features of our machine learning classifier):

- Sepal length
- Sepal width
- Petal length
- Petal width

The goal is to correctly classify the three types of iris using the four features

## Load and Visualization

### Load the data and inspect them

In [None]:
#Import scikit-learn dataset library
from sklearn import datasets
#Load dataset
iris = datasets.load_iris()

In [None]:
iris.keys()

In [None]:
# We use pandas for better manipulation
import pandas as pd
# Transofm the data in a dataframe
df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
df['target'] = pd.Series(iris.target)
df.head()

In [None]:
# print the size of the data
print(df['target'].value_counts())

### Some usefull visualization

In [None]:
import matplotlib.pyplot as plt
import numpy as np 
import seaborn as sns

### Exercise
- Scatterplot the 'petal width' vs all the other features

In [None]:
targets = np.unique(df['target'].values)
ncols = 3 
fig, axs = plt.subplots(ncols=ncols, nrows=1, figsize=(10,3))
feature_y = df.columns[-2]
for col in range(ncols):
    feature_x = df.columns[col]
    ax = axs[col]
    for i in targets:
        mask = df['target'].values == i
        label = iris.target_names[i]

        ax.scatter(df[feature_x].values[mask], 
                   df[feature_y].values[mask],
                   label=label)
    if col == 0:
        ax.set_ylabel(feature_y)
    ax.set_xlabel(feature_x)
ax.legend()

In [None]:
# We can plot something more insigthfull with seaborn
sns.set()
sns.pairplot(df[['petal length (cm)', 'petal width (cm)', 
                   'sepal length (cm)', 'sepal width (cm)', 'target']],
             hue="target", diag_kind="kde", 
             palette=['blue','orange','green'])


From the visual inspection it seems that the petals width and length are the two most important features to distinguish the three groups of flowers

We can also look at correlations

In [None]:
# Plot correlation heatmap
fig, ax = plt.subplots(figsize=(6,5))
sns.heatmap(df.corr())
ax.set_title('Correlation On iris Classes')

We observe that:
- the shape of the petals (width and length) are the most correlated with the type of flowers 
- the sepal length wich also haves a positive but minor correlation
- we have the negative correlation of the sepal width column (and correlation is not very high)

## Preprocessing: prepare the data to apply machine learning classification models

In [None]:
X = df.iloc[:,:-1]  # define the features
y = df.iloc[:, -1].values  # define the target values

In [None]:
print(X.shape)
print(y.shape)
print(type(X))
print(type(y))

In [None]:
# Import train_test_split function
from sklearn.model_selection import train_test_split
# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3) # 70% training and 30% test

In [None]:
y_train.shape

## Classification with SVM

## Simple SVM with the 2 most important features

Here we will use sklearn to fit an SVM model.
In this simple example we will use the first two feature vectors only and plot the decision boundaries. <br>


In [None]:
from sklearn import svm
kernel = 'linear'
svc1 = svm.SVC(kernel=kernel)
svc1.fit(X_train.iloc[:, :2], y_train)

In [None]:
%matplotlib inline

In [None]:
from matplotlib.colors import ListedColormap
from sklearn.inspection import DecisionBoundaryDisplay

def plot_decision_boundaries(clf, X_train, y_train):                                                                      
    cmap_light = ListedColormap(["orange", "cyan", "cornflowerblue"])
    cmap_bold = ["darkorange", "c",  "darkblue" ]

    _, ax = plt.subplots()
    DecisionBoundaryDisplay.from_estimator(
        clf,
        X_train.iloc[:, :2],
        cmap=cmap_light,
        ax=ax,
        response_method="predict",
        plot_method="pcolormesh",
        xlabel=X_train.columns[0],
        ylabel=X_train.columns[1],
        shading="auto",
    )

    # Plot also the training points

    sns.scatterplot(
        x=X_train.iloc[:, 0],
        y=X_train.iloc[:, 1],
        hue=y_train,
        palette=cmap_bold,
        alpha=1.0,
        edgecolor="black",
    )

In [None]:
plot_decision_boundaries(svc1, X_train, y_train)
plt.title("3-Class classification SVM Kernel: " + kernel )

### Exercise

- Use the trained model to predict the labels of the test set, and evaluate the prediction (i.e compute the accuracy of the model).
- Define a SVM classifier with polynomial kernels of degree 3 and compare its performance with the linear SVM classifier

### Question
What do you deduce from the results?


In [None]:
prediction = svc1.predict(X_test.iloc[:, :2])
accuracy = np.sum(prediction == y_test) / y_test.shape[0]
print(f'Predicted array: {prediction}')
print(f'Test array: {y_test}')
print(f'Accuracy: {accuracy}')

In [None]:
kernel = 'poly'
degree = 3
svc2 = svm.SVC(kernel=kernel, degree=degree)
svc2.fit(X_train.iloc[:, :2], y_train)
plot_decision_boundaries(svc2, X_train, y_train)
plt.title(f"3-Class classification SVM Kernel:{kernel} Degree: {degree}")

In [None]:
prediction = svc2.predict(X_test.iloc[:, :2])
accuracy = np.sum(prediction == y_test) / y_test.shape[0]
print(f'Predicted array: {prediction}')
print(f'Test array: {y_test}')
print(f'Accuracy: {accuracy}')

### Train the model with all features and compute accuracy

In [None]:
#Create the SVM model
from sklearn.svm import SVC
classifier = SVC(kernel = 'linear', random_state = 0)
#Fit the model for the data

classifier.fit(X_train, y_train)

#Make the prediction
y_pred = classifier.predict(X_test)

In [None]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
print(cm)

In [None]:
#Number of accurate predictions
sum(y_test == y_pred)
#Accuracy
print("Accuracy:",sum(y_test == y_pred)/len(y_test))

In [None]:
#Import scikit-learn metrics module for accuracy calculation
from sklearn import metrics
print("Accuracy sk learn:", metrics.accuracy_score(y_test, y_pred))

In [None]:
# We can use cross validation to compute the accuracy of our model
from sklearn.model_selection import cross_val_score
svm_clf = SVC(kernel = 'linear', random_state = 0)
accuracies = cross_val_score(estimator = svm_clf, X = X, y = y, cv = 5)
print("Accuracy: {:.2f} %".format(accuracies.mean()*100))

### Visualize important features

In [None]:
classifier.coef_[0]

In [None]:
svc_feature_imp = pd.Series(abs(classifier.coef_[0]), index=df.columns[:-1]).sort_values(ascending=False)

In [None]:
# Ceating a bar plot
svc_feature_imp.plot(kind='barh')

In [None]:
# Creating a better bar plot
sns.barplot(x=svc_feature_imp, y=svc_feature_imp.index) 

# Add labels to your graph
plt.xlabel('Feature Importance Score')
plt.ylabel('Features')
plt.title("Visualizing Important Features")
plt.show()

## Classification with Random Forest

### Exercise
- Try to repeate what done with SVM by using random forest
- Vary the depth of your trees and the size of your forest to see how the results vary

If you want to get informatio on random forest type `RandomForestClassifier?` in a code cell

### Train the model and compute accuracy

In [None]:
#Import Random Forest Model
from sklearn.ensemble import RandomForestClassifier

#Create a Random Forest Classifier (an istance of the class)
clf=RandomForestClassifier(max_depth=3, n_estimators=100)

#Train the model using the training sets y_pred=clf.predict(X_test)
clf.fit(X_train,y_train)
y_pred=clf.predict(X_test)


In [None]:
# Model Accuracy, how often is the classifier correct?
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

In [None]:
rf_clf = RandomForestClassifier(max_depth=3, n_estimators=100)
accuracies = cross_val_score(estimator = rf_clf, X = X, y = y, cv = 5)
print("Accuracy: {:.2f} %".format(accuracies.mean()*100))

### Visualize the important features

In [None]:
clf.feature_importances_

In [None]:
feature_imp = pd.Series(clf.feature_importances_, index=iris.feature_names).sort_values(ascending=False)
feature_imp


In [None]:
feature_imp.index

In [None]:
# Creating a bar plot
sns.barplot(x=feature_imp, y=feature_imp.index) 

# Add labels to your graph
plt.xlabel('Feature Importance Score')
plt.ylabel('Features')
plt.title("Visualizing Important Features")
plt.show()