# Lecture 2
## Introduction to Sklearn
### Getting to know the API

<ol>
<li> Used data: Iris data set from the Seaborn package ( https://archive.ics.uci.edu/ml/datasets/iris ) /n
<li> Notebook Goal: Create a classifier trained on the iris data set using Sklearn.
<li> Extra Exercise: Yes, see below.
</ol>

In [1]:
#Necessary imports
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score


We start by loading the data. This is done using the 'load_dataset' method from the seaborn package.
Using the head() method returns the five first elements of the dataframe.

In [2]:
iris = sns.load_dataset("iris")
iris.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In this example, we are interested in constructing a classifier that is able to predict the species of the observation using information on the sepal and petal length and width.

Recall that the ingredients of an Sklearn classifier (which is supervised) are the following:
<ol>
<li> A numerical feature matrix
<li> A target vector
</ol>

Since the target vector is in this case the species, we will need to break the data set in the two corresponding parts.

In [3]:
X_iris = iris.drop('species', axis=1, inplace=False)
y_iris = iris['species']

In [4]:
X_iris.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


In [5]:
y_iris.head()

0    setosa
1    setosa
2    setosa
3    setosa
4    setosa
Name: species, dtype: object

It is important to always use two data sets: a training data set and a test data set. These data sets need to be disjoint. In this case, the test data set can then be used to evaluate the performance of the model on unseen data. 

These two data sets can be obtained using the train_test_split method. We will fix the random state so that we always get the same results.

In [6]:
X_train, X_test, ytrain,ytest = train_test_split(X_iris, y_iris, random_state=1)

We now follow the recipe as given in the slides!

In [7]:
model = GaussianNB() #We use the default hyperparameters. If we want to change these, we need to consult the documentation!

model.fit(X_train,ytrain)
y_model = model.predict(X_test)

The model object has now successfully fit to the training data, and is able to predict the species of the test data. In order to evaluate the performance of this classification, we use the accuracy. 

Recall that the accuracy is given by (# Correct Classification)/(# Total Classifications)

In [8]:
accuracy_score(ytest,y_model)

0.9736842105263158

EXERCISE:

Choose any classifier in Sklearn and change the above code to fit this chosen model. Compare your accuracy with the one obtained above.

In [None]:
#Exercise