# First steps with scikit-learn

Scikit-learn is by far the most used machine learning library in the Python community. It is so comprehensive that we are gonna limit ourselves to a supervised learning, classification examples to illustrate the most important ideas behind this library.

> This notebook is a critical review of the [Getting started](https://scikit-learn.org/stable/getting_started.html) entry from the original scikit-learn documentation.

As the official documentation states, scikit-learn is built mostly on SciPy, NumPy, and Matplotlib. If you've been around since pandas-zero, you know that we can use more high-level libraries that help us do the same work.

> This notebook assumes you've already took the [pandas-zero](https://github.com/leobezerra/pandas-zero) course. If you haven't, please do 😉

In [0]:
import pandas as pd
import seaborn as sns

## A way too simplistic example

The first goal of this tutorial is to have you understand the basic objects and conventions in scikit-learn. For this reason, let's start with a way too simplistic example for real life standards.  

### Loading the data

We'll load the iris dataset from Seaborn, which is represented as a `DataFrame` from Pandas. 

> This dataset is way too simplistic for real life standards, but I told you it would be 🙃

In [0]:
iris_dataset = sns.load_dataset('iris')
iris_dataset.head()

As we can see above, this dataset presents four numerical features and a target feature that represents the flower species to which the sample belongs. By convention, we isolate the target feature from the rest of the features:

> `X` stands for features that will be used for prediction.

> `y` stands for the target feature, or **label**.

In [0]:
X = iris_dataset.iloc[:,0:4]

In [0]:
y = iris_dataset["species"]

### Selecting a classifier

In a supervised learning machine learning problem, we have to select a classification algorithm (or **classifier**) to learn the patterns that define each class. scikit-learn offers a wide range of classifiers, so we'll stick to the simplest one: kNN.



In [0]:
from sklearn.neighbors import KNeighborsClassifier

kNN is short for k-nearest neighbors. In a nutshell, kNN predicts the label for a given sample based on the labels from its k nearest neighbors. 

> By default, kNN considers nearest neighbors according to Euclidean distance. 

Creating a classifier is very straightforward:

In [0]:
clf = KNeighborsClassifier()

### Fitting a model

Every classifier in scikit-learn provides a method `fit()`, which builds a model for estimating labels `y` from features `X`. Once again, fitting a model is pretty straightforward:

In [0]:
clf.fit(X, y)

Note that the output of the cell above is very verbose. It represents all the inner aspects of the classifier that scikit-learn allows you to customize when creating one.

> We'll talk about this later, since it deserves a whole notebook!

Also, note that this output is just a log that scikit-learn produces. The model fitting is done internally by the `clf` object.

### Predicting labels

Once we have fitted a model to the data, we can use it to predict labels. scikit-learn classifiers provide a method `predict()`, which predicts labels for given samples:

> By convention, we refer to labels predicted by classifiers as `y_pred`.

In [0]:
y_pred = clf.predict(X)

As discussed in the beginning, scikit-learn is built on NumPy, so the output of the method is an `ndarray` from NumPy.

In [0]:
y_pred

Since that is not very high-level, let's convert it to a Pandas `Series`:

In [0]:
y_pred = pd.Series(y_pred)
y_pred.value_counts()

### Evaluating the prediction

Scikit-learn offers a number of approaches to evaluate the quality of a prediction. The simplest one is called a confusion matrix: 

In [0]:
from sklearn.metrics import confusion_matrix

In [0]:
confusion_matrix(y, y_pred)

Not very readable, is it? Let's plot this and I'll walk you through it.

> Note that the `plot_confusion_matrix()` method below takes as arguments the fitted classifier, the input features and the target labels. 

In [0]:
from sklearn.metrics import plot_confusion_matrix

In [0]:
plot_confusion_matrix(clf, X, y)

Now we can see things more clearly. On the rows, we have the predicted labels. On the columns, the true labels. Ideally, all samples should be accounted on the main diagonal, which would mean the were predicted correctly.

That's what happens to the setosa species. kNN is able to create a model that correctly classifies all of its examples. For the remaining classes, kNN mistakes three versicolor samples for virginica, and two virginica samples for versicolor.

## A still simplistic example

In the example above, a very simplistic assumption is that the classifier could see all the samples to fit its model. In practice, a model fitted like that would probably **overfit** to the data it has seen. In real life, we use **sampling** to promote **generalization**.

> Sampling means that the classifier will be evaluated on data that it has not seen when fitting its model.

> Generalization is the ability of a classifier to predict labels correctly for samples it has not seen when fitting the model.

Scikit-learn offers several sampling approaches. The simplest is called **holdout**, and is provided as the `train_test_split()` method:

In [0]:
from sklearn.model_selection import train_test_split

In [0]:
X_train, X_test, y_train, y_test = train_test_split(X, y)

Note that the original data `X` and `y` is now partitioned into two subsets:

> `X_train` and `y_train` will be used to fit the model. They are know as training data.

> `X_test` and `y_test` will be used to evaluate the model. They are know as test data.

In [0]:
clf.fit(X_train, y_train)

In [0]:
plot_confusion_matrix(clf, X_test, y_test)

Note that the number of samples used for testing is much smaller than the number of samples used for training. This ratio is a parameter of `train_test_split`, which one can configure when splitting the data.

> Once again, sampling is so important that it will have its own notebook!

## A less simplistic example

Another very simplistic assumption we made in the examples above is that the data is ready for modelling as it comes. 

> Since we're using the iris dataset, this is still kinda true.

In real life, however, fitting a model is the last stage in a machine learning **pipeline**. 

> And if you have taken the pandas-zero course, machine learning is a very late stage in the data science process.

### Understanding pipelines

Pipelines are the atomic unit in machine learning. They comprise:
* Data preparation
* Feature engineering
* Estimators

> All these topics are super important, so each will have their own notebook 😁

The classifier we have been using plays the role of the estimator. Since this is still a simplistic example, I'll show you how to build a very simple pipeline where we include a data preparation component.

### Preparing the data

Even if the iris dataset is quite simplistic, its features present very different ranges:

In [0]:
iris_dataset.describe()

In general, machine learning algorithms should deal with features that have been rescaled to present a similar range. In scikit-learn, we can do this using the `MinMaxScaler` object:

In [0]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()

Next, we create our pipeline using the `make_pipeline()` method:

In [0]:
from sklearn.pipeline import make_pipeline
pipe = make_pipeline(scaler, clf)

A pipeline object follows the same fit-predict pattern from classifiers:

> Note that the the model that had been fitted by `clf` will be lost, since the pipeline will refit the model.

In [0]:
pipe.fit(X_train, y_train)

This output is even more verbose and starts to give you an idea on how complex creating and configuring a machine learning pipeline may become. Let's see if adding the data preparation component affected results somehow:

In [0]:
plot_confusion_matrix(pipe, X_test, y_test)

The output you see above may be better, equal, or worse than the result presented in the previous section. The reason is that the `train_test_split()` method splits the data in a randomized way, so it's not really possible to reproduce the exact results everytime.

> This is possible in a controlled setup, and we'll discuss it in a notebook just for that 🤓

More importantly, this means that every component in a pipeline might bring benefits or worsen results, which makes creating and configuring pipelines quite challenging and dataset-specific.

### Evaluating models analytically

To conclude our notebook, let's switch from a graphical analysis to an analytical one. This is very important in real life, since it is not feasible to use confusion matrices for many datasets.

> Guess what? Yep. A notebook for that 😎

For classification problems where the number of samples is evenly distributed among the different classes, the most used metric is called accuracy.

> Accuracy is the ration between the number of samples correctly classified and the total number of samples.

We can use the `accuracy_score()` method from scikit-learn to compute it: 

In [0]:
from sklearn.metrics import accuracy_score

In [0]:
clf = KNeighborsClassifier()
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
accuracy_score(y_test, y_pred)

In [0]:
y_pred = pipe.predict(X_test)
accuracy_score(y_test, y_pred)

Once again, whether results are better, equal, or worse depend on the run. Yet, we now have an analytical metric that assertains that higher scores represent better prediction.