# Introducing Scikit-learn

There are several Python libraries which provide solid implementations of a range of machine learning algorithms. One of the best known is scikit-learn, a package which provides efficient versions of a large number of common algorithms. 

Scikit-learn is characterized by a clean, uniform, and streamlined API, as well as by very useful and complete online documentation. A benefit of this uniformity is that once you under‐ stand the basic use and syntax of scikit-learn for one type of model, switching to a new model or algorithm is very straightforward.

This section provides an overview of the API of scikit-learn; a solid understanding of these API elements will form the foundation for understanding the deeper practical discussion of machine learning algorithms and approaches.

## Machine Learning Settings

Machine learning algorithms come in two different "flavours":

- **Supervised Learning**
    - _Classification_
    - _Regression_
    
    
- **Unsupervised Learning**
    - _Dimensionality Reduction_
    - _Clustering_

## Scikit-learn Cheat Sheet

<img src="images/scikit-learn-cheatsheet.png" width="100%" />

## 1. Representation and Visualization of Data

Data in scikit-learn, with very few exceptions, is assumed to be stored as a
**two-dimensional array**, of size `[n_samples, n_features]`. 
This array is usually referrred as the **feature matrix**.

There is also the **label vector**, of size `n_samples`, containing the list of labels
for each samples (_Note_: **ONLY** in the Supervised Learning settings)

$$
{\rm feature~matrix:~~~} {\bf X}~=~\left[
\begin{matrix}
x_{11} & x_{12} & \cdots & x_{1D}\\
x_{21} & x_{22} & \cdots & x_{2D}\\
x_{31} & x_{32} & \cdots & x_{3D}\\
\vdots & \vdots & \ddots & \vdots\\
\vdots & \vdots & \ddots & \vdots\\
x_{N1} & x_{N2} & \cdots & x_{ND}\\
\end{matrix}
\right]
$$

$$
{\rm label~vector:~~~} {\bf y}~=~ [y_1, y_2, y_3, \cdots y_N]
$$

Here there are $N$ samples and $D$ features.

- $N$ (`n_samples`):   The number of samples: each sample is an item to process (e.g. classify).
  A sample can be a document, a picture, a sound, a video, an astronomical object,
  a row in database or CSV file,
  or whatever you can describe with a fixed set of quantitative traits.
- $D$ (`n_features`):  The number of features or distinct traits that can be used to describe each
  item in a quantitative manner.  Features are generally real-valued, but may be boolean or
  discrete-valued in some cases.

The number of features must be fixed in advance. 

However it can be very high dimensional
(e.g. millions of features) with most of them being zeros for a given sample. This is a case
where `scipy.sparse` matrices can be useful, in that they are
much more memory-efficient than numpy arrays.

Each sample (data point) is a row in the data array, and each feature is a column.

### Simple Example: Iris Dataset

As an example of a simple dataset, we're going to take a look at the iris data stored by scikit-learn.
The data consists of measurements of three different species of irises.  

There are three species of iris in the dataset:

<style type="text/css">
    ul#flowers li {
        display:inline !important;
    }
</style>

<ul id="flowers">
    <li>
        Iris Setosa
        <img src="images/iris_setosa.jpg" width="25%">
    </li>
    <li>
        Iris Versicolor
        <img src="images/iris_versicolor.jpg" width="25%">
    </li>
    <li>
        Iris Virginica
        <img src="images/iris_virginica.jpg" width="25%">
    </li>
</ul>

### Quick Question:

**If we want to design an algorithm to recognize iris species, what might the data be?**

Remember: we need a 2D array of size `[n_samples x n_features]`.

- What would the `n_samples` refer to?

- What might the `n_features` refer to?

Remember that there must be a **fixed** number of features for each sample, and feature
number ``i`` must be a similar kind of quantity for each sample.

## Ex 1.1

Load the iris dataset from `sklearn` and print out description.

#### Hint: Inspect attributes of the `sklearn.datasets.base.Bunch` object

## Ex 1.2

Save `iris` data into a numpy array `X` and print out attributes along with corresponding `targets`.


## Ex 1.3

Plot `iris` data using a scatter plot

## Ex 1.3.1

**Change** `x_index` **and** `y_index` **in the above script
and find a combination of two parameters
which maximally separate the three classes.**

This exercise is a preview of **dimensionality reduction**

## Ex 1.4

Loads the `digits` data and print some data extracted from the dataset.

#### Hint: See `digits.data` and `digits.images`. Do you see any difference?

## 2. Supervised Learning: Classification

<img src="images/supervised_workflow.svg" width="80%">

## Ex 2.1

As classification is a supervised task, and we are interested in how well the model generalizes, we split our data into a training set,
to built the model from, and a test-set, to evaluate how well our model performs on new data. 
The ``train_test_split`` function form the ``cross_validation`` module does that for us, by randomly splitting of 25% of the data for testing.

<img src="images/train_test_split.svg" width="80%">

Load data from the IRIS dataset and split training and test sets

## Ex 2.2 

`fit` a `LogisticRegression` model to classify on training data

#### Hint: Suggestions in the text !-)

## Ex 2.3

Make Predictions on the remaining test set data

#### Hint: Leverage on previsou `train-test` split

## Ex 2.4

Try a new Machine Learning model: `KNeighborsClassifier`

## Ex 2.5

Calculate the accuracy `score` on test data with the `KNeighborsClassifier`

#### Hint: Clues in the text (as usual)

# 3. Regression

In regression we try to predict a continuous output variable. 

This can be most easily visualized in one dimension.

## Linear Regression

One of the simplest models again is a linear one, that simply tries to predict the data as lying on a line. One way to find such a line is LinearRegression (also known as ordinary least squares).
The interface for LinearRegression is exactly the same as for the classifiers before, only that ``y`` now contains float values, instead of classes.

## Ex 3.1

Generate `X` data and `y` target picking from the `sin` function plus some random noise (i.e. `np.random.uniform`)

## Ex 3.2

Fit a linear regression model using `sklearn.linear_model.LinearRegression`

## Ex 3.3

Plot original data and predicted data

## Ex 3.4

Try to think of a better model to fit. Fit and plot the data, as in previous exercises.

#### Hint: `KNeighborsRegression` model, maybe?

# 4. Unsupervised Learning

<img src="images/unsupervised_workflow.svg" width="80%">

## Clustering

Clustering is the task of gathering samples into groups of similar
samples according to some predefined similarity or dissimilarity
measure (such as the Euclidean distance).
In this section we will explore a basic clustering task on some synthetic and real datasets.

Here are some common applications of clustering algorithms:

- Compression, in a data reduction sens
- Can be used as a preprocessing step for recommender systems
- Similarly:
   - grouping related web news (e.g. Google News) and web search results
   - grouping related stock quotes for investment portfolio management
   - building customer profiles for market analysis
- Building a code book of prototype samples for unsupervised feature extraction



## Ex 4.1

Generate some random data, organised in blobs (i.e. artifically created agglomerations of points in the space)

#### Hint: Take a look at `make_blobs`

## Ex 4.2

Now we will use one of the simplest clustering algorithms, K-means.
This is an iterative algorithm which searches for three cluster
centers such that the distance from each point to its cluster is
minimized.


#### Hint: See `sklearn.cluster.KMeans`