### BIO-210: Projects in Informatics for SV
# Python Introduction 3 - Scikit-learn

Scikit-learn is a python library offering a set of tools for data mining, data analysis and machine learning (https://scikit-learn.org/stable/). Today you will learn how to load a dataset and how to perform some elementary exploration and visualization. You will then apply two important data analysis techniques: a clustering algorithm (k-means) and linear regression. Enjoy!

In [None]:
import sklearn as sk
import sklearn.datasets as sk_data
import sklearn.metrics as metrics
import sklearn.model_selection as cv
import numpy as np
import scipy as sp

# We will use a custom library with visualizations developed for this exercise
import lib.viz as viz

Today we will make plotting easy for you, as the lesson about visualization will come later in this course. For the moment, just call the function <code>viz.plot()</code> and give it the following arguments: x, y, color, plot_type ('line' or 'scatter'). Here follows a minimal example:

In [None]:
x = np.linspace(0, 1, 100)
y = np.sqrt(x)
viz.plot(x, y, 'green', 'line')

<code>sklearn.datasets</code> is the scikit-learn module to handle sets of data. It includes some toy dataset to experiment with your algorithms, but it also allows you to load real-world datasets or to generate data with specific structures, as in the following example:

In [None]:
X, y = sk_data.make_blobs(n_samples=100, centers=3, n_features=2, center_box=(-30.0, 30.0), random_state=0)
print(X.shape, y.shape, type(y))

viz.plot(x = X[y==1,0], y = X[y==1,1], color='blue', plot_type='scatter')
viz.plot(x = X[y==0,0], y = X[y==0,1], color='green', plot_type='scatter')
viz.plot(x = X[y==2,0], y = X[y==2,1], color='red', plot_type='scatter')

## Clustering (k-means)

The k-means algorithm clusters data by trying to separate samples in k groups of equal variance, minimizing a criterion known as the inertia or within-cluster sum-of-squares. This algorithm requires the number of clusters to be specified. It scales well to large number of samples and has been used across a large range of application areas in many different fields.

You remember the bonus exercise from last week?
Now we want to first implement k-means with numpy and then have a look on how to do it with scikit-learn!

**Exercise 1**. Implement k-means clustering to group features related to potential breast cancer masses. Clustering algorithms are used to group data that are similar to each other. In this case we would like to create 2 clusters. If the features are meaningful, each group should include a majority of positive (breast cancer) or negative (non breast cancer) outcomes. Proceed as follows:

1 - Run the cell below, which downloads the dataset and saves the breast cancer features and target labels (cancer / non-cancer)

In [None]:
from sklearn.datasets import load_breast_cancer

data = load_breast_cancer()

features = data.data
target = data.target
print("Shape of the feature matrix: ", features.shape)
print("Shape of the label vector: ", target.shape)

2 - Normalize the features by subtracting the mean from each column and dividing it by its standard deviation. Normalizing the features is a standard way to make the clustering robust to the scale of the features. You can use the relevant <code>numpy</code> functions to do so:

In [None]:
# Your code here

3 - Each cluster is characterized by a centroid, which is its center of mass. The k-means algorithm will start from two random centroids and iteratively update their values. Define the initial values of the centroids by creating 2 vectors of size equal to the number of features, containing random values sampled from a standard normal distribution.

In [None]:
# Your code here

Now define the iteration loop, which should run until the centroids do not change their value for two consecutive iterations (or the cluster assignment does not change for two consecutive iterations). In each step:

4 - Assign each element of the dataset to the closest centroid. Measure the distance between each centroid and an element with the standard euclidean distance. If the element is closer to the centroid 0, then it belongs to the cluster 0. Otherwise it belongs to the cluster 1. Run this assignment for all the elements.

5 - Update the centroids. They are the average of all the elements assigned to their cluster. Hint: if <code>features</code> is your features matrix and <code>clusters</code> the vector of the cluster assignment, you can get the features of the elements in a certain cluster with the code <code>features[clusters == cluster_id]</code>

Verify that the algorithm converges in a finite number of steps. Once the clustering is completed, check the distribution of target labels associated to the elements of each cluster (Hint: for both clusters, count the elements with label 0 or 1).  If the distribution is substantially different between the two clusters, it means that this simple algorithm has learnt how to approximately distinguish a cancer mass from a non-cancer one!

In [None]:
# Your code here

### Now we want to have a look on how to run k-means with scikit-learn:

Clustering of unlabeled data can be performed with the module <code>sklearn.cluster</code>.

One important thing to note is that the clustering algorithms can take different kinds of matrix as input. All the methods accept standard data matrices of shape <code>(n_samples, n_features)</code>.

First we import the KMeans algorithm and a scaler object that helps us normalizing our data.

In [None]:
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

We can use the **StandardScaler** to normalize our data:

In [None]:
sc = StandardScaler()
norm_features = sc.fit_transform(features)

There are 3 functions in all the clustering classes:
- **fit()** is building the model from the training data (e.g. finding the centroids)
- **predict()** is assigning labels to test data after building the model
- **fit_predict()** is doing both in the same data (e.g in kmeans, it finds the centroids and assigns the labels to the dataset)

We are finally ready to run k-means clustering:

In [None]:
kmeans = KMeans(init='k-means++', n_clusters=2, n_init=10)
kmeans.fit_predict(norm_features)
centroids = kmeans.cluster_centers_
labels = kmeans.labels_

A simple way to check whether the distribution of the two classes (cancer or no-cancer) in each cluster is different is to print the confusion matrix. The confusion matrix for an N classes problem (in our case N=2) is an N x N matrix in which each column represents a cluster and each row a label. If the numbers on one of the diagonal are considerably larger than on the other one, this means that there is a label distribution inbalance between the clusters and therefore the algorithm worked well.

In [None]:
metrics.confusion_matrix(target, labels)

For an overview of the other metrics functions available in scikit-learn, take a look at sklearn-metrics here: http://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics

**Exercise 2**. Compare the performance of the scikit-learn implementation of k-means with the one of your own implementation. Measure the accuracy of each algorithm to check which implementation did a better job! Hint: to compute the accuracy, consider the cluster assignment as a label. If in cluster 0 there are more elements belonging to the "cancer" group, define "0" as the label "cancer" and "1" as the label "non-cancer", otherwise do the opposite. The label assignment can be different for the two clustering algorithms, because they do not now anything about the meaning of the label. Once you have assigned a prediction to each data point, compute the accuracy with <code>metrics.accuracy_score</code>.

In [None]:
# Your code here

**Exercise 3.** Usually the performances of the scikit-learn implementation of k-means performs differently from your own implementation. Can you give an explanation of this fact in your own words? (hint: look up the meaning of the parameters <code>init=k-means++</code> and <code>n_init</code>)

 *Your answer here*

## Linear Regression

In [None]:
from sklearn import linear_model

In your linear algebra class you have studied methods to solve linear systems, even when they are overdetermined. To do that, the standard way to find a solution is to solve the system with the least squares method. Least squares is the basis of an important statistical tool: the linear regression. In fact, datasets often define an overdetermined system, as there are many more data points than features. Of course, scikit-learn offers its own implementation of it. Consider the following linear system


$$Y = \alpha +\beta X +\epsilon$$

We **know**: the dataset $X$ (in the form of a matrix) and the target vector $Y$

We **do not know**: the coefficient vectors $\alpha$ and $\beta$ and the residual noise vecotor $\epsilon$

**Goal:** Given $X$ and $Y$ produce estimates of $\alpha$ and $\beta$ denoted by $\widehat{\alpha}$ and $\widehat{\beta}$ 

Input data comes in the form of pairs $\left(X_i,Y_i\right)$  for $i=1,\ldots ,n$

The **true regression line**: For **every** individual it should hold that:
$$Y_i = \alpha +\beta X_i +\epsilon_i$$


**Error** for the $i$-th data point is: $$ \epsilon_i = Y_i-\alpha-\beta X_i $$


The **estimated regression line** : $$\widehat{Y_i}=\widehat{\alpha}+\widehat{\beta}X_i$$


**Residuals** measure the distance between each observation from the estimated regression line and are defined as follows: $$\widehat{\epsilon_i} = Y_i-\widehat{Y_i}$$

##### Ordinary Least Squares Regression as an optimization problem

**Question**: How do we find $\widehat{\alpha}$ and $\widehat{\beta}$?

**Answer**: By minimizing the residuals, or *sum of squared residuals* :

\begin{eqnarray}
\text{SSR} & = & \sum_{i=1}^n \widehat{\epsilon_i}^2 \\
& = & \sum_{i=1}^n \left(Y_i-\widehat{Y_i}\right)^2
\end{eqnarray}

### Example I:
Generate a dataset using the <code>datasets.makeregression()</code> function:

In [None]:
X, y = sk_data.make_regression(n_samples=100, n_features=1, bias=0.1, noise=42, random_state=1)
print(X.shape, y.shape)
viz.plot(X, y, 'blue', 'scatter')

In [None]:
# Create linear regression object
regr = linear_model.LinearRegression()

regr.fit(X, y)
y_pred = regr.predict(X)

# The coefficients
print('Coefficients: \n', regr.coef_, regr.intercept_)
# The mean squared error
print('Mean squared error: %.2f'
      % metrics.mean_squared_error(y, y_pred))

# Plot outputs
viz.plot(X, y, 'blue', 'scatter')
viz.plot(X, y_pred, 'red', 'line')

### Now it's your turn!

**Exercise 3**: Analyze the **multi-dimensional** california housing data with a linear regression model.

First of all, load the housing data, which is already available in scikit-learn. It also comes with a lengthy description!

In [None]:
# Loading housing data
california = sk_data.fetch_california_housing()
X = california["data"]
y = california["target"]

print(california['DESCR'])

First step: split the data into training and testing. Hint: use the function <code>cv.train_test_split()</code>

In [None]:
# Your code here

Now fit a linear regression model on the train data and evaluate it by computing the MSE for both the train and the test:

In [None]:
# Your code here

Print the coefficients for all features:

In [None]:
# Your code here

Plot the feature *MedInc* and the corresponding regression line:

In [None]:
# Your code here

Look at some example predictions on the test set and compare to the ground-truth labels:

In [None]:
# Your code here