# Introduction to the Course

In these series of workshops, we will be developing our understanding of machine learning techniques so that we can ultimately apply them to our research problems in the conus exogenomics group.

We will begin the course by becoming familiar with the popular machine learning library: __scikit-learn__.

![alt-text](http://www.scipy-lectures.org/_images/scikit-learn-logo.png)

## Installation

These tutorials will require recent installations of:
* numpy
* scipy
* matplotlib
* scikit-learn
* ipython with ipython notebook (Jupyter Notebook)

To easily install these all at once, I suggest installing [Anaconda](https://www.continuum.io/what-is-anaconda)

## Downloading Course Materials

To download the course materials, I highly recommend installing __git__, and creating a __GitHub__ account if you haven't already. 

To install git, simply run the following command in the terminal:

```
sudo apt-get install git
```

Once git is installed, you can clone the material in this workshop by running the following command:

```
git clone https://github.com/INASIC/conus-exogenomics/tree/master/machine_learning/workshops/sklearn/tutorials
```

After you have installed the material, open the IPython notebooks by running the following command in the terminal, and navigating to the relevant notebooks:

```
jupyter notebook
```



## Table of Contents

1. Linear Regression
 * Loading the dataset
 * Preparing the dataset
 * Fitting the model
 * Evaluating the model
 * Exercise
2. Introduction to Classification
 * The Iris dataset
 * Scikit-learn's in-build datasets
 * Logistic Regression
 * K-Nearest Neighbors
 * Exercises

# 1. Linear Regression

Regression is a method used in machine learning to predict __continuous__ output. That is, we are predicting a value that can have any possible value, as opposed to discrete values.

We will begin the workshop by looking at the simple linear regression algorithm to fit a straight line onto a dataset. 

## The Investigation

The dataset we will look at is collected from the following source: 

R.J. Gladstone (1905). "*A Study of the Relations of the Brain to to the Size of the Head*", Biometrika, Vol. 4, pp105-123

Here, we will investigate the relationship between brain weight (in grams) to head size (cubic cm) for 237 adults classified by gender and age group.

## Import Dependencies

In [None]:
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

# suppress warning messages
import warnings; warnings.simplefilter('ignore')

## Loading the Dataset

Here, we use pandas to load the dataset into a pandas __dataframe__.

We then print out the first few lines of this dataset using the __.tail()__ command.

In [None]:
df = pd.read_csv('./data/dataset_brain.txt', 
                 encoding='utf-8', 
                 comment='#',
                 sep='\s+')
df.head()

## Visualize the Dataset

Here, we use matplotlib to visualize the dataset.

In [None]:
plt.scatter(df['head-size'], df['brain-weight'])
plt.xlabel('Head size (cm^3)')
plt.ylabel('Brain weight (grams)');

## Preparing the Dataset

Before we can begin applying linear regression on our data, we first need to prepare the dataset using pandas.

In [None]:
y = df['brain-weight'].values

# scikit-learn expects features to be 2 dimensional
X = df['head-size'].values
X = X[:, np.newaxis]

## Seperating the Data

Here, we split our dataset into three sections: 
1. Training data
    * What we train our algorithm on
2. Testing data
    * What we use to evaluate the performance of our algorithm, after training
    
We then visualize this split dataset to get a better intuition into the spread of our data.

In [None]:
from sklearn.cross_validation import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.33, 
                                                    random_state=123)

In [None]:
plt.scatter(X_train, y_train, c='blue', marker='o')
plt.scatter(X_test, y_test, c='red', marker='s')
plt.xlabel('Head size (cm^3)')
plt.ylabel('Brain weight (grams)');

## Training our Linear Regression Model

In [None]:
from sklearn.linear_model import LinearRegression

lr = LinearRegression()
lr.fit(X_train, y_train)
y_pred = lr.predict(X_test)

## Evaluating the Performance of our Model

In [None]:
lr.score(X_test, y_test)

In [None]:
sum_of_squares = ((y_test - y_pred) ** 2).sum()
res_sum_of_squares = ((y_test - y_test.mean()) ** 2).sum()
r2_score = 1 - (sum_of_squares / res_sum_of_squares)
print('R2 score: %s' % r2_score)

In [None]:
lr.coef_

In [None]:
lr.intercept_

## Visualizing the Regression Line

In [None]:
min_pred = X_train.min() * lr.coef_ + lr.intercept_
max_pred = X_train.max() * lr.coef_ + lr.intercept_

plt.scatter(X_train, y_train, c='blue', marker='o')
plt.plot([X_train.min(), X_train.max()],
         [min_pred, max_pred],
         color='red',
         linewidth=4)
plt.suptitle('Linear Regression Algorithm')
plt.xlabel('Head size (cm^3)')
plt.ylabel('Brain weight (grams)')

# Exercise: k-Nearest Regressor

Using the same dataset, train the k-Nearest Regressor to predict a line to the data. How does this algorithm compare to the linear regression algorithm?

In [None]:
from sklearn.neighbors import KNeighborsRegressor



## Solutions

In [None]:
# %load solutions/01_regression_solutions.py

## Next Workshop

[Introduction to Classification](https://github.com/INASIC/conus-exogenomics/tree/master/machine_learning/workshops/sklearn/tutorials/classification)