# Excercise 1 - Machine Learning Basics

This exercise is based on https://github.com/rasbt/pydata-chicago2016-ml-tutorial

# Table of Contents

* [1 Linear Regression](#2-Linear-Regression)
    * [Loading the dataset](#Loading-the-dataset)
    * [Preparing the dataset](#Preparing-the-dataset)
    * [Fitting the model](#Fitting-the-model)
    * [Evaluating the model](#Evaluating-the-model)
* [2 Classification](#3-Introduction-to-Classification)
    * [The Iris dataset](#The-Iris-dataset)
    * [Class label encoding](#Class-label-encoding)
    * [Scikit-learn's in-build datasets](#Scikit-learn's-in-build-datasets)
    * [Test/train splits](#Test/train-splits)
    * [Logistic Regression](#Logistic-Regression)
    * [K-Nearest Neighbors](#K-Nearest-Neighbors)

# 1  Linear Regression

### Loading the dataset

We will use a dataset of an old publication which studied the relation of the brain weight to the head size for different gender and age ranges.

Source: R.J. Gladstone (1905). "A Study of the Relations of the Brain to 
to the Size of the Head", Biometrika, Vol. 4, pp105-123

The dataset is stored in a file called 
**`dataset_brain.txt`**

Description: Brain weight (grams) and head size (cubic cm) for 237 adults classified by gender and age group.

Variables/Columns
- Gender (1=Male, 2=Female)
- Age Range (1=20-46, 2=46+)
- Head size (cm^3)
- Brain weight (grams)


### Task 1: Print the first 30 lines of the dataset

We will use **`pandas`** to read in the dataset.

https://pandas.pydata.org/pandas-docs/stable/

'pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful and flexible open source data analysis / manipulation tool available in any language. It is already well on its way toward this goal.'


In [None]:
import pandas as pd

The file contains 'comma separated values' (CSV) and we will use pandas **`DataFrame`** to handle the data.

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html#pandas.DataFrame

In [None]:
df = pd.read_csv('dataset_brain.txt', 
                 encoding='utf-8', 
                 comment='#',
                 sep='\s+')
df.head()

Let's look at the relation of the brain weight to the head size by plotting them in a 2D scatter plot. We will use **`matplotlib`** for that.

https://matplotlib.org/



In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

We can call the columns of the pandas DataFrame simply by using the keys.

In [None]:
plt.scatter(df['head-size'], df['brain-weight'])
plt.xlabel('Head size (cm^3)')
plt.ylabel('Brain weight (grams)');

### Preparing the dataset

In order to use the dataset, we need to retrieve a numpy array containing only the values.

http://www.numpy.org/

In [None]:
import numpy as np

In [None]:
y = df['brain-weight'].values
print y

How many data points do we have?

In [None]:
y.shape

The same with the head size:

In [None]:
X = df['head-size'].values
print X
X.shape

Instead of an array, we would like to have n arrays containing one value:

In [None]:
X = X[:, np.newaxis]
X.shape
print X

We will use the machine learning tool and library **`scikit-learn`** in the following. 

http://scikit-learn.org/stable/


A very useful functionality of scikit learn is to easily split the dataset into training and testing dataset. The dataset is split randomly with seed 123 and the test size is 30%, train size 70%:

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.3, random_state=123)

### Task 2: Plot the training and testing dataset separately again in a 2D scatter plot including axis label. Use different colors (option c(olor)='blue') and different marker (option marker='o')

https://matplotlib.org/api/colors_api.html

https://matplotlib.org/api/markers_api.html

### Fitting the model

We would like to fit the training data now using the LinearRegression model of scikit-learn:

http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html

Which uses a linear function and the ordinary least squares method.

In [None]:
from sklearn.linear_model import LinearRegression

lr = LinearRegression()
lr.fit(X_train, y_train)
y_pred = lr.predict(X_test)

OK, what is the result of the fit?

In [None]:
# The coefficients
print 'Coefficients: \n', lr.coef_
# The intercept
print 'Intercept: \n', lr.intercept_

OK, let's plot this linear function.

In [None]:
plt.scatter(X_test, y_test,  color='blue')
plt.plot(X_test, y_pred, color='red', linewidth=3)
plt.xlabel('Head size (cm^3)')
plt.ylabel('Brain weight (grams)');

### Evaluating the model

How do we know if the fit was good? We need to define a performance measure. One way is to calculate the **Coefficient of determination**, denoted R^2. It is the proportion of the variance in the dependent variable that is predictable from the independent variables. It is calculated the following way:

In [None]:
sum_of_squares = ((y_test - y_pred) ** 2).sum()
res_sum_of_squares = ((y_test - y_test.mean()) ** 2).sum()
r2_score = 1 - (sum_of_squares / res_sum_of_squares)
print('R2 score: %.2f' % r2_score)

It ranges from 0 to 1 and values close to 1 means a good agreement. Luckily, scikit-learn has several performance measures for regression (metrics) already included:

http://scikit-learn.org/stable/modules/model_evaluation.html#regression-metrics

In [None]:
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error

# Explained variance score: 1 is perfect prediction
print('Coefficient of determination: %.2f' % r2_score(y_test, y_pred))
# The mean squared error
print("Mean squared error: %.2f" % mean_squared_error(y_test, y_pred))
# The mean squared error
print("Mean absolute error: %.2f" % mean_absolute_error(y_test, y_pred))



# 2 Classification

### The Iris dataset

### Task 3: The Iris flower dataset is stored in file **`dataset_iris.txt`**. Read in the dataset using a pandas DatafFrame and have a look at the first entries.

We now need to create a 150x4 design matrix containing only our feature values. In order to do that, we need to strip the class column from the dataset. We use the **`iloc`** function for that:

`DataFrame.iloc
Purely integer-location based indexing for selection by position.`

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.iloc.html

In [None]:
X = df.iloc[:, :4]
X

And now we get 150x4 numpy array (design matrix) by using the values function: 

In [None]:
X = X.values
X

However, we also need a numpy array containing the class labels in order to classify. Let's get the class column and create a numpy array out of it:

In [None]:
y = df['class'].values
y

We could also just inspect the targets by only looking at unique values:

In [None]:
np.unique(y)

### Class label encoding

We will now use the **`LabelEncoder`** class to convert the class labels into numerical labels:

In [None]:
from sklearn.preprocessing import LabelEncoder

l_encoder = LabelEncoder()
l_encoder.fit(y)
l_encoder.classes_

Simply, by using **`transform`**, we can convert it into numerical targets

In [None]:
y_enc = l_encoder.transform(y)
y_enc

Or just the unique values:

In [None]:
np.unique(y_enc)

We can also convert it back by using **`inverse_transform`**:

In [None]:
np.unique(l_encoder.inverse_transform(y_enc))

### Scikit-learn's in-build datasets

Scikit-learn has also a couple of in-build datasets: 

http://scikit-learn.org/stable/datasets/index.html

The iris dataset is part of it, which you can simply load:

In [None]:
from sklearn.datasets import load_iris

iris = load_iris()
print(iris['DESCR'])

We get the feature design matrix by calling data:

In [None]:
 iris.data

And the target array:

In [None]:
iris.target

### Test/train splits

OK, now we need to split the dataset again in training and testing. Let's first assign the design matrix to X and the target to y:

In [None]:
X, y = iris.data[:, :], iris.target
# ! We only use 2 features for visual purposes


How many example do we have of each class?

In [None]:
print('Class labels:', np.unique(y))
print('Class proportions:', np.bincount(y))

### Task 4: Split the dataset in 40% testing and 60% training sets. How many example of each class do you expect in the training set? How many are there? What happened?

By default, the dataset is shuffled. What happens if we don't shuffle?

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.4, random_state=42, shuffle=False)

print('Class labels:', np.unique(y_train))
print('Class proportions:', np.bincount(y_train))

OK, we wan't to shuffle, but we want equal portions of each class. We can achieve that by using the `stratify` option:

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.4, random_state=42,
        stratify=y)

print('Class labels:', np.unique(y_train))
print('Class proportions:', np.bincount(y_train))

### Task 5: Plot the sepal length vs the sepal width of the training set for the different classes in a scatter plot. You can set the colors to the classes with `c=y_train`

### Logistic Regression

Let's perform a classification using logistic regression:

In [None]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(solver='newton-cg', 
                        multi_class='multinomial', 
                        random_state=1)

lr.fit(X_train, y_train)
y_pred = lr.predict(X_test)

OK, how do we evaluate the classification? We can chose one of the classification performance measures:

http://scikit-learn.org/stable/modules/model_evaluation.html#classification-metrics

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, roc_curve, roc_auc_score
print('Accuracy: %.2f' % accuracy_score(y_test, y_pred))
print("Precision: %.2f" % precision_score(y_test, y_pred, average='weighted'))
print("Recall: %.2f" % recall_score(y_test, y_pred, average='weighted'))


Or we use the classification report function:

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, roc_curve, roc_auc_score, classification_report
print 'Classification Report:\n', classification_report(y_test, y_pred)

### K-Nearest Neighbors

### Task 6: Perform a classification using K-nearest neighbors classifier and evaluate the performance.

http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html

<div style='height:100px;'></div>