# ANITA astroinformatics summer school 2019 - "Rise of the machines"

## Part III - Regression

This notebook provides an introduction to machine learning and walks through how to develop workflows for training machine learning models.

This lesson is prepared by:
- [Kevin Chai](http://computation.curtin.edu.au/about/computational-specialists/health-sciences/)
- [Rebecca Lange](http://computation.curtin.edu.au/about/computational-specialists/humanities/)

from the [Curtin Institute for Computation](http://computation.curtin.edu.au) at Curtin University in Perth, Australia. 

Some of the materials in this notebook have been referenced and adapted from:
- [Randal Olsen's Data Science Notebook](https://github.com/rhiever/Data-Analysis-and-Machine-Learning-Projects/tree/master/example-data-science-notebook)
- [Sebastian Raschka's Python Machine Learning Notebooks](https://github.com/rasbt/python-machine-learning-book)
- [Kevin Markham's Scikit Learn Notebooks](https://github.com/justmarkham/scikit-learn-videos)

Make sure to open this notebook in the root directory of the code repository.

This work is made available under the [Creative Commons Attribution 4.0 International License](http://creativecommons.org/licenses/by/4.0/).

## Table of contents


1. [Problem definition](#1.-Problem-definition)
2. [Prepare the dataset](#2.-Prepare-the-dataset)
3. [Train models](#3.-Train-models)
4. [Feature normalisation](#4.-Feature-normalisation)
5. [Feature engineering](#5.-Feature-engineering)

### Required libraries

[[ go back to the top ]](#Table-of-contents)

This notebook uses several Python packages that come standard with the [Anaconda Python distribution](http://continuum.io/downloads). The primary libraries that we'll be using are:

* **NumPy**: a fast numerical array structure and helper functions.
* **pandas**: a DataFrame structure to store data in memory and work with it easily and efficiently.
* **scikit-learn**: a machine learning package.
* **matplotlib**: a basic plotting library; most other plotting libraries are built on top of it.
* **seaborn**: a advanced statistical plotting library.

To make sure you have all of the packages you need, install them with `conda`:

    conda install numpy pandas scikit-learn matplotlib seaborn

`conda` may ask you to update some of the packages if you don't have the most recent version. Allow it to do so.

Alternatively, if you can install the packages with [pip](https://pip.pypa.io/en/stable/installing/) (a Python package manager):

    pip install numpy pandas scikit-learn matplotlib seaborn


# Regression

## 1. Problem definition

[[ go back to the top ]](#Table-of-contents)

Your colleague has heard about your recent success in applying machine learning for galaxy classification. They mention they have a dataset containing _ugriz_ photomerty of galaxies but no redshift measurements. They would like you to develop a machine learning model to estimate the redshift values.

This is a regression problem. Regression is a supervised learning approach but instead of predicting the category of an example, we are predicting a continuous value. i.e. $y_i \in\ \mathbb{R}$. For example, the predicted redshift of galaxy `i` is 0.2312.

Developing a regression model follows the same process as building a classification model with the exception of using a different metric for evaluating the model performance. i.e. we can't use classification accuracy for evaluating our regression model.

## 2. Prepare the dataset

[[ go back to the top ]](#Table-of-contents)

Load and prepare the dataset for using the scikit-learn library.

In [1]:
import numpy as np
import pandas as pd

# Set a random seed number to reproduce our results
seed = 11

# 1. Load the clean dataset using pandas read_csv function
df = pd.read_csv('data/galaxies-clean.csv')

# 2. Select the columns of interest for modelling
features = ['mag_u','mag_g','mag_r','mag_i','mag_z']

# 3. Create the features matrix as X
X = df[features]

# 4. Create the labels vector as y
y = df['redshift']


###  Split the dataset

Create a training and test dataset using the `train_test_split` function.

In [2]:
# Import the function
from sklearn.model_selection import train_test_split

# Split the dataset into X_train, X_test, y_train, y_test
# Use a training dataset size of 80%
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.8,
                                                   random_state = seed)




## 3. Train models

[[ go back to the top ]](#Table-of-contents)

We will fit the regression models without any parameter tuning using the following three models:

- [Linear Regression](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html)
- [RandomForestRegressor](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html)
- [MLPRegressor](http://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPRegressor.html)

We will measure model performance using the root mean squared error (RMSE) metric. This can be calculated by using the NumPy function `np.sqrt` and the scikit-learn `mean_squared_error` function. 

In [4]:
# Step 1: Import the classes
from sklearn.linear_model import LinearRegression, BayesianRidge
from sklearn.ensemble import RandomForestRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.metrics import mean_squared_error

# Step 2: Instantiate the estimators
lr = LinearRegression()
rf = RandomForestRegressor(random_state=seed)
mlp = MLPRegressor(random_state=seed)
br = BayesianRidge()

# Step 3: Fit the estimators on data (i.e. train the models)
lr.fit(X_train, y_train)
rf.fit(X_train, y_train)
mlp.fit(X_train, y_train)
br.fit(X_train, y_train)

# Step 4: Generate predictions
y_pred_m1 = lr.predict(X_test)
y_pred_m2 = rf.predict(X_test)
y_pred_m3 = mlp.predict(X_test)
y_pred_m4 = br.predict(X_test)

# Calculate the Root Mean Squared Error (RMSE)
# np.sqrt(mean_squared_error(...))
m1_score = np.sqrt(mean_squared_error(y_test,y_pred_m1))
m2_score = np.sqrt(mean_squared_error(y_test,y_pred_m2))
m3_score = np.sqrt(mean_squared_error(y_test,y_pred_m3))
m4_score = np.sqrt(mean_squared_error(y_test,y_pred_m4))

# Display the model scores
print('Linear regression: %.3f' % (m1_score))
print('Random forest regressor: %.3f' % (m2_score))
print('Multi layer perceptron: %.3f' % (m3_score))
print('Baysian Ridge: %.3f' % (m4_score))




Linear regression: 0.022
Random forest regressor: 0.026
Multi layer perceptron: 0.047
Baysian Ridge: 0.022


The linear regression model is off by an average of 0.022 redshift while the random forest regression has an RMSE of 0.026. The multi layer perceptron (without any parameter tuning) achieves a poorer RMSE of 0.047.

Let's examine the correlation between the predicted and actual redshift values for both models.

In [5]:
# calculate the correlation between the labels and predictions
m1_corr = np.corrcoef(y_test, y_pred_m1)[0][1]
m2_corr = np.corrcoef(y_test, y_pred_m2)[0][1]
m3_corr = np.corrcoef(y_test, y_pred_m3)[0][1]
m4_corr = np.corrcoef(y_test, y_pred_m4)[0][1]

print('Linear regression: %.3f' % (m1_corr))
print('Random forest regressor: %.3f' % (m2_corr))
print('Multi layer perceptron: %.3f' % (m3_corr))
print('Bayesian Ridge: %.3f' % (m4_corr))

Linear regression: 0.888
Random forest regressor: 0.838
Multi layer perceptron: 0.602
Bayesian Ridge: 0.888


That's impressive. The linear regression model has a correlation value of 0.888 while the multi layer perceptron has the lowest correlation of 0.6.

Let's look at the generated coefficients and intercept values for the linear regression model.

In [6]:
print(lr.intercept_)
print(lr.coef_)

-0.2941654156646464
[-0.02794643  0.25568438 -0.13844462 -0.37345313  0.3028921 ]


In [7]:
# pair coefficients with feature names
list(zip(features, lr.coef_))

[('mag_u', -0.027946434186041844),
 ('mag_g', 0.2556843840450466),
 ('mag_r', -0.13844461806634395),
 ('mag_i', -0.37345313161319504),
 ('mag_z', 0.3028921026506884)]

This shows that the linear regression model for estimating the redshift of a galaxy is:

<p style="text-align:center;font-weight:bold">$redshift = -0.027(mag_u) + 0.255(mag_g) - 0.138 (mag_r) - 0.373(mag_i) + 0.302(mag_z) - 0.294$</p>

## 4. Feature normalisation

[[ go back to the top ]](#Table-of-contents)

An important data preparation step for many machine learning models is to [normalise / standarised the feature values](http://scikit-learn.org/stable/modules/preprocessing.html) in our dataset. e.g. we can scale the magnitude values to a smaller and standardised range of [0, 1]. Note: this is not required for decision trees and random forests as those models are scale invariant.

We can normalise our features to a range of [0, 1] using the scikit-learn `MinMaxScaler()` function.

In [8]:
# Import the MinMaxScaler function
from sklearn.preprocessing import MinMaxScaler

# Fit the scaler on the training dataset ONLY
scaler = MinMaxScaler().fit(X_train)

# Transform both the training and test datasets
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Print out the new scaled values
print('Original scale:')
print('  X_train %.3f - %.3f' % (np.min(np.min(X_train)), np.max(np.max(X_train))))
print('  X_test %.3f - %.3f' % (np.min(np.min(X_test)), np.max(np.max(X_test))))
print('Transformed scale:')
print('  X_train %.3f - %.3f' % (np.min(X_train_scaled), np.max(X_train_scaled)))
print('  X_test %.3f - %.3f' % (np.min(X_test_scaled), np.max(X_test_scaled)))

Original scale:
  X_train 11.729 - 24.691
  X_test 12.357 - 21.861
Transformed scale:
  X_train 0.000 - 1.000
  X_test 0.014 - 0.895


Now fit a linear regression and multi layer perceptron model on the scaled dataset.

In [10]:
# Instantiate the estimators
lr_scaled = LinearRegression()
mlp_scaled = MLPRegressor(random_state=seed)

# Fit the estimators
lr_scaled.fit(X_train_scaled,y_train)
mlp_scaled.fit(X_train_scaled, y_train)

# Step 4: Generate predictions
y_pred_m1 = lr_scaled.predict(X_test_scaled)
y_pred_m2 = mlp_scaled.predict(X_test_scaled)

# Calculate the Root Mean Squared Error (RMSE)
# np.sqrt(mean_squared_error(...))
m1_score = np.sqrt(mean_squared_error(y_test, y_pred_m1))
m2_score = np.sqrt(mean_squared_error(y_test, y_pred_m2))

# calculate the correlation coefficient
m1_corr = np.corrcoef(y_test,y_pred_m1)[0][1]
m2_corr = np.corrcoef(y_test, y_pred_m2)[0][1]

# print the results
print('Linear regresssion scaled')
print('  RMSE: %.3f' % (m1_score))
print('  correlation: %.3f' % (m1_corr))
print('Multi layer perception scaled')
print('  RMSE: %.3f' % (m2_score))
print('  correlation: %.3f' % (m2_corr))

Linear regresssion scaled
  RMSE: 0.022
  correlation: 0.888
Multi layer perception scaled
  RMSE: 0.034
  correlation: 0.690


The linear regression model achieved the same performance on the scaled and unscaled dataset (0.022 RMSE). However, the multi layer perceptron model has an improved RMSE of 0.034 (previously 0.047) and correlation coefficient of 0.69 (previously 0.602) on the scaled dataset.

Normalisation might not always improve performance but it is common practice to do so before training certain types of models. You should also think about your dataset and what features you are normalising. e.g. are the features measured on a linear scale? Should magnitude be normalised or should the raw fluxes be normalised instead?

## 5. Feature engineering

[[ go back to the top ]](#Table-of-contents)

Another approach to improve the performance of our models is to try new features based on our knowledge of the domain. This is known as feature engineering.

We are currently using the raw magnitude bands as features to estimate a galaxy's redshift. However, it might make more sense to use colour features that measure the ratio of flux in neighbouring filters. This is equivalent to subtracting the magnitudes of the neighbouring filters. The key to photometric red shift is that a red shifted galaxy will have different observed colors to what it would have at red shift zero. i.e. galaxies at higher redshift tend to be redder in colour. 

With this knowledge, let's build a model with 4 engineered features by subtracting neighbouring bands in the `ugriz` magnitude channels:

- $mag_u - mag_g$
- $mag_g - mag_r$
- $mag_r - mag_i$
- $mag_i - mag_z$

In [11]:
# Make a copy of the features matrix
X_new = X.copy()

# Create new features
# subtract neighbouring magnitudes
X_new['u-g'] = X_new['mag_u'] - X_new['mag_g']
X_new['g-r'] = X_new['mag_g'] - X_new['mag_r']
X_new['r-i'] = X_new['mag_r'] - X_new['mag_i']
X_new['i-z'] = X_new['mag_i'] - X_new['mag_z']

# Remove the old columns
X_new.drop(['mag_u', 'mag_g', 'mag_r', 'mag_i', 'mag_z'], axis=1, inplace=True)

Fit a linear regression and a multi layer perceptron model with the new features.

In [15]:
# Split the dataset into X_train, X_test, y_train, y_test
X_train_new, X_test_new, y_train_new, y_test_new = train_test_split(X_new, y, train_size=0.8, 
                                                                    random_state=seed)

# Instantiate the estimators
lr_new = LinearRegression()
mlp_new = MLPRegressor(random_state=seed)

# Fit the estimators
lr_new.fit(X_train_new, y_train_new)
mlp_new.fit(X_train_new, y_train_new)

# Generate predictions
m1_pred = lr_new.predict(X_test_new)
m2_pred = mlp_new.predict(X_test_new)

# Calculate the RMSE
m1_score = np.sqrt(mean_squared_error(y_test_new, m1_pred))
m2_score = np.sqrt(mean_squared_error(y_test_new, m2_pred))

# Calculate the correlation coefficient
m1_corr = np.corrcoef(y_test_new, m1_pred)[0][1]
m2_corr = np.corrcoef(y_test_new, m2_pred)[0][1]

# Print the result
print('Linear regresssion new features')
print('  RMSE: %.3f' % (m1_score))
print('  correlation: %.3f' % (m1_corr))
print('Multi layer perceptron new features')
print('  RMSE: %.3f' % (m2_score))
print('  correlation: %.3f' % (m2_corr))

Linear regresssion new features
  RMSE: 0.025
  correlation: 0.846
Multi layer perceptron new features
  RMSE: 0.026
  correlation: 0.845




The linear regression model achieved a poorer RMSE of 0.025 (previously 0.022) and correlation coefficient of 0.846 (previously 0.888). However, the multi layer perceptron model achieved improved performance with a RMSE of 0.026 (previously on the scaled dataset 0.034) and a correlation coefficient of 0845 (previously 0.69) with the new features.

It is up to you to decide on how long you want to spend coming up with and experiment with new features. The amount of time you spend on feature engineering is often based on your desired model performance and project requirements (e.g. deadlines).

## Your turn

Apply the steps used in Notebook: "Part II Classification", subsections for cross-validation, generating learning curves, parameter tuning and reporting on your regression models.

Are there any ways you can improve the performance? Can you come up with better features or try other [regression algorithms](http://scikit-learn.org/stable/supervised_learning.html#supervised-learning) for this dataset?