# Session 01, Task A

In this task you will read in some toy data from an experiment and, using linear regression, try to predict the outcome for an 'unseen' or unknown measurement.

In [None]:
import pandas as pd  # We will use pandas DataFrames for storing features
import numpy as np  # We will use NumPy arrays to store image data

# Utils is a custom module written to simplify these tutorials
# You do not need to understand these codes for this practical
from utils.practice_data import generateNiceData

# Generate a pandas Dataframe with some toy results
# Each row (a cell) has four feature: A, B, C, Y
# These could, for example, be length, intensity, etc.
number_of_samples=100
problem = generateNiceData(3,number_of_samples,noNoise=True)
problem.describe()

## The Task

In one experiment we were able to measure four features: A, B, C, and Y. However, in our next experiment we could only measure A, B and C.

Given these three features (A, B and C) we want to predict the value of feature Y.

1. Let's start with a multivariate linear regression (assuming the relationship between our variables is simple).
2. Advanced: Then try to provide non-linear features (i.e. assuming the relationship between our variables is complex)
3. Advanced: Following see how the performance changes with and without noise in data
4. Advanced: See how performance changes depending on size of the problem

In [None]:
#Data from pandas series needs to be converted to familiar numpy arrays
x = problem.loc[:,['A','B','C']].values
y = problem.loc[:,'Y'].values

print(x.shape, y.shape)  # check our array shapes make sense

## Task 1: 

Read the documentation for function [sklearn.model_selection.train_test_split()](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html), start with the second example provided.

In the cell below, complete the function call (by replacing all the `____`s) to split our dataset into `x_train`/`y_train` (training) and `x_test`/`y_test` (testing) subsets.

In [None]:
from sklearn import model_selection

____, ____, ____, ____  = model_selection.train_test_split(____, ____, test_size=0.2, random_state=0)

### REMOVE THESE LINES ###
x_train, x_test, y_train, y_test = model_selection.train_test_split(x, y, test_size=0.2, random_state=0)
### REMOVE THESE LINES ###

## Task 2:

 Look at the [Linear Regression class](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html) documentation and examples. In the code below, we have initialised a Linear Regression model. Add the following lines of code:
1. A line that 'fit's a model to our training data,
2. A line that 'predict's the value of Y for our test data.

What do the intercept and slope represent?

In [None]:
from sklearn import linear_model  # this submodule contains the 'LinearRegression' function

mv_regression = linear_model.LinearRegression(normalize=True) #How would it perform without normalisation?

# Fit regression model to feature data x_train and target y_train
# ADD CODE HERE:
### REMOVE THESE LINES ###
mv_regression.fit(x_train,y_train)
### REMOVE THESE LINES ###

# Fill vector y_predict with estimations of target y_predict from data x_test
# ADD CODE HERE:
### REMOVE THESE LINES ###
y_predict = mv_regression.predict(x_test)
### REMOVE THESE LINES ###

print("Intercept {}".format(mv_regression.intercept_))
print("Slope {}".format(mv_regression.coef_))

In machine learning we want to get an idea of how well our models fits our data (by comparing out prediction to our known testing data values), there are a variety of error metrics that can be used for this. Run the cell below to compare the 'True' or known values and the predicted values as well as three common error metrics. How do you interpret these numbers?

In [None]:
from sklearn import metrics
from sklearn.metrics import mean_squared_error, r2_score

results = pd.DataFrame({'True value': y_test.flatten(), 'Predicted': y_predict.flatten()})
print('Mean Absolute Error: {}'.format(metrics.mean_absolute_error(y_test, y_predict))  )
print('Mean Squared Error: {}'.format(metrics.mean_squared_error(y_test, y_predict)) )
print('Root Mean Squared Error: {}'.format(np.sqrt(metrics.mean_squared_error(y_test, y_predict))))

## Task 3: 

Given your extensive knowledge of the **property Y**, you suspect that measurement B and C have a non-linear relation to Y. 

Modify the following cell to use function [np.power()](https://docs.scipy.org/doc/numpy/reference/generated/numpy.power.html) and [np.sin()](https://docs.scipy.org/doc/numpy/reference/generated/numpy.sin.html?highlight=sin#numpy.sin) to create additional non-linear features z_b and z_c. 

*Hint:* try different powers of `x_b`, from 2 to 8 to see when the fit is the closest. 

*Hint:* Apply sine function firectly to `x_c: z_c = np.sin(x_c)`

In [None]:
z_b = x_b
z_c = x_c 

z_b = np.power(x_b,4)
z_c = np.sin(x_c)

#You can either add z-features to your existing measurements, or replace x_b and x_c with the non-linear features.
x=np.concatenate((x_a,x_b,x_c,z_b,z_c),axis=1)
x=np.concatenate((x_a,z_b,z_c),axis=1)

## Task 4:

Copy the code from the cells above to try linear regression with non-linear features

## Task 5: 
Display the results on the test dataset by executing the following cell

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline
plt.plot(y_test,'x')
plt.plot((y_predict),'o')
plt.show()

### ADVANCED Task:

Gather the code above in the loop and create a graph showing how Root Mean Squared Error changes with the size of the noisy dataset.


In [None]:
for number_of_samples in range (20, 500, 50):
    problem = generateNiceData(3,number_of_samples,noNoise=True)
    #put your code here