<a href="https://colab.research.google.com/github/ElioRame/ProgrammingAssignment2/blob/master/PALS0039_Ex_2_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

[![PALS0039 Logo](https://www.phon.ucl.ac.uk/courses/pals0039/images/pals0039logo.png)](https://www.phon.ucl.ac.uk/courses/pals0039/)

#Exercise 2.2 Regression task

In this exercise we set up a simple regression task and explore how well different regression models fit the data.

(a) The following code generates some random training and test data for a regression problem. Run the code, then add comments to explain the different steps

In [None]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

#
np.random.seed(0)
#a random seed is set to make sure that random numbers are replicable
#
def linear_polynomial(x, gradient=4.0, intercept=2.0):
  return gradient * x + intercept
#gradient = slope, intercept = where the line intercepts the y
#
def generate_noisy_samples(num_samples, domain=[0.0, 1.0]):
  x = np.linspace(domain[0], domain[1], num_samples)
  y = linear_polynomial(x) + np.random.normal(size=num_samples)
  return x, y
#x = np.linspace creates a specified number (num_samples) evenly spaced numbers over an interval (from 0 to 1)
#the evenly spaced numbers are fed into the linear-polynomial functin which creates a gradually ascending linear polyniomial by multiplying lower to higher even spaced numbers to same gradient and intercept
#randomness is added with np.random.normal numbers of same size as num_samples
x_train, y_train = generate_noisy_samples(100)
x_test, y_test = generate_noisy_samples(100)
#random samples are assigned to train and test sets
#when plotting the x_train and y-train are noisy, but by reapplying the linear_polynomial functions to the carefully crafted noisy samples, the relationship becomes apparent
#
plt.plot(x_train, y_train,'bo', label="noisy training samples")
plt.plot(x_train, linear_polynomial(x_train), label="true relationship", linestyle="--")
plt.legend()
plt.show()

(b) Use the numpy [`Polynomial.fit` method](https://numpy.org/doc/stable/reference/generated/numpy.polynomial.polynomial.Polynomial.fit.html) to find a polynomial with **degree of 0** from the training data. Plot this function against the training data (as above).

Hint: to evaluate the estimated model returned by `Polynomial.fit` you can simply use it as a callable function that takes x values as input.

In [None]:
from numpy.polynomial import Polynomial
poly0 = Polynomial.fit(x_train, y_train, deg = 0)
#(b)
#ANSWER
plt.plot(x_train, y_train,'bo', label="noisy training samples")
plt.plot(x_train, linear_polynomial(x_train), label="true relationship", linestyle="--")
plt.plot(x_train, poly0(x_train))
plt.legend()
plt.show()


(c) Repeat the previous task but fit two additional models with **degree 1 and 2** respectively.

In [None]:
#(c)
#ANSWER
poly1 = Polynomial.fit(x_train, y_train, deg=1)
poly2 = Polynomial.fit(x_train, y_train, deg=2)

plt.plot(x_train, y_train,'bo', label="noisy training samples")
plt.plot(x_train, linear_polynomial(x_train), label="true relationship", linestyle="--")
plt.plot(x_train, poly0(x_train))
plt.plot(x_train, poly1(x_train))
plt.plot(x_train, poly2(x_train))
plt.legend()
plt.show()

(d) Calculate the **mean squared error** (MSE) on both the training and test sets for the three polynomial models.

Hint: SKLearn has a function that can be used to [calculate the MSE](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html).

In [None]:
from sklearn.metrics import mean_squared_error


y_train_predicted_0 = poly0(x_train)
y_test_predicted_0  = poly0(x_test)
y_train_predicted_1 = poly1(x_train)
y_test_predicted_1  = poly1(x_test)
y_train_predicted_2 = poly2(x_train)
y_test_predicted_2  = poly2(x_test)
#fitting a polynomial of deg 0, 1, and 2 to the two random train and test sets. it is expected that deg 1 will approximate the true relationship better for both tests as deg 0 is too general and deg 2 is too fitted to the x_train data
#considering that the x_test data, while still linear, contains randomness (as do all the sets here), the exercise works because one of the models will be necessarily underfitted and on overfitted
mses_train = {}
mses_test = {}
#the lower the MSE, the better
mses_train["poly0"] = mean_squared_error(y_train_predicted_0, y_train)
mses_train["poly1"] = mean_squared_error(y_train_predicted_1, y_train)
mses_train["poly2"] = mean_squared_error(y_train_predicted_2, y_train)

mses_test["poly0"] = mean_squared_error(y_test_predicted_0, y_test)
mses_test["poly1"] = mean_squared_error(y_test_predicted_1, y_test)
mses_test["poly2"] = mean_squared_error(y_test_predicted_2, y_test)
#comparing the new polynomial to the noisy initial data
print("TRAIN:", mses_train, sep="\t")
print("TEST:", mses_test, sep="\t")
#(d)
#ANSWER

(e) Can you identify underfitting and overfitting? Which model would you eventually deploy on this task? Why?

In [None]:
#(e)
#ANSWER
#poly0 = underfitting: poor performance on both train and test
#poly1 = good fit: performance better for test but not too much, it hasn't been trained so much that it overapproximates a specific dataset and can generalise results
#poly2 = overfitting: better performance for train test, poor generalisation

(f) Pick a regression method of your choice from the `sklearn` library (regression models with example code can be found in [this index](https://scikit-learn.org/stable/supervised_learning.html#supervised-learning)). Identify some of the hyperparameters of this model and try to find out how they affect the modelling.

In [None]:
#(f)
#ANSWER
#DecisionTreeRegressor: max_depth, min_samples_split, min_samples_leaf, ...

(g) Fit your choice of regression model with two or three different configurations of the hyperparamers and calculate the train and test set MSE for this model (as done before). Which model would you eventually deploy on this task?

In [None]:
#(g)
#ANSWER
#from professor
from sklearn.tree import DecisionTreeRegressor

tree_d2 = DecisionTreeRegressor(max_depth=2).fit(x_train.reshape(-1, 1), y_train)
tree_d4 = DecisionTreeRegressor(max_depth=4).fit(x_train.reshape(-1, 1), y_train)
tree_d8 = DecisionTreeRegressor(max_depth=8).fit(x_train.reshape(-1, 1), y_train)
#a decision tree approximates a sine curve over the data to then predict the value of a target variable by implementng supervised if-else decisions.
#the depth refers to the greater or lesser attention to fine detailes and features of the data when making the prediction, here we can see that the higher depths exibit overfitting
#decision trees require X to have two features (n_samples, n_features), it is possible to reshape the values as done above - the array was changed to be of 2 rather than 1 dimension - while not adding anything for the n_features option (achieved through the .reshape(1, -1)
#this makes sense here but making the decision tree ignore the number of features could make it less acurate (?) because it can make less "splits"

mses_train = {}
mses_test = {}

mses_train["tree_d2"] = mean_squared_error(tree_d2.predict(x_train.reshape(-1, 1)), y_train)
mses_train["tree_d4"] = mean_squared_error(tree_d4.predict(x_train.reshape(-1, 1)), y_train)
mses_train["tree_d8"] = mean_squared_error(tree_d8.predict(x_train.reshape(-1, 1)), y_train)

mses_test["tree_d2"] = mean_squared_error(tree_d2.predict(x_test.reshape(-1, 1)), y_test)
mses_test["tree_d4"] = mean_squared_error(tree_d4.predict(x_test.reshape(-1, 1)), y_test)
mses_test["tree_d8"] = mean_squared_error(tree_d8.predict(x_test.reshape(-1, 1)), y_test)

print("TRAIN:", mses_train, sep="\t")
print("TEST:", mses_test, sep="\t")

plt.plot(x_train, y_train,'bo', label="noisy training samples")
plt.plot(x_train, linear_polynomial(x_train), label="true relationship", linestyle="--")
plt.plot(x_train, tree_d2.predict(x_train.reshape(-1, 1)), label="tree with max depth 2")
plt.plot(x_train, tree_d4.predict(x_train.reshape(-1, 1)), label="tree with max depth 4")
plt.plot(x_train, tree_d8.predict(x_train.reshape(-1, 1)), label="tree with max depth 8")
plt.show()

y_train.shape
x_train.shape
x_for_tree = x_train.reshape(-1, 1)
x_train.ndim
#Deploy tree_d2 (best generalisation error) -- other models exhibited overfitting!