<a href="https://colab.research.google.com/github/HanSong19/PALS0039-Introduction-to-Deep-Learning-for-Speech-and-Language-Processing-/blob/main/PALS0039_Ex_2_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

[![PALS0039 Logo](https://www.phon.ucl.ac.uk/courses/pals0039/images/pals0039logo.png)](https://www.phon.ucl.ac.uk/courses/pals0039/)

#Exercise 2.2 Regression task

In this exercise we set up a simple regression task and explore how well different regression models fit the data.

(a) The following code generates some random training and test data for a regression problem. Run the code, then add comments to explain the different steps

In [None]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# generate always the same random numbers by setting the seed
np.random.seed(0)

# 1st order polynomial (a line)
def linear_polynomial(x, gradient=4.0, intercept=2.0):
  return gradient * x + intercept                                               # comments are also good here

# generate noisy samples around a line
# num_samples: number of samples to generate
# domain: input range for which samples are generated (default: 0-1)
# output: x and y
def generate_noisy_samples(num_samples, domain=[0.0, 1.0]):
  x = np.linspace(domain[0], domain[1], num_samples)                            # num_samples equidistant numbers between domain[0] and domain[1]
  y = linear_polynomial(x) + np.random.normal(size=num_samples)                 # sum of line and noise
  return x, y

# generate 100 samples for training and 100 for testing
x_train, y_train = generate_noisy_samples(100)
x_test, y_test = generate_noisy_samples(100)

# plot the noisy training samples and the noise-free ground truth
plt.plot(x_train, y_train,'bo', label="noisy training samples")
plt.plot(x_train, linear_polynomial(x_train), label="true relationship", 
         linestyle="--")
plt.legend()
plt.show()

(b) Use the numpy [`Polynomial.fit` method](https://numpy.org/doc/stable/reference/generated/numpy.polynomial.polynomial.Polynomial.fit.html) to find a polynomial with **degree of 0** from the training data. Plot this function against the training data (as above).

Hint: to evaluate the estimated model returned by `Polynomial.fit` you can simply use it as a callable function that takes x values as input.

In [None]:
from numpy.polynomial import Polynomial

#(b)

plot0=Polynomial.fit(x_train, y_train, deg=0)

plt.plot(x_train, y_train, 'bo', label="noisy training samples")
plt.plot(x_train, linear_polynomial(x_train),linestyle='--',label="true relationship")
plt.plot(x_train, plot0(x_train), label="deg 0 polynomial train")

plt.legend() 
plt.show()





(c) Repeat the previous task but fit two additional models with degree 1 and 2 respectively.

In [None]:
#(c)
plot1=Polynomial.fit(x_train, y_train, deg=1)
plot2=Polynomial.fit(x_train, y_train, deg=2)
plt.plot(x_train, y_train, 'bo', label="noisy training samples")
plt.plot(x_train, linear_polynomial(x_train),linestyle='--',label="true relationship")
plt.plot(x_train, plot0(x_train), label="deg 0 polynomial")
plt.plot(x_train, plot1(x_train), label="deg 1 polynomial")
plt.plot(x_train, plot2(x_train), label='deg2 polynomial')
plt.legend() 
plt.show()

(d) Calculate the **mean squared error** (MSE) on both the training and test sets for the three polynomial models.

Hint: SKLearn has a function that can be used to [calculate the MSE](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html).

In [None]:
from sklearn.metrics import mean_squared_error

#(d)
mses_train = {}
mses_test = {}

mses_train["poly0"] = mean_squared_error(poly0(x_train), y_train)
mses_train["poly1"] = mean_squared_error(poly1(x_train), y_train)
mses_train["poly2"] = mean_squared_error(poly2(x_train), y_train)

mses_test["poly0"] = mean_squared_error(poly0(x_test), y_test)
mses_test["poly1"] = mean_squared_error(poly1(x_test), y_test)
mses_test["poly2"] = mean_squared_error(poly2(x_test), y_test)

print("TRAIN:", mses_train, sep="\t")
print("TEST:", mses_test, sep="\t")

(e) Can you identify underfitting and overfitting? Which model would you eventually deploy on this task? Why?

In [None]:
#(e)
# Underfitting for poly0 (able to achieve better generalisation error with poly1 and poly2)
# Overfitting for poly2 (training error was lower than poly1 but generalisation error was higher)
# Deploy poly1 (best generalisation error)

(f) Pick a regression method of your choice from the `sklearn` library (regression models with example code can be found in [this index](https://scikit-learn.org/stable/supervised_learning.html#supervised-learning)). Identify some of the hyperparameters of this model and try to find out how they affect the modelling.

In [None]:
#(f)
# DecisionTreeRegressor: max_depth, min_samples_split, min_samples_leaf, ...

(g) Fit your choice of regression model with two or three different configurations of the hyperparamers and calculate the train and test set MSE for this model (as done before). Which model would you eventually deploy on this task?

In [None]:
#(g)
from sklearn.tree import DecisionTreeRegressor

tree_d2 = DecisionTreeRegressor(max_depth=2).fit(x_train.reshape(-1, 1), y_train)
tree_d4 = DecisionTreeRegressor(max_depth=4).fit(x_train.reshape(-1, 1), y_train)
tree_d8 = DecisionTreeRegressor(max_depth=8).fit(x_train.reshape(-1, 1), y_train)

mses_train = {}
mses_test = {}

mses_train["tree_d2"] = mean_squared_error(tree_d2.predict(x_train.reshape(-1, 1)), y_train)
mses_train["tree_d4"] = mean_squared_error(tree_d4.predict(x_train.reshape(-1, 1)), y_train)
mses_train["tree_d8"] = mean_squared_error(tree_d8.predict(x_train.reshape(-1, 1)), y_train)

mses_test["tree_d2"] = mean_squared_error(tree_d2.predict(x_test.reshape(-1, 1)), y_test)
mses_test["tree_d4"] = mean_squared_error(tree_d4.predict(x_test.reshape(-1, 1)), y_test)
mses_test["tree_d8"] = mean_squared_error(tree_d8.predict(x_test.reshape(-1, 1)), y_test)

print("TRAIN:", mses_train, sep="\t")
print("TEST:", mses_test, sep="\t")

plt.plot(x_train, y_train,'bo', label="noisy training samples")
plt.plot(x_train, linear_polynomial(x_train), label="true relationship", linestyle="--")
plt.plot(x_train, tree_d2.predict(x_train.reshape(-1, 1)), label="tree with max depth 2")
plt.plot(x_train, tree_d4.predict(x_train.reshape(-1, 1)), label="tree with max depth 4")
plt.plot(x_train, tree_d8.predict(x_train.reshape(-1, 1)), label="tree with max depth 8")
plt.show()

#Deploy tree_d2 (best generalisation error) -- other models exhibited overfitting!