# Notes

The mathematical principle of regression is called $\textbf{ordinary least squares}$. Suppose we have to solve the model
$$X\theta = y,$$
where $X$ is the data matrix and $y$ is the target value of our prediction. We find the parameters $\theta$ in the regression model by minimizing the error function
$$E(\theta) = ||X\theta-y||,$$
which is a vector norm.

In general, any types of regression model are based on the concept of ordinary least squares. For example,

(i) $y = ax + b$ (linear model)

(ii) $y = a_0 + a_1 x + \dots + a_n x^n$ (polynomial model)

(iii) $y = \theta_0 + \theta_1x_1 +...+\theta_nx_n$ (multiple linear regression)

(iv) $y = k x^m$ (power model)

(v) $y = a + k(b^x)$ (exponential model)

(vi) $y = m\log x + c$ (logarithmic model)

##```Example (Implementation by Hand)```

Suppose we are going to model the estimated profit $y$ (in US dollars) for a concert when the charge is $x$ dollars per ticket.  The data points are as follows.
$$(2, 2600), (5, 6500), (8, 8600), (11, 8900), (14, 7400), (17, 4100)$$

We want to fit the model $y = \theta_0 + \theta_1x + \theta_2x^2$.

### _Step 1: Define the data matrix $X$ and target vector $y$._

In [None]:
import numpy as np

In [None]:
data = np.array([[2],[5],[8],[11],[14],[17]])
X = np.hstack((np.ones(data.shape),data,data**2))      # np.hstack() combine the three columns horizontally into X
X

array([[  1.,   2.,   4.],
       [  1.,   5.,  25.],
       [  1.,   8.,  64.],
       [  1.,  11., 121.],
       [  1.,  14., 196.],
       [  1.,  17., 289.]])

In [None]:
y = np.array([[2600], [6500], [8600], [8900], [7400], [4100]])
y

array([[2600],
       [6500],
       [8600],
       [8900],
       [7400],
       [4100]])

### _Step 2: The optimal solution to minimize the error function $E(\theta)$ is_
$$\theta^* = (X^TX)^{-1}X^Ty.$$

In [None]:
from numpy.linalg import inv

In [None]:
theta = inv(X.T@X)@X.T@y
theta

array([[-1000.],
       [ 2000.],
       [ -100.]])

Hence, the model is $y = -1000+2000x-100x^2$.

## ```Example (Implementation by Sklearn Package)```

We fit the same model above using the scikit-learn packages again.

### _Step 1: Define the data matrix $X$ and target vector $y$._

### _Step 2: Separate $X$ and $y$ into training set and testing set._

In practice, it is suggested to separate the data set and target into training set and testing set in order to test the accuracy of the model. The code below will set 75% of the data to training set and 25% to testing set by default.

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)

### _Step 3: Standardize features by removing the mean and scaling to unit variance._

In practice, it is suggested to subtract the mean and divide by the standard deviation on each feature in the data matrix $X$. This will $\textbf{increase the speed}$ and $\textbf{minimize the error}$ in computing the model parameters.

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)    # Mean and s.d. of each feature are calculated.
X_test_scaled = scaler.transform(X_test)          # Without the "fit_" in the code, mean and s.d. of X_train are still be
                                                  # used, since we want the test data set is not correlated to the fitting
                                                  # of the model.

### _Step 4: Model Fitting._

In [None]:
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree = 2)
X_train_poly = poly.fit_transform(X_train_scaled)
X_test_poly = poly.transform(X_test_scaled)
poly.fit(X_train_poly, y_train)

In [None]:
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train_poly, y_train)       # This is the degree 2 polynomial model we want.

In [None]:
y_pred = model.predict(X_test_poly)
y_pred

array([[4189.13216769],
       [2689.13216769]])

### _Step 5: Model Evaluation. (Check whether the model is good enough compared to other candidate model.)_

In [None]:
from sklearn.metrics import mean_squared_error
mse = mean_squared_error(y_test, y_pred)         # Compute the mean squared error of the model on the test data set.
mse

7944.543316285124

In [None]:
rss = ((y_test-y_pred)**2).sum()                 # Compute the residual sum of sq
rss

np.float64(15889.086632570248)

# Exercise

1. Consider the following two data sets. Which one should be represented by a linear model? Which one should be represented by a quadratic model? Why?

(i) $(0.8, -0,752), (1, -0,8), (1.1, -0.818), (1.3, -0.842), (1.6, -0.848), (1.8, -0.832), (2, -0.8), (2.1, -0.778), (2.5, -0.65), (2.9, -0.458)$

(ii) $(0.8, 3.08), (1, 3), (1.1, 2.96), (1.3, 2.88), (1.6, 2.76), (1.8, 2.68), (2, 2.6), (2.1, 2.56), (2.5, 2.4), (2.9, 2.24)$

In [None]:
# (a)
X1 = np.array([[0.8], [1], [1.1], [1.3], [1.6], [1.8], [2], [2.1], [2.5], [2.9]])
X1 = np.hstack((np.ones(X1.shape), X1, X1**2))
y1 = np.array([[-0.752], [-0.8], [-0.818], [-0.842], [-0.848], [-0.832], [-0.8], [-0.778], [-0.65], [-0.458]])
th1_lin = inv(X1[:, :2].T@X1[:, :2])@X1[:, :2].T@y1
th1_qua = inv(X1.T@X1)@X1.T@y1
rss1_lin = ((y1 - X1[:, :2]@th1_lin)**2).sum()
rss1_qua = ((y1 - X1@th1_qua)**2).sum()

print('RSS for linear model =', rss1_lin)
print('RSS for quadratic model =', rss1_qua)
print('')
print('Quadratic model should be selected as its RSS is smaller.')

RSS for linear model = 0.06475521995682415
RSS for quadratic model = 8.363774488089337e-28

Quadratic model should be selected as its RSS is smaller.


In [None]:
# (b)
y2 = np.array([[3.08], [3], [2.96], [2.88], [2.76], [2.68], [2.6], [2.56], [2.4], [2.24]])
th2_lin = inv(X1[:, :2].T@X1[:, :2])@X1[:, :2].T@y2
th2_qua = inv(X1.T@X1)@X1.T@y2
rss2_lin = ((y2 - X1[:, :2]@th2_lin)**2).sum()
rss2_qua = ((y2 - X1@th2_qua)**2).sum()

print('RSS for linear model =', rss2_lin)
print('RSS for quadratic model =', rss2_qua)
print('')
print('Linear model should be selected as its RSS is smaller.')

RSS for linear model = 1.9721522630525295e-31
RSS for quadratic model = 9.985006907834957e-27

Linear model should be selected as its RSS is smaller.


2. Consider the following eight data points.$\\$
$$(-3, 46), (-2, 13), (-1, 0), (0, 1), (6, -35), (7, -104), (10, -539), (11, -780)\\$$
Build a model to fit these data points using a $\textbf{polynomial}$.

3. Consider the following eight data points.$\\$
$$(-4, 385), (-2.5, 47.875), (-1.5, 4.375), (-1, 1), (1, 5), (2, 49), (5, 1501), (6, 3025)\\$$
Build a model to fit these data points using a $\textbf{polynomial}$.

4. We are going to build a model using a famous data set, "Iris Flower Data Set". The following description is copied from Wikipedia.

* The Iris flower data set or Fisher's Iris data set is a multivariate data set used and made famous by the British statistician and biologist Ronald Fisher in his 1936 paper.

* The data set consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimeters. Based on the combination of these four features, Fisher developed a linear discriminant model to distinguish each species.

* In the data set, "data" is the input of the model, "target" is the species predicted. In the column "target", "0" stands for "setosa", "1" stands for versicolor, "2" stands for "virginica".

<img src='https://drive.google.com/uc?id=1jPmMDqG5PYjnVGDlmAerhdz6aJbli6ur'>

Now, we would like to build a multiple linear regression model to predict the the species of iris flower.

(a) Execute the following codes to import the data set. "keys()" is a command to read the filter in the data set.

In [None]:
# from sklearn.datasets import load_iris
# iris_data = load_iris()
# iris_data.keys()

In [None]:
# iris_data.feature_names

In [None]:
# iris_data.data

In [None]:
# iris_data.target

(b) Define the data matrix $X$ and target vector $y$.

(c) Solve the model $X\theta = y$ by using the theoretical optimal solution.

5. Redo the work in Q4. But this time please use the scikit-learn packages to complete the task.

(a) Separate the data set into training set and testing set.

(b) Standardize the features. (i.e. Removing the mean and divide by the standard deviation.)

(c) Model fitting.

(d) Model Evaluation.