# Linear Regression in PyTorch

## What is linear regression?

Linear = our predictions are a **linear combination** of our inputs

Regression = we will learn the relationship that relates features to labels

$X$ is a matrix of training data. Each row represents a different training example (of which there are $m$). Each column represents a different feature (of which there are $n$). Hence $X$ has dimensions $m \times n$, i.e. $X \in  R^{m \times n}$.

$W$ is our matrix of weights, that controls how much each feature contributes to the hypothesis. If one particular weight equals 5, then changing its associated feature by 1 in the input space, will change the output hypothesis by 5. $W \in R^{n \times 1}$

$h$ is our hypothesis - our prediction of the mapping from input to output. In this example, our model will predict a single scalar output for each of out $m$ inputs - so $h \in R^{m\times n}$.

## $ h = X  W = w_1 x_1 + w_2 x_2 + \dots + w_{n-1} x_{n-1} + w_n x_n$

This linear combination is a **weighted sum of the input features**. As we vary the value of one feature, our hypothesis will change proportionately and linearly.

Imagine that we are trying to predict house price. Consider:
- The weight associated with the feature that is the number of rooms should be large and positive, because the number of rooms contributes lots, and positively to the price of a house. 
- The weight associated with the age of the house may be negative, as older houses might be found to be worth less from the training data.
- The weight associated with a feature that is the age of the person last living there should be zero, because the house price is independent of this feature. It does not contribute at all to the house price.

## Cost functions

For our algorithms to learn, we need a way to evaluate their current performance, so that we can determine how to improve. We can mathematically define when our algorithm is performing well by evaluating an appropriate objective function. We usually try to minimise a function which indicates the error in our hypothesis. In this case, we will use the mean squared error (MSE) between our predictions and labels as our cost function.

## $ MSE\ Loss,\ J = \frac{1}{2m} \sum_{i=1}^{m} (h^{(i)} - y^{(i)})^2$

The cost function has as many dimensions as we have parameters. Changing these parameters moves us around parameter space, in which the cost varies. Varying different parameters will have varying influence on how the cost changes - as such, some are more important to optimise.

(See cost functions notebook for more detail)

## Optimization
We optimize this model using the gradient descent algorithm where we iteratively calculate the derivative of our cost w.r.t our paramameters and use that to update our weights in a direction which reduces the cost.


## Implementation
Firstly we will import some functionality

In [30]:
# import functionality from these libraries
%matplotlib notebook
import numpy as np      # for efficient numerical computation
import torch            # for building computational graphs
from torch.autograd import Variable     # for automatically computing gradients of our cost with respect to what we want to optimise
import matplotlib.pyplot as plt     # for plotting absolutely anything
from mpl_toolkits.mplot3d import Axes3D # for plotting 3D graphs
import pandas as pd #allows us to easily import any data

We will import our data into a pandas data frame and shuffle it

In [31]:
df = pd.read_csv('airfoil_self_noise.dat', sep='\t')#import our dataset into a pandas dataframe
df = df.sample(frac=1) #shuffle our dataset
print(df.head())

       800     0  0.3048  71.3  0.00266337  126.201
1452  3150  12.3  0.1016  39.6    0.040827  113.055
974   4000   0.0  0.0254  55.5    0.000412  133.223
1306   800   3.3  0.1016  55.5    0.002211  129.119
1421  4000  12.3  0.1016  71.3    0.033779  118.018
1229   630  22.2  0.0254  39.6    0.022903  137.026


Convert the datapoints into torch tensors, normalize our features and split intro training and test sets. It is very important to normalize in this case as our features have are different orders of magnitude. Try training without and you will see that the loss is significantly higher.

In [71]:
#convert our data into torch tensors
X = torch.Tensor(np.array(df[df.columns[0:-1]])) #pick our features from our dataset
Y = torch.Tensor(np.array(df[df.columns[-1]])) #select our label

X = (X-X.mean(0)) / X.std(0) #normalize our features along the 0th axis

m = 1100 #size of training set

#split our data into training and test set
#training set
x_train = Variable(X[0:m])
y_train = Variable(Y[0:m])

#test set
x_test = Variable(X[m:])
y_test = Variable(Y[m:])

Define the model class which we will use to instantiate our model.

In [68]:
#define model class - inherit useful functions and attributes from torch.nn.Module
class linearmodel(torch.nn.Module):
    def __init__(self):
        super().__init__() #call parent class initializer
        self.linear = torch.nn.Linear(5, 1) #define linear combination function with 11 inputs and 1 output

    def forward(self, x):
        x = self.linear(x) #linearly combine our inputs to give 1 outputs
        return x

Define the necessary hyper-parameters, instantiate the model from the class, cost function and optimizer.

In [69]:
no_epochs = 100
lr = 10

#create our model from defined class
mymodel = linearmodel()
criterion = torch.nn.MSELoss() #cross entropy cost function as it is a classification problem
optimizer = torch.optim.Adam(mymodel.parameters(), lr = lr) #define our optimizer

Create the axes which we will use to plot our costs each epoch. Define the training loop and train.

In [None]:
#for plotting costs
costs=[]
plt.ion()
fig = plt.figure()
ax = fig.add_subplot(111)
ax.set_xlabel('Epoch')
ax.set_ylabel('Cost')
ax.set_xlim(0, no_epochs-1)
plt.show()

#training loop - same as last time
def train(no_epochs):
    for epoch in range(no_epochs):
        h = mymodel.forward(x_train) #forward propagate - calulate our hypothesis
        #calculate, plot and print cost
        cost = criterion(h, y_train)
        costs.append(cost.data[0])
        ax.plot(costs, 'b')
        fig.canvas.draw()
        print('Epoch ', epoch, ' Cost: ', cost.data[0])

        #calculate gradients + update weights using gradient descent step with our optimizer
        optimizer.zero_grad()
        cost.backward()
        optimizer.step()

train(no_epochs)

Now we check the cost of the model on the test set. If there is a significant difference increase of the cost on the test set compared to the training set, it means that our model does not generalize well to new examples.

In [None]:
def test():
    h = mymodel.forward(x_test)
    cost = criterion(h, y_test)
    
    return cost.data[0]

test_cost = test()
print('Cost on test set: ', test_cost)