## Introduction

Linear Regression is a statistical analysis tool in order to predict the relationship between two variables. Typically, we refer to independent variable as the $x$ variable and the dependent variable as the $y$ variable.

These models are linear making them simple which is an advantage because due to their easy to understand nature, it makes them vastly modular for many applications.

### Notation
Before moving forward, we need to define some notations that will be used throughout this file.

- $x$ is independent vector
- $y$ is dependent vector
- $m$ is the slope
- $b$ is the intercept
- $\sigma_m$ is the standard deviation of $m$
- $\sigma_b$ is the standard deviation of $b$
- $x_i$ is one element of input vector
- $y_i$ is one element of output vector
- $\bar{y}$ is mean of output vector
- $\sigma_n$ is the noise

## Linear Regression

A linear equation is model using

$$ y = mx + b$$

where $x$ is the independent variable, $y$ is the dependent variable, and $b$ is the $y$-intercept.

Our goal in linear regression is to optimize $m$ and $b$. To be able to optimize any model, we need a way to judge how good the parameters are. For this we can use the coefficient of determination. This will tell us how good the best fit line is from the actual data. The closer $R^2$ is to $1$, the more accurate the best fit line is.

$$ R^2 = \frac{RSS}{TSS} $$

where $RSS$ is the sum of squares of residuals and can be found using

$$ RSS = \sum_{i=1}^n (y_i - f(x_i))^2 $$

and $TSS$ is the total sum of squares

$$ TSS = \sum_{i=1}^n (y_i - \bar{y})^2 $$


In [1]:
import torch
import torch.nn as nn
import torch.optim as optim


class StraightLine_v1(nn.Module):
    '''
    This class models a straight line
    '''
    def __init__(self):
        super().__init__()
        self.m = nn.Parameter(torch.tensor(1.0))
        self.b = nn.Parameter(torch.tensor(1.0))

    
    def forward(self, x):
        return self.m * x + self.b

In [2]:
class RSquared_v1(nn.Module):
    '''
    This class implements the R2 loss function
    '''
    def __init__(self):
        super().__init__()
        self.f = StraightLine_v1()
    
    def forward(self, x: torch.Tensor, y: torch.Tensor):
        y_pred = self.f(x)
        
        RSS = torch.sum((y - y_pred)**2)
        TSS = torch.sum((y - torch.mean(y))**2)

        return (-1.) * (1 - RSS / TSS)

## Our Data

We will be using an advertising dataset with different types of advertising platforms. For our case, we want to know how TV advertising impacts Sales and so our $x$ vector will be TV and our $y$ vector will be Sales.

In [3]:
import pandas as pd
import torch
from torch.autograd import Variable

# Obtains the file and cleans it
FILE_PATH = 'data/Advertising.csv' 
df = pd.read_csv(FILE_PATH).drop('Unnamed: 0', axis=1)

# Obtains our x and y vectors
x = df.TV.to_numpy().reshape(-1, 1)
y = df.Sales.to_numpy().reshape(-1, 1)

x_tensor = Variable(torch.from_numpy(x).type(torch.FloatTensor)) # type: ignore
y_tensor = Variable(torch.from_numpy(y).type(torch.FloatTensor)) # type: ignore

## Training Data
In order to train all of our models, we will use the Adam optimizer.

In [4]:
from tqdm import tqdm

def train(model, x_tensor, y_tensor, epochs=10000, learning_rate=0.02):
    '''
    This function trains a model
    '''
    optimizer = optim.Adam(model.parameters(), lr = learning_rate)

    for epoch in tqdm(range(epochs), desc="Training..."):
        optimizer.zero_grad()
        
        nLogLik = model(x_tensor, y_tensor)
        nLogLik.backward(retain_graph=True)

        optimizer.step()

In [6]:
# Trains the model
model_rs = RSquared_v1()
train(model_rs, x_tensor, y_tensor)

# Prints the results
model_rs.f.m, model_rs.f.b

Training...: 100%|██████████| 10000/10000 [00:05<00:00, 1757.22it/s]


(Parameter containing:
 tensor(0.0475, requires_grad=True),
 Parameter containing:
 tensor(7.0326, requires_grad=True))

## Maximum Liklihood Estimation (MLE)

Maximum Likelihood Estimation is a method we can use to estimate the parameters of an assumed probability distribution.

$$ \sum_{i=1}^n \frac{(y_i - mx_i - b)^2}{2\sigma_n^2} + \frac{1}{2\sigma_n^2}$$

Before we can do bayesian inference, this is equivalent to maximum a posteriori (MAP) estimation with uniform prior distributions.

In [7]:
class nLogLikelyhood_v1(nn.Module):
    '''
    This class models the log likelyhood of a straight line
    which is used to assess the quality of the fit
    '''
    def __init__(self):
        super().__init__()
        self.f = StraightLine_v1()
        self.sigmaN = nn.Parameter(torch.tensor(1.0))
    
    def forward(self, x, y):
        pred = self.f(x)
        return 0.5 * (torch.log(self.sigmaN ** 2) + 0.5 * ((y - pred) ** 2) / (self.sigmaN ** 2)).sum()
        # The line above is equivalent to the following:
        # return (-1.) * Normal(pred, self.sigmaN).log_prob(y).sum()

In [8]:
model_mle = nLogLikelyhood_v1()
train(model_mle, x_tensor, y_tensor)

model_mle.f.m, model_mle.f.b, model_mle.sigmaN

Training...: 100%|██████████| 10000/10000 [00:07<00:00, 1402.18it/s]


(Parameter containing:
 tensor(0.0475, requires_grad=True),
 Parameter containing:
 tensor(7.0326, requires_grad=True),
 Parameter containing:
 tensor(2.2635, requires_grad=True))

## Maximum a Posteriori Estimation (MAP)

Maximum a Posteriori Estimation adds on to the maximum likelihood estimation and can be seen as a regularization of maximum likelihood estimation.

$$-\log{P(y \mid X, w) * P(m) * P(b)}$$

To find both $P(m)$ and $P(b)$ we need to use these equation...

$$P(m) = \log\sigma_m + \frac{m^2}{2\sigma_m^2}$$

$$ \& $$

$$P(b) = \log\sigma_b + \frac{b^2}{2\sigma_b^2}$$


In [9]:
from torch.distributions.normal import Normal

class maxPosterior_v1(nn.Module):
    def __init__(self):
        super().__init__()
        self.f = StraightLine_v1()
        self.sigmaN = nn.Parameter(torch.tensor(1.0))
        
    def forward(self, x, y):
        pred = self.f(x)
        nLogLik = (-1.) * Normal(pred, self.sigmaN).log_prob(y).sum()
        
        nLogPriorM = (-1.) * Normal(0, 1).log_prob(self.f.m)
        nLogPriorB = (-1.) * Normal(0, 10).log_prob(self.f.b)
        
        return nLogLik + nLogPriorM + nLogPriorB

In [10]:
model_map = maxPosterior_v1()
train(model_map, x_tensor, y_tensor)

model_map.f.m, model_map.f.b, model_map.sigmaN

Training...: 100%|██████████| 10000/10000 [00:15<00:00, 640.61it/s]


(Parameter containing:
 tensor(0.0476, requires_grad=True),
 Parameter containing:
 tensor(7.0218, requires_grad=True),
 Parameter containing:
 tensor(2.8215, requires_grad=True))

## Advanced Straight Line

So far we have obtained only one slope and one intercept but we want a distribution for both the slope and intercept.

In [11]:
class StraightLine_v2(nn.Module):
    def __init__(self):
        super().__init__()
        self.muM = nn.Parameter(torch.tensor(1.0))
        self.sigmaM = nn.Parameter(torch.tensor(1.0))
        self.muB = nn.Parameter(torch.tensor(1.0))
        self.sigmaB = nn.Parameter(torch.tensor(1.0))
        
        self.samples = 100
        
        self.m = self.muM + torch.exp(self.sigmaM) * Normal(0, 1).sample(torch.Size([1, self.samples]))
        self.b = self.muB + torch.exp(self.sigmaB) * Normal(0, 1).sample(torch.Size([1, self.samples]))
    
    def forward(self, x):        
        self.m = self.muM + torch.exp(self.sigmaM) * Normal(0, 1).sample(torch.Size([1, self.samples]))
        self.b = self.muB + torch.exp(self.sigmaB) * Normal(0, 1).sample(torch.Size([1, self.samples]))

        return torch.matmul(x, self.m) + self.b.repeat(x.shape[0], 1)

## KL Divergence

This allows us to measure how one probability distributition can be different from another.

We want to emulate...
$$ -\int q(w) \log\frac{p(y \mid x, w)p(w)}{q(w)} dw $$

Which is approximately equal to...
$$ \frac{1}{M}\sum_{i=1}^M\log P(y \mid x, w^{(i)})

In [12]:
class variationalBayes_v1(nn.Module):
    def __init__(self):
        super().__init__()
        self.f = StraightLine_v2()
        self.sigmaN = nn.Parameter(torch.tensor(1.0))
        self.prior_M_std = torch.tensor([1.])
        self.prior_B_std = torch.tensor([10.])
        
    def forward(self, x, y):
        pred = self.f(x)
        
        y_truth = y.reshape(y.shape[0], -1).repeat(1, self.f.samples)
        nLogLik = (-1.) * Normal(pred, self.sigmaN).log_prob(y_truth).sum() / self.f.samples

        nLogPriorM = torch.log(self.prior_M_std) + 0.5 * (self.f.muM / self.prior_M_std) ** 2 + 0.5 * (self.f.sigmaM / self.prior_M_std) ** 2
        nLogPriorB = torch.log(self.prior_B_std) + 0.5 * (self.f.muB / self.prior_B_std) ** 2 + 0.5 * (self.f.sigmaB / self.prior_B_std) ** 2
        
        LogVarPostM = (-1.) * torch.log(self.f.sigmaM) 
        LogVarPostB = (-1.) * torch.log(self.f.sigmaB)
        
        return LogVarPostM + LogVarPostB + nLogLik + nLogPriorM + nLogPriorB

In [13]:
model = variationalBayes_v1()
train(model, x_tensor, y_tensor)

model.f.muM, model.f.muB, torch.exp(model.f.sigmaM), torch.exp(model.f.sigmaB), model.sigmaN

Training...: 100%|██████████| 10000/10000 [00:56<00:00, 177.73it/s]


(Parameter containing:
 tensor(0.0495, requires_grad=True),
 Parameter containing:
 tensor(7.0114, requires_grad=True),
 tensor(0.0248, grad_fn=<ExpBackward0>),
 tensor(1.0513, grad_fn=<ExpBackward0>),
 Parameter containing:
 tensor(3.3328, requires_grad=True))