# Linear or Bayesian Regression to predict course prices?

In this section, we use Linear and Bayesian Linear Regression to predict course prices. Bayesian inference is an alternative to frequentist inference applied when the dataset is not enough huge to build robust models and/or we manage some prior knowlegde from the phenomenal studied. 

### Frequentist Linear Regression
Linear Regressions provides us a linear combination of weights multiplied by a set of predictor variables, $x$, that explains the behavior of the dependence variable $y$. Additionally, this equation includes a random sample error distributed normally. The matrix generalization of the linear model for any number of predictors is:

\begin{align*}
y = \beta^TX+\epsilon
\end{align*}

In Machine Learning, Linear Regression is used as supervised learning method where the learning process consists on using a training dataset to find the best coefficients $\beta$ that describe the observed variable. This best solution minizes the *RSS* or (Residual Sum of Squares), the total sum of squared difference between the observed variable $y$ and the predicted value for that observation.  

\begin{align*}
RSS = \sum_{n=1}^{N}(y_n-\hat{y})^2 = \sum_{n=1}^{N}(y_n-\beta^T{x_n})^2
\end{align*}

The minimization of RSS, called Ordinary Least Squares (OLS), that is the gradient of the RSS with respect to $\beta$ set to zero, give us the optimum or **maximum likelihood estimate of $\beta$**, the most probable coefficients given the inputs $X$ and outputs $y$. This method provides us a single estimate or weight for every model parameter based on the training data. It means that the model is built only over the data available. For small datasets, it makes more sense estimate probably values for our parameters.


The estimating equation for $\beta$, or $\hat{\beta}$ in matrix form is:

\begin{align*}
\hat{\beta}=(X^TX)^{-1}X^Ty
\end{align*}

Finally, using this $\hat{\beta}$ we are able to estimate the output value:

\begin{align*}
\hat{y} = \hat{\beta}^TX
\end{align*}

### Bayesian Linear Regression

In this case, the observed variable is a probability distribution. Asummming that the output can be represented as a normal distribution, the model for Bayesian Linear Regression is expresed as follows:

\begin{align*}
y \sim \mathcal{N}(\beta^TX,\sigma^2{I})
\end{align*}

where the mean is the transpose of the weight matrix, $\beta$, multiplied by the predictor matrix and the variance is the square of the standard deviation $\sigma$ of every parameter multiplied by the Identity matrix to dimensional adjustment.

Unlike Linear Regression, Bayesian Regression generates a distribution of possible predicted values, not only a single one. To do that, the model determines the posterior distribution for the model parameters (it means that the model assumes that the parameters come from distributions. 

According to Bayesian Inference, the posterior probability of the parameters is expressed as the conditional probability of the coefficients upon the training inputs $X$ and outputs $y$:

\begin{align*}
P(\beta|y, X)=\frac{P(y|\beta, X)P(\beta, X)}{P(y|X)}
\end{align*}


where:
- $P(y|\beta, X)$ is the likelihood of the data
- $P(\beta, X)$ is the prior probability of the parameters
- $P(y|X)$ is the normalization constant of Bayes theorem

Therefore, the distribution of possible model parameters depend on the data and prior knowlegde. The prior probability allows us include previous knowledge about the parameters in the model. Otherwise, the usual is using non-informative priors such as normal distribution. 

For small datasets, the uncertainty of the posterior distribution could be huge, but the increment of data shrinks the possible values until they become the same got by OLS for infinite data.

The posterior distribution can't be directly calculated because model parameters are distributions as well. Markov Chain Monte Carlo (MCMC) allows to build the most likely distribution generating random samples from the posterior distribution. Using normal priors for the parameters, we get random samples exploring and changing the value of parameters every time, each one limited to their own parameter space. As larger the samples generated, as accurate the posterior distribution. 

Is it possible to converge to the right distribution? Markov Chain limites random samples for the parameters based on the current state (parameter values) and the assumed prior distribution of the parameters (normal distribution is used to face no previous assumptions)

In [13]:
import pandas as pd
from sklearn import linear_model
from sklearn.preprocessing import OneHotEncoder

In [2]:
df_courses = pd.read_csv('../Data/interim/Courses.csv')

In [3]:
df_courses.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3351 entries, 0 to 3350
Data columns (total 16 columns):
id                      3351 non-null int64
title                   3351 non-null object
url                     3351 non-null object
isPaid                  3351 non-null bool
price                   3351 non-null float64
numSubscribers          3351 non-null int64
numReviews              3351 non-null int64
numPublishedLectures    3351 non-null int64
instructionalLevel      3351 non-null object
contentInfo             3351 non-null object
publishedTime           3351 non-null object
category                3351 non-null object
timeSpent               3351 non-null float64
publishDate             3351 non-null object
level                   3351 non-null object
paidBool                3351 non-null bool
dtypes: bool(2), float64(2), int64(4), object(8)
memory usage: 373.1+ KB


In [4]:
df_courses.drop(columns=['url', 'isPaid', 'instructionalLevel', 'contentInfo', 'publishedTime'], inplace=True)

In [5]:
df_courses.head()

Unnamed: 0,id,title,price,numSubscribers,numReviews,numPublishedLectures,category,timeSpent,publishDate,level,paidBool
0,28295,Learn Web Designing & HTML5/CSS3 Essentials in...,75.0,43285,525,24,WebDevelopment,4.0,2013-01-03,All Levels,True
1,19603,Learning Dynamic Website Design - PHP MySQL an...,50.0,47886,285,125,WebDevelopment,12.5,2012-06-18,All Levels,True
2,889438,ChatBots: Messenger ChatBot with API.AI and No...,50.0,2577,529,64,WebDevelopment,4.5,2016-06-30,All Levels,True
3,197836,Projects in HTML5,60.0,8777,206,75,WebDevelopment,15.5,2014-06-17,Intermediate Level,True
4,505208,Programming Foundations: HTML5 + CSS3 for Entr...,20.0,23764,490,58,WebDevelopment,5.5,2015-10-17,Beginner Level,True


In [6]:
def oneHotEncode(df, name_column):
    dummies = pd.get_dummies(df[name_column], prefix=name_column)
    df = pd.concat([df,dummies],axis=1)

    df.drop([name_column], axis = 1 , inplace=True)
    return df

In [8]:
df = oneHotEncode(df_courses, 'category')
df = oneHotEncode(df, 'level')
df = oneHotEncode(df, 'paidBool')

In [9]:
df.head()

Unnamed: 0,id,title,price,numSubscribers,numReviews,numPublishedLectures,timeSpent,publishDate,category_BussinessFinance,category_GraphicDesign,category_MusicInstrument,category_WebDevelopment,level_All Levels,level_Beginner Level,level_Expert Level,level_Intermediate Level,paidBool_False,paidBool_True
0,28295,Learn Web Designing & HTML5/CSS3 Essentials in...,75.0,43285,525,24,4.0,2013-01-03,0,0,0,1,1,0,0,0,0,1
1,19603,Learning Dynamic Website Design - PHP MySQL an...,50.0,47886,285,125,12.5,2012-06-18,0,0,0,1,1,0,0,0,0,1
2,889438,ChatBots: Messenger ChatBot with API.AI and No...,50.0,2577,529,64,4.5,2016-06-30,0,0,0,1,1,0,0,0,0,1
3,197836,Projects in HTML5,60.0,8777,206,75,15.5,2014-06-17,0,0,0,1,0,0,0,1,0,1
4,505208,Programming Foundations: HTML5 + CSS3 for Entr...,20.0,23764,490,58,5.5,2015-10-17,0,0,0,1,0,1,0,0,0,1


In [10]:
df.drop(columns=['id', 'title', 'publishDate'], inplace=True)

In [12]:
y = df.price
X = df.drop(columns='price')

Predicting prices based on the features above:

In [14]:
clf = linear_model.ARDRegression()
clf.fit(X, y)

ARDRegression(alpha_1=1e-06, alpha_2=1e-06, compute_score=False, copy_X=True,
       fit_intercept=True, lambda_1=1e-06, lambda_2=1e-06, n_iter=300,
       normalize=False, threshold_lambda=10000.0, tol=0.001, verbose=False)

In [15]:
clf.predict(y)

ValueError: Expected 2D array, got 1D array instead:
array=[75. 50. 50. ... 20. 50. 20.].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.