# Linear Regression

The goal of this assignment is to build a simple linear regression algorithm from scratch. Linear regression is a very useful and simple to understand predicting values, given a set of training data. The outcome of regression is a best fitting line function, which, by definition, is the line that minimizes the sum of the squared errors. When plotted on a 2 dimensional coordinate system, the errors are the distance between the actual Y' and predicted Y' of the line. In machine learning, this line equation Y' = b(x) + A is solved using gradient descent to gradually approach to it. **We will be using the statistical approach here that directly solves this line equation without using an iterative algorithm.**

---
### Imports

In [None]:
from k2datascience import linear_regression

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
%matplotlib inline

---
### Load Data

In [None]:
ad = linear_regression.AdvertisingSimple()

## Exercise 1 - Explore the Data

The `Advertising` data set consists of the sales of that product in 200 different
markets, along with advertising budgets for the product in each of those
markets for three different media: TV, radio, and newspaper. Explore the data and decide on which variable you would like to use to predict `Sales`.

In [None]:
ad.data.info()
ad.data.head()
ad.data.describe()

In [None]:
ad.plot_correlation_joint_plots()

In [None]:
ad.plot_correlation_heatmap()

#### Findings
- TV advertizing has the largest correlation to sales, and will be used for prediction.

## Exercise 2 - Build a Simple Linear Regression Class

The derivation can be [found here on Wikipedia](https://en.wikipedia.org/wiki/Simple_linear_regression).

The general steps are:
- Calculate mean and variance
- Calculate covariance
- Estimate coefficients
- Make predictions on out-of-sample data

The class should do the following:
- Fit a set of x,y points
- Predict the value a new x values based on the coefficients
- Can plot the best fit line on the points
- Return the coefficient and intercept
- Return the coefficient of determination (R^2)

## Exercise 3 - Try it out on the Advertising Data Set

In [None]:
ad.simple_stats_fit()
f'Coefficient: {ad.coefficients[0]:.4f}'
f'Intercept: {ad.intercept:.4f}'
f'R-Squared Value: {ad.r2:.3f}'

In [None]:
ad.plot_simple_stats()

## Exercise 4 - Check via Statsmodels and Scikit-learn

In [None]:
import statsmodels.api as sm

X = ad.data.tv
y = ad.data.sales

X = sm.add_constant(X)
model = sm.OLS(y, X)
ln_reg = model.fit()

ln_reg.summary()

In [None]:
import matplotlib.pyplot as plt
import numpy as np
from sklearn import linear_model

advertising_X = ad.data.tv[:, np.newaxis]
ad_X_train = advertising_X[:-20]
ad_X_test = advertising_X[-20:]

ad_y_train = ad.data.sales[:-20]
ad_y_test = ad.data.sales[-20:]

ln_reg = linear_model.LinearRegression()
ln_reg.fit(ad_X_train, ad_y_train)

f'Coefficients: {ln_reg.coef_}'
f'Intercept: {ln_reg.intercept_}'
mse = np.mean((ln_reg.predict(ad_X_test) - ad_y_test)**2)
f'Mean Squared Error: {mse:.2f}'
variance = ln_reg.score(ad_X_test, ad_y_test)
f'Variance Score: {variance:.2f}'

In [None]:
fig = plt.figure('Correlation Heatmap', figsize=(8, 6),
                 facecolor='white', edgecolor='black')
rows, cols = (1, 1)
ax = plt.subplot2grid((rows, cols), (0, 0))

test_sort = np.argsort(ad_X_test.flatten())

ax.scatter(ad_X_test, ad_y_test, alpha=0.5, marker='d')
ax.plot(ad_X_test[test_sort], ln_reg.predict(ad_X_test)[test_sort],
        color='black', linestyle='--')

ax.set_title('Sales vs TV Advertising', fontsize=20)
ax.set_xlabel('TV Advertising', fontsize=14)
ax.set_ylabel('Sales', fontsize=14)

plt.show();

#### Findings
- Statsmodels and SciKit-Learn both use state they are using Ordinary Least Squares to perform the linear regression, but the coefficient and intercept values are slightly different.
    - Ths is due to the fact the Statsmodels uses the entire data set to create the fit, while SciKit-Learn uses a training subset.

# Additional Optional Exercises

- Train / test split with RMSE calculation
- Proper documentation for class methods and attributes
- Build with NumPy methods and compare computation time
- Multiple Linear Regression (SGD covered in Advanced Regression Unit)