# Regression Project Drafting

## 1 Linear Regression
### 1.1 Background
Consider a set of $i = 1, \dots, n$ samples where each sample contains a set of $j = 1, \dots, d$ features, $x_{ij}$, and a label $y_i$. With linear regression our label is a linear function of our features, i.e., 
$$ \hat{y}_i=w_1x_{i1}+w_2x_{i2}+\cdots+w_dx_{id} = \sum_{j=1}^d w_jx_{ij} = w^Tx_i $$
where $w_j$ are the weights or regression coefficients of $x_i$.
#### 1.1.1 Least Squares Objective
A common way to determine the regression coefficients is by minimizing the sum of squared errors between the predicted label ($\hat{y}_i = w^Tx_i$) and the true label ($y_i$), i.e., 
$$f(w) = \frac{1}{2}\sum_{i=1}^n(w^Tx_i-y_i)^2$$
where $f(w)$ is commonly referred to as the loss function.

[deriving the ridge regression solution](https://stats.stackexchange.com/questions/69205/how-to-derive-the-ridge-regression-solution)

Types to consider include:
* Ordinary least squares regression (OLSR)
* Linear regression
* Logistic regression
* Stepwise regression
* Multivariate adaptive regression splines (MARS)
* Locally estimated scatterplot smoothing (LOESS)
* Jackknife regression

## Linear Regression from Learnds.com

### From Overview

In [None]:
%pylab inline
import pandas as pd
# we have to clean up the raw data set which we will do
# in the next lesson. But for now let's look at the cleaned up data.
# import the cleaned up dataset into a pandas data frame
df = pd.read_csv('../datasets/loanf.csv')

# extract FICO Score and Interest Rate and plot them
# FICO Score on x-axis, Interest Rate on y-axis
intrate = df['Interest.Rate']
fico = df['FICO.Score']
p = plot(fico,intrate,'o')
ax = gca()
xt = ax.set_xlabel('FICO Score')
yt = ax.set_ylabel('Interest Rate %')

In [None]:
from IPython.core.display import HTML
def css_styling():
    styles = open("../styles/custom.css", "r").read()
    return HTML(styles)
css_styling()

### From Data Exploration: Lending Club

In [None]:
%matplotlib inline
# first we ingest the data from the source on the web
# this contains a reduced version of the data set from Lending Club
import pandas as pd
loansData = pd.read_csv('https://spark-public.s3.amazonaws.com/dataanalysis/loansData.csv')

In [None]:
loansData['Interest.Rate'][0:5] # first five rows of Interest.Rate

In [None]:
loansData['Loan.Length'][0:5] # first five rows of Loan.Length

In [None]:
loansData['FICO.Range'][0:5] # first five rows of FICO.Range

In [None]:
LoanLength = loansData['Loan.Length']
df = pd.DataFrame(LoanLength)
# LoanLength = 'LoanLength'.translate("month")
# LoanLength = LoanLength.str.replace('m,o,n,t,h','')
# LoanLength = LoanLength.replace('m,o,n,t,h,s','')
# LoanLength[0:5]
df.describe() 
# LoanLength.info()
df.str.replace("mo",'')
# import re
# LoanLength = re.sub('[months]','',LoanLength)
# LoanLength[0:5]
# df.replace('[months]','')

In [None]:
help(str.translate)

Well that was a fail so now I am moving onto a new tutorial

## From Introduction to Linear Regression
[Introduction to Linear Regression](http://nbviewer.jupyter.org/github/justmarkham/DAT4/blob/master/notebooks/08_linear_regression.ipynb)

In [None]:
# imports
import pandas as pd
import matplotlib.pyplot as plt

# this allows plots to appear directly in the notebook
%matplotlib inline

# read data into a DataFrame
data = pd.read_csv('http://www-bcf.usc.edu/~gareth/ISL/Advertising.csv', index_col=0)
data.head()

# print the shape of the DataFrame
data.shape

# visualize the relationship between the features and the response using scatterplots
fig, axs = plt.subplots(1, 3, sharey=True)
data.plot(kind='scatter', x='TV', y='Sales', ax=axs[0], figsize=(16, 8))
data.plot(kind='scatter', x='Radio', y='Sales', ax=axs[1])
data.plot(kind='scatter', x='Newspaper', y='Sales', ax=axs[2])