# Introduction to Regression Analysis

## _Regression:_ (metaphorical) movement to an underlying trend
0. _Regression toward the mean_
1. _Linear regression_ (simple linear, multiple linear, quantile)
2. _Polynomial regression_ (& spline regression)
3. _Non-parametric regression_ (regression trees)
4. _Binomial regression_ (binary, probit, logit/logistical)

In [None]:
# Turns on/off pretty printing 
%pprint

# Every returned Out[] is displayed, not just the last one. 
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [None]:
import numpy as np
import pandas as pd
import sklearn               # sklearn is the ML package we will use
import nltk 

import matplotlib.pyplot as plt
import seaborn as sns        # seaborn graphical package
sns.set_style('darkgrid')

In [None]:
# statsmodels.api is for actually looking at
# the regression equation and statistical measures thereof
import statsmodels.api as sm
import statsmodels.formula.api as smf

### Linear Regression: fitting to lines
#### Assumptions:
* continuous values
* a linear relationship
* multivariate normality
* no multicollinearity
* homoskedasticity

Regression can be used for __explanation__, and also for __prediction__:
* explain the overall relationship between predictor variable(s) and outcome variable
* predict individual outcomes for new data
A regression model itself does not prove the direction of causation; conclusions about causation _must_ come from an a priori understanding of the relationship between variables.

In [None]:
# CSV files on GitHub are rendered. Click on "Raw" to get to the raw file. 
# This salary data has cleaner correlation. 
url = "https://raw.githubusercontent.com/csjcode/course-machinelearning-az/master/Machine-Learning-A-Z/Part%202%20-%20Regression/Section%204%20-%20Simple%20Linear%20Regression/Salary_Data.csv"
df = pd.read_csv(url)
df.columns = ['years_experience', 'salary']
 
# This salary data is bigger, has more variability. 
url = "https://raw.githubusercontent.com/bokeh/bokeh/master/examples/app/export_csv/salary_data.csv"
df2 = pd.read_csv(url)

In [None]:
df.describe()

In [None]:
df

In [None]:
df_line = np.polyfit(df['years_experience'], df['salary'], 1)
# polynomial of degree 1 that minimizes error
df_line 
# first term is coefficient, second is intercept
# the returned vector minimizes residual error

In [None]:
df_line2 = np.polyfit(df['salary'], df['years_experience'], 1)
df_line2

In [None]:
np.corrcoef(df_line, df_line2)

* continuous values?
* a linear relationship?
* multivariate normality?
* no multicollinearity?
* homoskedasticity?

In [None]:
plt.scatter(df['years_experience'], df['salary'])

In [None]:
sns.regplot(x=df.years_experience,y=df.salary,color='blue')

### Multiple Linear Regression: fitting to multiple lines(!?)

In [None]:
english = pd.read_csv('../../Class-Exercise-Repo/activity3/english_updated.csv', index_col='Index')

In [None]:
english.describe()

In [None]:
elm = smf.ols("RTlexdec ~ Familiarity + WrittenFrequency", english)

In [None]:
elmf = elm.fit()

In [None]:
print(elmf.summary())

### Stepwise regression?
* adding/removing one predictor, comparing the resulting model to the original
* rinse, repeat

__Probably don't do this.__
* It encourages brute force solutions, biased toward outliers, etc. (regression to the mean)
* It encourages overfitting to the data
* It discourages thinking about the data
* It inflates (deflates?) p-values: _p-hacking_

In [None]:
elm2 = smf.ols("RTlexdec ~ Familiarity * WrittenFrequency", english)
elmf2 = elm2.fit()

In [None]:
print(elmf2.summary())

### Preparing data for machine learning. 
Need to create:
- Input data, which we will call X. 1+ columns of data points ("features"). 
    - We have only 1 "feature", however, which is years of experience.  
- Target data, which we will call y. A series of data points. 
    - Target is salary dollar amount. 

In [None]:
x = df['years_experience']    # series: lower-case x
X = df[['years_experience']]  # dataframe with only one column. Uppercase X. 
y = df['salary']              # series

In [None]:
x.head()         # Won't be using these, just for illustration
X.head()         # input feature(s)
y.head()         # output target values

In [None]:
# sklearn provides a function for splitting data. Randomize on same seed. 
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 1/3, random_state = 0)

In [None]:
len(X_train)
len(X_test)

In [None]:
# Fitting Simple Linear Regression to the Training set
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)

In [None]:
X_test
y_test

In [None]:
# Predicting the Test set results
y_pred = regressor.predict(X_test)

In [None]:
X_test[:5]    # test set, years of experience
y_test[:5]    # test set, real salaries
y_pred[:5]    # salaries predicted by regressor
                 # <-- hopefully not too far away from real numbers! 

In [None]:
dir(regressor)

In [None]:
regressor.coef_
regressor.get_params()
regressor.intercept_

### Plotting data and prediction
1. On training set
2. On test set

In [None]:
plt.scatter(X_train, y_train, color='red')
plt.plot(X_train, regressor.predict(X_train), color='blue')
plt.title('Salary vs Experience (Training set)')
plt.xlabel('Years of Experience')
plt.ylabel('Salary')
plt.show()

In [None]:
plt.scatter(X_test, y_test, color='red')
plt.plot(X_train, regressor.predict(X_train), color='blue')
plt.title('Salary vs Experience (Test set)')
plt.xlabel('Years of Experience')
plt.ylabel('Salary')
plt.show()

In [None]:
# How about someone with just 0.5 year of experience? How about 15? 
newdf = pd.DataFrame({'years_experience':[0.5, 15]})
newdf
regressor.predict(newdf)