# REGRESSION AND CLASSIFICATION

## Regression: 
Measure of relationship between mean value of one variable with corresponding value(s) of (an) other variable(s). Unlike classification (discussed below), regression models continuous/ordered values.
In regression, we evaluate our model based on how close the predicted value is to the actual value of an observation. 

## Classification:
Process of determining the category or class that an observation belongs to.
Can be binary (spam or no spam) or multiclass (which one of "n" classes, does an observation belong to?).
Classification is a supervised learning technique: we need to have a dataset with correct class labels for training. When correct class labels are not available, the corresponding unsupervised learning technique we can use is clustering.

## Supervised Learning - Steps:

#### STEP 1:  Process and generate the initial data set for modeling. 
It will contain both the predictor variables (x1...xn) and the target variable y (Y is called the target variable because the supervised learning's "goal" is to is predict the missing variable y).

#### STEP 2: Split the data into training and testing sets for evaluation of our prediction model. 

Q: Why split? Why can't we evaluate our model on the training data itself? 
A: Because then, we'd have a very optimistic view of our models performance, since it was optimized directly on the training data.

The split needs to be performed in such a way that both sets contain the same patterns of data. For example, if the initial data set has 100 observations, with the predictor value being 60 "Y" and 40 "N" in these observations, the individual training and testing sets should also have the Y and N in the same ratio of 3:2. 

Splitting is usually done through random selection.

Typically, the training set has 70% of the records in the initial data set.

#### STEP 3: Build a model on the training data set.

Building a model (also called training or fitting) boils down to optimizing a set of parameters on the training set. For instance, in the case of linear regression, we assume a linear relationship between our features and our response variable, and learn the coefficients of the model from our training set. This occurs through minimizing our squared error loss function:

Loss(*beta*) = **1/n** x (summation over i) (**y_i** - [**X_i** x *beta*])^2, with respect to each*beta* parameter (i.e., taking the partial derivatives and setting them to zero).  

Avoid overfitting models (learning the training data too closely; otherwise, this would cause the prediction of new data to perform more poorly). 

#### STEP 4: Test the model using the test data set. 
We use the model to predict the value of Y in each of the test records. Then compare the predicted value of Y with the actual value of Y. This gives an indication of the accuracy of the prediction model built. 

#### STEP 5: Fine tune the model using cross-validation. 
Try different algorithms/parameters, until the accuracy reaches an acceptable level.

#### STEP 6: Predict unknown data.
Use the model built on new data (where Y is NOT known) to predict Y.

### Demonstration of Overfitting:

In [27]:
import pandas as pd
import numpy as np
import statsmodels.formula.api as smf

# Set seed for reproducible results
np.random.seed(414)

# Gen toy data
X = np.linspace(0, 15, 1000)

In [24]:
y = 3 * np.sin(X) + np.random.normal(1 + X, .2, 1000)

In [25]:
train_X, train_y = X[:700], y[:700]

In [26]:
test_X, test_y = X[700:], y[700:]

In [20]:
train_df = pd.DataFrame({'X': train_X, 'y': train_y})
print(train_df)

             X         y
0     0.000000  1.015209
1     0.015015  0.980087
2     0.030030  1.169571
3     0.045045  1.269522
4     0.060060  1.258568
5     0.075075  1.451896
6     0.090090  1.222422
7     0.105105  1.301232
8     0.120120  1.459951
9     0.135135  1.236465
10    0.150150  1.416206
11    0.165165  1.712582
12    0.180180  1.674174
13    0.195195  1.481929
14    0.210210  1.789626
15    0.225225  1.702812
16    0.240240  1.907079
17    0.255255  2.017517
18    0.270270  2.184883
19    0.285285  1.975227
20    0.300300  2.270087
21    0.315315  2.271371
22    0.330330  2.238102
23    0.345345  2.097258
24    0.360360  2.191813
25    0.375375  2.312131
26    0.390390  2.929569
27    0.405405  2.387823
28    0.420420  2.774248
29    0.435435  2.873006
..         ...       ...
670  10.060060  9.189485
671  10.075075  8.839575
672  10.090090  9.301995
673  10.105105  9.089993
674  10.120120  9.122525
675  10.135135  9.437679
676  10.150150  9.290701
677  10.165165  9.264488


In [21]:
test_df = pd.DataFrame({'X': test_X, 'y': test_y})
print(test_df)

             X          y
0    10.510511   9.113557
1    10.525526   8.798065
2    10.540541   8.931348
3    10.555556   9.047734
4    10.570571   8.659095
5    10.585586   9.115196
6    10.600601   8.966861
7    10.615616   8.947426
8    10.630631   8.761166
9    10.645646   8.570319
10   10.660661   8.542211
11   10.675676   8.840849
12   10.690691   8.685590
13   10.705706   9.116769
14   10.720721   9.013263
15   10.735736   8.645331
16   10.750751   8.956652
17   10.765766   8.911172
18   10.780781   8.496090
19   10.795796   8.867917
20   10.810811   8.936554
21   10.825826   8.829840
22   10.840841   9.052438
23   10.855856   8.630678
24   10.870871   8.828351
25   10.885886   8.995500
26   10.900901   8.735839
27   10.915916   9.041912
28   10.930931   8.613124
29   10.945946   8.498318
..         ...        ...
270  14.564565  18.319723
271  14.579580  18.164795
272  14.594595  18.381468
273  14.609610  18.050854
274  14.624625  18.022420
275  14.639640  18.263179
276  14.6546

In [22]:
# Linear Fit
poly_1 = smf.ols(formula='y ~ 1 + X', data=train_df).fit()
poly_1.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.642
Model:,OLS,Adj. R-squared:,0.642
Method:,Least Squares,F-statistic:,1254.0
Date:,"Wed, 30 Sep 2015",Prob (F-statistic):,5.52e-158
Time:,11:27:39,Log-Likelihood:,-1483.4
No. Observations:,700,AIC:,2971.0
Df Residuals:,698,BIC:,2980.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5
,coef,std err,t,P>|t|,[95.0% Conf. Int.]
Intercept,1.9959,0.152,13.104,0.000,1.697 2.295
X,0.8896,0.025,35.405,0.000,0.840 0.939

0,1,2,3
Omnibus:,701.108,Durbin-Watson:,0.02
Prob(Omnibus):,0.0,Jarque-Bera (JB):,52.98
Skew:,-0.259,Prob(JB):,3.13e-12
Kurtosis:,1.756,Cond. No.,12.4


In [23]:
# Quadratic Fit
poly_2 = smf.ols(formula='y ~ 1 + X + I(X**2)', data=train_df).fit()
poly_2.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.666
Model:,OLS,Adj. R-squared:,0.665
Method:,Least Squares,F-statistic:,694.4
Date:,"Wed, 30 Sep 2015",Prob (F-statistic):,1.25e-166
Time:,11:28:09,Log-Likelihood:,-1459.6
No. Observations:,700,AIC:,2925.0
Df Residuals:,697,BIC:,2939.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5
,coef,std err,t,P>|t|,[95.0% Conf. Int.]
Intercept,3.1458,0.221,14.261,0.000,2.713 3.579
X,0.2313,0.097,2.382,0.017,0.041 0.422
I(X ** 2),0.0627,0.009,7.004,0.000,0.045 0.080

0,1,2,3
Omnibus:,1210.467,Durbin-Watson:,0.022
Prob(Omnibus):,0.0,Jarque-Bera (JB):,49.911
Skew:,-0.091,Prob(JB):,1.45e-11
Kurtosis:,1.705,Cond. No.,160.0


### Metrics and Cross-Validation:

When building a supervised model, it's important to have a clear idea of how you will evaluate its performance. As we've discussed, one crucial part of evaluation is to separate your training and testing set to avoid overfitting. Once we have done this, it's time to decide exactly what metric we will use to evaluate our model. Metrics differ between regression and classification models:

#### Regression:

1. Mean Squared Error: Mean square error between our predicted outcomes, and the true response in our test set.

2. Mean Absolute Error: Mean absolute error between our predicted outcomes, and the true response in our test set.

3. R-Squared: Coefficient of determination from regression score function.

#### Classification:

4. Accuracy: The percentage of data points labeled correctly by the model.

5. Precision: The ratio of (true positives / (true_positives + true_negatives)). The precision is intuitively the ability of the classifier not to label as positive a sample that is negative.

6. Recall: The ratio of (true_positives / (true_positives + false_negatives)). The recall is intuitively the ability of the classifier to find all the positive samples.

A typical work flow while building a model is to split your data, build your model, and then choose a metric to evaluate it.For instance, suppose we wanted to build a classifier to predict whether or not an individual had cancer. In this scenario, the harm of incorrectly predicting that some one does not have cancer who actually does is far worse than predicting that a healthy person has cancer. Thus, we want to minimize our false negatives, and might choose recall as our metric.
