# REGRESSION AND CLASSIFICATION

## Regression: 
Measure of relationship between mean value of one variable with corresponding value(s) of (an) other variable(s). Unlike classification (discussed below), regression models continuous/ordered values.
In regression, we evaluate our model based on how close the predicted value is to the actual value of an observation. 

## Classification:
Process of determining the category or class that an observation belongs to.
Can be binary (spam or no spam) or multiclass (which one of "n" classes, does an observation belong to?).
Classification is a supervised learning technique: we need to have a dataset with correct class labels for training. When correct class labels are not available, the corresponding unsupervised learning technique we can use is clustering.

## Supervised Learning - Steps:

#### STEP 1:  Process and generate the initial data set for modeling. 
It will contain both the predictor variables (x1...xn) and the target variable y (Y is called the target variable because the supervised learning's "goal" is to is predict the missing variable y).

#### STEP 2: Split the data into training and testing sets for evaluation of our prediction model. 

Q: Why split? Why can't we evaluate our model on the training data itself? 
A: Because then, we'd have a very optimistic view of our models performance, since it was optimized directly on the training data.

The split needs to be performed in such a way that both sets contain the same patterns of data. For example, if the initial data set has 100 observations, with the predictor value being 60 "Y" and 40 "N" in these observations, the individual training and testing sets should also have the Y and N in the same ratio of 3:2. 

Splitting is usually done through random selection.

Typically, the training set has 70% of the records in the initial data set.

#### STEP 3: Build a model on the training data set.

Building a model (also called training or fitting) boils down to optimizing a set of parameters on the training set. For instance, in the case of linear regression, we assume a linear relationship between our features and our response variable, and learn the coefficients of the model from our training set. This occurs through minimizing our squared error loss function:

Loss(*beta*) = **1/n** x (summation over i) (**y_i** - [**X_i** x *beta*])^2, with respect to each*beta* parameter (i.e., taking the partial derivatives and setting them to zero).  

Avoid overfitting models (learning the training data too closely; otherwise, this would cause the prediction of new data to perform more poorly). 

#### STEP 4: Test the model using the test data set. 
We use the model to predict the value of Y in each of the test records. Then compare the predicted value of Y with the actual value of Y. This gives an indication of the accuracy of the prediction model built. 

#### STEP 5: Fine tune the model using cross-validation. 
Try different algorithms/parameters, until the accuracy reaches an acceptable level.

#### STEP 6: Predict unknown data.
Use the model built on new data (where Y is NOT known) to predict Y.

### Demonstration of Overfitting:

In [14]:
import pandas as pd
import numpy as np
import statsmodels.formula.api as smf

# Set seed for reproducible results
np.random.seed(414)

# Gen toy data
X = np.linspace(0, 15, 1000)
print(X)

[  0.           0.01501502   0.03003003   0.04504505   0.06006006
   0.07507508   0.09009009   0.10510511   0.12012012   0.13513514
   0.15015015   0.16516517   0.18018018   0.1951952    0.21021021
   0.22522523   0.24024024   0.25525526   0.27027027   0.28528529
   0.3003003    0.31531532   0.33033033   0.34534535   0.36036036
   0.37537538   0.39039039   0.40540541   0.42042042   0.43543544
   0.45045045   0.46546547   0.48048048   0.4954955    0.51051051
   0.52552553   0.54054054   0.55555556   0.57057057   0.58558559
   0.6006006    0.61561562   0.63063063   0.64564565   0.66066066
   0.67567568   0.69069069   0.70570571   0.72072072   0.73573574
   0.75075075   0.76576577   0.78078078   0.7957958    0.81081081
   0.82582583   0.84084084   0.85585586   0.87087087   0.88588589
   0.9009009    0.91591592   0.93093093   0.94594595   0.96096096
   0.97597598   0.99099099   1.00600601   1.02102102   1.03603604
   1.05105105   1.06606607   1.08108108   1.0960961    1.11111111
   1.12612

In [15]:
y = 3 * np.sin(X) + np.random.normal(1 + X, .2, 1000)
print(y)

[  1.01520936   0.98008704   1.16957092   1.26952224   1.2585685
   1.45189621   1.22242195   1.30123238   1.45995083   1.2364653
   1.41620578   1.71258217   1.67417434   1.48192944   1.78962584
   1.70281211   1.907079     2.01751679   2.1848829    1.97522696
   2.27008691   2.27137052   2.23810165   2.09725812   2.19181292
   2.31213097   2.92956864   2.38782317   2.77424833   2.8730056
   2.89318493   3.12798683   2.78281687   2.77490083   3.15287467
   3.08164434   2.8963036    3.05261644   3.14668547   3.43812556
   3.150099     3.53356016   3.26583005   3.26371572   3.47221439
   3.21361257   3.51723974   3.92010008   3.87391172   3.68324095
   3.8636242    3.61396205   3.83162179   3.83978501   3.57562442
   3.88597127   4.4217664    4.1467011    4.22793397   4.14681179
   4.34847966   4.27689894   4.1101172    4.14859791   4.09909015
   4.30477853   4.71687249   4.45833828   4.43732799   4.75443227
   4.62131963   4.51588032   4.78509598   4.78102217   4.87506689
   4.83685376

In [16]:
train_X, train_y = X[:700], y[:700]
print(train_X)

[  0.           0.01501502   0.03003003   0.04504505   0.06006006
   0.07507508   0.09009009   0.10510511   0.12012012   0.13513514
   0.15015015   0.16516517   0.18018018   0.1951952    0.21021021
   0.22522523   0.24024024   0.25525526   0.27027027   0.28528529
   0.3003003    0.31531532   0.33033033   0.34534535   0.36036036
   0.37537538   0.39039039   0.40540541   0.42042042   0.43543544
   0.45045045   0.46546547   0.48048048   0.4954955    0.51051051
   0.52552553   0.54054054   0.55555556   0.57057057   0.58558559
   0.6006006    0.61561562   0.63063063   0.64564565   0.66066066
   0.67567568   0.69069069   0.70570571   0.72072072   0.73573574
   0.75075075   0.76576577   0.78078078   0.7957958    0.81081081
   0.82582583   0.84084084   0.85585586   0.87087087   0.88588589
   0.9009009    0.91591592   0.93093093   0.94594595   0.96096096
   0.97597598   0.99099099   1.00600601   1.02102102   1.03603604
   1.05105105   1.06606607   1.08108108   1.0960961    1.11111111
   1.12612

In [17]:
print(train_y)

[  1.01520936   0.98008704   1.16957092   1.26952224   1.2585685
   1.45189621   1.22242195   1.30123238   1.45995083   1.2364653
   1.41620578   1.71258217   1.67417434   1.48192944   1.78962584
   1.70281211   1.907079     2.01751679   2.1848829    1.97522696
   2.27008691   2.27137052   2.23810165   2.09725812   2.19181292
   2.31213097   2.92956864   2.38782317   2.77424833   2.8730056
   2.89318493   3.12798683   2.78281687   2.77490083   3.15287467
   3.08164434   2.8963036    3.05261644   3.14668547   3.43812556
   3.150099     3.53356016   3.26583005   3.26371572   3.47221439
   3.21361257   3.51723974   3.92010008   3.87391172   3.68324095
   3.8636242    3.61396205   3.83162179   3.83978501   3.57562442
   3.88597127   4.4217664    4.1467011    4.22793397   4.14681179
   4.34847966   4.27689894   4.1101172    4.14859791   4.09909015
   4.30477853   4.71687249   4.45833828   4.43732799   4.75443227
   4.62131963   4.51588032   4.78509598   4.78102217   4.87506689
   4.83685376

In [18]:
test_X, test_y = X[700:], y[700:]
print(test_X)

[ 10.51051051  10.52552553  10.54054054  10.55555556  10.57057057
  10.58558559  10.6006006   10.61561562  10.63063063  10.64564565
  10.66066066  10.67567568  10.69069069  10.70570571  10.72072072
  10.73573574  10.75075075  10.76576577  10.78078078  10.7957958
  10.81081081  10.82582583  10.84084084  10.85585586  10.87087087
  10.88588589  10.9009009   10.91591592  10.93093093  10.94594595
  10.96096096  10.97597598  10.99099099  11.00600601  11.02102102
  11.03603604  11.05105105  11.06606607  11.08108108  11.0960961
  11.11111111  11.12612613  11.14114114  11.15615616  11.17117117
  11.18618619  11.2012012   11.21621622  11.23123123  11.24624625
  11.26126126  11.27627628  11.29129129  11.30630631  11.32132132
  11.33633634  11.35135135  11.36636637  11.38138138  11.3963964
  11.41141141  11.42642643  11.44144144  11.45645646  11.47147147
  11.48648649  11.5015015   11.51651652  11.53153153  11.54654655
  11.56156156  11.57657658  11.59159159  11.60660661  11.62162162
  11.63663664

In [19]:
print(test_y)

[  9.11355686   8.79806467   8.9313485    9.04773388   8.65909466
   9.11519643   8.96686064   8.94742599   8.76116573   8.57031857
   8.54221085   8.84084919   8.68558964   9.11676948   9.01326327
   8.64533076   8.95665168   8.91117216   8.49608993   8.86791681
   8.9365543    8.82984015   9.05243769   8.63067848   8.8283513
   8.99550006   8.73583869   9.04191238   8.61312444   8.49831794
   8.60425784   9.07859536   9.40423353   8.83186679   8.90617173
   9.03933547   9.14214397   9.28679145   9.4477728    9.08270999
   8.97105014   9.02831612   9.01271953   8.73272281   9.477346     9.2343554
   9.27133816   9.19043422   9.21479079   9.24396465   9.43891631
   9.04300428   9.42446634   9.61860665   9.55746811   9.6253561
   9.38501677   9.42592119   9.29303562   9.72044866   9.83152674
   9.83701568   9.540746     9.76983419   9.75756532  10.00231158
   9.46086691   9.79179735  10.29434853   9.69510326   9.85496034
  10.08673495  10.27936696  10.04872078  10.12713876  10.15839925


In [20]:
train_df = pd.DataFrame({'X': train_X, 'y': train_y})
print(train_df)

             X         y
0     0.000000  1.015209
1     0.015015  0.980087
2     0.030030  1.169571
3     0.045045  1.269522
4     0.060060  1.258568
5     0.075075  1.451896
6     0.090090  1.222422
7     0.105105  1.301232
8     0.120120  1.459951
9     0.135135  1.236465
10    0.150150  1.416206
11    0.165165  1.712582
12    0.180180  1.674174
13    0.195195  1.481929
14    0.210210  1.789626
15    0.225225  1.702812
16    0.240240  1.907079
17    0.255255  2.017517
18    0.270270  2.184883
19    0.285285  1.975227
20    0.300300  2.270087
21    0.315315  2.271371
22    0.330330  2.238102
23    0.345345  2.097258
24    0.360360  2.191813
25    0.375375  2.312131
26    0.390390  2.929569
27    0.405405  2.387823
28    0.420420  2.774248
29    0.435435  2.873006
..         ...       ...
670  10.060060  9.189485
671  10.075075  8.839575
672  10.090090  9.301995
673  10.105105  9.089993
674  10.120120  9.122525
675  10.135135  9.437679
676  10.150150  9.290701
677  10.165165  9.264488


In [21]:
test_df = pd.DataFrame({'X': test_X, 'y': test_y})
print(test_df)

             X          y
0    10.510511   9.113557
1    10.525526   8.798065
2    10.540541   8.931348
3    10.555556   9.047734
4    10.570571   8.659095
5    10.585586   9.115196
6    10.600601   8.966861
7    10.615616   8.947426
8    10.630631   8.761166
9    10.645646   8.570319
10   10.660661   8.542211
11   10.675676   8.840849
12   10.690691   8.685590
13   10.705706   9.116769
14   10.720721   9.013263
15   10.735736   8.645331
16   10.750751   8.956652
17   10.765766   8.911172
18   10.780781   8.496090
19   10.795796   8.867917
20   10.810811   8.936554
21   10.825826   8.829840
22   10.840841   9.052438
23   10.855856   8.630678
24   10.870871   8.828351
25   10.885886   8.995500
26   10.900901   8.735839
27   10.915916   9.041912
28   10.930931   8.613124
29   10.945946   8.498318
..         ...        ...
270  14.564565  18.319723
271  14.579580  18.164795
272  14.594595  18.381468
273  14.609610  18.050854
274  14.624625  18.022420
275  14.639640  18.263179
276  14.6546

In [22]:
# Linear Fit
poly_1 = smf.ols(formula='y ~ 1 + X', data=train_df).fit()
poly_1.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.642
Model:,OLS,Adj. R-squared:,0.642
Method:,Least Squares,F-statistic:,1254.0
Date:,"Wed, 30 Sep 2015",Prob (F-statistic):,5.52e-158
Time:,11:27:39,Log-Likelihood:,-1483.4
No. Observations:,700,AIC:,2971.0
Df Residuals:,698,BIC:,2980.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5
,coef,std err,t,P>|t|,[95.0% Conf. Int.]
Intercept,1.9959,0.152,13.104,0.000,1.697 2.295
X,0.8896,0.025,35.405,0.000,0.840 0.939

0,1,2,3
Omnibus:,701.108,Durbin-Watson:,0.02
Prob(Omnibus):,0.0,Jarque-Bera (JB):,52.98
Skew:,-0.259,Prob(JB):,3.13e-12
Kurtosis:,1.756,Cond. No.,12.4


In [23]:
# Quadratic Fit
poly_2 = smf.ols(formula='y ~ 1 + X + I(X**2)', data=train_df).fit()
poly_2.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.666
Model:,OLS,Adj. R-squared:,0.665
Method:,Least Squares,F-statistic:,694.4
Date:,"Wed, 30 Sep 2015",Prob (F-statistic):,1.25e-166
Time:,11:28:09,Log-Likelihood:,-1459.6
No. Observations:,700,AIC:,2925.0
Df Residuals:,697,BIC:,2939.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5
,coef,std err,t,P>|t|,[95.0% Conf. Int.]
Intercept,3.1458,0.221,14.261,0.000,2.713 3.579
X,0.2313,0.097,2.382,0.017,0.041 0.422
I(X ** 2),0.0627,0.009,7.004,0.000,0.045 0.080

0,1,2,3
Omnibus:,1210.467,Durbin-Watson:,0.022
Prob(Omnibus):,0.0,Jarque-Bera (JB):,49.911
Skew:,-0.091,Prob(JB):,1.45e-11
Kurtosis:,1.705,Cond. No.,160.0
