# 回归模型
Predicting a continuous-valued attribute associated with an object.

Regression VS Classification
1. target variable is continuous or discrete
2. evaluation metrics diff

1. [SVR](http://www.svms.org/regression/SmSc98.pdf)
2. [Boosting Regression](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.31.314&rep=rep1&type=pdf)
3. [Random Forest](https://www.stat.berkeley.edu/~breiman/randomforest2001.pdf)

In [None]:
class sklearn.svm.SVR(kernel='rbf', degree=3, gamma='auto', coef0=0.0, tol=0.001, C=1.0, 
                      epsilon=0.1, shrinking=True, cache_size=200, verbose=False, max_iter=-1

In [1]:
import pandas as pd 
import numpy as np 
df = pd.read_csv("forestfires.csv")

In [4]:
newDf = pd.get_dummies(data = df,columns=['month','day'])

In [6]:
xdata = newDf.drop('area',axis = 1)
ydata = newDf['area']

In [14]:
from sklearn.svm import SVR
svr = SVR(kernel ="linear")

In [8]:
svr.fit(xdata,ydata)

SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1, gamma='auto',
  kernel='linear', max_iter=-1, shrinking=True, tol=0.001, verbose=False)

In [10]:
svr.score(xdata,ydata)

-0.032413254922760926

In [12]:
from sklearn.preprocessing import StandardScaler
ss = StandardScaler()

In [13]:
xdata_ss = ss.fit_transform(xdata)

In [15]:
svr.fit(xdata_ss,ydata)

SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1, gamma='auto',
  kernel='linear', max_iter=-1, shrinking=True, tol=0.001, verbose=False)

In [16]:
svr.score(xdata_ss,ydata)

-0.032109399537486372

In [18]:
from sklearn.ensemble import GradientBoostingRegressor

In [20]:
gbr = GradientBoostingRegressor(learning_rate=0.01,n_estimators=200,subsample=0.7,max_depth=5)

In [21]:
gbr.fit(xdata,ydata)

GradientBoostingRegressor(alpha=0.9, init=None, learning_rate=0.01, loss='ls',
             max_depth=5, max_features=None, max_leaf_nodes=None,
             min_samples_leaf=1, min_samples_split=2,
             min_weight_fraction_leaf=0.0, n_estimators=200,
             presort='auto', random_state=None, subsample=0.7, verbose=0,
             warm_start=False)

In [22]:
gbr.score(xdata,ydata)

0.8866365842788213

In [25]:
xdata.columns

Index(['X', 'Y', 'FFMC', 'DMC', 'DC', 'ISI', 'temp', 'RH', 'wind', 'rain',
       'month_apr', 'month_aug', 'month_dec', 'month_feb', 'month_jan',
       'month_jul', 'month_jun', 'month_mar', 'month_may', 'month_nov',
       'month_oct', 'month_sep', 'day_fri', 'day_mon', 'day_sat', 'day_sun',
       'day_thu', 'day_tue', 'day_wed'],
      dtype='object')

In [24]:
gbr.feature_importances_

array([  4.41360129e-02,   3.94356106e-02,   4.75935407e-02,
         1.54505802e-01,   3.58018321e-02,   9.22938387e-02,
         3.82758601e-01,   5.13449694e-02,   6.08557819e-02,
         2.32186595e-05,   3.09211797e-04,   3.69627013e-03,
         1.48685051e-04,   1.69273872e-04,   0.00000000e+00,
         8.90379516e-03,   6.26749916e-04,   1.39397354e-03,
         4.28218621e-04,   0.00000000e+00,   8.19459640e-05,
         3.91820110e-03,   2.87109662e-04,   4.49093905e-03,
         1.84765529e-02,   1.19510928e-02,   2.67285926e-02,
         4.80272631e-03,   4.83745434e-03])

## Regression Metrics

#### Explained variance score
The explained_variance_score computes the explained variance regression score.

If \hat{y} is the estimated target output, y the corresponding (correct) target output, and Var is Variance, the square of the standard deviation, then the explained variance is estimated as follow:
![image1](http://scikit-learn.org/stable/_images/math/494cda4d8d05a44aa9aa20de549468e4d121e04c.png)

The best possible score is 1.0, lower values are worse.

#### Mean absolute error
The mean_absolute_error function computes mean absolute error, a risk metric corresponding to the expected value of the absolute error loss or l1-norm loss.

If \hat{y}_i is the predicted value of the i-th sample, and y_i is the corresponding true value, then the mean absolute error (MAE) estimated over n_{\text{samples}} is defined as
![image2](http://scikit-learn.org/stable/_images/math/c38d771fb5eb121916c06cf8c651363583d17794.png)

#### Mean squared error
The mean_squared_error function computes mean square error, a risk metric corresponding to the expected value of the squared (quadratic) error loss or loss.

If \hat{y}_i is the predicted value of the i-th sample, and y_i is the corresponding true value, then the mean squared error (MSE) estimated over n_{\text{samples}} is defined as
![image](http://scikit-learn.org/stable/_images/math/44f36557fef9b30b077b21550490a1b9a0ade154.png)

#### R² score, the coefficient of determination
The r2_score function computes R², the coefficient of determination. It provides a measure of how well future samples are likely to be predicted by the model. Best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value of y, disregarding the input features, would get a R^2 score of 0.0.

If \hat{y}_i is the predicted value of the i-th sample and y_i is the corresponding true value, then the score R² estimated over n_{\text{samples}} is defined as
![image3](http://scikit-learn.org/stable/_images/math/bdab7d608c772b3e382e2822a73ef557c80fbca2.png)

where ![image](http://scikit-learn.org/stable/_images/math/4b4e8ee0c1363ed7f781ed3a12073cfd169e3f79.png)

#### Practical Regression Metrics

Use error metric used in classification to evaluate regression.

In [30]:
# define the error you can bear
def errorForRegression(prediction,target,errorList):
    if not isinstance(errorList,list):
        raise TypeError("the errorList shoule be List type")
    error = np.abs(prediction - target)
    result = [np.sum(error <= x) / len(error) for x in errorList]
    return result
    

In [26]:
# Coding in here 

pred = gbr.predict(xdata)

In [31]:
result = (pred,ydata,[3,4,5])

In [32]:
result

(array([   6.38676909,    4.92368899,    5.26187986,    5.92132006,
           5.50687378,    6.69931598,    4.36506358,    6.40707721,
           8.46444871,    6.71058966,    6.01355679,    7.36134445,
           6.27976606,    7.68503337,   21.00755605,    7.40627021,
           8.53488581,    7.73315978,    5.83177216,    6.33302782,
          10.81628551,    4.48772356,    7.3769695 ,    8.36502792,
           7.5727203 ,    7.58571083,    6.69373815,    9.11053803,
          10.34651198,   10.65557705,    4.33434376,    7.32779789,
          12.66021833,    5.41171334,    4.74501872,    6.92889897,
           6.61105579,    6.90591805,    8.00771258,    5.44602386,
          20.89522796,    5.29929333,    4.75328629,    6.82035778,
           6.88705175,    4.14690172,    6.0527364 ,   13.90949246,
           4.81307047,    6.9056774 ,   11.62674558,    5.97505957,
           5.29705906,    5.29705906,    4.61634032,   13.59597274,
           5.65600039,    4.8769743 ,    6.17981

## Pipeline
Pipeline of transforms with a final estimator.

Sequentially apply a list of transforms and a final estimator. Intermediate steps of the pipeline must be ‘transforms’, that is, they must implement fit and transform methods. The final estimator only needs to implement fit.

The purpose of the pipeline is to assemble several steps that can be cross-validated together while setting different parameters. For this, it enables setting parameters of the various steps using their names and the parameter name separated by a ‘__’, as in the example below. A step’s estimator may be replaced entirely by setting the parameter with its name to another estimator, or a transformer removed by setting to None.

In [None]:
# 标准化
# 特征选择
# PCA
# Logistic Regression

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest
from sklearn.linear_model import LogisticRegression



## How to choose a right estimator
[Choosing the right estimator](http://scikit-learn.org/stable/tutorial/machine_learning_map/)
![image](http://scikit-learn.org/stable/_static/ml_map.png)