## Outline of Steps
* [step0](#step0): import necessary packages
* [step1](#step1): import `dataset X_remaining50.pickle` as `X_remaining50`
* [step2](#step2): select the relevant predictors and target variable
* [step3](#step3): shuffle and sampling  the remaining dataset
* [step4](#step4): create Lasso model as benchmark model - using default parameter
* [step5](#step5): create Lasso model as benchmark model - using grid-search
* [step6](#step6): optimized Lasso model's performance on testing dataset 
* [step7](#step7): insights of coefficients from Lasso model
* [step8](#step8): create comparing model support vector regression - using default parameter
* [step9](#step9): create comparing model support vector regression - using grid-search
* [step10](#step10): optimized SVR model's performance on testing dataset
* [step11](#step11): choose better model and predict the rating for the testing dataset
* [step12](#step12): save the output dataset for later use
* [step13](#step13): appendix - increase the sample number in training dataset and re-fit Lasso model again
* [step14](#step14): appendix - try `DecisionTreeRegressor` for rate prediction model
* [step15](#step15): appendix - try `AdaBoosting` for rate prediction model
* [step16](#step16): appendix - try `multi-layer perceptron` for rate prediction model

In [1]:
# import necessary packages
import numpy as np
import pandas as pd
from IPython.display import display # Allows the use of display() for DataFrames
import seaborn as sns
import matplotlib.pyplot as plt
import missingno as msno # module for missing value visualization
from scipy import stats # implement box-cox transformation
from math import ceil
from sklearn.utils import shuffle # shuffling the dataset
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

import nltk
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer

from sklearn.naive_bayes import MultinomialNB # for sentiment analysis benchmark model
from sklearn.model_selection import cross_val_score # cross validation score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

from sklearn.metrics import make_scorer, mean_squared_error, r2_score
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import Lasso, Ridge
from sklearn.svm import SVR

from scipy.stats import uniform
from numpy import flatnonzero # return the index for nonzero value

# Pretty display for notebooks
%matplotlib inline
pd.options.display.max_columns = None # show up all column values in display

# suppress warning
import warnings
warnings.simplefilter("ignore")

# suppress scientific notation
np.set_printoptions(suppress=True)

<a id="step1"></a>
## step1: import dataset `X_remaining50.pickle` as `X_remaining50`

In [2]:
X_remaining50 = pd.read_pickle("X_remaining50.pickle")

<a id="step2"></a>
## step2: select the relevant predictors and target variable
1. the target variable is `transformed_score`.
2. based on the correlation matrix showed up in part3 and part4, following are some explanatory features worth for trying. 
    - a. `mlp_predict_review_sentiment`
    - b. `transformed_review_total_negative_word_counts` 
    - c. `transformed_review_total_positive_word_counts`
    - d. `transformed_average_score`
    - e. `quarter_transformed_score`
    - f. `quarter_previous_transformed_score`
    - g. They all have higher correlation with `transformed_score`.
3. in order to apply pd.get_dummies(), I first replace the value 0/1 in `mlp_predict_review_sentiment` to string.
4. drop out the NA in the rows, so that I won't face bugs when applying MinMaxScaler()
5. set up the dummies for the dataset.

In [3]:
# select the relevant target variable and predictors
cols = ["transformed_score",
        "mlp_predict_review_sentiment",
        "transformed_review_total_negative_word_counts",
        "transformed_review_total_positive_word_counts",
        "transformed_average_score",
        "quarter_transformed_score",
        "quarter_previous_transformed_score"]

X_remaining_raw = X_remaining50[cols]

In [4]:
# map 0/1 in mlp_predict_review_sentiment to negative/positive sentiment
value_replace = {0:"negative",
                 1:"positive"}

X_remaining_raw["mlp_predict_review_sentiment"] = X_remaining_raw["mlp_predict_review_sentiment"].map(value_replace)


In [5]:
# drop out row contains NA
X_remaining_raw.dropna(inplace=True)

In [6]:
# convert categorical feature into dummies
X_remaining_raw = pd.get_dummies(X_remaining_raw)

<a id="step3"></a>
## step3: shuffle and sampling  the remaining dataset
1. In previous experiments, if the number of samples goes beyond 10000 in Support Vector Machine, the training time will be super long. Thus, in order to avoid long training hours, I will universally use only **5%** (that is around *11728* samples) as training dataset for both Lasso and SVM model.
2. The 5% dataset will serve as training and validation dataset.
3. Use the remaining 95% of dataset as testing dataset for rating prediction model.

In [7]:
# separate target variable out - transformed_score
target_variable = X_remaining_raw.transformed_score

# drop out the target variable in X dataset
X_remaining_sub = X_remaining_raw.drop(["transformed_score"], axis=1)

# use the remaining 95% as the testing dataset 
X_train, X_test, y_train, y_test = train_test_split(X_remaining_sub, target_variable,
                                                    test_size = 0.95, random_state=20)

<a id="step4"></a>
## step4: create Lasso model as benchmark model - using default parameter
Default parameter for Lasso model:
* alpha = 1

In [8]:
# create a Lasso model with default argument
lasso_default = Lasso()

# define scoring function
scorer = make_scorer(r2_score, greater_is_better=True)

# compute cross-validation score
scores = cross_val_score(lasso_default, X_train, y_train, cv=5)

# print out the average score
print("average cross-validation R^2 score of default Lasso model: {:.4f}".format(scores.mean()))

average cross-validation R^2 score of default Lasso model: 0.3868


<a id="step5"></a>
## step5: create Lasso model as benchmark model - using grid-search
There are two parameters being fine tuned in grid-search. One is **alpha**, when alpha goes down, the model will become more complicated, hence overfitting. In the meantime, **max_iter** will need to go up responding to smaller alpha.

Comparing the **R^2** between the default parameter and the optimized parameter:
1. before: 0.3868
2. after: 0.3870
3. the parameters being chosen by grid-search are:
    - alpha: 0.001
    - max_iter: 10000

**BRIEF RESULT:**
1. As we can tell, the `GridSearchCV` chose to have smaller alpha and larger max_iter, comparing to the default parameters, implying that default model is still a bit of underfitting.  

In [9]:
# create a Lasso model
lasso = Lasso()

# set up the parameter range for grid-search
param_grid = {"alpha":[80,50,20,15,10,5,3,2,1,0.5,0.1,0.001],
              "max_iter":[10000,5000,1000,500,100]}

scorer = make_scorer(r2_score, greater_is_better=True)

grid = GridSearchCV(lasso, param_grid=param_grid, scoring=scorer, cv=5)

grid.fit(X_train,y_train)

print("the best R^2 of all model parameters' combination on Lasso model: {:.4f}".format(grid.best_score_))

the best R^2 of all model parameters' combination on Lasso model: 0.3870


In [10]:
# the argument setting for best estimator
print("the parameter setting of optimized Lasso model")
grid.best_estimator_

the parameter setting of optimized Lasso model


Lasso(alpha=0.001, copy_X=True, fit_intercept=True, max_iter=10000,
   normalize=False, positive=False, precompute=False, random_state=None,
   selection='cyclic', tol=0.0001, warm_start=False)

<a id="step6"></a>
## step6: optimized Lasso model's performance on testing dataset 

In [11]:
# the estimator's performance on the testing dataset
print("the R^2 of the optimized Lasso model on testing dataset:")
print("{:.4f}".format(grid.score(X_test, y_test)))

the R^2 of the optimized Lasso model on testing dataset:
0.3899


<a id="step7"></a>
## step7: insights of coefficients from Lasso model
Because Lasso model can suppress the coefficients of insignificant explanatory variables to 0, it's a good way to inspect the what explanatory variables are actually important in predicting the target variable - `transformed_score`.

**NOTICE:** Owing to the target variable and explanatory variables are being box-cox re-scaled, we can only interpret the importance of variables based on coefficients. The absolute value of coefficient doesn't represent its effect on the original target variable.

**BRIEF RESULT:**
1. From the coefficient of each feature, we can tell that **`transformed_review_total_positive_word_counts`** is significantly correlated to higher scores.
2. On the contrary, **`transformed_review_total_negative_word_counts`** and **`mlp_predict_review_sentiment_negative`** are significantly correlated to lower scores. 
3. One interesting discovery is that the `mlp_predict_review_sentiment_negative` seems quite useful in identifying low scores, proving its value of text sentiment model. Nonetheless, it's not so powerful in identifying high score instead.

In [12]:
# filter out the nonzero coefficients
nonzero_index = flatnonzero(grid.best_estimator_.coef_) # return the index for nonzero coefficients
nonzero_feature_name = X_train.columns[nonzero_index] # index out the feature name
nonzero_coef = grid.best_estimator_.coef_[nonzero_index] # index out the coefficient's value

In [13]:
# display the coefficients as sorted dataframe
value = nonzero_coef[nonzero_coef.argsort()[::-1]]
value = np.round(value, 6) # avoid scientific notation
feature = nonzero_feature_name[nonzero_coef.argsort()[::-1]]
lasso_coefficient = pd.DataFrame({"feature":feature, "value":value})
display(lasso_coefficient)

Unnamed: 0,feature,value
0,transformed_review_total_positive_word_counts,16.342751
1,quarter_transformed_score,0.684411
2,transformed_average_score,0.019337
3,quarter_previous_transformed_score,0.009739
4,mlp_predict_review_sentiment_positive,0.0
5,transformed_review_total_negative_word_counts,-17.836231
6,mlp_predict_review_sentiment_negative,-40.343272


<a id="step8"></a>
## step8: create comparing model support vector regression - using default parameter
Default parameter for SVR:
* C = 1
* gamma = 1/n_features ('auto')

**SPECIAL NOTICE**:
1. When training Support Vector Machine, it requires to have similar scale on all features. For I have dummy variables(0/1), I will use MinMaxScaler() to scale all numeric features into range 0/1 as well. 
2. In the following model evaluation, for the proper use of train and validation dataset in cross-validation, it's better to create a pipeline for it.
    - For each fold of cross-validation, pipeline enables to use the training data of current fold only and create scaler based on it. It avoids data information leakage to validation dataset.
3. In training Support Vector Machine, the samples of dataset is recommended not to go beyond 100,000 rows, the time it takes for training the model goes exponentially. In my model training process, I will choose samples not to go beyond 10,000 rows.

In [14]:
# create a pipeline with default model parameter
svr_pipe_default = Pipeline([("scaler", MinMaxScaler()),("svr", SVR(C=1, gamma = 'auto'))])

# define scoring function
scorer = make_scorer(r2_score, greater_is_better=True)

# compute cross-validation score
scores = cross_val_score(svr_pipe_default, X_train, y_train, cv=5)

# print out the average score
print("average cross-validation R^2 score of default SVR model: {:.2f}".format(scores.mean()))


average cross-validation R^2 score of default SVR model: 0.27


<a id="step9"></a>
## step9: create comparing model support vector regression - using grid-search
1. In order to speed up the model training process, I will try to use `RandomizedSearchCV` this time, instead of `GridSearchCV`. The `RandomizedSearchCV` will only select some combinations of parameters, hence reducing training time.
2. `RandomizedSearchCV` uses distribution instead of hard-code number for specifying parameters.
3. In SVR model, two parameters are being fine tuned. One is **C**, which is similar to the L1 regularization in Lasso model. When C goes up, the model will move toward overfitting. The other is **gamma**, which means how far each sample's influence is. If gamma goes up, every sample will have shorter range of influence, hence the model will move toward overfitting.

Comparing the **R^2** between the default parameter and the optimized parameter:
1. before: 0.27
2. after: 0.41
3. the parameters being chosen by randomized-grid-search are:
    - C: 8.91
    - gamma: 8.15

**BRIEF RESULT:**
1. As we can tell, the `RandomizedSearchCV` chose to have larger C and gamma, comparing to the default parameters, implying that default model is still a bit of underfitting.  

In [15]:
# create a randomized pipeline
svr_pipe = Pipeline([("scaler", MinMaxScaler()),("svr", SVR())])


# set up the parameter range for grid-search
param_grid = {"svr__C":uniform(0,10), # use distributions insted (only applicable in randomized grid search)
              "svr__gamma":uniform(0,10)}

scorer = make_scorer(r2_score, greater_is_better=True)

random_grid = RandomizedSearchCV(svr_pipe, param_distributions=param_grid, # use param_distributions
                                 scoring=scorer, cv=5, n_iter=8, random_state=20)

random_grid.fit(X_train,y_train)

print("the best R^2 of all model parameters' combination on SVR model: {:.2f}".format(random_grid.best_score_))

the best R^2 of all model parameters' combination on SVR model: 0.41


In [16]:
# the argument setting for best estimator
print("the parameter setting of optimized SVR model")
random_grid.best_estimator_

the parameter setting of optimized SVR model


Pipeline(memory=None,
     steps=[('scaler', MinMaxScaler(copy=True, feature_range=(0, 1))), ('svr', SVR(C=8.9153072947470804, cache_size=200, coef0=0.0, degree=3, epsilon=0.1,
  gamma=8.1583747730768401, kernel='rbf', max_iter=-1, shrinking=True,
  tol=0.001, verbose=False))])

<a id="step10"></a>
## step10: optimized SVR model's performance on testing dataset

In [17]:
# the estimator's performance on the testing dataset
print("the R^2 of the optimized SVR model on testing dataset:")
print("{:.4f}".format(random_grid.score(X_test, y_test)))

the R^2 of the optimized SVR model on testing dataset:
0.4127


<a id="step11"></a>
## step11: choose better model and predict the rating for the testing dataset
As I use **R^2** as the criterion to decide which model(*Lasso or SVR*) performs better, it turns out **SVR** has a higher R^2 by a relatively small margin. I will use SVR as the final rating model and predict the **transformed_score** for the testing dataset, which I can compare with the true scores in later discussion. 

In [18]:
# use SVR to predict the transformed_score in testing dataset
svr_predict_transformed_score_test = random_grid.predict(X_test)

# have a look at the predicted transformed_score
svr_predict_transformed_score_test[:5]

array([ 281.75857095,  392.40150371,  554.86369912,  551.22471957,
         90.05032052])

<a id="step12"></a>
## step12: save the output dataset for later use
The datasets I will use for later discussion include:
1. `X_remaining_sub` and `target_variable`: for the task in model evaluation and validation
2. `X_test` and `y_test` and `svr_predict_transformed_score_test`: for examining the SVR model performance and comparing the `transformed_score`, `svr_predict_transformed_score_test` and original `Reviewer_Score`.  

In [19]:
# save the output dataset for later use
X_remaining_sub.to_pickle("X_remaining_sub.pickle")
target_variable.to_pickle("target_variable.pickle")
X_test.to_pickle("X_test.pickle")
y_test.to_pickle("y_test.pickle")
pd.Series(svr_predict_transformed_score_test).to_pickle("svr_predict_transformed_score_test.pickle")

<a id="step13"></a>
## step13: appendix - increase the sample number in training dataset and re-fit Lasso model again
1. It turns out that even I try to use a larger dataset for training Lasso model, the R^2 still under-perform than SVR model, which, by the way, use far less samples for model training.
2. Nevertheless, both model seem not performing well in the starndard of originally set-up benchmark. 
3. The Lasso model only got **0.38 R^2** and SVR model only got **0.41 R^2**. Both are lower than **0.5 R^2** benchmark.

In [20]:
# increase samples in training dataset - model training version 2
# separate target variable out - transformed_score
target_variable_v2 = X_remaining_raw.transformed_score

# drop out the target variable in X dataset
X_remaining_sub_v2 = X_remaining_raw.drop(["transformed_score"], axis=1)

# use the remaining 95% as the testing dataset 
X_train_v2, X_test_v2, y_train_v2, y_test_v2 = train_test_split(X_remaining_sub_v2, target_variable_v2,
                                                                test_size = 0.25, random_state=20)

In [21]:
# create a grid-searched Lasso model - version 2
lasso = Lasso()

# set up the parameter range for grid-search
param_grid = {"alpha":[80,50,20,15,10,5,3,2,1,0.5,0.1,0.001],
              "max_iter":[10000,5000,1000,500,100]}

scorer = make_scorer(r2_score, greater_is_better=True)

grid = GridSearchCV(lasso, param_grid=param_grid, scoring=scorer, cv=5)

grid.fit(X_train_v2,y_train_v2)

print("the best R^2 of all model parameters' combination on Lasso model: {:.4f}".format(grid.best_score_))

the best R^2 of all model parameters' combination on Lasso model: 0.3898


In [22]:
# the argument setting for best estimator - version 2
print("the parameter setting of optimized Lasso model")
grid.best_estimator_

the parameter setting of optimized Lasso model


Lasso(alpha=0.001, copy_X=True, fit_intercept=True, max_iter=10000,
   normalize=False, positive=False, precompute=False, random_state=None,
   selection='cyclic', tol=0.0001, warm_start=False)

In [23]:
# the estimator's performance on the testing dataset - version 2
print("the R^2 of the optimized Lasso model on testing dataset:")
print("{:.4f}".format(grid.score(X_test_v2, y_test_v2)))

the R^2 of the optimized Lasso model on testing dataset:
0.3907


<a id="step14"></a>
## step14: appendix - try `DecisionTreeRegressor` for rate prediction model
**REF:**<br/>
[DecisionTreeRegressor](http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html)

In [24]:
# separate target variable out - transformed_score
target_variable = X_remaining_raw.transformed_score

# drop out the target variable in X dataset
X_remaining_sub = X_remaining_raw.drop(["transformed_score"], axis=1)

# use the remaining 95% as the testing dataset 
X_train, X_test, y_train, y_test = train_test_split(X_remaining_sub, target_variable,
                                                    test_size = 0.95, random_state=20)

In [25]:
# create a DecisionTreeRegressor model
from sklearn.tree import DecisionTreeRegressor

DecisionTR = DecisionTreeRegressor()

# set up the parameter range for grid-search
param_grid = {"max_depth":[50,20,10,5,3,2,1],
              "random_state":[10]} # other options include: max_features, min_samples_leaf, max_leaf_nodes

scorer = make_scorer(r2_score, greater_is_better=True)

grid = GridSearchCV(DecisionTR, param_grid=param_grid, scoring=scorer, cv=5)

grid.fit(X_train,y_train)

print("the best R^2 of all model parameters' combination on DecisionTreeRegressor model: {:.4f}".format(grid.best_score_))

the best R^2 of all model parameters' combination on DecisionTreeRegressor model: 0.3782


In [26]:
# the argument setting for best estimator
print("the parameter setting of optimized DecisionTreeRegressor model")
print(grid.best_estimator_)
print("\n")

# the estimator's performance on the testing dataset
print("the R^2 of the optimized DecisionTreeRegressor model on testing dataset:")
print("{:.4f}".format(grid.score(X_test, y_test)))

the parameter setting of optimized DecisionTreeRegressor model
DecisionTreeRegressor(criterion='mse', max_depth=5, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=10, splitter='best')


the R^2 of the optimized DecisionTreeRegressor model on testing dataset:
0.3740


<a id="step15"></a>
## step15: appendix - try `AdaBoosting` for rate prediction model
**REF:**<br/>
[AdaBoostRegressor](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostRegressor.html)<br/>
[Decision Tree Regression with AdaBoost](http://scikit-learn.org/stable/auto_examples/ensemble/plot_adaboost_regression.html#sphx-glr-auto-examples-ensemble-plot-adaboost-regression-py)<br/>
[Adaboost Classifier](https://chrisalbon.com/machine_learning/trees_and_forests/adaboost_classifier/)<br/>
[Ensemble Machine Learning Algorithms in Python with scikit-learn](https://machinelearningmastery.com/ensemble-machine-learning-algorithms-python-scikit-learn/)

In [27]:
# create a DecisionTreeRegressor model
from sklearn.ensemble import AdaBoostRegressor

AdaBoost = AdaBoostRegressor()

# set up the parameter range for grid-search
param_grid = {"n_estimators":[100,80,50,20,10,5,3,2,1],
              "learning_rate":[0.5,0.3,0.2,0.1,0.05,0.01],
              "random_state":[10]}

scorer = make_scorer(r2_score, greater_is_better=True)

grid = GridSearchCV(AdaBoost, param_grid=param_grid, scoring=scorer, cv=5)

grid.fit(X_train,y_train)

print("the best R^2 of all model parameters' combination on AdaBoostRegressor model: {:.4f}".format(grid.best_score_))

the best R^2 of all model parameters' combination on AdaBoostRegressor model: 0.3704


In [28]:
# the argument setting for best estimator
print("the parameter setting of optimized AdaBoostRegressor model")
print(grid.best_estimator_)
print("\n")

# the estimator's performance on the testing dataset
print("the R^2 of the optimized AdaBoostRegressor model on testing dataset:")
print("{:.4f}".format(grid.score(X_test, y_test)))

the parameter setting of optimized AdaBoostRegressor model
AdaBoostRegressor(base_estimator=None, learning_rate=0.2, loss='linear',
         n_estimators=20, random_state=10)


the R^2 of the optimized AdaBoostRegressor model on testing dataset:
0.3670


<a id="step16"></a>
## step16: appendix - try `multi-layer perceptron` for rate prediction model

In [29]:
from keras.utils import np_utils # encode categorical variable
from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten
from keras.callbacks import ModelCheckpoint, EarlyStopping 

Using TensorFlow backend.


REF:[custom R^2 score for keras](https://www.kaggle.com/c/mercedes-benz-greener-manufacturing/discussion/34019) 

In [30]:
# custom R2-score metrics for keras backend
from keras import backend as K

def r2_keras(y_true, y_pred):
    SS_res =  K.sum(K.square(y_true - y_pred)) 
    SS_tot = K.sum(K.square(y_true - K.mean(y_true))) 
    return ( 1 - SS_res/(SS_tot + K.epsilon()) )

In [31]:
# Building the model architecture
model = Sequential()
model.add(Dense(512, activation='relu', input_shape=(X_train.shape[1],)))
model.add(Dropout(0.2))
model.add(Dense(128, activation='relu'))
model.add(Dense(64, activation='relu'))
model.add(Dense(32, activation='relu'))
model.add(Dense(1))
model.summary()

# Compiling the model using categorical_crossentropy loss, and rmsprop optimizer.
model.compile(loss='mean_squared_error',
              optimizer='rmsprop',
              metrics=['mae',r2_keras])

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_1 (Dense)              (None, 512)               4096      
_________________________________________________________________
dropout_1 (Dropout)          (None, 512)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 128)               65664     
_________________________________________________________________
dense_3 (Dense)              (None, 64)                8256      
_________________________________________________________________
dense_4 (Dense)              (None, 32)                2080      
_________________________________________________________________
dense_5 (Dense)              (None, 1)                 33        
Total params: 80,129
Trainable params: 80,129
Non-trainable params: 0
_________________________________________________________________


In [32]:
# normalize input data
mean = X_train.mean(axis=0)
X_train = X_train - mean
std = X_train.std(axis=0)
X_train = X_train/std

In [33]:
X_test = X_test - mean
X_test = X_test/std 

In [34]:
# Running and evaluating the model

checkpointer = ModelCheckpoint(filepath='rate.model.best.hdf5', 
                               verbose=1, save_best_only=True)

earlystop = EarlyStopping(patience=2)

hist = model.fit(X_train.as_matrix(), y_train.as_matrix(),
          batch_size=50,
          epochs=20,
          validation_split=0.25,
          callbacks=[checkpointer, earlystop],
          verbose=2,
          shuffle=True)

Train on 8796 samples, validate on 2932 samples
Epoch 1/20
Epoch 00000: val_loss improved from inf to 21623.96487, saving model to rate.model.best.hdf5
1s - loss: 44850.7644 - mean_absolute_error: 164.9602 - r2_keras: -3.8778e-01 - val_loss: 21623.9649 - val_mean_absolute_error: 119.9491 - val_r2_keras: 0.3093
Epoch 2/20
Epoch 00001: val_loss improved from 21623.96487 to 20046.13426, saving model to rate.model.best.hdf5
1s - loss: 21537.4712 - mean_absolute_error: 119.9884 - r2_keras: 0.3451 - val_loss: 20046.1343 - val_mean_absolute_error: 116.8699 - val_r2_keras: 0.3574
Epoch 3/20
Epoch 00002: val_loss improved from 20046.13426 to 19074.28822, saving model to rate.model.best.hdf5
1s - loss: 20558.0097 - mean_absolute_error: 116.4715 - r2_keras: 0.3746 - val_loss: 19074.2882 - val_mean_absolute_error: 113.4209 - val_r2_keras: 0.3909
Epoch 4/20
Epoch 00003: val_loss did not improve
0s - loss: 20344.6784 - mean_absolute_error: 115.8818 - r2_keras: 0.3865 - val_loss: 19285.6775 - val_mea

In [35]:
# make prediction
mlp_rate_predict = model.predict(X_test.as_matrix())

In [36]:
# compare the predict value to the actual value
from scipy.special import inv_boxcox
inv_mlp_rate_predict = inv_boxcox(mlp_rate_predict, 3.3)
inv_y_test = inv_boxcox(y_test.as_matrix(),3.3)

In [37]:
inv_mlp_rate_predict[25:30]

array([[ 9.33035088],
       [ 8.4790411 ],
       [ 9.44911098],
       [ 8.59064198],
       [ 8.73257637]], dtype=float32)

In [38]:
inv_y_test[25:30]

array([  7.50265102,   4.6011632 ,  10.00412818,   6.70221246,   7.50265102])

In [None]:
# footnotes
# 在SVM(在regression的案例中，使用SVR)
# 在SVM中，要求每個feature最好有同樣的scale，所以需要進行標準化，常用的方式是將值標準化為在0到1之間(MinMaxScaler)
# 在SVM中，要提高features的維度，主要有兩種方法：polynomial kernel，或是使用radial basis function(RBF) kernel
# 可以fine tune的參數則是C與gamma
# C就像是ridge function或是lasso function中討論的L2與L1的regularization
# gamma指的是每一個資料點，所影響的範圍，如果gamma越小，則每個點影響範圍越廣，比較gerneralization，趨向underfitting
# C則是越大的話，則模型會趨向複雜，overfitting
# 在SVM裡面，也可以使用dummy variables

# 在ridge function中使用的是alpha參數，當alpha越大時，coefficient會更趨近於0，相反的越小的alpha，則模型更複雜，趨向overfitting
# 而在lasso function中，則需要注意到，當我們將alpha調小時，也需要同時增加max_iter(the maximum number of iterations)的參數才行, 
# 並且在lasso模型中，將alpha調小，是往overfitting的方向趨近
# 一般來說，實務上第一個會嘗試的是ridge function，除非是features數量很多，需要削減，才會使用lasso function