## Outline of Steps
* [step0](#step0): import necessary packages
* [step1](#step1): import `X_remaining_sub` and `target_variable` for model evaluation
* [step2](#step2): create self-define function for the purpose of model evaluation
* [step3](#step3): take out one data sample and implement model trials
* [step4](#step4): import `X_test` and `y_test` and `svr_predict_transformed_score_test` for model justification
* [step5](#step5): model justification - compare the predicted value vs actual value
* [step6](#step6): model justification - compare the prediction power on  lowest or highest points 

In [2]:
# import necessary packages
import numpy as np
import pandas as pd
from IPython.display import display # Allows the use of display() for DataFrames
import seaborn as sns
import matplotlib.pyplot as plt
#import missingno as msno # module for missing value visualization
from scipy import stats # implement box-cox transformation
from math import ceil
from sklearn.utils import shuffle # shuffling the dataset
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

import nltk
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer

from sklearn.naive_bayes import MultinomialNB # for sentiment analysis benchmark model
from sklearn.model_selection import cross_val_score # cross validation score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

from sklearn.metrics import make_scorer, mean_squared_error, r2_score
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import Lasso, Ridge
from sklearn.svm import SVR

from scipy.stats import uniform
from scipy.special import inv_boxcox # used to find out the inverse of box-cox transformation
from numpy import flatnonzero # return the index for nonzero value

# Pretty display for notebooks
%matplotlib inline
pd.options.display.max_columns = None # show up all column values in display

# suppress warning
import warnings
warnings.simplefilter("ignore")

# suppress scientific notation
np.set_printoptions(suppress=True)

<a id="step1"></a>
## step1: import `X_remaining_sub` and `target_variable` for model evaluation

In [2]:
X_remaining_sub = pd.read_pickle("X_remaining_sub.pickle")
target_variable = pd.read_pickle("target_variable.pickle")

<a id="step2"></a>
## step2: create self-define function for the purpose of model evaluation
Here, I will refer to the model evaluation function being used in the course assignment of **Predicting Boston Housing Prices**. The function used in the project - **PredictTrials**, uses different training dataset to build up the model and always predicts on the same data point. Therefore, from the predicted output of all the models, we can evaluate the stability and validity of the model.

In [163]:
# create a self-defined function for model evaluation
def PredictTrials_SVR(X, y, trials, data_X):
    outputs = []
    inv_outputs = []
    for k in range(trials):
        # use the random_state k as a way to shuffling the training dataset
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.95, random_state = k)
        
        # use the optimized parameters from pervious modeling in part08
        svr_pipe = Pipeline([("scaler", MinMaxScaler()),("svr", SVR(C=8.91, gamma = 8.15))])
        
        model = svr_pipe.fit(X_train, y_train)        
        predict = model.predict(data_X)
        predict = float(predict)
        outputs.append(predict)
        
        inv_predict = float(inv_boxcox(predict, 3.3))
        inv_outputs.append(inv_predict)
        
        print("Trial {} prediction: {:.2f}".format(k,predict))
        print("Trial {} prediction on original Reviewer Score scale: {:.2f}".format(k, inv_predict))
    
    # display the range of predicted transformed_score
    print("The range of predicted transformed_score: {:.2f}".format(max(outputs)-min(outputs)))
    print("The range of predicted score on Reviewer Score scale: {:.2f}".format(max(inv_outputs)-min(inv_outputs)))

<a id="step3"></a>
## step3: take out one data sample and implement model trials
In this process of model evaluation, I first extract out one data sample and also exclude the data sample from the training dataset. Then, I will implement for **5** trials to see the predicted value on both **boxcox transformed_score** and **score in original Reviewer Score scale**. The **PredictTrials_SVR** function also presents the variation range of score among these trials. Last, I also include the **actual transformed_score and its score in original scale**, so that we can tell how accurately the model's prediction does.

**BRIEF RESULT:**

1. From the trial result, it seems that the model's prediction is quite consistent. The variation range(17.13) of the predicted box-cox transformed_score only accounts for around **5.5%** of the average predicted transformed_score. If we look at the score in original scale, the variation range(0.13) of the predicted score only accounts for around **1.6%** of the average predicted score in original scale. Both prove that SVR model provides quite consistent prediction, not influenced by different training dataset used for model building.
2. Nonetheless, if we compare the prediction to the **actual score(sample_y)**, we can tell that SVR model seems a little over-estimating this data point, where the prediction for the data point is around 310 to 330 in transformed scale and 8.1 to 8.3 in original scale. The actual value in transformed scale is 233 and 7.5 in original scale.

In [164]:
# take out one data sample
sample_X = X_remaining_sub.iloc[10,:].to_frame().T
sample_y = target_variable.iloc[10]

In [165]:
# drop out the data sample from original dataset
X = X_remaining_sub.drop(labels=[int(sample_X.index.values)], axis=0)
y = target_variable.drop(labels=[int(sample_X.index.values)], axis = 0)

In [166]:
# implement model trials
predict_outputs = PredictTrials_SVR(X, y, trials=5, data_X= sample_X)

Trial 0 prediction: 315.57
Trial 0 prediction on original Reviewer Score scale: 8.21
Trial 1 prediction: 313.27
Trial 1 prediction on original Reviewer Score scale: 8.20
Trial 2 prediction: 329.59
Trial 2 prediction on original Reviewer Score scale: 8.32
Trial 3 prediction: 327.72
Trial 3 prediction on original Reviewer Score scale: 8.31
Trial 4 prediction: 312.46
Trial 4 prediction on original Reviewer Score scale: 8.19
The range of predicted transformed_score: 17.13
The range of predicted score on Reviewer Score scale: 0.13


In [167]:
# the actual value of sample_y
print("The actual value of sample_y: {:.2f}".format(sample_y))
print("The actual value of sample_y on original Reviewer Score scale: {:.2f}".format(float(inv_boxcox(sample_y, 3.3))))

The actual value of sample_y: 233.96
The actual value of sample_y on original Reviewer Score scale: 7.50


<a id="step4"></a>
## step4: import `X_test` and `y_test` and `svr_predict_transformed_score_test` for model justification

In [149]:
X_test = pd.read_pickle("X_test.pickle")
y_test = pd.read_pickle("y_test.pickle")
svr_predict_transformed_score_test = pd.read_pickle("svr_predict_transformed_score_test.pickle")

# synchronize the index for svr_predict_transformed_score_test to the rest
svr_predict_transformed_score_test = pd.Series(svr_predict_transformed_score_test.values, 
                                               index=y_test.index, 
                                               name="svr_predict_transformed_score_test")

<a id="step5"></a>
## step5: model justification - compare the predicted value vs actual value
**SPECIAL NOTICE:**<br/>
Because the `y_test` is the box-cox transformed_score of `Reviewer_Score`, now I want to reverse it back, I wil use lmbda **3.3**, which is used earlier in *Part04_in_capstone_data_preprocessing_and_feature_engineering step3 jupyter notebook* for box-cox transformation.

In [172]:
# have a look at the actual transformed_score
print("The actual transformed_score of y_test")
print(y_test[:10])

The actual transformed_score of y_test
74665     459.479808
510245    459.479808
502247    396.735455
428486    396.735455
420858    195.188936
123309    396.735455
325197    396.735455
330588    528.824767
294449    233.955126
93174     528.824767
Name: transformed_score, dtype: float64


In [173]:
# have a look at the predicted transformed_score
print("The predicted transformed_score from SVR model")
print(svr_predict_transformed_score_test[:10])

The predicted transformed_score from SVR model
74665     281.758571
510245    392.401504
502247    554.863699
428486    551.224720
420858     90.050321
123309    391.900867
325197    475.122006
330588    365.473248
294449    252.555632
93174     501.185548
Name: svr_predict_transformed_score_test, dtype: float64


In [174]:
# have a look at the actual Reviewer Score - reverse back using lmbda 3.3
inv_y_test = inv_boxcox(y_test, 3.3)
print("The actual Reviewer Score of y_test")
print(inv_y_test[:10])

The actual Reviewer Score of y_test
74665     9.203640
510245    9.203640
502247    8.803401
428486    8.803401
420858    7.102429
123309    8.803401
325197    8.803401
330588    9.603882
294449    7.502651
93174     9.603882
Name: transformed_score, dtype: float64


In [175]:
# have a look at the predicted score in Reviewer Score scale - reverse back using lmbda 3.3
inv_svr_predict_transformed_score_test = inv_boxcox(svr_predict_transformed_score_test, 3.3)
print("The predicted score in Reviewer Score scale")
print(inv_svr_predict_transformed_score_test[:10])

The predicted score in Reviewer Score scale
74665     7.936955
510245    8.774169
502247    9.744710
428486    9.725309
420858    5.621288
123309    8.770778
325197    9.297419
330588    8.587315
294449    7.678391
93174     9.449011
Name: svr_predict_transformed_score_test, dtype: float64


<a id="step6"></a>
## step6: model justification - compare the prediction power on  lowest or highest points 
In checking performance on extreme points, I will use score **3 and 7** as the breaking point for low/high points.

**BRIEF RESULT:**

It turns out that the SVR model seems not able to identify **low score** data points. It largely over-estimates every data point with higher score. As we see for those data points with score below 2(Reviewer Score), the SVR predictions still remain around 5 score, even to 9 score.

On the other hand, SVR model predictions seem quite consistent with data points which are actually high scores. As we see when the actual score goes down to 7 around, the SVR prediction also catches up the trend and echos to the result(prediction score 5.6).

In [192]:
# subset data with score lower than 3
low_inv_y_test = inv_y_test[inv_y_test < 3]
print("Have a look at the data points with score lower than 3")
print(low_inv_y_test[:10])

Have a look at the data points with score lower than 3
112827    2.900461
245690    2.500324
143016    2.900461
398986    2.500324
78282     2.900461
324904    2.900461
69139     2.900461
66199     2.900461
137210    2.500324
384990    2.500324
Name: transformed_score, dtype: float64


In [193]:
# have a look at the inv_svr_predict_transformed_score_test on all these data points
print("The predicted score of these low score data points")
print(inv_svr_predict_transformed_score_test[low_inv_y_test.index[:10]])

The predicted score of these low score data points
112827    5.203802
245690    6.612412
143016    9.532126
398986    7.982823
78282     8.837372
324904    5.233020
69139     6.420626
66199     6.513173
137210    7.303907
384990    8.553344
Name: svr_predict_transformed_score_test, dtype: float64


In [194]:
# subset data with score higher than 7
high_inv_y_test = inv_y_test[inv_y_test > 7]
print("Have a look at the data points with score higher than 6")
print(high_inv_y_test[:10])

Have a look at the data points with score higher than 6
74665     9.203640
510245    9.203640
502247    8.803401
428486    8.803401
420858    7.102429
123309    8.803401
325197    8.803401
330588    9.603882
294449    7.502651
93174     9.603882
Name: transformed_score, dtype: float64


In [195]:
# have a look at the inv_svr_predict_transformed_score_test on all these data points
print("The predicted score of these high score data points")
print(inv_svr_predict_transformed_score_test[high_inv_y_test.index[:10]])

The predicted score of these high score data points
74665     7.936955
510245    8.774169
502247    9.744710
428486    9.725309
420858    5.621288
123309    8.770778
325197    9.297419
330588    8.587315
294449    7.678391
93174     9.449011
Name: svr_predict_transformed_score_test, dtype: float64
