# Yelp Review Prediction
## 1.1 Background Information
The main goals of this project are to identify a small set of informative features and a prediction model that manages to predict the ratings of reviews accurately. Training, testing, and validation data are all based on about 1.5 million Yelp reviews.
## 1.2 Data Clean
- Modify Abbreviation and Special Symbol
    - Before:  nâ€™t
    - After:  not


- Remove Non-English Reviews
- Negative Sentences
    - Before: They NEVER get my order right
    - After: they never notget notmy notorder notright


- Remove Punctuation

# 2 Model Description


![Learn Cruve](../image/learning_curve.png)

## 2.1


![Learn Cruve](../image/embedding.png)

## 2.2 Additional Variables

- **year **: scaled year variable.
- **loc1 **: 1 if the restaurant is in the western United States, otherwise 0.
- **loc2 **: 1 if the restaurant is in the eastern United States, otherwise 0.
- **loc3 **: 1 if the restaurant isn't in the United States, otherwise 0.

![Word Cloud for 1-star Review](../image/year.png)|  | ![Word Map](../image/worldmap.png)
:- | :- | :- 


- **Score1 ~ Score5 **: Score1[word] = $\frac{\text{P(this word is included in reviews with 1 star)}}{\text{P(this word is included in reviews with other stars)}}$, Score2 ~ Score5 is similar for 2 ~ 5 stars.
- **S1 ~ S5**: S1[review] = # of words with high Score1 in the review.

| Word               | Variable    | 1-star | 2-star | 3-star | 4-star | 5-star |
| ------------------ |:-----------:| :-----:| :-----:| :-----:| :-----:| :-----:|
| **refund**         | frequence   | 115    | 15     | 7      | 4      | 2      |
|                    | probability | 0.011  | 0.002  | 0      | 0      | 0      |
|                    | Score       | 34.200 | 1.080  | 0.300  | 0.072  | 0.025  |
| **notdisappoints** | frequence   | 0      | 2      | 5      | 43     | 110    |
|                    | probability | 0      | 0      | 0      | 0.002  | 0.003  |
|                    | Score       | 0      | 0.116  | 0.188  | 0.917  | 3.870  |
| **and**            | frequence   | 9196   | 8691   | 12851  | 25604  | 32071  |
|                    | probability | 0.859  | 0.886  | 0.877  | 0.895  | 0.886  |
|                    | Score       | 0.968  | 1.000  | 0.991  | 1.020  | 1.000  |

Intuitively speaking, **"Refund"** is a negative word ( you won't ask for a refund if you are satisfied with the restaurant ) and  **"notdisappoints"** is a positive one while **"and"** contains no information. If we merely consider probability of a word, we will mistakenly think the word **"and"** is important, and **"refund"** and **"notdisappoints"** are useless since their probability is close to 0. However, if we use **S1 ~ S5** to judge the sentiment of words, **"Refund"** get a high score for 1-star reviews and  **"notdisappoints"** for 5-star reviews. **"And"** shows no preference.

We take words with high S1 value as negeaive and high S5 value as positive. The following two word clouds show positive and negative words selected through S1 and S5.

![Positive](../image/dist5.png)|![White](../image/white.jpg) | ![Negative](../image/dist1.png)
:-: | :- | :- :


# 3 Model MSE Comparison

| Feature\Model  | LM     | NB     | NN     | LSTM       | GLM    | SVM    | 
| -------------- |:------:|:------:|:------:|:----------:|:------:|:------:| 
| vector + ad    | 0.673  | 0.974  | 0.494  | **0.493**  | 0.698  | NA     |
| vector         | 0.720  | 1.112  | 0.524  | 0.526      | 0.756  | 0.585  |
| additional     | 0.836  | 1.459  | 0.614  | 0.612      | 0.894  | NA     | 
| frequence      | NA     | NA     | NA     | NA         | 0.864  | 0.790  |
| tf-idf         | NA     | NA     | 0.804  | NA         | 0.836  | 0.770  | 

# 4 Interpretable Model


$$y =3.65+0.04* scale(year)+0.04*loc1+0.06*loc2-0.11*S1-0.17*S2-0.03*S3+0.03*S4+0.14*S5$$


|Line | Review  | True rate     | Residual  
|----| -------------- |:------:|:------:
|19033| ...but this steakhouse was awful...    | 1  | 0.161|
|627| one of the greatest hookah bar ...|5|-0.677|

## 5 Model Strengths and Weaknesses
**Strengths**
<br>Our selection of model and features produces robust and accurate predictions and the inclusion of additional informative variables contributes to the reduction of MSE by 0.033. 
<br>**Weaknesses**
<br>We have not experimented much on grid search over various model parameters and leave potential room for further optimize our results.

## 6 Reference
<br>Sida, W. and Christopher D. M.,2012, *'Baselines and Bigrams: Simple, Good Sentiment and Topic Classification'*, ACL 