# Checkpoint: Yelp Restaurant Reviews Sentiment and TF-IDF Result
Let's analyze the sentiment for Yelp reviews.

In [None]:
import pandas as pd
from matplotlib import pyplot as plt
from IPython.display import Image
from IPython.display import display_html
import os

In [None]:
df = pd.read_csv('../reference/dataframe/restaurant.csv', index_col=0)
positive_phrases = pd.read_csv('../reference/dataframe/positive.csv', header=0, names=['phrases', 'count'])
negative_phrases = pd.read_csv('../reference/dataframe/negative.csv', header=0, names=['phrases', 'count'])

## What are the most positive phrases?
Here are the most positive phrases in the given txt file, ranked by frequency.

In [None]:
positive_phrases[0:10].plot.barh(title='Top 10 Positive Phrases', x='phrases').invert_yaxis()

## What are the most negative phrases?
Here are the most negative phrases in the given txt file, ranked by freqency.

In [None]:
negative_phrases[0:10].plot.barh(title='Top 10 Negative Phrases', x='phrases').invert_yaxis()

In [None]:
df = df.assign(tokens=df.sentence.str.split(' '))
df = df.assign(tokens=df.tokens.apply(lambda x: len(x)))
df_rev = df.groupby('index')[['compound', 'tokens']].sum()

## What is the distribution of sentiment for these reviews?
Looking at the distribution of how extreme the sentiment are for these reviews in general.

In [None]:
df_rev

In [None]:
plt.hist(df_rev.compound)
plt.title(label='Distribution of Sentiment')
plt.show()

## What is the distribution of review length?
Review length is the number of words written for a review.

In [None]:
plt.hist(df_rev.tokens)
plt.title(label='Distribution of Review Length')
plt.show()

## Plotting the relationship between how positive/negative a review was and how long it was.

In [None]:
plt.scatter(x=df_rev['tokens'], y=df_rev['compound'])
plt.title("Scatterplot of Sentiment vs. Review Length")
plt.show()

## EDA of User Review

- The user distribution of our test dataframe
- It is easy to observe that in the test data, individual user does not contain many reviews and most user contain only 1 review record.
- We then explore the effect of the number of reviews on our AutoPhrase analysis.

In [None]:
Image(filename='../reference/img/most_20_user.png', width = 1000, height = 5000)

The AutoPhrase result for three users in our test review set
- we choose two users that has a record of most reviews, second most reviews and one random user from the test data.
- from the AutoPhrase result, we can observe that the user with more reviews preserve a more significant phrase than the users with less reviews.
- then, we decide to filter out those users with less than 5 reivews in our later tf-idf analysis to ensure the quality of our recommendation result.

In [None]:
def display_side_by_side(*args):
    html_str=''
    for i in args:
        for df in i:
            html_str+=df.to_html()
    display_html(html_str.replace('table','table style="display:inline"'),raw=True)

In [None]:
df = []
for filename in os.listdir('../reference/AutoPhrase_result'):
    path = '../reference/AutoPhrase_result/' + filename
    if os.stat(path).st_size != 0:
        df.append(pd.read_fwf(path, header = None, engine='python').rename(columns={0:'score',1:'phrase'}))
display_side_by_side(df)

## TD-IDF Result

The following two dataframes represent the target user we query for the recommendation and the recommendation user list we genreated.
- we can observe that there does exist significant similarity between the recommended user and target user.
- It will most likely to pop up the users who have been to the same restaurant and giev a similar review just like the target user.

In [None]:
user_index_df = pd.read_csv('../reference/dataframe/user_index.csv').drop(columns = ['Unnamed: 0'])
user_recommendation_df = pd.read_csv('../reference/dataframe/user_recommendation.csv').drop(columns = ['Unnamed: 0'])
display_side_by_side([user_index_df, user_recommendation_df])

The following two dataframes represent the target restaurant we query for the recommendation and the recommendation restaurant list we genreated.
- we can observe that there does exist significant similarity between the recommended restaurant and target restaurant.
- It will most likely to pop up the restaurant who have a similar cateogories.
- Also, the users review on the target restaurant will be considered as importance evalution features while make teh recommendation. 

In [None]:
rest_index_df = pd.read_csv('../reference/dataframe/rest_index.csv').drop(columns = ['Unnamed: 0'])
rest_recommendation_df = pd.read_csv('../reference/dataframe/restaurant_recommendation.csv').drop(columns = ['Unnamed: 0'])
display_side_by_side([rest_index_df, rest_recommendation_df])