# Final Explainer Notebook

This notebook is done as part of the final project for 02806 Social Data Analysis and Visualization, Technical University of Denmark.

This notebook contains all the code used for the website: https://restaurant-guide.github.io/

Please note that this notebook is only rendered with data from 2019-2021, so not the full dataset. This is done, because of GitHub storage limitations. 

## 1. Motivation
### 1.1 Purpose 

The aim of the final project for Social Data Analysis and Visualization is to investigate restaurants and their reviews through a dataset from Yelp. The company Yelp will be described in the next subsection. Nowadays before going to a restaurant, you look it up on Google. Here it is possible to check the ratings of a restaurant and the comments. Hence, fake reviews are a hot topic, where business owners can promote their business falsely. The idea of this project is to find the best restaurant based on chosen types of restaurant, kitchen, price, review score without fake reviews in the dataset. From this, the user is able to find the perfect restaurant through an interactive application, but also see how the fake reviews are detected.




<!-- The aim of the project is to find the best restaurant based on chosen type of restaurant, kitchen, price, review score without fake reviews.  -->


### 1.2 Data
The dataset considered for this project is from Yelp, which is an American company founded in 2004. The idea behind the company is to publish crow-sourced reviews about businesses through their website Yelp.com and mobile app Yelp. In the second quarter of 2019, Yelp has an average of 61.8 million unique visitors through their website, while it was 76.7 million unique visitors in the app. As the internet gets filled with more and more information and the likelihood of the information being false, Yelp can then provides a local platform for people to discover true information. So the dataset from Yelp considers different businesses and not only restaurants, which is the focus of this project. If a restaurant is not part of Yelp, they are not considered in this project. 

Overall, the Yelp dataset is a subset of Yelp's businesses, reviews and users, which covers 11 metropolitan areas. All the dataset consists of the following information:
* 8,635,403 reviews
* 160,585 business
* 200,000 pictures
* 2,189,457 unique users

The dataset stretches over a period from 14th of October 2004 to 28th of January 2021. 

For this project, three of Yelps dataset is used. The main dataset consists of all the different businesses and some attributes explaining opening hours, price range and category. Furthermore, a dataset consisting of all the reviews and when the reviews are given is also used. To detect anomalies in the data, the dataset about the user that has given a review is also used. So to sum up, the following data files were used throughout this project
* yelp_academic_dataset_business.json (121.466 KB)
* yelp_academic_dataset_review.json (6.774.100 KB)
* yelp_academic_dataset_user.json (3.598.150 KB)

See reference [1] for the data. 

## 2. Basic stats. Let's understand the dataset better

Before doing data analysis on the dataset, it is key to do some basic statistics on the chosen dataset. Before doing basic statistics, some data cleaning and preprocessing are done.

### 2.1 Data cleaning & preprocessing

As mentioned in the section about the dataset, there are multiple businesses considered in our dataset. The businesses that do not have the word *Restaurant* in the column `categories` will not be considered. Further, restaurants that are closed have been removed, because it does not make sense to give advice for the user to go to a closed restaurant. 
To get some useful information from the `categories` column, two new variables are created. The first variable explains the kitchen type of the restaurant, whereas the second variable explains the type of restaurant, i.e. if it is a pizzeria. The list of words for the two variables are presented below

* Kitchen type: Thai, Chinese, Japanese, Korean, Indian, American, Caribean, Italian, Mediterranean, Mexican, Cajun, Vietnamese and Greek.
* Type: Food, nightlife, bars, sandwiches, pizza, breakfast & brunch, fast food, burgers, salad, buffet, cafes, coffee & tea, vegetarian, steakhouse, sushi bars, diners and wine bars. 

The data from Yelp also give what state the different restaurants are in but in abbrivations. Hence the full state name and to do some plotting later on where the states are, the states latitude and longitude are also inserted. 

From merging the three datasets, the shape is `(6.588.460, 27)`, to make with more compatible a shorter time period is considered, namely 1st of January 2014 to 28th of January 2021, which gives a shape of `(5.177.322, 27)`. Remeber that these numbers cannot be compared with the shapes below, since this notebook only covers around two years.

See to the full dataprep, please see the python file: `NAME OF FILE`

#### Libaries
The libraries used for this project are presented below.

In [3]:
import pandas as pd
import numpy as np
from functools import reduce
import matplotlib.pyplot as plt
import os

# plot functions
import seaborn as sns
import matplotlib.pyplot as plt
from bokeh.plotting import figure
from bokeh.io import show, output_notebook, curdoc, output_file
from bokeh.models import ColumnDataSource, FactorRange, Legend, HoverTool, GeoJSONDataSource, \
                        LinearColorMapper, ColorBar, NumeralTickFormatter, Div, Select, TableColumn, \
                        DataTable, CheckboxGroup, Tabs, Panel, CheckboxButtonGroup, RadioButtonGroup
from bokeh.application.handlers import FunctionHandler
from bokeh.application import Application
from bokeh.palettes import Category20c, Pastel1, Set3, Blues
from bokeh.layouts import column, row, WidgetBox, gridplot
from bokeh.embed import file_html
from bokeh.resources import CDN
from bokeh.tile_providers import get_provider, Vendors
from bokeh.transform import linear_cmap,factor_cmap
output_notebook()

#Anomaly detection

import sys
!{sys.executable} -m pip install gensim
!{sys.executable} -m pip install python-Levenshtein

from gensim.parsing.preprocessing import remove_stopwords
from gensim.utils import simple_preprocess
from gensim.parsing.porter import PorterStemmer
from collections import Counter
from sklearn.ensemble import IsolationForest

Collecting gensim
  Downloading gensim-4.0.1-cp37-cp37m-win_amd64.whl (23.9 MB)
Collecting smart-open>=1.8.1
  Downloading smart_open-5.0.0-py3-none-any.whl (56 kB)
Collecting Cython==0.29.21
  Downloading Cython-0.29.21-cp37-cp37m-win_amd64.whl (1.6 MB)
Installing collected packages: smart-open, Cython, gensim
  Attempting uninstall: Cython
    Found existing installation: Cython 0.29.23
    Uninstalling Cython-0.29.23:
      Successfully uninstalled Cython-0.29.23
Successfully installed Cython-0.29.21 gensim-4.0.1 smart-open-5.0.0




In [4]:
path = 'C:/Users/Miche/OneDrive - Danmarks Tekniske Universitet/MMC/2. Semester/Social Data/websites/Restaurant-Guide/data'
os.chdir(path)

In [5]:
df_2019 = pd.read_csv('yelp_reviews_RV_categories_2019.csv')
df_2020 = pd.read_csv('yelp_reviews_RV_categories_2020.csv')

df = df_2019.append([df_2020]) #appending the two datasets

print(df.columns) #checking our columns
print('shape of dataset', df.shape)

Index(['business_id', 'name', 'address', 'city', 'state', 'postal_code',
       'latitude', 'longitude', 'stars', 'review_count', 'attributes',
       'categories', 'hours', 'cat_kitchen', 'cat_type', 'state_name',
       'latitude_state', 'longitude_state', 'PriceRange', 'AvgPrice',
       'user_id', 'review_stars', 'text', 'date', 'username', 'user_count',
       'average_stars'],
      dtype='object')
shape of dataset (1427833, 27)


In [6]:
print('Timeperiod of the dataset')
print (df.date.min())
print (df.date.max())

Timeperiod of the dataset
2019-01-01 00:00:09
2021-01-28 15:23:52


In [7]:
df.head(3)

Unnamed: 0,business_id,name,address,city,state,postal_code,latitude,longitude,stars,review_count,...,longitude_state,PriceRange,AvgPrice,user_id,review_stars,text,date,username,user_count,average_stars
0,7bioKzHBPPRdM7-jJQAwkg,Cafe Aion,1235 Pennsylvania Ave,Boulder,CO,80302,40.008789,-105.276503,3.5,144,...,-105.358887,$$,20,eXRC79iX60xwA1UuGRuWNg,4,Best paella I have enjoyed in a long time on a...,2019-09-16 19:26:44,Daniel,376,4.29
1,7bioKzHBPPRdM7-jJQAwkg,Cafe Aion,1235 Pennsylvania Ave,Boulder,CO,80302,40.008789,-105.276503,3.5,144,...,-105.358887,$$,20,eXRC79iX60xwA1UuGRuWNg,4,Best paella I have enjoyed in a long time on a...,2019-09-16 19:26:44,Daniel,376,4.29
2,7bioKzHBPPRdM7-jJQAwkg,Cafe Aion,1235 Pennsylvania Ave,Boulder,CO,80302,40.008789,-105.276503,3.5,144,...,-105.358887,$$,20,eXRC79iX60xwA1UuGRuWNg,4,Best paella I have enjoyed in a long time on a...,2019-09-16 19:26:44,Daniel,376,4.29


Start by checking NaN values

In [8]:
print('Any missing values:', df.isnull().sum().sum())

Any missing values: 13558


From further analysis, the missing values were in the columns: `hours`,`address` and `username`. Since these are not being used, these can be dropped.

In [9]:
df = df.drop(['hours','address', 'username'],axis=1) 
print('Any missing values:', df.isnull().sum().sum())

Any missing values: 0


Checking the top categories in different columns

In [10]:
#exploding the dataframe so the categories column is not a list type
df_duplicates = df.drop_duplicates(subset=['name','date'])
df_explode = df_duplicates.assign(categories = df_duplicates['categories'].str.split(', ')).explode('categories')

print('Top categories in the dataset')
print(df_explode['categories'].value_counts().head(10))
print('Checking that all rows have the value *Restaurant*:', df_duplicates.shape)

MemoryError: Unable to allocate 304. MiB for an array with shape (13, 3060959) and data type object

Making sure that all rows have restaurants in their cateogy, which is true. 

In [None]:
print('Top categories in the dataset based on our own variables')
print(df['cat_kitchen'].value_counts().head(10))

In [None]:
print('Top categories in the dataset based on our own variables')
print(df['cat_type'].value_counts().head(10))

In [None]:
print('Count of the different price ranges')
print(df['PriceRange'].value_counts())

Some descriptive statisics for the different variables

In [None]:
variables = ['name','city','state','cat_kitchen','cat_type','PriceRange']
df[variables].describe()

So there are 11 unique states in the dataset and 302 unique cities.

In [None]:
variables = ['stars', 'AvgPrice']
round(df[variables].describe(),2)

In [None]:
fig = plt.figure(figsize=(10, 5))
sns.boxplot(df['stars'], color='#2b8cbe', saturation=0.5)
plt.tight_layout(h_pad=2)

In [None]:
fig = plt.figure(figsize=(10,5))
selected = ['stars']
plt.title('Stars')
sns.boxplot(x='state_name', y='stars', data=df, palette='YlGnBu', saturation=0.5)
plt.xticks(rotation=90)
plt.tight_layout(h_pad=2)

### MAYBE SOME MORE DESCRIPTIVE PLOTS?

## 3. Exploratory Data Analysis

### 3.1. General Overview of the data

Starting out by illustrating the different states and how many restaurants there are in each state.

In [None]:
restaurants = df.drop_duplicates(subset=['name','state'])
restaurant_ = restaurants.groupby(['state_name', 'latitude_state', 'longitude_state']).size().reset_index(name='Counts').sort_values(by='Counts',ascending=False)
restaurants.shape

Below it used to define the sizes of the circle on our world map. The function is forcing the count values to be between 15 and 50. This is done, because in some states there are above 1000 of restaurants, hence why the circle will fill the whole plot without seeing the map. 

In [None]:
r_max = max(restaurant_['Counts'])
r_min = min(restaurant_['Counts'])
t_min = 15
t_max = 50

restaurant_['Size'] = (restaurant_['Counts'] - r_min)/(r_max - r_min) * (t_max - t_min) + t_min

Map over the different states, where we have restaurants

In [None]:
# Define function to switch from lat/long to mercator coordinates
def x_coord(x, y):
    
    lat = x
    lon = y
    
    r_major = 6378137.000
    x = r_major * np.radians(lon)
    scale = x/lon
    y = 180.0/np.pi * np.log(np.tan(np.pi/4.0 + 
        lat * (np.pi/180.0)/2.0)) * scale
    return (x, y)
# Define coord as tuple (lat,long)
restaurants_copy = restaurant_.copy()
restaurants_copy['coordinates'] = list(zip(restaurants_copy['latitude_state'], restaurants_copy['longitude_state']))
# Obtain list of mercator coordinates
mercators = [x_coord(x, y) for x, y in restaurants_copy['coordinates']]

# Create mercator column in our df
restaurants_copy['mercator'] = mercators
# Split that column out into two separate columns - mercator_x and mercator_y
restaurants_copy[['mercator_x', 'mercator_y']] = restaurants_copy['mercator'].apply(pd.Series)

In [None]:
# Select tile set to use
chosentile = get_provider(Vendors.CARTODBPOSITRON)

tooltips = [("State","@state_name"), ("Count", "@Counts")]

counts = restaurants_copy['Counts'].to_list()

p = figure(title = 'Number of restaurants in North American', 
           x_axis_type="mercator", y_axis_type="mercator", 
           x_axis_label = 'Longitude', y_axis_label = 'Latitude', 
           tooltips = tooltips, plot_width=800, plot_height=600)

p.add_tile(chosentile)

p.circle(x = 'mercator_x', y = 'mercator_y', color = 'lightblue', source=restaurants_copy, 
         size='Size', fill_alpha = 0.7)
show(p)
html = file_html(p, CDN, "North America - Map")
# print(html)

General bar charts over the three main variables, state, kitchen and type.

In [None]:
group_state_ = restaurants.groupby(['cat_kitchen']).size().reset_index(name='Counts')
group_state_['Type'] = group_state_['cat_kitchen'].astype(str)
df_dict_ = group_state_.to_dict('list')

group_state__ = restaurants.groupby(['cat_type']).size().reset_index(name='Counts')
group_state__['Type'] = group_state__['cat_type'].astype(str)
df_dict__ = group_state__.to_dict('list')

group_state = restaurants.groupby(['state_name']).size().reset_index(name='Counts')
group_state['state'] = group_state['state_name'].astype(str)
df_dict = group_state.to_dict('list')

In [None]:
title = 'Count of restaurants by State'
xlabel = 'State'
range_x = group_state.state.unique().tolist()

plot1 = figure(x_range=FactorRange(factors=range_x), y_range=(0,2500), plot_width=800, plot_height=500,
               x_axis_label=xlabel, toolbar_location=None, title=title)
plot1.vbar(x='state_name', width=0.7, bottom=0,
           top='Counts', source=df_dict, color='lightblue')

# hover tool
plot1.add_tools(HoverTool(tooltips=[('Count', "@Counts{1}")]))

# axis ticks
plot1.xaxis.major_tick_line_color = None 
plot1.xaxis.minor_tick_line_color = None 
plot1.yaxis.major_tick_line_color = None  
plot1.yaxis.minor_tick_line_color = None  
plot1.title.text_font_size = '13pt'
plot1.title.align = 'center'

# show(plot1)

In [None]:
title = 'Count of restaurants by Kitchen'
range_x = group_state_['cat_kitchen'].unique().tolist()
xlabel = 'Kitchen type'

plot2 = figure(x_range=FactorRange(factors=range_x),y_range=(0,2500),  plot_width=800, plot_height=500,
               x_axis_label=xlabel, toolbar_location=None, title=title)
plot2.vbar(x='Type', width=0.7, bottom=0,
           top='Counts', source=df_dict_, color='lightblue')

# hover tool
plot2.add_tools(HoverTool(tooltips=[('Count', "@Counts")]))

# axis ticks
plot2.xaxis.major_tick_line_color = None 
plot2.xaxis.minor_tick_line_color = None 
plot2.yaxis.major_tick_line_color = None  
plot2.yaxis.minor_tick_line_color = None  
plot2.xaxis.major_label_orientation = "vertical"
plot2.title.text_font_size = '13pt'
plot2.title.align = 'center'

# show(plot2)

In [None]:
title = 'Count of restaurants by Type'
range_x = group_state__['cat_type'].unique().tolist()
xlabel = 'Types'

plot3 = figure(x_range=FactorRange(factors=range_x), y_range=(0,2500), plot_width=600, plot_height=500,
               x_axis_label=xlabel, toolbar_location=None, title=title)
plot3.vbar(x='Type', width=0.7, bottom=0,
           top='Counts', source=df_dict__, color='lightblue')

# hover tool
plot3.add_tools(HoverTool(tooltips=[('Count', "@Counts")]))

# axis ticks
plot3.xaxis.major_tick_line_color = None 
plot3.xaxis.minor_tick_line_color = None 
plot3.yaxis.major_tick_line_color = None  
plot3.yaxis.minor_tick_line_color = None  
plot3.xaxis.major_label_orientation = "vertical"
plot3.title.text_font_size = '13pt'
plot3.title.align = 'center'

# show(plot3)

In [None]:
# Increase the plot widths
plot1.plot_width = plot2.plot_width = plot3.plot_width = 800
plot1.plot_height = plot2.plot_height = plot3.plot_height = 500


# Create three panels, one for each conference
state_panel = Panel(child=plot1, title='Count of restaurants by State')
kitchen_panel = Panel(child=plot2, title='Count of restaurants by Kitchen')
types_panel = Panel(child=plot3, title='Count of restaurants by Type')

# Assign the panels to Tabs
tabs = Tabs(tabs=[state_panel, kitchen_panel, types_panel])

# Show the tabbed layout
show(tabs)

html = file_html(tabs, CDN, "General Overview - BarChart")
# print(html)

Since the dataset considers all the reviews from the different users, number of reviews per state, kitchen and type is considered.

In [None]:
group_state_ = df.groupby(['cat_kitchen']).size().reset_index(name='Counts')
group_state_['Type'] = group_state_['cat_kitchen'].astype(str)
df_dict_ = group_state_.to_dict('list')

group_state__ = df.groupby(['cat_type']).size().reset_index(name='Counts')
group_state__['Type'] = group_state__['cat_type'].astype(str)
df_dict__ = group_state__.to_dict('list')

group_state = df.groupby(['state_name']).size().reset_index(name='Counts')
group_state['state'] = group_state['state_name'].astype(str)
df_dict = group_state.to_dict('list')

In [None]:
title = 'Count of reviews by State'
xlabel = 'State'
range_x = group_state.state.unique().tolist()

plot1 = figure(x_range=FactorRange(factors=range_x), y_range=(0,1300000), plot_width=800, plot_height=500,
               x_axis_label=xlabel, toolbar_location=None, title=title)
plot1.vbar(x='state_name', width=0.7, bottom=0,
           top='Counts', source=df_dict, color='lightblue')

# hover tool
plot1.add_tools(HoverTool(tooltips=[('Count', "@Counts{1}")]))

# axis ticks
plot1.xaxis.major_tick_line_color = None 
plot1.xaxis.minor_tick_line_color = None 
plot1.yaxis.major_tick_line_color = None  
plot1.yaxis.minor_tick_line_color = None  
plot1.title.text_font_size = '13pt'
plot1.title.align = 'center'

# show(plot1)

In [None]:
title = 'Count of reviews by Kitchen'
range_x = group_state_['cat_kitchen'].unique().tolist()
xlabel = 'Kitchen type'

plot2 = figure(x_range=FactorRange(factors=range_x), y_range=(0,1700000), plot_width=800, plot_height=500,
               x_axis_label=xlabel, toolbar_location=None, title=title)
plot2.vbar(x='Type', width=0.7, bottom=0,
           top='Counts', source=df_dict_, color='lightblue')

# hover tool
plot2.add_tools(HoverTool(tooltips=[('Count', "@Counts")]))

# axis ticks
plot2.xaxis.major_tick_line_color = None 
plot2.xaxis.minor_tick_line_color = None 
plot2.yaxis.major_tick_line_color = None  
plot2.yaxis.minor_tick_line_color = None  
plot2.xaxis.major_label_orientation = "vertical"
plot2.title.text_font_size = '13pt'
plot2.title.align = 'center'

# show(plot2)

In [None]:
title = 'Count of reviews by Type'
range_x = group_state__['cat_type'].unique().tolist()
xlabel = 'Type'

plot3 = figure(x_range=FactorRange(factors=range_x), y_range=(0,1100000),plot_width=600, plot_height=500,
               x_axis_label=xlabel, toolbar_location=None, title=title)
plot3.vbar(x='Type', width=0.7, bottom=0,
           top='Counts', source=df_dict__, color='lightblue')

# hover tool
plot3.add_tools(HoverTool(tooltips=[('Count', "@Counts")]))

# axis ticks
plot3.xaxis.major_tick_line_color = None 
plot3.xaxis.minor_tick_line_color = None 
plot3.yaxis.major_tick_line_color = None  
plot3.yaxis.minor_tick_line_color = None  
plot3.xaxis.major_label_orientation = "vertical"
plot3.title.text_font_size = '13pt'
plot3.title.align = 'center'

# show(plot3)

In [None]:
# Increase the plot widths
plot1.plot_width = plot2.plot_width = plot3.plot_width = 800
plot1.plot_height = plot2.plot_height = plot3.plot_height = 500


# Create three panels, one for each conference
state_panel = Panel(child=plot1, title='Count of reviews by State')
kitchen_panel = Panel(child=plot2, title='Count of reviews by Kitchen')
types_panel = Panel(child=plot3, title='Count of reviews by Type')

# Assign the panels to Tabs
tabs = Tabs(tabs=[state_panel, kitchen_panel, types_panel])

# Show the tabbed layout
show(tabs)

html = file_html(tabs, CDN, "Overview Reviews")
# print(html)

### 3.2. Score Overview

In [None]:
stars_df = restaurants.groupby(['state_name','stars']).size().reset_index(name='counts')
stars_df = stars_df.pivot_table(values = 'counts', index='state_name', columns='stars').reset_index()
stars_df = stars_df.rename(columns={1.0:'1', 1.5:'1.5',2.0:'2',2.5:'2.5',
                                    3.0:'3',3.5:'3.5',4.0:'4',4.5:'4.5',5.0:'5'}) 
stars_df = stars_df.fillna(0)
source = ColumnDataSource(stars_df)

title = 'Count of stars by State'
range_x = restaurants['state_name'].sort_values().unique().tolist()
xlabel = 'States'
stack = ['1', '1.5', '2', '2.5', '3', '3.5', '4', '4.5', '5']
colors = ['#30678d','#08306b', '#08519c', '#2171b5', '#4292c6', '#6baed6', '#9ecae1', '#c6dbef', '#deebf7']

state = figure(x_range=FactorRange(factors=range_x), y_range=(0,2400),plot_width=800, plot_height=550,
           x_axis_label=xlabel, toolbar_location=None, tools="",
           title=title)

renderers = state.vbar_stack(stack, x='state_name', width=0.9, color=colors, source=source,
             legend_label=stack)


for r in renderers:
    year = r.name
    hover = HoverTool(tooltips=[
        ("Total # of stars that have %s" % year, "@%s" % year),
        ("index", "$index")
    ], renderers=[r])
    state.add_tools(hover)

# axis ticks
state.yaxis.ticker = [0, 500, 1000, 1500, 2000, 2500]
state.xaxis.major_tick_line_color = None 
state.xaxis.minor_tick_line_color = None 
state.yaxis.major_tick_line_color = None  
state.yaxis.minor_tick_line_color = None  
state.legend.location = "top_center"
state.legend.label_text_font_size = "7pt"
state.legend.orientation = "horizontal"
state.title.text_font_size = '13pt'
state.title.align = 'center'

# show(state)

In [None]:
stars_df = restaurants.groupby(['cat_kitchen','stars']).size().reset_index(name='counts')
stars_df = stars_df.pivot_table(values = 'counts', index='cat_kitchen', columns='stars').reset_index()
stars_df = stars_df.rename(columns={1.0:'1', 1.5:'1.5',2.0:'2',2.5:'2.5',
                                    3.0:'3',3.5:'3.5',4.0:'4',4.5:'4.5',5.0:'5'}) 

stars_df = stars_df.fillna(0)
source = ColumnDataSource(stars_df)

title = 'Count of stars by Kitchen'
range_x = restaurants['cat_kitchen'].sort_values().unique().tolist()
xlabel = 'Kitchen Types'
ylabel = '# of stars'
stack = ['1', '1.5', '2', '2.5', '3', '3.5', '4', '4.5', '5']
colors = ['#30678D','#08306b', '#08519c', '#2171b5', '#4292c6', '#6baed6', '#9ecae1', '#c6dbef', '#deebf7']

kitchen = figure(x_range=FactorRange(factors=range_x), y_range=(0,2400), plot_width=800, plot_height=500,
           x_axis_label=xlabel,toolbar_location=None, tools="",
           title=title)

renderers = kitchen.vbar_stack(stack, x='cat_kitchen', width=0.9, color=colors, source=source,
             legend_label=stack)

for r in renderers:
    year = r.name
    hover = HoverTool(tooltips=[
        ("Total # of stars that have %s" % year, "@%s" % year),
        ("index", "$index")
    ], renderers=[r])
    kitchen.add_tools(hover)

# axis ticks
kitchen.xaxis.major_tick_line_color = None 
kitchen.xaxis.minor_tick_line_color = None 
kitchen.yaxis.major_tick_line_color = None  
kitchen.yaxis.minor_tick_line_color = None  
kitchen.legend.location = "top_center"
kitchen.legend.label_text_font_size = "7pt"
kitchen.legend.orientation = "horizontal"
kitchen.xaxis.major_label_orientation = "vertical"
kitchen.title.text_font_size = '13pt'
kitchen.title.align = 'center'

# show(kitchen)

In [None]:
stars_df = restaurants.groupby(['cat_type','stars']).size().reset_index(name='counts')
stars_df = stars_df.pivot_table(values = 'counts', index='cat_type', columns='stars').reset_index()
stars_df = stars_df.rename(columns={1.0:'1', 1.5:'1.5',2.0:'2',2.5:'2.5',
                                    3.0:'3',3.5:'3.5',4.0:'4',4.5:'4.5',5.0:'5'}) 

stars_df = stars_df.fillna(0)
source = ColumnDataSource(stars_df)

title = 'Count of stars by Type'
range_x = restaurants['cat_type'].sort_values().unique().tolist()
xlabel = 'Type'
ylabel = '# of stars'
stack = ['1', '1.5', '2', '2.5', '3', '3.5', '4', '4.5', '5']
stack = ['1', '1.5', '2', '2.5', '3', '3.5', '4', '4.5', '5']
colors = ['#30678D','#08306b', '#08519c', '#2171b5', '#4292c6', '#6baed6', '#9ecae1', '#c6dbef', '#deebf7']

types = figure(x_range=FactorRange(factors=range_x),y_range=(0,2000), plot_width=800, plot_height=500,
           x_axis_label=xlabel ,toolbar_location=None, tools="",
           title=title)

renderers = types.vbar_stack(stack, x='cat_type', width=0.9, color=colors, source=source,
             legend_label=stack)

for r in renderers:
    year = r.name
    hover = HoverTool(tooltips=[
        ("Total # of stars that have %s" % year, "@%s" % year),
        ("index", "$index")
    ], renderers=[r])
    types.add_tools(hover)

# axis ticks
types.xaxis.major_tick_line_color = None 
types.xaxis.minor_tick_line_color = None 
types.yaxis.major_tick_line_color = None  
types.yaxis.minor_tick_line_color = None  
types.legend.location = "top_center"
types.legend.label_text_font_size = "7pt"
types.legend.orientation = "horizontal"
types.xaxis.major_label_orientation = "vertical"
types.title.text_font_size = '13pt'
types.title.align = 'center'

# show(types)

In [None]:
# Increase the plot widths
state.plot_width = kitchen.plot_width = types.plot_width = 800
state.plot_height = kitchen.plot_height = types.plot_height = 600

# Create three panels, one for each conference
state_panel = Panel(child=state, title='Count stars by State')
kitchen_panel = Panel(child=kitchen, title='Count stars by Kitchen')
types_panel = Panel(child=types, title='Count stars by Type')

# Assign the panels to Tabs
tabs = Tabs(tabs=[state_panel, kitchen_panel, types_panel])

# Show the tabbed layout
show(tabs)

html = file_html(tabs, CDN, "Score Overview")
# print(html)

### 3.3. Time-based Overview

## 4. Genre

### 4.1. Which genre of data story did you use?

In the paper from Edward Segel and Jeffrey Heer several genres are described, where multiple of these genres are used to create the content for this website.

* The annotated chart is used on the site 'Exploratory Data Analysis' and 'Reviews', since a hover tool with selected and important information is presented for the user. Furthermore, the last map on 'Exploratory Data Analysis' is presenting the timeline in reviews for a given year, where the user can see the number of reviews, stars and of course name of the restaurant.
* The annotated chart and animation are also used on the page 'Find your favorite restaurant'. On this page, an interactive figure is presented, where the user can filter data to his/her own preference. The hover tool explained above is still active. Further, if the user is in doubt about how to use the plot, a gif is running above the plot.  

### 4.2. Which tools did you use from each of the 3 categories of Visual Narrative?

Figure 7 from Edward Segel and Jeffrey Heer's paper presented the three different categories under *Visual Narrative*, where the tools used will be explained here.

* Visual Structuring: For visual structuring, a consistent visual platform is used. This gives a good flow through the presentation of our dataset and our findings since the same color palette is used for the graphs and theme for the website. 
* Highlighting: Here three highlighting methods are used. On the page 'Exploratory Data Analysis',  close-ups, zooming and feature distinction are used. The close-ups are used on the map of where the restaurants are placed in North America and the user is able to zoom. On the other plots, feature distinction is used by the different colors. This is done, so the user can see a difference between for example stars by state. Highlighting is also used for the tab 'Find your favorite restaurant', since the user is first presented with an overview of all the restaurants, again the user can zoom in/out and filter the data by his/her own preferences.
* Transition Guidance: Here a simple website setup is used, hence familiar objects are used. This also (hopefully) makes it easier for the user to use the website that is created. 

### 4.3. Which tools did you use from each of the 3 categories of Narrative Structure?

Again, figure 7 from Edward Segel and Jeffrey Heer's paper is used. The narrative structure will be explained throughout this part. 

* Ordering: As explained above, the website is made with a different post on the front page, where the user can always go back to the front page using the *home* button in the upper right corner. Hence the user can access the tabs he/she wants to. The page 'Exploratory Data Analysis' is driven by the authors of this project, since it is what they found interesting when doing the exploratory part. 
* Interactivity: On all plots, a hover highlighting is configured, which gives the user easy access to a lot of information by just using the mousepad. Further, an interactive plot is done where the user can filter on the data input to the plot.
* Messaging: For this part, multiple messaging is used. When entering the website an introductory text is presented. Further, on all posts, there are headers for all the different sections within each page and captions for all figures. This makes it easy for the user to understand when the topic is changing and to get a brief understanding of what the figures are illustrating. Lastly, annotations for each figure are also present on the website, which is more in detail and for users that want a more in-depth understanding of the dataset. 


## 5. Visualizations

*Explain the visualizations you've chosen.*

*Why are they right for the story you want to tell?*

The constructed website has three pages with plots. The first page 'Exploratory Data Analysis' has bar charts, line plots and map overview. The second page 'Reviews' is presenting the machine learning model that has been used to detect fake reviews and further presents **HVILKE PLOTS**. Lastly, the last page 'Find your favorite restaurant' is an interactive plot, where a table of all relevant restaurants are seen and a map where the restaurants are placed.

These visualizations are good because they give the user easy access to a lot of information, with needing a deeper knowledge. Further, they also give the user access to interact with the last figure. 

**Er der noget andet mere fornuftigt vi kan skrive?**


### 5.1. Anomaly Detection

It has been estimated by a Harvard Business School report that 20% of reviews on Yelp are fake. In anomaly detection, the 20% review with the highest risk of being fake reviews is found, by using Isolation Forest. 

In [None]:
df = df.drop_duplicates(subset='text')
print(df.columns) #checking our columns
print('shape of dataset', df.shape)

N = df.shape[0]
MaxUsed = 100

df = df.iloc[0:N]

#### Creation of features

Length of review and number of sentences in the reviews are used as some of the features of the Isolation Forest Model.

In [None]:
#Length of review
df['reviewlength'] = df['text'].str.len()

#number of sentences
df['sentences'] = df["text"].str.count('\.')

#### Text Processing

The reviews are processed in the following way:
* Stop words are removed
* The text is tokenized
* The text is stemmed

In [None]:
#remove stop words
df['filtered_text'] = df.text.apply(remove_stopwords)

In [None]:
# Tokenize the text
df['tokenized_text'] = [simple_preprocess(line, deacc=True) for line in df['filtered_text']] 

In [None]:
#Stem the text
porter_stemmer = PorterStemmer()
df['stemmed_tokens'] = [[porter_stemmer.stem(word) for word in tokens] for tokens in df['tokenized_text']]

df['stemmed_text'] = df['stemmed_tokens'].apply(' '.join)

Some of the features of the model are if a given word is in the review. The 100 most used words without stop words are found. Most reviews have 5 stars, so to make sure that it doesn't only check for positive words, the 100 most used words are found in a balanced dataset, where there is an equal number of reviews with 1, 2, 3, 4 and 5 stars.

In [None]:
#count of stars given
min_review = min(df['review_stars'].value_counts())

#dataframes for each star
df1 = df.loc[(df['review_stars'] == 1)]
df2 = df.loc[(df['review_stars'] == 2)]
df3 = df.loc[(df['review_stars'] == 3)]
df4 = df.loc[(df['review_stars'] == 4)]
df5 = df.loc[(df['review_stars'] == 5)]

#sample dataframe for each star
df1 = df1.sample(n=min_review,axis='rows')
df2 = df2.sample(n=min_review,axis='rows')
df3 = df3.sample(n=min_review,axis='rows')
df4 = df4.sample(n=min_review,axis='rows')
df5 = df5.sample(n=min_review,axis='rows')

In [None]:
#balanced reviews
df_word = pd.concat([df1, df2, df3, df4, df5])

In [None]:
#find the 100 most used words from the balanced review set(without stop words)
Word = []
WordCounter = Counter(" ".join(df_word["stemmed_text"]).split()).most_common(MaxUsed)
for j in range(len(WordCounter)):
    Word.insert(j,WordCounter[j][0])

#### Encoding

The reviews are encoded using one-hot-encoding, so for each review, it is checked if they contain each of the 100 most used words from the balanced dataset. 

Furthermore, userID and businessID are encoded using label encoding, so these also enter as features in the model.

Lastly, the date and time are encoded using label encoding, so the date and time can be a feature in the model. This is done by encoding it as year, month, day in the month, hour, minute and the day of the week. 

In [None]:
Words = np.zeros((N,MaxUsed))

#one-hot encoding over the 100 most used words
for index, row in df.iterrows():
    for i in range(np.size(Words, axis=1)):
        if Word[i] in df['stemmed_text'].iloc[index]:
            Words[index,i] = 1

In [None]:
#drop attributes with text
df = df.drop(['text','filtered_text','tokenized_text', 'stemmed_tokens',
       'stemmed_text'], axis=1)

In [None]:
#dataframe for encoding
df_words = pd.DataFrame(Words)

#Encoding column names are changed
Names = list(range(0, MaxUsed))
ColNames = [str(n) for n in Names]

df_words.columns = ColNames
print(df_words)

#The encoding is added to the dataset
df = df.join(df_words)

In [None]:
#label encoding for userID and businessID
df['user_id'] = df['user_id'].astype('category')
df['business_id'] = df['business_id'].astype('category')

df['UserCat'] = df['user_id'].cat.codes
df['BusinessCat'] = df['business_id'].cat.codes

#encode date
df['date']= pd.to_datetime(df['date'])

df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['day'] = df['date'].dt.day
df['hour'] = df['date'].dt.hour
df['minute'] = df['date'].dt.minute
df['DayOfWeek'] = df['date'].dt.dayofweek

#### Anomaly detection - Isolation Forest

As already mentioned, anomaly detection is performed using an Isolation Forest. The features used in the model is stars the restaurant has, number of reviews received, the star given from the particular review, length of review, number of sentences in the review, userID, businessID, year, month, day in the month, hour, minute, day of the week and the features regarding which of the 100 most used words, the review contains. 

In [None]:
#features for Isolation Forest
df_feature = df[['stars', 'review_count','review_stars','reviewlength', 'sentences', 
                 'UserCat','BusinessCat', 'year', 'month', 'day', 'hour', 'minute', 
                 'DayOfWeek', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', 
                 '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', 
                 '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', 
                 '35', '36', '37', '38', '39', '40', '41', '42', '43', '44', '45', '46', 
                 '47', '48', '49', '50', '51', '52', '53', '54', '55', '56', '57', '58', 
                 '59', '60', '61', '62', '63', '64', '65', '66', '67', '68', '69', '70', 
                 '71', '72', '73', '74', '75', '76', '77', '78', '79', '80', '81', '82', 
                 '83', '84', '85', '86', '87', '88', '89', '90', '91', '92', '93', '94', 
                 '95', '96', '97', '98', '99']]

In [None]:
model = IsolationForest(n_estimators=50, max_samples='auto', 
                        contamination=float(0.2),max_features=1).fit(df_feature)

In [None]:
df['scores'] = model.decision_function(df_feature)
df['anomaly'] = model.predict(df_feature)

In [None]:
#drop encodings
df = df.drop(['reviewlength', 'sentences', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 
              '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', 
              '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', 
              '36', '37', '38', '39', '40', '41', '42', '43', '44', '45', '46', '47', '48', 
              '49', '50', '51', '52', '53', '54', '55', '56', '57', '58', '59', '60', '61', 
              '62', '63', '64', '65', '66', '67', '68', '69', '70', '71', '72', '73', '74',
              '75', '76', '77', '78', '79', '80', '81', '82', '83', '84', '85', '86', '87',
              '88', '89', '90', '91', '92', '93', '94', '95', '96', '97', '98', '99', 
              'UserCat', 'BusinessCat', 'year', 'month', 'day', 'hour', 'minute', 'DayOfWeek'], axis = 1)

### 5.2. Interactive plot

In [48]:
df_new = df.drop_duplicates(subset=['name','state','cat_kitchen','cat_type'])
df_new.shape

(24530, 27)

In [54]:
def modify_doc(doc):
    
    def make_dataset(selectedState, selectedKitchen, selectedType, selectedPrice):
        df_ = df_new.copy()
        df_empty = pd.DataFrame()
        if selectedPrice == 'No Preference':  
            if selectedState != 'All':
                df_ = df_[df_['state_name'] == selectedState]
            if selectedKitchen != 'All':
                df_ = df_[df_['cat_kitchen'] == selectedKitchen]
            if selectedType != 'All':
                df_ = df_[df_['cat_type'] == selectedType]
        else:
            for i, price in enumerate(selectedPrice):
                # Subset to the carrier
                subset = df_[df_['PriceRange'] == price]
                df_empty = df_empty.append(subset)
                
            if selectedState != 'All':
                df_empty = df_empty[df_empty['state_name'] == selectedState]
            if selectedKitchen != 'All':
                df_empty = df_empty[df_empty['cat_kitchen'] == selectedKitchen]
            if selectedType != 'All':
                df_empty = df_empty[df_empty['cat_type'] == selectedType]
        
        if selectedPrice == 'No Preference':
            df_ = df_
        else:
            df_ = df_empty


        # Preparing with long/lat coordinates 
        df_['coordinates'] = list(zip(df_['latitude'], df_['longitude']))
        # Obtain list of mercator coordinates
        mercators = [x_coord(x, y) for x, y in df_['coordinates'] ]

        # Create mercator column in our df
        df_['mercator'] = mercators
        # Split that column out into two separate columns - mercator_x and mercator_y
        df_[['mercator_x', 'mercator_y']] = df_['mercator'].apply(pd.Series)
    
        #title for the plot  
        div_title = Div(text="<b> Restaurant matching Preferences: {} </b>".format(len(df_['name'].unique())),
                   style={'font-size': '150%'})
        
        # Convert dataframe to column data source
        return ColumnDataSource(df_), div_title
    
    def make_plot(source):
    
        #table in the plot
        columns = [
            TableColumn(field="name", title="Restaurant Name", width=100),
            TableColumn(field="stars", title="Stars", width=50),
            TableColumn(field="state_name", title="State", width=80),
            TableColumn(field="city", title="City", width=80),
            TableColumn(field="cat_kitchen", title="Kitchen", width=110),
            TableColumn(field="cat_type", title="Type", width=100),
            TableColumn(field="PriceRange", title="Price", width=60)
        ]
        table = DataTable(source=source, columns=columns, width=660, height=200, fit_columns=False)

        #Map layout of North America
        tooltips = [("Restaurant","@name"), ("Stars", "@stars")]

        p = figure(x_axis_type="mercator", y_axis_type="mercator", 
               x_axis_label = 'Longitude', y_axis_label = 'Latitude', 
               tooltips = tooltips, plot_width=500, plot_height=500, 
               toolbar_location='below', tools="pan,wheel_zoom,reset", 
                active_scroll='auto')

        p.circle(x = 'mercator_x', y = 'mercator_y', color = 'lightblue', source=source, 
             size=10, fill_alpha = 0.7)

        chosentile = get_provider(Vendors.CARTODBPOSITRON)
        p.add_tile(chosentile)

        return table, p
    
    # Update maps
    def update(attr, old, new):
        
        # Get the list of carriers for the graph
        selectedState = select_state.value
        selectedKitchen = select_kitchen.value
        selectedType = select_type.value
        selectedPrice = [select_price.labels[i] for i in select_price.active]
        
        # Make a new dataset based on the selected filters and the make_dataset function defined earlier
        new_src, div_title = make_dataset(selectedState, selectedKitchen, selectedType, selectedPrice)
        # Update the source used in the quad glpyhs
        src.data.update(new_src.data)
        
        layout.children[0] = div_title

        
    #selection the different filters
    div_subtitle = Div(text="<i> Filter with the data and find your favorite restaurant </i>")

    # User select: State
    div_state = Div(text="<b> Select State </b>")
    state = ['All']+df_new['state_name'].unique().tolist()
    select_state = Select(options=state, value=state[0]) #by default All is chosen
    select_state.on_change('value', update)

    # User select: Kitchen
    div_kitchen = Div(text="<b> Select Kitchen </b>")
    kitchen = ['All']+df_new['cat_kitchen'].unique().tolist()
    select_kitchen = Select(options=kitchen, value=kitchen[0]) #by default All is chosen
    select_kitchen.on_change('value', update)

    # User select: Type
    div_type = Div(text="<b> Select Type </b>")
    types = ['All']+df_new['cat_type'].unique().tolist()
    select_type = Select(options=types, value=types[0]) #by default All is chosen
    select_type.on_change('value', update)

    # User select : Price Range
    div_price = Div(text="<b> Select Price </b>")
    price_range = ['$','$$','$$$','$$$$','Unknown']
    select_price = CheckboxButtonGroup(labels=price_range, active=[2,3])
    select_price.on_change('active', update)
    
    #initial source and plot
    initial_select_price = [select_price.labels[i] for i in select_price.active]
    
    src, div_title = make_dataset(select_state.value,select_kitchen.value, select_type.value, initial_select_price)
    table, p = make_plot(src)

    # Combine all controls to get in column
    col_tab_plot = row(table, p, height=200, width=1200)
    col_filters_1 = column(div_state, select_state, div_kitchen, select_kitchen , width=290)
    col_filters_2 = column(div_type, select_type, div_price, select_price, width=290)

    # Layout
    layout = column(div_title, div_subtitle, col_tab_plot, row(col_filters_1,col_filters_2)) 
    #it is possible to add multiple col_filters in the row(), you just need to specify it above
    doc.add_root(layout)

# Set up an application
app = Application(FunctionHandler(modify_doc))
show(app)

ERROR:bokeh.server.views.ws:Refusing websocket connection from Origin 'http://localhost:8890';                       use --allow-websocket-origin=localhost:8890 or set BOKEH_ALLOW_WS_ORIGIN=localhost:8890 to permit this; currently we allow origins {'localhost:8888'}


## 6. Discussion


Think critically about your creation

What went well?,

What is still missing? What could be improved?, Why?

## 7. Contriutions

(Dette er blot et bud)

* Data preprocessing and cleaning:
* Exploratory data analysis - General Overview:
* Exploratory data analysis - Score Overview:
* Exploratory data analysis - Time-based score:
* Data Analysis - Anomaly Detection:
* Data Analysis - Interactive Bokeh Plot:
* Website + GitHub: 
* Explainer Notebook:

## 8. References
[1] Dataset: https://www.yelp.com/dataset

[2] https://arturomoncadatorres.com/creating-a-shareable-bokeh-dashboard-with-binder/

[3] https://towardsdatascience.com/data-visualization-with-bokeh-in-python-part-one-getting-started-a11655a467d4

[4] https://towardsdatascience.com/data-visualization-with-bokeh-in-python-part-iii-a-complete-dashboard-dc6a86aa6e23

[5] https://towardsdatascience.com/data-visualization-with-bokeh-in-python-part-ii-interactions-a4cf994e2512

[6] https://towardsdatascience.com/discover-your-next-favorite-restaurant-exploration-and-visualization-on-yelps-dataset-157d9799123c