# Introduction and Dataset

## Introduction

Yelp is a business directory service and crowd-sourced review forum which allows users to give reviews to businesses based on their own experience. It's easy for human to understand the sentiment behind the reviews, specifically whether a review is positive or negative, even without having any context about the business. However, there is a huge amount of reviews data on yelp, it's not realistic for business owners to go through every review by themselves to improve their performance. 

Our project is aimed to:

1.	Provide useful, analytical insights to business owners on Yelp and, based on these insights, propose data-driven, actionable decisions to said owners in order to improve their ratings in Yelp. 
2.	Build a web dashboard/widget/web application that visualizes your analysis and makes it easier to understand for business owners

## The Dataset
There are four datasets provided, business data, review data, tip data, and user data. We are going to focus on the analysis of attributes in the business data and the review texts in the review data of all the mexican restaurant.

For the **business data**, there are 4628 businesses who are categorized as mexican restaurant, we transform all the attributes into data frame and filter out the attributes with more than 90% of missing and impute the rest of missing values by using KNN (which use the majority value of the nearest 5 neighbors to substitute the missing value). Then we build linear model based on the clean business data, and give recommendations after conducting the $\chi^2$ test. The clean data include the following 40 variables:

|     |  |  |  |  |  |  |  |  |  |
|------|------|------|------|------|------|------|------|------|------|
|   business_id  | name | latitude | longitude | stars | review_count | is_open | WheelchairAccessible | RestaurantsTakeOut | RestaurantsGoodForGroups |
| romantic | intimate | classy |  hipster  | divey | touristy | trendy |  OutdoorSeating | RestaurantsTableService | RestaurantsDelivery |
| upscale | casual | dessert | latenight | lunch | dinner | brunch  |  breakfast | RestaurantsPriceRange2 |BusinessAcceptsCreditCards | 
| BikeParking | GoodForKids |  Caters  | NoiseLevel | HasTV | WiFi | Alcohol | park | RestaurantsReservations |  RestaurantsAttire | 

For the **reviews data**, there are 403941 reviews which belong to mexican restaurants. We tokenize/lowercase/lemmatize the review text and extract the top 1000 most frequent words. Following that, we mannully pick up 49 food/service words that make sense for analysis, which are 

|     |  |  |  |  |  |  |  |  |  |
|------|------|------|------|------|------|------|------|------|------|
|   taco  | burrito | salsa | chips | rice | guacamole | asada | chipotle | tortilla | salad |
| nacho | enchilada | meat |  time  | chicken | fish | shrimp |  beef | pork | steak |
| service | wait | minute | staff | waitress | waiter | atmosphere  |  price | internet |wifi | 
| noisy | noise |  breakfast  | lunch | dinner | line | drink | margarita | table |  fries | 
| manager | beer |  patio  | bacon | vegan | quesadilla | fajita | chile | enchilada |   | 

Then we give recommendations based on our analysis of these words.

# EDA

The details of EDA can be found in the file, "EDA_summary.pdf".

# Recommendation Based on Business Attributes

## Imputation: KNN
Since there are many missing values in attributes of businesses, we need to impute the missing values. We use KNN model to perform the imputation: 
* Distance measure: Gower distance
* Hyper-parameter: k = 5
* Voting method: majority voting

## Linear Regression
We fit a stepwise linear regression (using AIC as criterion) with the following formula:
$$rating \sim attributes$$ 
The following table shows the summary table of significant predictors:
<img style="float: center;" src="image/summary.jpg" width="60%">

## $\chi^2$ test
Considering businesses with 3.5\~5.0 stars as positive ones, businesses with 1.0\~3.0 stars as negative ones, we performed $\chi^2$ test for attributes cross ratings and dropped attributes that were not significant (we dropped **OutDoorSeating**, **WiFi** and **GoodForKids**):
<img style="float: center;" src="image/chisq.jpg" width="80%">


Some contingency tables for siginificant attributes are as following:
<img style="float: center;" src="image/matrices.png" width="40%">

## Recommandations
Divide the significant predictors into two lists: one for positive predictors (contains the predictors with positive coefficients), the other for negative predictors (contains the predictors with negative coefficients). We will give the business owners suggestions based on such results: 
<img style="float: center;" src="image/list.jpg" width="40%">


Take business **Taco Bell** (business ID: 1Dfx3zM-rW4n-31KeC8sJg) as example, we can give the owner sugestions as following: 
* 
*
* 
*  

## Strenghs
* From the QQ plot, we can see that the assumption of normality is well satisfied.
* Can provide business owners with interpretable and clear recommendations with statistical proof.

## Weaknesses
* The coefficient of determination $R^2$ is not high enough.
* The residual vs fitted plot contains some patterns, which indicates that some assumptions (e.g. equal variance) may not be well satisfied.

# Recommendation Based on Review Texts

We tokenize/lowercase/lemmatize the review text and extract the top 1000 most frequent words. Following that, we manually pick up 49 words which can be basis of recommedations for Mexican restaurants, and build a contingency table for each of them. For example, the contingency table for **taco** is:

|     |  |  |
|------|------|------|
|     | Rating Positive |Rating Negative |
|   Taco Positive  | 11832 |4577 |
|   Taco Negative  | 949 |1551 |


The way we define the sentiment of the review containing a target word is to pick up the neighborhood, which means 6 words, of it. If the majority of them are positive, then we count this review as an positive one for this target word, otherwise count it as a negative one.

Then, we conduct the $\chi^2$ testing on this contingency table to test the relationship between the sentiment of reviews containing those words and the ratings of reviews. After that, we can get a significant wordlist. Following is some examples of the distribution of the significant words.

The next step is to classify the reviews of each business into two categories, positive reviews or negative reviews, and then use *TF-IDF* to extract the 100 most "important" words in these two categories. And we get positive aspects and negative aspects for each business by getting intersections of those "important" words and high-frequency words.

Below is an example of what recommendations we can give to the business owners based on reviews for their businesses.

# Shiny App

# Contributions

Han Liao and Lingfeng Zhu: Generate recommendations based on business attributes. 

Qiaochu Yu and Yujie Zhang: Recommendations based on review data.

Summary and app are developed by all the members.

# References