# Modeling on Ohio's Restaurant Yelp Review Data: Comparison Between Latent Dirichelet Allocation and Multinomial Logistic Regression

**Author:** Ningning Long, Yue You, Tian Xia

In [16]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import shelve

When running business like restaurants or cafes, owners of business care a lot of how they can make uses of customers’ 
reviews after they visit their businesses. The insights mined from customers’ reviews will help discover the weaknesses of the business and contribute to the improvements of service, product or ambience. This study is aimed for utilizing machine learning and natural language processing (NLP) techniques to analyze customers’ reviews from the [Yelp Dataset Challenge](https://www.yelp.com/dataset/challenge).

We focus on a subset data of the whole dataset challenge (i.e. the restaurants reviews in the state of Ohio). We are interested in using the latent dirichelet allocation (LDA) method to find out the topics underlying the customer’s reviews. The latent dirichelet allocation (LDA) is an unsupervised machine learning algorithm that can generate topics based on word frequency from a set of documents. We will use the topics to investigate how they match the reviews and predict the customers’ ratings. Meanwhile, in order to make a comparison, we will also compute the Tf-idf of reviews and use them to run the multinomial logistic regression of customers’ ratings. 

## Data Extraction & Cleaning

The raw data are JSON files from the Round 10 [Yelp Dataset Challenge](https://www.yelp.com/dataset/challenge). We subset and extract part of the data and save it in the **'data'** folder. The *'restaurant.csv'* file is the meta-data for the restaurants in the state of Ohio. The *'reviews.csv'* file is the file that we mainly worked with, which contains the customers' reviews for restaurants in the state of Ohio. Our latent dirichelet allocation (LDA) model and multinomial logistic regression model mainly use the reviews in that file.

Here is a quick look of the *'restaurant.csv'* file in the **'data'** folder:

In [17]:
pd.read_csv('./data/restaurant.csv',index_col=0).head()

Unnamed: 0,state,city,address,name,business_id,stars,review_count,categories
0,OH,Painesville,1 S State St,Sidewalk Cafe Painesville,Bl7Y-ATTzXytQnCceg5k6w,3.0,26,"['American (Traditional)', 'Breakfast & Brunch..."
1,OH,Northfield,10430 Northfield Rd,Zeppe's Pizzeria,7HFRdxVttyY9GiMpywhhYw,3.0,7,"['Pizza', 'Caterers', 'Italian', 'Wraps', 'Eve..."
2,OH,Mentor,9209 Mentor Ave,Firehouse Subs,lXcxSdPa2m__LqhsaL9t9A,3.5,9,"['Restaurants', 'Sandwiches', 'Delis', 'Fast F..."
3,OH,Cleveland,13181 Cedar Rd,Richie Chan's Chinese Restaurant,Pawavw9U8rjxWVPU-RB7LA,3.5,22,"['Chinese', 'Restaurants']"
4,OH,Northfield,134 E Aurora Rd,Romeo's Pizza,RzVHK8Jfcy8RvXjn_z3OBw,4.0,4,"['Restaurants', 'Pizza']"


Here is a quick look of the *'reviews.csv'* file in the **'data'** folder. We will primarily work on the 'text' column, which stores the customers' reviews, and the 'stars' column, which stores the actual customers' ratings to restaurants, in our analysis.

In [14]:
pd.read_csv('./data/reviews.csv',index_col=0).head()

Unnamed: 0,business_id,cool,date,funny,review_id,stars,text,useful,user_id
52,tulUhFYMvBkYHsjmn30A9w,1,2013-11-19,0,FsS5TUFPI8QJEE60-HR3dw,2,Wished it was better..\nAfter watching man vs....,1,bWh4k_cCuVt5GLVd33xIxg
53,tulUhFYMvBkYHsjmn30A9w,1,2014-12-18,0,7xGHiLP1vAaGmX6srC_XXw,4,"Decor and service leave much to be desired, bu...",0,nQ4e81UdfczimYcIUtO3HA
54,tulUhFYMvBkYHsjmn30A9w,1,2014-09-12,0,ZWlXWc9LHPLiOksrp-enyw,5,My husband and I ate here tonight for the firs...,0,gJPa95ZRozMhiOqvENpspA
55,tulUhFYMvBkYHsjmn30A9w,1,2012-02-28,1,KpRwKYyQ93ypyDSdA7IXfw,2,Don't believe the hype. Nooooo! \n\nIn the Cle...,5,bAwfPH4lXNzgcYp9JFy6ow
56,tulUhFYMvBkYHsjmn30A9w,3,2014-10-06,6,OZvrgp4vWBsYqIt3-YMSEw,3,Don't believe the hype!\n\nAfter seeing this l...,10,BjtJ3VkMOxV2Lan037AFuw


In short summary of the data, we focus on the 316 restaurants which have 100 customers' reviews at least, in the state of Ohio. The maximum number of reviews for a single restaurant in our sample is around 900. The distribution of mean star ratings received for restaurants is skewed with the peak around 4 stars.

<img src="./fig/NumberOfReviewsPerRestaurant.png">

<img src="./fig/MeanRatings.png">

## Latent Dirichelet Allocation

## The Multinomial Logistic Regression

## Conclusion

## Author Contributions

This repository and project is the collaboration from **Ningning Long**, **Yue You** and **Tian Xia**. Their contributions to the project are summarized as:

**Ningning Long**:
-	‘*.ipynb’, latent dirichelet allocation modeling and analysis
-	‘makefile’
-	‘environment.yml’
-	Some write-up of ‘main.ipynb’

**Yue You**:
-	‘Modeling.ipynb’, statistical modeling of multinomial logistic regression
-	‘.gitignore’
-	Some write-up of ‘main.ipynb’

**Tian Xia**:
-	‘Data_Cleaning.ipynb’, data extraction and cleaning
-	‘README.md’
-	Some write-up of ‘main.ipynb’
-	‘LICENSE.md’
