# COGS 108 - Final Project 

## Permissions

Place an `X` in the appropriate bracket below to specify if you would like your group's project to be made available to the public. (Note that PIDs will be scraped from the public submission, but student names will be included.)

* [ X ] YES - make available
* [  ] NO - keep private

# Overview

Everyone has an idea of what makes a restaurant upscale: we associate factors like expensive food, special cuisine, and a fancy name to mean a place has higher ratings. In this project, we are using data from the Michelin Star Restaurant Guide to find out if these factors really do make a restaurant more likely to gain stars. We then compare this to data from Yelp to see if these factors correlate to those of restaurants on Yelp’s site. We used visualizations and predictive models to determine if there is any significance in cuisine, price, or name and higher ratings. 

# Names

- Marcus Choy-Arioli
- Kassandra Johnson
- Howard Kim
- Janae Zhang

# Group Members IDs

- A15752688
- A14922596
- A16178162
- A15008235

# Research Question

*Of the Michelin Star Restaurants in the United States, do factors such as price, cuisine, and/or name contribute to a higher chance of getting more Michelin Stars? Do these factors also contribute to the ratings of non-Michelin star restaurants?*

## Background and Prior Work

When the Michelin Guide was started, its original goal was to promote travel, which in turn would theoretically increase tire sales as more people drove to visit the locations in the guide. By 1920, the guide itself cost 7 francs, and in 1936 the criteria for star rankings was published ("About the Michelin Guide"). Over time, the idea of particular restaurants being worth traveling to has turned the Michelin star into the symbol of prestige as we know it today. As of March 2020, Japan hosts the most Michelin-starred restaurants, with France coming in a close second (Boland). However, while Michelin stars are awarded worldwide, the official guides are broken down by region -- in some cases each country is a region, but in places like China and the US, regions are broken down by city (Boland).

Michelin claims that five factors go into determining which restaurants get awarded stars -- three are based around food quality, with the other two considering the chef's personality and financial value of the experience (Bennett, 2017). However, the heavy connotations of a Michelin star have prompted some restaurants to ask the company to remove their star (Bennett, 2017) (Heighton-Ginns, 2018). It's clear from these anecdotes that the average layperson has a specific idea in their mind about what a Michelin-star restaurant is like, and as such there is a possibility of bias among Michelin's own restaurant inspectors. While food critics are quick to point to examples of casual eateries that have earned three stars, whether this particular standard is evenly applied remains to be seen.

To help mitigate the number of confounding factors, we've also opted to compare any findings from the Michelin star data to Yelp data. Since Yelp is based on crowdsourced reviews (McCoy, 2020), and used most often for restaurants, it provides a good metric for understanding if any results from the Michelin data are truly statistically significant when compared to an overall population. 

We will also be limiting our analysis to restaurants in the United States, as cultural norms and tastes may affect how Michelin stars are awarded in other countries. The factors we will be considering in our analysis are the restaurant's name, cuisine, location, and price. 


References (include links):
- 1) About the MICHELIN Guide. Retrieved April 24, 2020, from https://guide.michelin.com/en/about-us

- 2) Boland, K. (2020, March 9). Countries That Have The Most Michelin Restaurants. Retrieved from https://www.worldatlas.com/articles/countries-that-have-the-most-michelin-restaurants.html

- 3) Bennett, M. (2017, October 3). Star quality: What does it take to win the Michelin award? Retrieved from https://www.bbc.com/news/uk-scotland-scotland-business-41449529

- 4) Heighton-Ginns, L. (2018, October 12). The Business Behind Michelin Stars. Retrieved from https://www.bbc.com/news/business-45733941

- 5) McCoy, J. (2020, April 26). 15 Things You May Not Know About Yelp. Retrived from https://www.searchenginejournal.com/yelp-facts/355044/


# Hypothesis


 
We hypothesize that the listed factors will contribute to a restaurant's ability to obtain Michelin stars. We also believe that the type cuisine will be the most contributing factor.

Null Hypothesis: There will be no effect. A restaurant's price, cuisine, or name will not matter in the likelihood of higher ratings.

Alternative Hypothesis: There will be an change in ratings dependant on at least one of the factors. 


# Datasets

The following three datasets will be the main dataset for our study. Because the data format on these three datasets are uniform, we will be combining these datasets into a single dataframe. Then, we use the combined data to clean and wrangle our necessary data for our model. Additionally, since this dataset is set to update automatically with new information, the data here was retrieved in June 2020.

**Dataset #1**
*   **Name:** one-star-michelin-restaurants
*   **Link to the dataset:** https://www.kaggle.com/jackywang529/michelin-restaurants
*   **Number of observations:** 544
*   **Description:** This is a dataset containing all the one-star Michelin Star Restaurants as of 2019, in most of the regions that have Michelin Star Restaurants. Each restaurant has its name, the year it was awarded stars, location information, cuisine, price, and a url to the restaurant.

**Dataset #2**
- **Name:** two-star-michelin-restaurants
- **Link to the dataset:** https://www.kaggle.com/jackywang529/michelin-restaurants
- **Number of observations:** 110
- **Description:** This is a dataset containing all the two-star Michelin Star Restaurants as of 2019, in most of the regions that have Michelin Star Restaurants. Each restaurant has its name, the year it was awarded stars, location information, cuisine, price, and a url to the restaurant. 
 
**Dataset #3**
- **Name:**  three-star-michelin-restaurants
- **Link to the dataset:** https://www.kaggle.com/jackywang529/michelin-restaurants
- **Number of observations:** 36
- **Description:** This is a dataset containing all the three-star Michelin Star Restaurants as of 2019, in most of the regions that have Michelin Star Restaurants. Each restaurant has its name, the year it was awarded stars, location information, cuisine, price, and a url to the restaurant. 


The Yelp Dataset will be our supporting dataset to test against our model. We will be comparing Michelin star values against these Yelp Businesses, and use these values to see how they closely match to the rating system of Yelp Stars by standardizing the Michelin Stars and Yelp Stars.
 
**Dataset #4**
- **Name:**  Yelp Dataset
- **Link to the dataset:** https://www.kaggle.com/yelp-dataset/yelp-dataset
- **Number of observations:** 209390
- **Description:** This is the yelp dataset that has all the information available from yelp. It includes the name of the place, location information, reviews, stars, and categories of the type of place it is. 

In [None]:
!pip install --user geopandas
!pip install --user descartes
!pip install --user nltk
import nltk


# Setup

In [None]:
# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import geopandas as gpd

from shapely.geometry import Point, Polygon

import matplotlib.colors as colors
import matplotlib.cm as cmap
from matplotlib.font_manager import FontProperties
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

import patsy
import statsmodels.api as sm
from pandas.io.json import json_normalize
import json


nltk.download('punkt')
nltk.download('stopwords')

# Display options
# pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)

np.set_printoptions(threshold=np.inf)

# Data Imports

Import Michelin Star Restaurant data (.csv) and Yelp's Public Business Data (.json). Note that the Yelp's data required an initial data cleaning and wrangling due to file size limitations on the GitHub. The method/code used to clean and wrangle the initial raw data is posted on the data cleaning section.

In [None]:
one_star_path = 'Datasets/datasets_279628_577823_one-star-michelin-restaurants.csv'
one_df = pd.read_csv(one_star_path)
one_df.shape

In [None]:
two_star_path = 'Datasets/datasets_279628_577823_three-stars-michelin-restaurants.csv'
two_df = pd.read_csv(two_star_path)
two_df.head()
two_df.shape

In [None]:
three_star_path = 'Datasets/datasets_279628_577823_two-stars-michelin-restaurants.csv'
three_df = pd.read_csv(three_star_path)
three_df.shape

In [None]:
yelp_path = 'Datasets/yelp.json'
yelp_df = pd.read_json(yelp_path)
yelp_df.shape

# Data Cleaning

Because the data that we acquired for our Michelin Starred Restaurant data was fairly clean, we decided to implement the format that Michelin Starred Restaurant used, and work around that format to organize and clean Yelp's raw Business data.

For the following, these are the methods/codes used to do an initial cleaning and wrangling for our raw Yelp Business data due to file size constraints on GitHub. These were performed externally on University of California, San Diego (UCSD) DataHub Servers, which can handle bigger file sizes compared to GitHub. We initially found that Yelp's Business data included businesses that were not restaurants, so we filtered our data down to businesses that included the 'restaurant' tag in the 'categories' column of our data. However, the raw data had few incomplete data in 'categories' column, and we determined that those incomplete data are not viable in our study upon initial inspection. Therefore, we dropped all businesses that had empty categories and those without 'restaurant' in the 'categories' column of our dataframe.

In [None]:
# Initial Data Wrangling done before being added to the repo
# Reason: Due to file size constraints for GitHub


# raw_df = pd.read_json('yelp_academic_dataset_business.json', lines = True)
  # read in raw data
# raw_df = raw_df.drop(columns=['business_id', 'address', 'hours'])
  # drop initial unnecessary data columns
# raw_df = raw_df.dropna(subset=['categories'])
  # drop data with missing categories
# raw_df = raw_df[new_df['categories'].str.contains("Restaurant")]
  # create a new dataframe with categories that contains 'Restaurant'

We realized that our raw Yelp Business data also included businesses from Canada, and we could see from the postal code values that they differ by Canada postal values containing letters.

In [None]:
# new_df = new_df[new_df['postal_code'].str.isdigit()]
  # remove any none digit postal code businesses

Furthermore, our raw Yelp Business data contained an 'attributes' column, which contained our price range data; however, the data values in the 'attributes' column was in a nested json format. To clean this data, we exported our initially cleaned dataframe back into a .json file. Then, we used json_normalize from json and pandas.io.json library to format our 'attributes' column into another set of dataframe; this new dataframe included the tags inside 'attributes' as a stand alone column. Finally, we cleaned out our unnecessary columns by creating a new dataframe to include the columns that we want for our study.

In [None]:
# new_df = raw_df.to_json(orient='records')
  # export back to json
# finaldf = json_normalize(json.loads(new_df), meta=['name']) 
  # normalize the exported json to include 'attribute' stand alone columns
# list(finaldf.columns)
# finaldf = finaldf[['name', 'city', 'state', 'postal_code', 'latitude', 'longitude', 'stars', 'review_count', 'categories', 'attributes.RestaurantsPriceRange2']]
  # new dataframe to drop unnecessary columns
# finaldf.to_json('yelp.json')
  # export back out to json file to be included in our datasets on GitHub


# Specs after initial clean and wrangle
# Size = ~8mb
# Rows = 38216 rows
# Columns = 10 Columns

Our next following is the data cleaning that we did for our Michelin Starred Restaurants. We included a 'stars' column to signify how many stars that each of those imported dataframe signified. Then, we concatenated all three dataframes (one star, two star, three star) into one single dataframe labeled under mich_df (Michelin Dataframe). 

From there, we did an initial inspection of our region column to see what regions of Michelin starred restaurants were included. Because we would like to focus only on United States, we filtered our data to include only regions from United States, which were California, New York City, Washington DC, and Chicago. We then dropped any unnecessary or redundant data columns for our study.

In [None]:
one_df['stars'] = 1
  # label our one starred restaurants
two_df['stars'] = 2
  # label our two starred restaurants
three_df['stars'] = 3
  # label our three starred restaurants
concatList = [one_df,two_df,three_df]
mich_df = pd.concat(concatList)
  # combine our three dataframes into a single dataframe for Michelin restaurants

mich_df['region'].value_counts()
  # see all our regions inside Michelin regions column
main_df = mich_df[(mich_df.region == 'California') | (mich_df.region == 'New York City') | (mich_df.region == 'Washington DC') | (mich_df.region == 'Chicago')]
  # include only regions from United States
main_df = main_df.drop(columns=['year', 'zipCode', 'url', 'region'])
  # drop columns that are unnecessary or redundant
main_df.reset_index(inplace=True, drop=True)
  # reset index
main_df


We then needed to clean up the price column to have it represented on a numerical scale, as the original Michelin price ranges were indicated by a certain number of "$".

In [None]:
def standardize_price(price):
    if '$' in price:
        return len(price)
    else:
        return np.nan
    
main_df['price'] = main_df['price'].apply(standardize_price)

Going back to our Yelp Dataset, we needed more data cleaning and wrangling to be fit for our suitable needs. This involved renaming few columns and cleaning our cuisine column data. Through careful inspection of our data, we were able to conclude that in the Yelp data, the '(NEW)' value in the cuisine category refers to 'Contempory' cuisine. 

In [None]:
yelp_df = yelp_df.rename(columns={'attributes.RestaurantsPriceRange2': 'price', 'categories': 'cuisine'})
yelp_df = yelp_df.dropna()
yelp_df = yelp_df[['name', 'latitude', 'longitude', 'city', 'cuisine', 'price', 'stars']]
yelp_df

                  

In [None]:
new = yelp_df[yelp_df['cuisine'].str.contains('New', na=False)]
cont = new[~new['cuisine'].str.contains('Fast Food', na=False)]
contemporary = cont[~cont['cuisine'].str.contains('Traditional', na=False)]
contemporary['cuisine'] = 'Contemporary'
yelp_df.update(contemporary)
yelp_df.cuisine.value_counts()





In [None]:
def standardize_category(string):
    
    string = str(string)
    string = string.lower()
    string = string.strip()
    string = string.replace(' ', '')
    
    if 'chinese' in string:
        output = 'Chinese'
    elif 'american' in string:
        output = 'American'
    elif 'contemporary' in string:
        output = 'Contemporary'
    elif 'seafood' in string:
        output = 'Seafood'
    elif 'mexican' in string:
        output = 'Mexican'
    elif 'italian' in string:
        output = 'Italian'
    elif 'pizza' in string:
        output = 'Italian'
    elif 'japanese' in string:
        output = 'Japanese'
    elif 'sushi' in string:
        output = 'Japanese'
    elif 'greek' in string:
        output = 'Greek'
    elif 'french' in string:
        output = 'French'
    elif 'thai' in string:
        output = 'Thai'
    elif 'spanish' in string:
        output = 'Spanish'
    elif 'indian' in string:
        output = 'Indian'
    elif 'middleeastern' in string:
        output = 'Middle Eastern'
    elif 'spanish' in string:
        output = 'Spanish'
    elif 'hawaiian' in string:
        output = 'Hawaiian'
    elif 'southern' in string:
        output = 'Southern'
    elif 'mediterranean' in string:
        output = 'Mediterranean'
    elif 'korean' in string:
        output = 'Korean'
    elif 'vietnamese' in string:
        output = 'Vietnamese'
    elif 'steakhouse' in string:
        output = 'Steakhouse'
    # Otherwise, if uncaught - keep as is
    else:
        return np.NaN
    return output     

In [None]:
yelp_df['cuisine'] = yelp_df['cuisine'].apply(standardize_category)
yelp_df['cuisine'].value_counts().loc[lambda x : x>25] 


To fix the prices in the Yelp data, we just needed to convert any "None" values to NaNs.

In [None]:
def fix_yelp_prices(price):
    if price == "None":
        return np.nan
    else:
        return float(price)
    
yelp_df['price'] = yelp_df['price'].apply(fix_yelp_prices)
yelp_df = yelp_df.dropna()
yelp_df.reset_index(inplace=True, drop=True)

# Data Analysis & Results

We started with EDA and Visualizations then moved on to more structured analysis. This section is broken down by each factor we were analyzing.

# Cuisine

We want to see how the type of cuisine contributes to Michelin star ratings. We are looking at the visualization of each type of cuisine compared to the number of star ratings it received (1, 2, or 3). We will then do the same process for the Yelp Data. In order to do this we used grouped countplots from seaborn. 

In [None]:
#we want compare the number and types of stars earned to the type of cuisine for the Michelin data we will make a group barplot

#Set the parameters for the graph
plt.rcParams['figure.figsize'] = (12, 17)
sns.set(font_scale=2.0, style="white")

#Enter the data the graph will show
#make graph horizantal so it is easier to view
mich_cus = sns.countplot(y = 'cuisine', hue = 'stars', 
              data = main_df, order = main_df['cuisine'].value_counts().index  )

#Label the graph
mich_cus.set_title('Michelin Star Amount by Type of Cuisine', loc='left')
mich_cus.set_ylabel('Cuisine')
mich_cus.set_xlabel('Count')

We can see from this graph that there are mostly one star rated Michelin restaurants. We also can tell that the more restaurants in a particular category the more Michelin 3 star ratings they have. However this does not show if contemporary cuisine, for example, is more likely to get a higher rating or if it just has more restaurants. 

In [None]:
#we want compare the number and types of stars earned to the type of cuisine for the Yelp data we will make a group barplot
#and compare it to Michelins cuisine's graph

#set the parmeters of the yelp graph
plt.rcParams['figure.figsize'] = (12, 27)

#Enter the data the graph with show
#make graph horizontal and easier to view
yelp_cus = sns.countplot(y = 'cuisine', hue = 'stars', 
              data= yelp_df, order = yelp_df['cuisine'].value_counts().index  )

#label the graph
yelp_cus.set_title('Yelp Star Amount by Type of Cuisine', loc='left')
yelp_cus.set_ylabel('Cuisine')
yelp_cus.set_xlabel('Count')

This plot again makes it hard to see if there really is a significance in rating due to cusine. It mostly just shows the amount of resturants in each type of cuisine.

To make it easier to determine we moved to trying to see the average star amount in both the Michelin and Yelp datasets by cuisine. We use the groupby function and find the mean amount of stars per cuisine. 

In [None]:
#Find the average star amount by type of cuisine for Michelin data
main_df.groupby('cuisine', as_index=False)['stars'].mean()

In [None]:
#Find the average star amount by type of cuisine for Yelp data
yelp_df.groupby('cuisine', as_index=False)['stars'].mean()

The mean scores on Yelp and Michelin data. Show a better distribution of average star ratings compared to cuisine taking out the effect of just general amount in each catergory. We can see that there is not a significant difference in the star ratings based on the type of cuisine. 

## Price

In this section, we will first be evaluating the distribution of Michelin stars among different price points, then repeating that for a the range of Yelp stars.

In [None]:
#Set the parameters for the visulizations
plt.rcParams['figure.figsize'] = (12, 7)

In [None]:
mich_price = sns.countplot(y = 'price', hue = 'stars', data = main_df)

mich_price.set_title('Michelin Star Amount by Price', loc='left')
mich_price.set_ylabel('Price')
mich_price.set_xlabel('Count')

From the above graph, it's clear that Michelin restaurants with 2 or more stars are in the highest price range, and that the majority of all Michelin-starred restaurants are as well.

Repeat for the Yelp star data:

In [None]:
yelp_price = sns.countplot(y = 'price', hue = 'stars', data = yelp_df)

yelp_price.set_title('Yelp Star Amount by Price', loc='left')
yelp_price.set_ylabel('Price')
yelp_price.set_xlabel('Count')

While this data is more widely distributed, the majority of restaurants are in the lower price ranges, with the majority of ratings for all restaurants being in the 3-4 star range. Restaurants in the highest price range are mostly given 4 out of 5 stars, with 3 stars being the lowest rating.

Now to take a look at the basic statistics for the Michelin star data:

In [None]:
mich_desc = main_df.describe()
mich_desc

Ignoring latitude and longitude for now, we can see some basic trends in the price and stars columns. With the average price being in the higher half of the range while the average number of stars is almost one, it appears that Michelin-starred restaurants trend higher in price than most. This is especially clear when compared to the same analysis on the Yelp data below, where the average price is in the lower half and the scaled number of stars is still higher than in the Michelin data.

In [None]:
yelp_desc = yelp_df.describe()
yelp_desc

To see just how strongly price ties to the number of Michelin stars a restaurant receives, we turn to Pearson correlation.

In [None]:
mich_corrs = main_df.corr()
mich_corrs

From the above, we can see a correlation of 0.33 between price and number of stars. We then do the same with the Yelp data.



In [None]:
yelp_corrs = yelp_df.corr()
yelp_corrs

Using the either the stars or scaled_stars, we that that with a correlation of about 0.14, price has about half the effect on stars for general restaurant data when compared to Michelin-starred restaurants.

### Visualizing the price data

From here, we map linear regression for the Michelin star and scaled Yelp star data over price. Note that since the data points themselves are based on integers, the scatter plot will appear to have very few data points.

In [None]:
# Use polyfit to determine values for linear regression line
a1, b1 = np.polyfit(main_df['price'], main_df['stars'], 1)
pred_mich_stars = (a1 * np.arange(0, 5, 1) + b1)

# Graph predicted number of stars based on price using Michelin star data
main_df.plot.scatter('price', 'stars')
plt.plot(np.arange(0, 5), pred_mich_stars, 'r')

In [None]:
# Scale Yelp stars down to Michelin star size and drop null values
yelp_stars_df = yelp_df
yelp_stars_df['scaled_stars'] = yelp_df['stars'] * 0.6
yelp_stars_df = yelp_df[yelp_df['price'].notnull()]


In [None]:
# Use polyfit to determine values for linear regression line
a2, b2 = np.polyfit(yelp_stars_df['price'], yelp_stars_df['scaled_stars'], 1)
pred_yelp_stars = (a2 * np.arange(0, 5, 1) + b2)

# Graph predicted number of stars based on price using Yelp data
yelp_df.plot.scatter('price', 'scaled_stars')
plt.plot(np.arange(0, 5), pred_yelp_stars, 'r')

# Scikit Learn Predictors

## Setup


In [None]:
# import libraries
from sklearn.svm import SVR
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import datasets, linear_model

from nltk.tokenize import word_tokenize


# tfidf vectorizer for restaurant names and cuisine
tfidf = TfidfVectorizer(sublinear_tf=True, analyzer='word', max_features=2000, tokenizer=word_tokenize)

# Support Vector Machines (SVR) for TF-IDF on Restaurant Names & Cuisine
def train_SVM(X, y):
    clf = SVR()
    clf.fit(X,y)
    return clf

In [None]:
# Original Data Star Values
mich_stars = main_df[['stars']]
yelp_stars = yelp_df[['stars']]

# Training Dataframes
mich_train_df = main_df[['name','cuisine','price','stars']]
yelp_train_df = yelp_df[['name','cuisine','price','stars']]

# Testing Dataframes
mich_test_df = main_df[['name','cuisine','price']]
yelp_test_df = yelp_df[['name','cuisine','price']]

# Concatenated Dataframes based on training data
concat_mich_train_df = pd.concat([mich_train_df, yelp_test_df])
concat_yelp_train_df = pd.concat([yelp_train_df, mich_test_df])

## Name TF-IDF

### Michelin Dataset Trained

In [None]:
# Vectorize the restaurant names for training
name_mich_tfidf_X = tfidf.fit_transform(concat_mich_train_df['name']).toarray()

# Split the training data from concatenated dataframe
name_mich_train_X = name_mich_tfidf_X[~concat_mich_train_df['stars'].isnull()]
name_mich_train_y = concat_mich_train_df['stars'][~concat_mich_train_df['stars'].isnull()]

In [None]:
# run SVM
name_mich_clf = train_SVM(name_mich_train_X, name_mich_train_y)

In [None]:
# Get testing data and run predictor
name_yelp_test_X = name_mich_tfidf_X[concat_mich_train_df['stars'].isnull()]
name_pred_yelp_test_y = name_mich_clf.predict(name_yelp_test_X)

In [None]:
# Wrangling Old data and New Predictor values to a new dataframe
name_pred_yelp_stars_df = pd.DataFrame(data=name_pred_yelp_test_y, columns=["stars"])
name_pred_yelp_stars_df = name_pred_yelp_stars_df.rename(columns={'stars': 'Predicted Mich'})
name_pred_yelp_stars_df['Yelp'] = yelp_stars
name_pred_yelp_stars_df['Pred Mich Scaled'] = 1.667 * name_pred_yelp_stars_df['Predicted Mich']
name_pred_yelp_stars_df

### Yelp Dataset Trained

Due to an extremely large file size that caused a runtime of over an hour (leading the program to crash on our personal computers), we've opted to comment out this code. In the one instance it did run, it did not seem to have any significance. The same applies to other commented-out Yelp-trained datasets.

In [None]:
# # Vectorize the restaurant names for training
# name_yelp_tfidf_X = tfidf.fit_transform(concat_yelp_train_df['name']).toarray()

# # Split the training data from concatenated dataframe
# name_yelp_train_X = name_yelp_tfidf_X[~concat_yelp_train_df['stars'].isnull()]
# name_yelp_train_y = concat_yelp_train_df['stars'][~concat_yelp_train_df['stars'].isnull()]

In [None]:
# # run SVM
# name_yelp_clf = train_SVM(name_yelp_train_X, name_yelp_train_y)

In [None]:
# # Get testing data and run predictor
# name_mich_test_X = name_yelp_tfidf_X[concat_yelp_train_df['stars'].isnull()]
# name_pred_mich_test_y = name_yelp_clf.predict(name_mich_test_X)

In [None]:
# # Wrangling Old data and New Predictor values to a new dataframe
# name_pred_mich_stars_df = pd.DataFrame(data=name_pred_mich_test_y, columns=["stars"])
# name_pred_mich_stars_df = name_pred_mich_stars_df.rename(columns={'stars': 'Predicted Yelp'})
# name_pred_mich_stars_df['Mich'] = mich_stars
# name_pred_mich_stars_df['Pred Yelp Scaled'] = 0.6 * name_pred_yelp_stars_df['Predicted Yelp']
# name_pred_mich_stars_df

## Cuisine TF-IDF

### Michelin Dataset Trained

In [None]:
# Vectorize the restaurant names for training
cuisine_mich_tfidf_X = tfidf.fit_transform(concat_mich_train_df['cuisine']).toarray()

# Split the training data from concatenated dataframe
cuisine_mich_train_X = cuisine_mich_tfidf_X[~concat_mich_train_df['stars'].isnull()]
cuisine_mich_train_y = concat_mich_train_df['stars'][~concat_mich_train_df['stars'].isnull()]

In [None]:
# run SVM
cuisine_mich_clf = train_SVM(cuisine_mich_train_X, cuisine_mich_train_y)

In [None]:
# Get testing data and run predictor
cuisine_yelp_test_X = cuisine_mich_tfidf_X[concat_mich_train_df['stars'].isnull()]
cuisine_pred_yelp_test_y = cuisine_mich_clf.predict(cuisine_yelp_test_X)

In [None]:
# Wrangling Old data and New Predictor values to a new dataframe
cuisine_pred_yelp_stars_df = pd.DataFrame(data=cuisine_pred_yelp_test_y, columns=["stars"])
cuisine_pred_yelp_stars_df = cuisine_pred_yelp_stars_df.rename(columns={'stars': 'Predicted Mich'})
cuisine_pred_yelp_stars_df['Yelp'] = yelp_stars
cuisine_pred_yelp_stars_df['Pred Mich Scaled'] = 1.667 * cuisine_pred_yelp_stars_df['Predicted Mich']
cuisine_pred_yelp_stars_df

### Yelp Dataset Trained

In [None]:
# # Vectorize the restaurant names for training
# cuisine_yelp_tfidf_X = tfidf.fit_transform(concat_yelp_train_df['cuisine']).toarray()

# # Split the training data from concatenated dataframe
# cuisine_yelp_train_X = cuisine_yelp_tfidf_X[~concat_yelp_train_df['stars'].isnull()]
# cuisine_yelp_train_y = concat_yelp_train_df['stars'][~concat_yelp_train_df['stars'].isnull()]

In [None]:
# # run SVM
# cuisine_yelp_clf = train_SVM(cuisine_yelp_train_X, cuisine_yelp_train_y)

In [None]:
# # Get testing data and run predictor
# cuisine_mich_test_X = cuisine_yelp_tfidf_X[concat_yelp_train_df['stars'].isnull()]
# cuisine_pred_mich_test_y = cuisine_yelp_clf.predict(cuisine_mich_test_X)

In [None]:
# # Wrangling Old data and New Predictor values to a new dataframe
# cuisine_pred_mich_stars_df = pd.DataFrame(data=cuisine_pred_mich_test_y, columns=["stars"])
# cuisine_pred_mich_stars_df = cuisine_pred_mich_stars_df.rename(columns={'stars': 'Predicted Yelp'})
# cuisine_pred_mich_stars_df['Mich'] = mich_stars
# cuisine_pred_mich_stars_df['Pred Yelp Scaled'] = 0.6 * cuisine_pred_yelp_stars_df['Predicted Yelp']
# cuisine_pred_mich_stars_df

## Price Linear Regression

### Michelin Dataset Trained

In [None]:
# Split the training data from concatenated dataframe
price_mich_train_X = concat_mich_train_df['price'][~concat_mich_train_df['stars'].isnull()]
price_mich_train_y = concat_mich_train_df['stars'][~concat_mich_train_df['stars'].isnull()]

# Reshape to 1D array
price_mich_train_X = price_mich_train_X.values.reshape(-1,1)
price_mich_train_y = price_mich_train_y.values.reshape(-1,1)

In [None]:
# run Linear Regression
mich_train_regr = linear_model.LinearRegression()
mich_train_regr.fit(price_mich_train_X, price_mich_train_y)

In [None]:
# Get testing data and run predictor
price_yelp_test_X = concat_mich_train_df['price'][concat_mich_train_df['stars'].isnull()]
price_yelp_test_X = price_yelp_test_X.values.reshape(-1,1)
price_pred_yelp_test_y = mich_train_regr.predict(price_yelp_test_X)

In [None]:
# Wrangling Old data and New Predictor values to a new dataframe
price_pred_yelp_stars_df = pd.DataFrame(data=price_pred_yelp_test_y, columns=["stars"])
price_pred_yelp_stars_df = price_pred_yelp_stars_df.rename(columns={'stars': 'Predicted Mich'})
price_pred_yelp_stars_df['Yelp'] = yelp_stars
price_pred_yelp_stars_df['Pred Mich Scaled'] = 1.667 * price_pred_yelp_stars_df['Predicted Mich']
price_pred_yelp_stars_df

### Yelp Dataset Trained

In [None]:
# Split the training data from concatenated dataframe
price_yelp_train_X = concat_yelp_train_df['price'][~concat_yelp_train_df['stars'].isnull()]
price_yelp_train_y = concat_yelp_train_df['stars'][~concat_yelp_train_df['stars'].isnull()]

# Reshape to 1D array
price_yelp_train_X = price_yelp_train_X.values.reshape(-1,1)
price_yelp_train_y = price_yelp_train_y.values.reshape(-1,1)

In [None]:
# run Linear Regression
yelp_train_regr = linear_model.LinearRegression()
yelp_train_regr.fit(price_yelp_train_X, price_yelp_train_y)

In [None]:
# Get testing data and run predictor
price_mich_test_X = concat_yelp_train_df['price'][concat_yelp_train_df['stars'].isnull()]
price_mich_test_X = price_mich_test_X.values.reshape(-1,1)
price_pred_mich_test_y = yelp_train_regr.predict(price_mich_test_X)

In [None]:
# Wrangling Old data and New Predictor values to a new dataframe
price_pred_mich_stars_df = pd.DataFrame(data=price_pred_mich_test_y, columns=["stars"])
price_pred_mich_stars_df = price_pred_mich_stars_df.rename(columns={'stars': 'Predicted Yelp'})
price_pred_mich_stars_df['Mich'] = mich_stars
price_pred_mich_stars_df['Pred Yelp Scaled'] = 0.6 * price_pred_mich_stars_df['Predicted Yelp']
price_pred_mich_stars_df

Given the strong uniformity of predictions, it appears that restaurant name, cuisine name, and price do not have any particular ability to predict a restaurant's Yelp rating from its Michelin star rating. 

# Location

One variable that we are also able to analyze within the Michelin dataset is the location of the restaurant. Using the latitude and longitude of each restuarant we can plot each restaurant over a map of the United States to see if we have any interesting patterns or clusters. We can achieve this by using geopandas in order to get the data for the United States, then we can use our dataset to create points within a geopandas dataframe that will store each restuarants location. We then plot this dataframe with the map in order to get the final result.

In [None]:
# Get data from geopandas datasets
world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))

# Set Map to North American
ax = world[world.continent == 'North America'].plot(alpha = 0.4, figsize=(15,15),
    color='lightgrey', edgecolor='black')
# Load each restuarant's lat & lon into a geopandas dataframe
gdf = gpd.GeoDataFrame(
    main_df, geometry=gpd.points_from_xy(main_df.longitude, main_df.latitude))

# Plot restuarants on the map
gdf.plot(ax = ax, markersize = 20, color='red', marker='o', label = 'Michelin Starred Restuarants')
plt.legend(prop={'size':15})

Now we see that there is a cluster of restuarants in different cities, to view the cities we can do the following:

In [None]:
main_df['city'].value_counts()

From the data we can see that the 5 clusters are the cities of New York, San Francisco, Los Angeles, Chicago, and Washington D.C. There are only a few resturants that are not located in these major population cities. From the map we created we can see that if your restaurant is not located in one of those five cities than your chances of obtaining a Michelin star is drastically lower. 

# Ethics & Privacy

As with any data science, we carefully considered and operated in a manner in which we maintain respect for privacy and ethical guidelines to ensure that our research does not cause undue harm. 

Starting with our research question, we believe that it is well-posed and unbiased because we are simply using our research to find contributing factors. We are using our research to answer what factors may lead to a greater likelihood of receiving Michelin Stars. Our scope of investigation will be strictly within the publicly available data that is available to the general public via the internet. We believe that these datasets are fully capable of answering our research question well.

For this study, the main stakeholders will be the Michelin tire company itself because we are essentially evaluating the quality of their rating system, as well as restaurants that have received Michelin stars. In the case of this study, concluding that Michelin Guide has bias in these contributing factors would potentially affect Michelin in a way that suggests that their projected outcomes were unreasonable, casting doubt on the validity of the ratings. This could hurt both Michelin's brand as well as restaurants. On the other hand, it may become favorable to Michelin and become an appeal for the eyes of their investors if the case was that Michelin Guide indeed fulfilled its role. Additionally, it could serve as a tool for quality control should any bias be found, and in the long-term would be more helpful than harmful. In terms of the data based on Michelin Guide, we believe that it is not possible to be used for nefarious purposes because it is publicly available data from Michelin to create exposure (i.e favorable sale records to investors).

Another possible stakeholder in this study is Yelp. However since we are not evaluating Yelp and just using it as a comparison tool it should not cause any harm to the company. We are also not evaluating any individual reviewer, any person that has made a review for Yelp or Michelin Star is not known and will not be individually affected by this analysis. We are however naming the restaurants involved to be able to factor if their name contributes to their reviews. While including names does have a potential risk of causing doubts of their reasons for having Michelin stars, we want to reiterate that we are not reviewing restaurants, we are just looking for correlations in factors that contribute to receiving Michelin Stars. Any restaurant reviews are already publicly available on Yelp. We also understand that the most contributing factor is the food’s quality and taste itself and that aspect is not included. We are not implying that any restaurant is good or bad, we are just looking for trends. 

We strongly believe that this study can be analyzed and concluded just from the publicly available data. Despite being somewhat correlational, the data available on the surface web is enough to make reliable inferences.

We have carefully considered the requiring informed consent to perform this study on Michelin; however, we are convinced that the influence of Michelin Guide is much far bigger than the impact that this study can cause. Therefore, we concluded that any following unintended negative consequences from this research are negligible to Michelin; meaning, there are not necessarily any risks incurred in Michelin's stance.


# Conclusion & Discussion

After our analysis was completed it was determined that we failed to reject the null hypothesis. There was no significant difference in ratings due to different types in the categories of cuisine, price,  or name. If a category had more stars it was based on just pure amount rather than the influence of any particular category. We have discovered that none of the factors we identified specifically contribute to higher ratings. There is a slight correlation between price and number of Michelin stars received, but it is unclear if a higher price was caused by reception of stars or a contributing factor due to lack of data over time. Michelin star data in general here is limited, as it encompasses only one year, and was further limited by geographical area (in our case, country).
 
We used a mixture of visual data and predictive models to analyze our data. Most of the visualizations were skewed based on the amount in each category. For cuisine we also looked at the average stars per type of cuisine and found that there was not a big difference between categories. In the price data set the graphs showed that the higher prices tended to have higher ratings, especially when compared to the Yelp data, but confounding factors make it unclear the cause of this trend. We also looked at location data--where restaurants were located, but we already knew the Michelin guide is only based in certain areas, so this was mostly for visualization. Additionally, areas with a higher population density are likely to have more restaurants in general, which likely contributes to those areas having more Michelin stars. We then used SVR to implement TF-IDf for the predictions made for the name and cuisine categories. Then we used Linear regression to make predictions for the category of price. These predictive models showed no statistical significance for higher ratings based on any of our tested factors.

Although we found no statistical significance, there is still a chance that we missed something in our analysis that might have changed this. We also found that there is popularity in some cuisines, which might imply that certain cuisines may have more of a chance of getting Michelin stars in general. We also did not test every factor of a restaurant so there may in fact be something else that makes a higher contribution to ratings. This data was also only focused in the United States so our information can only be based from this region. As of our current findings, it seems most appropriate to believe that Michelin is properly basing its star awards on the five factors it claims to use.


# Team Contributions

Marcus worked on data cleaning and did work standardizing the data. He also did the analysis for the location data.

Kassandra worked on the overview, hypothesis, and explaining the data sets. She also did work on the cuisine data analysis and  writing ethics and privacy and the conclusion. 

Howard did work importing the yelp dataset and cleaning it. He then worked on the Scikit Learn Predictors. He also helped with ethics and privacy.

Janae helped with data cleaning, and filled in the background and prior work section. She also did analysis for the price section and helped write the conclusion. 