# Exploratory Data Analysis
### KirbyDownB
During this phase, we'll be performing EDA on the data we've collected. As of now, we have restaurant data from Yelp, along with income data based on zip code, which we scraped.

## Install Packages

In [1]:
!pip3 install pymongo nltk wordcloud textblob numpy



## Import Packages

In [2]:
from pymongo import MongoClient
from constants import mongodb_atlas_connection
from helpers import isZipCodeValid
from nltk import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import LancasterStemmer, WordNetLemmatizer, PorterStemmer
from wordcloud import WordCloud, STOPWORDS
from textblob import TextBlob

import nltk
import pandas as pd
import numpy as np
import re

nltk.download('punkt')
nltk.download("stopwords")

ModuleNotFoundError: No module named 'constants'

## Fetch Data

In [None]:
client = MongoClient(mongodb_atlas_connection)

yelp_db = client['yelp']
income_db = client['income']

restaurants_collection = yelp_db['restaurants_new']
reviews_collection = yelp_db['reviews']
zip_code_collection = income_db['zipcode']

zip_codes = list(zip_code_collection.find())
zip_restaurants = list(restaurants_collection.find())
reviews = list(reviews_collection.find())

# Check for valid zip codes
zip_codes = list(filter(lambda x: isZipCodeValid(x['zip']), zip_codes))

client.close()



In [None]:
zip_restaurants[0]['restaurants']

In [None]:
stop_words = stopwords.words('english')
rest_review = {}

for review in reviews:
    polarity = 0
    subjectivity = 0
    review_count = 0
    rid = review['rid']
    rName = review['name']    
    
    for i in review['reviews']:
        pre_process = i['text'].lower()
        pre_process = re.sub('[^a-z0-9]+', ' ', pre_process)
        pre_process = word_tokenize(pre_process)
        pre_process = [w for w in pre_process if not w in stop_words] 
        temp = list(TextBlob(' '.join(pre_process)).sentiment)
        polarity += temp[0]
        subjectivity += temp[0]
        review_count += 1
    
    polarity = polarity/review_count 
    subjectivity = subjectivity/review_count
    rest_review[rid] = [polarity, subjectivity]

In [None]:
rest_zip_code = {}
rest_ratings = {}

for zip_restaurant in zip_restaurants:
    restaurants = zip_restaurant['restaurants']
    zip_code = zip_restaurant['zipcode']
    
    for restaurant in restaurants:
        rid = restaurant['id']
        rating = restaurant['rating']
        rest_zip_code[rid] = zip_code
        rest_ratings[rid] = rating

In [None]:
zip_income = {}

for obj in zip_codes:
    zip_code = obj['zip']
    income = obj['income']
    zip_income[zip_code] = income

In [None]:
labels = ['id', 'restaurant_id', 'restaurant_name', 'zip_code', 'avg_sentiment_polarity', 'avg_sentiment_subjectivity', 'zip_code_income', 'restaurant_rating']

data = []
for index, obj in enumerate(reviews):
    restaurant_id = obj['rid']
    restaurant_name = obj['name']
    
    if restaurant_id not in rest_zip_code:
        continue
    zip_code = rest_zip_code[restaurant_id]
    
    sentiment_polarity = np.nan
    sentiment_subjectivity = np.nan
    
    if restaurant_id in rest_review:
        sentiment = rest_review[restaurant_id]
        sentiment_polarity = sentiment[0]
        sentiment_subjectivity = sentiment[1]
    
    zip_code_income = zip_income[zip_code]
    restaurant_rating = rest_ratings[restaurant_id]
    
    value = [index, restaurant_id, restaurant_name, zip_code, sentiment_polarity, sentiment_subjectivity, zip_code_income, restaurant_rating]
    data.append(value)

In [None]:
df = pd.DataFrame(data, columns=labels)
df.set_index('id', inplace=True)
df.head()

In [None]:
df['zip_code_income'] = df['zip_code_income'].str.replace(',','').str.replace('$','').astype('float')
df.head()

In [None]:
print(df.groupby('zip_code'))
dfg = df.groupby('zip_code')
dfg.head()
# Average rating by zipcode

In [None]:
%matplotlib inline
df.groupby('zip_code')['restaurant_rating'].mean().plot.bar()
df.groupby('zip_code')['restaurant_rating'].mean()

## Histogram

In [None]:
# histogram
df.restaurant_rating.hist(bins=3)

As we can see, most of the ratings fall between 4 and 5 stars

## Boxplots

In [None]:
 df.boxplot(column=['restaurant_rating'])

In [None]:
 df.boxplot(column=['zip_code_income'])

The restaurant data seems to have a few outliers, and it is interesting to note that all of the outliers are on the lower end. This is the opposite of the income data, where all ofthe outliers are on the higher end.

## Scatter Plots

In [None]:
import matplotlib.pyplot as plt

series = df.groupby('zip_code_income')['restaurant_rating'].mean()
import pylab
pylab.scatter(series.index, series)

Although there is not a clear correlation we can make with the data, we can at least notice a few trends. While restaurants that are in the lower income range are all over the place, the restaurants in the higher income areas tend to have a more stable rating.

In [None]:
series = df.groupby('avg_sentiment_subjectivity')['restaurant_rating'].mean()
pylab.scatter(series.index, series)

It also seems, from the above data, at least, that the reviews themselves are not a good indicator for whether the rating of the restaurant itself will be good as well. One would think that a restaurant with more favorable reviews would result in a higher average rating. 

In [None]:
df.plot.scatter(x='zip_code_income',y='avg_sentiment_subjectivity')
# pylab.scatter(series.index, series)

Similar to the comparison with rating and the review sentiment, we see that in areas with a lower median income, the reviews tend to be more wide ranging, with sentiment ranging from -0.6 all the way to 0.8 in areas with a median income of $20,000, while in wealthier areas, the reviews tend to be much more consolidated. 

## Conclusion

Although we didn't find any clear positive or negative correlation between some of the datapoint as we were expecting, we still found some interesting trends in the data. The most ineresting insighty we foiund was that the sentiment of the ratijngs themselves don't necessarily correspond to the ratings. That is, the number of stars someone gives a restaurant is not a clear indicator of how well they will review the restaurant, which makes sense. People may use the same words to describe a similar experience, but their overall sentiment may be very different,as these things are influenced by many other factors.

Another trend we indentified was that ratings in areas with a lower median income area tended to be all over the place, while ratings from areas with a higher median income, both in the actual rating and the sentiments of the reviews, tended to be significantly more stable.

In [None]:
df.to_csv('result.csv', index=False)