##### Main authors: WENQI HOU, GAURAVI SAHA, MANYING (JANE) TSANG
#### Repurposed with adaptations and changes from: GIOVANNI FICARRA & LEONARDO PICCHIAMI

### YELP DATA PREPROCESSING - RESTAURANT RECOMMENDATION SYSTEM 

##### CONTEXT OF THE DATA
We have chosen to pick Yelp dataset for three main reasons: 
- The data is feasible and has potential due to large volumes (3.6GB)
- Since we have gathered the information from the Yelp website, it is authentic and will help us develop practical insights. 
- The datasets include multitude of restaurants, 36 states, 1200 cities and users nationwide which enriches the quality of the data. 

### 1. Overall Project Objectives

Focusing on Las Vegas restaurants, we are implementing a high fidelity system for a user, restaurant and Yelp to transform the restaurant recommendation experience. Gather regional specific insights about our customer base, develop strategic factors that would influence a customer’s decision to visit a particular restaurant.

 - User Perspective: Trending cuisines, upscale bars, quality of restaurants to garner a wholesome experience for the customer.
 - Restaurant’s Profitability: Identifying revenue from highly reviewed users, targeted success through region specific analytics. 
 - Yelp’s Perspective: Testing Yelp’s tracking mechanism of restaurant hours, abreast with current status of restaurants (newly opened, permanently closed, etc). Develop a recommendation system for a new customer and identify the top 5 restaurants based on certain input parameters like cuisine, ambience, type of restaurant, etc.




### 2. Description of Data

5 datasets in json format retrieved from Yelp website : business.json, user.json, checkin.json, tip.json and review.json.

- business_id: ID of the business
- name: name of the business
- neighborhood
- address: address of the business
- city: city of the business
- state: state of the business
- postal_code: postal code of the business
- latitude: latitude of the business
- longitude: longitude of the business
- stars: average rating of the business
- review_count: number of reviews received
- is_open: 1 if the business is open, 0 therwise
- categories: multiple categories of the business

Review has the following attributes:

- review_id: ID of the review
- user_id: ID of the user
- business_id: ID of the business
- stars: ratings of the business
- date: review date
- text: review from the user
- useful: number of users who vote a review as usefull
- funny: number of users who vote a review as funny
- cool: number of users who vote a review as cool

User data has these variables:
- average stars
- compliment_cool, compliment_cute, compliment_funny, compliment_hot, compliment_list, compliment_more, compliment_note, compliment_photos, compliment_plain, compliment_profile, compliment_writer
- cool
- elite
- fans
- friends
- funny
- name
- review_counts
- useful
- user_id
- yelping_since

Check in has two columns: 

- business_id
- date

And the most important data for our analysis: Tip data

- business_id
- compliment_count
- date
- text
- user_id

### 3. Data Processing Tasks

#### Generating a cleaned and transformed version of the data:

1. Transfer json into pandas dataframe with proper indexing Extract data that includes restaurants in Las Vegas.
2. Replace garbage data which includes incorrect states and postal codes, etc Replace missing values. 
3. Date transformations and standardization.
4. Merge multiple dataframes and reshape.
5. Delete unnecessary columns which could add ambiguity based on logical assumptions.
6. Delete duplicate restaurants entries and combine their reviews.
7. Fix typographical errors in reviews.
8. Data discretize review counts.
9. Count user’s rating as a function of restaurants’ type and find their preference Improve the accuracy of business category by tracking ‘buzz words’ in review


### Enhancement to the Data:

We have improved and enhanced the data at every level by cleaning information within the columns. Further data cleaning and enhancements are covered in the data cleaning section.

### 4. Explanatory Data Analysis

### Data Import

We imported our large json file into dataframes by spliting each file into multiple chunks, then convert these chunks to a list, and concatenated them to a final dataframe.
After creating one dataframe, we check the columns, the shapes and the head of the dataframe to get an overall idea of what our data looks like and its features.

In [1]:
import re
from collections import Counter
import datetime as dt

import pandas as pd
import numpy as np
import pickle
from wordcloud import WordCloud
import matplotlib.pyplot as plt


In [2]:
'''
DUBBIO
Se non ho capito male, settando la chunksize leggi tutto il json dividendolo in un numero di chunk che stabilisci te per
fare operazioni su una grande quantità di dati iterativamente lavorando su una piccola parte alla volta. Ma che senso 
ha caricare tutto, dividerlo in parti, metterlo in una lista e poi rimetterlo insieme? È un discorso di efficienza di
operazioni?

RISPOSTA: dovrebbe risparmiare memoria nel parsing.
'''

frames_tip = []
for chunk in pd.read_json('../dataset/yelp_academic_dataset_tip.json', lines=True, chunksize = 10000):
    frames_tip.append(chunk)
tip=pd.concat(frames_tip)

In [3]:
tip.columns

Index(['user_id', 'business_id', 'text', 'date', 'compliment_count'], dtype='object')

In [4]:
tip.head()

Unnamed: 0,user_id,business_id,text,date,compliment_count
0,UPw5DWs_b-e2JRBS-t37Ag,VaKXUpmWTTWDKbpJ3aQdMw,"Great for watching games, ufc, and whatever el...",2014-03-27 03:51:24,0
1,Ocha4kZBHb4JK0lOWvE0sg,OPiPeoJiv92rENwbq76orA,Happy Hour 2-4 daily with 1/2 price drinks and...,2013-05-25 06:00:56,0
2,jRyO2V1pA4CdVVqCIOPc1Q,5KheTjYPu1HcQzQFtm4_vw,Good chips and salsa. Loud at times. Good serv...,2011-12-26 01:46:17,0
3,FuTJWFYm4UKqewaosss1KA,TkoyGi8J7YFjA6SbaRzrxg,The setting and decoration here is amazing. Co...,2014-03-23 21:32:49,0
4,LUlKtaM3nXd-E4N4uOk_fQ,AkL6Ous6A1atZejfZXn1Bg,Molly is definately taking a picture with Sant...,2012-10-06 00:19:27,0


In [5]:
frames_checkin = []
for chunk in pd.read_json('../dataset/yelp_academic_dataset_checkin.json', lines=True, chunksize = 10000):
    frames_checkin.append(chunk)
checkin=pd.concat(frames_checkin)

In [6]:
checkin.columns

Index(['business_id', 'date'], dtype='object')

In [7]:
checkin.shape

(161950, 2)

In [8]:
checkin.head()

Unnamed: 0,business_id,date
0,--1UhMGODdWsrMastO9DZw,"2016-04-26 19:49:16, 2016-08-30 18:36:57, 2016..."
1,--6MefnULPED_I942VcFNA,"2011-06-04 18:22:23, 2011-07-23 23:51:33, 2012..."
2,--7zmmkVg-IMGaXbuVd0SQ,"2014-12-29 19:25:50, 2015-01-17 01:49:14, 2015..."
3,--8LPVSo5i0Oo61X01sV9A,2016-07-08 16:43:30
4,--9QQLMTbFzLJ_oT-ON3Xw,"2010-06-26 17:39:07, 2010-08-01 20:06:21, 2010..."


In [None]:
#Original testing size 20000
#My testing size 10

frames_review = []
for chunk in pd.read_json('../dataset/yelp_academic_dataset_review.json', lines=True, chunksize = 10):
    frames_review.append(chunk)
review=pd.concat(frames_review)

In [None]:
review.columns

In [None]:
review.shape

In [None]:
review.head()

In [None]:
frames = []
for chunk in pd.read_json('../dataset/yelp_academic_dataset_user.json', lines=True, chunksize = 10000):
    frames.append(chunk)
user = pd.concat(frames)

In [None]:
user.columns

In [None]:
user.shape

In [None]:
user.head()

In [None]:
frames_business = []
for chunk in pd.read_json('../dataset/yelp_academic_dataset_business.json', lines=True, chunksize = 10000):
    frames_business.append(chunk)
business = pd.concat(frames_business)

In [None]:
business.columns

In [None]:
business.head()

In [None]:
# TODO credo si possa togliere, tanto non ci interessano più le città
business['city'].value_counts().head()

### Flow of Data Processing:

We started with 'business' since it contains ‘attribute’ which we can use it to extract all business at Las Vegas, and further extract restaurants based on ‘categories’ out of all business types. 
- By creating a new dataframe business_vegas_restaurant, we were able to filter 'review' table by matching its 'business_id' with 'business_id' in dataframe'business_vegas_restaurant', creating a new dataframe 'review_in_vegas'.

- Using the same logic, we then were able to filter 'user' dataframe by matching its 'user_id' with 'user_id' in 'review_in_vegas'. 

- The new dataframe 'review_in_vegas'contains all customers who have been to at least one restaurant in Las Vegas and left a review. 

- Same as the rest two dataframes, new dataframes 'tip_vegas', 'checkin_vegas' were created by matching 'business_id'

#### To avoid importing data from the large json files every time, we converted the new dataframes to pickle files for future use.

In [None]:
business['restaurant']=business['categories'].str.contains('Restaurants',flags = re.IGNORECASE)

In [None]:
business_restaurant=business[business['restaurant'] == True]

In [None]:
business_restaurant.head()

In [None]:
business_restaurant.reset_index(drop=True).head()

In [None]:
business_restaurant.shape

#### The Pickling Process

In [None]:
business_restaurant.to_pickle('../dataset/restaurants.pickle')
review=review.drop('text',axis=1)

In [None]:
review_all_restaurant=review.loc[review['business_id'].isin(business_restaurant['business_id'].unique())]

In [None]:
review_all_restaurant.reset_index(drop=True).head()

In [None]:
review_all_restaurant.to_pickle('../dataset/all_review.pickle')

In [None]:
user.columns

In [None]:
user_all_restaurant=user.loc[user['user_id'].isin(review_all_restaurant['user_id'].unique())]

In [None]:
user_all_restaurant.to_pickle('../dataset/all_users.pickle')

In [None]:
tip.columns

In [None]:
tip_all_restaurant = tip.loc[tip['user_id'].isin(review_all_restaurant['user_id'].unique())].reset_index(drop=True)

In [None]:
tip_all_restaurant.to_pickle('../dataset/all_tips.pickle')

In [None]:
check_all_restaurant=checkin.loc[checkin['business_id'].isin(business_restaurant['business_id'].unique())].reset_index(drop=True)

In [None]:
check_all_restaurant.to_pickle('../dataset/all_checkin.pickle')

# Data Cleaning - Making the Data useful for analysis

##### Working with Business pickle file:

This file contains information about our restaurants and other related parameters. This dataframe acts as the focus of our analysis and we intend to derive meaningful insights from it.

Summary of actions:
- Reading the business pickle file for clean up
- Missing values cleaned up
- Using only a few selected columns for meaningful analysis
- Extract useful information from categories column to investigate resturants' cuisine


In [None]:
rest = pd.read_pickle('../dataset/restaurants.pickle')

In [None]:
rest.fillna(value = pd.np.nan, inplace = True)

In [None]:
Rest = rest.reset_index(drop = True)
Rest.index +=1
Rest.head()

In [None]:
Rest.columns

In [None]:
'''
DUBBIO
Lui qua esclude alcune colonne. Però:
- Non prende in considerazione city, nel suo caso è sempre las vegas. Io l'ho aggiunta, credo sia una informazione utile.
  Credo che nella predizioe, anche il di dove sia il ristorante che consigli sia utile saperlo.
'''

Rest_final = Rest[['name', 'business_id', 'address', 'categories', 'postal_code','attributes','hours','latitude','longitude','review_count','stars', 'city']]

In [None]:
'''
Ho eseguito la faccenda. Ora mi sembra che unique abbia funzionato. Prima sembrava un crash quasi.
'''

categories=', '.join(list(Rest_final['categories'].unique()))
categories=categories.split(', ')
categories[:5]

In [None]:
c = Counter(categories)
c.most_common(60)

In [None]:
'''
Qui sta aggiunge una nuova feature. Il tipo di cucina.
'''

cuisine = 'American|Chinese|Italian|Japanese|Mexican|Asian Fusion|Thai|Korean|Mediterranean'
Rest_final['cuisine']=Rest_final['categories'].str.findall(cuisine)

In [None]:
'''
Mappa ogni elemento a una lista, giustamente perche un ristorante può avere più tipi di cucina. Se non è fra
le tipologie principali, è in others.
'''

Rest_final['cuisine'] = Rest_final['cuisine'].map(lambda x: list(x))
Rest_final['cuisine'] = Rest_final['cuisine'].map(lambda x: ['Others'] if x==[] else x)

In [None]:
Rest_final['cuisine'].head(20)

#### Remove redundant entries (e.g: American, American)

In [None]:
Rest_final['cuisine'] = Rest_final['cuisine'].map(lambda x: list(dict.fromkeys(x)))
Rest_final['cuisine'] = Rest_final['cuisine'].map(', '.join) # convert list of string to string
Rest_final['cuisine'].head(20)

Check all cuisines and merge all resturants with cuisine - Asian into Asian fusion for ease.

In [None]:
Rest_final['cuisine'].unique()

In [None]:
Rest_final['cuisine'].iloc[np.where(Rest_final['cuisine'].str.contains('Asian Fusion'))]='Asian Fusion'

In [None]:
Rest_final['cuisine'].unique()

#### Analysis of messy data in the attribute column:

To fix this issue where each item inside is a dictionary with values, attributes acts as a filter on Yelp that customers can click to identify the restaurant. For eg. Wifi = Yes would be selected (or tick marked) while making a selection on Yelp.

We have split the atributes column with dictionary to different filters.

In [None]:
Rest_final.isnull().sum()

In [None]:
Rest_final['attributes'].apply(pd.Series).head()
# Split the attributes dictionary into all its values

#### Summary of actions:
- Concatenating the attributes to the dataframe.
- Since there are a lot of missing values in most of the columns, we have cherry-picked a few columns out of the list and included a few filters for our analysis.
- Clean up of the WiFi column.
- Clean up of the Alcohol column.

 

In [None]:
R = Rest_final['attributes'].apply(pd.Series)
list(R.columns)

In [None]:
Rest_new = pd.concat([Rest_final.drop(['attributes'], axis=1), Rest_final['attributes'].apply(pd.Series)], axis=1)
Rest_new.head()

In [None]:
'''
Anche in questo caso, fra le features da considerare, ho aggiunto city.
'''

Rest_new = Rest_new[['name', 'business_id', 'address', 'cuisine', 'postal_code','hours','latitude','longitude',
                   'review_count','stars','OutdoorSeating','BusinessAcceptsCreditCards','RestaurantsDelivery',
                   'RestaurantsReservations','WiFi','Alcohol','categories', 'city']]

In [None]:
Rest_new.fillna(value=pd.np.nan, inplace=True)
Rest_new['WiFi'].unique()

In [None]:
a=Rest_new['WiFi'].map(lambda x: 'No' if x in np.array(["u'no'", "'no'",'None']) else x)
a=a.map(lambda x: 'Free' if x in np.array(["'free'", "u'free'"]) else x)
a.unique()

In [None]:
a=a.map(lambda x: 'Paid' if x in np.array(["'paid'", "u'paid'"]) else x)
a.unique()

In [None]:
Rest_new['WiFi']=a

In [None]:
Rest_new['Alcohol'].unique()

In [None]:
Alc = Rest_new['Alcohol'].map(lambda x: 'Full_Bar' if x in np.array(["u'full_bar'", "'full_bar'"]) else x)
Alc.unique()

In [None]:
Alc = Alc.map(lambda x: 'Beer&Wine' if x in np.array(["u'beer_and_wine'", "'beer_and_wine'"]) else x)
Alc.unique()

In [None]:
Alc = Alc.map(lambda x: 'No' if x in np.array(["u'none'", "'none'",'None']) else x)
Alc.unique()

###### Cleaned Version:

In [None]:
Rest_new['Alcohol']= Alc
Rest_new.head()

### Splitting up restaurant hours:

Summary of Actions:
- Clean up hours to split into multiple columns regarding to open and close time of each day.
- Check if every restaurant open and close once per day.
- Use the defined function to split keys(days) and values (hours) of hours dictionary for later information extraction.

In [None]:
print(Rest_new['hours'][Rest_new['hours'].notnull()].map(lambda x: x.values()).map(len).sort_values().value_counts())

In [None]:
def merge(x,y):
    result = []
    try:
        for i in x:
            index = x.index(i)
            result.append(i)
            result.append(y[index])
        return result
    except TypeError:
        result = [np.NaN, np.NaN]

In [None]:
Rest_new['business_days']=Rest_new['hours'][Rest_new['hours'].notnull()].map(lambda x:list(x.keys()))
Rest_new['business_hours']=Rest_new['hours'][Rest_new['hours'].notnull()].map(lambda x:list(x.values()))
Rest_new['hours_day'] = Rest_new.apply(lambda row: merge(row['business_days'], row['business_hours']), axis=1)

In [None]:
Rest_new_hours = Rest_new[:]
Rest_new_hours.head(10)

In [None]:
Rest_new_hours['hours_day'][Rest_new_hours['hours_day'].notnull()] = Rest_new_hours['hours_day'][Rest_new['hours_day'].notnull()].map(lambda x: ''.join(x))
Rest_new_hours.head()

In [None]:
Rest_new_hours['Monday_Open']=Rest_new_hours['hours_day'].str.extract('[M][o][n][d][a][y](\d*[:]\d*)[-]\d*[:]\d*')
Rest_new_hours['Tuesday_Open']=Rest_new_hours['hours_day'].str.extract('[T][u][e][s][d][a][y](\d*[:]\d*)[-]\d*[:]\d*')
Rest_new_hours['Wednesday_Open']=Rest_new_hours['hours_day'].str.extract('[W][e][d][n][e][s][d][a][y](\d*[:]\d*)[-]\d*[:]\d*')
Rest_new_hours['Thursday_Open']=Rest_new_hours['hours_day'].str.extract('[T][h][u][r][s][d][a][y](\d*[:]\d*)[-]\d*[:]\d*')
Rest_new_hours['Friday_Open']=Rest_new_hours['hours_day'].str.extract('[F][r][i][d][a][y](\d*[:]\d*)[-]\d*[:]\d*')
Rest_new_hours['Saturday_Open']=Rest_new_hours['hours_day'].str.extract('[S][a][t][u][r][d][a][y](\d*[:]\d*)[-]\d*[:]\d*')
Rest_new_hours['Sunday_Open']=Rest_new_hours['hours_day'].str.extract('[S][u][n][d][a][y](\d*[:]\d*)[-]\d*[:]\d*')
Rest_new_hours['Monday_Close']=Rest_new_hours['hours_day'].str.extract('[M][o][n][d][a][y]\d*[:]\d*[-](\d*[:]\d*)')
Rest_new_hours['Tuesday_Close']=Rest_new_hours['hours_day'].str.extract('[T][u][e][s][d][a][y]\d*[:]\d*[-](\d*[:]\d*)')
Rest_new_hours['Wednesday_Close']=Rest_new_hours['hours_day'].str.extract('[[W][e][d][n][e][s][d][a][y]\d*[:]\d*[-](\d*[:]\d*)')
Rest_new_hours['Thursday_Close']=Rest_new_hours['hours_day'].str.extract('[T][h][u][r][s][d][a][y]\d*[:]\d*[-](\d*[:]\d*)')
Rest_new_hours['Friday_Close']=Rest_new_hours['hours_day'].str.extract('[F][r][i][d][a][y]\d*[:]\d*[-](\d*[:]\d*)')
Rest_new_hours['Saturday_Close']=Rest_new_hours['hours_day'].str.extract('[S][a][t][u][r][d][a][y]\d*[:]\d*[-](\d*[:]\d*)')
Rest_new_hours['Sunday_Close']=Rest_new_hours['hours_day'].str.extract('[S][u][n][d][a][y]\d*[:]\d*[-](\d*[:]\d*)')

In [None]:
Rest_new_hours.head(5)

In [None]:
Rest_new_hours.drop(['hours_day','business_days','business_hours'],axis=1,inplace=True)
Rest_new_hours.columns

In [None]:
def str2time(val):
    try:
        return dt.datetime.strptime(val, '%H:%M').time()
    except:
        return pd.NaT

In [None]:
'''
Ho modificato gli indici, poiché avendo aggiunto city consideraca le features sbagliate.
'''

Rest_new_hours.iloc[:,18:32]=Rest_new_hours.iloc[:,18:32].astype(str)
Rest_new_hours.iloc[:,18:32]=Rest_new_hours.iloc[:,18:32].applymap(lambda x: str2time(x))
Rest_new_hours.iloc[:,18:32].head()

In [None]:
Rest_new_hours.loc[3801]

In [None]:
Rest_new_hours.drop('hours',axis=1,inplace=True)
Rest_new_hours.head()

In [None]:
'''
Credo che per il nostro task sia utile salvarci di nuovo il pickle.
'''
Rest_new_hours.to_pickle('../dataset/restaurants.pickle') 


### Cleaning of the Review Dataset:



Summary of Actions:
- Reset the index to 1 - for ease of reading
- Rearranging the columns in the dataframe
- Updating the timestamp to include only the date format (YYYY-MM-DD).
- We used the pandas to_datetime to drop the time of the 'date'column

In [None]:
'''
DUBBIO 
Onestamente: mi pare un pò insensata questa parte di preprocessing. La manteniamo?
'''

In [None]:
pickle_review = open("../dataset/all_review.pickle","rb")
review = pickle.load(pickle_review)
review.head()

In [None]:
Review = review.reset_index(drop=True)
Review.index +=1
Review.head()

In [None]:
Review = Review[['business_id', 'user_id', 'review_id', 'date', 'cool','funny','useful','stars']]
Review.head()

In [None]:
'''
Anche qui ci risalviamo il pickle
'''

review.to_pickle('../dataset/all_review.pickle')

### Cleaning Users Dataset

Summary of Actions:
- After processing the data, we have shrunk the dataset from 22 columns to 11 columns. The 'compliment' columns are all dropped because they function very similar to 'cool' and 'funny' columns which are also counting how many different kinds of compliments the user got from others. So, to remove the redundancy, we have eliminated those variables.
- Since we have extracted only Las Vegas data, the index was not in order. Therefore, the first step is to reset the index and make the first index '1'. 
- Second, we re-arrange the columns order so the most important information will be shown first which makes it easier for readers to gain insights from the data frame.  
- Third, the 'yelping_since' included data and time (hour and minute) which we do not need 'time' for our analysis. Therefore, we used pandas to_datatime function to drop the 'time' in that column.
- After that, we worked on the multivalued columns: elite and friends. 'elite' columns contained all the years that the user was a elite member in a string format. 
- We decided that having the year details do not help with analyzing the dataset, instead, counting how many years the user is a elite member provides more useful information. 
- Therefore, we first used regular expression to find all the years which would also convert strings to lists.
- The similar methods apply to 'friends' too, but  instead of regular expression, we used a string method to split the strings. 
- Consequently, we changed 'name' to 'user_name' to specify which dataset this column belongs to.

In [None]:
pickle_users = open("../dataset/all_users.pickle","rb")
users = pickle.load(pickle_users)

In [None]:
#dropping org index 
users = users.reset_index(drop=True)
users.index +=1

In [None]:
titles = ['user_id','name','average_stars','yelping_since','review_count','elite','fans','useful','cool','funny','friends']
users =users.reindex(columns=titles)

#rename columns
users = users.rename(columns={'name':'user_name','review_count':'review'})   

In [None]:
#converting timestamp to date 
users['yelping_since'] = pd.to_datetime(users['yelping_since'])
users['yelping_since'] = users['yelping_since'].dt.date

In [None]:
users['elite'] = users['elite'].apply(lambda x: re.findall('20\d\d',x))

In [None]:
users['elite'] = users['elite'].apply(lambda x: len(x))

In [None]:
users['friends'].str.split(',')
users['friends'] = users['friends'].apply(lambda x: len(x))

In [None]:
users = users.rename(columns={'elite':'years_of_elite'})
users.head()

In [None]:
users.to_pickle('../dataset/all_users.pickle')

### Cleaning Tip Dataset

Summary of Actions:

- Since the original tip dataset only contain business_id, we extracted 'business_id' and 'name' from restaurant dataset in order to add the 'name' column in the tip_new dataset. 
- We added the column through doing an inner join.
- Also, we used the pandas to_datetime to drop the time of the 'date'column. 
- After that, we rearrange the column orders and renamed the columns names so we do not have same names accross different dataset.

In [None]:
pickle_tip = open("../dataset/all_tips.pickle","rb")
tip = pickle.load(pickle_tip)
tip = tip.set_index(keys='business_id')

In [None]:
#load in restaurant pickle file in order to get the restaurant names
pickle_restaurant = open("../dataset/restaurant in vegas.pickle","rb")
restaurant = pickle.load(pickle_restaurant)
restaurant_new = restaurant[['name','business_id']]
restaurant_new = restaurant_new.set_index(keys='business_id')

In [None]:
tip_new = tip.join(restaurant_new,how='inner')

In [None]:
tip_new['date'] = pd.to_datetime(tip_new['date'])
tip_new['date'] = tip_new['date'].dt.date

In [None]:
titles = ['name','date','text','user_id']
tip_new = tip_new.reindex(columns=titles)

In [None]:
tip_new = tip_new.rename(columns={'name':'restaurant_name','text':'user_tips','date':'tips_date'})

In [None]:
tip_new.to_pickle('../dataset/all_tips.pickle')

# WordCloud

#### WordCloud of User's tips:

To reinforce what we mentioned above, we created a wordcloud and it confirms that -

- Great food
- Great service
- Love place
- Best food
- Happy hour
- Tasty, yummy, delicious
 
are the words that pop out giving an overall positive vibe to Las Vegas.
    

In [None]:
cloud = WordCloud(width=1200, height= 1080,max_words= 1000).generate(' '.join(tip_new['user_tips'].astype(str)))
plt.figure(figsize=(15, 25))
plt.imshow(cloud)
plt.axis('off');
