<a href="https://colab.research.google.com/github/Nandeesh-U/ZOMATO-RESTAURANT-CLUSTERING-AND-SENTIMENT-ANALYSIS/blob/main/ZOMATO_RESTAURANT_CLUSTERING_AND_SENTIMENT_ANALYSIS.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Problem Statement**

Zomato is an Indian restaurant aggregator and food delivery start-up founded by Deepinder Goyal and Pankaj Chaddah in 2008. Zomato provides information, menus and user-reviews of restaurants, and also has food delivery options from partner restaurants in select cities.

India is quite famous for its diverse multi cuisine available in a large number of restaurants and hotel resorts, which is reminiscent of unity in diversity. Restaurant business in India is always evolving. More Indians are warming up to the idea of eating restaurant food whether by dining outside or getting food delivered. The growing number of restaurants in every state of India has been a motivation to inspect the data to get some insights, interesting facts and figures about the Indian food industry in each city. So, this project focuses on analysing the Zomato restaurant data for each city in India.

**The Project focuses on Customers and Company**, you have  to **analyze the sentiments of the reviews given by the customer in the data** and **made some useful conclusion in the form of Visualizations**. Also, cluster the zomato restaurants into different segments. The data is vizualized as it becomes easy to analyse data at instant. The Analysis also solve some of the business cases that can directly help the customers finding the Best restaurant in their locality and for the company to grow up and work on the fields they are currently lagging in.

This could help in clustering the restaurants into segments. Also the data has valuable information around cuisine and costing which can be used in cost vs. benefit analysis

Data could be used for sentiment analysis. Also the metadata of reviewers can be used for identifying the critics in the industry. 

# **Attribute Information**

## **Zomato Restaurant names and Metadata**
Use this dataset for clustering part

1. Name : Name of Restaurants

2. Links : URL Links of Restaurants

3. Cost : Per person estimated Cost of dining

4. Collection : Tagging of Restaurants w.r.t. Zomato categories

5. Cuisines : Cuisines served by Restaurants

6. Timings : Restaurant Timings

## **Zomato Restaurant reviews**
Merge this dataset with Names and Matadata and then use for sentiment analysis part

1. Restaurant : Name of the Restaurant

2. Reviewer : Name of the Reviewer

3. Review : Review Text

4. Rating : Rating Provided by Reviewer

5. MetaData : Reviewer Metadata - No. of Reviews and followers

6. Time: Date and Time of Review

7. Pictures : No. of pictures posted with review

In [None]:
# Importing the libraries
from urllib.request import urlopen
import re
import numpy as np
import pandas as pd
import requests
from bs4 import BeautifulSoup
import ast

In [None]:
# Defining a function to scrape the content in the website and return the html script of the page
def send_request(url):
    response = requests.get(
        url='https://app.scrapingbee.com/api/v1/',
        params={
            'api_key': 'S2X6U0NGJAYG3SLEFPB80L2STD47D3Q7JC8P81J77EYXDS82UE6CBYAZP4AX9O69O0KHHY84U4QCKYTE',
            'url': url,  
        },
        
    )
    #print('Response HTTP Status Code: ', response.status_code)
    #print('Response HTTP Response Body: ', response.content)
    return response

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
# Reading the data into a csv
names_df = pd.read_csv('/content/drive/MyDrive/Data Squad zomato/Zomato Restaurant names and Metadata.csv')

In [None]:
names_df.head()

Unnamed: 0,Name,Links,Cost,Collections,Cuisines,Timings
0,Beyond Flavours,https://www.zomato.com/hyderabad/beyond-flavou...,800,"Food Hygiene Rated Restaurants in Hyderabad, C...","Chinese, Continental, Kebab, European, South I...","12noon to 3:30pm, 6:30pm to 11:30pm (Mon-Sun)"
1,Paradise,https://www.zomato.com/hyderabad/paradise-gach...,800,Hyderabad's Hottest,"Biryani, North Indian, Chinese",11 AM to 11 PM
2,Flechazo,https://www.zomato.com/hyderabad/flechazo-gach...,1300,"Great Buffets, Hyderabad's Hottest","Asian, Mediterranean, North Indian, Desserts","11:30 AM to 4:30 PM, 6:30 PM to 11 PM"
3,Shah Ghouse Hotel & Restaurant,https://www.zomato.com/hyderabad/shah-ghouse-h...,800,Late Night Restaurants,"Biryani, North Indian, Chinese, Seafood, Bever...",12 Noon to 2 AM
4,Over The Moon Brew Company,https://www.zomato.com/hyderabad/over-the-moon...,1200,"Best Bars & Pubs, Food Hygiene Rated Restauran...","Asian, Continental, North Indian, Chinese, Med...","12noon to 11pm (Mon, Tue, Wed, Thu, Sun), 12no..."


In [None]:
# Creating a new column to store the html string of each url
names_df['content'] = np.nan

In [None]:
# Scraping through each url and storing the html string in the content column of the data frame
#for i,url in enumerate(names_df['Links']):
#  response = send_request(url)
#  content = response.content
#  names_df.loc[i,'content'] = str(content)

In [None]:
# Writing the dataframe to a csv to ensure no data loss in working
#names_df.to_csv('/content/drive/MyDrive/Data Squad zomato/Nandeesh/names_df_v2')

In [None]:
# reading the dataframe from the csv file again
names_df = pd.read_csv('/content/drive/MyDrive/Data Squad zomato/Nandeesh/names_df_v2.csv')

In [None]:
# checking for null entries
sum(names_df['content'].isnull())

0

In [None]:
names_df.head()

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,Name,Links,Cost,Collections,Cuisines,Timings,content
0,0,0,Beyond Flavours,https://www.zomato.com/hyderabad/beyond-flavou...,800,"Food Hygiene Rated Restaurants in Hyderabad, C...","Chinese, Continental, Kebab, European, South I...","12noon to 3:30pm, 6:30pm to 11:30pm (Mon-Sun)","b'<!DOCTYPE html><html lang=""en"" data-rh=""lang..."
1,1,1,Paradise,https://www.zomato.com/hyderabad/paradise-gach...,800,Hyderabad's Hottest,"Biryani, North Indian, Chinese",11 AM to 11 PM,"b'<!DOCTYPE html><html lang=""en"" data-rh=""lang..."
2,2,2,Flechazo,https://www.zomato.com/hyderabad/flechazo-gach...,1300,"Great Buffets, Hyderabad's Hottest","Asian, Mediterranean, North Indian, Desserts","11:30 AM to 4:30 PM, 6:30 PM to 11 PM","b'<!DOCTYPE html><html lang=""en"" data-rh=""lang..."
3,3,3,Shah Ghouse Hotel & Restaurant,https://www.zomato.com/hyderabad/shah-ghouse-h...,800,Late Night Restaurants,"Biryani, North Indian, Chinese, Seafood, Bever...",12 Noon to 2 AM,"b'<!DOCTYPE html><html lang=""en"" data-rh=""lang..."
4,4,4,Over The Moon Brew Company,https://www.zomato.com/hyderabad/over-the-moon...,1200,"Best Bars & Pubs, Food Hygiene Rated Restauran...","Asian, Continental, North Indian, Chinese, Med...","12noon to 11pm (Mon, Tue, Wed, Thu, Sun), 12no...","b'<!DOCTYPE html><html lang=""en"" data-rh=""lang..."


In [None]:
# No null entries. so all the urls were scraped

In [None]:
names_df['Timings']=names_df['Timings'].replace(np.nan,'')

In [None]:
names_df['Timings'] = names_df['Timings'].str.lower()

In [None]:
def multiple_str_replaces(org_str,maps):
  '''
  This function takes a dictionary of mapping with keys as the charecters to be replaced in a string and
  the values as the characters to be replaced with
  '''
  for l,r in maps.items():
    org_str = org_str.replace(l,r)
  return org_str

In [None]:
mappings = {'noon':'pm','midnight':'am','),':');'}

In [None]:
names_df['Timings'] = names_df['Timings'].apply(lambda x: multiple_str_replaces(x,maps = mappings))

In [None]:
def drop_closed_days(in_str):
  '''
  This function deletes the days when the restaurant is closed Ex: tue closed, thu closed will be dropped
  '''
  regex = re.compile('[a-z]{3} closed|closed \([a-z]{3}\)')
  result = re.findall(regex,in_str)
  for text in result:
    in_str = in_str.replace('; '+text,'')
  return in_str

In [None]:
names_df['Timings'] = names_df['Timings'].apply(lambda x: drop_closed_days(x))

In [None]:
def expand_days(in_str):
  '''
  This function takes a from-to of week days string and replaces it with all the days in between.
  For Ex: 'tue-sat' will be replaced by 'tue,wed,thu,fri,sat'
  '''
  days = ['mon','tue','wed','thu','fri','sat','sun']
  in_days = in_str.split('-')
  
  result = ''

  for i,day in enumerate((days*2)[(days*2).index(in_days[0]):]):
    if day == in_days[1]:
      result = result+', '+day
      break
    elif i==0:
      result = result+day
    else:
      result = result+', '+day
  return result

In [None]:
def open_days(in_str):
  '''
  This function returns the list of days on which a restaurant is open given a string of 'timings' column as argument
  '''
  regex = re.compile(".*?\((.*?)\)")
  result = re.findall(regex, in_str)
  for i,text in enumerate(result):
    if '-' in text:
      result[i] = expand_days(result[i])
    else:
      pass
  result = ', '.join(result)
  if result == '':
    result = ''
  else:
    result = str(list(set(result.split(', '))))
  return result

In [None]:
names_df.at[95,'Timings'] = '1 pm to 2 am (mon), (wed-sun)'

In [None]:
names_df['Open_days'] = names_df['Timings'].apply(lambda x: open_days(x))
# Assuming that the restaurants whose open days are not listed are open on all days
names_df['Open_days']=names_df['Open_days'].replace('',str(['mon','tue','wed','thu','fri','sat','sun']))

In [None]:
# Picking the latitude and longitude of the restaurants location
for i, content in enumerate(names_df['content']):
  # updating the string to a soup string to easily parse
  soup = str(BeautifulSoup(names_df.loc[i,'content'],"html.parser"))

  # Parsing the latitude and longitude
  tmp = list(re.finditer('https://maps.zomato.com/',soup))
  if len(tmp) == 0:
    names_df.loc[i,'latitude'] = np.nan
    names_df.loc[i,'longitude'] = np.nan
  else:
    loc = tmp[0].span()[0]
    geo_loc = re.findall('=.+&map',soup[loc:loc+200])[0][1:-4]
    names_df.loc[i,'latitude'] = geo_loc.split(',')[0]
    names_df.loc[i,'longitude'] = geo_loc.split(',')[1]

  # Parsing the List of additional services( as a dictionary item in the dataframe column)
  tmp_loc = re.search("More Info",str(soup))
  if tmp_loc==None:
    names_df.loc[i,'additional_services'] = np.nan
  else:
    more_loc = tmp_loc.span()[0]
    tmp = soup[more_loc:]
    inds = list(re.finditer('color="#4F4F4F"',tmp))
    services = list()
    for ind in inds:
      loc = ind.span()[0]
      services.append(re.findall('>.+</p',tmp[loc:loc+50])[0][1:-3])
    names_df.loc[i,'additional_services'] = str(services)
    
  # Identifying if the restaurant has featured in any of the best lists of the city - binary variable = 1 if featured, 0 otherwise
  names_df.loc[i,'Has_Featured'] = int(len(list(re.finditer('Featured In',soup)))>0)

  # Identifying what people associate this restaurant for
  inds = list(re.finditer("People Say This Place Is Known For",str(soup)))
  if len(inds) == 0:
    names_df.loc[i,'People say this is known for'] = np.nan
  else:
    ind = inds[0].span()[0]
    tmp = soup[ind:ind+500]
    names_df.loc[i,'People say this is known for']=re.findall('color="#4F4F4F">.+</p><h3',tmp)[0][16:-7]

In [None]:
# dropping the content column
names_df.drop('content',axis = 1,inplace = True)

In [None]:
# Creating a master list to find out how many and what catergories of additional services are available in total
master_list_add_servs = list()
for add_list in names_df['additional_services']:
  master_list_add_servs.extend(ast.literal_eval(names_df.loc[0,'additional_services']))
master_list_add_servs = list(set(master_list_add_servs))

In [None]:
len(master_list_add_servs)

9

* So there are 9 unique additional services in total

In [None]:
master_list_add_servs

['Wifi',
 'Valet Parking Available',
 'Indoor Seating',
 'Buffet',
 'Takeaway Available',
 'Home Delivery',
 'Table booking recommended',
 'Seating Available',
 'Romantic Dining']

In [None]:
# Replacing nan values with an empty list
names_df['additional_services'] = names_df['additional_services'].replace(np.nan,'[]')

In [None]:
# Creating a column for each of the 9 additional services through one hot encoding
encode_list = list()
for adds_list in names_df['additional_services']:
  tmp = ast.literal_eval(adds_list)
  tmp_dict = dict()
  for service in master_list_add_servs:
    if service in tmp:
      tmp_dict.update({service:1})
    else:
      tmp_dict.update({service:0})
  encode_list.append(tmp_dict)

# Converting the list of dictionaries to a dataframe
add_services_df = pd.DataFrame(encode_list)

In [None]:
# Appending the new columns to the names_df dataframe
names_df = pd.concat([names_df,add_services_df],axis = 1)
names_df.drop('additional_services',axis = 1,inplace = True)

In [None]:
# Cleaning the 'People say this is known for' column
names_df['People say this is known for'] = names_df['People say this is known for'].replace(np.nan,'')

# Some observations in this column has extra unnecessary text that starts with 'class='. Identifying them and removing
bool_series = names_df['People say this is known for'].str.contains('class=')

for i in names_df.loc[bool_series].index:
  tmp = names_df.iloc[i]['People say this is known for']
  names_df.at[i,'People say this is known for'] = tmp[:tmp.find('<')]

In [None]:
# Writing the updated dataframe to a csv file
names_df.to_csv('/content/drive/MyDrive/Data Squad zomato/Nandeesh/names_df_v3.csv')