# Web scraping - Yelp (Restaurants)

I am going to Scrape restaurants information from Yelp.

1. Setting Environment
2. Collect Information from the First item
3. Combine all the code & create functions
4. Save the date to csv file

## 1. Setting environment

In [1]:
# libraries
from bs4 import BeautifulSoup
from selenium import webdriver 
import requests
import pandas as pd
from time import sleep

In [19]:
# Set URl for scraping
url = 'https://www.yelp.ca/search?find_desc=&find_loc=Markham%2C+ON&ns=1'
res = requests.get(url)
driver = webdriver.Chrome('chromedriver.exe')

In [20]:
# Check if the url was successfuly read
res

<Response [200]>

In [21]:
# Read the page as html
soup = BeautifulSoup(res.text,'html.parser')

## 2. Collect information from the first Item

Before scraping all the pages, I am going to scrape only the first item's information.<br>
The information would be **Restaurant's name, Rating, Number of review, Category,and Price range.**

In [22]:
items = soup.find_all('div',attrs={'class':'container__09f24__sxa9-'})
item = items[0]
item.text

'1.\xa0Duo Patisserie & Cafe407DessertsCoffee & TeaBakeries$$This is a placeholder“Their pastries always impress because the sweetness and flavour is just right, ontop of the beautiful presentation of their pastries. They also open really…”\xa0more'

In [7]:
items[9].h4

<h4 class="css-1l5lt1i"><span class="css-1pxmz4g">10<!-- -->. <a class="css-166la90" href="/biz/felix-and-norton-markham-2" name="Felix &amp; Norton">Felix &amp; Norton</a></span></h4>

In [8]:
# Restaurant's name
name = item.find('a',attrs={'class':'css-166la90'}).text
name

'Duo Patisserie & Cafe'

In [9]:
# Rating
rating = item.find('div',attrs={'class':'i-stars__09f24___sZu0'})['aria-label']
rating

'4.5 star rating'

In [10]:
# Number of review
num_review = item.find('span',attrs={'class':'reviewCount__09f24__3GsGY'}).text
num_review = int(num_review)
num_review

407

In [11]:
# Category
kind_of_food = item.find('p',attrs={'class':'text__09f24__2NHRu'}).text
kind_of_food

'Desserts'

In [12]:
# Price range
price_range = item.find('span', attrs={'class':'priceRange__09f24__2GspP'}).text
price_range

'$$'

In [13]:
# Area
area = item.p.text.split('$')[-1]
area

''

In [14]:
# Number of item in one page, 10 is correct
len(items)

10

## 3. Combine all the Code & Create functions

By using the code above, I am going to create some functions to collect all the information.

#### Function: get_info()
1. Initiate a list called "data" (All the information will be in this list).
2. Create a function called "get_info". This will loop all the items in the page and get information.
3. Once collected information in the page, these will be stored in a dictionary called "details".
4. Add the dictionary into the "data" list.

#### Function: main()
1. This is a function to run the function "get_info()"
2. Loop get_info() function for the number of pages in the list.
3. Each time after the get_info() function, open next page and loop the process.

In [36]:
# Set url of the page/BeautifulSoup/webdriver
# Function to get all the information 
data = []
def get_info(items):
    
    # Loop all the items and get each information
    for item in items:
        try:
            name = item.find('a',attrs={'class':'css-166la90'}).text
        except:
            name =  ""
        
        try:
            rating = item.find('div',attrs={'class':'i-stars__09f24___sZu0'})['aria-label']
        except:
            rating =  ""

        try:
            num_review = item.find('span',attrs={'class':'reviewCount__09f24__3GsGY'}).text
            num_review = int(num_review)
        except:
            num_review =  0
        
        try:
            kind_of_food = item.find('p',attrs={'class':'text__09f24__2NHRu'}).text
        except:
            kind_of_food =  ""

        try:
            price_range = item.find('span',attrs={'class':'priceRange__09f24__2GspP'}).text
        except:
            price_range =  ""

        try:
            area = item.p.text.split('$')[-1]
        except:
            area =  ""
        
        # Put the information in the dictionary "details"
        details = {}
        
        details['Name'] = name
        details['Rating'] = rating
        details['Review_count'] = num_review
        details['category'] = kind_of_food
        details['price_range'] = price_range
        details['area'] = area
        
        # Add the dictionary into a list "data"
        data.append(details)
        

def main():  
    driver = webdriver.Chrome('chromedriver.exe')
    page_list = [1,10,20,30,40,50,60,70,80,90,100,110,120,130,140,150,160,170,180,190,200,210,220,230]

    for page in page_list:
        url = "https://www.yelp.ca/search?find_desc=&find_loc=Markham%2C%20Ontario&ns=1&start={}".format(page)
        driver.get(url)
        res = requests.get(url)
        soup = BeautifulSoup(res.text,'html.parser')
        items = soup.find_all('div',attrs={'class':'container__09f24__sxa9-'})
        get_info(items)
        sleep(2)

In [37]:
# Run main function
main()

In [38]:
# Number of items
len(data)

240

In [39]:
# Check if the information is stored properly
data[:5]

[{'Name': 'Smash Kitchen & Bar',
  'Rating': '4 star rating',
  'Review_count': 762,
  'category': 'Comfort Food',
  'price_range': '$$',
  'area': 'Unionville'},
 {'Name': 'NextDoor Restaurant',
  'Rating': '4 star rating',
  'Review_count': 475,
  'category': 'Canadian (New)',
  'price_range': '$$',
  'area': 'Unionville Mainstreet'},
 {'Name': 'Alchemy Coffee',
  'Rating': '4 star rating',
  'Review_count': 528,
  'category': 'Coffee & Tea',
  'price_range': '$$',
  'area': 'Unionville'},
 {'Name': 'Inspire Restaurant',
  'Rating': '4 star rating',
  'Review_count': 494,
  'category': 'Asian Fusion',
  'price_range': '$$',
  'area': 'Unionville'},
 {'Name': 'Fat Ninja Bite',
  'Rating': '4.5 star rating',
  'Review_count': 507,
  'category': 'Japanese',
  'price_range': '$$',
  'area': 'Milliken'}]

In [40]:
# Put the data into dataframe using pandas
df = pd.DataFrame(data)

In [41]:
# First 5 rows 
df.head()

Unnamed: 0,Name,Rating,Review_count,category,price_range,area
0,Smash Kitchen & Bar,4 star rating,762,Comfort Food,$$,Unionville
1,NextDoor Restaurant,4 star rating,475,Canadian (New),$$,Unionville Mainstreet
2,Alchemy Coffee,4 star rating,528,Coffee & Tea,$$,Unionville
3,Inspire Restaurant,4 star rating,494,Asian Fusion,$$,Unionville
4,Fat Ninja Bite,4.5 star rating,507,Japanese,$$,Milliken


In [42]:
# Generate a csv file
df.to_csv('yelp.csv',encoding='utf-8', index=False)

## Successfuly saved a csv file!

CSV file was saved in the same directory and there are 240 rows and 6 columns as we expected.