# Data Gathering Notebook

This notebook consists of code that will:

1. Search Yelp based on a search phrase and location.
2. Scrape (usng Beautiful Soup) Yelp search results to store information for each business found.
3. Scrape (usng Beautiful Soup) reviews for each business found in step 2.
4. Store the cleaned results into two csv files for use in other notebooks: `csv_files/business_info.csv` and `csv_files/reviews.csv`

# Import Libraries

In [11]:
import pandas as pd
import numpy as np
import json
import requests
import re
from bs4 import BeautifulSoup
import matplotlib.pyplot as plt
%matplotlib inline

#Need to use a delay between page scrapes in order to limit getting blocked by Yelp
from time import sleep

# Scrape Restaurant Information

Get a list of businesses and their info:
- Name
- Address
- Star Rating
- Price Range
- Number of Reviews
- URL for business Yelp review page
- URL for image used as business icon

### Enter city and food type search terms

In [20]:
#ENTER SEARCH TERMS BELOW:
cuisine_type = "Italian"
location = "Albany, NY"

#Generate URL based on search terms
base_url = "https://www.yelp.com"
search_url = f"{base_url}/search?find_desc={cuisine_type}&find_loc={location}"

#Or manually set search_url by copying directly from Yelp Page if desired
#search_url = "https://www.yelp.com/search?find_desc=burger&find_loc=Albany%2C%20NY"

### Set class names that are used on the Yelp business search results page

In [55]:
star_container_class = "lemon--div__373c0__1mboc attribute__373c0__1hPI_ display--inline-block__373c0__2de_K u-space-r1 border-color--default__373c0__2oFDT"
price_range_class = "lemon--span__373c0__3997G text__373c0__2pB8f priceRange__373c0__2DY87 text-color--normal__373c0__K_MKN text-align--left__373c0__2pnx_ text-bullet--after__373c0__1ZHaA"
review_count_class = "lemon--span__373c0__3997G text__373c0__2pB8f reviewCount__373c0__2r4xT text-color--mid__373c0__3G312 text-align--left__373c0__2pnx_"
next_page_class = "lemon--a__373c0__IEZFH link__373c0__29943 next-link navigation-button__373c0__1D3Ug link-color--blue-dark__373c0__1mhJo link-size--default__373c0__1skgq"
search_result_class = "lemon--div__373c0__1mboc searchResult__373c0__1yggB border-color--default__373c0__2oFDT"

### Scrape data from each results page

This loop will run continuously until there is no longer a "next page" button found on the bottom of the screen.

In this loop:

1. Request html page and lead into Beautiful Soup object.
2. Loop through each result found on the page.
3. For each result store desired business information: url, name, address, categories, star rating, price range, number of reviews.
4. Find "next page" button and store url for next pass through the while loop. 

In [56]:
next_page_url = search_url
page_counter = 1
business_list = []

#Run continuously until there is no longer a "next page" url found.
while next_page_url:
    #Request HTML page and load into Beautiful Soup object
    request = requests.get(next_page_url)
    soup = BeautifulSoup(request.content,'html.parser')
    
    #Find search results container on page.
    search_results = soup.findAll(class_=search_result_class)
    print(f"Page {page_counter}, {len(search_results)-1} results {next_page_url}")
    result_counter = 1
    
    #Loop through search results and store information for each business
    for search_result in search_results:
        business_info = {}
        try:
            business_name_url = search_result.findAll('a', href=True)[1]
            business_info['url'] = f"https://www.yelp.com{business_name_url['href']}"
            business_info['name'] = business_name_url['name']
            business_info['biz_id'] = business_name_url['href'].split('/biz/')[1].split('?')[0]
        except:
            continue
            
        try:
            business_info['address'] = search_result.find('address').text
        except:
            pass
        try:
            business_info['category'] = [category.text for category in search_result.findAll("a",attrs={"role":"link"})]
        except:
            pass
        try:
            business_info['star_rating'] = float(re.findall(r"[-+]?\d*\.\d+|\d+", 
                                                      search_result.find(
                                                          class_=star_container_class).find('div')['aria-label'] )[0] )
        except:
            pass
        try:
            business_info['price_range'] = search_result.find(class_=price_range_class).text
        except:
            pass
        try:
            business_info['num_reviews'] = int(re.findall(r"[-+]?\d*\.\d+|\d+",
                                                      search_result.find(
                                                          class_=review_count_class).text )[0] )
        except:
            pass
        try:
            business_info['image_shown'] = search_result.find('img')['src']
        except:
            pass
        
        #Append business information for each search result to a list containing all businesses.
        if business_info:
            business_list.append(business_info)
            
        result_counter+=1
    
    #Set url for next page. If not found, break out of loop.
    if soup.find(class_=next_page_class):
        next_page_url = base_url + soup.find(class_=next_page_class)['href']
        page_counter+=1
    else:
        break
    
    #Random delay between 2 and 20 seconds to prevent getting blocked
    sleep(np.random.randint(2,20))

print(len(business_list), "businesses scraped")

Page 1, 10 results https://www.yelp.com/search?find_desc=burger&find_loc=Albany%2C%20NY
Page 2, 10 results https://www.yelp.com/search?find_desc=burger&find_loc=Albany%2C%20NY&start=10
Page 3, 10 results https://www.yelp.com/search?find_desc=burger&find_loc=Albany%2C%20NY&start=20
Page 4, 10 results https://www.yelp.com/search?find_desc=burger&find_loc=Albany%2C%20NY&start=30
Page 5, 10 results https://www.yelp.com/search?find_desc=burger&find_loc=Albany%2C%20NY&start=40
Page 6, 10 results https://www.yelp.com/search?find_desc=burger&find_loc=Albany%2C%20NY&start=50
Page 7, 10 results https://www.yelp.com/search?find_desc=burger&find_loc=Albany%2C%20NY&start=60
Page 8, 10 results https://www.yelp.com/search?find_desc=burger&find_loc=Albany%2C%20NY&start=70
Page 9, 10 results https://www.yelp.com/search?find_desc=burger&find_loc=Albany%2C%20NY&start=80
Page 10, 10 results https://www.yelp.com/search?find_desc=burger&find_loc=Albany%2C%20NY&start=90
Page 11, 10 results https://www.yelp.c

1761

### Load results into a Pandas DataFrame

In [57]:
business_info_df = pd.DataFrame(business_list)
#Drop businesses with no reviews
business_info_df.dropna(subset=['num_reviews'], inplace=True)
#Drop duplicates
business_info_df.drop(business_info_df[business_info_df.biz_id.duplicated(keep='first')].index, inplace=True)
print(len(business_info_df))
business_info_df.tail()

1081


Unnamed: 0,address,biz_id,category,image_shown,name,num_reviews,price_range,star_rating,url
1756,,johnnys-to-go-schenectady,[Italian],https://s3-media4.fl.yelpcdn.com/bphoto/dr6Msn...,Johnny’s To-Go,5.0,,3.0,https://www.yelp.com/biz/johnnys-to-go-schenec...
1757,93 W Campbell Rd,burger-grill-and-middle-eastern-cuisine-schene...,[Imported Food],https://s3-media1.fl.yelpcdn.com/assets/srv0/y...,Burger Grill & Middle Eastern Cuisine,2.0,,2.0,https://www.yelp.com/biz/burger-grill-and-midd...
1758,908 River St,kennedy-fried-chicken-and-pizza-troy,"[Chicken Wings, Pizza, Halal]",https://s3-media1.fl.yelpcdn.com/bphoto/rvwZlp...,Kennedy Fried Chicken & Pizza,26.0,$,2.0,https://www.yelp.com/biz/kennedy-fried-chicken...
1759,123 4th St,ginos-pizzeria-troy,[Pizza],https://s3-media1.fl.yelpcdn.com/bphoto/Ie8eEW...,Gino’s Pizzeria,8.0,$,2.0,https://www.yelp.com/biz/ginos-pizzeria-troy?o...
1760,441 State St,nicos-pizzeria-schenectady,"[Pizza, Italian]",https://s3-media3.fl.yelpcdn.com/bphoto/XDhLzF...,Nico’s Pizzeria,45.0,$,3.0,https://www.yelp.com/biz/nicos-pizzeria-schene...


In [None]:
#CLEAN UP CATEGORY VALUES - remove parenthesis
business_info_categories = []
for category in business_info_df.category:
    cat_list = []
    for cat in category:
        cat = cat.replace('(',' ')
        cat = cat.replace(')',' ')
        cat = re.sub(' +',' ', cat).strip()
        cat_list.append(cat)
    business_info_categories.append(cat_list)
business_info_df.category = business_info_categories

In [13]:
#CLEAN UP BUSINESS NAMES
business_names = []
for name in business_info_df.name:
    name = name.replace('â\x80\x99',"\'")
    business_names.append(name)
business_info_df.name = business_names

# Scrape Reviews Functions

Unfortunately there are two versions of reviews pages that Yelp uses. Therefore, two versions of review page scrapers were needed in order to handle the different structures of the webpage. The following functions are used to scrape review data:

`get_reviews()`: returns list of reviews for a business across all of the review pages. For each page, it will determine which version of the review page is being used, and call the appropriate review scraper (see below) to gather all review data for the page.

`get_reviews_page_v1()`: Returns a list of review details for a single page of review results.

`get_reviews_page_v2()`: Does the same thing as v1 but uses a different set of class names and a different structure of soup object.

In [264]:
def get_reviews(business_name, business_index, yelp_business_url, verbose=False):
    """
    This function will iterate through all of the review pages for a particular business and
    return a list populated with all reviews found.
    
    INPUTS:
    business_name     = The name of the business. It is contained in the results list records.
    business_index    = The business index (unique identifier). It is contained in the results records.
    yelp_business_url = The URL for the starting page of reviews for the business.
    verbose           = Summary info is always printed, but with verbose validation of each page is printed.
    
    OUTPUT:
    List of reviews. Each review is a dictionary containing desired review information.
    """
    
    #Class names used in Yelp Review pages.
    #There are two flavors of page design that yelp uses
    search_result_class_v1 = "lemon--li__373c0__1r9wz u-space-b3 u-padding-b3 border--bottom__373c0__uPbXS border-color--default__373c0__2oFDT"
    search_result_class_v2 = "review review--with-sidebar"
    
    #Set starting page (first page of reviews)
    next_page_url = yelp_business_url

    reviews_list = []
    page_counter=1

    #Continue to loop through review pages until there is no longer a "next" link at the bottom.
    while next_page_url:
        if verbose:
            #Print the page url being parsed
            print(f"Page {page_counter}, {next_page_url}")

        #Request html for page and load into BeautifulSoup object.
        request = requests.get(next_page_url)
        soup = BeautifulSoup(request.content,'html.parser')
        
        #Check which version of the page is being used. If neither is found, print error message.
        if len(soup.findAll(class_=search_result_class_v1))!=0:
            reviews_list.extend(get_reviews_page_v1(soup,business_name,business_index,verbose))
        elif len(soup.findAll(class_=search_result_class_v2))!=0:
            reviews_list.extend(get_reviews_page_v2(soup,business_name,business_index,verbose))
        else:
            print("Could not parse page: ", next_page_url)
        
        #Check for "next" page link - update next_page_url if found.
        #Break from while loop if there is no next page.
        if soup.find("link", attrs={'rel':'next'}):
            next_page_url = soup.find("link", attrs={'rel':'next'})['href']
            page_counter+=1
        else:
            break
        
        #Random delay between 1 and 4 seconds to prevent getting blocked
        sleep(np.random.randint(1,3))
    
    return reviews_list

In [63]:
def get_reviews_page_v1(soup, business_name, business_index, verbose=False):
    """
    This function will extract reviews information from the BeautifulSoup object representing
    version 1 of a Yelp review page.
    
    INPUTS:
    soup           = BeautifulSoup object to traverse.
    business_name  = The name of the business. It is contained in the results list records.
    business_index = The business index (unique identifier). It is contained in the results records.
    verbose        = If True, print status of review extraction.
    
    OUTPUT:
    List of reviews. Each review is a dictionary containing desired review information.
    """
    search_result_class = "lemon--li__373c0__1r9wz u-space-b3 u-padding-b3 border--bottom__373c0__uPbXS border-color--default__373c0__2oFDT"
    star_container_class = "lemon--div__373c0__1mboc arrange-unit__373c0__1piwO border-color--default__373c0__2oFDT"
    date_class = "lemon--span__373c0__3997G text__373c0__2pB8f text-color--mid__373c0__3G312 text-align--left__373c0__2pnx_"
    pic_class = "lemon--span__373c0__3997G photo-box-grid-item__373c0__2kFqV display--inline__373c0__1DbOG u-space-r2 u-space-b2 border-color--default__373c0__2oFDT"
    pic_url_class = "lemon--img__373c0__3GQUb photo-box-img__373c0__O0tbt"
    
    #Get each review block
    reviews = soup.findAll(class_=search_result_class)
    reviews_list=[]
    skipped_review_counter=0
    #Loop through each review and pull out pertinent information. Put into list of dictionaries.
    for review in reviews:
        try:
            review_info = {}
            review_info["business_name"] = business_name
            review_info["business_index"] = business_index
            review_info["date"] = review.find(class_=date_class).text.strip()
            #review_info["review"] = review.find("span", attrs={"class": "lemon--span__373c0__3997G", "lang": "en"}).text
            review_info["review"] = review.find(attrs={"lang": "en"}).text
            review_info['star_rating'] = float(re.findall(r"[-+]?\d*\.\d+|\d+", 
                           review.find(class_=star_container_class).find('div')['aria-label'] )[0] )
            review_info["pic_count"] = len(review.find_all(class_=pic_class))
            review_info["pic_urls"] = [obj['src'] for obj in review.findAll(class_=pic_url_class)]

            #Sometimes the user id is not being found
            try:
                review_info["user_id"] = review.find('a')['href'].split('userid=')[1]
            except:
                None

            reviews_list.append(review_info)
        except:
            skipped_review_counter+=1
            
    if verbose:
        if skipped_review_counter!=0:
            print(f"Skipped {skipped_review_counter} reviews")

    return(reviews_list)

In [64]:
def get_reviews_page_v2(soup, business_name, business_index, verbose=False):
    """
    This function will extract reviews information from the BeautifulSoup object representing
    version 2 of a Yelp review page.
    
    INPUTS:
    soup           = BeautifulSoup object to traverse.
    business_name  = The name of the business. It is contained in the results list records.
    business_index = The business index (unique identifier). It is contained in the results records.
    verbose        = If True, print status of review extraction.
    
    OUTPUT:
    List of reviews. Each review is a dictionary containing desired review information.
    """
    
    search_result_class = "review review--with-sidebar"
    star_container_class = "biz-rating__stars"
    date_class = "rating-qualifier"
    review_photo_box_class = "photo-box-grid clearfix js-content-expandable lightbox-media-parent"
    
    #Get each review block
    reviews = soup.findAll(class_=search_result_class)
    reviews_list=[]
    skipped_review_counter=0
    #Loop through each review and pull out pertinent information. Put into list of dictionaries.
    for review in reviews:
        try:
            review_info = {}
            review_info["business_name"] = business_name
            review_info["business_index"] = business_index
            review_info["date"] = review.find(class_=date_class).text.strip()
            review_info["review"] = review.find(attrs={"lang": "en"}).text
            review_info['star_rating'] = float(re.findall(r"[-+]?\d*\.\d+|\d+", 
                           review.find(class_=star_container_class).find('div')['title'])[0] )
            try:
                pic_line_items = review.find(class_=review_photo_box_class).findAll('li')
                review_info["pic_count"] = len(pic_line_items)
                review_info["pic_urls"] = [obj.find('img')['src'] for obj in pic_line_items]
            except:
                review_info["pic_count"] = 0
                review_info["pic_urls"] = []

            #Sometimes the user id is not being found
            try:
                review_info["user_id"] = review.find('a')['href'].split('userid=')[1]
            except:
                None

            reviews_list.append(review_info)
        except:
            skipped_review_counter+=1

    if verbose:
        if skipped_review_counter!=0:
            print(f"Skipped {skipped_review_counter} reviews")
            
    return(reviews_list)

Testing reviews scraper for a single business in business_info_df

In [78]:
#TESTING SCRAPER FOR A SINGLE BUSINESS
index_num = 4
business_url = business_info_df.url[index_num]
business_name = business_info_df.name[index_num]
business_index = business_info_df.biz_id[index_num]

reviews_df = pd.DataFrame(get_reviews(business_name,business_index, business_url,verbose=True))

Page 1, https://www.yelp.com/biz/kuma-ani-albany-4?osq=restaurants
Page 2, https://www.yelp.com/biz/kuma-ani-albany-4?osq=restaurants&start=20


In [79]:
reviews_df.head()

Unnamed: 0,business_index,business_name,date,pic_count,pic_urls,review,star_rating,user_id
0,kuma-ani-albany-4,Kuma Ani,9/1/2019,3,[https://s3-media2.fl.yelpcdn.com/bphoto/FLsrY...,I heard this place makes its own ramen noodles...,5.0,6HOtEJ8wp9Q3XkpLVStKnA
1,kuma-ani-albany-4,Kuma Ani,9/11/2019,3,[https://s3-media4.fl.yelpcdn.com/bphoto/IVQgv...,"Alright, a ramen place in Colonie! It's a welc...",4.0,JRSIPNLEpWnZkoFGGO-WDw
2,kuma-ani-albany-4,Kuma Ani,9/6/2019,3,[https://s3-media3.fl.yelpcdn.com/bphoto/Bf2jH...,As someone that loves japanese food when I hea...,4.0,MVIlQinGxgwPH_XOUpoXrA
3,kuma-ani-albany-4,Kuma Ani,8/26/2019,3,[https://s3-media2.fl.yelpcdn.com/bphoto/O8WK_...,I am very excited to have a ramen place in an ...,5.0,YX927Jezj85xgM_XX6bVxw
4,kuma-ani-albany-4,Kuma Ani,8/25/2019,0,[],"So, like I said... my 1st visit would not be m...",5.0,7MohwGwMyJaErZ_SRQQNdQ


# Scrape ALL Reviews for each Business in initial search
Get all reviews for businesses listed in `business_info_df`

This is a tricky process. It takes some time to scrape all of the pages, and it is likely that problems can occur if the website being scraped suspects that I am a bot. You may have to stop and continue running this code at different times.

`all_reviews` will contain all of the reviews for all businesses. 

`last_index_completed` indicates the index that will be used as a starting point in `business_info_df`

In [120]:
# Set container
last_index_completed = 0
all_reviews = []

Use the code below to set `last_index_completed` when you want to continue scraping reviews after stopping.

In [1]:
#Use call to set index starting point
#last_index_completed = 828

### Loop through restaurants and get reviews for each

This loop will iterate for each business in `business_info_df`, then it will gather the reviews from all review pages for that business and append it to the `all_reviews` list.

In this loop:

1. For each business, get the url its first page of reviews from `business_info_df`. 
2. Pass the url into `get_reviews()` which will iteratively scrape all review pages for the business.
3. Add results to master `all_reviews` list. 

In [493]:
while last_index_completed < len(business_info_df):
    biz = business_info_df.iloc[last_index_completed]
    
    #Random delay between 1 and 5 seconds to try to prevent getting blocked
    sleep(np.random.randint(2,5))
    print(last_index_completed, int(biz['num_reviews']),"reviews\t", biz['name'], biz['url'], end='')
    
    #Get all reviews for the business and add to list.
    all_reviews.extend(get_reviews(biz['name'],biz['biz_id'],biz['url'])) 

    print(" completed")
    last_index_completed+=1

1076 5 reviews	 Johnny’s To-Go https://www.yelp.com/biz/johnnys-to-go-schenectady?osq=burger completed
1077 2 reviews	 Burger Grill & Middle Eastern Cuisine https://www.yelp.com/biz/burger-grill-and-middle-eastern-cuisine-schenectady?osq=burger completed
1078 26 reviews	 Kennedy Fried Chicken & Pizza https://www.yelp.com/biz/kennedy-fried-chicken-and-pizza-troy?osq=burger completed
1079 8 reviews	 Gino’s Pizzeria https://www.yelp.com/biz/ginos-pizzeria-troy?osq=burger completed
1080 45 reviews	 Nico’s Pizzeria https://www.yelp.com/biz/nicos-pizzeria-schenectady?osq=burger completed


In [526]:
all_reviews_df = pd.DataFrame(all_reviews)
len(all_reviews_df)

59274

### Clean the Date Field

Noticed that some dates have "updated" after them in the string indicating that the review was updated. I want to strip this information from the date field.

In [12]:
#Scrub date strings and convert to datetime type
scrubbed_dates = [re.sub("\D\D\D", "", date) for date in all_reviews_df.date]
all_reviews_df.date = pd.to_datetime(scrubbed_dates)

# Write dataframes to CSV files

Currently commented out because I have scraped everything I need.

In [11]:
#business_info_df.to_csv('csv_files/business_info_all.csv',index=False)
#all_reviews_df.to_csv('csv_files/reviews_all.csv',index=False)