### Unit 1 Homework:  Scraping the Yelp Website

Welcome!  For this homework assignment you'll be tasked with building a web scraper in a manner that builds on what was covered in our web scraping class.

The assignment will extend the lab work done during that time, where we built a dataset that listed the name, number of reviews and price range for restaurant on the following web page: https://www.yelp.com/search?find_desc=Restaurants&find_loc=London%2C%20United%20Kingdom&ns=1

**What You'll Turn In:**

A finished jupyter notebook that walks us through the steps you took in order to get your results.  Provide notes where appropriate to explain what you are doing.

The notebook should produce a finished dataset at the end.  

If for some reason you're experiencing problems with the final result, please let someone know when turning it in.
 
The homework is divided into five tiers, each of which have increasing levels of difficulty:

##### Tier 1: Five Columns From the First Page

At the most basic level for this assignment, you will need to extend what we did in class, and create a dataset that has five columns in it that are 30 rows long.  This means you will not need to go off the first page in order to complete this section.

##### Tier 2:  100 Row Dataset With At Least 3 Columns

For this portion of the assignment, take 3 of your columns from step 1, and extend them out to multiple pages on the yelp website.  You should appropriately account for the presence of missing values.

##### Tier 3:  100 Row Dataset With At Least 5 Columns

Very similar to Tier 2, but if you use this many columns you will be forced to encounter some columns that will frequently have missing values, whereas with Tier 2 you could likely skip these if you wanted to.  

##### Tier 4:  100 Row Dataset With At Least 5 Columns + Individual Restaurant Categories

Restaurants often have different categories associated with them, so grabbing them individually as separate values is often challenging.  To complete this tier, you'll have to find a way to 'pick out' each of the individual categories as their own separate column value.  

##### Tier 5:  Unlimited Row Dataset With At Least 5 Columns + Individual Restaurant Categories

Take what you did in Tier 4, and extend it so that the code will work with an arbitrary number of pages.  Ie, regardless of how many pages there are listing the best restaurants in London, your scraper will find them, and cleanly parse their information into clean datasets.

### Hints

Here are a few tips that will save you time when completing this assignment:

 - The name, average rating, total ratings and neighborhood of a restaurant tend to be the 'easy' ones, because they rarely have missing values, so what ever logic you use on the first page will typically apply to all pages.  They are a good place to start
 - Phone numbers, price ranges and reviews are more commonly missing, so if you are trying to get a larger number of items from them across multiple pages you should expect to do some error handling
 - You can specify any sort of selector when using the `find_all()` method, not just `class`.  For example, imagine you have the following `<div>` tag:
    `<div class='main-container red-blue-green' role='front-unit' aria-select='left-below'>Some content here</div>`
    
   This means that when you use `scraper.find_all('div')`, you can pass in arguments like `scraper.find_all('div', {'role': 'front-unit'})` or anything else that allows you to isolate that particular tag.
 - When specifying selectors like `{'class': 'dkght__384Ko'}`, sometimes less is more.  If you include multiple selectors, you are saying return a tag with **any one of these** distinctions, not all of them.  So if your results are large, try different combinations of selectors to get the smallest results possible.
 - If you begin dealing with values that are unreliably entered, you should use the 'outside in' technique where you grab a parent container that holds the element and find a way to check to see if a particular value is there by scraping it further.  The best way to do this is to try and find a unique container for every single restaurant.  This means that you will have a reliable parent element for every single restaurant, and within *each of these* you can search for `<p>`, `<a>`, `<div>`, and `<span>` tags and apply further logic.
 - When you get results from `BeautifulSoup`, you will be given data that's denoted as either `bs4.element.Tag` or `bs4.element.ResultSet`.  They are **not the same**.  Critically, you can search a `bs4.element.Tag` for further items, but you cannot do this with a `bs4.element.ResultSet`.  
 
   For example, let's say you grab all of the divs from a page with `scraper.find_all('div')` and save it as the variable `total_divs`.  This means `total_divs` will look somethig like this:  
   
   `[<div><p>Div content</p><p>Second paragraph</p></div>,`
      `<div><p>Div content</p><p>Second paragraph</p></div>,`
      `<div><p>Div content</p><p>Second paragraph</p></div>]`
      
   In this case the variable `total_divs` is a result set and there's nothing else you can do to it directly.  However, every item within `total_divs` is a tag, which means you can scrape it further.  
   
   So if you wanted you could write a line like:  `total_paragraphs = [div.find_all('p') for div in total_divs]`, and get the collection of paragraphs within each div.  
   
   If you confuse the two you'll get the following error message:  
   
   `AttributeError: ResultSet object has no attribute 'find_all'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?`
 - The values of the different selectors change periodically on yelp, so if your scraper all of a sudden stops working that's probably why.  Ie, if you have a command like `scraper.find_all('div', {'class': '485dk0W__container09'}` that no longer returns results, the class `485dk0W__container09` may now be `r56kW__container14` or something similar.

In [1]:
from bs4 import BeautifulSoup
import pandas as pd
import requests

In [226]:
page_ranges = [0, 10, 20]
all_titles = []
total_num_reviews = []
all_price_ranges = []
all_avg_ratings = []
all_reviews = []
all_cuisines = []

for i in page_ranges: 
    url = f'https://www.yelp.com/search?find_desc=Restaurants&find_loc=London%2C%20United%20Kingdom&ns=1&start={i}'
    req = requests.get(url).text
    scraper = BeautifulSoup(req)
    titles = scraper.find_all('a')
    titles = scraper.find_all('a', {'class': 'css-166la90'})
    titles = [str(title) for title in titles]
    titles = [title.replace('</a', '') for title in titles]
    titles = [title.split('>')[1]for title in titles]
    titles = [title for title in titles if title != 'more' and '<div' not in title and '<span' not in title]
    all_titles += titles
    num_reviews = scraper.find_all('span', {'class': 'css-e81eai'})
    num_reviews = [str(review) for review in num_reviews]
    num_reviews = [review.replace('</span>', '') for review in num_reviews]
    num_reviews = [review.split('>')[1] for review in num_reviews]
    num_reviews = [review for review in num_reviews if review.isdigit()]
    total_num_reviews += num_reviews
    price_ranges = scraper.find_all('span', {'class': 'css-xtpg8e'})
    price_ranges = [str(range_) for range_ in price_ranges]
    price_ranges = [range_.replace('</span>', '') for range_ in price_ranges]
    price_ranges = [range_.split('>')[1] for range_ in price_ranges]
    price_ranges = [range_ for range_ in price_ranges if '\xA3' in range_]
    all_price_ranges += price_ranges
    avg_ratings = scraper.find_all('div', {'class': 'i-stars__09f24__1T6rz'})
    avg_ratings = [float(item['aria-label'].split()[0]) for item in avg_ratings]
    all_avg_ratings += avg_ratings
    cuisines = scraper.find_all('a', {'class': 'css-1joxor6'})
    cuisines = [str(cuisine) for cuisine in cuisines]
    cuisines = [cuisine.replace('</a>', '') for cuisine in cuisines]
    cuisines = [cuisine.replace('<span class="display--inline__09f24__EhyFv margin-r0-5__09f24__zUVuo border-color--default__09f24__1eOdn"','') for cuisine in cuisines]
    cuisines = [cuisine.replace('see all', '') for cuisine in cuisines]
    cuisines = [cuisine.split('>')[1] for cuisine in cuisines]
    all_cuisines += cuisines
    reviews = scraper.find_all('p', {'class':'css-e81eai'})
    reviews = [str(review.text) for review in reviews]
    reviews = [review for review in reviews if len(review)>1]
    reviews = [review.replace('\xa0more','') for review in reviews[:10]]
    all_reviews += reviews
    

In [115]:
all_titles

['The Mayfair Chippy',
 'Dishoom',
 'The Breakfast Club',
 'Flat Iron',
 'Ffiona’s Restaurant',
 'Dishoom',
 'The Fat Bear',
 'Restaurant Gordon Ramsay',
 'Mother Mash',
 'NOPI',
 'Sketch',
 'The Golden Chippy',
 'Honest Burgers Meard St - Soho',
 'Padella',
 'Regency Café',
 'BAO - Soho',
 'Mestizo',
 'Hawksmoor Seven Dials',
 'Lanzhou Noodle Bar',
 'The Queens Arms',
 'Abeno',
 'Yauatcha',
 'Duck &amp; Waffle',
 'The Palomar Restaurant',
 'Wahaca',
 'Barrafina',
 'Blacklock',
 'Kennington Lane Cafe',
 'Burger &amp; Lobster',
 'Homeslice Neal’s Yard']

In [124]:
total_num_reviews

['282',
 '1841',
 '494',
 '380',
 '267',
 '547',
 '122',
 '204',
 '470',
 '272',
 '826',
 '106',
 '278',
 '206',
 '184',
 '152',
 '342',
 '351',
 '120',
 '101',
 '480',
 '705',
 '104',
 '310',
 '59',
 '109',
 '96',
 '291',
 '228',
 '166']

In [126]:
all_price_ranges

['££',
 '££',
 '££',
 '££',
 '££',
 '££',
 '££',
 '££££',
 '££',
 '£££',
 '£££',
 '££',
 '££',
 '££',
 '££',
 '££',
 '£££',
 '£',
 '££',
 '££',
 '£££',
 '£££',
 '£££',
 '££',
 '££',
 '£££',
 '£',
 '££',
 '££',
 '£££']

In [131]:
all_avg_ratings

[4.5,
 4.5,
 4.0,
 4.5,
 4.5,
 4.5,
 4.5,
 4.5,
 4.0,
 4.5,
 4.0,
 5.0,
 4.5,
 4.5,
 4.0,
 4.0,
 4.5,
 4.0,
 4.5,
 4.5,
 4.0,
 4.0,
 4.5,
 4.0,
 4.5,
 4.5,
 5.0,
 4.0,
 4.5,
 4.0]

In [227]:
all_cuisines 

['See all',
 'See all',
 'See all',
 '',
 'Fish &amp; Chips',
 'Indian',
 'British',
 'Steakhouses',
 'French',
 'British',
 'American (Traditional)',
 'Soul Food',
 'Cajun/Creole',
 'Indian',
 'French',
 'Modern European',
 'Cocktail Bars',
 'Coffee &amp; Tea',
 'Breakfast &amp; Brunch',
 'American (Traditional)',
 'Mediterranean',
 '',
 '',
 'See all',
 'See all',
 'See all',
 '',
 'Italian',
 'Mexican',
 'Bars',
 'Fish &amp; Chips',
 'Chinese',
 'Noodles',
 'Modern European',
 'Bars',
 'British',
 'Middle Eastern',
 'Mediterranean',
 'Burgers',
 'Cafes',
 'British',
 'Pubs',
 'Gastropubs',
 'Dim Sum',
 'Seafood',
 'Noodles',
 '',
 '',
 'See all',
 'See all',
 'See all',
 '',
 'British',
 'Steakhouses',
 'Cocktail Bars',
 'British',
 'Japanese',
 'Taiwanese',
 'Caribbean',
 'Japanese',
 'Mexican',
 'Korean',
 'Cafes',
 'British',
 'Cocktail Bars',
 'Steakhouses',
 '',
 '']

In [220]:
all_cuisines = [cuisine.remove]

['See all',
 'See all',
 'See all',
 ' class="display--inline__09f24__EhyFv margin-r0-5__09f24__zUVuo border-color--default__09f24__1eOdn"',
 'Fish &amp; Chips',
 'Indian',
 'Coffee &amp; Tea',
 'Breakfast &amp; Brunch',
 'American (Traditional)',
 'Steakhouses',
 'British',
 'Indian',
 'French',
 'British',
 'American (Traditional)',
 'Soul Food',
 'Cajun/Creole',
 'British',
 'Mediterranean',
 ' class="display--inline__09f24__EhyFv margin-r0-5__09f24__zUVuo border-color--default__09f24__1eOdn"',
 ' class="display--inline__09f24__EhyFv margin-r0-5__09f24__zUVuo border-color--default__09f24__1eOdn"',
 'See all',
 'See all',
 'See all',
 ' class="display--inline__09f24__EhyFv margin-r0-5__09f24__zUVuo border-color--default__09f24__1eOdn"',
 'French',
 'Modern European',
 'Cocktail Bars',
 'Fish &amp; Chips',
 'Burgers',
 'Italian',
 'Taiwanese',
 'Mexican',
 'Bars',
 'British',
 'Steakhouses',
 'Cocktail Bars',
 'Chinese',
 'Noodles',
 'British',
 'Pubs',
 'Gastropubs',
 'Japanese',
 ' 

In [159]:
Food_types 

['Indian',
 'Coffee &amp; Tea',
 'Breakfast &amp; Brunch',
 'American (Traditional)',
 'Steakhouses',
 'British',
 'Indian',
 'American (Traditional)',
 'Soul Food',
 'Cajun/Creole',
 'French',
 'British',
 'British',
 'Mediterranean',
 '<span class="display--inline__09f24__EhyFv margin-r0-5__09f24__zUVuo border-color--default__09f24__1eOdn"',
 '<span class="display--inline__09f24__EhyFv margin-r0-5__09f24__zUVuo border-color--default__09f24__1eOdn"',
 'See all',
 'See all',
 'See all',
 '<span class="display--inline__09f24__EhyFv margin-r0-5__09f24__zUVuo border-color--default__09f24__1eOdn"',
 'French',
 'Modern European',
 'Cocktail Bars',
 'Fish &amp; Chips',
 'Burgers',
 'Italian',
 'Taiwanese',
 'Mexican',
 'Bars',
 'British',
 'Steakhouses',
 'Cocktail Bars',
 'Chinese',
 'Noodles',
 'British',
 'Pubs',
 'Gastropubs',
 'Japanese',
 '<span class="display--inline__09f24__EhyFv margin-r0-5__09f24__zUVuo border-color--default__09f24__1eOdn"',
 '<span class="display--inline__09f24__E

In [169]:
reviews = scraper.find_all('p', {'class':' css-e81eai'})

In [172]:
reviews = [str(review.text) for review in reviews]

In [175]:
reviews = [review for review in reviews if len(review)>1]

In [191]:
all_reviews

['“One of the best fish ever with the most tasty chips.\n\nAll of the sauce was on point and it really gave loads of flavour”',
 '“Hard to find a way to add any higher praise to the restaurant of the decade in London. Great food, great decor, great folks running it, even great bathrooms.…”',
 "“By far one of my most favorite breakfast places in London! If you're in the area I highly recommend making time to go. It's super small inside and its always…”",
 '“Went to London for vacation and stopped by this  place for dinner! We were originally trying to find nandos (first time trying) but we saw this place across…”',
 "“Ffiona's is easily my favorite restaurant in London. The whole experience, from the food, to the atmosphere, to Ffiona herself, felt like family/home.\n\nThe…”",
 "“I visited Dishoom during my recent London trip (pre-COVID) and can't wait to go back.\nI went for breakfast and got the vegan Bombay; it was the best vegan meal…”",
 "“WOW, this place is delicious!\n\nOur famil

In [198]:
df_dict = {
    'Name': all_titles, 
    'Num Reviews': total_num_reviews,
    'Price Range': all_price_ranges,
    'Avg Ratings': all_avg_ratings,
    'First Review': all_reviews,
}

df = pd.DataFrame(df_dict)

In [199]:
df

Unnamed: 0,Name,Num Reviews,Price Range,Avg Ratings,First Review
0,The Mayfair Chippy,282,££,4.5,“One of the best fish ever with the most tasty...
1,Dishoom,1841,££,4.5,“Hard to find a way to add any higher praise t...
2,The Breakfast Club,494,££,4.0,“By far one of my most favorite breakfast plac...
3,Flat Iron,380,££,4.5,“Went to London for vacation and stopped by th...
4,Ffiona’s Restaurant,267,££,4.5,“Ffiona's is easily my favorite restaurant in ...
5,Dishoom,547,££,4.5,“I visited Dishoom during my recent London tri...
6,The Fat Bear,122,££,4.5,"“WOW, this place is delicious!\n\nOur family s..."
7,Restaurant Gordon Ramsay,204,££££,4.5,“Compared to Michelin 3-star restaurants in Ca...
8,Mother Mash,470,££,4.0,“Soho is full of culture and amazing places to...
9,NOPI,272,£££,4.5,“10/10 recommend!\nGood cocktails to have as a...


In [228]:
food_types = scraper.find_all('div',{'class':'container__09f24__21w3G'})

In [231]:
food_types[0].find_all('a', {'class': 'css-1joxor6'})

[<a class="css-1joxor6" href="/search?cflt=british&amp;find_loc=London%2C+United+Kingdom" name="" rel="" role="link" target="">British</a>,
 <a class="css-1joxor6" href="/search?cflt=steak&amp;find_loc=London%2C+United+Kingdom" name="" rel="" role="link" target="">Steakhouses</a>,
 <a class="css-1joxor6" href="/search?cflt=cocktailbars&amp;find_loc=London%2C+United+Kingdom" name="" rel="" role="link" target="">Cocktail Bars</a>]