### Unit 1 Homework:  Scraping the Yelp Website

Welcome!  For this homework assignment you'll be tasked with building a web scraper in a manner that builds on what was covered in our web scraping class.

The assignment will extend the lab work done during that time, where we built a dataset that listed the name, number of reviews and price range for restaurant on the following web page: https://www.yelp.com/search?find_desc=Restaurants&find_loc=London%2C%20United%20Kingdom&ns=1

**What You'll Turn In:**

A finished jupyter notebook that walks us through the steps you took in order to get your results.  Provide notes where appropriate to explain what you are doing.

The notebook should produce a finished dataset at the end.  

If for some reason you're experiencing problems with the final result, please let someone know when turning it in.
 
The homework is divided into five tiers, each of which have increasing levels of difficulty:

##### Tier 1: Five Columns From the First Page

At the most basic level for this assignment, you will need to extend what we did in class, and create a dataset that has five columns in it that are 30 rows long.  This means you will not need to go off the first page in order to complete this section.

##### Tier 2:  100 Row Dataset With At Least 3 Columns

For this portion of the assignment, take 3 of your columns from step 1, and extend them out to multiple pages on the yelp website.  You should appropriately account for the presence of missing values.

##### Tier 3:  100 Row Dataset With At Least 5 Columns

Very similar to Tier 2, but if you use this many columns you will be forced to encounter some columns that will frequently have missing values, whereas with Tier 2 you could likely skip these if you wanted to.  

##### Tier 4:  100 Row Dataset With At Least 5 Columns + Individual Restaurant Categories

Restaurants often have different categories associated with them, so grabbing them individually as separate values is often challenging.  To complete this tier, you'll have to find a way to 'pick out' each of the individual categories as their own separate column value.  

##### Tier 5:  Unlimited Row Dataset With At Least 5 Columns + Individual Restaurant Categories

Take what you did in Tier 4, and extend it so that the code will work with an arbitrary number of pages.  Ie, regardless of how many pages there are listing the best restaurants in London, your scraper will find them, and cleanly parse their information into clean datasets.

### Hints

Here are a few tips that will save you time when completing this assignment:

 - The name, average rating, total ratings and neighborhood of a restaurant tend to be the 'easy' ones, because they rarely have missing values, so what ever logic you use on the first page will typically apply to all pages.  They are a good place to start
 - Phone numbers, price ranges and reviews are more commonly missing, so if you are trying to get a larger number of items from them across multiple pages you should expect to do some error handling
 - You can specify any sort of selector when using the `find_all()` method, not just `class`.  For example, imagine you have the following `<div>` tag:
    `<div class='main-container red-blue-green' role='front-unit' aria-select='left-below'>Some content here</div>`
    
   This means that when you use `scraper.find_all('div')`, you can pass in arguments like `scraper.find_all('div', {'role': 'front-unit'})` or anything else that allows you to isolate that particular tag.
 - When specifying selectors like `{'class': 'dkght__384Ko'}`, sometimes less is more.  If you include multiple selectors, you are saying return a tag with **any one of these** distinctions, not all of them.  So if your results are large, try different combinations of selectors to get the smallest results possible.
 - If you begin dealing with values that are unreliably entered, you should use the 'outside in' technique where you grab a parent container that holds the element and find a way to check to see if a particular value is there by scraping it further.  The best way to do this is to try and find a unique container for every single restaurant.  This means that you will have a reliable parent element for every single restaurant, and within *each of these* you can search for `<p>`, `<a>`, `<div>`, and `<span>` tags and apply further logic.
 - When you get results from `BeautifulSoup`, you will be given data that's denoted as either `bs4.element.Tag` or `bs4.element.ResultSet`.  They are **not the same**.  Critically, you can search a `bs4.element.Tag` for further items, but you cannot do this with a `bs4.element.ResultSet`.  
 
   For example, let's say you grab all of the divs from a page with `scraper.find_all('div')` and save it as the variable `total_divs`.  This means `total_divs` will look somethig like this:  
   
   `[<div><p>Div content</p><p>Second paragraph</p></div>,`
      `<div><p>Div content</p><p>Second paragraph</p></div>,`
      `<div><p>Div content</p><p>Second paragraph</p></div>]`
      
   In this case the variable `total_divs` is a result set and there's nothing else you can do to it directly.  However, every item within `total_divs` is a tag, which means you can scrape it further.  
   
   So if you wanted you could write a line like:  `total_paragraphs = [div.find_all('p') for div in total_divs]`, and get the collection of paragraphs within each div.  
   
   If you confuse the two you'll get the following error message:  
   
   `AttributeError: ResultSet object has no attribute 'find_all'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?`
 - The values of the different selectors change periodically on yelp, so if your scraper all of a sudden stops working that's probably why.  Ie, if you have a command like `scraper.find_all('div', {'class': '485dk0W__container09'}` that no longer returns results, the class `485dk0W__container09` may now be `r56kW__container14` or something similar.

# Jake's Coding Solutions

## Tier 1: Five Columns From the First Page
- First is me showing my work
- Below that is my cleaned-up, all-in-one-cell code block

### Tier 1 - Showing My Work
This is going to involve a few steps:
- Do an http "get" request to essentially import a webpage's source code
- "Scrape" the source code with the BeautifulSoup library, which basically means we find valuable nuggets in the HTML using tricks like looking for patterns in tags and CSS classes
- Turn that scraped data into an actual structured Pandas DataFrame (which makes it available for cool Pandas data science stuff)

In [1]:
# We're going to need three Python libraries to make this code work:
import requests   # this library will help us make http requests, which is how we get webpages
from bs4 import BeautifulSoup   # this library will help us parse html source code (i.e., webscraping)
import pandas as pd   # this library will help us with data science stuff

In [2]:
# So let's get to requestin'!

url_to_scrape = "https://www.yelp.com/search?find_desc=Restaurants&find_loc=London%2C%20United%20Kingdom&ns=1"

http_response = requests.get(url_to_scrape)   # this actually requests the page and stores the resulting response

Just for knowledge, http_response is its own weirdo object, not just a string or some other standard data type.

In [3]:
print(f"The data type of http_response is {type(http_response)}")

The data type of http_response is <class 'requests.models.Response'>


Now that we've got the source html from Yelp, we need to put it in some kind of usable format so we can work with it.

In [4]:
# first we'll get the text version of the http_response
source_code_text = http_response.text   # the .text method is part of the requests library, not BeautifulSoup

# then we'll give that text source code to BeautifulSoup, which creates an object with useful scraping methods
yelp_scrape = BeautifulSoup(source_code_text)

Just for knowledge, yelp_scrape is a bs4 object, not just a string or some other standard data type.

In [5]:
print(f"The data type of yelp_scrape is {type(yelp_scrape)}")

The data type of yelp_scrape is <class 'bs4.BeautifulSoup'>


Now we can use this bs4 object to parse ("scrape") the html cource code.

Tier 1 asks us to create a dataset of five restaurant attributes from the first page. I'll choose the following:
- Restaurant name
- Restaurant rank
- Restaurant price range
- Restaurant star rating
- Restaurant neighborhood

In [6]:
# we're going to structure this data as a dictionary of lists
london_yelp_restaurants = {
    "Name": [],
    "Rank": [],
    "Price range": [],
    "Star rating": [],
    "Neighborhood": []
}

Let's look for names first.  

We'll start by going to the website in Chrome, opening the developer tools, and inspecting restaurant title elements.  This zooms us to the corresponding html code.  I'm looking for some sort of common pattern of HTML tags or CSS classes I can use to identify all the restaurant titles.

In [7]:
# found the restaurant name inside of <a> tags, with a specific-looking css class, so we'll go with that
scraped_names = yelp_scrape.find_all('a', {'class': 'css-166la90'})  # this returns a list of tags that match the criteria

# let's take a look at what we found:
scraped_names

[<a class="css-166la90" href="/biz/the-mayfair-chippy-london-2?osq=Restaurants" name="The Mayfair Chippy" rel="" target="">The Mayfair Chippy</a>,
 <a class="css-166la90" href="/biz/dishoom-london?osq=Restaurants" name="Dishoom" rel="" target="">Dishoom</a>,
 <a class="css-166la90" href="/biz/flat-iron-london-2?osq=Restaurants" name="Flat Iron" rel="" target="">Flat Iron</a>,
 <a class="css-166la90" href="/biz/ffionas-restaurant-london?osq=Restaurants" name="Ffiona’s Restaurant" rel="" target="">Ffiona’s Restaurant</a>,
 <a class="css-166la90" href="/biz/dishoom-london-7?osq=Restaurants" name="Dishoom" rel="" target="">Dishoom</a>,
 <a class="css-166la90" href="/biz/the-breakfast-club-london-2?osq=Restaurants" name="The Breakfast Club" rel="" target="">The Breakfast Club</a>,
 <a class="css-166la90" href="/biz/restaurant-gordon-ramsay-london-3?osq=Restaurants" name="Restaurant Gordon Ramsay" rel="" target="">Restaurant Gordon Ramsay</a>,
 <a class="css-166la90" href="/biz/the-fat-bear-

In [8]:
# looks like we got the goods, but we need to narrow it down more
# this will be easier if we just get at the text between the tags using the .text method
scraped_names = [scraped_name.text for scraped_name in scraped_names]

scraped_names

['The Mayfair Chippy',
 'Dishoom',
 'Flat Iron',
 'Ffiona’s Restaurant',
 'Dishoom',
 'The Breakfast Club',
 'Restaurant Gordon Ramsay',
 'The Fat Bear',
 'NOPI',
 'Sketch',
 '2',
 '3',
 '4',
 '5',
 '6',
 '7',
 '8',
 '9',
 '']

In [9]:
# now it's clear to see how we can narrow the list
scraped_names = [name for name in scraped_names if len(name) >1]  # this will be a problem if a restaurant has a one-character name...

scraped_names

['The Mayfair Chippy',
 'Dishoom',
 'Flat Iron',
 'Ffiona’s Restaurant',
 'Dishoom',
 'The Breakfast Club',
 'Restaurant Gordon Ramsay',
 'The Fat Bear',
 'NOPI',
 'Sketch']

In [10]:
# let's double-check to make sure there are ten results
print(f"# of scraped names: {len(scraped_names)}")

# of scraped names: 10


In [11]:
# et voila!  The names for the first ten restaurants.  Let's put it in the dictionary!

london_yelp_restaurants['Name'] = scraped_names

london_yelp_restaurants

{'Name': ['The Mayfair Chippy',
  'Dishoom',
  'Flat Iron',
  'Ffiona’s Restaurant',
  'Dishoom',
  'The Breakfast Club',
  'Restaurant Gordon Ramsay',
  'The Fat Bear',
  'NOPI',
  'Sketch'],
 'Rank': [],
 'Price range': [],
 'Star rating': [],
 'Neighborhood': []}

Now let's apply similar logic and scrape out the rest of the data we want.  I'm purposefully going to leave in the scratchwork so I can show my logic, but maybe I'll create a cleaned-up version at the end.

Next up is ranks.

In [12]:
scraped_ranks = yelp_scrape.find_all('span', {'class': 'css-1pxmz4g'})

scraped_ranks

[<span class="css-1pxmz4g">1<!-- -->. <a class="css-166la90" href="/biz/the-mayfair-chippy-london-2?osq=Restaurants" name="The Mayfair Chippy" rel="" target="">The Mayfair Chippy</a></span>,
 <span class="css-1pxmz4g">2<!-- -->. <a class="css-166la90" href="/biz/dishoom-london?osq=Restaurants" name="Dishoom" rel="" target="">Dishoom</a></span>,
 <span class="css-1pxmz4g">3<!-- -->. <a class="css-166la90" href="/biz/flat-iron-london-2?osq=Restaurants" name="Flat Iron" rel="" target="">Flat Iron</a></span>,
 <span class="css-1pxmz4g">4<!-- -->. <a class="css-166la90" href="/biz/ffionas-restaurant-london?osq=Restaurants" name="Ffiona’s Restaurant" rel="" target="">Ffiona’s Restaurant</a></span>,
 <span class="css-1pxmz4g">5<!-- -->. <a class="css-166la90" href="/biz/dishoom-london-7?osq=Restaurants" name="Dishoom" rel="" target="">Dishoom</a></span>,
 <span class="css-1pxmz4g">6<!-- -->. <a class="css-166la90" href="/biz/the-breakfast-club-london-2?osq=Restaurants" name="The Breakfast Clu

In [13]:
scraped_ranks = [rank.text for rank in scraped_ranks]  # get the text between the tags

In [14]:
scraped_ranks  # see what we got

['1.\xa0The Mayfair Chippy',
 '2.\xa0Dishoom',
 '3.\xa0Flat Iron',
 '4.\xa0Ffiona’s Restaurant',
 '5.\xa0Dishoom',
 '6.\xa0The Breakfast Club',
 '7.\xa0Restaurant Gordon Ramsay',
 '8.\xa0The Fat Bear',
 '9.\xa0NOPI',
 '10.\xa0Sketch']

In [15]:
# now parse out the number
scraped_ranks = [rank[0:rank.index('.')] for rank in scraped_ranks]  # grab the first characters before the "."

scraped_ranks

['1', '2', '3', '4', '5', '6', '7', '8', '9', '10']

In [16]:
# let's assume we want integers, so convert
scraped_ranks = [int(rank) for rank in scraped_ranks]

scraped_ranks

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

In [17]:
# we have what we want!  Now let's add it to the dictionary

london_yelp_restaurants['Rank'] = scraped_ranks

london_yelp_restaurants

{'Name': ['The Mayfair Chippy',
  'Dishoom',
  'Flat Iron',
  'Ffiona’s Restaurant',
  'Dishoom',
  'The Breakfast Club',
  'Restaurant Gordon Ramsay',
  'The Fat Bear',
  'NOPI',
  'Sketch'],
 'Rank': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
 'Price range': [],
 'Star rating': [],
 'Neighborhood': []}

Now let's do price ranges!

In [18]:
scraped_price_ranges = yelp_scrape.find_all('span', {'class': 'priceRange__09f24__2O6le css-xtpg8e'})

scraped_price_ranges

[<span class="priceRange__09f24__2O6le css-xtpg8e">££</span>,
 <span class="priceRange__09f24__2O6le css-xtpg8e">££</span>,
 <span class="priceRange__09f24__2O6le css-xtpg8e">££</span>,
 <span class="priceRange__09f24__2O6le css-xtpg8e">££</span>,
 <span class="priceRange__09f24__2O6le css-xtpg8e">££</span>,
 <span class="priceRange__09f24__2O6le css-xtpg8e">££</span>,
 <span class="priceRange__09f24__2O6le css-xtpg8e">££££</span>,
 <span class="priceRange__09f24__2O6le css-xtpg8e">££</span>,
 <span class="priceRange__09f24__2O6le css-xtpg8e">£££</span>,
 <span class="priceRange__09f24__2O6le css-xtpg8e">£££</span>]

In [19]:
scraped_price_ranges = [price.text for price in scraped_price_ranges]

scraped_price_ranges

['££', '££', '££', '££', '££', '££', '££££', '££', '£££', '£££']

In [20]:
# we have what we want!  Now let's add it to the dictionary

london_yelp_restaurants['Price range'] = scraped_price_ranges

london_yelp_restaurants

{'Name': ['The Mayfair Chippy',
  'Dishoom',
  'Flat Iron',
  'Ffiona’s Restaurant',
  'Dishoom',
  'The Breakfast Club',
  'Restaurant Gordon Ramsay',
  'The Fat Bear',
  'NOPI',
  'Sketch'],
 'Rank': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
 'Price range': ['££',
  '££',
  '££',
  '££',
  '££',
  '££',
  '££££',
  '££',
  '£££',
  '£££'],
 'Star rating': [],
 'Neighborhood': []}

Now on to star rating!

In [21]:
scraped_star_ratings = yelp_scrape.find_all('div', {'class': 'i-stars__09f24__1T6rz'})

scraped_star_ratings

[<div aria-label="4.5 star rating" class="i-stars__09f24__1T6rz i-stars--regular-4-half__09f24__1YrPo border-color--default__09f24__1eOdn overflow--hidden__09f24__3z7CX" role="img"><img alt="" class="offscreen__09f24__1VFco" height="560" src="https://s3-media0.fl.yelpcdn.com/assets/public/stars_v2.yji-52d3d7a328db670d4402843cbddeed89.png" width="132"/></div>,
 <div aria-label="4.5 star rating" class="i-stars__09f24__1T6rz i-stars--regular-4-half__09f24__1YrPo border-color--default__09f24__1eOdn overflow--hidden__09f24__3z7CX" role="img"><img alt="" class="offscreen__09f24__1VFco" height="560" src="https://s3-media0.fl.yelpcdn.com/assets/public/stars_v2.yji-52d3d7a328db670d4402843cbddeed89.png" width="132"/></div>,
 <div aria-label="4.5 star rating" class="i-stars__09f24__1T6rz i-stars--regular-4-half__09f24__1YrPo border-color--default__09f24__1eOdn overflow--hidden__09f24__3z7CX" role="img"><img alt="" class="offscreen__09f24__1VFco" height="560" src="https://s3-media0.fl.yelpcdn.com/

In [22]:
len(scraped_star_ratings)

10

In [23]:
scraped_star_ratings = [rating.text for rating in scraped_star_ratings]

scraped_star_ratings

['', '', '', '', '', '', '', '', '', '']

Oh snap!  There is no text between the tags here...can we get at this "aria-label" tag attribute instead of the text?

In [24]:
scraped_star_ratings = yelp_scrape.find_all('div', {'class': 'i-stars__09f24__1T6rz'})

scraped_star_ratings

[<div aria-label="4.5 star rating" class="i-stars__09f24__1T6rz i-stars--regular-4-half__09f24__1YrPo border-color--default__09f24__1eOdn overflow--hidden__09f24__3z7CX" role="img"><img alt="" class="offscreen__09f24__1VFco" height="560" src="https://s3-media0.fl.yelpcdn.com/assets/public/stars_v2.yji-52d3d7a328db670d4402843cbddeed89.png" width="132"/></div>,
 <div aria-label="4.5 star rating" class="i-stars__09f24__1T6rz i-stars--regular-4-half__09f24__1YrPo border-color--default__09f24__1eOdn overflow--hidden__09f24__3z7CX" role="img"><img alt="" class="offscreen__09f24__1VFco" height="560" src="https://s3-media0.fl.yelpcdn.com/assets/public/stars_v2.yji-52d3d7a328db670d4402843cbddeed89.png" width="132"/></div>,
 <div aria-label="4.5 star rating" class="i-stars__09f24__1T6rz i-stars--regular-4-half__09f24__1YrPo border-color--default__09f24__1eOdn overflow--hidden__09f24__3z7CX" role="img"><img alt="" class="offscreen__09f24__1VFco" height="560" src="https://s3-media0.fl.yelpcdn.com/

In [25]:
scraped_star_ratings[0]

<div aria-label="4.5 star rating" class="i-stars__09f24__1T6rz i-stars--regular-4-half__09f24__1YrPo border-color--default__09f24__1eOdn overflow--hidden__09f24__3z7CX" role="img"><img alt="" class="offscreen__09f24__1VFco" height="560" src="https://s3-media0.fl.yelpcdn.com/assets/public/stars_v2.yji-52d3d7a328db670d4402843cbddeed89.png" width="132"/></div>

In [26]:
scraped_star_ratings[0].attrs   # we can get the tag attributes by using the .attrs method

{'class': ['i-stars__09f24__1T6rz',
  'i-stars--regular-4-half__09f24__1YrPo',
  'border-color--default__09f24__1eOdn',
  'overflow--hidden__09f24__3z7CX'],
 'aria-label': '4.5 star rating',
 'role': 'img'}

In [27]:
# that returned a dictionary, so we just need to call the key to get its corresponding value
scraped_star_ratings[0].attrs['aria-label']  # this gives us the value for the aria-label attribute

'4.5 star rating'

In [28]:
# now that we've learned that, we can iterate through the list!

scraped_star_ratings = [rating.attrs['aria-label'] for rating in scraped_star_ratings]

scraped_star_ratings

['4.5 star rating',
 '4.5 star rating',
 '4.5 star rating',
 '4.5 star rating',
 '4.5 star rating',
 '4 star rating',
 '4.5 star rating',
 '4.5 star rating',
 '4.5 star rating',
 '4 star rating']

In [29]:
# sweet!  now let's get rid of the "star rating" suffix and convert to a float
scraped_star_ratings = [float(rating[0:len(rating) - len(" star rating")]) for rating in scraped_star_ratings]

scraped_star_ratings

[4.5, 4.5, 4.5, 4.5, 4.5, 4.0, 4.5, 4.5, 4.5, 4.0]

In [30]:
len(scraped_star_ratings)  # double-check we still have 10 items and they match the website

10

In [31]:
# awesome!  now we put it in the dictionary

london_yelp_restaurants['Star rating'] = scraped_star_ratings

london_yelp_restaurants

{'Name': ['The Mayfair Chippy',
  'Dishoom',
  'Flat Iron',
  'Ffiona’s Restaurant',
  'Dishoom',
  'The Breakfast Club',
  'Restaurant Gordon Ramsay',
  'The Fat Bear',
  'NOPI',
  'Sketch'],
 'Rank': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
 'Price range': ['££',
  '££',
  '££',
  '££',
  '££',
  '££',
  '££££',
  '££',
  '£££',
  '£££'],
 'Star rating': [4.5, 4.5, 4.5, 4.5, 4.5, 4.0, 4.5, 4.5, 4.5, 4.0],
 'Neighborhood': []}

Let's do a quick DataFrame check to make sure we're headed in the right direction.  Probably should have done this for the first scraped list.

In [32]:
# turns out that DataFrame requires all lists to be equal lengths, so we need to kick out the neighborhood for now
test_dictionary = {}

for element in london_yelp_restaurants:   # element here represents the key (a string), not the key+value pair
    print(element)
    if len(london_yelp_restaurants[element]) > 0:   # if the value's length is > 0
        test_dictionary[element] = london_yelp_restaurants[element]   # assign they key and its corresponding value
        
test_dictionary

Name
Rank
Price range
Star rating
Neighborhood


{'Name': ['The Mayfair Chippy',
  'Dishoom',
  'Flat Iron',
  'Ffiona’s Restaurant',
  'Dishoom',
  'The Breakfast Club',
  'Restaurant Gordon Ramsay',
  'The Fat Bear',
  'NOPI',
  'Sketch'],
 'Rank': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
 'Price range': ['££',
  '££',
  '££',
  '££',
  '££',
  '££',
  '££££',
  '££',
  '£££',
  '£££'],
 'Star rating': [4.5, 4.5, 4.5, 4.5, 4.5, 4.0, 4.5, 4.5, 4.5, 4.0]}

In [33]:
pd.DataFrame(test_dictionary)

Unnamed: 0,Name,Rank,Price range,Star rating
0,The Mayfair Chippy,1,££,4.5
1,Dishoom,2,££,4.5
2,Flat Iron,3,££,4.5
3,Ffiona’s Restaurant,4,££,4.5
4,Dishoom,5,££,4.5
5,The Breakfast Club,6,££,4.0
6,Restaurant Gordon Ramsay,7,££££,4.5
7,The Fat Bear,8,££,4.5
8,NOPI,9,£££,4.5
9,Sketch,10,£££,4.0


In retrospect, to avoid creating this type of test dictionary, it's probably best practice not to create an empty dictionary at the beginning, but rather just append dictionary key+value pairs as you go, so you can just call DataFrame whenever.

In any case, a spot-check of the dataframe confirms that it matches the yelp site itself, so we're on track!

Time to grab our last item, the neighborhood.

In [34]:
scraped_neighborhoods = yelp_scrape.find_all('p', {'class': 'css-8jxw1i'})

scraped_neighborhoods

[<p class="css-8jxw1i">020 7741 2233</p>,
 <p class="css-8jxw1i"><span class="raw__09f24__3Obuy">14 North Audley Street</span></p>,
 <p class="css-8jxw1i">Mayfair</p>,
 <p class="css-8jxw1i">020 7420 9320</p>,
 <p class="css-8jxw1i"><span class="raw__09f24__3Obuy">12 Upper Saint Martin's Lane</span></p>,
 <p class="css-8jxw1i">Covent Garden</p>,
 <p class="css-8jxw1i"><span class="raw__09f24__3Obuy">17 Beak Street</span></p>,
 <p class="css-8jxw1i">Soho</p>,
 <p class="css-8jxw1i">020 7937 4152</p>,
 <p class="css-8jxw1i"><span class="raw__09f24__3Obuy">51 Kensington Church Street</span></p>,
 <p class="css-8jxw1i">Kensington</p>,
 <p class="css-8jxw1i">020 7420 9322</p>,
 <p class="css-8jxw1i"><span class="raw__09f24__3Obuy">22 Kingly Street</span></p>,
 <p class="css-8jxw1i">Soho</p>,
 <p class="css-8jxw1i">020 7434 2571</p>,
 <p class="css-8jxw1i"><span class="raw__09f24__3Obuy">33 D'Arblay Street</span></p>,
 <p class="css-8jxw1i">Soho</p>,
 <p class="css-8jxw1i">020 7352 4441</p>,

Well, crud.  How do we reliably find the neighborhoods in this list?

It's not quite every third element in the list, because one of them is missing a phone number.  And what would happen if neighborhood data is blank/null/non-existent?

We need a more structural way of finding it, and assigning a `None` value if we can't find it.

This is where the outside-in approach comes in.

In [35]:
# using Chrome's inspector, I found a div tag that contains the address, phone number, neighborhood, etc.
scraped_neighborhood_containers = yelp_scrape.find_all('div', {'class': 'container__09f24__1fWZl'})

scraped_neighborhood_containers

[<div class="container__09f24__1fWZl padding-l2__09f24__2MHQ3 border-color--default__09f24__1eOdn text-align--right__09f24__2OpQD"><div class="border-color--default__09f24__1eOdn"><div class="display--inline-block__09f24__3L1EB border-color--default__09f24__1eOdn"><div class="border-color--default__09f24__1eOdn"><p class="css-8jxw1i">020 7741 2233</p></div></div></div><address class=""><div class="border-color--default__09f24__1eOdn"><div class="display--inline-block__09f24__3L1EB border-color--default__09f24__1eOdn"><div class="border-color--default__09f24__1eOdn"><p class="css-8jxw1i"><span class="raw__09f24__3Obuy">14 North Audley Street</span></p></div></div></div></address><div class="margin-b1__09f24__1647o border-color--default__09f24__1eOdn"><div class="border-color--default__09f24__1eOdn"><div class="display--inline-block__09f24__3L1EB border-color--default__09f24__1eOdn"><div class="border-color--default__09f24__1eOdn"><p class="css-8jxw1i">Mayfair</p></div></div></div></div>

In [36]:
# seems promising; let's confirm there are ten
len(scraped_neighborhood_containers)

10

In [37]:
# cool!  let's look at just one of them
scraped_neighborhood_containers[0]

<div class="container__09f24__1fWZl padding-l2__09f24__2MHQ3 border-color--default__09f24__1eOdn text-align--right__09f24__2OpQD"><div class="border-color--default__09f24__1eOdn"><div class="display--inline-block__09f24__3L1EB border-color--default__09f24__1eOdn"><div class="border-color--default__09f24__1eOdn"><p class="css-8jxw1i">020 7741 2233</p></div></div></div><address class=""><div class="border-color--default__09f24__1eOdn"><div class="display--inline-block__09f24__3L1EB border-color--default__09f24__1eOdn"><div class="border-color--default__09f24__1eOdn"><p class="css-8jxw1i"><span class="raw__09f24__3Obuy">14 North Audley Street</span></p></div></div></div></address><div class="margin-b1__09f24__1647o border-color--default__09f24__1eOdn"><div class="border-color--default__09f24__1eOdn"><div class="display--inline-block__09f24__3L1EB border-color--default__09f24__1eOdn"><div class="border-color--default__09f24__1eOdn"><p class="css-8jxw1i">Mayfair</p></div></div></div></div><

In [38]:
# it looks like our neighbrohood is in a <p> tag, so let's find_all within the a container
scraped_neighborhood_containers[4].find_all('p')


[<p class="css-8jxw1i">020 7420 9322</p>,
 <p class="css-8jxw1i"><span class="raw__09f24__3Obuy">22 Kingly Street</span></p>,
 <p class="css-8jxw1i">Soho</p>]

We can use that structure to pick the last element.

Side note: this isn't a 100%-foolproof method, because it'll pick up addresses or phone numbers if the neighborhood is missing.  We could get closer to 100% (but not quite there) by checking to see if the string begins with a number, which will only mislead us if the neighborhood starts with a number or the address doesn't.  This also won't cover the case where all three are missing...but for now, let's just get the "good enough" solution for now and worry about those fringe cases later.

So where were we?  Doing a more structural outside-in approach for finding neighborhood data.

In [39]:
# get the list of div tags that contain the address, phone number, neighborhood, etc.
scraped_neighborhood_containers = yelp_scrape.find_all('div', {'class': 'container__09f24__1fWZl'})

scraped_neighborhood_containers

[<div class="container__09f24__1fWZl padding-l2__09f24__2MHQ3 border-color--default__09f24__1eOdn text-align--right__09f24__2OpQD"><div class="border-color--default__09f24__1eOdn"><div class="display--inline-block__09f24__3L1EB border-color--default__09f24__1eOdn"><div class="border-color--default__09f24__1eOdn"><p class="css-8jxw1i">020 7741 2233</p></div></div></div><address class=""><div class="border-color--default__09f24__1eOdn"><div class="display--inline-block__09f24__3L1EB border-color--default__09f24__1eOdn"><div class="border-color--default__09f24__1eOdn"><p class="css-8jxw1i"><span class="raw__09f24__3Obuy">14 North Audley Street</span></p></div></div></div></address><div class="margin-b1__09f24__1647o border-color--default__09f24__1eOdn"><div class="border-color--default__09f24__1eOdn"><div class="display--inline-block__09f24__3L1EB border-color--default__09f24__1eOdn"><div class="border-color--default__09f24__1eOdn"><p class="css-8jxw1i">Mayfair</p></div></div></div></div>

In [40]:
# now let's just get the <p> tags within each container
scraped_neighborhoods = [container.find_all('p') for container in scraped_neighborhood_containers]

scraped_neighborhoods  # this is a list of lists

[[<p class="css-8jxw1i">020 7741 2233</p>,
  <p class="css-8jxw1i"><span class="raw__09f24__3Obuy">14 North Audley Street</span></p>,
  <p class="css-8jxw1i">Mayfair</p>],
 [<p class="css-8jxw1i">020 7420 9320</p>,
  <p class="css-8jxw1i"><span class="raw__09f24__3Obuy">12 Upper Saint Martin's Lane</span></p>,
  <p class="css-8jxw1i">Covent Garden</p>],
 [<p class="css-8jxw1i"><span class="raw__09f24__3Obuy">17 Beak Street</span></p>,
  <p class="css-8jxw1i">Soho</p>],
 [<p class="css-8jxw1i">020 7937 4152</p>,
  <p class="css-8jxw1i"><span class="raw__09f24__3Obuy">51 Kensington Church Street</span></p>,
  <p class="css-8jxw1i">Kensington</p>],
 [<p class="css-8jxw1i">020 7420 9322</p>,
  <p class="css-8jxw1i"><span class="raw__09f24__3Obuy">22 Kingly Street</span></p>,
  <p class="css-8jxw1i">Soho</p>],
 [<p class="css-8jxw1i">020 7434 2571</p>,
  <p class="css-8jxw1i"><span class="raw__09f24__3Obuy">33 D'Arblay Street</span></p>,
  <p class="css-8jxw1i">Soho</p>],
 [<p class="css-8j

In [41]:
# let's simplify what we're looking at by getting text of each tag

for collection in scraped_neighborhoods:
    for tag in range(0,len(collection)):
        collection[tag] = collection[tag].text

scraped_neighborhoods

[['020 7741 2233', '14 North Audley Street', 'Mayfair'],
 ['020 7420 9320', "12 Upper Saint Martin's Lane", 'Covent Garden'],
 ['17 Beak Street', 'Soho'],
 ['020 7937 4152', '51 Kensington Church Street', 'Kensington'],
 ['020 7420 9322', '22 Kingly Street', 'Soho'],
 ['020 7434 2571', "33 D'Arblay Street", 'Soho'],
 ['020 7352 4441', '68 Royal Hospital Road', 'Chelsea'],
 ['020 7236 2498', '61 Carter Lane', 'Blackfriars'],
 ['020 7494 9584', '21-22 Warwick Street', 'Soho'],
 ['020 7659 4500', '9 Conduit Street', 'Mayfair']]

In [42]:
# now let's get the last element of each list

scraped_neighborhoods = [location[len(location)-1] for location in scraped_neighborhoods]

scraped_neighborhoods

['Mayfair',
 'Covent Garden',
 'Soho',
 'Kensington',
 'Soho',
 'Soho',
 'Chelsea',
 'Blackfriars',
 'Soho',
 'Mayfair']

In [43]:
len(scraped_neighborhoods)  # double-check we still have 10 items and they match the website

10

In [44]:
# awesome!  now we put it in the dictionary

london_yelp_restaurants['Neighborhood'] = scraped_neighborhoods

london_yelp_restaurants

{'Name': ['The Mayfair Chippy',
  'Dishoom',
  'Flat Iron',
  'Ffiona’s Restaurant',
  'Dishoom',
  'The Breakfast Club',
  'Restaurant Gordon Ramsay',
  'The Fat Bear',
  'NOPI',
  'Sketch'],
 'Rank': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
 'Price range': ['££',
  '££',
  '££',
  '££',
  '££',
  '££',
  '££££',
  '££',
  '£££',
  '£££'],
 'Star rating': [4.5, 4.5, 4.5, 4.5, 4.5, 4.0, 4.5, 4.5, 4.5, 4.0],
 'Neighborhood': ['Mayfair',
  'Covent Garden',
  'Soho',
  'Kensington',
  'Soho',
  'Soho',
  'Chelsea',
  'Blackfriars',
  'Soho',
  'Mayfair']}

In [45]:
# let's update that DataFrame!

pd.DataFrame(london_yelp_restaurants)

Unnamed: 0,Name,Rank,Price range,Star rating,Neighborhood
0,The Mayfair Chippy,1,££,4.5,Mayfair
1,Dishoom,2,££,4.5,Covent Garden
2,Flat Iron,3,££,4.5,Soho
3,Ffiona’s Restaurant,4,££,4.5,Kensington
4,Dishoom,5,££,4.5,Soho
5,The Breakfast Club,6,££,4.0,Soho
6,Restaurant Gordon Ramsay,7,££££,4.5,Chelsea
7,The Fat Bear,8,££,4.5,Blackfriars
8,NOPI,9,£££,4.5,Soho
9,Sketch,10,£££,4.0,Mayfair


That's the product we were looking for, so let's refactor all of the above into a nice, clean block!

### Tier 1 - Cleaned-Up Code Answer
Below is my cleaned-up code that executes everything in a single cell:

In [46]:
# We're going to need three Python libraries to make this code work:

import requests   # this library will help us make http requests, which is how we get webpages
from bs4 import BeautifulSoup   # this library will help us parse html source code (i.e., webscraping)
import pandas as pd   # this library will help us with data science stuff



# So let's get to requestin'!
url_to_scrape = "https://www.yelp.com/search?find_desc=Restaurants&find_loc=London%2C%20United%20Kingdom&ns=1"

http_response = requests.get(url_to_scrape)   # this actually requests the page and stores the resulting response

# get the text version of the http_response
source_code_text = http_response.text   # the .text method is part of the requests library, not BeautifulSoup

# then we'll give that text source code to BeautifulSoup, which creates an object with useful scraping methods
yelp_scrape = BeautifulSoup(source_code_text)



# we're going to structure this data as a dictionary of lists
london_yelp_restaurants = {}



# let's scrape for names:
scraped_names = yelp_scrape.find_all('a', {'class': 'css-166la90'})  # this returns a list of tags that match the criteria

# get the text
scraped_names = [scraped_name.text for scraped_name in scraped_names]

# clean it up
scraped_names = [name for name in scraped_names if len(name) >1]  # this will be a problem if a restaurant has a one-character name...

# add it to the dictionary
london_yelp_restaurants['Name'] = scraped_names



# now let's scrape for ranks:
scraped_ranks = yelp_scrape.find_all('span', {'class': 'css-1pxmz4g'})

# get the text
scraped_ranks = [rank.text for rank in scraped_ranks]  # get the text between the tags

# now parse out the number
scraped_ranks = [rank[0:rank.index('.')] for rank in scraped_ranks]  # grab the first characters before the "."

# let's assume we want integers, so convert
scraped_ranks = [int(rank) for rank in scraped_ranks]

# add it to the dictionary
london_yelp_restaurants['Rank'] = scraped_ranks



# now let's scrape for price ranges:
scraped_price_ranges = yelp_scrape.find_all('span', {'class': 'priceRange__09f24__2O6le css-xtpg8e'})

# get the text
scraped_price_ranges = [price.text for price in scraped_price_ranges]

# add it to the dictionary
london_yelp_restaurants['Price range'] = scraped_price_ranges



# now let's scrape for star rating:
scraped_star_ratings = yelp_scrape.find_all('div', {'class': 'i-stars__09f24__1T6rz'})

# narrow it down
scraped_star_ratings = [rating.attrs['aria-label'] for rating in scraped_star_ratings]

# clean it up by getting rid of the "star rating" suffix and converting to a float
scraped_star_ratings = [float(rating[0:len(rating) - len(" star rating")]) for rating in scraped_star_ratings]

# add it to the dictionary
london_yelp_restaurants['Star rating'] = scraped_star_ratings



# now let's scrape for neighborhood:
scraped_neighborhood_containers = yelp_scrape.find_all('div', {'class': 'container__09f24__1fWZl'})

# now let's just get the <p> tags within each container
scraped_neighborhoods = [container.find_all('p') for container in scraped_neighborhood_containers]

# let's simplify what we're looking at by getting text of each tag
for collection in scraped_neighborhoods:
    for tag in range(0,len(collection)):
        collection[tag] = collection[tag].text   # is there a more efficienty way to do this?
        
# now let's get the last element of each list
scraped_neighborhoods = [location[len(location)-1] for location in scraped_neighborhoods]

# add it to the dictionary
london_yelp_restaurants['Neighborhood'] = scraped_neighborhoods



# now put it in a DataFrame!
pd.DataFrame(london_yelp_restaurants)

Unnamed: 0,Name,Rank,Price range,Star rating,Neighborhood
0,The Mayfair Chippy,1,££,4.5,Mayfair
1,Dishoom,2,££,4.5,Covent Garden
2,Flat Iron,3,££,4.5,Soho
3,Ffiona’s Restaurant,4,££,4.5,Kensington
4,Dishoom,5,££,4.5,Soho
5,The Breakfast Club,6,££,4.0,Soho
6,Restaurant Gordon Ramsay,7,££££,4.5,Chelsea
7,The Fat Bear,8,££,4.5,Blackfriars
8,NOPI,9,£££,4.5,Soho
9,Sketch,10,£££,4.0,Mayfair


## Tiers 2 and 3:  100 Row Dataset With At Least 3 and 5 Columns, Respectively
Since Tier 3 builds off of Tier 2, I'm jumping right to Tier 3 so I can kill two birds with one stone.
- First is me showing my work
- Below that is my cleaned-up, all-in-one-cell code block

### Tiers 2 and 3 - Showing My Work
This is going to involve a few steps:
- Figure out how to modify the Yelp page URL to load the next ten results (or load more than ten)
- Do webscraping using the same methods as Tier 1
 - Figure out how to handle missing values and replace them with `None` so we feed the same lenth lists into the DataFrame
- Generate the dataframe

The Yelp page won't let me load more than ten results, so we'll have to scrape ten results at a time.

To do this, we'll load the first page, scrape, load the next page, scrape, load the next page, etc.

To get the next page, we'll have to add a `&start=10` parameter to the end of the URL.

In [47]:
base_url = "https://www.yelp.com/search?find_desc=Restaurants&find_loc=London%2C%20United%20Kingdom&ns=1"

# generate a list like [0, 10, 20, ..., 90] to get the first ten pages (first 100 results)
page_start_values = [i * 10 for i in range(0,10)] 

# create a list of URLs to scrape
urls_to_scrape = [f"{base_url}&start={value}" for value in page_start_values]   # append a suffix to the base url

urls_to_scrape

['https://www.yelp.com/search?find_desc=Restaurants&find_loc=London%2C%20United%20Kingdom&ns=1&start=0',
 'https://www.yelp.com/search?find_desc=Restaurants&find_loc=London%2C%20United%20Kingdom&ns=1&start=10',
 'https://www.yelp.com/search?find_desc=Restaurants&find_loc=London%2C%20United%20Kingdom&ns=1&start=20',
 'https://www.yelp.com/search?find_desc=Restaurants&find_loc=London%2C%20United%20Kingdom&ns=1&start=30',
 'https://www.yelp.com/search?find_desc=Restaurants&find_loc=London%2C%20United%20Kingdom&ns=1&start=40',
 'https://www.yelp.com/search?find_desc=Restaurants&find_loc=London%2C%20United%20Kingdom&ns=1&start=50',
 'https://www.yelp.com/search?find_desc=Restaurants&find_loc=London%2C%20United%20Kingdom&ns=1&start=60',
 'https://www.yelp.com/search?find_desc=Restaurants&find_loc=London%2C%20United%20Kingdom&ns=1&start=70',
 'https://www.yelp.com/search?find_desc=Restaurants&find_loc=London%2C%20United%20Kingdom&ns=1&start=80',
 'https://www.yelp.com/search?find_desc=Restaur

Now let's take the cleaned-up code from Tier 1 and put it to good use!

We'll loop through the code ten times, each time scraping a different URL from the list we just created.

This will be easier to read if we define the scraper as a function.

In [48]:
# We're going to need three Python libraries to make this code work:

import requests   # this library will help us make http requests, which is how we get webpages
from bs4 import BeautifulSoup   # this library will help us parse html source code (i.e., webscraping)
import pandas as pd   # this library will help us with data science stuff



# make a function similar to the code in Tier 1, but have it return a dictionary (and don't create a DataFrame)
def scrape_a_page(site_to_scrape):
    # So let's get to requestin'!
    http_response = requests.get(site_to_scrape)   # this actually requests the page and stores the resulting response

    # get the text version of the http_response
    source_code_text = http_response.text   # the .text method is part of the requests library, not BeautifulSoup

    # then we'll give that text source code to BeautifulSoup, which creates an object with useful scraping methods
    yelp_scrape = BeautifulSoup(source_code_text)



    # we're going to structure this data as a dictionary of lists
    london_yelp_restaurants = {}



    # let's scrape for names:
    scraped_names = yelp_scrape.find_all('a', {'class': 'css-166la90'})  # this returns a list of tags that match the criteria

    # get the text
    scraped_names = [scraped_name.text for scraped_name in scraped_names]

    # clean it up
    scraped_names = [name for name in scraped_names if len(name) >1]  # this will be a problem if a restaurant has a one-character name...

    # add it to the dictionary
    london_yelp_restaurants['Name'] = scraped_names



    # now let's scrape for ranks:
    scraped_ranks = yelp_scrape.find_all('span', {'class': 'css-1pxmz4g'})

    # get the text
    scraped_ranks = [rank.text for rank in scraped_ranks]  # get the text between the tags

    # now parse out the number
    scraped_ranks = [rank[0:rank.index('.')] for rank in scraped_ranks]  # grab the first characters before the "."

    # let's assume we want integers, so convert
    scraped_ranks = [int(rank) for rank in scraped_ranks]

    # add it to the dictionary
    london_yelp_restaurants['Rank'] = scraped_ranks



    # now let's scrape for price ranges:
    scraped_price_ranges = yelp_scrape.find_all('span', {'class': 'priceRange__09f24__2O6le css-xtpg8e'})

    # get the text
    scraped_price_ranges = [price.text for price in scraped_price_ranges]

    # add it to the dictionary
    london_yelp_restaurants['Price range'] = scraped_price_ranges



    # now let's scrape for star rating:
    scraped_star_ratings = yelp_scrape.find_all('div', {'class': 'i-stars__09f24__1T6rz'})

    # narrow it down
    scraped_star_ratings = [rating.attrs['aria-label'] for rating in scraped_star_ratings]

    # clean it up by getting rid of the "star rating" suffix and converting to a float
    scraped_star_ratings = [float(rating[0:len(rating) - len(" star rating")]) for rating in scraped_star_ratings]

    # add it to the dictionary
    london_yelp_restaurants['Star rating'] = scraped_star_ratings



    # now let's scrape for neighborhood:
    scraped_neighborhood_containers = yelp_scrape.find_all('div', {'class': 'container__09f24__1fWZl'})

    # now let's just get the <p> tags within each container
    scraped_neighborhoods = [container.find_all('p') for container in scraped_neighborhood_containers]

    # let's simplify what we're looking at by getting text of each tag
    for collection in scraped_neighborhoods:
        for tag in range(0,len(collection)):
            collection[tag] = collection[tag].text   # is there a more efficienty way to do this?

    # now let's get the last element of each list
    scraped_neighborhoods = [location[len(location)-1] for location in scraped_neighborhoods]

    # add it to the dictionary
    london_yelp_restaurants['Neighborhood'] = scraped_neighborhoods

    return london_yelp_restaurants  # return the dictionary with the results of the single-page scrape



# so let's loop through all the pages we want to scrape, and call the function

# first, create a dictionary to hold the results (a dictionary of lists)
london_yelp_restaurants = {
    "Name": [],
    "Rank": [],
    "Price range": [],
    "Star rating": [],
    "Neighborhood": []
}


# now let's call the functions!
for page in urls_to_scrape:  # for each site in the list of URLs to scrape
    # call the scrape function for the page and store it in a dictionary
    single_page_yelp_results = scrape_a_page(page)
    
    # now append the single-page results to the main results dictionary
    for key in london_yelp_restaurants:
        london_yelp_restaurants[key] += single_page_yelp_results[key]



In [49]:
# now let's see what the result is!
london_yelp_restaurants

{'Name': ['The Mayfair Chippy',
  'Dishoom',
  'Flat Iron',
  'Ffiona’s Restaurant',
  'Dishoom',
  'The Breakfast Club',
  'Restaurant Gordon Ramsay',
  'The Fat Bear',
  'NOPI',
  'Sketch',
  'The Golden Chippy',
  'Honest Burgers Meard St - Soho',
  'Padella',
  'Mestizo',
  'BAO - Soho',
  'Hawksmoor Seven Dials',
  'Lanzhou Noodle Bar',
  'The Queens Arms',
  'Duck & Waffle',
  'The Palomar Restaurant',
  'Wahaca',
  'Barrafina',
  'Mother Mash',
  'Blacklock',
  'Kennington Lane Cafe',
  'Burger & Lobster',
  'Homeslice Neal’s Yard',
  'Yauatcha',
  'Bocca Di Lupo',
  'Abeno',
  'Honey & Co',
  'Kiln',
  'Bibimbap',
  'The Rum Kitchen',
  'Busaba Soho',
  'Savoir Faire',
  'The Ledbury',
  'Shoryu Ramen',
  'Jinjuu',
  'Regency Café',
  'The Wolseley',
  'Gordon Ramsay Street Pizza',
  'Silk Road',
  'Misato',
  'San Carlo Cicchetti',
  'Barrafina',
  'Soho Joe',
  'The Orangery',
  'Nando’s',
  'Bodean’s',
  'The Alchemist',
  'Circolo Popolare',
  'Ye Olde Cheshire Cheese',
  '

So the code ran without producing errors, but the results aren't right.

It seems there's a problem with the restaurant names, which is honestly the one I thought would be bulletproof.  Let's run some tests to troubleshoot:

In [50]:
len(london_yelp_restaurants['Name'])

114

Length is longer than 100, so that's a problem.  There are also numbers in the mix instead of names, which seems wrong.  If we know what they are, we can go to the specific page in the browser and check on them to see if we can find a pattern.

In [51]:
for i in range(0,len(london_yelp_restaurants['Name'])):
    if london_yelp_restaurants['Name'][i].isdigit():
        print(f"At index {i} (restaurant {i + 1}), the restaurant name is a number: {london_yelp_restaurants['Name'][i]}")
    else:
        print(f"At index {i} (restaurant {i + 1}), the restaurant name has letters: {london_yelp_restaurants['Name'][i]}")

At index 0 (restaurant 1), the restaurant name has letters: The Mayfair Chippy
At index 1 (restaurant 2), the restaurant name has letters: Dishoom
At index 2 (restaurant 3), the restaurant name has letters: Flat Iron
At index 3 (restaurant 4), the restaurant name has letters: Ffiona’s Restaurant
At index 4 (restaurant 5), the restaurant name has letters: Dishoom
At index 5 (restaurant 6), the restaurant name has letters: The Breakfast Club
At index 6 (restaurant 7), the restaurant name has letters: Restaurant Gordon Ramsay
At index 7 (restaurant 8), the restaurant name has letters: The Fat Bear
At index 8 (restaurant 9), the restaurant name has letters: NOPI
At index 9 (restaurant 10), the restaurant name has letters: Sketch
At index 10 (restaurant 11), the restaurant name has letters: The Golden Chippy
At index 11 (restaurant 12), the restaurant name has letters: Honest Burgers Meard St - Soho
At index 12 (restaurant 13), the restaurant name has letters: Padella
At index 13 (restauran

Whoa, whoa, whoa!  Starting on page 2, the titles are all out of order!  The order on the Yelp page itself does **not** match the order in the list above.  And some on the page are even skipped altogether!

(Aside: part of this phenomenon was due to the fact that Yelp's ranking was changing in real time)

I guess this shows the value of spot-checking.  Augh.  I guess it's time to revise the logic of the restaurant name finder, and then I need to check the other fields to make sure they're good.

Relatedly, this is the major disadvantage of putting everything in one block...it's tough to troubleshoot!  I guess I shouldn't fight against the small-batch nature of Jupyter.  That's the point!

Splitting it back up and troubleshooting!

#### Generate the URLs

In [52]:
# let's generate the URLs we want to look at
base_url = "https://www.yelp.com/search?find_desc=Restaurants&find_loc=London%2C%20United%20Kingdom&ns=1"

page_start_value = 90

url_to_scrape = f"{base_url}&start={page_start_value}"  # append a suffix to the base url

#### Make the Request

In [53]:
# So let's get to requestin'!
http_response = requests.get(url_to_scrape)   # this actually requests the page and stores the resulting response

# get the text version of the http_response
source_code_text = http_response.text   # the .text method is part of the requests library, not BeautifulSoup

# then we'll give that text source code to BeautifulSoup, which creates an object with useful scraping methods
yelp_scrape = BeautifulSoup(source_code_text)

#### Initialize the Dictionary of Results

In [54]:
# we're going to structure this data as a dictionary of lists
london_yelp_restaurants = {}

#### Scrape for Names
I revised the logic in the cleaning part because it was allowing numbers where we didn't want them.

In [55]:
# let's scrape for names:
scraped_names = yelp_scrape.find_all('a', {'class': 'css-166la90'})  # this returns a list of tags that match the criteria

# scraped_names

In [56]:
# get the text
scraped_names = [scraped_name.text for scraped_name in scraped_names]

# scraped_names

In [57]:
# clean it up by filtering out empty strings and numbers 
# this logic will be a problem if a restaurant's name is just a number...
for name in scraped_names:
    scraped_names = [name for name in scraped_names if name != '' and not name.isdigit()]

# scraped_names

In [58]:
# add it to the dictionary
london_yelp_restaurants['Name'] = scraped_names

#### Scrape for Ranks

In [59]:
# now let's scrape for ranks:
scraped_ranks = yelp_scrape.find_all('span', {'class': 'css-1pxmz4g'})

# get the text
scraped_ranks = [rank.text for rank in scraped_ranks]  # get the text between the tags

# now parse out the number
scraped_ranks = [rank[0:rank.index('.')] for rank in scraped_ranks]  # grab the first characters before the "."

# let's assume we want integers, so convert
scraped_ranks = [int(rank) for rank in scraped_ranks]

# add it to the dictionary
london_yelp_restaurants['Rank'] = scraped_ranks

#### Scrape for Price Ranges

In [60]:
# now let's scrape for price ranges.  This one got tricky, and we had to go for an outside-in approach.
scraped_price_ranges = yelp_scrape.find_all('div', {'class': 'priceCategory__09f24__2IbAM'})

# scraped_price_ranges


In [61]:
# get the text
scraped_price_ranges = [price.text for price in scraped_price_ranges]

# scraped_price_ranges

In [62]:
# now grab the £ signs if they're there, otherwise `None`
for element in range(len(scraped_price_ranges)):
    if scraped_price_ranges[element].count("\xA3") == 0:
        scraped_price_ranges[element] = None
    else:
        scraped_price_ranges[element] = scraped_price_ranges[element].count("\xA3")

# scraped_price_ranges

In [63]:
# add it to the dictionary
london_yelp_restaurants['Price range'] = scraped_price_ranges

#### Scrape for Star Ratings

In [64]:
# now let's scrape for star rating:
scraped_star_ratings = yelp_scrape.find_all('div', {'class': 'i-stars__09f24__1T6rz'})

# narrow it down
scraped_star_ratings = [rating.attrs['aria-label'] for rating in scraped_star_ratings]

# clean it up by getting rid of the "star rating" suffix and converting to a float
scraped_star_ratings = [float(rating[0:len(rating) - len(" star rating")]) for rating in scraped_star_ratings]

# add it to the dictionary
london_yelp_restaurants['Star rating'] = scraped_star_ratings

#### Scrape for Neighborhood

In [65]:
# now let's scrape for neighborhood:
scraped_neighborhood_containers = yelp_scrape.find_all('div', {'class': 'container__09f24__1fWZl'})

# now let's just get the <p> tags within each container
scraped_neighborhoods = [container.find_all('p') for container in scraped_neighborhood_containers]

# let's simplify what we're looking at by getting text of each tag
for collection in scraped_neighborhoods:
    for tag in range(0,len(collection)):
        collection[tag] = collection[tag].text   # is there a more efficienty way to do this?

# now let's get the last element of each list
scraped_neighborhoods = [location[len(location)-1] for location in scraped_neighborhoods]

# add it to the dictionary
london_yelp_restaurants['Neighborhood'] = scraped_neighborhoods

#### Check the Results for Problems

In [66]:
# let's see what we got!
for key in london_yelp_restaurants:
    print(f"Key: {key}")
    if len(london_yelp_restaurants[key]) == 10:
        print(f"{len(london_yelp_restaurants[key])} elements:")
    else:
        print(f"PROBLEM >>> {len(london_yelp_restaurants[key])} elements:")
    print(london_yelp_restaurants[key])
    print()



# and let's load it into a dataframe
# pd.DataFrame(london_yelp_restaurants)

Key: Name
10 elements:
['Fuckoffee', 'Nando’s', 'Darjeeling Express', 'Bill’s', 'ROVI', 'Cecconi’s Mayfair', 'Laksamania', 'The Barbary', 'La Porchetta Pollo Bar', 'Belgo Centraal']

Key: Rank
10 elements:
[91, 92, 93, 94, 95, 96, 97, 98, 99, 100]

Key: Price range
10 elements:
[1, 2, None, 2, None, 3, None, 3, 1, 2]

Key: Star rating
10 elements:
[4.0, 4.0, 4.5, 4.0, 4.0, 4.0, 4.5, 4.5, 4.0, 4.0]

Key: Neighborhood
10 elements:
['Borough', 'South Kensington', 'Soho', 'Soho', 'Fitzrovia', 'Mayfair', 'Fitzrovia', 'Covent Garden', 'Bloomsbury', 'Covent Garden']



#### Load the Results into a DataFrame
Note: this will throw an error if the lists above are not the same length!

In [67]:
# and let's load it into a dataframe
pd.DataFrame(london_yelp_restaurants)

Unnamed: 0,Name,Rank,Price range,Star rating,Neighborhood
0,Fuckoffee,91,1.0,4.0,Borough
1,Nando’s,92,2.0,4.0,South Kensington
2,Darjeeling Express,93,,4.5,Soho
3,Bill’s,94,2.0,4.0,Soho
4,ROVI,95,,4.0,Fitzrovia
5,Cecconi’s Mayfair,96,3.0,4.0,Mayfair
6,Laksamania,97,,4.5,Fitzrovia
7,The Barbary,98,3.0,4.5,Covent Garden
8,La Porchetta Pollo Bar,99,1.0,4.0,Bloomsbury
9,Belgo Centraal,100,2.0,4.0,Covent Garden


Ok, so after all that troubleshooting, it seems to be working!  Let's put it back into a cell-by-cell solution that iterates through the ten pages!

I also learned at this point that it's easy to merge and split cells in a Jupyter notebook, so I'm going to re-merge everything related to the scraping and put it into a function definition so it's easy to iterate.

#### Define the scraping function

In [73]:
def scrape_a_page(page_to_scrape):
    # So let's get to requestin'!
    http_response = requests.get(page_to_scrape)   # this actually requests the page and stores the resulting response

    # get the text version of the http_response
    source_code_text = http_response.text   # the .text method is part of the requests library, not BeautifulSoup

    # then we'll give that text source code to BeautifulSoup, which creates an object with useful scraping methods
    yelp_scrape = BeautifulSoup(source_code_text)

    #### Initialize the Dictionary of Results
    # we're going to structure this data as a dictionary of lists
    london_yelp_restaurants = {}

    #### Scrape for Names
    # I revised the logic in the cleaning part because it was allowing numbers where we didn't want them.
    # let's scrape for names:
    scraped_names = yelp_scrape.find_all('a', {'class': 'css-166la90'})  # this returns a list of tags that match the criteria

    # scraped_names

    # get the text
    scraped_names = [scraped_name.text for scraped_name in scraped_names]

    # scraped_names

    # clean it up by filtering out empty strings and numbers 
    # this logic will be a problem if a restaurant's name is just a number...
    for name in scraped_names:
        scraped_names = [name for name in scraped_names if name != '' and not name.isdigit()]

    # scraped_names

    # add it to the dictionary
    london_yelp_restaurants['Name'] = scraped_names

    
    
    #### Scrape for Ranks
    # now let's scrape for ranks:
    scraped_ranks = yelp_scrape.find_all('span', {'class': 'css-1pxmz4g'})

    # get the text
    scraped_ranks = [rank.text for rank in scraped_ranks]  # get the text between the tags

    # now parse out the number
    scraped_ranks = [rank[0:rank.index('.')] for rank in scraped_ranks]  # grab the first characters before the "."

    # let's assume we want integers, so convert
    scraped_ranks = [int(rank) for rank in scraped_ranks]

    # add it to the dictionary
    london_yelp_restaurants['Rank'] = scraped_ranks

    
    
    #### Scrape for Price Ranges
    # now let's scrape for price ranges.  This one got tricky, and we had to go for an outside-in approach.
    scraped_price_ranges = yelp_scrape.find_all('div', {'class': 'priceCategory__09f24__2IbAM'})

    # scraped_price_ranges

    # get the text
    scraped_price_ranges = [price.text for price in scraped_price_ranges]

    # scraped_price_ranges

    # now grab the £ signs if they're there, otherwise `None`
    for element in range(len(scraped_price_ranges)):
        if scraped_price_ranges[element].count("\xA3") == 0:
            scraped_price_ranges[element] = None
        else:
            scraped_price_ranges[element] = scraped_price_ranges[element].count("\xA3")

    # scraped_price_ranges

    # add it to the dictionary
    london_yelp_restaurants['Price range'] = scraped_price_ranges

    
    
    #### Scrape for Star Ratings
    # now let's scrape for star rating:
    scraped_star_ratings = yelp_scrape.find_all('div', {'class': 'i-stars__09f24__1T6rz'})

    # narrow it down
    scraped_star_ratings = [rating.attrs['aria-label'] for rating in scraped_star_ratings]

    # clean it up by getting rid of the "star rating" suffix and converting to a float
    scraped_star_ratings = [float(rating[0:len(rating) - len(" star rating")]) for rating in scraped_star_ratings]

    # add it to the dictionary
    london_yelp_restaurants['Star rating'] = scraped_star_ratings

    
    
    #### Scrape for Neighborhood
    # now let's scrape for neighborhood:
    scraped_neighborhood_containers = yelp_scrape.find_all('div', {'class': 'container__09f24__1fWZl'})

    # now let's just get the <p> tags within each container
    scraped_neighborhoods = [container.find_all('p') for container in scraped_neighborhood_containers]

    # let's simplify what we're looking at by getting text of each tag
    for collection in scraped_neighborhoods:
        for tag in range(0,len(collection)):
            collection[tag] = collection[tag].text   # is there a more efficienty way to do this?

    # now let's get the last element of each list
    scraped_neighborhoods = [location[len(location)-1] for location in scraped_neighborhoods]

    # add it to the dictionary
    london_yelp_restaurants['Neighborhood'] = scraped_neighborhoods
    
    
    #### Return the dictionary
    return london_yelp_restaurants

#### Create a dictionary to hold the results

In [74]:
# first, create a dictionary to hold the results (a dictionary of lists)
london_yelp_restaurants = {
    "Name": [],
    "Rank": [],
    "Price range": [],
    "Star rating": [],
    "Neighborhood": []
}

#### Generate the URLs

In [75]:
base_url = "https://www.yelp.com/search?find_desc=Restaurants&find_loc=London%2C%20United%20Kingdom&ns=1"

# generate a list like [0, 10, 20, ..., 90] to get the first ten pages (first 100 results)
page_start_values = [i * 10 for i in range(0, 10)] 

# create a list of URLs to scrape
urls_to_scrape = [f"{base_url}&start={value}" for value in page_start_values]   # append a suffix to the base url

urls_to_scrape

['https://www.yelp.com/search?find_desc=Restaurants&find_loc=London%2C%20United%20Kingdom&ns=1&start=0',
 'https://www.yelp.com/search?find_desc=Restaurants&find_loc=London%2C%20United%20Kingdom&ns=1&start=10',
 'https://www.yelp.com/search?find_desc=Restaurants&find_loc=London%2C%20United%20Kingdom&ns=1&start=20',
 'https://www.yelp.com/search?find_desc=Restaurants&find_loc=London%2C%20United%20Kingdom&ns=1&start=30',
 'https://www.yelp.com/search?find_desc=Restaurants&find_loc=London%2C%20United%20Kingdom&ns=1&start=40',
 'https://www.yelp.com/search?find_desc=Restaurants&find_loc=London%2C%20United%20Kingdom&ns=1&start=50',
 'https://www.yelp.com/search?find_desc=Restaurants&find_loc=London%2C%20United%20Kingdom&ns=1&start=60',
 'https://www.yelp.com/search?find_desc=Restaurants&find_loc=London%2C%20United%20Kingdom&ns=1&start=70',
 'https://www.yelp.com/search?find_desc=Restaurants&find_loc=London%2C%20United%20Kingdom&ns=1&start=80',
 'https://www.yelp.com/search?find_desc=Restaur

#### Loop through the pages and call the scraping function

In [76]:
# let's loop through all the pages we want to scrape, and call the function

for page in urls_to_scrape:  # for each site in the list of URLs to scrape
    # call the scrape function for the page and store it in a dictionary
    single_page_yelp_results = scrape_a_page(page)
    
    # now append the single-page results to the main results dictionary
    for key in london_yelp_restaurants:
        london_yelp_restaurants[key] += single_page_yelp_results[key]

#### Load the Results into a DataFrame
Note: this will throw an error if the lists above are not the same length!

In [77]:
# and let's load it into a dataframe
pd.DataFrame(london_yelp_restaurants)

Unnamed: 0,Name,Rank,Price range,Star rating,Neighborhood
0,The Mayfair Chippy,1,2.0,4.5,Mayfair
1,Dishoom,2,2.0,4.5,Covent Garden
2,Flat Iron,3,2.0,4.5,Soho
3,Ffiona’s Restaurant,4,2.0,4.5,Kensington
4,Dishoom,5,2.0,4.5,Soho
...,...,...,...,...,...
95,Cecconi’s Mayfair,96,3.0,4.0,Mayfair
96,Laksamania,97,,4.5,Fitzrovia
97,The Barbary,98,3.0,4.5,Covent Garden
98,La Porchetta Pollo Bar,99,1.0,4.0,Bloomsbury


It worked!!!!!!

## Tiers 2 and 3 - Cleaned-Up Code Answer
By providing five columns, the below answers both tiers 2 and 3.

Below is my cleaned-up code that executes everything in a single cell:

In [78]:
# We're going to need three Python libraries to make this code work:

import requests   # this library will help us make http requests, which is how we get webpages
from bs4 import BeautifulSoup   # this library will help us parse html source code (i.e., webscraping)
import pandas as pd   # this library will help us with data science stuff



def scrape_a_page(page_to_scrape):
    # So let's get to requestin'!
    http_response = requests.get(page_to_scrape)   # this actually requests the page and stores the resulting response

    # get the text version of the http_response
    source_code_text = http_response.text   # the .text method is part of the requests library, not BeautifulSoup

    # then we'll give that text source code to BeautifulSoup, which creates an object with useful scraping methods
    yelp_scrape = BeautifulSoup(source_code_text)

    #### Initialize the Dictionary of Results
    # we're going to structure this data as a dictionary of lists
    london_yelp_restaurants = {}

    #### Scrape for Names
    # I revised the logic in the cleaning part because it was allowing numbers where we didn't want them.
    # let's scrape for names:
    scraped_names = yelp_scrape.find_all('a', {'class': 'css-166la90'})  # this returns a list of tags that match the criteria

    # scraped_names

    # get the text
    scraped_names = [scraped_name.text for scraped_name in scraped_names]

    # scraped_names

    # clean it up by filtering out empty strings and numbers 
    # this logic will be a problem if a restaurant's name is just a number...
    for name in scraped_names:
        scraped_names = [name for name in scraped_names if name != '' and not name.isdigit()]

    # scraped_names

    # add it to the dictionary
    london_yelp_restaurants['Name'] = scraped_names

    
    
    #### Scrape for Ranks
    # now let's scrape for ranks:
    scraped_ranks = yelp_scrape.find_all('span', {'class': 'css-1pxmz4g'})

    # get the text
    scraped_ranks = [rank.text for rank in scraped_ranks]  # get the text between the tags

    # now parse out the number
    scraped_ranks = [rank[0:rank.index('.')] for rank in scraped_ranks]  # grab the first characters before the "."

    # let's assume we want integers, so convert
    scraped_ranks = [int(rank) for rank in scraped_ranks]

    # add it to the dictionary
    london_yelp_restaurants['Rank'] = scraped_ranks

    
    
    #### Scrape for Price Ranges
    # now let's scrape for price ranges.  This one got tricky, and we had to go for an outside-in approach.
    scraped_price_ranges = yelp_scrape.find_all('div', {'class': 'priceCategory__09f24__2IbAM'})

    # scraped_price_ranges

    # get the text
    scraped_price_ranges = [price.text for price in scraped_price_ranges]

    # scraped_price_ranges

    # now grab the £ signs if they're there, otherwise `None`
    for element in range(len(scraped_price_ranges)):
        if scraped_price_ranges[element].count("\xA3") == 0:
            scraped_price_ranges[element] = None
        else:
            scraped_price_ranges[element] = scraped_price_ranges[element].count("\xA3")

    # scraped_price_ranges

    # add it to the dictionary
    london_yelp_restaurants['Price range'] = scraped_price_ranges

    
    
    #### Scrape for Star Ratings
    # now let's scrape for star rating:
    scraped_star_ratings = yelp_scrape.find_all('div', {'class': 'i-stars__09f24__1T6rz'})

    # narrow it down
    scraped_star_ratings = [rating.attrs['aria-label'] for rating in scraped_star_ratings]

    # clean it up by getting rid of the "star rating" suffix and converting to a float
    scraped_star_ratings = [float(rating[0:len(rating) - len(" star rating")]) for rating in scraped_star_ratings]

    # add it to the dictionary
    london_yelp_restaurants['Star rating'] = scraped_star_ratings

    
    
    #### Scrape for Neighborhood
    # now let's scrape for neighborhood:
    scraped_neighborhood_containers = yelp_scrape.find_all('div', {'class': 'container__09f24__1fWZl'})

    # now let's just get the <p> tags within each container
    scraped_neighborhoods = [container.find_all('p') for container in scraped_neighborhood_containers]

    # let's simplify what we're looking at by getting text of each tag
    for collection in scraped_neighborhoods:
        for tag in range(0,len(collection)):
            collection[tag] = collection[tag].text   # is there a more efficienty way to do this?

    # now let's get the last element of each list
    scraped_neighborhoods = [location[len(location)-1] for location in scraped_neighborhoods]

    # add it to the dictionary
    london_yelp_restaurants['Neighborhood'] = scraped_neighborhoods
    
    
    #### Return the dictionary
    return london_yelp_restaurants




# create a dictionary to hold the results (a dictionary of lists)
london_yelp_restaurants = {
    "Name": [],
    "Rank": [],
    "Price range": [],
    "Star rating": [],
    "Neighborhood": []
}




# Generate the URLs
base_url = "https://www.yelp.com/search?find_desc=Restaurants&find_loc=London%2C%20United%20Kingdom&ns=1"

# generate a list like [0, 10, 20, ..., 90] to get the first ten pages (first 100 results)
page_start_values = [i * 10 for i in range(0,10)] 

# create a list of URLs to scrape
urls_to_scrape = [f"{base_url}&start={value}" for value in page_start_values]   # append a suffix to the base url





# Loop through the pages and call the scraping function

for page in urls_to_scrape:  # for each site in the list of URLs to scrape
    # call the scrape function for the page and store it in a dictionary
    single_page_yelp_results = scrape_a_page(page)
    
    # now append the single-page results to the main results dictionary
    for key in london_yelp_restaurants:
        london_yelp_restaurants[key] += single_page_yelp_results[key]

        
        
# Load the results into a DataFrame
# Note: this will throw an error if the lists above are not the same length!

pd.DataFrame(london_yelp_restaurants)

Unnamed: 0,Name,Rank,Price range,Star rating,Neighborhood
0,The Mayfair Chippy,1,2.0,4.5,Mayfair
1,Dishoom,2,2.0,4.5,Covent Garden
2,Flat Iron,3,2.0,4.5,Soho
3,Ffiona’s Restaurant,4,2.0,4.5,Kensington
4,Dishoom,5,2.0,4.5,Soho
...,...,...,...,...,...
95,Cecconi’s Mayfair,96,3.0,4.0,Mayfair
96,Laksamania,97,,4.5,Fitzrovia
97,The Barbary,98,3.0,4.5,Covent Garden
98,La Porchetta Pollo Bar,99,1.0,4.0,Bloomsbury


## Tier 4: 100 Row Dataset With At Least 5 Columns + Individual Restaurant Categories
- First is me showing my work
- Below that is my cleaned-up, all-in-one-cell code block

### Tier 4 - Showing My Work
- Building off the Tier 2 and 3 code, now we want to bring in the category data.  There may be any number of categories, so including zero, so we'll want to import a list of them for each restaurant.
- When we have the returned results, we'll want to split the restaurant categories into their own columns, and put a `None` if a given restaurant has fewer than the maximum number of categories.

First, let's scrape the list of categories for each restaurant.

#### Generate the URLs

In [79]:
# let's generate the URLs we want to look at
base_url = "https://www.yelp.com/search?find_desc=Restaurants&find_loc=London%2C%20United%20Kingdom&ns=1"

page_start_value = 70

url_to_scrape = f"{base_url}&start={page_start_value}"  # append a suffix to the base url

#### Make the Request

In [80]:
# So let's get to requestin'!
http_response = requests.get(url_to_scrape)   # this actually requests the page and stores the resulting response

# get the text version of the http_response
source_code_text = http_response.text   # the .text method is part of the requests library, not BeautifulSoup

# then we'll give that text source code to BeautifulSoup, which creates an object with useful scraping methods
yelp_scrape = BeautifulSoup(source_code_text)

#### Initialize the Dictionary of Results

In [81]:
# we're going to structure this data as a dictionary of lists
london_yelp_restaurants = {}

#### Scrape for Names

In [82]:
# let's scrape for names:
scraped_names = yelp_scrape.find_all('a', {'class': 'css-166la90'})  # this returns a list of tags that match the criteria

# scraped_names

In [83]:
# get the text
scraped_names = [scraped_name.text for scraped_name in scraped_names]

# scraped_names

In [84]:
# clean it up by filtering out empty strings and numbers 
# this logic will be a problem if a restaurant's name is just a number...
for name in scraped_names:
    scraped_names = [name for name in scraped_names if name != '' and not name.isdigit()]

# scraped_names

In [85]:
# add it to the dictionary
london_yelp_restaurants['Name'] = scraped_names

#### Scrape for Ranks

In [86]:
# now let's scrape for ranks:
scraped_ranks = yelp_scrape.find_all('span', {'class': 'css-1pxmz4g'})

# get the text
scraped_ranks = [rank.text for rank in scraped_ranks]  # get the text between the tags

# now parse out the number
scraped_ranks = [rank[0:rank.index('.')] for rank in scraped_ranks]  # grab the first characters before the "."

# let's assume we want integers, so convert
scraped_ranks = [int(rank) for rank in scraped_ranks]

# add it to the dictionary
london_yelp_restaurants['Rank'] = scraped_ranks

#### Scrape for Price Ranges

In [87]:
# now let's scrape for price ranges.  This one got tricky, and we had to go for an outside-in approach.
scraped_price_ranges = yelp_scrape.find_all('div', {'class': 'priceCategory__09f24__2IbAM'})

# scraped_price_ranges


In [88]:
# get the text
scraped_price_ranges = [price.text for price in scraped_price_ranges]

# scraped_price_ranges

In [89]:
# now grab the £ signs if they're there, otherwise `None`
for element in range(len(scraped_price_ranges)):
    if scraped_price_ranges[element].count("\xA3") == 0:
        scraped_price_ranges[element] = None
    else:
        scraped_price_ranges[element] = scraped_price_ranges[element].count("\xA3")

# scraped_price_ranges

In [90]:
# add it to the dictionary
london_yelp_restaurants['Price range'] = scraped_price_ranges

#### Scrape for Star Ratings

In [91]:
# now let's scrape for star rating:
scraped_star_ratings = yelp_scrape.find_all('div', {'class': 'i-stars__09f24__1T6rz'})

# narrow it down
scraped_star_ratings = [rating.attrs['aria-label'] for rating in scraped_star_ratings]

# clean it up by getting rid of the "star rating" suffix and converting to a float
scraped_star_ratings = [float(rating[0:len(rating) - len(" star rating")]) for rating in scraped_star_ratings]

# add it to the dictionary
london_yelp_restaurants['Star rating'] = scraped_star_ratings

#### Scrape for Neighborhood

In [92]:
# now let's scrape for neighborhood:
scraped_neighborhood_containers = yelp_scrape.find_all('div', {'class': 'container__09f24__1fWZl'})

# now let's just get the <p> tags within each container
scraped_neighborhoods = [container.find_all('p') for container in scraped_neighborhood_containers]

# let's simplify what we're looking at by getting text of each tag
for collection in scraped_neighborhoods:
    for tag in range(0,len(collection)):
        collection[tag] = collection[tag].text   # is there a more efficienty way to do this?

# now let's get the last element of each list
scraped_neighborhoods = [location[len(location)-1] for location in scraped_neighborhoods]

# add it to the dictionary
london_yelp_restaurants['Neighborhood'] = scraped_neighborhoods

#### Scrape for Categories

In [93]:
# now let's scrape for categories
scraped_categories = yelp_scrape.find_all('p', {'class': 'css-n6i4z7'})

# scraped_categories

In [94]:
# get the text
scraped_categories = [categories.text for categories in scraped_categories]

scraped_categories

['££Indian, Pakistani, Halal',
 '£££British, Modern European',
 '££Italian',
 '£Thai, Asian Fusion, Vegetarian',
 '£££Thai',
 'Japanese, Sushi Bars',
 '££British, Pubs',
 '££Bars, Waffles',
 '££Korean',
 '£££Chicken Shop, American (New), Soul Food',
 'Anyone experience bad service at Indian restaurants?',
 'Can someone please suggest which restaurants in Chinatown you would recommend? And also feel free to say which restaurants you would avoid in that…',
 'Can someone please suggest which restaurants in Chinatown you would recommend? And also feel free to say which restaurants you would avoid in that…']

In [95]:
# now clean it up!  

# Let's assume for now that the ten restaurant-specific results will be at the top.  This could be problematic later.
scraped_categories = scraped_categories[0:10]    # get the first ten results only

# now clean off any £ symbols (some strings may have none)
for categories in range(len(scraped_categories)):
    scraped_categories[categories] = scraped_categories[categories].replace('\xA3', '')

scraped_categories

['Indian, Pakistani, Halal',
 'British, Modern European',
 'Italian',
 'Thai, Asian Fusion, Vegetarian',
 'Thai',
 'Japanese, Sushi Bars',
 'British, Pubs',
 'Bars, Waffles',
 'Korean',
 'Chicken Shop, American (New), Soul Food']

In [96]:
# now split the remaining strings into lists
for element in range(len(scraped_categories)):
    scraped_categories[element] = scraped_categories[element].split(',')
    
scraped_categories    

[['Indian', ' Pakistani', ' Halal'],
 ['British', ' Modern European'],
 ['Italian'],
 ['Thai', ' Asian Fusion', ' Vegetarian'],
 ['Thai'],
 ['Japanese', ' Sushi Bars'],
 ['British', ' Pubs'],
 ['Bars', ' Waffles'],
 ['Korean'],
 ['Chicken Shop', ' American (New)', ' Soul Food']]

In [97]:
# there are still leading spaces on strings after the split, so trim those off
for element in scraped_categories:
    for category in range(len(element)):
        if element[category][0] == " ":
            element[category] = element[category][1:]

scraped_categories

[['Indian', 'Pakistani', 'Halal'],
 ['British', 'Modern European'],
 ['Italian'],
 ['Thai', 'Asian Fusion', 'Vegetarian'],
 ['Thai'],
 ['Japanese', 'Sushi Bars'],
 ['British', 'Pubs'],
 ['Bars', 'Waffles'],
 ['Korean'],
 ['Chicken Shop', 'American (New)', 'Soul Food']]

In [98]:
# add it to the dictionary
london_yelp_restaurants['Categories'] = scraped_categories

london_yelp_restaurants['Categories']

[['Indian', 'Pakistani', 'Halal'],
 ['British', 'Modern European'],
 ['Italian'],
 ['Thai', 'Asian Fusion', 'Vegetarian'],
 ['Thai'],
 ['Japanese', 'Sushi Bars'],
 ['British', 'Pubs'],
 ['Bars', 'Waffles'],
 ['Korean'],
 ['Chicken Shop', 'American (New)', 'Soul Food']]

#### Split the category lists into their own individual columns
After we iterate through all the pages, we can find the maximum number of categories we encountered.  We'll make each restaurant's list be the same length by "padding" with `None` values.

In [99]:
# find the max number of categories
max_categories = 0
for category_list in london_yelp_restaurants['Categories']:
    if len(category_list) > max_categories:
        max_categories = len(category_list)
        
max_categories

3

In [100]:
# Now we pad the shorter lists with `None` values so they're all the same length
for category_list in london_yelp_restaurants['Categories']:
    if len(category_list) < max_categories:   # if the restaurant's category list is shorter
        while len(category_list) < max_categories:
            category_list.append(None)    # keep appending None to the end of the category list until it's long enough
            print(category_list)

london_yelp_restaurants['Categories']

['British', 'Modern European', None]
['Italian', None]
['Italian', None, None]
['Thai', None]
['Thai', None, None]
['Japanese', 'Sushi Bars', None]
['British', 'Pubs', None]
['Bars', 'Waffles', None]
['Korean', None]
['Korean', None, None]


[['Indian', 'Pakistani', 'Halal'],
 ['British', 'Modern European', None],
 ['Italian', None, None],
 ['Thai', 'Asian Fusion', 'Vegetarian'],
 ['Thai', None, None],
 ['Japanese', 'Sushi Bars', None],
 ['British', 'Pubs', None],
 ['Bars', 'Waffles', None],
 ['Korean', None, None],
 ['Chicken Shop', 'American (New)', 'Soul Food']]

Now we need to break up these equal-length lists into their own categories

In [101]:
# we'll create one new dictionary key for each category number
for i in range(max_categories):
    column_name = f"Category_{i + 1}"
    print(column_name)
    london_yelp_restaurants[column_name] = []
    
london_yelp_restaurants

Category_1
Category_2
Category_3


{'Name': ['Tayyabs',
  'The Ivy',
  'Da Mario Restaurant',
  'Wok to Walk - Oxford St',
  'Patara',
  'Kazu',
  'Mr Fogg’s Tavern',
  'Duck & Waffle Local',
  'Naru',
  'Absurd Bird Soho'],
 'Rank': [71, 72, 73, 74, 75, 76, 77, 78, 79, 80],
 'Price range': [2, 3, 2, 1, 3, None, 2, 2, 2, 3],
 'Star rating': [4.0, 4.0, 4.5, 4.5, 4.0, 5.0, 4.0, 4.5, 4.0, 4.0],
 'Neighborhood': ['Whitechapel',
  'Covent Garden',
  'Kensington',
  'Soho',
  'Bloomsbury',
  'Fitzrovia',
  'Covent Garden',
  'Leicester Square',
  'Covent Garden',
  'Soho'],
 'Categories': [['Indian', 'Pakistani', 'Halal'],
  ['British', 'Modern European', None],
  ['Italian', None, None],
  ['Thai', 'Asian Fusion', 'Vegetarian'],
  ['Thai', None, None],
  ['Japanese', 'Sushi Bars', None],
  ['British', 'Pubs', None],
  ['Bars', 'Waffles', None],
  ['Korean', None, None],
  ['Chicken Shop', 'American (New)', 'Soul Food']],
 'Category_1': [],
 'Category_2': [],
 'Category_3': []}

In [102]:
# now we'll populate the new category values
# we want this to be flexible so any number of categories will work.  
# So avoid hard-coding variables like 'Category_2'

# iterate through each category list in the dictionary and place the elements into the appropriate individual category key
for category_list in london_yelp_restaurants['Categories']:
    for category_number in range(max_categories):
        london_yelp_restaurants[f"Category_{category_number + 1}"].append(category_list[category_number])

london_yelp_restaurants

{'Name': ['Tayyabs',
  'The Ivy',
  'Da Mario Restaurant',
  'Wok to Walk - Oxford St',
  'Patara',
  'Kazu',
  'Mr Fogg’s Tavern',
  'Duck & Waffle Local',
  'Naru',
  'Absurd Bird Soho'],
 'Rank': [71, 72, 73, 74, 75, 76, 77, 78, 79, 80],
 'Price range': [2, 3, 2, 1, 3, None, 2, 2, 2, 3],
 'Star rating': [4.0, 4.0, 4.5, 4.5, 4.0, 5.0, 4.0, 4.5, 4.0, 4.0],
 'Neighborhood': ['Whitechapel',
  'Covent Garden',
  'Kensington',
  'Soho',
  'Bloomsbury',
  'Fitzrovia',
  'Covent Garden',
  'Leicester Square',
  'Covent Garden',
  'Soho'],
 'Categories': [['Indian', 'Pakistani', 'Halal'],
  ['British', 'Modern European', None],
  ['Italian', None, None],
  ['Thai', 'Asian Fusion', 'Vegetarian'],
  ['Thai', None, None],
  ['Japanese', 'Sushi Bars', None],
  ['British', 'Pubs', None],
  ['Bars', 'Waffles', None],
  ['Korean', None, None],
  ['Chicken Shop', 'American (New)', 'Soul Food']],
 'Category_1': ['Indian',
  'British',
  'Italian',
  'Thai',
  'Thai',
  'Japanese',
  'British',
  'Bar

We'll eventually want to delete the 'Categories' key/value pair from the dictionary, but for now, let's keep it for troubleshooting purposes.

#### Check the Results for Problems

In [103]:
# let's see what we got!
for key in london_yelp_restaurants:
    print(f"Key: {key}")
    if len(london_yelp_restaurants[key]) == 10:
        print(f"{len(london_yelp_restaurants[key])} elements:")
    else:
        print(f"PROBLEM >>> {len(london_yelp_restaurants[key])} elements:")
    print(london_yelp_restaurants[key])
    print()



# and let's load it into a dataframe
# pd.DataFrame(london_yelp_restaurants)

Key: Name
10 elements:
['Tayyabs', 'The Ivy', 'Da Mario Restaurant', 'Wok to Walk - Oxford St', 'Patara', 'Kazu', 'Mr Fogg’s Tavern', 'Duck & Waffle Local', 'Naru', 'Absurd Bird Soho']

Key: Rank
10 elements:
[71, 72, 73, 74, 75, 76, 77, 78, 79, 80]

Key: Price range
10 elements:
[2, 3, 2, 1, 3, None, 2, 2, 2, 3]

Key: Star rating
10 elements:
[4.0, 4.0, 4.5, 4.5, 4.0, 5.0, 4.0, 4.5, 4.0, 4.0]

Key: Neighborhood
10 elements:
['Whitechapel', 'Covent Garden', 'Kensington', 'Soho', 'Bloomsbury', 'Fitzrovia', 'Covent Garden', 'Leicester Square', 'Covent Garden', 'Soho']

Key: Categories
10 elements:
[['Indian', 'Pakistani', 'Halal'], ['British', 'Modern European', None], ['Italian', None, None], ['Thai', 'Asian Fusion', 'Vegetarian'], ['Thai', None, None], ['Japanese', 'Sushi Bars', None], ['British', 'Pubs', None], ['Bars', 'Waffles', None], ['Korean', None, None], ['Chicken Shop', 'American (New)', 'Soul Food']]

Key: Category_1
10 elements:
['Indian', 'British', 'Italian', 'Thai', 'Thai

#### Load the Results into a DataFrame
Note: this will throw an error if the lists above are not the same length!

In [104]:
# and let's load it into a dataframe
pd.DataFrame(london_yelp_restaurants)

Unnamed: 0,Name,Rank,Price range,Star rating,Neighborhood,Categories,Category_1,Category_2,Category_3
0,Tayyabs,71,2.0,4.0,Whitechapel,"[Indian, Pakistani, Halal]",Indian,Pakistani,Halal
1,The Ivy,72,3.0,4.0,Covent Garden,"[British, Modern European, None]",British,Modern European,
2,Da Mario Restaurant,73,2.0,4.5,Kensington,"[Italian, None, None]",Italian,,
3,Wok to Walk - Oxford St,74,1.0,4.5,Soho,"[Thai, Asian Fusion, Vegetarian]",Thai,Asian Fusion,Vegetarian
4,Patara,75,3.0,4.0,Bloomsbury,"[Thai, None, None]",Thai,,
5,Kazu,76,,5.0,Fitzrovia,"[Japanese, Sushi Bars, None]",Japanese,Sushi Bars,
6,Mr Fogg’s Tavern,77,2.0,4.0,Covent Garden,"[British, Pubs, None]",British,Pubs,
7,Duck & Waffle Local,78,2.0,4.5,Leicester Square,"[Bars, Waffles, None]",Bars,Waffles,
8,Naru,79,2.0,4.0,Covent Garden,"[Korean, None, None]",Korean,,
9,Absurd Bird Soho,80,3.0,4.0,Soho,"[Chicken Shop, American (New), Soul Food]",Chicken Shop,American (New),Soul Food


Nice!  That's just what we want.  We know it works for the first page, now let's try this approach on some other individual pages and see what speed bumps we run into.

Update: after some spot checks on random single pages, it appears to be working!  So let's try iterating through all ten pages, and include removing the 'Categories' key/value pair.

To make this easier, we'll put the scraping code back into its own function so it's easy to iterate.

#### Define the scraping function

In [105]:
def scrape_a_page(url_to_scrape):
    #### Make the Request
    # So let's get to requestin'!
    http_response = requests.get(url_to_scrape)   # this actually requests the page and stores the resulting response

    # get the text version of the http_response
    source_code_text = http_response.text   # the .text method is part of the requests library, not BeautifulSoup

    # then we'll give that text source code to BeautifulSoup, which creates an object with useful scraping methods
    yelp_scrape = BeautifulSoup(source_code_text)

    
    
    #### Initialize the Dictionary of Results
    # we're going to structure this data as a dictionary of lists
    london_yelp_restaurants = {}

    
    
    #### Scrape for Names
    # let's scrape for names:
    scraped_names = yelp_scrape.find_all('a', {'class': 'css-166la90'})  # this returns a list of tags that match the criteria

    # get the text
    scraped_names = [scraped_name.text for scraped_name in scraped_names]

    # clean it up by filtering out empty strings and numbers 
    # this logic will be a problem if a restaurant's name is just a number...
    for name in scraped_names:
        scraped_names = [name for name in scraped_names if name != '' and not name.isdigit()]

    # add it to the dictionary
    london_yelp_restaurants['Name'] = scraped_names

    
    
    #### Scrape for Ranks
    # now let's scrape for ranks:
    scraped_ranks = yelp_scrape.find_all('span', {'class': 'css-1pxmz4g'})

    # get the text
    scraped_ranks = [rank.text for rank in scraped_ranks]  # get the text between the tags

    # now parse out the number
    scraped_ranks = [rank[0:rank.index('.')] for rank in scraped_ranks]  # grab the first characters before the "."

    # let's assume we want integers, so convert
    scraped_ranks = [int(rank) for rank in scraped_ranks]

    # add it to the dictionary
    london_yelp_restaurants['Rank'] = scraped_ranks

    
    
    #### Scrape for Price Ranges
    # now let's scrape for price ranges.  This one got tricky, and we had to go for an outside-in approach.
    scraped_price_ranges = yelp_scrape.find_all('div', {'class': 'priceCategory__09f24__2IbAM'})

    # get the text
    scraped_price_ranges = [price.text for price in scraped_price_ranges]

    # now grab the £ signs if they're there, otherwise `None`
    for element in range(len(scraped_price_ranges)):
        if scraped_price_ranges[element].count("\xA3") == 0:
            scraped_price_ranges[element] = None
        else:
            scraped_price_ranges[element] = scraped_price_ranges[element].count("\xA3")

    # add it to the dictionary
    london_yelp_restaurants['Price range'] = scraped_price_ranges

    
    
    #### Scrape for Star Ratings
    # now let's scrape for star rating:
    scraped_star_ratings = yelp_scrape.find_all('div', {'class': 'i-stars__09f24__1T6rz'})

    # narrow it down
    scraped_star_ratings = [rating.attrs['aria-label'] for rating in scraped_star_ratings]

    # clean it up by getting rid of the "star rating" suffix and converting to a float
    scraped_star_ratings = [float(rating[0:len(rating) - len(" star rating")]) for rating in scraped_star_ratings]

    # add it to the dictionary
    london_yelp_restaurants['Star rating'] = scraped_star_ratings

    
    
    #### Scrape for Neighborhood
    # now let's scrape for neighborhood:
    scraped_neighborhood_containers = yelp_scrape.find_all('div', {'class': 'container__09f24__1fWZl'})

    # now let's just get the <p> tags within each container
    scraped_neighborhoods = [container.find_all('p') for container in scraped_neighborhood_containers]

    # let's simplify what we're looking at by getting text of each tag
    for collection in scraped_neighborhoods:
        for tag in range(0,len(collection)):
            collection[tag] = collection[tag].text   # is there a more efficienty way to do this?

    # now let's get the last element of each list
    scraped_neighborhoods = [location[len(location)-1] for location in scraped_neighborhoods]

    # add it to the dictionary
    london_yelp_restaurants['Neighborhood'] = scraped_neighborhoods

    
    
    #### Scrape for Categories
    # now let's scrape for categories
    scraped_categories = yelp_scrape.find_all('p', {'class': 'css-n6i4z7'})

    # get the text
    scraped_categories = [categories.text for categories in scraped_categories]

    # now clean it up!  

    # Let's assume for now that the ten restaurant-specific results will be at the top.  This could be problematic later.
    scraped_categories = scraped_categories[0:10]    # get the first ten results only

    # now clean off any £ symbols (some strings may have none)
    for categories in range(len(scraped_categories)):
        scraped_categories[categories] = scraped_categories[categories].replace('\xA3', '')

    # now split the remaining strings into lists
    for element in range(len(scraped_categories)):
        scraped_categories[element] = scraped_categories[element].split(',')

    # there are still leading spaces on strings after the split, so trim those off
    for element in scraped_categories:
        for category in range(len(element)):
            if element[category][0] == " ":
                element[category] = element[category][1:]

    # add it to the dictionary
    london_yelp_restaurants['Categories'] = scraped_categories

    
    
    #### Return the dictionary
    return london_yelp_restaurants

#### Generate the URLs

In [106]:
base_url = "https://www.yelp.com/search?find_desc=Restaurants&find_loc=London%2C%20United%20Kingdom&ns=1"

# generate a list like [0, 10, 20, ..., 90] to get the first ten pages (first 100 results)
page_start_values = [i * 10 for i in range(0,10)] 

# create a list of URLs to scrape
urls_to_scrape = [f"{base_url}&start={value}" for value in page_start_values]   # append a suffix to the base url

urls_to_scrape

['https://www.yelp.com/search?find_desc=Restaurants&find_loc=London%2C%20United%20Kingdom&ns=1&start=0',
 'https://www.yelp.com/search?find_desc=Restaurants&find_loc=London%2C%20United%20Kingdom&ns=1&start=10',
 'https://www.yelp.com/search?find_desc=Restaurants&find_loc=London%2C%20United%20Kingdom&ns=1&start=20',
 'https://www.yelp.com/search?find_desc=Restaurants&find_loc=London%2C%20United%20Kingdom&ns=1&start=30',
 'https://www.yelp.com/search?find_desc=Restaurants&find_loc=London%2C%20United%20Kingdom&ns=1&start=40',
 'https://www.yelp.com/search?find_desc=Restaurants&find_loc=London%2C%20United%20Kingdom&ns=1&start=50',
 'https://www.yelp.com/search?find_desc=Restaurants&find_loc=London%2C%20United%20Kingdom&ns=1&start=60',
 'https://www.yelp.com/search?find_desc=Restaurants&find_loc=London%2C%20United%20Kingdom&ns=1&start=70',
 'https://www.yelp.com/search?find_desc=Restaurants&find_loc=London%2C%20United%20Kingdom&ns=1&start=80',
 'https://www.yelp.com/search?find_desc=Restaur

#### Create a dictionary to hold the results

In [107]:
# first, create a dictionary to hold the results (a dictionary of lists)
london_yelp_restaurants = {
    "Name": [],
    "Rank": [],
    "Price range": [],
    "Star rating": [],
    "Neighborhood": [],
    "Categories": []
}

#### Loop through the pages and call the scraping function

In [108]:
# let's loop through all the pages we want to scrape, and call the function

for url in urls_to_scrape:  # for each site in the list of URLs to scrape
    # call the scrape function for the page and store it in a dictionary
    single_page_yelp_results = scrape_a_page(url)
    
    # now append the single-page results to the main results dictionary
    for key in london_yelp_restaurants:
        london_yelp_restaurants[key] += single_page_yelp_results[key]

#### Split the category lists into their own individual columns
After we iterate through all the pages, we can find the maximum number of categories we encountered.  We'll make each restaurant's list be the same length by "padding" with `None` values.

In [109]:
# find the max number of categories
max_categories = 0
for category_list in london_yelp_restaurants['Categories']:
    if len(category_list) > max_categories:
        max_categories = len(category_list)
        
# max_categories

In [110]:
# Now we pad the shorter lists with `None` values so they're all the same length
for category_list in london_yelp_restaurants['Categories']:
    if len(category_list) < max_categories:   # if the restaurant's category list is shorter
        while len(category_list) < max_categories:
            category_list.append(None)    # keep appending None to the end of the category list until it's long enough

# london_yelp_restaurants['Categories']

In [111]:
# now we need to break up these equal-length lists into their own categories
# to do this, we'll create one new dictionary key for each category number
for i in range(max_categories):
    column_name = f"Category_{i + 1}"
    london_yelp_restaurants[column_name] = []
    
# london_yelp_restaurants

In [112]:
# now we'll populate the new category values
# we want this to be flexible so any number of categories will work.  
# So avoid hard-coding variables like 'Category_2'

# iterate through each category list in the dictionary and place the elements into the appropriate individual category key
for category_list in london_yelp_restaurants['Categories']:
    for category_number in range(max_categories):
        london_yelp_restaurants[f"Category_{category_number + 1}"].append(category_list[category_number])

# london_yelp_restaurants

In [113]:
# now delete the 'Categories' key/pair value from the dictionary, because we don't need it anymore

london_yelp_restaurants.pop('Categories')

[['Fish & Chips', None, None],
 ['Indian', None, None],
 ['Steakhouses', None, None],
 ['British', None, None],
 ['Indian', None, None],
 ['Coffee & Tea', 'Breakfast & Brunch', 'American (Traditional)'],
 ['French', 'British', None],
 ['American (Traditional)', 'Soul Food', 'Cajun/Creole'],
 ['Mediterranean', None, None],
 ['French', 'Modern European', 'Cocktail Bars'],
 ['Fish & Chips', None, None],
 ['Burgers', None, None],
 ['Italian', None, None],
 ['Mexican', 'Bars', None],
 ['Taiwanese', None, None],
 ['British', 'Steakhouses', 'Cocktail Bars'],
 ['Chinese', 'Noodles', None],
 ['British', 'Pubs', 'Gastropubs'],
 ['Modern European', 'Bars', 'British'],
 ['Middle Eastern', 'Mediterranean', None],
 ['Mexican', None, None],
 ['Spanish', 'Tapas Bars', None],
 ['British', None, None],
 ['British', 'Cocktail Bars', 'Steakhouses'],
 ['Cafes', None, None],
 ['Seafood', 'Burgers', None],
 ['Pizza', None, None],
 ['Dim Sum', 'Seafood', 'Noodles'],
 ['Italian', None, None],
 ['Japanese', Non

#### Load the Results into a DataFrame

In [114]:
# and let's load it into a dataframe
pd.DataFrame(london_yelp_restaurants)

Unnamed: 0,Name,Rank,Price range,Star rating,Neighborhood,Category_1,Category_2,Category_3
0,The Mayfair Chippy,1,2.0,4.5,Mayfair,Fish & Chips,,
1,Dishoom,2,2.0,4.5,Covent Garden,Indian,,
2,Flat Iron,3,2.0,4.5,Soho,Steakhouses,,
3,Ffiona’s Restaurant,4,2.0,4.5,Kensington,British,,
4,Dishoom,5,2.0,4.5,Soho,Indian,,
...,...,...,...,...,...,...,...,...
95,Cecconi’s Mayfair,96,3.0,4.0,Mayfair,Italian,,
96,Laksamania,97,,4.5,Fitzrovia,Malaysian,,
97,The Barbary,98,3.0,4.5,Covent Garden,Middle Eastern,Mediterranean,
98,La Porchetta Pollo Bar,99,1.0,4.0,Bloomsbury,Italian,,


It worked!!!!!!

## Tier 4 - Cleaned-Up Code Answer
Below is my cleaned-up code that executes everything in a single cell:

In [115]:
# We're going to need three Python libraries to make this code work:

import requests   # this library will help us make http requests, which is how we get webpages
from bs4 import BeautifulSoup   # this library will help us parse html source code (i.e., webscraping)
import pandas as pd   # this library will help us with data science stuff



#### Define the scraping function
def scrape_a_page(url_to_scrape):
    
    
    
    #### Make the Request
    # So let's get to requestin'!
    http_response = requests.get(url_to_scrape)   # this actually requests the page and stores the resulting response

    # get the text version of the http_response
    source_code_text = http_response.text   # the .text method is part of the requests library, not BeautifulSoup

    # then we'll give that text source code to BeautifulSoup, which creates an object with useful scraping methods
    yelp_scrape = BeautifulSoup(source_code_text)

    
    
    #### Initialize the Dictionary of Results
    # we're going to structure this data as a dictionary of lists
    london_yelp_restaurants = {}

    
    
    #### Scrape for Names
    # let's scrape for names:
    scraped_names = yelp_scrape.find_all('a', {'class': 'css-166la90'})  # this returns a list of tags that match the criteria

    # get the text
    scraped_names = [scraped_name.text for scraped_name in scraped_names]

    # clean it up by filtering out empty strings and numbers 
    # this logic will be a problem if a restaurant's name is just a number...
    for name in scraped_names:
        scraped_names = [name for name in scraped_names if name != '' and not name.isdigit()]

    # add it to the dictionary
    london_yelp_restaurants['Name'] = scraped_names

    
    
    #### Scrape for Ranks
    # now let's scrape for ranks:
    scraped_ranks = yelp_scrape.find_all('span', {'class': 'css-1pxmz4g'})

    # get the text
    scraped_ranks = [rank.text for rank in scraped_ranks]  # get the text between the tags

    # now parse out the number
    scraped_ranks = [rank[0:rank.index('.')] for rank in scraped_ranks]  # grab the first characters before the "."

    # let's assume we want integers, so convert
    scraped_ranks = [int(rank) for rank in scraped_ranks]

    # add it to the dictionary
    london_yelp_restaurants['Rank'] = scraped_ranks

    
    
    #### Scrape for Price Ranges
    # now let's scrape for price ranges.  This one got tricky, and we had to go for an outside-in approach.
    scraped_price_ranges = yelp_scrape.find_all('div', {'class': 'priceCategory__09f24__2IbAM'})

    # get the text
    scraped_price_ranges = [price.text for price in scraped_price_ranges]

    # now grab the £ signs if they're there, otherwise `None`
    for element in range(len(scraped_price_ranges)):
        if scraped_price_ranges[element].count("\xA3") == 0:
            scraped_price_ranges[element] = None
        else:
            scraped_price_ranges[element] = scraped_price_ranges[element].count("\xA3")

    # add it to the dictionary
    london_yelp_restaurants['Price range'] = scraped_price_ranges

    
    
    #### Scrape for Star Ratings
    # now let's scrape for star rating:
    scraped_star_ratings = yelp_scrape.find_all('div', {'class': 'i-stars__09f24__1T6rz'})

    # narrow it down
    scraped_star_ratings = [rating.attrs['aria-label'] for rating in scraped_star_ratings]

    # clean it up by getting rid of the "star rating" suffix and converting to a float
    scraped_star_ratings = [float(rating[0:len(rating) - len(" star rating")]) for rating in scraped_star_ratings]

    # add it to the dictionary
    london_yelp_restaurants['Star rating'] = scraped_star_ratings

    
    
    #### Scrape for Neighborhood
    # now let's scrape for neighborhood:
    scraped_neighborhood_containers = yelp_scrape.find_all('div', {'class': 'container__09f24__1fWZl'})

    # now let's just get the <p> tags within each container
    scraped_neighborhoods = [container.find_all('p') for container in scraped_neighborhood_containers]

    # let's simplify what we're looking at by getting text of each tag
    for collection in scraped_neighborhoods:
        for tag in range(0,len(collection)):
            collection[tag] = collection[tag].text   # is there a more efficienty way to do this?

    # now let's get the last element of each list
    scraped_neighborhoods = [location[len(location)-1] for location in scraped_neighborhoods]

    # add it to the dictionary
    london_yelp_restaurants['Neighborhood'] = scraped_neighborhoods

    
    
    #### Scrape for Categories
    # now let's scrape for categories
    scraped_categories = yelp_scrape.find_all('p', {'class': 'css-n6i4z7'})

    # get the text
    scraped_categories = [categories.text for categories in scraped_categories]

    # now clean it up!  

    # Let's assume for now that the ten restaurant-specific results will be at the top.  This could be problematic later.
    scraped_categories = scraped_categories[0:10]    # get the first ten results only

    # now clean off any £ symbols (some strings may have none)
    for categories in range(len(scraped_categories)):
        scraped_categories[categories] = scraped_categories[categories].replace('\xA3', '')

    # now split the remaining strings into lists
    for element in range(len(scraped_categories)):
        scraped_categories[element] = scraped_categories[element].split(',')

    # there are still leading spaces on strings after the split, so trim those off
    for element in scraped_categories:
        for category in range(len(element)):
            if element[category][0] == " ":
                element[category] = element[category][1:]

    # add it to the dictionary
    london_yelp_restaurants['Categories'] = scraped_categories

    
    
    #### Return the dictionary
    return london_yelp_restaurants



#### Generate the URLs
base_url = "https://www.yelp.com/search?find_desc=Restaurants&find_loc=London%2C%20United%20Kingdom&ns=1"

# generate a list like [0, 10, 20, ..., 90] to get the first ten pages (first 100 results)
page_start_values = [i * 10 for i in range(0,10)] 

# create a list of URLs to scrape
urls_to_scrape = [f"{base_url}&start={value}" for value in page_start_values]   # append a suffix to the base url



#### Create a dictionary to hold the results
# first, create a dictionary to hold the results (a dictionary of lists)
london_yelp_restaurants = {
    "Name": [],
    "Rank": [],
    "Price range": [],
    "Star rating": [],
    "Neighborhood": [],
    "Categories": []
}



#### Loop through the pages and call the scraping function
# let's loop through all the pages we want to scrape, and call the function

for url in urls_to_scrape:  # for each site in the list of URLs to scrape
    # call the scrape function for the page and store it in a dictionary
    single_page_yelp_results = scrape_a_page(url)
    
    # now append the single-page results to the main results dictionary
    for key in london_yelp_restaurants:
        london_yelp_restaurants[key] += single_page_yelp_results[key]
        
        
        

#### Split the category lists into their own individual columns
# After we iterate through all the pages, we can find the maximum number of categories we encountered
# We'll make each restaurant's list be the same length by "padding" with `None` values.

# find the max number of categories
max_categories = 0
for category_list in london_yelp_restaurants['Categories']:
    if len(category_list) > max_categories:
        max_categories = len(category_list)

# Now we pad the shorter lists with `None` values so they're all the same length
for category_list in london_yelp_restaurants['Categories']:
    if len(category_list) < max_categories:   # if the restaurant's category list is shorter
        while len(category_list) < max_categories:
            category_list.append(None)    # keep appending None to the end of the category list until it's long enough

# now we need to break up these equal-length lists into their own categories
# to do this, we'll create one new dictionary key for each category number
for i in range(max_categories):
    column_name = f"Category_{i + 1}"
    london_yelp_restaurants[column_name] = []

# now we'll populate the new category values
# we want this to be flexible so any number of categories will work.  
# So avoid hard-coding variables like 'Category_2'
# iterate through each category list in the dictionary and place the elements into the appropriate individual category key
for category_list in london_yelp_restaurants['Categories']:
    for category_number in range(max_categories):
        london_yelp_restaurants[f"Category_{category_number + 1}"].append(category_list[category_number])

# now delete the 'Categories' key/pair value from the dictionary, because we don't need it anymore
london_yelp_restaurants.pop('Categories')




#### Load the Results into a DataFrame
pd.DataFrame(london_yelp_restaurants)

Unnamed: 0,Name,Rank,Price range,Star rating,Neighborhood,Category_1,Category_2,Category_3
0,The Mayfair Chippy,1,2.0,4.5,Mayfair,Fish & Chips,,
1,Dishoom,2,2.0,4.5,Covent Garden,Indian,,
2,Flat Iron,3,2.0,4.5,Soho,Steakhouses,,
3,Ffiona’s Restaurant,4,2.0,4.5,Kensington,British,,
4,Dishoom,5,2.0,4.5,Soho,Indian,,
...,...,...,...,...,...,...,...,...
95,Cecconi’s Mayfair,96,3.0,4.0,Mayfair,Italian,,
96,Laksamania,97,,4.5,Fitzrovia,Malaysian,,
97,The Barbary,98,3.0,4.5,Covent Garden,Middle Eastern,Mediterranean,
98,La Porchetta Pollo Bar,99,1.0,4.0,Bloomsbury,Italian,,
