# Web Scraping



We'll use some tools from a variety of packages. The `requests` package gives us a way of getting data from the web. This is the main way we will get the actual raw HTML from a webpage. However, that HTML will be very difficult to parse. The `lxml` module gives us a way of interpreting the HTML code and converting it into a text that we can use nicely. 

Finally, we'll use `pandas` to clean everything up.

In [None]:
from bs4 import BeautifulSoup
from lxml import html 
from requests import get
import pandas as pd

## Example: Restaurants of the Year from USA Today

Suppose we wanted to do a study of what restaurants are considered the best restaurants in the US and do a comparison of the types of cuisines that are featured from different parts of the country. To do this, we might want to grab some information from various sources, such as top restaurants lists, and then try to link that information about the regions that the restaurants are in.

Let's take a look at an example of getting some of that information from an article published in USA today about best restaurants in the US. This was a random website I found from a Google search: https://www.usatoday.com/story/life/food-dining/2024/02/15/best-restaurants-of-the-year-across-usa/71909704007/.

First, let's define the URL of the website that we want to scrape.

In [None]:
url = 'https://www.usatoday.com/story/life/food-dining/2024/02/15/best-restaurants-of-the-year-across-usa/71909704007/'

We then use the `get` function from the `requests` package to get the HTML code from the URL that we specified.

In [None]:
webpage = get(url)

We can check the status code to see if we were able to actually connect to the webpage. We want to see a status code of 200. If you see something else, that's a sign that something went wrong. There are many reasons something might go wrong at this stage, including a typo in the URL, issues with your internet connection, lack of permissions, and so on. The status code can help you diagnose what the issue might be if you don't see 200. 

In [None]:
webpage.status_code

In [None]:
tree = html.fromstring(webpage.content)

Next, we will use Selector Gadget (https://selectorgadget.com) in order to grab the pieces that we want. This is a nifty tool that you can use with webpages to select and grab only the data that you want. 

In [None]:
xp = '//*[contains(concat( " ", @class, " " ), concat( " ", "gnt_ar_b_h2", " " ))]//a'

We use the `xpath` method to pull out the information from just that xpath. 

In [None]:
restaurants = tree.xpath(xp)

Then, we can use `.text` to pull out just the text for individual elements of this.

In [None]:
restaurants[0].text

<font color ='red'>**Question 1: Create a list `names` of restaurant names using the `restaurants` object. These should not contain the location of the restaurant.**</font>

*Hint:* For strings, you can use the `.split()` method to split a string on a specific character (such as `|`).

<font color ='red'>**Question 2: Create a list `locations` of restaurant locations using the `restaurants` object. These should not contain the names of the restaurant.**</font>

Next, let's create two lists `city` and `state` that separates out the restaurant locations. We can do the same process of splitting for the `locations` list.

In [None]:
city = [l.split(', ')[0] for l in locations]
city

Looking at the `city` list, though, we might notice an issue. There might be a typo that means there is a missing comma (Since webpages change, this is not guaranteed to be there. But there was one when I put this together). This means that we need to a bit of cleaning first.

In [None]:
locations[-7] = 'Nashville, Tennessee'
city = [l.split(', ')[0] for l in locations]
city

In [None]:
state = [l.split(', ')[1] for l in locations]
state

Once we get the data cleaned up a bit, we can change it into a Pandas Series object to make working with it easier.

In [None]:
restaurants_dict = {'Name':names, 'City':city, "State":state}
rest_df = pd.DataFrame(restaurants_dict)
rest_df.head()

<font color ='red'>**Question 3: Which state had the most restaurants appear on the USA Today list of Best Restaurants in 2024?**</font>

### Grabbing more information

We have now gotten the restaurant name along with City and State. We can keep going and identify other pieces from the webpage that we want to grab. For example, we see that there is a short description for each restaurant. We might want to pull out that text.

In [None]:
xp = '//*[contains(concat( " ", @class, " " ), concat( " ", "gnt_ar_b_p", " " ))]'

In [None]:
tree = html.fromstring(webpage.content)

In [None]:
article_text = [x.text for x in tree.xpath(xp)]
article_text[:10]

This grabbed a bit too much information, so we'll need to cut it down. For example, we can remove the first few elements because it's just the introductory part of the article. There are also some None values in there. Let's clean these up a bit. 

In [None]:
descriptions = [x.text for x in tree.xpath(xp) if x.text is not None][6:]
descriptions[:5]

There's a little more work to be done here before we add this to the `rest_df` DataFrame. You'll notice that the length of this list is a bit longer than the number of rows in the DataFrame. That's because there's a few stray pieces that we still need to clean up. Near the end of the article, there are a few extra paragraphs that aren't about specific restaurants. Also, one restaurant had a two paragraph description, so it's been split up into two. You'll have to manually go in and update those so that it all matches up. 

Feel free to try giving this a shot. Note that the `.join` method for strings can join together the strings within a list. For example, `' '.join(['some', 'text'])` creates the string `'some text'` because it joins the strings within the list using the space `' '` as the separator.

In [None]:
' '.join(['some', 'text'])

### Notes about using the Selector Gadget method

This is a relatively easy way to grab the data you want to from HTML pages. Notably, you don't need to actually know what HTML tags mean, or how to identify them. The Selector Gadget provides a way to do this all by pointing and clicking.

However, this can sometimes be finicky and not grab exactly what you want. A more consistent way is to identify the tags exactly and use those. This might mean needing to do a bit more additional cleaning, but you do ensure that you don't miss data or lose out on information.

In the next section, we'll talk about another method for grabbing data from HTML files using Beautiful Soup.

## Using Beautiful Soup

Beautiful Soup (https://beautiful-soup-4.readthedocs.io/en/latest/) is a Python library that is designed to make pulling data out of HTML files easier. 

We'll first take the HTML content, parse it, then extract the pieces we want using the HTML tags that are part of the webpage. 

In [None]:
webpage = get(url)

We'll create a `BeautifulSoup` data structure from the content that we get from the webpage. This will essentially organize the data so that we can pull out pieces that we want. You can think about this similar to how we organize data in DataFrames (except not in the same tabular format of the DataFrames).

In [None]:
soup = BeautifulSoup(webpage.content, 'html.parser')

In [None]:
type(soup)

 Instead of a tabular structure, though, the data are stored in a tree structure, according to the tags. So, in order to find the data we want, we will need to identify the tags so that we can find the data we want. It can be helpful to look at the HTML source code to identify the pieces of the webpage that we want to pull from. This might involve a little bit of trial and error, but the browser tools should be useful for finding the tags that you want. 

As a reminder, here is the webpage we are scraping from: https://www.usatoday.com/story/life/food-dining/2024/02/15/best-restaurants-of-the-year-across-usa/71909704007/. 

We can access different parts of the data in the webpage using the dot notation similar to how we might access columns of a DataFrame. For example, since the restaurant names are in the `h2` tag, we can use `soup.h2`.

In [None]:
soup.h2

This is a bit messy, but we can use the `get_text()` method to clean it up and get just the relevant text. 

In [None]:
soup.h2.get_text()

Note that this only provides the first one. To get them all, we can use `.find_all`.

In [None]:
soup.find_all('h2')

<font color ='red'>**Question 4: Create a list of restaurant names using `soup`.**</font>

Finding the tags that correspond to the pieces of information you want can be a bit time-consuming, but the Inspect Webpage Source feature of your browser can be very helpful in this. How to access the page source will differ according to the browser you are using, but they should all have some way of clicking on the thing you want to find out what the tag associated with it is. 

For example, to get the paragraph descriptions of each restaurant, we need to find out what tag it's under. Inspecting the webpage should reveal that it is using the 'p' tag.

<font color ='red'>**Question 5: Pull out the descriptions of each restaurant from `soup` as a list. Make sure there isn't any of the front matter introduction to the article. The first element should be details about "Urban Bar & Kitchen".**</font>

Extra code: example of pulling all of the text descriptions using a loop


In [None]:
headers = soup.find_all('h2')
textlist = []
# loop through each header
for head in headers:
    # start with empty text
    text = ''
    newtext=head.find_next_sibling()
    # now iterate through the next siblings
    while newtext is not None:
        newtext=newtext.find_next_sibling()
        # stop at the end or when you encounter a new header
        if (newtext is None or newtext.name == 'h2'):
            textlist.append(text)
            break
        # if the above conditions aren't met, get the text and add it
        text+=  newtext.get_text()



# put everything in a data frame
rest_descriptions = pd.DataFrame({"locations":locations ,"name":names,"description": textlist})

# and print the result
rest_descriptions.head()