<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Web scraping with BeautifulSoup

---



### Learning Objectives

**After this lesson, you will be able to:**

- Identify whether it's ethical to scrape a website
- Write selectors to pick out elements from HTML
- Perform web scraping tasks using `requests` and `beautifulsoup`
- Convert the results of a scraping task to a `pandas DataFrame` 


#  What is web scraping?
---

Sometimes the data we're after won't be available via a spreadsheet download, or an API. It might be embedded in a web page, or spread across several web pages. In this case, web scraping can be a good solution. 

Web scraping is a slightly more complicated data acquisition method. It involves two steps:

* Grabbing (or 'scraping') the HTML underlying a website

* Searching (or 'parsing') it to extract the information you're interested in. HTML is a language where different pieces of content on a website are sandwiched or enclosed inside 'tags' that describe exactly what that piece of content is. So, a large heading would be enclosed between opening and closing heading tags: ``<h1> My Heading <\h1>``. By searching for particular tags in our scraped HTML, we can pick out and store the exact pieces of content we're interested in.

Scraping is the programmatic equivalent of browsing a website, and copy-pasting content from the website into your own local file or spreadsheet. There are some basic rules to follow when scraping websites, to avoid getting into trouble:

* **Don't scrape websites that ask you not to scrape them** It's important to avoid scraping websites that explicitly prohibit scrapers/crawlers/spiders/robots (these can sometimes be used to all mean scraping) in their Terms of Use or Terms and Conditions. Under special circumstances, it might be possible to get permission from a website to scrape them if you make direct contact with the owners, explain why you'd like to scrape their site, and what you'll do with the results. 

* **Ask permission** If possible, it's polite/good practise to drop the organisation behind a website a note or email to let them know you'll be scraping their site. 

* **Avoid scraping personal data**

* **Be considerate** Don't send one million requests to a website in the space of one second! If you're looping through several URLs, add in a pause of a second or two using a function like ``time.sleep(5)`` to make sure you don't overwhelm a website's servers.

The Office for National Statistics has published a good set of ethical scraping guidelines here: https://www.ons.gov.uk/aboutus/transparencyandgovernance/datastrategy/datapolicies/webscrapingpolicy
    

#  Making `get` requests
---

We're going to make a `get` request to scrape the HTML that drives `example.com`.

Let's start by defining the URL we want to scrape, and using ``requests.get()`` to grab the HTML behind the site. 

In [1]:
example_response.status_code

NameError: name 'example_response' is not defined

Make a `get` request to a website

In [None]:
example_url = 'https://toscrape.com/'


Let's check the status code of the response

All good! Now let's access the text of the response.

## <font color='green'> Exercise 1: Making `get` requests</font>
--- 

Try making `get` requests to the following URLs and observing the HTML you get back. Double check your results by opening your web browsers, visiting the website and using `View Source`.

* https://toscrape.com/
* https://en.wikipedia.org/wiki/Sensitivity_and_specificity
* https://www.bbc.co.uk/news 
* https://www.reddit.com/
* https://twitter.com/home

Do your results look as expected? If not, why not?

#  Writing HTML selectors
---

The HTML we scraped from `example.com` is currently one big, messy string that contains all the HTML from the `example.com` front page. How can we turn this into a searchable object? 

We use a library called ``beautiful soup`` to transform this string into a searchable object.

In [5]:
from bs4 import BeautifulSoup

Let's now convert our raw text from the `toscrape.com` front page into a more easily searchable object.

In [3]:
BeautifulSoup(example_response.text, 'html.parser')

NameError: name 'example_response' is not defined

If you're curious about this html.parser, check out [the documentation](https://docs.python.org/3/library/html.parser.html).

Our output doesn't look very different to ``example_text``, but let's check the types of our two variables:

Whereas ``example_text`` is a string, ``example_soup`` is a 'beautiful soup object.' 

This means we can very easily and precisely search for tagged HTML content using our new-found knowledge of how to write **selectors**.

We know that the HTML tag for a hyperlink is ``'a'``. 

We can use this knowledge, together with the ``select`` method in beautiful soup, to extract every URL.

This is a list; if there was more than one `a` element on the page, the list would contain all of these elements, because **select** finds **all** HTML elements that match our selector.

We can access the first element of the list, and get the hyperlink text using the `.get_text()` method.

We can also access the actual URL of the link.

## <font color='green'> Exercise 2: Writing selectors
---

* Write a selector to extract the title of the page, 'Scraping Sandbox'

* Write a selector to extract the text in the two sub-headers (hint: check the HTML to see what elements they are!)

* Write a selector to find out how many `<div>` elements on the page have a **class** of `col-md-6`

## <font color='green'> Exercise 3: Putting it all together</font>
---

Now let's put our web scraping skills to the test and retrieve book prices from http://books.toscrape.com/

#### First, identify the HTML tag that contains the price of a book

#### Now write a CSS selector to retrieve all price elements from the page

#### Use `requests` to retrieve the HTML of the webpage, and `BeautifulSoup` and your selector to select all HTML tags from the page that contain a book's price

#### Use Python to extract the text from these HTML tags, so you're left with a list of strings representing prices

#### Clean up the strings (removing any unnecessary characters) and convert them into a `float` type

#### Create a new `pandas` DataFrame from these prices

#### Use your new DataFrame to calculate the average price of a book in the store

## <font color="green">Stretch</font>
---

If you scroll down to the bottom of the page you'll notice this is only one page of a possible 50! In order to get a true average price of books on this site, we'll need to scrape them all.

#### Identify the url structure of further pages of books. How can you change a url to navigate to a specific one of the 50 pages?

#### Write a `for` loop to go through the pages 1-50 and retrieve prices, just like you did from the first page

Reuse your code as much as you can, adapting it to allow your price list to include all 50 pages worth.

#### Revise your average book price calculation

Use all 50 pages' worth of data to calculate the real average book price

#  Repetitive scraping with `for` loops
---

We're familiar with control structures like `if` statements and `for` loops, so we have all the tools we need to carry out more advanced scraping tasks. 

Imagine we want to scrape Wikipedia to find the latitude and longitude of several cities.

**Let's start by trying to Wikipedia to find the latitude and longitude of a single city.**

In [8]:
import requests
from bs4 import BeautifulSoup

city = 'Budapest'
wiki_url = 'https://en.wikipedia.org/wiki/' + city

city_soup = BeautifulSoup(requests.get(wiki_url).text)
city_soup

<!DOCTYPE html>
<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-enabled vector-feature-main-menu-pinned-disabled vector-feature-limited-width-enabled vector-feature-limited-width-content-enabled vector-feature-zebra-design-disabled" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>Budapest - Wikipedia</title>
<script>document.documentElement.className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-enabled vector-feature-main-menu-pinned-disabled vector-feature-limited-width-enabled vector-feature-limited-width-content-enabled vector-feature-zebra-design-disabled";(function(){var cookie=document.cookie.match(/(?:^|; )enwikimwclientprefs=([^

We now have the HTML from a single city's page. Let's see where the latitude/longitude live on such a page:

https://en.wikipedia.org/wiki/Budapest

Easy! Now we could put this into a function, loop through all the cities we want and call the function each time

While we loop through our cities, we can then just store the lat/long values each time in lists

Now we can store the data in a new DataFrame

#  Extracting data from tables
---

Often data is stored in HTML `table` elements. We *could* loop through each row and cell and extract the data cell by cell, but that would take a long time. There is a better way!

You could then use `BeautifulSoup` to extract individual `<td>` elements that contain the data we're after...

**or** the much simpler `pandas` approach:

The table we're after is already a DataFrame!

It's not perfect but it's a pretty great start!

Now it's just some quick cleaning and analysis to find which country has won the most World Cups



## <font color='green'> Exercise 4: Putting it all together</font>

---
    
We will now combine all these concepts to scrape some new data.
    
The [National UFO Reporting Center Online Database](http://www.nuforc.org) stores information about reported UFO sightings.
    
We want to answer the following questions:
    
- is there a seasonal pattern in UFO sightings?
- which US state sees the most UFO sightings? Within that, which cities report the most sightings?
- what shape of UFO is the most common for people to see?

#### 1. Visit http://www.nuforc.org/webreports/ndxevent.html, click through some of the rows and take a look at the format the data will be in

#### 2. Use Python and `requests` to read in the page at https://www.nuforc.org/webreports/ndxe202203.html and display the HTML

In [9]:
import requests

url = "https://www.nuforc.org/webreports/ndxe202203.html"
response = requests.get(url)
html_content = response.text

#### 3. Use `BeautifulSoup` to find all the table elements. How many are there?

If you inspect the page you will see that there is a closing `</BODY>` and `</HTML>` tag near the top where there shouldn't be. Sometimes websites have bad HTML and your browser does its best to display it. It's only when you try to scrape it that these things cause an issue!

There is a bit of code below to help you clean up the page:

In [13]:
# replace the FIRST instance of those </BODY> and </HTML> only

ufo_html_cleaned = ufo_html.replace("</BODY>\r</HTML>", "", 1)

#### 4. Use `pandas` to read in the contents of the table as a DataFrame

How well did pandas manage to import the data?

In [16]:
import pandas as pd

tables = pd.read_html(url)

df = tables[0]
print(df)

       Date / Time           City State Country     Shape  \
0    3/31/22 23:50        Lincoln    CA     USA    Circle   
1    3/31/22 20:57        Hickory    NC     USA     Light   
2    3/31/22 14:08     Coupeville    WA     USA    Sphere   
3    3/31/22 12:07    Bloomington    IN     USA  Triangle   
4    3/31/22 10:32      St. Louis    MO     USA       NaN   
..             ...            ...   ...     ...       ...   
333   3/1/22 03:00     Wentzville    MO     USA     Other   
334   3/1/22 03:00    Wentzville,    MO     USA       NaN   
335   3/1/22 01:30      Daleville    VA     USA      Disk   
336   3/1/22 01:27  Kearneysville    WV     USA  Triangle   
337   3/1/22 00:00     Fort Payne    AL     USA  Cylinder   

                     Duration  \
0                     10+ min   
1                   2 minutes   
2            9 seconds approx   
3                A long time.   
4                         NaN   
..                        ...   
333     6 to8 seconds Tuesday   
334

#### 5. Deduce the url pattern for the monthly sightings

The sightings on the page http://www.nuforc.org/webreports/ndxevent.html are organised by month. Inspect the monthly urls to determine a pattern

#### 6. Write a function that reads in a month of UFO sightings

Your function should:

- take as input a year and a month (in whatever format you see fit)
- request the appropriate HTML
- extract the `<table>` into a DataFrame. Combine the code from previous questions to achieve this. Don't forget to include the code to clean up the HTML with those problematic tags!

Test it with March 2022 and make sure your DataFrame looks the same as above.

In [37]:
def read_ufo_sightings(year, month):
    url = f"http://www.nuforc.org/webreports/ndxe{year}{month:02d}.html"
    response = requests.get(url)
    ufo_html = response.text

    ufo_html_cleaned = ufo_html.replace("</BODY>\r</HTML>", "", 1)

    soup = BeautifulSoup(ufo_html_cleaned, 'html.parser')
    tables = soup.find_all('table')
    df = pd.read_html(str(tables[0]))[0]

    return df

year = 2022
month = 3
read_ufo_sightings(year, month)

Unnamed: 0,Date / Time,City,State,Country,Shape,Duration,Summary,Posted,Images
0,3/31/22 23:50,Lincoln,CA,USA,Circle,10+ min,2 Floating / Hovering Circles of Light,4/22/22,
1,3/31/22 20:57,Hickory,NC,USA,Light,2 minutes,Light saw making a circular rotation repeatedl...,4/22/22,
2,3/31/22 14:08,Coupeville,WA,USA,Sphere,9 seconds approx,Black sphere high in sky moving at fast speed,4/22/22,
3,3/31/22 12:07,Bloomington,IN,USA,Triangle,A long time.,I have managed to get in a Fight with Invading...,4/22/22,Yes
4,3/31/22 10:32,St. Louis,MO,USA,,,MADAR Node 70,4/22/22,
...,...,...,...,...,...,...,...,...,...
333,3/1/22 03:00,Wentzville,MO,USA,Other,6 to8 seconds Tuesday,"I seen space craft ,the bottom ,look like squa...",4/22/22,
334,3/1/22 03:00,"Wentzville,",MO,USA,,15 seconds or 20 seconds,"I seen space craft ,the bottom ,look like squa...",4/22/22,Yes
335,3/1/22 01:30,Daleville,VA,USA,Disk,2 hrs,"Object moved in gridlike pattern for 2 hours, ...",4/22/22,
336,3/1/22 01:27,Kearneysville,WV,USA,Triangle,8 to 12 seconds,Only the edges of the craft were visible the c...,3/4/22,


#### 7. Write a `for` loop to call your function for every month since 2019

Store each returned DataFrame in the list below

In [38]:
dataframes = []

for year in range(2019, 2023):
    for month in range(1, 13):
        df = read_ufo_sightings(year, month)
        dataframes.append(df)

for df in dataframes:
    df

#### 8. Use the `pandas` `.concat()` method to combine your DataFrames into a single dataset of UFO sightings

#### 9. Convert the `Date / Time` column to a `datetime`

#### 10. Answer the "research" questions


* What are the most common shapes of UFOs that people sight?
* Is there a seasonal pattern in sightings?
* How (if at all) has the pandemic changed UFO sighting behaviour?