# Web scraping with BeautifulSoup

---



### Learning Objectives

**After this lesson, you will be able to:**

- Identify whether it's ethical to scrape a website
- Write selectors to pick out elements from HTML
- Perform web scraping tasks using `requests` and `beautifulsoup`
- Convert the results of a scraping task to a `pandas DataFrame` 


# <font color='blue'> What is web scraping?

Sometimes the data we're after won't be available via a spreadsheet download, or an API. It might be embedded in a web page, or spread across several web pages. In this case, web scraping can be a good solution. 

Web scraping is a slightly more complicated data acquisition method. It involves two steps:

* Grabbing (or 'scraping') the HTML underlying a website

* Searching (or 'parsing') it to extract the information you're interested in. HTML is a language where different pieces of content on a website are sandwiched or enclosed inside 'tags' that describe exactly what that piece of content is. So, a large heading would be enclosed between opening and closing heading tags: ``<h1> My Heading <\h1>``. By searching for particular tags in our scraped HTML, we can pick out and store the exact pieces of content we're interested in.

Scraping is the programmatic equivalent of browsing a website, and copy-pasting content from the website into your own local file or spreadsheet. There are some basic rules to follow when scraping websites, to avoid getting into trouble:

* **Don't scrape websites that ask you not to scrape them** It's important to avoid scraping websites that explicitly prohibit scrapers/crawlers/spiders/robots (these can sometimes be used to all mean scraping) in their Terms of Use or Terms and Conditions. Under special circumstances, it might be possible to get permission from a website to scrape them if you make direct contact with the owners, explain why you'd like to scrape their site, and what you'll do with the results. 

* **Ask permission** If possible, it's polite/good practise to drop the organisation behind a website a note or email to let them know you'll be scraping their site. 

* **Avoid scraping personal data**

* **Be considerate** Don't send one million requests to a website in the space of one second! If you're looping through several URLs, add in a pause of a second or two using a function like ``time.sleep(5)`` to make sure you don't overwhelm a website's servers.

The Office for National Statistics has published a good set of ethical scraping guidelines here: https://www.ons.gov.uk/aboutus/transparencyandgovernance/lookingafterandusingdataforpublicbenefit/policies/policieswebscrapingpolicy


# <font color='blue'> Making `get` requests

We're going to make a `get` request to scrape the HTML that drives `example.com`.

Let's start by defining the URL we want to scrape, and using ``requests.get()`` to grab the HTML behind the site. 

In [1]:
import requests

Make a `get` request to a website

In [2]:
example_url = 'http://toscrape.com'
example_response = requests.get(example_url)

Let's check the status code of the response

In [3]:
example_response.status_code

200

All good! Now let's access the text of the response.

In [4]:
example_text = example_response.text

print(example_text)

<!DOCTYPE html>
<html lang="en">
    <head>
        <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
        <title>Scraping Sandbox</title>
        <link href="./css/bootstrap.min.css" rel="stylesheet">
        <link href="./css/main.css" rel="stylesheet">
    </head>
    <body>
        <div class="container">
            <div class="row">
                <div class="col-md-1"></div>
                <div class="col-md-10 well">
                    <img class="logo" src="img/zyte.png" width="200px">
                    <h1 class="text-right">Web Scraping Sandbox</h1>
                </div>
            </div>

            <div class="row">
                <div class="col-md-1"></div>
                <div class="col-md-10">
                    <h2>Books</h2>
                    <p>A <a href="http://books.toscrape.com">fictional bookstore</a> that desperately wants to be scraped. It's a safe place for beginners learning web scraping and for developers validating their s

---
## <font color='red'> Exercise 1: Making `get` requests</font>
    
Try making `get` requests to the following URLs and observing the HTML you get back. Double check your results by opening your web browsers, visiting the website and using `View Source`.

* https://toscrape.com/
* https://en.wikipedia.org/wiki/Sensitivity_and_specificity
* https://www.bbc.co.uk/news 
* https://www.reddit.com/
* https://twitter.com/home

Do your results look as expected? If not, why not?

In [64]:
example_url = 'https://toscrape.com/'
example_response = requests.get(example_url)

In [65]:
example_response.status_code

200

In [66]:
example_text = example_response.text

print(example_text)

<!DOCTYPE html>
<html lang="en">
    <head>
        <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
        <title>Scraping Sandbox</title>
        <link href="./css/bootstrap.min.css" rel="stylesheet">
        <link href="./css/main.css" rel="stylesheet">
    </head>
    <body>
        <div class="container">
            <div class="row">
                <div class="col-md-1"></div>
                <div class="col-md-10 well">
                    <img class="logo" src="img/zyte.png" width="200px">
                    <h1 class="text-right">Web Scraping Sandbox</h1>
                </div>
            </div>

            <div class="row">
                <div class="col-md-1"></div>
                <div class="col-md-10">
                    <h2>Books</h2>
                    <p>A <a href="http://books.toscrape.com">fictional bookstore</a> that desperately wants to be scraped. It's a safe place for beginners learning web scraping and for developers validating their s

In [6]:
example_url = 'https://en.wikipedia.org/wiki/Sensitivity_and_specificity'
example_response = requests.get(example_url)

In [7]:
example_response.status_code

200

In [8]:
example_text = example_response.text

print(example_text)

<!DOCTYPE html>
<html class="client-nojs" lang="en" dir="ltr">
<head>
<meta charset="UTF-8"/>
<title>Sensitivity and specificity - Wikipedia</title>
<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":false,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"5c3078f6-7276-4f4e-9a10-79d6160d3d62","wgCSPNonce":false,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"Sensitivity_and_specificity","wgTitle":"Sensitivity and specificity","wgCurRevisionId":1099790443,"wgRevisionId":1099790443,"wgArticleId":5599330,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Articles with short description","Short description matches Wikidata","Accuracy and precision","Bioinformatics"

In [9]:
example_url = 'https://www.bbc.co.uk/news'
example_response = requests.get(example_url)

In [10]:
example_response.status_code

200

In [11]:
example_text = example_response.text

print(example_text)

<!DOCTYPE html>
<html lang="en-GB" class="b-pw-1280 b-reith-sans-font no-touch" id="responsive-news">
<head>
    <meta charset="utf-8">
    <meta name="viewport" content="width=device-width, initial-scale=1, user-scalable=1">
    <meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">
    <meta name="google-site-verification" content="Tk6bx1127nACXoqt94L4-D-Of1fdr5gxrZ7u2Vtj9YI">
    <link href="//static.bbc.co.uk" rel="preconnect" crossorigin>
    <link href="//m.files.bbci.co.uk" rel="preconnect" crossorigin>
    <link href="//nav.files.bbci.co.uk" rel="preconnect" crossorigin>
    <link href="//ichef.bbci.co.uk" rel="preconnect" crossorigin>
    <link rel="dns-prefetch" href="//mybbc.files.bbci.co.uk">
    <link rel="dns-prefetch" href="//ssl.bbc.co.uk/">
    <link rel="dns-prefetch" href="//sa.bbc.co.uk/">
    <link rel="dns-prefetch" href="//ichef.bbci.co.uk">


    <link rel="preload" as="style" href="//m.files.bbci.co.uk/modules/bbc-morph-news-page-styles/2.4.25/enhanced.

In [12]:
example_url = 'https://www.reddit.com/'
example_response = requests.get(example_url)

In [13]:
example_response.status_code

200

In [14]:
example_text = example_response.text

print(example_text)


    <!DOCTYPE html>
    <html lang="en-US">
      <head>
        <script>
    var __SUPPORTS_TIMING_API = typeof performance === 'object' && !!performance.mark && !! performance.measure && !!performance.getEntriesByType;
    function __perfMark(name) { __SUPPORTS_TIMING_API && performance.mark(name); };
    var __firstPostLoaded = false;
    function __markFirstPostVisible() {
      if (__firstPostLoaded) { return; }
      __firstPostLoaded = true;
      __perfMark("first_post_title_image_loaded");
    }
    var __firstCommentLoaded = false;
    function __markFirstCommentVisible() {
      if (__firstCommentLoaded) { return; }
      __firstCommentLoaded = true;
      __perfMark("first_comment_loaded");
    }
  </script>
        <script>__perfMark('head_tag_start');</script>
        <meta charSet="utf-8"/>
        <meta name="viewport" content="width=device-width, initial-scale=1" />
        <meta name="referrer" content="origin-when-cross-origin" />
        <style>
  /* http://meyerwe

In [37]:
example_url = 'https://twitter.com/home'
example_response = requests.get(example_url)

In [38]:
example_response.status_code

200

In [39]:
example_text = example_response.text

print(example_text)

<!DOCTYPE html>
<html dir="ltr" lang="en">
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width,initial-scale=1,maximum-scale=1,user-scalable=0,viewport-fit=cover" /><link rel="preconnect" href="//abs.twimg.com" /><link rel="dns-prefetch" href="//abs.twimg.com" /><link rel="preconnect" href="//api.twitter.com" /><link rel="dns-prefetch" href="//api.twitter.com" /><link rel="preconnect" href="//pbs.twimg.com" /><link rel="dns-prefetch" href="//pbs.twimg.com" /><link rel="preconnect" href="//t.co" /><link rel="dns-prefetch" href="//t.co" /><link rel="preconnect" href="//video.twimg.com" /><link rel="dns-prefetch" href="//video.twimg.com" /><link rel="preload" as="script" crossorigin="anonymous" href="https://abs.twimg.com/responsive-web/client-web-legacy/feature-switch-manifest.463c3bd8.js" nonce="MTc0MTY5ZjItMjI1ZC00YjA3LTllZDUtMDAwOGIxYTY0ZGVi" /><link rel="preload" as="script" crossorigin="anonymous" href="https://abs.twimg.com/responsive-web/client-web-legacy/po

In [40]:
type(example_text)

str

# <font color='blue'> Writing HTML selectors

The HTML we scraped from `example.com` is currently one big, messy string that contains all the HTML from the `example.com` front page. How can we turn this into a searchable object? 

We use a library called ``beautiful soup`` to transform this string into a searchable object.

In [19]:
!pip install beautifulsoup4

Collecting beautifulsoup4
  Downloading beautifulsoup4-4.11.1-py3-none-any.whl (128 kB)
Collecting soupsieve>1.2
  Downloading soupsieve-2.3.2.post1-py3-none-any.whl (37 kB)
Installing collected packages: soupsieve, beautifulsoup4
Successfully installed beautifulsoup4-4.11.1 soupsieve-2.3.2.post1


In [20]:
from bs4 import BeautifulSoup

Let's now convert our raw text from the `toscrape.com` front page into a more easily searchable object.

In [28]:
example_url = 'https://toscrape.com/'
example_response = requests.get(example_url)

In [29]:
example_response.status_code

200

In [30]:
example_text = example_response.text

print(example_text)

<!DOCTYPE html>
<html lang="en">
    <head>
        <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
        <title>Scraping Sandbox</title>
        <link href="./css/bootstrap.min.css" rel="stylesheet">
        <link href="./css/main.css" rel="stylesheet">
    </head>
    <body>
        <div class="container">
            <div class="row">
                <div class="col-md-1"></div>
                <div class="col-md-10 well">
                    <img class="logo" src="img/zyte.png" width="200px">
                    <h1 class="text-right">Web Scraping Sandbox</h1>
                </div>
            </div>

            <div class="row">
                <div class="col-md-1"></div>
                <div class="col-md-10">
                    <h2>Books</h2>
                    <p>A <a href="http://books.toscrape.com">fictional bookstore</a> that desperately wants to be scraped. It's a safe place for beginners learning web scraping and for developers validating their s

In [31]:
example_soup = BeautifulSoup(example_response.text, 'html.parser')
example_soup

<!DOCTYPE html>

<html lang="en">
<head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<title>Scraping Sandbox</title>
<link href="./css/bootstrap.min.css" rel="stylesheet"/>
<link href="./css/main.css" rel="stylesheet"/>
</head>
<body>
<div class="container">
<div class="row">
<div class="col-md-1"></div>
<div class="col-md-10 well">
<img class="logo" src="img/zyte.png" width="200px"/>
<h1 class="text-right">Web Scraping Sandbox</h1>
</div>
</div>
<div class="row">
<div class="col-md-1"></div>
<div class="col-md-10">
<h2>Books</h2>
<p>A <a href="http://books.toscrape.com">fictional bookstore</a> that desperately wants to be scraped. It's a safe place for beginners learning web scraping and for developers validating their scraping technologies as well. Available at: <a href="http://books.toscrape.com">books.toscrape.com</a></p>
<div class="col-md-6">
<a href="http://books.toscrape.com"><img class="img-thumbnail" src="./img/books.png"/></a>
</div>
<div class="col-

Our output doesn't look very different to ``example_text``, but let's check the types of our two variables:

In [32]:
type(example_text)

str

In [33]:
type(example_soup)

bs4.BeautifulSoup

Whereas ``example_text`` is a string, ``example_soup`` is a 'beautiful soup object.' 

This means we can very easily and precisely search for tagged HTML content using our new-found knowledge of how to write **selectors**.

We know that the HTML tag for a hyperlink is ``'a'``. 

We can use this knowledge, together with the ``select`` method in beautiful soup, to extract every URL.

In [34]:
example_soup.select('a')

[<a href="http://books.toscrape.com">fictional bookstore</a>,
 <a href="http://books.toscrape.com">books.toscrape.com</a>,
 <a href="http://books.toscrape.com"><img class="img-thumbnail" src="./img/books.png"/></a>,
 <a href="http://quotes.toscrape.com/">A website</a>,
 <a href="http://quotes.toscrape.com"><img class="img-thumbnail" src="./img/quotes.png"/></a>,
 <a href="http://quotes.toscrape.com/">Default</a>,
 <a href="http://quotes.toscrape.com/scroll">Scroll</a>,
 <a href="http://quotes.toscrape.com/js">JavaScript</a>,
 <a href="http://quotes.toscrape.com/js-delayed">Delayed</a>,
 <a href="http://quotes.toscrape.com/tableful">Tableful</a>,
 <a href="http://quotes.toscrape.com/login">Login</a>,
 <a href="http://quotes.toscrape.com/search.aspx">ViewState</a>,
 <a href="http://quotes.toscrape.com/random">Random</a>]

This is a list; if there was more than one `a` element on the page, the list would contain all of these elements, because **select** finds **all** HTML elements that match our selector.

We can access the first element of the list, and get the hyperlink text using the `.get_text()` method.

In [35]:
example_soup.select('a')[0].get_text()

'fictional bookstore'

We can also access the actual URL of the link.

In [36]:
example_soup.select('a')[0]['href']

'http://books.toscrape.com'


## <font color='red'> Exercise 2: Writing selectors

* Write a selector to extract the title of the page, 'Scraping Sandbox'

* Write a selector to extract the text in the two sub-headers (hint: check the HTML to see what elements they are!)

* Write a selector to find out how many `<div>` elements on the page have a **class** of `col-md-6`

In [80]:
example_soup = BeautifulSoup(example_response.text, 'html.parser')
example_soup

<!DOCTYPE html>

<!--[if lt IE 7]>      <html lang="en-us" class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]-->
<!--[if IE 7]>         <html lang="en-us" class="no-js lt-ie9 lt-ie8"> <![endif]-->
<!--[if IE 8]>         <html lang="en-us" class="no-js lt-ie9"> <![endif]-->
<!--[if gt IE 8]><!--> <html class="no-js" lang="en-us"> <!--<![endif]-->
<head>
<title>
    All products | Books to Scrape - Sandbox
</title>
<meta content="text/html; charset=utf-8" http-equiv="content-type"/>
<meta content="24th Jun 2016 09:29" name="created"/>
<meta content="" name="description"/>
<meta content="width=device-width" name="viewport"/>
<meta content="NOARCHIVE,NOCACHE" name="robots"/>
<!-- Le HTML5 shim, for IE6-8 support of HTML elements -->
<!--[if lt IE 9]>
        <script src="//html5shim.googlecode.com/svn/trunk/html5.js"></script>
        <![endif]-->
<link href="static/oscar/favicon.ico" rel="shortcut icon"/>
<link href="static/oscar/css/styles.css" rel="stylesheet" type="text/css"/>
<link href="s

In [68]:
example_soup.select('title')[0].get_text()

'Scraping Sandbox'

In [69]:
example_soup.find_all('h2')

[<h2>Books</h2>, <h2>Quotes</h2>]

In [70]:
for h2 in example_soup.find_all('h2'):
    print(h2.get_text())

Books
Quotes


In [71]:
len(example_soup.select('div.col-md-6'))

4

## <font color='red'> Exercise 3: Putting it all together</font>

Now let's put our web scraping skills to the test and retrieve book prices from http://books.toscrape.com/

#### First, identify the HTML tag that contains the price of a book

In [107]:
book_url = 'https://books.toscrape.com/'
book_response = requests.get(example_url)

In [108]:
book_response.status_code

200

In [109]:
book_text = example_response.text

# print(example_text)

In [110]:
book_soup = BeautifulSoup(example_response.text, 'html.parser')
# example_soup

In [119]:
book_soup.select("p.price_color")

[<p class="price_color">Â£51.77</p>,
 <p class="price_color">Â£53.74</p>,
 <p class="price_color">Â£50.10</p>,
 <p class="price_color">Â£47.82</p>,
 <p class="price_color">Â£54.23</p>,
 <p class="price_color">Â£22.65</p>,
 <p class="price_color">Â£33.34</p>,
 <p class="price_color">Â£17.93</p>,
 <p class="price_color">Â£22.60</p>,
 <p class="price_color">Â£52.15</p>,
 <p class="price_color">Â£13.99</p>,
 <p class="price_color">Â£20.66</p>,
 <p class="price_color">Â£17.46</p>,
 <p class="price_color">Â£52.29</p>,
 <p class="price_color">Â£35.02</p>,
 <p class="price_color">Â£57.25</p>,
 <p class="price_color">Â£23.88</p>,
 <p class="price_color">Â£37.59</p>,
 <p class="price_color">Â£51.33</p>,
 <p class="price_color">Â£45.17</p>]

#### Now write a CSS selector to retrieve all price elements from the page

#### Use `requests` to retrieve the HTML of the webpage, and `BeautifulSoup` and your selector to select all HTML tags from the page that contain a book's price

In [117]:
price_tags = book_soup.select("p.price_color")
price_tags

[<p class="price_color">Â£51.77</p>,
 <p class="price_color">Â£53.74</p>,
 <p class="price_color">Â£50.10</p>,
 <p class="price_color">Â£47.82</p>,
 <p class="price_color">Â£54.23</p>,
 <p class="price_color">Â£22.65</p>,
 <p class="price_color">Â£33.34</p>,
 <p class="price_color">Â£17.93</p>,
 <p class="price_color">Â£22.60</p>,
 <p class="price_color">Â£52.15</p>,
 <p class="price_color">Â£13.99</p>,
 <p class="price_color">Â£20.66</p>,
 <p class="price_color">Â£17.46</p>,
 <p class="price_color">Â£52.29</p>,
 <p class="price_color">Â£35.02</p>,
 <p class="price_color">Â£57.25</p>,
 <p class="price_color">Â£23.88</p>,
 <p class="price_color">Â£37.59</p>,
 <p class="price_color">Â£51.33</p>,
 <p class="price_color">Â£45.17</p>]

#### Use Python to extract the text from these HTML tags, so you're left with a list of strings representing prices

#### Clean up the strings (removing any unnecessary characters) and convert them into a `float` type

In [144]:
prices = []
for price_tag in price_tags:
    prices.append(price_tag.get_text())

In [145]:
prices

['Â£51.77',
 'Â£53.74',
 'Â£50.10',
 'Â£47.82',
 'Â£54.23',
 'Â£22.65',
 'Â£33.34',
 'Â£17.93',
 'Â£22.60',
 'Â£52.15',
 'Â£13.99',
 'Â£20.66',
 'Â£17.46',
 'Â£52.29',
 'Â£35.02',
 'Â£57.25',
 'Â£23.88',
 'Â£37.59',
 'Â£51.33',
 'Â£45.17']

In [150]:
type(prices[0])

str

In [149]:
prices_str = ([float(prices.('Â£')) for price in prices])


AttributeError: 'list' object has no attribute 'strip'

In [None]:
price_values = []

for price in prices:
    price_value = float(price[2:])
    price_values.append(price_value)

#### Create a new `pandas` DataFrame from these prices

In [140]:
import pandas as pd

In [141]:
df_price = pd.Dataframe(price)

AttributeError: module 'pandas' has no attribute 'Dataframe'

#### Use your new DataFrame to calculate the average price of a book in the store

## <font color="green">Stretch</font>

If you scroll down to the bottom of the page you'll notice this is only one page of a possible 50! In order to get a true average price of books on this site, we'll need to scrape them all.

#### Identify the url structure of further pages of books. How can you change a url to navigate to a specific one of the 50 pages?

#### Write a `for` loop to go through the pages 1-50 and retrieve prices, just like you did from the first page

Reuse your code as much as you can, adapting it to allow your price list to include all 50 pages worth.

#### Revise your average book price calculation

Use all 50 pages' worth of data to calculate the real average book price

# <font color='blue'> Repetitive scraping with `for` loops

We're familiar with control structures like `if` statements and `for` loops, so we have all the tools we need to carry out more advanced scraping tasks. 

Imagine we want to scrape Wikipedia to find the latitude and longitude of several cities.

**Let's start by trying to Wikipedia to find the latitude and longitude of a single city.**

In [151]:
city = 'Budapest'

wiki_url = 'https://en.wikipedia.org/wiki/'+city

city_soup = BeautifulSoup(requests.get(wiki_url).text)

In [152]:
type(city_soup)

bs4.BeautifulSoup

We now have the HTML from a single city's page. Let's see where the latitude/longitude live on such a page:

https://en.wikipedia.org/wiki/Budapest

In [153]:
city_soup.select("span.latitude")[0]

<span class="latitude">47°29′33″N</span>

In [154]:
city_soup.select("span.longitude")[0]

<span class="longitude">19°03′05″E</span>

Easy! Now we could put this into a function, loop through all the cities we want and call the function each time

In [155]:
def get_lat_long(city):
    wiki_url = 'https://en.wikipedia.org/wiki/'+city
    city_soup = BeautifulSoup(requests.get(wiki_url).text)
    
    latitude = city_soup.select("span.latitude")[0]
    longitude = city_soup.select("span.longitude")[0]
    
    return (latitude.get_text(), longitude.get_text())

get_lat_long("London")

('51°30′26″N', '0°7′39″W')

While we loop through our cities, we can then just store the lat/long values each time in lists

In [156]:
import time

cities = ["Budapest", "London", "Mumbai"]
latitudes = []
longitudes = []

for city in cities:
    print(f"Scraping {city}")
    lat, long = get_lat_long(city)
    latitudes.append(lat)
    longitudes.append(long)
    
    # let's be polite :-)
    time.sleep(1)

print("Done")

Scraping Budapest
Scraping London
Scraping Mumbai
Done


In [157]:
latitudes

['47°29′33″N', '51°30′26″N', '19°04′34″N']

In [158]:
longitudes

['19°03′05″E', '0°7′39″W', '72°52′39″E']

Now we can store the data in a new DataFrame

In [159]:
import pandas as pd

city_df = pd.DataFrame(
    {
        "city": cities,
        "latitude": latitudes,
        "longitude": longitudes
    }
)

city_df

Unnamed: 0,city,latitude,longitude
0,Budapest,47°29′33″N,19°03′05″E
1,London,51°30′26″N,0°7′39″W
2,Mumbai,19°04′34″N,72°52′39″E


# <font color='blue'> Extracting data from tables

Often data is stored in HTML `table` elements. We *could* loop through each row and cell and extract the data cell by cell, but that would take a long time. There is a better way!

In [160]:
world_cup_url = "https://en.wikipedia.org/wiki/List_of_FIFA_Women%27s_World_Cup_finals"

wc_html = requests.get(world_cup_url).text

wc_soup = BeautifulSoup(wc_html)

In [161]:
tables = wc_soup.select("table")
len(tables)

6

In [162]:
tables[2]

<table class="sortable plainrowheaders wikitable">
<caption>List of finals matches, their venues and locations, the finalists and final scores
</caption>
<tbody><tr>
<th scope="col">Year
</th>
<th scope="col">Winners
</th>
<th scope="col">Score
</th>
<th scope="col">Runners-up
</th>
<th scope="col">Venue
</th>
<th scope="col">Location
</th>
<th scope="col">Attendance
</th></tr>
<tr>
<th scope="row" style="text-align:center"><a href="/wiki/1991_FIFA_Women%27s_World_Cup" title="1991 FIFA Women's World Cup">1991</a>
</th>
<td align="right"><a href="/wiki/United_States_women%27s_national_soccer_team" title="United States women's national soccer team">United States</a><span class="flagicon"> <img alt="" class="thumbborder" data-file-height="650" data-file-width="1235" decoding="async" height="12" src="//upload.wikimedia.org/wikipedia/en/thumb/a/a4/Flag_of_the_United_States.svg/23px-Flag_of_the_United_States.svg.png" srcset="//upload.wikimedia.org/wikipedia/en/thumb/a/a4/Flag_of_the_United_S

You could then use `BeautifulSoup` to extract individual `<td>` elements that contain the data we're after...

**or** the much simpler `pandas` approach:

In [163]:
table_df_list = pd.read_html(wc_html)

In [164]:
type(table_df_list)

list

In [165]:
len(table_df_list)

6

The table we're after is already a DataFrame!

In [166]:
table_df_list[2]

Unnamed: 0,Year,Winners,Score,Runners-up,Venue,Location,Attendance
0,1991,United States,2–1,Norway,Tianhe Stadium,"Guangzhou, China","63,000[3]"
1,1995,Norway,2–0,Germany,Råsunda Stadium,"Stockholm, Sweden","17,158[4]"
2,1999,United States,0–0,China,Rose Bowl,"Pasadena, California, US","90,185[5]"
3,2003,Germany,*2–1*,Sweden,Home Depot Center,"Carson, California, US","26,137[6]"
4,2007,Germany,2–0,Brazil,Hongkou Football Stadium,"Shanghai, China","31,000[7]"
5,2011,Japan,2–2,United States,Commerzbank-Arena,"Frankfurt, Germany","48,817[8]"
6,2015,United States,5–2,Japan,BC Place,"Vancouver, Canada","53,341[9]"
7,2019,United States,2–0,Netherlands,Parc Olympique Lyonnais,"Décines-Charpieu, France","57,900[10]"
8,Upcoming finals,Upcoming finals,Upcoming finals,Upcoming finals,Upcoming finals,Upcoming finals,Upcoming finals
9,Year,Finalists,Match,Finalists,Venue,Location,Attendance


It's not perfect but it's a pretty great start!

Now it's just some quick cleaning and analysis to find which country has won the most World Cups

In [167]:
df_wc = table_df_list[2]

# if the "Winners" column starts with "Upcoming" or "Finalists" or NaN it's erroneous
df_wc.dropna(subset=["Winners"], inplace=True)
df_wc = df_wc[df_wc["Winners"].str.startswith("Finalists") == False]
df_wc = df_wc[df_wc["Winners"].str.startswith("Upcoming") == False]

len(df_wc)

8

In [168]:
df_wc["Winners"].value_counts()

United States    4
Germany          2
Norway           1
Japan            1
Name: Winners, dtype: int64

---

## <font color='red'> Exercise 4: Putting it all together</font>
    
We will now combine all these concepts to scrape some new data.
    
The [National UFO Reporting Center Online Database](http://www.nuforc.org) stores information about reported UFO sightings.
    
We want to answer the following questions:
    
- is there a seasonal pattern in UFO sightings?
- which US state sees the most UFO sightings? Within that, which cities report the most sightings?
- what shape of UFO is the most common for people to see?

#### 1. Visit http://www.nuforc.org/webreports/ndxevent.html, click through some of the rows and take a look at the format the data will be in

#### 2. Use Python and `requests` to read in the page at https://www.nuforc.org/webreports/ndxe202203.html and display the HTML

In [170]:
ufo_url = "https://www.nuforc.org/webreports/ndxe202203.html"

ufo_req = requests.get(ufo_url)
ufo_req.status_code

# display the returned HTML below

200

In [173]:
ufo_req.text

'<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"><HTML><HEAD><META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1"><META NAME="GENERATOR" CONTENT="Mozilla/4.5b2 [en] (WinNT; I) [Netscape]"><TITLE>UFO Report Index For 03/2022</TITLE></HEAD><BODY><FONT FACE="Calibri"><FONT SIZE=+1>National UFO ReportingCenter</FONT></FONT><BR><FONT FACE="Calibri"><FONT SIZE=+1>UFO Report Index For 03/2022</FONT><BR><P><FONT FACE="Calibri"><FONT SIZE=+1><A HREF="https://www.nuforc.org">NUFORC Home</A><BR><BR></FONT></BODY><TABLE  CELLSPACING=1><THEAD><TR><TH BGCOLOR=#c0c0c0 BORDERCOLOR=#000000 ><FONT style=FONT-SIZE:11pt FACE="Calibri" COLOR=#000000>Date / Time</TH><TH BGCOLOR=#c0c0c0 BORDERCOLOR=#000000 ><FONT style=FONT-SIZE:11pt FACE="Calibri" COLOR=#000000>City</TH><TH BGCOLOR=#c0c0c0 BORDERCOLOR=#000000 ><FONT style=FONT-SIZE:11pt FACE="Calibri" COLOR=#000000>State</TH><TH BGCOLOR=#c0c0c0 BORDERCOLOR=#000000 ><FONT style=FONT-SIZE:11pt FACE="Calibri" COLOR=#000000>Coun

#### 3. Use `BeautifulSoup` to find all the table elements. How many are there?

If you inspect the page you will see that there is a closing `</BODY>` and `</HTML>` tag near the top where there shouldn't be. Sometimes websites have bad HTML and your browser does its best to display it. It's only when you try to scrape it that these things cause an issue!

There is a bit of code below to help you clean up the page:

In [174]:
# replace the FIRST instance of those </BODY> and </HTML> only
ufo_html_cleaned = ufo_req.text.replace("</BODY>\r</HTML>", "", 1)

In [176]:
ufo_soup = BeautifulSoup(ufo_html_cleaned)

ufo_tables = ufo_soup.select("table")

ufo_tables

[<table cellspacing="1"><thead><tr><th bgcolor="#c0c0c0" bordercolor="#000000"><font color="#000000" face="Calibri" style="FONT-SIZE:11pt">Date / Time</font></th><th bgcolor="#c0c0c0" bordercolor="#000000"><font color="#000000" face="Calibri" style="FONT-SIZE:11pt">City</font></th><th bgcolor="#c0c0c0" bordercolor="#000000"><font color="#000000" face="Calibri" style="FONT-SIZE:11pt">State</font></th><th bgcolor="#c0c0c0" bordercolor="#000000"><font color="#000000" face="Calibri" style="FONT-SIZE:11pt">Country</font></th><th bgcolor="#c0c0c0" bordercolor="#000000"><font color="#000000" face="Calibri" style="FONT-SIZE:11pt">Shape</font></th><th bgcolor="#c0c0c0" bordercolor="#000000"><font color="#000000" face="Calibri" style="FONT-SIZE:11pt">Duration</font></th><th bgcolor="#c0c0c0" bordercolor="#000000"><font color="#000000" face="Calibri" style="FONT-SIZE:11pt">Summary</font></th><th bgcolor="#c0c0c0" bordercolor="#000000"><font color="#000000" face="Calibri" style="FONT-SIZE:11pt">Po

#### 4. Use `pandas` to read in the contents of the table as a DataFrame

How well did pandas manage to import the data?

In [178]:
import numpy as np

In [207]:
df_ufo = pd.read_html(ufo_html_cleaned)
df_ufo[0].head()

Unnamed: 0,Date / Time,City,State,Country,Shape,Duration,Summary,Posted,Images
0,3/31/22 23:50,Lincoln,CA,USA,Circle,10+ min,2 Floating / Hovering Circles of Light,4/22/22,
1,3/31/22 20:57,Hickory,NC,USA,Light,2 minutes,Light saw making a circular rotation repeatedl...,4/22/22,
2,3/31/22 14:08,Coupeville,WA,USA,Sphere,9 seconds approx,Black sphere high in sky moving at fast speed,4/22/22,
3,3/31/22 12:07,Bloomington,IN,USA,Triangle,A long time.,I have managed to get in a Fight with Invading...,4/22/22,Yes
4,3/31/22 10:32,St. Louis,MO,USA,,,MADAR Node 70,4/22/22,


#### 5. Deduce the url pattern for the monthly sightings

The sightings on the page http://www.nuforc.org/webreports/ndxevent.html are organised by month. Inspect the monthly urls to determine a pattern

#### 6. Write a function that reads in a month of UFO sightings

Your function should:

- take as input a year and a month (in whatever format you see fit)
- request the appropriate HTML
- extract the `<table>` into a DataFrame. Combine the code from previous questions to achieve this. Don't forget to include the code to clean up the HTML with those problematic tags!

Test it with March 2022 and make sure your DataFrame looks the same as above.

In [204]:
def get_monthly_sightings(year, month):
    
    # if we pass in integers we have to make sure single digits are in "01" format
    # lots of ways to do this, one is with zfill
    month_string = str(month).zfill(2)
#     print('month_string: ', month_string)
    
    ufo_url_monthly = f"http://www.nuforc.org/webreports/ndxe{year}{month_string}.html"
    
    # retrieve the HTML
    ufo_req_month = requests.get(ufo_url_monthly)
    
    ufo_req_month.status_code
    
    # clean the HTML
    ufo_html_cleaned = ufo_req_month.text.replace("</BODY></HTML>", "", 1)
    
    # return the DataFrame
    return pd.read_html(ufo_html_cleaned)[0]

In [205]:
get_monthly_sightings(2022, 3).head()

Unnamed: 0,Date / Time,City,State,Country,Shape,Duration,Summary,Posted,Images
0,3/31/22 23:50,Lincoln,CA,USA,Circle,10+ min,2 Floating / Hovering Circles of Light,4/22/22,
1,3/31/22 20:57,Hickory,NC,USA,Light,2 minutes,Light saw making a circular rotation repeatedl...,4/22/22,
2,3/31/22 14:08,Coupeville,WA,USA,Sphere,9 seconds approx,Black sphere high in sky moving at fast speed,4/22/22,
3,3/31/22 12:07,Bloomington,IN,USA,Triangle,A long time.,I have managed to get in a Fight with Invading...,4/22/22,Yes
4,3/31/22 10:32,St. Louis,MO,USA,,,MADAR Node 70,4/22/22,


#### 7. Write a `for` loop to call your function for every month since 2019

Store each returned DataFrame in the list below

In [214]:
ufo_sightings_df = []

for year in [2019, 2020, 2021, 2022]:
    for month in range(1, 13):
        print(f"{year} month {month}")
    try:
        get_monthly_sightings(year, month)
        ufo_sightings_df.append(get_monthly_sightings(year, month)) 
    except:
        print(f"No data found for {year}/{month}.")
        pass
    

print("Done!")

2019 month 1
2019 month 2
2019 month 3
2019 month 4
2019 month 5
2019 month 6
2019 month 7
2019 month 8
2019 month 9
2019 month 10
2019 month 11
2019 month 12
2020 month 1
2020 month 2
2020 month 3
2020 month 4
2020 month 5
2020 month 6
2020 month 7
2020 month 8
2020 month 9
2020 month 10
2020 month 11
2020 month 12
2021 month 1
2021 month 2
2021 month 3
2021 month 4
2021 month 5
2021 month 6
2021 month 7
2021 month 8
2021 month 9
2021 month 10
2021 month 11
2021 month 12
2022 month 1
2022 month 2
2022 month 3
2022 month 4
2022 month 5
2022 month 6
2022 month 7
2022 month 8
2022 month 9
2022 month 10
2022 month 11
2022 month 12
No data found for 2022/12.
Done!


In [215]:
ufo_sightings_df[0]

Unnamed: 0,Date / Time,City,State,Country,Shape,Duration,Summary,Posted,Images
0,12/31/19 23:55,Wesley Chapel,FL,USA,Light,30 minutes,8-9 Disc shaped lights moving south about 1000...,2/7/20,
1,12/31/19 23:23,Sanford,FL,USA,Circle,5 minutes,3 unidentified bright orange objects travellin...,2/7/20,
2,12/31/19 23:00,Pinellas Park,FL,USA,Sphere,4 minutes,At work on New Year's Eve. As I looked up in t...,5/1/20,
3,12/31/19 23:00,Fort Lauderdale,FL,USA,Fireball,:30,"I was in Fort Lauderdale, FL looking toward th...",2/7/20,
4,12/31/19 22:45,Lakeland,FL,USA,Oval,3 minutes,Two VERY bright glowing objects moving from no...,2/7/20,
...,...,...,...,...,...,...,...,...,...
758,12/1/19 06:00,Mcleansville,NC,USA,Disk,3 minutes,I was walking with my son to check my mail wit...,12/19/19,
759,12/1/19 04:48,Newburgh,IN,USA,,,MADAR Node 142,12/19/19,
760,12/1/19 01:25,Alexandria,VA,USA,,,MADAR Node 141,12/19/19,
761,12/1/19 00:30,Woolwich,ME,USA,Other,20 minutes,I saw what looked like a person standing by a ...,12/1/19,


#### 8. Use the `pandas` `.concat()` method to combine your DataFrames into a single dataset of UFO sightings

#### 9. Convert the `Date / Time` column to a `datetime`

#### 10. Answer the "research" questions


* What are the most common shapes of UFOs that people sight?
* Is there a seasonal pattern in sightings?
* How (if at all) has the pandemic changed UFO sighting behaviour?