## Scraping from a simple page

In [1]:
import requests
from bs4 import BeautifulSoup

In [2]:
url = f"https://wagon-public-datasets.s3.amazonaws.com/02-Data-Toolkit/02-Data-Sourcing/example.html"

In [3]:
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")

In [4]:
# You now can query the `soup` object!
soup.title.string
soup.find('h1')
soup.find_all('a')

[<a href="http://www.lewagon.com" id="author">Le Wagon</a>]

In [5]:
paragraph = soup.find("p")
paragraph

<p>This is a very simple html page.</p>

In [6]:
articles = soup.find_all("article")
articles

[<article>How to load CSVs?</article>,
 <article>API calls with Python</article>,
 <article>Scraping with BeautifulSoup</article>]

In [7]:
items = soup.find_all("li", class_="package")
items
# Mind the use of class_ instead of class!class is a reserved keyword in Python, so we can't use it.

[<li class="package">Pandas</li>,
 <li class="package">Requests</li>,
 <li class="package">BeautifulSoup</li>]

## Let's scrape Wikipedia to enrich our data
### Our goal: find the region of each country.

In [8]:
!curl -s https://wagon-public-datasets.s3.amazonaws.com/02-Data-Toolkit/02-Data-Sourcing/iso2_codes.csv > data/iso2_codes.csv

import pandas as pd
iso_df = pd.read_csv('data/iso2_codes.csv', na_filter=False)
iso_df.head()

Unnamed: 0,Name,Code,Full Name
0,Afghanistan,AF,Afghanistan
1,Åland Islands,AX,Åland Islands
2,Albania,AL,Albania
3,Algeria,DZ,Algeria
4,American Samoa,AS,American Samoa


In [9]:
import requests
from bs4 import BeautifulSoup

In [10]:
#When you start scraping, get it working for one row.
#Then refactor it to get all the data.

# What does our URL look like?
# We can find the region via this url: https://en.wikipedia.org/wiki/Geography_of_<Country_name>
# Let's try with the best country ever - Belgium!
url = "https://en.wikipedia.org/wiki/Geography_of_Belgium"
# Get the response
response = requests.get(url)
# Turn it into Soup
soup = BeautifulSoup(response.text, "html.parser")
# Find the infobox
infobox = soup.find(class_="infobox-data")
# Extract the region
region = infobox.text
print(region)

Europe


In [11]:
# let's chain it now into a function!
def region_scraper(country):
    url = f"https://en.wikipedia.org/wiki/Geography_of_{country}"
    print(f"Scraping info for {country}")
    try:
        # Get the response
        response = requests.get(url)
        # Turn it into Soup
        soup = BeautifulSoup(response.text, "html.parser")
        # Find the infobox
        infobox = soup.find(class_="infobox-data")
        # Extract the region
        region = infobox.text
        return region
    except:
        return None

In [12]:
region_scraper("Indonesia")

Scraping info for Indonesia


'Asia and Oceania'

In [13]:
# We can now get all the regions with map:
regions = iso_df['Full Name'].map(region_scraper)
regions

Scraping info for Afghanistan
Scraping info for Åland Islands
Scraping info for Albania
Scraping info for Algeria
Scraping info for American Samoa
Scraping info for Andorra
Scraping info for Angola
Scraping info for Anguilla
Scraping info for Antarctica
Scraping info for Antigua and Barbuda
Scraping info for Argentina
Scraping info for Armenia
Scraping info for Aruba
Scraping info for Australia
Scraping info for Austria
Scraping info for Azerbaijan
Scraping info for Bahamas
Scraping info for Bahrain
Scraping info for Bangladesh
Scraping info for Barbados
Scraping info for Belarus
Scraping info for Belgium
Scraping info for Belize
Scraping info for Benin
Scraping info for Bermuda
Scraping info for Bhutan
Scraping info for Bolivia
Scraping info for Bolivia
Scraping info for Bonaire
Scraping info for Bosnia and Herzegovina
Scraping info for Botswana
Scraping info for Bouvet Island
Scraping info for Brazil
Scraping info for British Indian Ocean Territory
Scraping info for Brunei Darussalam

0                   Asia
1                   None
2                 Europe
3                 Africa
4       United States[a]
             ...        
272               France
273               Africa
274                 Asia
275               Africa
276               Africa
Name: Full Name, Length: 277, dtype: object

In [14]:
# Now that we know it works, we can assign the result to a new column in our DataFrame.
iso_df['Region'] = regions

In [17]:
# Check the results
iso_df['Region'].isnull().sum()
#Our scraping has missed quite a few regions ... it's not always reliable!

103

In [18]:
# It's good practice to save your data so you don't have to scrape it again.
iso_df.to_csv("countries_regions.csv", index=False)

In [20]:
# check the output
!head -n 2 countries_regions.csv

Name,Code,Full Name,Region
Afghanistan,AF,Afghanistan,Asia
