### LSE Data Analytics Online Career Accelerator


## Web scraping with BeautifulSoup (tutorial video)

Web scraping or web harvesting is frequently used by data analysts to retrieve current data. BeautifulSoup is a popular Python library for web scraping. This Notebook will illustrate step-by-step how to perform web scraping with BeautifulSoup, convert the extracted data into a CSV file, and construct a Pandas DataFrame. You will learn:
- Performing web scraping with BeautifulSoup.
- Saving extracted data as JSON and CSV files.
- Import the extracted data into a Pandas DataFrame for analysis.

# 

In [1]:
# Install libraries.
!pip install requests
!pip install bs4
!pip install lxml

Collecting bs4
  Downloading bs4-0.0.1.tar.gz (1.1 kB)
  Preparing metadata (setup.py) ... [?25ldone
Building wheels for collected packages: bs4
  Building wheel for bs4 (setup.py) ... [?25ldone
[?25h  Created wheel for bs4: filename=bs4-0.0.1-py3-none-any.whl size=1256 sha256=021e16c870d3d1ae2daf43c99a5bde223a389186e59fe8e16d60b2f2299f29c8
  Stored in directory: /Users/b23/Library/Caches/pip/wheels/e4/62/1d/d4d1bc4f33350ff84227f89b258edb552d604138e3739f5c83
Successfully built bs4
Installing collected packages: bs4
Successfully installed bs4-0.0.1


In [2]:
# Import packages.
# Analyse the data.
import pandas as pd

# Get data from the internet.
import requests

# Parse data with BeautifulSoup.
from bs4 import BeautifulSoup

In [3]:
# Import the url.
url = 'https://www.worldometers.info/world-population/population-by-country/'

# Create a variable to store the URL information
page = requests.get(url)

# Make contact with the website.
page

<Response [200]>

In [4]:
# Get the information from the website.
if page.status_code == 200:
    html_doc = page.text

# Look at the html code & create a variable to store the HTML info.
soup = BeautifulSoup(html_doc)

# Print the output in a readable format.
print(soup.prettify())

<!DOCTYPE html>
<!--[if IE 8]> <html lang="en" class="ie8"> <![endif]-->
<!--[if IE 9]> <html lang="en" class="ie9"> <![endif]-->
<!--[if !IE]><!-->
<html lang="en">
 <!--<![endif]-->
 <head>
  <meta charset="utf-8"/>
  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
  <meta content="width=device-width, initial-scale=1" name="viewport"/>
  <title>
   Population by Country (2023) - Worldometer
  </title>
  <meta content="List of countries and dependencies in the world ranked by population, from the most populated. Growth rate, median age, fertility rate, area, density, population density, urbanization, urban population, share of world population." name="description"/>
  <!-- Favicon -->
  <link href="/favicon/favicon.ico" rel="shortcut icon" type="image/x-icon"/>
  <link href="/favicon/apple-icon-57x57.png" rel="apple-touch-icon" sizes="57x57"/>
  <link href="/favicon/apple-icon-60x60.png" rel="apple-touch-icon" sizes="60x60"/>
  <link href="/favicon/apple-icon-72x72.png" rel="ap

##  Extract table ID from website

In [5]:
# Navigate to the website and determine the table ID.
# Extract the contents of the table with the table ID.
table = soup.find('table', attrs={'id': 'example2'})

# View the information in a readable format.
print(table.prettify())

<table cellspacing="0" class="table table-striped table-bordered" id="example2" width="100%">
 <thead>
  <tr>
   <th>
    #
   </th>
   <th>
    Country (or dependency)
   </th>
   <th>
    Population
    <br/>
    (2023)
   </th>
   <th>
    Yearly
    <br/>
    Change
   </th>
   <th>
    Net
    <br/>
    Change
   </th>
   <th>
    Density
    <br/>
    (P/Km²)
   </th>
   <th>
    Land Area
    <br/>
    (Km²)
   </th>
   <th>
    Migrants
    <br/>
    (net)
   </th>
   <th>
    Fert.
    <br/>
    Rate
   </th>
   <th>
    Med.
    <br/>
    Age
   </th>
   <th>
    Urban
    <br/>
    Pop %
   </th>
   <th>
    World
    <br/>
    Share
   </th>
  </tr>
 </thead>
 <tbody>
  <tr>
   <td>
    1
   </td>
   <td style="font-weight: bold; font-size:15px; text-align:left">
    <a href="/world-population/india-population/">
     India
    </a>
   </td>
   <td style="font-weight: bold;">
    1,428,627,663
   </td>
   <td>
    0.81 %
   </td>
   <td>
    11,454,490
   </td>
   <td>
    

In [6]:
# All of the rows of the table.
rows = table.find_all('tr')

# View the rows.
rows

[<tr> <th>#</th> <th>Country (or dependency)</th> <th>Population<br/> (2023)</th> <th>Yearly<br/> Change</th> <th>Net<br/> Change</th> <th>Density<br/> (P/Km²)</th> <th>Land Area<br/> (Km²)</th> <th>Migrants<br/> (net)</th> <th>Fert.<br/> Rate</th> <th>Med.<br/> Age</th> <th>Urban<br/> Pop %</th> <th>World<br/> Share</th> </tr>,
 <tr> <td>1</td> <td style="font-weight: bold; font-size:15px; text-align:left"><a href="/world-population/india-population/">India</a></td> <td style="font-weight: bold;">1,428,627,663</td> <td>0.81 %</td> <td>11,454,490</td> <td>481</td> <td>2,973,190</td> <td>-486,136</td> <td>2.0</td> <td>28</td> <td>36 %</td> <td>17.76 %</td> </tr>,
 <tr> <td>2</td> <td style="font-weight: bold; font-size:15px; text-align:left"><a href="/world-population/china-population/">China</a></td> <td style="font-weight: bold;">1,425,671,352</td> <td>-0.02 %</td> <td>-215,985</td> <td>152</td> <td>9,388,211</td> <td>-310,220</td> <td>1.2</td> <td>39</td> <td>65 %</td> <td>17.72 %</t

In [8]:
# Store the extracted data.
# Create an empty list.
output = []

# Specify the column names.
column_names = ['ID', 'Country (or dependency)', 'Population (2020)',
                'Yearly Change', 'Net Change', 'Density (P/Km2)',
                'Land Area (Km2)', 'Migrants (net)', 'Fert. Rate',
                'Med. Age', 'Urbn Pop', 'World Share']

# Create a for loop statement.
# Extract the text within each element.
# Store data in a zip format for easy access.
for country in rows:
    country_data = country.find_all('td')
    if country_data:
        country_text = [td.text for td in country_data]
        output.append(dict(zip(column_names, country_text)))
        
# View the result.
output

[{'ID': '1',
  'Country (or dependency)': 'India',
  'Population (2020)': '1,428,627,663',
  'Yearly Change': '0.81 %',
  'Net Change': '11,454,490',
  'Density (P/Km2)': '481',
  'Land Area (Km2)': '2,973,190',
  'Migrants (net)': '-486,136',
  'Fert. Rate': '2.0',
  'Med. Age': '28',
  'Urbn Pop': '36 %',
  'World Share': '17.76 %'},
 {'ID': '2',
  'Country (or dependency)': 'China',
  'Population (2020)': '1,425,671,352',
  'Yearly Change': '-0.02 %',
  'Net Change': '-215,985',
  'Density (P/Km2)': '152',
  'Land Area (Km2)': '9,388,211',
  'Migrants (net)': '-310,220',
  'Fert. Rate': '1.2',
  'Med. Age': '39',
  'Urbn Pop': '65 %',
  'World Share': '17.72 %'},
 {'ID': '3',
  'Country (or dependency)': 'United States',
  'Population (2020)': '339,996,563',
  'Yearly Change': '0.50 %',
  'Net Change': '1,706,706',
  'Density (P/Km2)': '37',
  'Land Area (Km2)': '9,147,420',
  'Migrants (net)': '999,700',
  'Fert. Rate': '1.7',
  'Med. Age': '38',
  'Urbn Pop': '83 %',
  'World Shar

## Create a CSV or JSON file and import into Pandas

In [9]:
# Create a DataFrame directly from the output.
data = pd.DataFrame(output)

# View the DataFrame.
data.head()

Unnamed: 0,ID,Country (or dependency),Population (2020),Yearly Change,Net Change,Density (P/Km2),Land Area (Km2),Migrants (net),Fert. Rate,Med. Age,Urbn Pop,World Share
0,1,India,1428627663,0.81 %,11454490,481,2973190,-486136,2.0,28,36 %,17.76 %
1,2,China,1425671352,-0.02 %,-215985,152,9388211,-310220,1.2,39,65 %,17.72 %
2,3,United States,339996563,0.50 %,1706706,37,9147420,999700,1.7,38,83 %,4.23 %
3,4,Indonesia,277534122,0.74 %,2032783,153,1811570,-49997,2.1,30,59 %,3.45 %
4,5,Pakistan,240485658,1.98 %,4660796,312,770880,-165988,3.3,21,35 %,2.99 %


In [10]:
# Save the DataFrame as a CSV file without index.
data.to_csv('countries.csv', index=False)

In [11]:
# Create a JSON file.
import json

# Create a JSON file.
output_json = json.dumps(output)

# View the output.
output_json

'[{"ID": "1", "Country (or dependency)": "India", "Population (2020)": "1,428,627,663", "Yearly Change": "0.81 %", "Net Change": "11,454,490", "Density (P/Km2)": "481", "Land Area (Km2)": "2,973,190", "Migrants (net)": "-486,136", "Fert. Rate": "2.0", "Med. Age": "28", "Urbn Pop": "36 %", "World Share": "17.76 %"}, {"ID": "2", "Country (or dependency)": "China", "Population (2020)": "1,425,671,352", "Yearly Change": "-0.02 %", "Net Change": "-215,985", "Density (P/Km2)": "152", "Land Area (Km2)": "9,388,211", "Migrants (net)": "-310,220", "Fert. Rate": "1.2", "Med. Age": "39", "Urbn Pop": "65 %", "World Share": "17.72 %"}, {"ID": "3", "Country (or dependency)": "United States", "Population (2020)": "339,996,563", "Yearly Change": "0.50 %", "Net Change": "1,706,706", "Density (P/Km2)": "37", "Land Area (Km2)": "9,147,420", "Migrants (net)": "999,700", "Fert. Rate": "1.7", "Med. Age": "38", "Urbn Pop": "83 %", "World Share": "4.23 %"}, {"ID": "4", "Country (or dependency)": "Indonesia", 

In [12]:
# Save the JSON file to .json.
with open('countries.json', 'w') as f:
    json.dump(output, f)

In [13]:
# Read JSON using Pandas, output to .csv.
pd.read_json(output_json).to_csv('countries.csv', index=False)

In [14]:
# Import the CSV file with Pandas.
# Data = pd.read_json('countries.json').
data = pd.read_csv('countries.csv')

# View.
data.head()

Unnamed: 0,ID,Country (or dependency),Population (2020),Yearly Change,Net Change,Density (P/Km2),Land Area (Km2),Migrants (net),Fert. Rate,Med. Age,Urbn Pop,World Share
0,1,India,1428627663,0.81 %,11454490,481,2973190,-486136,2.0,28.0,36 %,17.76 %
1,2,China,1425671352,-0.02 %,-215985,152,9388211,-310220,1.2,39.0,65 %,17.72 %
2,3,United States,339996563,0.50 %,1706706,37,9147420,999700,1.7,38.0,83 %,4.23 %
3,4,Indonesia,277534122,0.74 %,2032783,153,1811570,-49997,2.1,30.0,59 %,3.45 %
4,5,Pakistan,240485658,1.98 %,4660796,312,770880,-165988,3.3,21.0,35 %,2.99 %


In [15]:
# Open the JSON file with Pandas.
data = pd.read_json('countries.json')

# View the DataFrame.
data.head()

Unnamed: 0,ID,Country (or dependency),Population (2020),Yearly Change,Net Change,Density (P/Km2),Land Area (Km2),Migrants (net),Fert. Rate,Med. Age,Urbn Pop,World Share
0,1,India,1428627663,0.81 %,11454490,481,2973190,-486136,2.0,28,36 %,17.76 %
1,2,China,1425671352,-0.02 %,-215985,152,9388211,-310220,1.2,39,65 %,17.72 %
2,3,United States,339996563,0.50 %,1706706,37,9147420,999700,1.7,38,83 %,4.23 %
3,4,Indonesia,277534122,0.74 %,2032783,153,1811570,-49997,2.1,30,59 %,3.45 %
4,5,Pakistan,240485658,1.98 %,4660796,312,770880,-165988,3.3,21,35 %,2.99 %
