# 1) Introduction to Webscraping

Web scraping is a way to automate the process of collecting data from websites. It's like **sending a robot to a web page**, instructing it to read the page's content, and then asking it to bring back the specific pieces of information you need.

Web scraping is particularly useful when the data you need is not readily accessible through an API. For instance, as we'll see on the next screen, we'll be able to retrieve the `List of countries and dependencies` by population Wikipedia page that doesn't have a corresponding API. So, in this lesson, we'll learn how to extract this data using web scraping techniques.

The `robots.txt` file is a text file that webmasters create to instruct web robots (typically search engine robots) how to crawl pages on their website. You can usually find this file by appending `/robots.txt` to the base URL of a website. For example, the `robots.txt` file for Wikipedia is at `https://en.wikipedia.org/robots.txt`. Always check this file before scraping a website to ensure you're not violating any rules.

In this lesson, we'll be using Python's `requests` library to send HTTP requests and the `bs4` (BeautifulSoup) library to parse web page HTML content. The `requests` library allows us to send HTTP requests using Python, while `BeautifulSoup` helps us **parse** a web page's HTML content to find the data we need.

# 2) Practical Applications of Web Scraping

Web scraping can be a powerful tool for a variety of applications across different domains:

**Data Journalism**: Reporters often need to analyze large amounts of data to uncover stories. Web scraping allows journalists to collect data from various sources for their investigative work.

**E-commerce**: Retailers and e-commerce companies use web scraping to monitor competitors' prices and product reviews. This information can help them adjust their strategies and improve their products.

**Recruitment**: HR professionals use web scraping to gather data on potential candidates from professional networking sites and job boards.

**Social Media Analysis**: Web scraping can gather data from social media platforms to understand customer sentiment and trends.

**SEO Monitoring**: Digital marketers use web scraping to track website performance, monitor SEO rankings, and gather intelligence on competitors.

**Research**: Academics and researchers use web scraping to collect data for research in fields like linguistics, data science, and sociology.

Let's apply what we've learned to a practical example. We'll continue with our scenario at EcoData Inc., where we've identified valuable data on the [List of countries and dependencies by population Wikipedia page](https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population).

> **_Note: For consistency and stability, in this lesson and the ones to come, we will use a dedicated page we’ve hosted on [GitHub](https://dataquestio.github.io/web-scraping-pages/?_gl=1*1o3elyl*_gcl_au*NDM0MjgyOTk5LjE3NTc4OTQwOTA.*_ga*MTYxMzk5ODEwMS4xNzE2ODU2NTg2*_ga_YXMFSKC6DP*czE3NjQyNzc2Mjckbzk4JGcxJHQxNzY0Mjc3ODYwJGo2MCRsMCRoMzE2NTMzNDU0). The page mirrors the official List of countries and dependencies by population Wikipedia page._**

To extract this data, we'll use the BeautifulSoup library to collect population data from a Wikipedia page. This can help us analyze demographic trends, which is a common application of web scraping in data science.

In this code, we're using the BeautifulSoup library to parse the HTML content of the webpage. BeautifulSoup allows us to navigate and search through the HTML and extract the data we need. This code will print the data from each column in each row of the main table. The data includes the `rank`, `country`/`dependency`, `population`, `% of world population`, `source`, and `explanatory notes`. For example, the output for the first few rows would look like this:

Here's a snippet of code that demonstrates how to extract the main table from the Wikipedia page:




In [19]:
# Import necessary libraries
import requests
from bs4 import BeautifulSoup

# Send an HTTP request to the URL of the webpage
response = requests.get('https://dataquestio.github.io/web-scraping-pages/')

# Parse the content of the request
soup = BeautifulSoup(response.text, 'html.parser')

# Find the main table using the class attribute
table = soup.find('table', {'class': 'wikitable'})

# Find all rows in the table
rows = table.find_all('tr')

# Loop through each row
for row in rows:
    # Find all columns in each row
    cols = row.find_all('td')
    # Get the text from each column
    cols = [col.text.strip() for col in cols]
    # Print the columns
    print(cols)  

[]
['World', '8,232,000,000', '100%', '13 Jun 2025', 'UN projection[1][3]', '']
['India', '1,417,492,000', '17.3%', '1 Jul 2025', 'Official projection[4]', '[b]']
['China', '1,408,280,000', '17.2%', '31 Dec 2024', 'Official estimate[5]', '[c]']
['United States', '340,110,988', '4.1%', '1 Jul 2024', 'Official estimate[6]', '[d]']
['Indonesia', '284,438,782', '3.5%', '30 Jun 2025', 'National annual projection[7]', '']
['Pakistan', '241,499,431', '2.9%', '1 Mar 2023', '2023 census result[8]', '[e]']
['Nigeria', '223,800,000', '2.7%', '1 Jul 2023', 'Official projection[9]', '']
['Brazil', '213,421,037', '2.6%', '1 Jul 2025', 'Official estimate[10]', '']
['Bangladesh', '169,828,911', '2.1%', '14 Jun 2022', '2022 census result[11]', '[f]']
['Russia', '146,028,325', '1.8%', '1 Jan 2025', 'Official estimate[13]', '[g]']
['Mexico', '130,575,786', '1.6%', '30 Jun 2025', 'National quarterly estimate[14]', '']
['Japan', '123,300,000', '1.5%', '1 Aug 2025', 'Monthly national estimate[15]', '']
['Ph

## Instructions

In this exercise, we'll extract the data from the Wikipedia page and store it in a more structured format. This will allow us to analyze the data more easily in the future. Additionally, we'll modify the function to handle potential errors, such as a missing table or an unsuccessful HTTP request.

1. Write a function named `extract_data` that takes a URL as an argument and returns a list of lists containing the data from the main table on the page. Each inner list should contain the rank, country/dependency, population, % of world population, source, and explanatory notes for a single row of the table.

1. Test the function using the URL `https://dataquestio.github.io/web-scraping-pages`/ and assign the result to a variable named `population_data`.

1. Print the first five lists in `population_data` to check if the data was extracted correctly.

In [24]:
import requests
from bs4 import BeautifulSoup

def extract_data(url):
    
    
    # Send an HTTP request to the URL of the webpage
    response = requests.get(url)
    # Parse the content of the request
    soup = BeautifulSoup(response.text, 'html.parser')
    # Find the main table using the class attribute
    table = soup.find('table', {'class': 'wikitable'})  
    # Find all rows in the table
    rows = table.find_all('tr')

    #Em HTML, as linhas de uma tabela são definidas usando a tag <tr>.
    
    data = []
    # Loop through each row
    for row in rows:
        # Find all columns in each row
        cols = row.find_all('td')
        #td é uma abreviação para "table data" (dados da tabela).

        # Get the text from each column
        cols = [col.text.strip() for col in cols]
        data.append(cols)
    
    return data

population_data = extract_data('https://dataquestio.github.io/web-scraping-pages/')

   
print(population_data[:5])


[[], ['World', '8,232,000,000', '100%', '13 Jun 2025', 'UN projection[1][3]', ''], ['India', '1,417,492,000', '17.3%', '1 Jul 2025', 'Official projection[4]', '[b]'], ['China', '1,408,280,000', '17.2%', '31 Dec 2024', 'Official estimate[5]', '[c]'], ['United States', '340,110,988', '4.1%', '1 Jul 2024', 'Official estimate[6]', '[d]']]
