# 1) Introduction to Webscraping

Web scraping is a way to automate the process of collecting data from websites. It's like **sending a robot to a web page**, instructing it to read the page's content, and then asking it to bring back the specific pieces of information you need.

Web scraping is particularly useful when the data you need is not readily accessible through an API. For instance, as we'll see on the next screen, we'll be able to retrieve the `List of countries and dependencies` by population Wikipedia page that doesn't have a corresponding API. So, in this lesson, we'll learn how to extract this data using web scraping techniques.

The `robots.txt` file is a text file that webmasters create to instruct web robots (typically search engine robots) how to crawl pages on their website. You can usually find this file by appending `/robots.txt` to the base URL of a website. For example, the `robots.txt` file for Wikipedia is at `https://en.wikipedia.org/robots.txt`. Always check this file before scraping a website to ensure you're not violating any rules.

In this lesson, we'll be using Python's `requests` library to send HTTP requests and the `bs4` (BeautifulSoup) library to parse web page HTML content. The `requests` library allows us to send HTTP requests using Python, while `BeautifulSoup` helps us **parse** a web page's HTML content to find the data we need.

# 2) Practical Applications of Web Scraping

Web scraping can be a powerful tool for a variety of applications across different domains:

**Data Journalism**: Reporters often need to analyze large amounts of data to uncover stories. Web scraping allows journalists to collect data from various sources for their investigative work.

**E-commerce**: Retailers and e-commerce companies use web scraping to monitor competitors' prices and product reviews. This information can help them adjust their strategies and improve their products.

**Recruitment**: HR professionals use web scraping to gather data on potential candidates from professional networking sites and job boards.

**Social Media Analysis**: Web scraping can gather data from social media platforms to understand customer sentiment and trends.

**SEO Monitoring**: Digital marketers use web scraping to track website performance, monitor SEO rankings, and gather intelligence on competitors.

**Research**: Academics and researchers use web scraping to collect data for research in fields like linguistics, data science, and sociology.

Let's apply what we've learned to a practical example. We'll continue with our scenario at EcoData Inc., where we've identified valuable data on the [List of countries and dependencies by population Wikipedia page](https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population).

> **_Note: For consistency and stability, in this lesson and the ones to come, we will use a dedicated page we’ve hosted on [GitHub](https://dataquestio.github.io/web-scraping-pages/?_gl=1*1o3elyl*_gcl_au*NDM0MjgyOTk5LjE3NTc4OTQwOTA.*_ga*MTYxMzk5ODEwMS4xNzE2ODU2NTg2*_ga_YXMFSKC6DP*czE3NjQyNzc2Mjckbzk4JGcxJHQxNzY0Mjc3ODYwJGo2MCRsMCRoMzE2NTMzNDU0). The page mirrors the official List of countries and dependencies by population Wikipedia page._**

To extract this data, we'll use the BeautifulSoup library to collect population data from a Wikipedia page. This can help us analyze demographic trends, which is a common application of web scraping in data science.

In this code, we're using the BeautifulSoup library to parse the HTML content of the webpage. BeautifulSoup allows us to navigate and search through the HTML and extract the data we need. This code will print the data from each column in each row of the main table. The data includes the `rank`, `country`/`dependency`, `population`, `% of world population`, `source`, and `explanatory notes`. For example, the output for the first few rows would look like this:

Here's a snippet of code that demonstrates how to extract the main table from the Wikipedia page:




In [19]:
# Import necessary libraries
import requests
from bs4 import BeautifulSoup

# Send an HTTP request to the URL of the webpage
response = requests.get('https://dataquestio.github.io/web-scraping-pages/')

# Parse the content of the request
soup = BeautifulSoup(response.text, 'html.parser')

# Find the main table using the class attribute
table = soup.find('table', {'class': 'wikitable'})

# Find all rows in the table
rows = table.find_all('tr')

# Loop through each row
for row in rows:
    # Find all columns in each row
    cols = row.find_all('td')
    # Get the text from each column
    cols = [col.text.strip() for col in cols]
    # Print the columns
    print(cols)  

[]
['World', '8,232,000,000', '100%', '13 Jun 2025', 'UN projection[1][3]', '']
['India', '1,417,492,000', '17.3%', '1 Jul 2025', 'Official projection[4]', '[b]']
['China', '1,408,280,000', '17.2%', '31 Dec 2024', 'Official estimate[5]', '[c]']
['United States', '340,110,988', '4.1%', '1 Jul 2024', 'Official estimate[6]', '[d]']
['Indonesia', '284,438,782', '3.5%', '30 Jun 2025', 'National annual projection[7]', '']
['Pakistan', '241,499,431', '2.9%', '1 Mar 2023', '2023 census result[8]', '[e]']
['Nigeria', '223,800,000', '2.7%', '1 Jul 2023', 'Official projection[9]', '']
['Brazil', '213,421,037', '2.6%', '1 Jul 2025', 'Official estimate[10]', '']
['Bangladesh', '169,828,911', '2.1%', '14 Jun 2022', '2022 census result[11]', '[f]']
['Russia', '146,028,325', '1.8%', '1 Jan 2025', 'Official estimate[13]', '[g]']
['Mexico', '130,575,786', '1.6%', '30 Jun 2025', 'National quarterly estimate[14]', '']
['Japan', '123,300,000', '1.5%', '1 Aug 2025', 'Monthly national estimate[15]', '']
['Ph

## Instructions

In this exercise, we'll extract the data from the Wikipedia page and store it in a more structured format. This will allow us to analyze the data more easily in the future. Additionally, we'll modify the function to handle potential errors, such as a missing table or an unsuccessful HTTP request.

1. Write a function named `extract_data` that takes a URL as an argument and returns a list of lists containing the data from the main table on the page. Each inner list should contain the rank, country/dependency, population, % of world population, source, and explanatory notes for a single row of the table.

1. Test the function using the URL `https://dataquestio.github.io/web-scraping-pages`/ and assign the result to a variable named `population_data`.

1. Print the first five lists in `population_data` to check if the data was extracted correctly.

In [24]:
import requests
from bs4 import BeautifulSoup

def extract_data(url):
    
    
    # Send an HTTP request to the URL of the webpage
    response = requests.get(url)
    # Parse the content of the request
    soup = BeautifulSoup(response.text, 'html.parser')
    # Find the main table using the class attribute
    table = soup.find('table', {'class': 'wikitable'})  
    # Find all rows in the table
    rows = table.find_all('tr')

    #Em HTML, as linhas de uma tabela são definidas usando a tag <tr>.
    
    data = []
    # Loop through each row
    for row in rows:
        # Find all columns in each row
        cols = row.find_all('td')
        #td é uma abreviação para "table data" (dados da tabela).

        # Get the text from each column
        cols = [col.text.strip() for col in cols]
        data.append(cols)
    
    return data

population_data = extract_data('https://dataquestio.github.io/web-scraping-pages/')

   
print(population_data[:5])


[[], ['World', '8,232,000,000', '100%', '13 Jun 2025', 'UN projection[1][3]', ''], ['India', '1,417,492,000', '17.3%', '1 Jul 2025', 'Official projection[4]', '[b]'], ['China', '1,408,280,000', '17.2%', '31 Dec 2024', 'Official estimate[5]', '[c]'], ['United States', '340,110,988', '4.1%', '1 Jul 2024', 'Official estimate[6]', '[d]']]


# 3) Extracting Data from Web Pages

Now, let's explore the process of extracting data from web pages using Python's `requests` library and `BeautifulSoup`.

Web scraping essentially involves two main steps: sending a request to the webpage and parsing its content.

1. Sending a Request: This is like traveling to the archaeological site. You need to reach the location first to start your work. In the context of web scraping, this involves sending an HTTP request to the URL of the webpage you want to access. The server responds to the request by returning the HTML content of the webpage.

1. Parsing the Content: Once you've reached the archaeological site, you need to carefully dig and find the artifacts (in our case, the scripts). This is similar to parsing the HTML content of the webpage to find the information you need. The Python library BeautifulSoup is designed for this purpose. It creates a parse tree from the HTML content of the webpage that can be used to extract data in a hierarchical and more readable manner.

```Python
# Import necessary libraries
import requests
from bs4 import BeautifulSoup

# Send an HTTP request to the URL of the webpage
response = requests.get('https://dataquestio.github.io/web-scraping-pages/')

# Parse the content of the request
soup = BeautifulSoup(response.text, 'html.parser')

# Find the main table using the class attribute
table = soup.find('table', {'class': 'wikitable'})

# Find all rows in the table
rows = table.find_all('tr')

# Loop through each row
for row in rows:
    # Find all columns in each row
    cols = row.find_all('td')
    # Get the text from each column
    cols = [col.text.strip() for col in cols]
    # Print the columns
    print(cols)

```

In this code, we follow this instructions:

1. first send an HTTP request to the URL of the webpage using `requests.get()`. The server responds to the request and returns the HTML content of the webpage, which we store in the `response` object.

1. Next, we use `BeautifulSoup` to parse the HTML content of the webpage. This creates a `BeautifulSoup` object (`soup`) that represents the document as a **nested data structure**. We can now use this object to extract data from the webpage in a more readable and hierarchical manner.

1. To find the main table on the webpage, we use the `find()` method, which returns the first matching element. We pass in the HTML element we're looking for (`'table'`) and a **dictionary** that describes the attribute(s) the element should have (`{'class': 'wikitable'}`).

1. Then, we use the `find_all()` method to find all row elements (`'tr'`) in the table (**Table Rows**). This method returns a ResultSet object containing all the matching elements.

1. Finally, we loop through each row and find all column elements (`'td'`) in each row (**Table data**). We use a list comprehension to get the text from each column and strip any extra whitespace. 

1. then print the columns, which gives us the data from each column in each row of the main table.

## Handling Different Data Types While Web Scraping

Web pages can contain various types of data, such as text, numbers, dates, and even images and videos. When we scrape data from a web page, we need to be aware of the data types we are dealing with.

* **Text**: Text is the most common form of data you'll extract from a web page. In **BeautifulSoup**, you can use the `.text` or `.get_text()` methods to extract text from a tag. You can also use the `.string` attribute to get the exact string within a tag.

* **Numbers**: Numbers are usually **represented as text** in HTML. After extracting the text, you can convert it to an integer or float using the `int()` or `float()` functions.

* **Dates**: Dates can be tricky because they can be in different formats. You might need to use the `datetime` module to parse and format dates.

* **Images and Videos**: To scrape images or videos, you typically extract the URL of the image or video file rather than the content itself. In BeautifulSoup, you can get the `src` attribute of an `img` or `video` tag to **get the URL**.

Remember, it's important to inspect the HTML content of the web page to understand the structure and data types before you start scraping.

# 4) Handling Errors and Exceptions in Web Scraping

In Python, errors are issues in a program that only occur when the program is running. These are often unexpected situations, like trying to open a non-existent file or access a web page that is currently down. When an error occurs in a Python script, it creates an exception, and if not handled, it will **terminate the program**.

To prevent our web scraping program from crashing mid-execution, it's important to anticipate the types of errors that could occur and handle them appropriately. This is where **error handling** comes in.

There are two types of errors we typically encounter when web scraping:

1. **Connection Errors**: These occur when there is a network problem, like a DNS failure (when the domain name cannot be converted into its corresponding IP address) or a refused connection (when the server refuses to respond). A common exception for this is `requests.exceptions.RequestException`.

1. **HTTP Errors**: These occur when an HTTP request returns an unsuccessful status code. For example, a 404 Not Found error means that the requested resource could not be found on the server, and a 500 Internal Server Error means that the server encountered an unexpected condition. A common exception for this is `requests.exceptions.HTTPError`.

To handle these exceptions, we use a `try/except` block. Here's how it works:

```Python
import requests
from requests.exceptions import RequestException, HTTPError

try:
    response = requests.get('https://www.dataquest.io')
    response.raise_for_status()  # Raise an HTTPError if the status is 4xx, 5xx
except RequestException as e:
    print(f"There was an issue with your request: {e}")
except HTTPError as e:
    print(f"HTTP error occurred: {e}")
```

In the above code, we first try to send a GET request to `https://www.dataquest.io`. If this raises a `RequestException`, we catch it and print a message. Then, we call `response.raise_for_status()`, which will raise an `HTTPError` if one occurs. Again, we catch it and print a message.

By handling these exceptions, our program can continue to run even if there's an issue with a single request.

## Instructions

In this exercise, you'll modify the code we've been building to extract data from the Wikipedia page and add exception handling for the `requests.get()` method. You'll handle the `RequestException` and `HTTPError` exceptions. This will ensure your web scraping code can handle potential errors and continue running even if a single request encounters an issue. We have imported all the necessary libraries.

1. Write a `try/except` block. In the `try` block, send an HTTP request to the URL(`https://dataquestio.github.io/web-scraping-pages/`) of the Wikipedia page and store the response in a variable. Then, call the `raise_for_status()` method on the response.

1. In the `except` block, catch a `RequestException` and print a message that includes the exception. Then, catch an `HTTPError` and print a different message that includes this exception.

1. Outside the `try/except` block, parse the content of the response using BeautifulSoup and store the resulting object in a variable.

1. Find the main table on the webpage using the `find()` method. The table can be identified by its class attribute 'wikitable'.

1. Find all rows in the table using the `find_all()` method and store them in a variable called rows.

1. Select the first `20` rows, excluding the first two, from the variable rows and assign the result to a variable called `top_20_countries`.

1. Loop through each row, find all columns in each row, get the text from each column, and print the columns. To get the text from each column, use a list comprehension with the `text` attribute and the `strip()` method.

In [1]:
import requests
from bs4 import BeautifulSoup
from requests.exceptions import RequestException, HTTPError

try:
    response = requests.get('https://dataquestio.github.io/web-scraping-pages/')
    response.raise_for_status()
    
#first exception    
except RequestException as err:
    print(err)

#second exception
except HTTPError as e:
    print(e)

#parse the content of message
soup = BeautifulSoup(response.text, 'html.parser')

#find the main table
table = soup.find('table', {'class': 'wikitable'})

#find all rows of table as Table Row:
rows = soup.find_all('tr')

#select first 20 rows, excluding first two
top_20_countries = rows[2:22]

#loop to get text
for row in top_20_countries:
    #find all columns in each row
    cols = row.find_all('td')
    #get text value
    cols = [col.text.strip() for col in cols]
    print(cols)
    

    

['India', '1,417,492,000', '17.3%', '1 Jul 2025', 'Official projection[4]', '[b]']
['China', '1,408,280,000', '17.2%', '31 Dec 2024', 'Official estimate[5]', '[c]']
['United States', '340,110,988', '4.1%', '1 Jul 2024', 'Official estimate[6]', '[d]']
['Indonesia', '284,438,782', '3.5%', '30 Jun 2025', 'National annual projection[7]', '']
['Pakistan', '241,499,431', '2.9%', '1 Mar 2023', '2023 census result[8]', '[e]']
['Nigeria', '223,800,000', '2.7%', '1 Jul 2023', 'Official projection[9]', '']
['Brazil', '213,421,037', '2.6%', '1 Jul 2025', 'Official estimate[10]', '']
['Bangladesh', '169,828,911', '2.1%', '14 Jun 2022', '2022 census result[11]', '[f]']
['Russia', '146,028,325', '1.8%', '1 Jan 2025', 'Official estimate[13]', '[g]']
['Mexico', '130,575,786', '1.6%', '30 Jun 2025', 'National quarterly estimate[14]', '']
['Japan', '123,300,000', '1.5%', '1 Aug 2025', 'Monthly national estimate[15]', '']
['Philippines', '114,123,600', '1.4%', '1 Jul 2025', 'Official projection[16]', '']


# 5) Understanding HTML elements, IDs and Classes

Webpages are built using HTML (Hyper Text Markup Language), a markup language that structures content on the web. It uses tags to create elements such as headings, paragraphs, links, images, tables, etc.

HTML tags typically come in pairs: an opening tag and a closing tag. The content goes between these tags. For example, `\This is a paragraph.\`. Here, `\` is the opening tag, This is a paragraph. is the content, and `\` is the closing tag. This whole line of code is known as a **paragraph element**.

Most HTML elements have attributes. Attributes provide additional information about the element. The most common attributes are `class` and `id`. These labels help us find exactly the content we need among all the others.

* A `class` is an attribute that specifies one or more class names for an element. The class attribute is mainly used to point to a class in a style sheet. However, it can also be used by JavaScript to access and manipulate elements with the specific class name. In BeautifulSoup, we can find elements with a specific class name using a dictionary, as we saw in the previous screen's code snippet.

* An `id` is a unique identifier. Each ID can only be used once within a webpage. It is used to specify a single, unique element. You can think of an ID as a unique label on a book in our bookshelf analogy.

In BeautifulSoup, we can find elements with a specific id using a dictionary with the id as the key, like so:

> 
```Python 
   >  element = soup.find('div', {'id': 'unique_id'}) 

```


## Instructions

1. Send an HTTP request to the Wikipedia page's URL(`"https://dataquestio.github.io/web-scraping-pages/"`) and store the response in a variable. Handle any potential exceptions as you learned on the previous screen.

1. Parse the content of the response using BeautifulSoup and store the resulting object in a variable.

1. Find the main table on the webpage using the find() method and store the result in a variable called table. The table can be identified by its class attribute wikitable. Print the table to the console.

1. Examine the printed table and identify the HTML elements used in the main table. Write down the tag names of these elements.

1. Identify the classes used in the main table. Write down the class names.

1. Identify the IDs used in the main table. Write down the ID names.

In [3]:
import requests
from bs4 import BeautifulSoup
from requests.exceptions import RequestException, HTTPError

try: 
    response = requests.get('https://dataquestio.github.io/web-scraping-pages/')
    response.raise_for_status()

#first exception    
except RequestException as err:
    print(err)

#second exception
except HTTPError as e:
    print(e)
    
#parse the content to BeautifulSoup
soup = BeautifulSoup(response.text, 'html.parser')

#find the main table
table = soup.find('table', {'class': 'wikitable'})

print(table.prettify())

<table class="wikitable sortable mw-datatable sort-under static-row-numbers sticky-header col1left col5left" style="text-align:right">
 <caption>
  List of countries and territories by total population
 </caption>
 <tbody>
  <tr>
   <th>
    Location
   </th>
   <th>
    Population
   </th>
   <th style="width:2em">
    % of
    <br/>
    world
   </th>
   <th>
    Date
   </th>
   <th>
    <span class="nowrap">
     Source (official or from
    </span>
    <br/>
    the
    <a href="/wiki/United_Nations" title="United Nations">
     United Nations
    </a>
    )
   </th>
   <th class="unsortable">
    Notes
   </th>
  </tr>
  <tr class="static-row-numbers-norank">
   <td>
    <b>
     <span class="flagicon" style="padding-left:25px;">
     </span>
     World
    </b>
   </td>
   <td>
    8,232,000,000
   </td>
   <td>
    <div class="center">
     100%
    </div>
   </td>
   <td>
    <span data-sort-value="000000002025-06-13-0000" style="white-space:nowrap">
     13 Jun 2025
    </spa

### HTML Elements:

These are the tags that make up the structure of a web page.
Examples: `p`, `div`, `span`, `table`, `tr`, `td`, etc.
Each HTML element has a specific name and defines a type of content.

**Classes:**

These are attributes that can be applied to HTML elements to group them or identify them.
Classes are defined using the class attribute in an HTML element.

They can be used to apply CSS styles or to select specific elements with JavaScript or BeautifulSoup.
Examples: class="wikitable", class="header", etc.

**IDs:**

* These are attributes that uniquely identify an HTML element on a page.

* IDs are defined using the id attribute in an HTML element.

* They must be unique throughout the entire page.

* Examples: `id="main-header"`, `id="footer"`, etc.



For example, if the printed table has a structure like this:
```HTML
<table class="wikitable">
  <tr>
    <th>Column 1</th>
    <th>Column 2</th>
  </tr>
  <tr>
    <td>Value 1</td>
    <td>Value 2</td>
  </tr>
</table>

```
You can identify:

* **HTML Elements**: `table`, `tr`, `th`, `td`

* **Classes**: `wikitable`

* **IDs**: None (since there are no id attributes defined)

# 6) Applying CSS Selector for Targeted Data Extraction

**CSS** (Cascading Style Sheets) is a language used to describe the look and formatting of a document written in HTML. For our purposes in web scraping, we're interested in CSS selectors. These are patterns used to select the element(s) you want to style. They can be used to select elements based on their ID, class, type, attribute, and more.

Think of a CSS selector as a magnet that can attract specific elements from a webpage. For instance, if we're looking for a paragraph within a webpage, we can use the CSS selector '`p`' to attract all paragraph elements.

In BeautifulSoup, we can use the `.select()` method to apply CSS selectors and extract specific data. For example, to select all paragraph elements, we'd write:

> paragraphs = soup.select('p')

CSS selectors can be more specific. For instance, to select all elements with a certain class, we use the pattern `.class_name`. To select elements with a certain ID, we use `#id_name`

Let's say we're looking for a book on our bookshelf with the label `history`. In CSS selector terms, we'd use `.history` to attract that book. If our book has a unique identifier, `book_123`, we'd use `#book_123` to select it.

Here's how we'd use these selectors in BeautifulSoup:

> 
```Python
# Select elements with the class 'history'
history_elements = soup.select('.history')

# Select the element with the id 'book_123'
book_123 = soup.select('#book_123')

```

CSS selectors also support more complex patterns to select elements. For example, we can select all `p` elements inside `div` elements using the selector `'div p'`.



With CSS selectors, we can also use **modulo** operators to select every `nth` element in a list. Modulo operation finds the remainder after the division of one number by another. In Python, we use the `%` symbol for modulo operation.

For example, if we have a list of `td` elements and we want to select every `4th` element (which corresponds to the `Date` column), we can use a loop and the modulo operation as follows:

In [None]:
td_elements = td_elements[13:]  # The first two rows are unstructured, so we start at the 14th element, assuming the 3rd row onwards.

# Loop through each 'td' element
for i in range(len(td_elements)):
    # If the index of the 'td' element is 3 (which corresponds to the 'Date' column)
    if i % 6 == 3:
        # Extract the text from the 'td' element and print it
        date = td_elements[i].text
        print(date)

In this code, `i % 6 == 3` checks if the remainder of the division of `i` by `6` is `3`. This condition is true for every `4th` element in the list (with indices 3, 9, 15, etc.), so these elements are selected.

We use `6` because the structure of our HTML table from which we are scraping data consists of rows with `6 cells` each. This pattern suggests that every row represents a distinct set of data points (like country name, population, world percentage, date, etc.) distributed across 6 columns.

## Instructions

In this exercise, we'll use CSS selectors to extract specific data from our Wikipedia page. Specifically, we'll target the '**Population**' column data from the table. This will allow us to collect population data for each country, which might be crucial for our analysis at EcoData Inc.

1. Send an HTTP request to the URL(`"https://dataquestio.github.io/web-scraping-pages/"`) of the Wikipedia page and store the response in a variable. Handle any potential exceptions as you learned on the previous screen.

1. Parse the content of the response using BeautifulSoup and store the resulting object in a variable.

1. Use CSS selectors to select all `td` elements inside `tr` elements from the main table. Store the result in a variable called td_elements.

1. Create an empty list called population_list to store the population of each country.

1. Loop through each `td` element.

* If the index of the `td` element is 1 (which corresponds to the `Population` column), extract the text from the 'td' element and append it to `population_list`.

* Remember, our table has 6 cells in each row.

1. Print the first ten population results in our `population_list` variable for observation.

Note: We're using the modulo operator (%) to select every second td element starting from the zero index. This corresponds to the Population column in our table.

In [21]:
import requests
from bs4 import BeautifulSoup
from requests.exceptions import RequestException,HTTPError 

try: 
    response = requests.get('https://dataquestio.github.io/web-scraping-pages/')
    response.raise_for_status()

#first exception    
except RequestException as err:
    print(err)

#second exception
except HTTPError as e:
    print(e)
    
#parse the content to BeautifulSoup:
soup = BeautifulSoup(response.text, 'html.parser')

#CSS to select all TD elements inside tr
td_elements = soup.select('tr td')
td_elements=td_elements[12:]

population_list = []


#loop through td_elements
for i in range(len(td_elements)):
    
    #if index i of 'td' element is 1 - population column
    if i % 6 == 1:
        #extract the text from td element and append it
        population = td_elements[i].text
        population_list.append(population)

#print first 10  population results        
print(population_list[:10])

        

['1,408,280,000', '340,110,988', '284,438,782', '241,499,431', '223,800,000', '213,421,037', '169,828,911', '146,028,325', '130,575,786', '123,300,000']


# 7) Handling different data types in Web Scraping

we'll come across various data types on webpages, including **strings**, **numbers**, and **dates**. Similar to categorizing distinct items in a collection, we must process each data type appropriately to effectively organize and understand our findings.

Mastering the handling of different data types is crucial for web scraping, as it ensures that the data we extract is accurate and usable in our subsequent analyses or applications. Let's look at how to manage these data types with confidence.

Handling different data types is crucial in web scraping because it allows us to perform appropriate operations on our data. For instance, we can perform mathematical operations on numeric data, compare dates, or manipulate strings. Without converting our data into the correct types, we would be limited in our ability to analyze and interpret our data.

>``` ['China', '1,411,750,000', '17.5%', '31 Dec 2022', 'Official estimate[4]', '[b]'] ```

Here, `1,411,750,000` is a string representation of a `population`, `17.5% `is a string representation of a `percentage`, and `31 Dec 2022` is a string representation of a `date`. To work with these values, we'll need to convert them into the appropriate data types.

For the population, we can remove the commas and convert the string to an integer:

In [22]:
population = '1,411,750,000'
population = int(population.replace(',', ''))
print(population)

1411750000


For the percentage, we can remove the '%' sign and convert the string to a float:

In [23]:
percentage = '17.5%'
percentage = float(percentage.replace('%', ''))
print(percentage)

17.5


For the date, we can use the `datetime` module to convert the string to a datetime object:

In [24]:
from datetime import datetime

date = '31 Dec 2022'
date = datetime.strptime(date, '%d %b %Y')
print(date)

2022-12-31 00:00:00


Sometimes, we may encounter values that can't be converted into the desired data type. For instance, trying to convert a string with non-numeric characters into an integer will raise a `ValueError`. To handle such errors, we can use a `try-except` block.

In this block, the `try` clause (the code between the try and except keywords) is executed. If no exception occurs, the `except` clause is skipped. However, if an exception occurs, the rest of the try clause is skipped, and the `except` clause is executed.

```Python
try:
    population = int(population.replace(',', ''))
except ValueError:
    print(f"Could not convert {population} to an integer.")

```

## Instructions

1. Send an HTTP request to the URL(`"https://dataquestio.github.io/web-scraping-pages/"`) of the Wikipedia page and store the response in a variable. Handle any potential exceptions as you learned on the previous screen.

1. Parse the content of the response using BeautifulSoup and store the resulting object in a variable.

1. Use CSS selectors to select all `td` elements inside `tr` elements from the main table. Store the result in a variable called `td_elements`.

    * Make sure to include only elements from the third row onwards, which begin from the 12th index, you will have to use index slicing(td_elements[12:]).

1. Create an empty list called population_list to store the population of each country.

1. Loop through each `td` element.

    * If the index of the `td` element is `1` (which corresponds to the `Population` column), extract the text from the 'td' element and append it to `population_list`.

    * If the index of the `td` element is `2` (which corresponds to the `% of world` column), extract the text from the `td` element, remove the `%` sign, convert it to a **float**, and print.

    * If the index of the td element is `3` (which corresponds to the `Source (official or from the United Nations)` column), extract the text from the td element, strip, convert it to a **datetime** object, and print.

    * Limit the processing to the first `30` rows in the `td_elements` by breaking out of the loop once 30 rows have been processed.

In [27]:
import requests
from bs4 import BeautifulSoup
from requests.exceptions import RequestException, HTTPError
from datetime import datetime


try:
    response = requests.get('https://dataquestio.github.io/web-scraping-pages/')
    response.raise_for_status()
except RequestException as e:
    print(f"There was an issue with your request: {e}")
except HTTPError as e:
    print(f"HTTP error occurred: {e}")
soup = BeautifulSoup(response.text, 'html.parser')
td_elements = soup.select('tr td')
td_elements = td_elements[12:]
first_30_rows = 0
for i in range(len(td_elements)):
    if i % 6 == 0:
        first_30_rows += 1
    if first_30_rows > 30:
        break
    if i % 6 == 1:
        population = td_elements[i].text
        population = int(population.replace(',', ''))
        print(population)
    elif i % 6 == 2:
        percentage = td_elements[i].text
        percentage = float(percentage.replace('%', ''))
        print(percentage)    
    elif i % 6 == 3:
        date = td_elements[i].text
        date=date.strip() 
        try:
            date = datetime.strptime(date, '%d %b %Y')
            print(date)
        except ValueError:
            print("Date not found")

1408280000
17.2
2024-12-31 00:00:00
340110988
4.1
2024-07-01 00:00:00
284438782
3.5
2025-06-30 00:00:00
241499431
2.9
2023-03-01 00:00:00
223800000
2.7
2023-07-01 00:00:00
213421037
2.6
2025-07-01 00:00:00
169828911
2.1
2022-06-14 00:00:00
146028325
1.8
2025-01-01 00:00:00
130575786
1.6
2025-06-30 00:00:00
123300000
1.5
2025-08-01 00:00:00
114123600
1.4
2025-07-01 00:00:00
112832000
1.4
2025-07-01 00:00:00
111652998
1.4
2025-07-01 00:00:00
107271260
1.3
2025-01-01 00:00:00
101343800
1.2
2024-12-31 00:00:00
85961000
1.0
2024-03-20 00:00:00
85664944
1.0
2024-12-31 00:00:00
83517030
1.0
2025-03-31 00:00:00
68688000
0.8
2025-08-01 00:00:00
68265209
0.8
2023-07-01 00:00:00
68153004
0.8
2025-07-01 00:00:00
65859640
0.8
2025-07-31 00:00:00
63100945
0.8
2025-06-30 00:00:00
58919230
0.7
2025-06-30 00:00:00
53330978
0.7
2025-07-01 00:00:00
53057212
0.6
2025-01-01 00:00:00
51662000
0.6
2025-07-01 00:00:00
51316756
0.6
2024-10-15 00:00:00
51159889
0.6
2025-07-31 00:00:00
49315949
0.6
2025-07-01 00

# Storing and Structuring Scraped Data

. The process of structuring and storing data is much like organizing a library. Imagine each piece of information we scraped is a book. We can't just leave our books in a pile; we need to arrange them on shelves and categorize them so that we can easily find what we're looking for. In the same way, we need to structure our scraped data into a suitable format and store it for easy access and analysis.

One common way to structure web-scraped data is by using pandas **DataFrames**.

Let's say we've scraped the following data from our Wikipedia page, We can convert this data into a DataFrame like this:

In [28]:
import pandas as pd

data = [['China', '1,411,750,000', '17.5%', '31 Dec 2022', 'Official estimate[4]', '[b]'],
        ['India', '1,392,329,000', '17.3%', '1 Mar 2023', 'Official projection[5]', '[c]'], 
        ['United States', '335,495,000', '4.2%', '11 Oct 2023', 'National population clock[7]', '[d]']]

# Define the column names
columns = ['Country/Dependency', 'Population', '% of World', 'Date', 'Source', 'Notes']

# Create a DataFrame from the data
df = pd.DataFrame(data, columns=columns)

print(df)

  Country/Dependency     Population % of World         Date  \
0              China  1,411,750,000      17.5%  31 Dec 2022   
1              India  1,392,329,000      17.3%   1 Mar 2023   
2      United States    335,495,000       4.2%  11 Oct 2023   

                         Source Notes  
0          Official estimate[4]   [b]  
1        Official projection[5]   [c]  
2  National population clock[7]   [d]  


After structuring our data, we can store it in a file for later use. One common way to do this is by writing the data to a **CSV file**. Here's how we can do it:

```Python
# Write the DataFrame to a CSV file
df.to_csv('population_data.csv', index=False)
```