<a href="https://colab.research.google.com/github/Mick971/Python-Web-Scraping/blob/main/lessons/02_web_scraping.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Web Scraping with Beautiful Soup

* * *

### Icons used in this notebook
🔔 **Question**: A quick question to help you understand what's going on.<br>
🥊 **Challenge**: Interactive exercise. We'll work through these in the workshop!<br>
⚠️ **Warning**: Heads-up about tricky stuff or common mistakes.<br>
💡 **Tip**: How to do something a bit more efficiently or effectively.<br>
🎬 **Demo**: Showing off something more advanced – so you know what Python can be used for!<br>

### Learning Objectives
1. [Reflection: To Scape Or Not To Scrape](#when)
2. [Extracting and Parsing HTML](#extract)
3. [Scraping the Illinois General Assembly](#scrape)

<a id='when'></a>

# To Scrape Or Not To Scrape

When we'd like to access data from the web, we first have to make sure if the website we are interested in offers a Web API. Platforms like Twitter, Reddit, and the New York Times offer APIs. **Check out D-Lab's [Python Web APIs](https://github.com/dlab-berkeley/Python-Web-APIs) workshop if you want to learn how to use APIs.**

However, there are often cases when a Web API does not exist. In these cases, we may have to resort to web scraping, where we extract the underlying HTML from a web page, and directly obtain the information we want. There are several packages in Python we can use to accomplish these tasks. We'll focus two packages: Requests and Beautiful Soup.

Our case study will be scraping information on the [state senators of Illinois](http://www.ilga.gov/senate), as well as the [list of bills](http://www.ilga.gov/senate/SenatorBills.asp?MemberID=1911&GA=98&Primary=True) each senator has sponsored. Before we get started, peruse these websites to take a look at their structure.

## Installation

We will use two main packages: [Requests](http://docs.python-requests.org/en/latest/user/quickstart/) and [Beautiful Soup](http://www.crummy.com/software/BeautifulSoup/bs4/doc/). Go ahead and install these packages, if you haven't already:

In [None]:
# Se instalan los paquetes de las librerias que vamos a utilizar para desarrollar el código


In [4]:
%pip install requests



In [5]:
%pip install beautifulsoup4



We'll also install the `lxml` package, which helps support some of the parsing that Beautiful Soup performs:

In [6]:
%pip install lxml



In [7]:
# Se importan las librerias de los paquetes instalados

In [24]:
# Import required libraries
from bs4 import BeautifulSoup
from datetime import datetime
import requests
import time

<a id='extract'></a>

# Extracting and Parsing HTML

In order to succesfully scrape and analyse HTML, we'll be going through the following 4 steps:
1. Make a GET request
2. Parse the page with Beautiful Soup
3. Search for HTML elements
4. Get attributes and text of these elements

## Step 1: Make a GET Request to Obtain a Page's HTML

We can use the Requests library to:

1. Make a GET request to the page, and
2. Read in the webpage's HTML code.

The process of making a request and obtaining a result resembles that of the Web API workflow. Now, however, we're making a request directly to the website, and we're going to have to parse the HTML ourselves. This is in contrast to being provided data organized into a more straightforward `JSON` or `XML` output.

In [25]:
# Se realiza una solicitud GET a la página web del senado de Illinois y se muestra los 1000
# primeros caracteres del contenido HTML recibido

In [26]:
# Make a GET request
req = requests.get('http://www.ilga.gov/senate/default.asp')
# Read the content of the server’s response
src = req.text
# View some output
print(src[:1000])

The service is unavailable.


## Step 2: Parse the Page with Beautiful Soup

Now, we use the `BeautifulSoup` function to parse the reponse into an HTML tree. This returns an object (called a **soup object**) which contains all of the HTML in the original document.

If you run into an error about a parser library, make sure you've installed the `lxml` package to provide Beautiful Soup with the necessary parsing tools.

In [27]:
# Se utiliza la libreria BeutifulSoup para parsear el código HTML obsetenido para que tenga una estructura clara
# e imprime los 1000 primeros elementos

In [28]:
# Parse the response into an HTML tree
soup = BeautifulSoup(src, 'lxml')
# Take a look
print(soup.prettify()[:1000])

<html>
 <body>
  <p>
   The service is unavailable.
  </p>
 </body>
</html>



The output looks pretty similar to the above, but now it's organized in a `soup` object which allows us to more easily traverse the page.

## Step 3: Search for HTML Elements

Beautiful Soup has a number of functions to find useful components on a page. Beautiful Soup lets you find elements by their:

1. HTML tags
2. HTML Attributes
3. CSS Selectors

Let's search first for **HTML tags**.

The function `find_all` searches the `soup` tree to find all the elements with an a particular HTML tag, and returns all of those elements.

What does the example below do?

In [None]:
# con la libreria soup se busca todos los enlaces <a> y se imprime los 10 primeros elementos

In [29]:
# Find all elements with a certain tag
a_tags = soup.find_all("a")
print(a_tags[:10])

[]


Because `find_all()` is the most popular method in the Beautiful Soup search API, you can use a shortcut for it. If you treat the BeautifulSoup object as though it were a function, then it’s the same as calling `find_all()` on that object.

These two lines of code are equivalent:

In [None]:
# Se utiliza dos formas equivalentes para encontrar todos los elementos <a> de la pagina web  y luego
# se muestra el primer elemento de los reusltados obtenidos

In [30]:
a_tags = soup.find_all("a")
a_tags_alt = soup("a")
print(a_tags[0])
print(a_tags_alt[0])

IndexError: list index out of range

How many links did we obtain?

In [None]:
# Se muestra el nuemro de elementos que tiene la variable a_tags donde se almacenan todos los elementos enlace de la
# pagina del senado

In [31]:
print(len(a_tags))

0


That's a lot! Many elements on a page will have the same HTML tag. For instance, if you search for everything with the `a` tag, you're likely to get more hits, many of which you might not want. Remember, the `a` tag defines a hyperlink, so you'll usually find many on any given page.

What if we wanted to search for HTML tags with certain attributes, such as particular CSS classes?

We can do this by adding an additional argument to the `find_all`. In the example below, we are finding all the `a` tags, and then filtering those with `class_="sidemenu"`.

In [32]:
# Se extrae todos los enlaces <a> con la clase sidemenu del codigo HTML y luego se muestra los 5 primeros elementos

In [33]:
# Get only the 'a' tags in 'sidemenu' class
side_menus = soup("a", class_="sidemenu")
side_menus[:5]

[]

A more efficient way to search for elements on a website is via a **CSS selector**. For this we have to use a different method called `select()`. Just pass a string into the `.select()` to get all elements with that string as a valid CSS selector.

In the example above, we can use `"a.sidemenu"` as a CSS selector, which returns all `a` tags with class `sidemenu`.

In [None]:
# El código usa un selector CSS para obtener todos los enlaces <a> con la clase sidemenu
# y luego se muestra los 5 primeros elementos

In [34]:
# Get elements with "a.sidemenu" CSS Selector.
selected = soup.select("a.sidemenu")
selected[:5]

[]

## 🥊 Challenge: Find All

Use BeautifulSoup to find all the `a` elements with class `mainmenu`.

In [None]:
# El código usa un selector CSS para obtener todos los enlaces <a> con la clase mainmenu
# y luego se muestra los 5 primeros elementos

In [35]:
# YOUR CODE HERE
selected = soup.select("a.mainmenu")
selected[:5]

[]

## Step 4: Get Attributes and Text of Elements

Once we identify elements, we want the access information in that element. Usually, this means two things:

1. Text
2. Attributes

Getting the text inside an element is easy. All we have to do is use the `text` member of a `tag` object:

In [37]:
# El código selecciona todos los enlaces <a> con la clase "sidemenu" y guarda el primero en first_link.
# Luego imprime ese enlace y muestra que su tipo es bs4.element.Tag, que representa un nodo HTML en BeautifulSoup.

In [38]:
# Get all sidemenu links as a list
side_menu_links = soup.select("a.sidemenu")

# Examine the first link
first_link = side_menu_links[0]
print(first_link)

# What class is this variable?
print('Class: ', type(first_link))

IndexError: list index out of range

It's a Beautiful Soup tag! This means it has a `text` member:

In [None]:
# imprime el texto dentro del primer enlace con la clase sidemenu

In [39]:
print(first_link.text)

NameError: name 'first_link' is not defined

Sometimes we want the value of certain attributes. This is particularly relevant for `a` tags, or links, where the `href` attribute tells us where the link goes.

💡 **Tip**: You can access a tag’s attributes by treating the tag like a dictionary:

In [41]:
# se imprime el valor del atributo href del primer enlace sidemenu

In [40]:
print(first_link['href'])

NameError: name 'first_link' is not defined

## 🥊 Challenge: Extract specific attributes

Extract all `href` attributes for each `mainmenu` URL.

In [43]:
# Se realiza el codigo para extrae el href del enlace de la clase mainmenu

In [42]:
# YOUR CODE HERE

# Get all sidemenu links as a list
main_menu_links = soup.select("a.mainmenu")

# Examine the first link
second_link = main_menu_links[0]
print(second_link)

# What class is this variable?
print('Class: ', type(second_link))

print(second_link['href'])

IndexError: list index out of range

<a id='scrape'></a>

# Scraping the Illinois General Assembly

Believe it or not, those are really the fundamental tools you need to scrape a website. Once you spend more time familiarizing yourself with HTML and CSS, then it's simply a matter of understanding the structure of a particular website and intelligently applying the tools of Beautiful Soup and Python.

Let's apply these skills to scrape the [Illinois 98th General Assembly](http://www.ilga.gov/senate/default.asp?GA=98).

Specifically, our goal is to scrape information on each senator, including their name, district, and party.

## Scrape and Soup the Webpage

Let's scrape and parse the webpage, using the tools we learned in the previous section.

In [45]:
# Se hace un solicitud GET a la pagina del senado Illinois y se parsea el reusltado para facilitar su analisis

In [44]:
# Make a GET request
req = requests.get('http://www.ilga.gov/senate/default.asp?GA=98')
# Read the content of the server’s response
src = req.text
# Soup it
soup = BeautifulSoup(src, "lxml")

## Search for the Table Elements

Our goal is to obtain the elements in the table on the webpage. Remember: rows are identified by the `tr` tag. Let's use `find_all` to obtain these elements.

In [46]:
# Se obtiene todos los elementos <tr> del HTML y luego se devuelve el numero de filas encontradas

In [47]:
# Get all table row elements
rows = soup.find_all("tr")
len(rows)

0

⚠️ **Warning**: Keep in mind: `find_all` gets *all* the elements with the `tr` tag. We only want some of them. If we use the 'Inspect' function in Google Chrome and look carefully, then we can use some CSS selectors to get just the rows we're interested in. Specifically, we want the inner rows of the table:

In [49]:
# Se selecciona todos los elementos <tr> que están anidados tres niveles dentro de otros <tr> usando el selector CSS
# luego se imprime las 5 primeras filas

In [48]:
# Returns every ‘tr tr tr’ css selector in the page
rows = soup.select('tr tr tr')

for row in rows[:5]:
    print(row, '\n')

It looks like we want everything after the first two rows. Let's work with a single row to start, and build our loop from there.

In [51]:
# Se toma la tercera fila <tr> encontrada en la lista y se imprime en un formato legible usando prettify

In [50]:
example_row = rows[2]
print(example_row.prettify())

IndexError: list index out of range

Let's break this row down into its component cells/columns using the `select` method with CSS selectors. Looking closely at the HTML, there are a couple of ways we could do this.

* We could identify the cells by their tag `td`.
* We could use the the class name `.detail`.
* We could combine both and use the selector `td.detail`.

In [None]:
# El código imprime todas las celdas <td> dentro de la fila
# imprime todos los elementos que tengan la clase detail dentro de esa fila
# imprime solo las celdas <td> que ademas tienen la clase detail

In [52]:
for cell in example_row.select('td'):
    print(cell)
print()

for cell in example_row.select('.detail'):
    print(cell)
print()

for cell in example_row.select('td.detail'):
    print(cell)
print()

NameError: name 'example_row' is not defined

We can confirm that these are all the same.

In [54]:
# Verifica que las tres selecciones devuelvan exactamente los mismos elementos

In [53]:
assert example_row.select('td') == example_row.select('.detail') == example_row.select('td.detail')

NameError: name 'example_row' is not defined

Let's use the selector `td.detail` to be as specific as possible.

In [56]:
# Se selecciona y se guarda en detail_cells todas las celdas <td> dentro de example_row que tienen la clase detail

In [55]:
# Select only those 'td' tags with class 'detail'
detail_cells = example_row.select('td.detail')
detail_cells

NameError: name 'example_row' is not defined

Most of the time, we're interested in the actual **text** of a website, not its tags. Recall that to get the text of an HTML element, we use the `text` member:

In [58]:
# Se extrae solo el texto de cada celda con clase "detail" dentro de example_row y guarda esos textos en una lista llamada row_data
# y luego se imprime el resultado de row_data

In [57]:
# Keep only the text in each of those cells
row_data = [cell.text for cell in detail_cells]

print(row_data)

NameError: name 'detail_cells' is not defined

Looks good! Now we just use our basic Python knowledge to get the elements of this list that we want. Remember, we want the senator's name, their district, and their party.

In [60]:
# Se imprime los valores de nombre, distrito y partido politico

In [59]:
print(row_data[0]) # Name
print(row_data[3]) # District
print(row_data[4]) # Party

NameError: name 'row_data' is not defined

## Getting Rid of Junk Rows

We saw at the beginning that not all of the rows we got actually correspond to a senator. We'll need to do some cleaning before we can proceed forward. Take a look at some examples:

In [62]:
# Imprime los valores de la primera segunda y ultima fila de la tabla

In [61]:
print('Row 0:\n', rows[0], '\n')
print('Row 1:\n', rows[1], '\n')
print('Last Row:\n', rows[-1])

IndexError: list index out of range

When we write our for loop, we only want it to apply to the relevant rows. So we'll need to filter out the irrelevant rows. The way to do this is to compare some of these to the rows we do want, see how they differ, and then formulate that in a conditional.

As you can imagine, there a lot of possible ways to do this, and it'll depend on the website. We'll show some here to give you an idea of how to do this.

In [64]:
# Se imprimen la cantidad de elementos hijos directos dentro de las filas especificadas

In [None]:
# Bad rows
print(len(rows[0]))
print(len(rows[1]))

# Good rows
print(len(rows[2]))
print(len(rows[3]))

Perhaps good rows have a length of 5. Let's check:

In [66]:
# crea una lista good_rows que solo incluye las filas que tienen exactamente 5 elementos hijos,
# filtrando las filas con la estructura correcta.
# Luego imprime la primera, la penúltima y la última de esas filas

In [65]:
good_rows = [row for row in rows if len(row) == 5]

# Let's check some rows
print(good_rows[0], '\n')
print(good_rows[-2], '\n')
print(good_rows[-1])

IndexError: list index out of range

We found a footer row in our list that we'd like to avoid. Let's try something else:

In [67]:
# selecciona y devuelve todas las celdas <td> con clase "detail" dentro de la tercera fila
# e imprime la primera y ultima fila filtrada para verificar

In [None]:
rows[2].select('td.detail')

In [None]:
# Bad row
print(rows[-1].select('td.detail'), '\n')

# Good row
print(rows[5].select('td.detail'), '\n')

# How about this?
good_rows = [row for row in rows if row.select('td.detail')]

print("Checking rows...\n")
print(good_rows[0], '\n')
print(good_rows[-1])

Looks like we found something that worked!

## Loop it All Together

Now that we've seen how to get the data we want from one row, as well as filter out the rows we don't want, let's put it all together into a loop.

In [69]:
#define una lista vacía llamada members para almacenar información. Luego filtra las filas válidas de una tabla,
# eliminando aquellas que no contienen celdas con la clase "detail". A continuación, recorre cada fila válida,
# extrae el texto de las celdas con clase "detail" y guarda en variables el nombre, distrito y partido político del senador.
# Finalmente, crea una tupla con estos datos y la añade a la lista members#

In [68]:
# Define storage list
members = []

# Get rid of junk rows
valid_rows = [row for row in rows if row.select('td.detail')]

# Loop through all rows
for row in valid_rows:
    # Select only those 'td' tags with class 'detail'
    detail_cells = row.select('td.detail')
    # Keep only the text in each of those cells
    row_data = [cell.text for cell in detail_cells]
    # Collect information
    name = row_data[0]
    district = int(row_data[3])
    party = row_data[4]
    # Store in a tuple
    senator = (name, district, party)
    # Append to list
    members.append(senator)

In [None]:
# devuelve la cantidad de elementos en la lista members

In [70]:
# Should be 61
len(members)

0

Let's take a look at what we have in `members`.

In [72]:
# imprime los primeros 5 elementos de la lista members

In [71]:
print(members[:5])

[]


## 🥊  Challenge: Get `href` elements pointing to members' bills

The code above retrieves information on:  

- the senator's name,
- their district number,
- and their party.

We now want to retrieve the URL for each senator's list of bills. Each URL will follow a specific format.

The format for the list of bills for a given senator is:

`http://www.ilga.gov/senate/SenatorBills.asp?GA=98&MemberID=[MEMBER_ID]&Primary=True`

to get something like:

`http://www.ilga.gov/senate/SenatorBills.asp?MemberID=1911&GA=98&Primary=True`

in which `MEMBER_ID=1911`.

You should be able to see that, unfortunately, `MEMBER_ID` is not currently something pulled out in our scraping code.

Your initial task is to modify the code above so that we also **retrieve the full URL which points to the corresponding page of primary-sponsored bills**, for each member, and return it along with their name, district, and party.

Tips:

* To do this, you will want to get the appropriate anchor element (`<a>`) in each legislator's row of the table. You can again use the `.select()` method on the `row` object in the loop to do this — similar to the command that finds all of the `td.detail` cells in the row. Remember that we only want the link to the legislator's bills, not the committees or the legislator's profile page.
* The anchor elements' HTML will look like `<a href="/senate/Senator.asp/...">Bills</a>`. The string in the `href` attribute contains the **relative** link we are after. You can access an attribute of a BeatifulSoup `Tag` object the same way you access a Python dictionary: `anchor['attributeName']`. See the <a href="http://www.crummy.com/software/BeautifulSoup/bs4/doc/#tag">documentation</a> for more details.
* There are a _lot_ of different ways to use BeautifulSoup to get things done. whatever you need to do to pull the `href` out is fine.

The code has been partially filled out for you. Fill it in where it says `#YOUR CODE HERE`. Save the path into an object called `full_path`.

In [None]:
# Este código realiza una petición HTTP para obtener la página de senadores de Illinois,
# la analiza con BeautifulSoup, filtra las filas relevantes de la tabla,
# extrae el nombre, distrito y partido de cada senador,
# y además obtiene el enlace directo a los proyectos de ley ("Bills") de cada senador.
# Finalmente, guarda toda la información en una lista de tuplas.

In [73]:
# Make a GET request
req = requests.get('http://www.ilga.gov/senate/default.asp?GA=98')
# Read the content of the server’s response
src = req.text
# Soup it
soup = BeautifulSoup(src, "lxml")
# Create empty list to store our data
members = []

# Returns every ‘tr tr tr’ css selector in the page
rows = soup.select('tr tr tr')
# Get rid of junk rows
rows = [row for row in rows if row.select('td.detail')]

# Loop through all rows
for row in rows:
    # Select only those 'td' tags with class 'detail'
    detail_cells = row.select('td.detail')
    # Keep only the text in each of those cells
    row_data = [cell.text for cell in detail_cells]
    # Collect information
    name = row_data[0]
    district = int(row_data[3])
    party = row_data[4]



     bills_link = row.find('a', string='Bills')
    if bills_link and bills_link.has_attr('href'):
        # Construir la URL completa usando el dominio base
        full_path = 'http://www.ilga.gov' + bills_link['href']
    else:
        full_path = ''


    # Store in a tuple
    senator = (name, district, party, full_path)
    # Append to list
    members.append(senator)

In [74]:
# Uncomment to test
# members[:5]

## 🥊  Challenge: Modularize Your Code

Turn the code above into a function that accepts a URL, scrapes the URL for its senators, and returns a list of tuples containing information about each senator.

In [75]:
# YOUR CODE HERE

def get_members(url):
    # Realiza una solicitud GET a la URL proporcionada y obtiene el contenido HTML
    req = requests.get(url)
    src = req.text
    # Analiza el HTML usando BeautifulSoup con el parser 'lxml'
    soup = BeautifulSoup(src, "lxml")
    # Crea una lista vacía para almacenar los datos de los miembros
    members = []
    # Selecciona todas las filas relevantes de la tabla usando el selector CSS 'tr tr tr'
    rows = soup.select('tr tr tr')
    # Filtra solo las filas que contienen al menos una celda <td> con clase 'detail'
    rows = [row for row in rows if row.select('td.detail')]
    # Itera sobre todas las filas válidas
    for row in rows:
        # Selecciona únicamente las celdas <td> que tienen la clase 'detail'
        detail_cells = row.select('td.detail')
        # Extrae solo el texto de cada una de esas celdas
        row_data = [cell.text for cell in detail_cells]
        # Obtiene el nombre del senador (primer elemento)
        name = row_data[0]
        # Obtiene el distrito del senador (cuarto elemento) y lo convierte a entero
        district = int(row_data[3])
        # Obtiene el partido del senador (quinto elemento)
        party = row_data[4]
        # Busca el enlace a la lista de proyectos de ley del senador
        # Busca el enlace cuyo texto sea 'Bills'
        bill_link = row.find('a', string='Bills')
        # Si se encuentra el enlace, construye la URL completa
        if bill_link and bill_link.has_attr('href'):
            full_path = 'http://www.ilga.gov' + bill_link['href']
        else:
            full_path = ''
        # Almacena la información en una tupla
        senator = (name, district, party, full_path)
        # Agrega la tupla a la lista de miembros
        members.append(senator)
    # Devuelve la lista de miembros
    return members


In [76]:
# Test your code
url = 'http://www.ilga.gov/senate/default.asp?GA=98'
senate_members = get_members(url)
len(senate_members)

0

## 🥊 Take-home Challenge: Writing a Scraper Function

We want to scrape the webpages corresponding to bills sponsored by each bills.

Write a function called `get_bills(url)` to parse a given bills URL. This will involve:

  - requesting the URL using the <a href="http://docs.python-requests.org/en/latest/">`requests`</a> library
  - using the features of the `BeautifulSoup` library to find all of the `<td>` elements with the class `billlist`
  - return a _list_ of tuples, each with:
      - description (2nd column)
      - chamber (S or H) (3rd column)
      - the last action (4th column)
      - the last action date (5th column)
      
This function has been partially completed. Fill in the rest.

In [77]:
def get_bills(url):
    src = requests.get(url).text
    soup = BeautifulSoup(src)
    rows = soup.select('tr')
    bills = []
    for row in rows:
        # YOUR CODE HERE
        bill_id =
        description =
        chamber =
        last_action =
        last_action_date =
        bill = (bill_id, description, chamber, last_action, last_action_date)
        bills.append(bill)
    return bills

SyntaxError: invalid syntax (ipython-input-3384773414.py, line 8)

In [None]:
# Uncomment to test your code
# test_url = senate_members[0][3]
# get_bills(test_url)[0:5]

### Scrape All Bills

Finally, create a dictionary `bills_dict` which maps a district number (the key) onto a list of bills (the value) coming from that district. You can do this by looping over all of the senate members in `members_dict` and calling `get_bills()` for each of their associated bill URLs.

**NOTE:** please call the function `time.sleep(1)` for each iteration of the loop, so that we don't destroy the state's web site.

In [None]:
# YOUR CODE HERE


In [None]:
# Uncomment to test your code
# bills_dict[52]