<a href="https://colab.research.google.com/github/Pala63/Python-Web-Scraping/blob/main/lessons/02_web_scraping.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Web Scraping with Beautiful Soup

* * *

### Icons used in this notebook
🔔 **Question**: A quick question to help you understand what's going on.<br>
🥊 **Challenge**: Interactive exercise. We'll work through these in the workshop!<br>
⚠️ **Warning**: Heads-up about tricky stuff or common mistakes.<br>
💡 **Tip**: How to do something a bit more efficiently or effectively.<br>
🎬 **Demo**: Showing off something more advanced – so you know what Python can be used for!<br>

### Learning Objectives
1. [Reflection: To Scape Or Not To Scrape](#when)
2. [Extracting and Parsing HTML](#extract)
3. [Scraping the Illinois General Assembly](#scrape)

<a id='when'></a>

# To Scrape Or Not To Scrape

When we'd like to access data from the web, we first have to make sure if the website we are interested in offers a Web API. Platforms like Twitter, Reddit, and the New York Times offer APIs. **Check out D-Lab's [Python Web APIs](https://github.com/dlab-berkeley/Python-Web-APIs) workshop if you want to learn how to use APIs.**

However, there are often cases when a Web API does not exist. In these cases, we may have to resort to web scraping, where we extract the underlying HTML from a web page, and directly obtain the information we want. There are several packages in Python we can use to accomplish these tasks. We'll focus two packages: Requests and Beautiful Soup.

Our case study will be scraping information on the [state senators of Illinois](http://www.ilga.gov/senate), as well as the [list of bills](http://www.ilga.gov/senate/SenatorBills.asp?MemberID=1911&GA=98&Primary=True) each senator has sponsored. Before we get started, peruse these websites to take a look at their structure.

## Installation

We will use two main packages: [Requests](http://docs.python-requests.org/en/latest/user/quickstart/) and [Beautiful Soup](http://www.crummy.com/software/BeautifulSoup/bs4/doc/). Go ahead and install these packages, if you haven't already:

In [None]:
%pip install requests # Instala la librería 'requests', que permite hacer solicitudes HTTP en Python. Se usa comúnmente en notebooks como Jupyter.



In [None]:
%pip install beautifulsoup4 #utilizada para analizar y extraer información de contenido HTML o XML (web scraping).



We'll also install the `lxml` package, which helps support some of the parsing that Beautiful Soup performs:

In [None]:
%pip install lxml # Instala la librería 'lxml', un parser muy rápido y eficiente para procesar archivos HTML y XML, usado junto con BeautifulSoup.



In [None]:
# Importa las librerías necesarias
from bs4 import BeautifulSoup  # Para analizar y extraer datos de código HTML o XML
from datetime import datetime  # Para trabajar con fechas y horas
import requests  # Para hacer solicitudes HTTP y obtener contenido de páginas web
import time  # Para agregar pausas en la ejecución del programa (por ejemplo, para evitar ser bloqueado por hacer muchas peticiones seguidas)


<a id='extract'></a>

# Extracting and Parsing HTML

In order to succesfully scrape and analyse HTML, we'll be going through the following 4 steps:
1. Make a GET request
2. Parse the page with Beautiful Soup
3. Search for HTML elements
4. Get attributes and text of these elements

## Step 1: Make a GET Request to Obtain a Page's HTML

We can use the Requests library to:

1. Make a GET request to the page, and
2. Read in the webpage's HTML code.

The process of making a request and obtaining a result resembles that of the Web API workflow. Now, however, we're making a request directly to the website, and we're going to have to parse the HTML ourselves. This is in contrast to being provided data organized into a more straightforward `JSON` or `XML` output.

In [None]:
# Realiza una solicitud GET a la página de los miembros del Senado del sitio web de Illinois
req = requests.get('http://www.ilga.gov/Senate/Members')

# Guarda el contenido HTML de la respuesta del servidor en la variable 'src'
src = req.text

# Imprime los primeros 1000 caracteres del HTML obtenido para revisar parte del contenido de la página
print(src[:1000])

<!DOCTYPE html>
<html lang="en">
<head id="Head1">
    <meta name="viewport" content="width=device-width, initial-scale=1.0" />
    <meta http-equiv="content-type" content="text/html;charset=utf-8" />
    <meta http-equiv="X-UA-Compatible" content="IE=Edge" />
    <meta charset="utf-8" />
    <meta charset="UTF-8">
    <!-- Meta Description -->
    <meta name="description" content="Welcome to the official government website of the Illinois General Assembly">
    <meta name="contactName" content="Legislative Information System">
    <meta name="contactOrganization" content="LIS Staff Services">
    <meta name="contactStreetAddress1" content="705 Stratton Office Building">
    <meta name="contactCity" content="Springfield">
    <meta name="contactZipcode" content="62706">
    <meta name="contactNetworkAddress" content="webmaster@ilga.gov">
    <meta name="contactPhoneNumber" content="217-782-3944">
    <meta name="contactFaxNumber" content="217-524-6059">
    <meta name

En esta celda se hace una solicitud HTTP (GET) a la página del Senado del estado de Illinois usando la librería requests.
Luego se guarda el contenido HTML de la página en la variable src.
Finalmente, se imprimen los primeros 1000 caracteres del HTML para ver cómo está estructurado el código de esa página web.

¿Por qué hacemos esto?
Para hacer scraping, necesitamos primero ver cómo está estructurado el HTML, ya que es ahí donde está la información que queremos extraer. En este caso, la página del Senado tiene los nombres y enlaces a los senadores, y lo vamos a capturar desde el HTML directamente.

## Step 2: Parse the Page with Beautiful Soup

Now, we use the `BeautifulSoup` function to parse the reponse into an HTML tree. This returns an object (called a **soup object**) which contains all of the HTML in the original document.

If you run into an error about a parser library, make sure you've installed the `lxml` package to provide Beautiful Soup with the necessary parsing tools.

In [None]:
# Analiza el contenido HTML usando BeautifulSoup con el parser 'lxml' para estructurarlo como un árbol
soup = BeautifulSoup(src, 'lxml')

# Muestra los primeros 1000 caracteres del HTML ya formateado y estructurado para facilitar su lectura
print(soup.prettify()[:1000])


<!DOCTYPE html>
<html lang="en">
 <head id="Head1">
  <meta content="width=device-width, initial-scale=1.0" name="viewport"/>
  <meta content="text/html;charset=utf-8" http-equiv="content-type"/>
  <meta content="IE=Edge" http-equiv="X-UA-Compatible"/>
  <meta charset="utf-8"/>
  <meta charset="utf-8"/>
  <!-- Meta Description -->
  <meta content="Welcome to the official government website of the Illinois General Assembly" name="description"/>
  <meta content="Legislative Information System" name="contactName"/>
  <meta content="LIS Staff Services" name="contactOrganization"/>
  <meta content="705 Stratton Office Building" name="contactStreetAddress1"/>
  <meta content="Springfield" name="contactCity"/>
  <meta content="62706" name="contactZipcode"/>
  <meta content="webmaster@ilga.gov" name="contactNetworkAddress"/>
  <meta content="217-782-3944" name="contactPhoneNumber"/>
  <meta content="217-524-6059" name="contactFaxNumber"/>
  <meta content="State Of Illinois" name="originatorJur

Se convierte el contenido HTML (src) en un objeto BeautifulSoup.
Esto permite navegar y buscar información dentro del HTML como si fuera un árbol estructurado.
Se usa 'lxml' como el parser porque es rápido y eficiente.
La función prettify() muestra el HTML con sangrías, facilitando su lectura.
Aquí solo se imprimen los primeros 1000 caracteres para no saturar la salida

The output looks pretty similar to the above, but now it's organized in a `soup` object which allows us to more easily traverse the page.

## Step 3: Search for HTML Elements

Beautiful Soup has a number of functions to find useful components on a page. Beautiful Soup lets you find elements by their:

1. HTML tags
2. HTML Attributes
3. CSS Selectors

Let's search first for **HTML tags**.

The function `find_all` searches the `soup` tree to find all the elements with an a particular HTML tag, and returns all of those elements.

What does the example below do?

In [None]:
# Busca todos los elementos de la página que tengan la etiqueta <a> (enlaces)
a_tags = soup.find_all("a")

# Imprime los primeros 10 enlaces encontrados en la página
print(a_tags[:10])

[<a b-0yw6sxot5c="" class="dropdown-item" data-lang="en" href="#">
<span b-0yw6sxot5c="" class="flag-icon flag-icon-us"></span> English
                            </a>, <a b-0yw6sxot5c="" class="dropdown-item" data-lang="af" href="#">
<span b-0yw6sxot5c="" class="flag-icon flag-icon-za"></span> Afrikaans
                            </a>, <a b-0yw6sxot5c="" class="dropdown-item" data-lang="sq" href="#">
<span b-0yw6sxot5c="" class="flag-icon flag-icon-al"></span> Albanian
                            </a>, <a b-0yw6sxot5c="" class="dropdown-item" data-lang="ar" href="#">
<span b-0yw6sxot5c="" class="flag-icon flag-icon-ae"></span> Arabic
                            </a>, <a b-0yw6sxot5c="" class="dropdown-item" data-lang="hy" href="#">
<span b-0yw6sxot5c="" class="flag-icon flag-icon-am"></span> Armenian
                            </a>, <a b-0yw6sxot5c="" class="dropdown-item" data-lang="az" href="#">
<span b-0yw6sxot5c="" class="flag-icon flag-icon-az"></span> Azerbaijani
      

Se buscan todas las etiquetas <a> en la página (que normalmente representan enlaces).
find_all("a") devuelve una lista con todos los elementos <a>.
Se imprimen los primeros 10 resultados para ver cómo lucen.
Esto es útil para ver a qué páginas están enlazando (como los senadores en este caso).

Because `find_all()` is the most popular method in the Beautiful Soup search API, you can use a shortcut for it. If you treat the BeautifulSoup object as though it were a function, then it’s the same as calling `find_all()` on that object.

These two lines of code are equivalent:

In [None]:
# Busca todos los elementos <a> usando el método find_all()
a_tags = soup.find_all("a")

# Alternativamente, busca los elementos <a> usando la notación abreviada (equivalente a find_all("a"))
a_tags_alt = soup("a")

# Imprime el primer enlace encontrado con cada método para comparar que ambos devuelven lo mismo
print(a_tags[0])
print(a_tags_alt[0])


<a b-0yw6sxot5c="" class="dropdown-item" data-lang="en" href="#">
<span b-0yw6sxot5c="" class="flag-icon flag-icon-us"></span> English
                            </a>
<a b-0yw6sxot5c="" class="dropdown-item" data-lang="en" href="#">
<span b-0yw6sxot5c="" class="flag-icon flag-icon-us"></span> English
                            </a>


Estas dos formas (soup.find_all("a") y soup("a")) hacen exactamente lo mismo.
Ambas devuelven una lista con todos los elementos <a> del HTML.
Se imprimen los primeros elementos de ambas listas para mostrar que son iguales.

How many links did we obtain?

In [None]:
print(len(a_tags)) # Imprime la cantidad total de etiquetas <a> (enlaces) encontradas en la página

270


Se imprime la cantidad total de etiquetas <a> encontradas en la página.
Generalmente habrá muchos enlaces, y no todos serán útiles para tu objetivo.

That's a lot! Many elements on a page will have the same HTML tag. For instance, if you search for everything with the `a` tag, you're likely to get more hits, many of which you might not want. Remember, the `a` tag defines a hyperlink, so you'll usually find many on any given page.

What if we wanted to search for HTML tags with certain attributes, such as particular CSS classes?

We can do this by adding an additional argument to the `find_all`. In the example below, we are finding all the `a` tags, and then filtering those with `class_="sidemenu"`.

In [None]:
# Obtiene solo las etiquetas <a> que tienen la clase "notranslate"
side_menus = soup("a", class_="notranslate")

# Muestra las primeras 2 etiquetas <a> con esa clase para ver ejemplos
side_menus[:2]


[<a class="notranslate" href="/Senate/Members/Details/3312">Neil Anderson</a>,
 <a class="notranslate" href="/Senate/Members/Details/3312">Neil Anderson</a>]

Se filtran los enlaces <a> que tienen el atributo class="sidemenu".
Esto es útil para obtener solo los enlaces que forman parte del menú lateral de navegación

A more efficient way to search for elements on a website is via a **CSS selector**. For this we have to use a different method called `select()`. Just pass a string into the `.select()` to get all elements with that string as a valid CSS selector.

In the example above, we can use `"a.sidemenu"` as a CSS selector, which returns all `a` tags with class `sidemenu`.

In [None]:
# Selecciona los primeros 5 elementos que tienen la clase "member-overlay" dentro de un <div> usando un selector CSS
selected = soup.select("div.member-overlay")

# Muestra los primeros 5 elementos encontrados con esa clase
selected[:5]


[<div class="member-overlay">
 <h5 class="card-title"><a class="notranslate" href="/Senate/Members/Details/3312">Neil Anderson</a> (R)</h5>
 <p class="card-text">
                                             Republican Caucus Chair
                                             <br/>47th District
                                         </p>
 </div>,
 <div class="member-overlay">
 <h5 class="card-title"><a class="notranslate" href="/Senate/Members/Details/3312">Neil Anderson</a> (R)</h5>
 <p class="card-text">
                                             Republican Caucus Chair
                                             <br/>47th District
                                         </p>
 </div>,
 <div class="member-overlay">
 <h5 class="card-title"><a class="notranslate" href="/Senate/Members/Details/3316">Omar Aquino</a> (D)</h5>
 <p class="card-text">
                                             Majority Caucus Chair
                                             <br/>2nd District
       

Otra forma más eficiente de buscar elementos es con selectores CSS.
"a.sidemenu" selecciona todas las etiquetas <a> que tienen la clase sidemenu.
Se imprimen los primeros 5 elementos encontrados.
Este método es muy flexible y permite búsquedas complejas como "div > ul > li > a".

## 🥊 Challenge: Find All

Use BeautifulSoup to find all the `a` elements with class `mainmenu`.

In [None]:
# Busca todas las etiquetas <a> cuyo atributo href sea exactamente "/Legislation"
side_menus = soup.find_all("a", href="/Legislation")

# Muestra las primeras 10 coincidencias encontradas
side_menus[:10]



[<a aria-expanded="false" aria-haspopup="true" b-0yw6sxot5c="" data-toggle="dropdown" href="/Legislation" role="button">
 <span b-0yw6sxot5c="">LEGISLATION &amp; LAWS</span> <i b-0yw6sxot5c="" class="fa fa-chevron-down"></i>
 </a>,
 <a b-0yw6sxot5c="" href="/Legislation">Bills &amp; Resolutions</a>]

## Step 4: Get Attributes and Text of Elements

Once we identify elements, we want the access information in that element. Usually, this means two things:

1. Text
2. Attributes

Getting the text inside an element is easy. All we have to do is use the `text` member of a `tag` object:

In [None]:
# Obtiene todos los enlaces (<a>) que tienen la clase "notranslate" y los guarda en una lista
side_menu_links = soup.select("a.notranslate")

# Muestra el primer enlace de esa lista
first_link = side_menu_links[0]
print(first_link)

# Imprime el tipo de dato de la variable first_link, que será un objeto BeautifulSoup Tag
print('Class: ', type(first_link))


<a class="notranslate" href="/Senate/Members/Details/3312">Neil Anderson</a>
Class:  <class 'bs4.element.Tag'>


It's a Beautiful Soup tag! This means it has a `text` member:

In [None]:
# Imprime el texto visible dentro del primer enlace, es decir, lo que un usuario ve y puede hacer clic
print(first_link.text)


Neil Anderson


Esto es interesante por que ignora las etiquetas html y me devuelve solo el texto.

Sometimes we want the value of certain attributes. This is particularly relevant for `a` tags, or links, where the `href` attribute tells us where the link goes.

💡 **Tip**: You can access a tag’s attributes by treating the tag like a dictionary:

In [None]:
# Imprime el valor del atributo 'href' del primer enlace, es decir, la URL o ruta a la que apunta ese link
print(first_link['href'])


/Senate/Members/Details/3312


## 🥊 Challenge: Extract specific attributes

Extract all `href` attributes for each `mainmenu` URL.

In [None]:
print(soup.prettify()[:500])  # Solo los primeros 500 caracteres para no saturar


<!DOCTYPE html>
<html lang="en">
 <head id="Head1">
  <meta content="width=device-width, initial-scale=1.0" name="viewport"/>
  <meta content="text/html;charset=utf-8" http-equiv="content-type"/>
  <meta content="IE=Edge" http-equiv="X-UA-Compatible"/>
  <meta charset="utf-8"/>
  <meta charset="utf-8"/>
  <!-- Meta Description -->
  <meta content="Welcome to the official government website of the Illinois General Assembly" name="description"/>
  <meta content="Legislative Information System" nam


In [None]:
# YOUR CODE HERE
# Buscar todos los enlaces con clase 'dropdown-item'
no_translate = soup.find_all('a', class_='notranslate')

# Extraer sus atributos href
dropdown_hrefs = [link['href'] for link in no_translate]

# Mostrar resultados
print(dropdown_hrefs)


['/Senate/Members/Details/3312', '/Senate/Members/Details/3312', '/Senate/Members/Details/3316', '/Senate/Members/Details/3316', '/Senate/Members/Details/3383', '/Senate/Members/Details/3383', '/Senate/Members/Details/3413', '/Senate/Members/Details/3413', '/Senate/Members/Details/3337', '/Senate/Members/Details/3337', '/Senate/Members/Details/3386', '/Senate/Members/Details/3386', '/Senate/Members/Details/3317', '/Senate/Members/Details/3317', '/Senate/Members/Details/3403', '/Senate/Members/Details/3403', '/Senate/Members/Details/3410', '/Senate/Members/Details/3410', '/Senate/Members/Details/3443', '/Senate/Members/Details/3443', '/Senate/Members/Details/3291', '/Senate/Members/Details/3291', '/Senate/Members/Details/3329', '/Senate/Members/Details/3329', '/Senate/Members/Details/3334', '/Senate/Members/Details/3334', '/Senate/Members/Details/3407', '/Senate/Members/Details/3407', '/Senate/Members/Details/3339', '/Senate/Members/Details/3339', '/Senate/Members/Details/3412', '/Senat

<a id='scrape'></a>

# Scraping the Illinois General Assembly

Believe it or not, those are really the fundamental tools you need to scrape a website. Once you spend more time familiarizing yourself with HTML and CSS, then it's simply a matter of understanding the structure of a particular website and intelligently applying the tools of Beautiful Soup and Python.

Let's apply these skills to scrape the [Illinois 98th General Assembly](http://www.ilga.gov/senate/default.asp?GA=98).

Specifically, our goal is to scrape information on each senator, including their name, district, and party.

## Scrape and Soup the Webpage

Let's scrape and parse the webpage, using the tools we learned in the previous section.

In [None]:
# Realiza una solicitud GET a la URL específica que muestra el estado de una ley (proyecto de ley) en el sitio web de Illinois
req = requests.get('https://ilga.gov/Legislation/BillStatus?DocNum=818&DocTypeID=HR&GA=101&GAID=9&LegID=34456&SessionID=51')

# Guarda el contenido HTML recibido en la variable 'src'
src = req.text

# Analiza el contenido HTML usando BeautifulSoup con el parser 'lxml' para estructurarlo y facilitar la extracción de datos
soup = BeautifulSoup(src, "lxml")


## Search for the Table Elements

Our goal is to obtain the elements in the table on the webpage. Remember: rows are identified by the `tr` tag. Let's use `find_all` to obtain these elements.

**WEB DINAMICA**

Al realizar un scraping de la página https://ilga.gov/Legislation/BillStatus?... utilizando requests y BeautifulSoup, me encontré con un problema: la búsqueda de elementos <tr> (filas de tabla) devolvía una lista vacía. Esto se debe a que la página carga su contenido de forma dinámica utilizando JavaScript.

Cuando utilizamos requests.get(...), lo que se descarga es únicamente el HTML estático que el servidor entrega inicialmente. Sin embargo, en este caso, la tabla que contiene los datos no se encuentra en ese HTML original, sino que es generada posteriormente por JavaScript una vez que la página ha sido completamente cargada en el navegador. Como requests no tiene la capacidad de ejecutar código JavaScript, dicha tabla simplemente no aparece en el contenido descargado.

Por esta razón, decidí utilizar Selenium, una herramienta que permite simular el comportamiento de un navegador real (como Chrome o Firefox). Selenium sí ejecuta JavaScript, lo que permite acceder al contenido completo de la página, incluyendo los datos generados dinámicamente. Gracias a esto, fue posible capturar el HTML final ya renderizado y, a partir de él, utilizar BeautifulSoup para encontrar correctamente todas las filas <tr> que componen la tabla deseada.

In [None]:
# Obtiene todas las filas de tabla (<tr>) del contenido HTML analizado
rows = soup.find_all("tr")

# Muestra cuántas filas de tabla se encontraron
len(rows)

# Verifica si la cadena "<tr" está presente en el HTML original descargado (para confirmar que hay tablas)
print("¿Hay '<tr>' en el HTML descargado?", "<tr" in src)

# Muestra los primeros 500 caracteres del HTML para tener una idea de su estructura inicial
print("Primeros 500 caracteres:\n", src[:500])


¿Hay '<tr>' en el HTML descargado? False
Primeros 500 caracteres:
 <!DOCTYPE html>
<html lang="en">
<head id="Head1">
    <meta name="viewport" content="width=device-width, initial-scale=1.0" />
    <meta http-equiv="content-type" content="text/html;charset=utf-8" />
    <meta http-equiv="X-UA-Compatible" content="IE=Edge" />
    <meta charset="utf-8" />
    <meta charset="UTF-8">
    <!-- Meta Description -->
    <meta name="description" content="Welcome to the official government website of the Illinois General Assembly">
    <meta name="contactName


In [None]:
pip install selenium  # Instala la librería Selenium, que permite automatizar la interacción con navegadores web para pruebas o scraping dinámico.




In [None]:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options  # Para configurar opciones del navegador Chrome
from bs4 import BeautifulSoup
import time

# Configura opciones para Chrome
chrome_options = Options()
chrome_options.add_argument('--no-sandbox')  # Evita problemas de permisos en algunos entornos
chrome_options.add_argument('--headless')  # Ejecuta Chrome sin abrir ventana visible (modo invisible)
chrome_options.add_argument('--disable-dev-shm-usage')  # Evita errores en sistemas con poca memoria compartida

# Crea el navegador Chrome con las opciones indicadas
driver = webdriver.Chrome(options=chrome_options)

try:
    # Carga la página web con contenido dinámico que queremos analizar
    driver.get('https://ilga.gov/Legislation/BillStatus?DocNum=818&DocTypeID=HR&GA=101&GAID=9&LegID=34456&SessionID=51')

    # Pausa 5 segundos para esperar que el contenido cargue completamente con JavaScript
    time.sleep(5)

    # Obtiene el código HTML ya renderizado, incluyendo contenido generado por JS
    html = driver.page_source

    # Usa BeautifulSoup para analizar el HTML obtenido
    soup = BeautifulSoup(html, 'lxml')

    # Busca todas las filas de tabla (<tr>) en el HTML
    rows = soup.find_all("tr")

    # Imprime la cantidad de filas encontradas
    print(f"Número de filas encontradas: {len(rows)}")

except Exception as e:
    # En caso de error, imprime qué pasó
    print(f"An error occurred: {e}")

finally:
    # Cierra el navegador para liberar recursos
    driver.quit()


Número de filas encontradas: 18


⚠️ **Warning**: Keep in mind: `find_all` gets *all* the elements with the `tr` tag. We only want some of them. If we use the 'Inspect' function in Google Chrome and look carefully, then we can use some CSS selectors to get just the rows we're interested in. Specifically, we want the inner rows of the table:

In [None]:
# Returns every ‘tr tr tr’ css selector in the page
# rows = soup.select('tr tr tr')

# for row in rows[:5]:
    #print(row, '\n')

Este codigo no encuentra etiquetas anidadas dentro de otras tr tres niveles profundos.

In [None]:
# Selecciona todas las filas (<tr>) que estén dentro de cualquier tabla (<table>)
rows = soup.select('table tr')
print(f"Número de filas encontradas: {len(rows)}")

# Si se encontraron filas, imprime las primeras 5 con formato legible
if rows:
    for row in rows[:5]:
        print(row.prettify(), '\n')
else:
    # Si no hay filas, avisa que no se encontraron <tr>
    print("No se encontraron filas <tr> en la página.")


Número de filas encontradas: 18
<tr>
 <th>
  Date
 </th>
 <th>
  Chamber
 </th>
 <th>
  Action
 </th>
</tr>
 

<tr>
 <td align="left" class="content" valign="top" width="13%">
  11/02/2007
 </td>
 <td align="left" class="content" valign="top" width="12%">
  House
 </td>
 <td align="left" class="content" valign="top" width="75%">
  Filed with the Clerk by
  <a href="../../house/members/details/1307">
   Rep. Daniel V. Beiser
  </a>
 </td>
</tr>
 

<tr>
 <td align="left" class="content" valign="top" width="13%">
  11/07/2007
 </td>
 <td align="left" class="content" valign="top" width="12%">
  House
 </td>
 <td align="left" class="content" valign="top" width="75%">
  Added Chief Co-Sponsor
  <a href="../../house/members/details/1291">
   Rep. John E. Bradley
  </a>
 </td>
</tr>
 

<tr>
 <td align="left" class="content" valign="top" width="13%">
  11/07/2007
 </td>
 <td align="left" class="content" valign="top" width="12%">
  House
 </td>
 <td align="left" class="content" valign="top" widt

It looks like we want everything after the first two rows. Let's work with a single row to start, and build our loop from there.

In [None]:
# Toma la tercera fila de la lista de filas encontradas (índice 2)
example_row = rows[2]

# Imprime esa fila con formato legible para entender su estructura HTML
print(example_row.prettify())


<tr>
 <td align="left" class="content" valign="top" width="13%">
  11/07/2007
 </td>
 <td align="left" class="content" valign="top" width="12%">
  House
 </td>
 <td align="left" class="content" valign="top" width="75%">
  Added Chief Co-Sponsor
  <a href="../../house/members/details/1291">
   Rep. John E. Bradley
  </a>
 </td>
</tr>



Este codigo imprime 3 filas 0 1 2

Let's break this row down into its component cells/columns using the `select` method with CSS selectors. Looking closely at the HTML, there are a couple of ways we could do this.

* We could identify the cells by their tag `td`.
* We could use the the class name `.detail`.
* We could combine both and use the selector `td.detail`.

In [None]:
# Recorre todas las celdas (<td>) dentro de la fila ejemplo y las imprime
for cell in example_row.select('td'):
    print(cell)
print()

# Busca y muestra todos los elementos dentro de la fila que tienen la clase 'content'
for cell in example_row.select('.content'):
    print(cell)
print()

# Busca y muestra solo las celdas <td> que además tengan la clase 'content'
for cell in example_row.select('td.content'):
    print(cell)
print()


<td align="left" class="content" valign="top" width="13%">
11/07/2007                                        </td>
<td align="left" class="content" valign="top" width="12%">
House                                        </td>
<td align="left" class="content" valign="top" width="75%">
Added Chief Co-Sponsor <a href="../../house/members/details/1291">Rep. John E. Bradley</a></td>

<td align="left" class="content" valign="top" width="13%">
11/07/2007                                        </td>
<td align="left" class="content" valign="top" width="12%">
House                                        </td>
<td align="left" class="content" valign="top" width="75%">
Added Chief Co-Sponsor <a href="../../house/members/details/1291">Rep. John E. Bradley</a></td>

<td align="left" class="content" valign="top" width="13%">
11/07/2007                                        </td>
<td align="left" class="content" valign="top" width="12%">
House                                        </td>
<td align="le

We can confirm that these are all the same.

In [None]:
# Verifica que las tres formas de seleccionar elementos dentro de example_row sean exactamente iguales
# Esto asegura que todos los <td> tienen la clase 'content' y que la selección es consistente
assert example_row.select('td') == example_row.select('.content') == example_row.select('td.content')


Let's use the selector `td.detail` to be as specific as possible.

In [None]:
# Selecciona solo las celdas <td> que tienen la clase 'content' dentro de example_row
detail_cells = example_row.select('td.content')

# Muestra la lista de celdas encontradas
detail_cells


[<td align="left" class="content" valign="top" width="13%">
 11/07/2007                                        </td>,
 <td align="left" class="content" valign="top" width="12%">
 House                                        </td>,
 <td align="left" class="content" valign="top" width="75%">
 Added Chief Co-Sponsor <a href="../../house/members/details/1291">Rep. John E. Bradley</a></td>]

Most of the time, we're interested in the actual **text** of a website, not its tags. Recall that to get the text of an HTML element, we use the `text` member:

In [None]:
# Extrae solo el texto de cada celda que tiene clase 'content' y lo guarda en una lista
row_data = [cell.text for cell in detail_cells]

# Imprime la lista con el texto de cada celda para ver los datos limpios
print(row_data)


['\n11/07/2007                                        ', '\nHouse                                        ', '\nAdded Chief Co-Sponsor Rep. John E. Bradley']


Looks good! Now we just use our basic Python knowledge to get the elements of this list that we want. Remember, we want the senator's name, their district, and their party.

In [None]:
# Imprime el primer valor del texto, que corresponde al Nombre
print(row_data[0])  # Name

# Imprime el segundo valor, que corresponde al Distrito
print(row_data[1])  # District

# Imprime el tercer valor, que corresponde al Partido Político
print(row_data[2])  # Party



11/07/2007                                        

House                                        

Added Chief Co-Sponsor Rep. John E. Bradley


## Getting Rid of Junk Rows

We saw at the beginning that not all of the rows we got actually correspond to a senator. We'll need to do some cleaning before we can proceed forward. Take a look at some examples:

In [None]:
# Imprime el contenido HTML completo de la primera fila (fila 0)
print('Row 0:\n', rows[0], '\n')

# Imprime el contenido HTML completo de la segunda fila (fila 1)
print('Row 1:\n', rows[1], '\n')

# Imprime el contenido HTML completo de la última fila de la tabla
print('Last Row:\n', rows[-1])


Row 0:
 <tr>
<th>Date</th>
<th>Chamber</th>
<th>Action</th>
</tr> 

Row 1:
 <tr>
<td align="left" class="content" valign="top" width="13%">
11/02/2007                                        </td>
<td align="left" class="content" valign="top" width="12%">
House                                        </td>
<td align="left" class="content" valign="top" width="75%">
Filed with the Clerk by <a href="../../house/members/details/1307">Rep. Daniel V. Beiser</a></td></tr> 

Last Row:
 <tr>
<td align="left" class="content" valign="top" width="13%">
<b>5/19/2008</b>
</td>
<td align="left" class="content" valign="top" width="12%">
<b>House</b>
</td>
<td align="left" class="content" valign="top" width="75%">
<b>Tabled By Sponsor <a href="../../house/members/details/1307">Rep. Daniel V. Beiser</a></b>
</td></tr>


When we write our for loop, we only want it to apply to the relevant rows. So we'll need to filter out the irrelevant rows. The way to do this is to compare some of these to the rows we do want, see how they differ, and then formulate that in a conditional.

As you can imagine, there a lot of possible ways to do this, and it'll depend on the website. We'll show some here to give you an idea of how to do this.

In [None]:
# Muestra la cantidad de elementos (por ejemplo, celdas) que contiene la fila 0, considerada "mala" (posiblemente vacía o diferente)
print(len(rows[0]))

# Muestra la cantidad de elementos que contiene la fila 1, también considerada "mala"
print(len(rows[1]))

# Muestra la cantidad de elementos que contiene la fila 2, considerada "buena" (con datos esperados)
print(len(rows[2]))

# Muestra la cantidad de elementos que contiene la fila 3, considerada "buena"
print(len(rows[3]))


7
6
6
6


Perhaps good rows have a length of 5. Let's check:

In [None]:
# Crea una lista con solo las filas que tienen exactamente 3 celdas <td>, descartando filas con estructura diferente
good_rows = [row for row in rows if len(row.find_all('td')) == 3]

# Imprime cuántas filas con 3 celdas se encontraron en total
print(f"Filas con 3 celdas encontradas: {len(good_rows)}\n")

# Muestra la primera fila bien formada con formato legible
print(good_rows[0].prettify(), '\n')

# Muestra la penúltima fila bien formada
print(good_rows[-2].prettify(), '\n')

# Muestra la última fila bien formada
print(good_rows[-1].prettify())


Filas con 3 celdas encontradas: 17

<tr>
 <td align="left" class="content" valign="top" width="13%">
  11/02/2007
 </td>
 <td align="left" class="content" valign="top" width="12%">
  House
 </td>
 <td align="left" class="content" valign="top" width="75%">
  Filed with the Clerk by
  <a href="../../house/members/details/1307">
   Rep. Daniel V. Beiser
  </a>
 </td>
</tr>
 

<tr>
 <td align="left" class="content" valign="top" width="13%">
  5/19/2008
 </td>
 <td align="left" class="content" valign="top" width="12%">
  House
 </td>
 <td align="left" class="content" valign="top" width="75%">
  Motion Prevailed
 </td>
</tr>
 

<tr>
 <td align="left" class="content" valign="top" width="13%">
  <b>
   5/19/2008
  </b>
 </td>
 <td align="left" class="content" valign="top" width="12%">
  <b>
   House
  </b>
 </td>
 <td align="left" class="content" valign="top" width="75%">
  <b>
   Tabled By Sponsor
   <a href="../../house/members/details/1307">
    Rep. Daniel V. Beiser
   </a>
  </b>
 </td>
<

We found a footer row in our list that we'd like to avoid. Let's try something else:

In [None]:
# Selecciona todas las celdas <td> con la clase 'content' dentro de la tercera fila (índice 2)
rows[2].select('td.content')


[<td align="left" class="content" valign="top" width="13%">
 11/07/2007                                        </td>,
 <td align="left" class="content" valign="top" width="12%">
 House                                        </td>,
 <td align="left" class="content" valign="top" width="75%">
 Added Chief Co-Sponsor <a href="../../house/members/details/1291">Rep. John E. Bradley</a></td>]

In [None]:
# Muestra las celdas <td> con clase 'content' de la última fila, considerada "mala" porque posiblemente no tenga datos válidos
print(rows[-1].select('td.content'), '\n')

# Muestra las celdas <td> con clase 'content' de la fila en la posición 5, considerada "buena"
print(rows[5].select('td.content'), '\n')

# Filtra todas las filas que tengan al menos una celda <td> con clase 'content', para quedarnos solo con filas válidas
good_rows = [row for row in rows if row.select('td.content')]

print("Checking rows...\n")

# Muestra la primera fila válida encontrada
print(good_rows[0], '\n')

# Muestra la última fila válida encontrada
print(good_rows[-1])


[<td align="left" class="content" valign="top" width="13%">
<b>5/19/2008</b>
</td>, <td align="left" class="content" valign="top" width="12%">
<b>House</b>
</td>, <td align="left" class="content" valign="top" width="75%">
<b>Tabled By Sponsor <a href="../../house/members/details/1307">Rep. Daniel V. Beiser</a></b>
</td>] 

[<td align="left" class="content" valign="top" width="13%">
11/07/2007                                        </td>, <td align="left" class="content" valign="top" width="12%">
House                                        </td>, <td align="left" class="content" valign="top" width="75%">
Added Chief Co-Sponsor <a href="../../house/members/details/1348">Rep. Fred Crespo</a></td>] 

Checking rows...

<tr>
<td align="left" class="content" valign="top" width="13%">
11/02/2007                                        </td>
<td align="left" class="content" valign="top" width="12%">
House                                        </td>
<td align="left" class="content" valign="top"

Looks like we found something that worked!

## Loop it All Together

Now that we've seen how to get the data we want from one row, as well as filter out the rows we don't want, let's put it all together into a loop.

In [None]:
# Define una lista vacía para almacenar los datos extraídos
members = []

# Filtra las filas válidas, descartando las que no tienen celdas con clase 'content'
valid_rows = [row for row in rows if row.select('td.content')]

# Recorre cada fila válida
for row in valid_rows:
    # Selecciona solo las celdas <td> que tienen la clase 'content'
    detail_cells = row.select('td.content')

    # Extrae solo el texto de cada celda para obtener los datos limpios
    row_data = [cell.text for cell in detail_cells]

    # Asigna cada dato a una variable con nombre descriptivo
    Date = row_data[0]
    Chamber = row_data[1]
    Action = row_data[2]

    # Crea una tupla con la información de la fila
    senator = (Date, Chamber, Action)

    # Añade la tupla a la lista de miembros
    members.append(senator)


In [None]:
# Muestra la cantidad total de registros guardados en la lista 'members'
len(members)


17

Let's take a look at what we have in `members`.

In [None]:
# Imprime las primeras 5 tuplas almacenadas en la lista 'members' para ver ejemplos de los datos extraídos
print(members[:5])


[('\n11/02/2007                                        ', '\nHouse                                        ', '\nFiled with the Clerk by Rep. Daniel V. Beiser'), ('\n11/07/2007                                        ', '\nHouse                                        ', '\nAdded Chief Co-Sponsor Rep. John E. Bradley'), ('\n11/07/2007                                        ', '\nHouse                                        ', '\nAdded Chief Co-Sponsor Rep. Robert F. Flider'), ('\n11/07/2007                                        ', '\nHouse                                        ', '\nAdded Chief Co-Sponsor Rep. Elizabeth Hernandez'), ('\n11/07/2007                                        ', '\nHouse                                        ', '\nAdded Chief Co-Sponsor Rep. Fred Crespo')]


## 🥊  Challenge: Get `href` elements pointing to members' bills

The code above retrieves information on:  

- the senator's name,
- their district number,
- and their party.

We now want to retrieve the URL for each senator's list of bills. Each URL will follow a specific format.

The format for the list of bills for a given senator is:

`http://www.ilga.gov/senate/SenatorBills.asp?GA=98&MemberID=[MEMBER_ID]&Primary=True`

to get something like:

`http://www.ilga.gov/senate/SenatorBills.asp?MemberID=1911&GA=98&Primary=True`

in which `MEMBER_ID=1911`.

You should be able to see that, unfortunately, `MEMBER_ID` is not currently something pulled out in our scraping code.

Your initial task is to modify the code above so that we also **retrieve the full URL which points to the corresponding page of primary-sponsored bills**, for each member, and return it along with their name, district, and party.

Tips:

* To do this, you will want to get the appropriate anchor element (`<a>`) in each legislator's row of the table. You can again use the `.select()` method on the `row` object in the loop to do this — similar to the command that finds all of the `td.detail` cells in the row. Remember that we only want the link to the legislator's bills, not the committees or the legislator's profile page.
* The anchor elements' HTML will look like `<a href="/senate/Senator.asp/...">Bills</a>`. The string in the `href` attribute contains the **relative** link we are after. You can access an attribute of a BeatifulSoup `Tag` object the same way you access a Python dictionary: `anchor['attributeName']`. See the <a href="http://www.crummy.com/software/BeautifulSoup/bs4/doc/#tag">documentation</a> for more details.
* There are a _lot_ of different ways to use BeautifulSoup to get things done. whatever you need to do to pull the `href` out is fine.

The code has been partially filled out for you. Fill it in where it says `#YOUR CODE HERE`. Save the path into an object called `full_path`.

In [None]:
# # Make a GET request
# req = requests.get('http://www.ilga.gov/senate/default.asp?GA=98')
# # Read the content of the server’s response
# src = req.text
# # Soup it
# soup = BeautifulSoup(src, "lxml")
# # Create empty list to store our data
# members = []

# # Returns every ‘tr tr tr’ css selector in the page
# rows = soup.select('tr tr tr')
# # Get rid of junk rows
# rows = [row for row in rows if row.select('td.detail')]

# # Loop through all rows
# for row in rows:
#     # Select only those 'td' tags with class 'detail'
#     detail_cells = row.select('td.detail')
#     # Keep only the text in each of those cells
#     row_data = [cell.text for cell in detail_cells]
#     # Collect information
#     name = row_data[0]
#     district = int(row_data[3])
#     party = row_data[4]

#     # YOUR CODE HERE
#     full_path = ''

#     # Store in a tuple
#     senator = (name, district, party, full_path)
#     # Append to list
#     members.append(senator)

In [None]:
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin

# Hacemos una solicitud GET a la lista actual de senadores del Senado de Illinois
url = "https://www.ilga.gov/Senate/Members/list"
req = requests.get(url)

# Parseamos el contenido HTML con BeautifulSoup
soup = BeautifulSoup(req.text, "lxml")

# Creamos una lista para almacenar los datos
members = []

# Seleccionamos todas las filas (<tr>) de la tabla que contiene la lista de senadores
rows = soup.select("table tr")

# Iteramos sobre cada fila (ignorando la cabecera si existe)
for row in rows:
    # Buscamos el primer enlace dentro de la fila (es donde está el nombre del senador)
    link = row.find("a", href=True)
    if link:
        name = link.get_text(strip=True)  # Obtenemos el nombre del senador
        relative_href = link["href"]      # Extraemos el href relativo, ej: /Senate/Members/Details/3312
        full_url = urljoin("https://www.ilga.gov", relative_href)  # Convertimos a URL completa

        # Guardamos los datos en una tupla
        member_data = (name, full_url)
        members.append(member_data)

# Mostramos los resultados
for m in members:
    print(m)


('Neil Anderson', 'https://www.ilga.gov/Senate/Members/Details/3312')
('Omar Aquino', 'https://www.ilga.gov/Senate/Members/Details/3316')
('Li Arellano, Jr.', 'https://www.ilga.gov/Senate/Members/Details/3383')
('Chris Balkema', 'https://www.ilga.gov/Senate/Members/Details/3413')
('Christopher Belt', 'https://www.ilga.gov/Senate/Members/Details/3337')
('Terri Bryant', 'https://www.ilga.gov/Senate/Members/Details/3386')
('Cristina Castro', 'https://www.ilga.gov/Senate/Members/Details/3317')
('Javier L. Cervantes', 'https://www.ilga.gov/Senate/Members/Details/3403')
('Andrew S. Chesney', 'https://www.ilga.gov/Senate/Members/Details/3410')
('Lakesia Collins', 'https://www.ilga.gov/Senate/Members/Details/3443')
('Bill Cunningham', 'https://www.ilga.gov/Senate/Members/Details/3291')
('John F. Curran', 'https://www.ilga.gov/Senate/Members/Details/3329')
('Donald P. DeWitte', 'https://www.ilga.gov/Senate/Members/Details/3334')
('Mary Edly-Allen', 'https://www.ilga.gov/Senate/Members/Details/3

In [None]:
# Uncomment to test
# members[:5]

## 🥊  Challenge: Modularize Your Code

Turn the code above into a function that accepts a URL, scrapes the URL for its senators, and returns a list of tuples containing information about each senator.

In [None]:
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin

def get_senators_from_url(url):
    """
    Función que recibe la URL de la lista de senadores del sitio web de ILGA
    y devuelve una lista de tuplas (nombre, URL_detalle).

    Parámetros:
        url (str): URL de la página con la tabla de senadores.

    Retorna:
        list[tuple]: Lista de tuplas con (nombre, URL completa a detalles)
    """
    try:
        # Hacer la solicitud HTTP
        response = requests.get(url)
        response.raise_for_status()  # Levanta excepción si el estado no es 200

        # Parsear el contenido HTML
        soup = BeautifulSoup(response.text, "lxml")

        # Lista donde se almacenan los resultados
        members = []

        # Buscar todas las filas de la tabla
        rows = soup.select("table tr")

        for row in rows:
            # Buscar el enlace con href dentro de la fila
            link = row.find("a", href=True)
            if link:
                name = link.get_text(strip=True)
                print(name)
                relative_href = link["href"]
                print(relative_href)
                full_url = urljoin(url, relative_href)  # Construir URL absoluta

                members.append((name, full_url))

        return members

    except Exception as e:
        print(f"Error al procesar la URL: {e}")
        return []

# Ejemplo de uso
url = "https://www.ilga.gov/Senate/Members/list"
senators = get_senators_from_url(url)

# Imprimir resultados
for senator in senators:
    print(senator)


Neil Anderson
/Senate/Members/Details/3312
Omar Aquino
/Senate/Members/Details/3316
Li Arellano, Jr.
/Senate/Members/Details/3383
Chris Balkema
/Senate/Members/Details/3413
Christopher Belt
/Senate/Members/Details/3337
Terri Bryant
/Senate/Members/Details/3386
Cristina Castro
/Senate/Members/Details/3317
Javier L. Cervantes
/Senate/Members/Details/3403
Andrew S. Chesney
/Senate/Members/Details/3410
Lakesia Collins
/Senate/Members/Details/3443
Bill Cunningham
/Senate/Members/Details/3291
John F. Curran
/Senate/Members/Details/3329
Donald P. DeWitte
/Senate/Members/Details/3334
Mary Edly-Allen
/Senate/Members/Details/3407
Laura Ellman
/Senate/Members/Details/3339
Paul Faraci
/Senate/Members/Details/3412
Sara Feigenholtz
/Senate/Members/Details/3376
Laura Fine
/Senate/Members/Details/3338
Dale Fowler
/Senate/Members/Details/3318
Suzy Glowiak Hilton
/Senate/Members/Details/3341
Graciela Guzmán
/Senate/Members/Details/3442
Michael W. Halpin
/Senate/Members/Details/3408
Don Harmon
/Senate/Me

## 🥊 Take-home Challenge: Writing a Scraper Function

We want to scrape the webpages corresponding to bills sponsored by each bills.

Write a function called `get_bills(url)` to parse a given bills URL. This will involve:

  - requesting the URL using the <a href="http://docs.python-requests.org/en/latest/">`requests`</a> library
  - using the features of the `BeautifulSoup` library to find all of the `<td>` elements with the class `billlist`
  - return a _list_ of tuples, each with:
      - description (2nd column)
      - chamber (S or H) (3rd column)
      - the last action (4th column)
      - the last action date (5th column)
      
This function has been partially completed. Fill in the rest.

In [None]:
import requests
from bs4 import BeautifulSoup

def get_bills(url):
    # Obtener el contenido HTML de la página
    src = requests.get(url).text
    soup = BeautifulSoup(src, "lxml")

    # Seleccionar todas las filas de la tabla
    rows = soup.select('tr')
    bills = []

    for row in rows:
        # Seleccionamos solo las celdas con clase 'billlist'
        cells = row.select('td.billlist')

        # Debe haber al menos 5 columnas con clase 'billlist'
        if len(cells) >= 5:
            bill_id = cells[0].get_text(strip=True)
            description = cells[1].get_text(strip=True)
            chamber = cells[2].get_text(strip=True)
            last_action = cells[3].get_text(strip=True)
            last_action_date = cells[4].get_text(strip=True)

            # Empaquetamos la tupla
            bill = (bill_id, description, chamber, last_action, last_action_date)
            bills.append(bill)

    return bills


In [None]:
# Uncomment to test your code
# test_url = senate_members[0][3]
# get_bills(test_url)[0:5]

### Scrape All Bills

Finally, create a dictionary `bills_dict` which maps a district number (the key) onto a list of bills (the value) coming from that district. You can do this by looping over all of the senate members in `members_dict` and calling `get_bills()` for each of their associated bill URLs.

**NOTE:** please call the function `time.sleep(1)` for each iteration of the loop, so that we don't destroy the state's web site.

In [None]:
# # YOUR CODE HERE
# import time

# # Simulamos que ya tienes esta lista con nombre, distrito y URL a sus proyectos de ley
# # Por ejemplo:
# # members = [
# #     ('Neil Anderson', 47, 'R', 'https://www.ilga.gov/Senate/SenatorBills.asp?MemberID=3312&GA=103&Primary=True'),
# #     ...
# # ]

# bills_dict = {}

# for name, district, party, bills_url in members:
#     print(f"Consultando proyectos de ley para el distrito {district} ({name})...")

#     # Llamamos a la función que obtiene los proyectos de ley
#     bills = get_bills(bills_url)

#     # Asociamos la lista de bills al distrito correspondiente
#     bills_dict[district] = bills

#     # Dormimos 1 segundo entre cada petición por respeto al servidor
#     time.sleep(1)

# # (Opcional) Verificamos los primeros elementos
# for district, bills in list(bills_dict.items())[:3]:
#     print(f"\nDistrito {district} tiene {len(bills)} proyectos de ley:")
#     for bill in bills[:2]:  # Muestra solo los 2 primeros por distrito
#         print("  ➤", bill)


le dejo comentado hasta tener una URL que si funcione para probar y que cumpla con estos parametros.

In [None]:
# Uncomment to test your code
# bills_dict[52]