*Importante aclarar que la pagina con la que se trabaja actualmente es* "https://www.ilga.gov/Senate/Members" *dado que se redirecciona a la URL orgiinal a la antes mencionada. *

# Web Scraping with Beautiful Soup

* * *

### Icons used in this notebook
🔔 **Question**: A quick question to help you understand what's going on.<br>
🥊 **Challenge**: Interactive exercise. We'll work through these in the workshop!<br>
⚠️ **Warning**: Heads-up about tricky stuff or common mistakes.<br>
💡 **Tip**: How to do something a bit more efficiently or effectively.<br>
🎬 **Demo**: Showing off something more advanced – so you know what Python can be used for!<br>

### Learning Objectives
1. [Reflection: To Scape Or Not To Scrape](#when)
2. [Extracting and Parsing HTML](#extract)
3. [Scraping the Illinois General Assembly](#scrape)

<a id='when'></a>

# To Scrape Or Not To Scrape

When we'd like to access data from the web, we first have to make sure if the website we are interested in offers a Web API. Platforms like Twitter, Reddit, and the New York Times offer APIs. **Check out D-Lab's [Python Web APIs](https://github.com/dlab-berkeley/Python-Web-APIs) workshop if you want to learn how to use APIs.**

However, there are often cases when a Web API does not exist. In these cases, we may have to resort to web scraping, where we extract the underlying HTML from a web page, and directly obtain the information we want. There are several packages in Python we can use to accomplish these tasks. We'll focus two packages: Requests and Beautiful Soup.

Our case study will be scraping information on the [state senators of Illinois](http://www.ilga.gov/senate), as well as the [list of bills](http://www.ilga.gov/senate/SenatorBills.asp?MemberID=1911&GA=98&Primary=True) each senator has sponsored. Before we get started, peruse these websites to take a look at their structure.

##En esta sección se procede a intalar las librerias necesarias para utilizar en este caso *requests y beautifoulsoup4*

## Installation

We will use two main packages: [Requests](http://docs.python-requests.org/en/latest/user/quickstart/) and [Beautiful Soup](http://www.crummy.com/software/BeautifulSoup/bs4/doc/). Go ahead and install these packages, if you haven't already:

In [1]:
%pip install requests



In [3]:
%pip install beautifulsoup4



We'll also install the `lxml` package, which helps support some of the parsing that Beautiful Soup performs:

In [4]:
%pip install lxml



In [5]:
# Import required libraries
from bs4 import BeautifulSoup
from datetime import datetime
import requests
import time

<a id='extract'></a>

# Extracting and Parsing HTML

In order to succesfully scrape and analyse HTML, we'll be going through the following 4 steps:
1. Make a GET request
2. Parse the page with Beautiful Soup
3. Search for HTML elements
4. Get attributes and text of these elements

## Step 1: Make a GET Request to Obtain a Page's HTML

We can use the Requests library to:

1. Make a GET request to the page, and
2. Read in the webpage's HTML code.

The process of making a request and obtaining a result resembles that of the Web API workflow. Now, however, we're making a request directly to the website, and we're going to have to parse the HTML ourselves. This is in contrast to being provided data organized into a more straightforward `JSON` or `XML` output.

##*Se* define la url a la que se le aplicara la peticion get

In [6]:
# Make a GET request
req = requests.get('http://www.ilga.gov/senate/default.asp')
# Read the content of the server’s response
src = req.text
# View some output
print(src[:1000])

<!DOCTYPE html>
<html lang="en">
<head id="Head1">
    <meta name="viewport" content="width=device-width, initial-scale=1.0" />
    <meta http-equiv="content-type" content="text/html;charset=utf-8" />
    <meta http-equiv="X-UA-Compatible" content="IE=Edge" />
    <meta charset="utf-8" />
    <meta charset="UTF-8">
    <!-- Meta Description -->
    <meta name="description" content="Welcome to the official government website of the Illinois General Assembly">
    <meta name="contactName" content="Legislative Information System">
    <meta name="contactOrganization" content="LIS Staff Services">
    <meta name="contactStreetAddress1" content="705 Stratton Office Building">
    <meta name="contactCity" content="Springfield">
    <meta name="contactZipcode" content="62706">
    <meta name="contactNetworkAddress" content="webmaster@ilga.gov">
    <meta name="contactPhoneNumber" content="217-782-3944">
    <meta name="contactFaxNumber" content="217-524-6059">
    <meta name

## Step 2: Parse the Page with Beautiful Soup

Now, we use the `BeautifulSoup` function to parse the reponse into an HTML tree. This returns an object (called a **soup object**) which contains all of the HTML in the original document.

If you run into an error about a parser library, make sure you've installed the `lxml` package to provide Beautiful Soup with the necessary parsing tools.

In [21]:
# Parse the response into an HTML tree
soup = BeautifulSoup(src, 'lxml')
# Take a look
print(soup.prettify()[:1000])

<!DOCTYPE html>
<html lang="en">
 <head id="Head1">
  <meta content="width=device-width, initial-scale=1.0" name="viewport"/>
  <meta content="text/html;charset=utf-8" http-equiv="content-type"/>
  <meta content="IE=Edge" http-equiv="X-UA-Compatible"/>
  <meta charset="utf-8"/>
  <meta charset="utf-8"/>
  <!-- Meta Description -->
  <meta content="Welcome to the official government website of the Illinois General Assembly" name="description"/>
  <meta content="Legislative Information System" name="contactName"/>
  <meta content="LIS Staff Services" name="contactOrganization"/>
  <meta content="705 Stratton Office Building" name="contactStreetAddress1"/>
  <meta content="Springfield" name="contactCity"/>
  <meta content="62706" name="contactZipcode"/>
  <meta content="webmaster@ilga.gov" name="contactNetworkAddress"/>
  <meta content="217-782-3944" name="contactPhoneNumber"/>
  <meta content="217-524-6059" name="contactFaxNumber"/>
  <meta content="State Of Illinois" name="originatorJur

The output looks pretty similar to the above, but now it's organized in a `soup` object which allows us to more easily traverse the page.

## Step 3: Search for HTML Elements

Beautiful Soup has a number of functions to find useful components on a page. Beautiful Soup lets you find elements by their:

1. HTML tags
2. HTML Attributes
3. CSS Selectors

Let's search first for **HTML tags**.

The function `find_all` searches the `soup` tree to find all the elements with an a particular HTML tag, and returns all of those elements.

What does the example below do?

Se busca las etiquetas tipo a con la libreria de BeautifulSoup

In [9]:
# Find all elements with a certain tag
a_tags = soup.find_all("a")
print(a_tags[:10])

[<a b-0yw6sxot5c="" class="dropdown-item" data-lang="en" href="#">
<span b-0yw6sxot5c="" class="flag-icon flag-icon-us"></span> English
                            </a>, <a b-0yw6sxot5c="" class="dropdown-item" data-lang="af" href="#">
<span b-0yw6sxot5c="" class="flag-icon flag-icon-za"></span> Afrikaans
                            </a>, <a b-0yw6sxot5c="" class="dropdown-item" data-lang="sq" href="#">
<span b-0yw6sxot5c="" class="flag-icon flag-icon-al"></span> Albanian
                            </a>, <a b-0yw6sxot5c="" class="dropdown-item" data-lang="ar" href="#">
<span b-0yw6sxot5c="" class="flag-icon flag-icon-ae"></span> Arabic
                            </a>, <a b-0yw6sxot5c="" class="dropdown-item" data-lang="hy" href="#">
<span b-0yw6sxot5c="" class="flag-icon flag-icon-am"></span> Armenian
                            </a>, <a b-0yw6sxot5c="" class="dropdown-item" data-lang="az" href="#">
<span b-0yw6sxot5c="" class="flag-icon flag-icon-az"></span> Azerbaijani
      

Because `find_all()` is the most popular method in the Beautiful Soup search API, you can use a shortcut for it. If you treat the BeautifulSoup object as though it were a function, then it’s the same as calling `find_all()` on that object.

These two lines of code are equivalent:

In [10]:
a_tags = soup.find_all("a")
a_tags_alt = soup("a")
print(a_tags[0])
print(a_tags_alt[0])

<a b-0yw6sxot5c="" class="dropdown-item" data-lang="en" href="#">
<span b-0yw6sxot5c="" class="flag-icon flag-icon-us"></span> English
                            </a>
<a b-0yw6sxot5c="" class="dropdown-item" data-lang="en" href="#">
<span b-0yw6sxot5c="" class="flag-icon flag-icon-us"></span> English
                            </a>


How many links did we obtain?

In [11]:
print(len(a_tags))

270


That's a lot! Many elements on a page will have the same HTML tag. For instance, if you search for everything with the `a` tag, you're likely to get more hits, many of which you might not want. Remember, the `a` tag defines a hyperlink, so you'll usually find many on any given page.

What if we wanted to search for HTML tags with certain attributes, such as particular CSS classes?

We can do this by adding an additional argument to the `find_all`. In the example below, we are finding all the `a` tags, and then filtering those with `class_="sidemenu"`.

In [39]:
# Get only the 'a' tags in 'sidemenu' class
side_menus = soup("a", class_="sidemenu")
side_menus[:5]

[]

NO existe resultado dado que no existe una clase sidemenu dado que se redirecciona a la URL https://www.ilga.gov/Senate/Members, por lo tanto se trabajara con el elemento de clase member-card para obtener la información de cada candidato para validar los resultados:

In [55]:
members = soup.find_all("div", class_="member-card mb-4")
members[:5]

[<div class="member-card mb-4" onclick="goToURL('Members/Details/3312')" style="background-image: url('https://cdn.ilga.gov/assets/img/members/%7B90CDA259-1DEA-4D18-AE97-30051E03D154%7D.jpg');">
 <div class="member-overlay">
 <h5 class="card-title"><a class="notranslate" href="/Senate/Members/Details/3312">Neil Anderson</a> (R)</h5>
 <p class="card-text">
                                             Republican Caucus Chair
                                             <br/>47th District
                                         </p>
 </div>
 </div>,
 <div class="member-card mb-4" onclick="goToURL('Members/Details/3312')" style="background-image: url('https://cdn.ilga.gov/assets/img/members/%7B90CDA259-1DEA-4D18-AE97-30051E03D154%7D.jpg');">
 <div class="member-overlay">
 <h5 class="card-title"><a class="notranslate" href="/Senate/Members/Details/3312">Neil Anderson</a> (R)</h5>
 <p class="card-text">
                                             Republican Caucus Chair
                   

Aqui se extraia con el select los elementos tipo a que se enecuentren dentro de la clase sidemenu, lo  cual por lo antes mencionado no retornara nada.

A more efficient way to search for elements on a website is via a **CSS selector**. For this we have to use a different method called `select()`. Just pass a string into the `.select()` to get all elements with that string as a valid CSS selector.

In the example above, we can use `"a.sidemenu"` as a CSS selector, which returns all `a` tags with class `sidemenu`.

In [58]:
# Get elements with "a.sidemenu" CSS Selector.
selected = soup.select("a.sidemenu")
selected[:5]

[]

## 🥊 Challenge: Find All

Use BeautifulSoup to find all the `a` elements with class `mainmenu`.

En este caso la solución seria, sin emabargo, no existe una clase mainmenu

In [18]:
soup.select("a.mainmenu")

[]

## Step 4: Get Attributes and Text of Elements

Once we identify elements, we want the access information in that element. Usually, this means two things:

1. Text
2. Attributes

Getting the text inside an element is easy. All we have to do is use the `text` member of a `tag` object:

In [73]:
# Get all member-card links as a list
side_menu_links = soup.select("div.member-card.mb-4 .notranslate")

# Examine the first link
first_link = side_menu_links[1]
print(first_link)

# What class is this variable?
print('Class: ', type(first_link))

<a class="notranslate" href="/Senate/Members/Details/3312">Neil Anderson</a>
Class:  <class 'bs4.element.Tag'>


En este caso se extrajo el tipo de clase de las tarjetas donde se encuentran los senadores en el elemento de clase notranslate


It's a Beautiful Soup tag! This means it has a `text` member:

Aqui se imprime el texto que se encuentra en el atributo seleccionado, en este caso el Nombre del Senador

In [74]:
print(first_link.text)

Neil Anderson


Sometimes we want the value of certain attributes. This is particularly relevant for `a` tags, or links, where the `href` attribute tells us where the link goes.

💡 **Tip**: You can access a tag’s attributes by treating the tag like a dictionary:

In [75]:
print(first_link['href'])

/Senate/Members/Details/3312


## 🥊 Challenge: Extract specific attributes

Extract all `href` attributes for each `mainmenu` URL.

In [76]:
[link['href'] for link in soup.select("a.mainmenu")]


[]

Aqui se obtiene los atributos href del mainmenu, lo que resuelve el valor vacio, aplicando lo mismo al atributo de notranslate


In [77]:
cards = soup.select("div.member-card.mb-4")

senadores = []

for card in cards:
    # Dentro de cada tarjeta buscamos todos los elementos con clase 'notranslate'
    datos = [el.get_text(strip=True) for el in card.select(".notranslate")]

    # Guardamos los datos (nombre, distrito, partido, etc.)
    senadores.append(datos)

# Mostramos los resultados
for idx, s in enumerate(senadores, start=1):
    print(f"Senador {idx}: {s}")

Senador 1: ['Neil Anderson']
Senador 2: ['Neil Anderson']
Senador 3: ['Omar Aquino']
Senador 4: ['Omar Aquino']
Senador 5: ['Li Arellano, Jr.']
Senador 6: ['Li Arellano, Jr.']
Senador 7: ['Chris Balkema']
Senador 8: ['Chris Balkema']
Senador 9: ['Christopher Belt']
Senador 10: ['Christopher Belt']
Senador 11: ['Terri Bryant']
Senador 12: ['Terri Bryant']
Senador 13: ['Cristina Castro']
Senador 14: ['Cristina Castro']
Senador 15: ['Javier L. Cervantes']
Senador 16: ['Javier L. Cervantes']
Senador 17: ['Andrew S. Chesney']
Senador 18: ['Andrew S. Chesney']
Senador 19: ['Lakesia Collins']
Senador 20: ['Lakesia Collins']
Senador 21: ['Bill Cunningham']
Senador 22: ['Bill Cunningham']
Senador 23: ['John F. Curran']
Senador 24: ['John F. Curran']
Senador 25: ['Donald P. DeWitte']
Senador 26: ['Donald P. DeWitte']
Senador 27: ['Mary Edly-Allen']
Senador 28: ['Mary Edly-Allen']
Senador 29: ['Laura Ellman']
Senador 30: ['Laura Ellman']
Senador 31: ['Paul Faraci']
Senador 32: ['Paul Faraci']
Sen

<a id='scrape'></a>

# Scraping the Illinois General Assembly

Believe it or not, those are really the fundamental tools you need to scrape a website. Once you spend more time familiarizing yourself with HTML and CSS, then it's simply a matter of understanding the structure of a particular website and intelligently applying the tools of Beautiful Soup and Python.

Let's apply these skills to scrape the [Illinois 98th General Assembly](http://www.ilga.gov/senate/default.asp?GA=98).

Specifically, our goal is to scrape information on each senator, including their name, district, and party.

# Ahora se aplicara el ejercicio en la pagina de los miembros de la 98va asamblea

## Scrape and Soup the Webpage

Let's scrape and parse the webpage, using the tools we learned in the previous section.

In [79]:
# Make a GET request
req = requests.get('http://www.ilga.gov/senate/default.asp?GA=98')
# Read the content of the server’s response
src = req.text
# Soup it
soup = BeautifulSoup(src, "lxml")

## Search for the Table Elements

Our goal is to obtain the elements in the table on the webpage. Remember: rows are identified by the `tr` tag. Let's use `find_all` to obtain these elements.

In [81]:
# Get all table row elements
rows = soup.find_all("tr")
len(rows)

0

No se obtienen resultador dado que no existe la pagina que se busca y seguimos en la apgina de miembros del senado, en esta pagina no se tiene los tags tr como se puede ver en el conteo = 0

⚠️ **Warning**: Keep in mind: `find_all` gets *all* the elements with the `tr` tag. We only want some of them. If we use the 'Inspect' function in Google Chrome and look carefully, then we can use some CSS selectors to get just the rows we're interested in. Specifically, we want the inner rows of the table:

In [83]:
# Returns every ‘tr tr tr’ css selector in the page
rows = soup.select('tr tr tr')

for row in rows[:5]:
    print(row, '\n')

It looks like we want everything after the first two rows. Let's work with a single row to start, and build our loop from there.

In [86]:
example_row = rows[2]
print(example_row.prettify())

IndexError: list index out of range

*Devuelve error dado que no se tienen elementos y la lista en la posicion 3 no tiene valores*

Let's break this row down into its component cells/columns using the `select` method with CSS selectors. Looking closely at the HTML, there are a couple of ways we could do this.

* We could identify the cells by their tag `td`.
* We could use the the class name `.detail`.
* We could combine both and use the selector `td.detail`.

In [85]:
for cell in example_row.select('td'):
    print(cell)
print()

for cell in example_row.select('.detail'):
    print(cell)
print()

for cell in example_row.select('td.detail'):
    print(cell)
print()

NameError: name 'example_row' is not defined

*Se tiene error dado que example_row no se definio dado que no se tienen elementos tr*, para esto se realizara el mismo ejercicio con el elemento "card-title"

In [88]:
titles = soup.select(".card-title")

# Mostramos los primeros 5 resultados
for idx, title in enumerate(titles[:5], start=1):
    print(f"Título {idx}: {title.get_text(strip=True)}")

Título 1: Neil Anderson(R)
Título 2: Neil Anderson(R)
Título 3: Omar Aquino(D)
Título 4: Omar Aquino(D)
Título 5: Li Arellano, Jr.(R)


We can confirm that these are all the same.

Con el assert validamos que los resultados sean identicos para los elementos td

In [89]:
assert example_row.select('td') == example_row.select('.detail') == example_row.select('td.detail')

NameError: name 'example_row' is not defined

Let's use the selector `td.detail` to be as specific as possible.

In [90]:
# Select only those 'td' tags with class 'detail'
detail_cells = example_row.select('td.detail')
detail_cells

NameError: name 'example_row' is not defined

Most of the time, we're interested in the actual **text** of a website, not its tags. Recall that to get the text of an HTML element, we use the `text` member:

In [91]:
# Keep only the text in each of those cells
row_data = [cell.text for cell in detail_cells]

print(row_data)

NameError: name 'detail_cells' is not defined

Looks good! Now we just use our basic Python knowledge to get the elements of this list that we want. Remember, we want the senator's name, their district, and their party.

In [92]:
print(row_data[0]) # Name
print(row_data[3]) # District
print(row_data[4]) # Party

NameError: name 'row_data' is not defined

*Ahora aplicado a nuestro ejemplo quedaria asi:*

In [93]:
senadores = []

for card in cards:
    # Nombre y partido
    title_tag = card.select_one(".card-title")
    if title_tag:
        nombre = title_tag.select_one(".notranslate").get_text(strip=True)
        partido = title_tag.get_text(strip=True).replace(nombre, "").strip()
    else:
        nombre, partido = None, None

    # Cargo y distrito
    text_tag = card.select_one(".card-text")
    if text_tag:
        partes = [t.strip() for t in text_tag.get_text(separator="|").split("|") if t.strip()]
        # normalmente: [cargo, distrito]
        cargo = partes[0] if len(partes) > 0 else None
        distrito = partes[1] if len(partes) > 1 else None
    else:
        cargo, distrito = None, None

    senadores.append({
        "nombre": nombre,
        "partido": partido,
        "cargo": cargo,
        "distrito": distrito
    })

# Mostrar los primeros resultados
for s in senadores[:5]:
    print(s)

{'nombre': 'Neil Anderson', 'partido': '(R)', 'cargo': 'Republican Caucus Chair', 'distrito': '47th District'}
{'nombre': 'Neil Anderson', 'partido': '(R)', 'cargo': 'Republican Caucus Chair', 'distrito': '47th District'}
{'nombre': 'Omar Aquino', 'partido': '(D)', 'cargo': 'Majority Caucus Chair', 'distrito': '2nd District'}
{'nombre': 'Omar Aquino', 'partido': '(D)', 'cargo': 'Majority Caucus Chair', 'distrito': '2nd District'}
{'nombre': 'Li Arellano, Jr.', 'partido': '(R)', 'cargo': 'Senator', 'distrito': '37th District'}


Como se puede validar se tiene los 5 primeros resultados de nombre, partido, cargo y distrito


*texto en cursiva*## Getting Rid of Junk Rows

We saw at the beginning that not all of the rows we got actually correspond to a senator. We'll need to do some cleaning before we can proceed forward. Take a look at some examples:

In [94]:
print('Row 0:\n', rows[0], '\n')
print('Row 1:\n', rows[1], '\n')
print('Last Row:\n', rows[-1])

IndexError: list index out of range

When we write our for loop, we only want it to apply to the relevant rows. So we'll need to filter out the irrelevant rows. The way to do this is to compare some of these to the rows we do want, see how they differ, and then formulate that in a conditional.

As you can imagine, there a lot of possible ways to do this, and it'll depend on the website. We'll show some here to give you an idea of how to do this.

In [95]:
# Bad rows
print(len(rows[0]))
print(len(rows[1]))

# Good rows
print(len(rows[2]))
print(len(rows[3]))

IndexError: list index out of range

Perhaps good rows have a length of 5. Let's check:

In [96]:
good_rows = [row for row in rows if len(row) == 5]

# Let's check some rows
print(good_rows[0], '\n')
print(good_rows[-2], '\n')
print(good_rows[-1])

IndexError: list index out of range

We found a footer row in our list that we'd like to avoid. Let's try something else:

In [97]:
rows[2].select('td.detail')

IndexError: list index out of range

In [98]:
# Bad row
print(rows[-1].select('td.detail'), '\n')

# Good row
print(rows[5].select('td.detail'), '\n')

# How about this?
good_rows = [row for row in rows if row.select('td.detail')]

print("Checking rows...\n")
print(good_rows[0], '\n')
print(good_rows[-1])

IndexError: list index out of range

Estos resultados dan errores dado que no se tenian los valores en las listas por los elementos td, se realiza limpieza de la data, para nuestro ejemplo vamos a eliminar los datos repetidos, comos e ve acontinuacion:

In [99]:
# Quitamos duplicados convirtiendo la lista de dicts a tuplas inmutables
unicos = [dict(t) for t in {tuple(s.items()) for s in senadores}]

# Aseguramos con assert que el largo sea menor o igual al original
assert len(unicos) <= len(senadores)

# Mostrar resultados
for s in unicos[:10]:
    print(s)

{'nombre': 'Mattie Hunter', 'partido': '(D)', 'cargo': 'Assistant Majority Leader', 'distrito': '3rd District'}
{'nombre': 'Robert F. Martwick', 'partido': '(D)', 'cargo': 'Senator', 'distrito': '10th District'}
{'nombre': 'Robert Peters', 'partido': '(D)', 'cargo': 'Majority Caucus Whip', 'distrito': '13th District'}
{'nombre': 'Mike Simmons', 'partido': '(D)', 'cargo': 'Senator', 'distrito': '7th District'}
{'nombre': 'Mary Edly-Allen', 'partido': '(D)', 'cargo': 'Senator', 'distrito': '31st District'}
{'nombre': 'Jason Plummer', 'partido': '(R)', 'cargo': 'Assistant Republican Leader', 'distrito': '55th District'}
{'nombre': 'Meg Loughran Cappel', 'partido': '(D)', 'cargo': 'Senator', 'distrito': '49th District'}
{'nombre': 'Ram Villivalam', 'partido': '(D)', 'cargo': 'Majority Caucus Whip', 'distrito': '8th District'}
{'nombre': 'Celina Villanueva', 'partido': '(D)', 'cargo': 'Senator', 'distrito': '12th District'}
{'nombre': 'Linda Holmes', 'partido': '(D)', 'cargo': 'Assistant Ma

Looks like we found something that worked!

*texto en cursiva*## Loop it All Together

Now that we've seen how to get the data we want from one row, as well as filter out the rows we don't want, let's put it all together into a loop.

Aqui se une los datos limpios

In [101]:
# Define storage list
members = []

# Get rid of junk rows
valid_rows = [row for row in rows if row.select('td.detail')]

# Loop through all rows
for row in valid_rows:
    # Select only those 'td' tags with class 'detail'
    detail_cells = row.select('td.detail')
    # Keep only the text in each of those cells
    row_data = [cell.text for cell in detail_cells]
    # Collect information
    name = row_data[0]
    district = int(row_data[3])
    party = row_data[4]
    # Store in a tuple
    senator = (name, district, party)
    # Append to list
    members.append(senator)

In [102]:
# Should be 61
len(members)

0

Para nuestro caso de datos sin duplicados obtenemos 60 resultados

In [104]:
len(unicos)

60

Let's take a look at what we have in `members`.

In [105]:
print(members[:5])

[]


## 🥊  Challenge: Get `href` elements pointing to members' bills

The code above retrieves information on:  

- the senator's name,
- their district number,
- and their party.

We now want to retrieve the URL for each senator's list of bills. Each URL will follow a specific format.

The format for the list of bills for a given senator is:

`http://www.ilga.gov/senate/SenatorBills.asp?GA=98&MemberID=[MEMBER_ID]&Primary=True`

to get something like:

`http://www.ilga.gov/senate/SenatorBills.asp?MemberID=1911&GA=98&Primary=True`

in which `MEMBER_ID=1911`.

You should be able to see that, unfortunately, `MEMBER_ID` is not currently something pulled out in our scraping code.

Your initial task is to modify the code above so that we also **retrieve the full URL which points to the corresponding page of primary-sponsored bills**, for each member, and return it along with their name, district, and party.

Tips:

* To do this, you will want to get the appropriate anchor element (`<a>`) in each legislator's row of the table. You can again use the `.select()` method on the `row` object in the loop to do this — similar to the command that finds all of the `td.detail` cells in the row. Remember that we only want the link to the legislator's bills, not the committees or the legislator's profile page.
* The anchor elements' HTML will look like `<a href="/senate/Senator.asp/...">Bills</a>`. The string in the `href` attribute contains the **relative** link we are after. You can access an attribute of a BeatifulSoup `Tag` object the same way you access a Python dictionary: `anchor['attributeName']`. See the <a href="http://www.crummy.com/software/BeautifulSoup/bs4/doc/#tag">documentation</a> for more details.
* There are a _lot_ of different ways to use BeautifulSoup to get things done. whatever you need to do to pull the `href` out is fine.

The code has been partially filled out for you. Fill it in where it says `#YOUR CODE HERE`. Save the path into an object called `full_path`.

El codigo quedaria de la siguiente forma sin embargo como vimos no existe un elemento tr por lo cual resultara en vacio


In [114]:
# Make a GET request
req = requests.get('http://www.ilga.gov/senate/default.asp?GA=98')
# Read the content of the server’s response
src = req.text
# Soup it
soup = BeautifulSoup(src, "lxml")
# Create empty list to store our data
members = []

# Returns every ‘tr tr tr’ css selector in the page
rows = soup.select('tr tr tr')
# Get rid of junk rows
rows = [row for row in rows if row.select('td.detail')]

# Loop through all rows
for row in rows:
    # Select only those 'td' tags with class 'detail'
    detail_cells = row.select('td.detail')
    # Keep only the text in each of those cells
    row_data = [cell.text for cell in detail_cells]
    # Collect information
    name = row_data[0]
    district = int(row_data[3])
    party = row_data[4]

    # YOUR CODE HERE
    # Extract href
    href = row.select('a')[1]['href']
    # Create full path
    full_path = "http://www.ilga.gov/senate/" + href + "&Primary=True"

    # Store in a tuple
    senator = (name, district, party, full_path)
    # Append to list
    members.append(senator)


In [115]:
# Uncomment to test
members[:5]

[]

## 🥊  Challenge: Modularize Your Code

Turn the code above into a function that accepts a URL, scrapes the URL for its senators, and returns a list of tuples containing information about each senator.

In [132]:
import requests
from bs4 import BeautifulSoup

def get_members(url):
    # Hacer la petición GET
    req = requests.get(url)
    req.raise_for_status()
    soup = BeautifulSoup(req.text, "lxml")

    members = []

    # Seleccionar todas las tarjetas de senadores
    cards = soup.select("div.member-card.mb-4")

    for card in cards:
        # Nombre y partido
        title = card.select_one(".card-title")
        if not title:
            continue

        nombre = title.select_one(".notranslate").get_text(strip=True)
        partido = title.get_text(strip=True).replace(nombre, "").strip()

        # Cargo y distrito (están en <p class="card-text">, separados por <br>)
        text = card.select_one(".card-text")
        partes = [t.strip() for t in text.get_text(separator="|").split("|") if t.strip()] if text else []
        cargo = partes[0] if len(partes) > 0 else None
        distrito = partes[1] if len(partes) > 1 else None

        # URL del perfil del senador
        a_tag = title.select_one("a")
        url_detalle = "https://www.ilga.gov" + a_tag["href"] if a_tag else None

        # Guardar en tupla (puedes cambiar a dict si prefieres)
        senator = (nombre, distrito, partido, cargo, url_detalle)
        members.append(senator)
        # Quitamos duplicados convirtiendo la lista de dicts a tuplas inmutables
        unicos = [dict(t) for t in {tuple(s.items()) for s in senadores}]
       # Aseguramos con assert que el largo sea menor o igual al original
        assert len(unicos) <= len(senadores)
    return unicos


In [133]:
# Test your code
url = 'https://www.ilga.gov/Senate/Members'
senate_members = get_members(url)
for s in senate_members[:5]:
    print(s)

{'nombre': 'Mattie Hunter', 'partido': '(D)', 'cargo': 'Assistant Majority Leader', 'distrito': '3rd District'}
{'nombre': 'Robert F. Martwick', 'partido': '(D)', 'cargo': 'Senator', 'distrito': '10th District'}
{'nombre': 'Robert Peters', 'partido': '(D)', 'cargo': 'Majority Caucus Whip', 'distrito': '13th District'}
{'nombre': 'Mike Simmons', 'partido': '(D)', 'cargo': 'Senator', 'distrito': '7th District'}
{'nombre': 'Mary Edly-Allen', 'partido': '(D)', 'cargo': 'Senator', 'distrito': '31st District'}


Aqui se creo una función que se puede invocar con la url y obtener los resultados tal como se pudo ver anteriormente


## 🥊 Take-home Challenge: Writing a Scraper Function

We want to scrape the webpages corresponding to bills sponsored by each bills.

Write a function called `get_bills(url)` to parse a given bills URL. This will involve:

  - requesting the URL using the <a href="http://docs.python-requests.org/en/latest/">`requests`</a> library
  - using the features of the `BeautifulSoup` library to find all of the `<td>` elements with the class `billlist`
  - return a _list_ of tuples, each with:
      - description (2nd column)
      - chamber (S or H) (3rd column)
      - the last action (4th column)
      - the last action date (5th column)
      
This function has been partially completed. Fill in the rest.

In [134]:
import requests
from bs4 import BeautifulSoup

def get_bills(url):
    # Solicita el contenido de la página
    response = requests.get(url)
    response.raise_for_status()  # Lanza error si la solicitud falla

    # Analiza el HTML con BeautifulSoup
    soup = BeautifulSoup(response.text, 'html.parser')

    # Encuentra todas las filas que contienen datos de proyectos de ley
    bill_rows = soup.find_all('tr')

    bills = []

    for row in bill_rows:
        # Encuentra todas las celdas con clase 'billlist'
        cells = row.find_all('td', class_='billlist')

        # Asegúrate de que haya al menos 5 columnas
        if len(cells) >= 5:
            description = cells[1].get_text(strip=True)
            chamber = cells[2].get_text(strip=True)
            last_action = cells[3].get_text(strip=True)
            last_action_date = cells[4].get_text(strip=True)

            bills.append((description, chamber, last_action, last_action_date))

    return bills


In [137]:
url = 'https://www.ilga.gov/Senate/Members'
result = get_bills(url)
for bill in result:
    print(bill)

El codigo seria similiar al siguiente sin embargo no existe una clase billlist por lo cual no arroja resultados.

### Scrape All Bills

Finally, create a dictionary `bills_dict` which maps a district number (the key) onto a list of bills (the value) coming from that district. You can do this by looping over all of the senate members in `members_dict` and calling `get_bills()` for each of their associated bill URLs.

**NOTE:** please call the function `time.sleep(1)` for each iteration of the loop, so that we don't destroy the state's web site.

En este caso obtenemos todas las biografias de los senadores con el id del senador que se pasa como memberid en la url y se obtiene de la clase col-sm-12

In [173]:
import requests
from bs4 import BeautifulSoup
import time

def get_all_member_ids():
    url = "https://www.ilga.gov/Senate/Members"
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')

    member_ids = []
    for a in soup.select('a[href*="/Senate/Members/Details/"]'):
        href = a['href']
        if "/Senate/Members/Details/" in href:
            member_id = href.split("/Senate/Members/Details/")[-1]
            member_ids.append(member_id)
    return member_ids

def get_biography(member_id):
    url = f"https://www.ilga.gov/Senate/Members/Details/{member_id}"
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')

    bio_div = soup.find('div', class_='col-sm-12')
    if bio_div:
        bio_text = bio_div.get_text(strip=True)
        return bio_text
    return "No se encontró biografía."

# Ejecutar todo
bios_dict = {}
member_ids = get_all_member_ids()

for member_id in member_ids:
    bio = get_biography(member_id)
    bios_dict[member_id] = bio
    print(f"🧾 Senador {member_id}:\n{bio[:300]}...\n")  # Muestra los primeros 300 caracteres
    time.sleep(1)


🧾 Senador 3312:
BiographyState Senator Neil Anderson, a professional firefighter and paramedic for the City of Moline, was first elected in 2015 and represents Illinois’ 47th District, spanning 15 counties. Raised in the Quad Cities, he worked in his family’s flooring business and walked on to the University of Neb...

🧾 Senador 3312:
BiographyState Senator Neil Anderson, a professional firefighter and paramedic for the City of Moline, was first elected in 2015 and represents Illinois’ 47th District, spanning 15 counties. Raised in the Quad Cities, he worked in his family’s flooring business and walked on to the University of Neb...

🧾 Senador 3316:
BiographyBorn and raised on the Northwest Side of Chicago; B.A. in Criminal Justice and Sociology, Loyola University Chicago; Bilingual Case Manager at Central West Case Management Unit at the Jane Addams School of Social Work; Legislative Assistant in the Illinois House of Representatives; Outreach...

🧾 Senador 3316:
BiographyBorn and rai

KeyboardInterrupt: 