## Scraping


Note the many ways to refer to an element in a webpage:
- `soup.find_all("a")` finds all `<a>` tags
- any tag can be found by its `id` or `class` attributes
- a loop to get all the links in a page:
```python
for link in soup.find_all("a"):
    print(link.get("href"))
```
or 
```py
[a['href'] for a in soup.find_all('a')]
```

In [None]:
from bs4 import BeautifulSoup
import requests

url = ("https://raw.githubusercontent.com/joelgrus/data/master/getting-data.html")
html = requests.get(url).text
soup = BeautifulSoup(html, 'html5lib')

first_paragraph = soup.find('p')        # or just soup.p


assert str(soup.find('p')) == '<p id="p1">This is the first paragraph.</p>'

first_paragraph_text = soup.p.text
first_paragraph_words = soup.p.text.split()


assert first_paragraph_words == ['This', 'is', 'the', 'first', 'paragraph.']

first_paragraph_id = soup.p['id']       # raises KeyError if no 'id'
first_paragraph_id2 = soup.p.get('id')  # returns None if no 'id'


assert first_paragraph_id == first_paragraph_id2 == 'p1'

all_paragraphs = soup.find_all('p')  # or just soup('p')
paragraphs_with_ids = [p for p in soup('p') if p.get('id')]


assert len(all_paragraphs) == 2
assert len(paragraphs_with_ids) == 1

important_paragraphs = soup('p', {'class' : 'important'})
important_paragraphs2 = soup('p', 'important')
important_paragraphs3 = [p for p in soup('p')
                         if 'important' in p.get('class', [])]


assert important_paragraphs == important_paragraphs2 == important_paragraphs3
assert len(important_paragraphs) == 1

In [None]:
first_paragraph_text

'This is the first paragraph.'

In [None]:
soup('h1')

[<h1>Getting Data</h1>]

In [None]:
soup.h1.text

'Getting Data'

In [None]:
[p for p in soup('p') if p.get('class', [])]

[<p class="important">This is the second paragraph.</p>]

Other useful tricks with beautiful soup: 

In [None]:
len(list(soup.children))

2

In [None]:
len(list(soup.descendants))

42

In [None]:
len(list(soup.h1.descendants)) # which can be soup.h1.descendants

1

In [None]:
len(soup.contents)

2

In [None]:
soup.contents[0]

'html'

In [None]:
for string in soup.stripped_strings:
    print(repr(string))

## Applied on abscis

In [None]:
url = ("https://www.abscis-architecten.be/nl/publicaties")
html = requests.get(url).text
soup = BeautifulSoup(html, 'html5lib')

# soup.find('p')                   # finds first paragraph, or soup.p
first_header = soup.h2             # finds first h2 header


In [None]:
links = [a['href'] for a in soup.find_all('a')]
links

['#',
 'https://www.abscis-architecten.be/nl/architectenbureau',
 'https://www.abscis-architecten.be/nl/projecten',
 'https://www.abscis-architecten.be/nl/nieuws',
 'https://www.abscis-architecten.be/nl/publicaties',
 'https://www.abscis-architecten.be/nl/vacatures',
 'https://www.abscis-architecten.be/nl/contact',
 'https://www.abscis-architecten.be/nl',
 'https://www.abscis-architecten.be/nl',
 'https://www.abscis-architecten.be/en',
 'https://www.abscis-architecten.be/nl/publicaties/houtbouw-voor-gestapelde-woningbouw/14-03-2023/794',
 'https://www.abscis-architecten.be/nl/publicaties/houtbouw-voor-gestapelde-woningbouw/14-03-2023/794',
 'https://www.abscis-architecten.be/nl/publicaties/houtbouw-voor-gestapelde-woningbouw/14-03-2023/794',
 'https://www.abscis-architecten.be/nl/publicaties/comfort-in-scholenbouw-hoe-ontwerp-je-een-gezonde-en-veilige-school/07-11-2022/790',
 'https://www.abscis-architecten.be/nl/publicaties/comfort-in-scholenbouw-hoe-ontwerp-je-een-gezonde-en-veilige-

In [None]:
first_header.find('a')['href']  # .find('a')['href'] first finds the first <a> tag within the calling object's context 
                                # and then extracts the href attribute value (the link) from that tag

'https://www.abscis-architecten.be/nl/publicaties/houtbouw-voor-gestapelde-woningbouw/14-03-2023/794'

In [None]:
headers = soup('h2')

In [None]:
links_dict = {}

for h in headers:
    a_tag = h.find('a') # I had to split this up: first look for anchor. Because it looks for a link in an anchor in a header. 
    
    if a_tag:
        title = h.text
        link = a_tag['href'] # note that a href is an attribute of an anchor (hence the <a href=...)
        links_dict[title] = link 
links_dict 

{'Houtbouw voor gestapelde woningbouw': 'https://www.abscis-architecten.be/nl/publicaties/houtbouw-voor-gestapelde-woningbouw/14-03-2023/794',
 'Comfort in scholenbouw: Hoe ontwerp je een gezonde en veilige school?': 'https://www.abscis-architecten.be/nl/publicaties/comfort-in-scholenbouw-hoe-ontwerp-je-een-gezonde-en-veilige-school/07-11-2022/790',
 'Sociaal wooncomplex tilt circulaire houtskeletbouw naar hoger niveau': 'https://www.abscis-architecten.be/nl/publicaties/sociaal-wooncomplex-tilt-circulaire-houtskeletbouw-naar-hoger-niveau/27-09-2022/787',
 'Blending Old and New: 7 Delicate Alterations to Brick Façades': 'https://www.abscis-architecten.be/nl/publicaties/blending-old-and-new-7-delicate-alterations-to-brick-faades/08-08-2022/783',
 'Karakteristieke kloostergevel herbergt sociaal woonproject': 'https://www.abscis-architecten.be/nl/publicaties/karakteristieke-kloostergevel-herbergt-sociaal-woonproject/05-08-2022/782',
 'Voetgangersveer, Appels': 'https://www.abscis-architect

In [None]:
for h in headers:
    if a_tag:
        # if soup.p.has_attr('text'):
            print(h.text)
            try: print(h.find_next_sibling('p').text) # to get the next paragraph
            except: pass
            print(a_tag['href'],"\n")

Houtbouw voor gestapelde woningbouw
De klassieke opvatting dat een stenen gebouw langer meegaat dan houtbouw, is al lang achterhaald. De circulaire bouwmethode levert een grote bijdrage aan de verduurzaming van de bouwindustrie.

https://www.abscis-architecten.be/nl/publicaties/het-menslievendheidproject-middelhoog-houtskeletgebouw-in-brussel-gebouwd-volgens-strengste-brandveiligheidsnormen/11-02-2022/771 

Comfort in scholenbouw: Hoe ontwerp je een gezonde en veilige school?
Welke zijn de grootste problemen wat comfort in scholenbouw betreft? Die vraag stond centraal in de poll die Architectura organiseerde tijdens een webinar. Luchtkwaliteit, akoestiek, ergonomie en daglicht scoorden het…
https://www.abscis-architecten.be/nl/publicaties/het-menslievendheidproject-middelhoog-houtskeletgebouw-in-brussel-gebouwd-volgens-strengste-brandveiligheidsnormen/11-02-2022/771 

Sociaal wooncomplex tilt circulaire houtskeletbouw naar hoger niveau
Voor zijn nieuw sociaal huisvestingsproject in de 

fun aside - python can also show html with `IPython`:

In [None]:
from IPython.display import display, HTML

html_string = '''
<!DOCTYPE html>
<html>
<head>
<style>
  .highlight {
    background-color: yellow;
  }
</style>
</head>
<body>
  <p>
    This is a normal paragraph, but
    <span class="highlight">this part is highlighted</span>
    using a span element with a custom CSS class.
  </p>
</body>
</html>
'''

display(HTML(html_string))

> Checking for mentions (which applies the 'data science' in 'congress' logic below)

In [None]:
from bs4 import BeautifulSoup
import requests

url = ("https://www.abscis-architecten.be/nl/publicaties/houtbouw-voor-gestapelde-woningbouw/14-03-2023/794")
html = requests.get(url).text
soup = BeautifulSoup(html, 'html5lib')


text = ''.join(p.text for p in soup('p') if p)   # this to get rid of this None value
keyword = 'houtk'

any(keyword.lower() in text.lower() for _ in text) # ANY expects an iterable so that means it requires a `for loop` after the `in`

False

In [None]:
paragraph

'\n\t\t\t\t\t\t\t+32 (0)9 244 60 20\n\t\t\t\t\t\t\tinfo@abscis.be\n\t\t\t\t\t\t'

In [None]:
all_text = [p.get_text() for p in soup('p')]         # this is the best way to get all text and store it in a variable
keywrd = 'houtbouw'

# any(keywrd.lower() in paragraph.lower()             # which is a membership check
            #    for paragraph in all_text)
            
for paragraph in all_text:
    if keywrd.lower() in paragraph.lower():
        print(paragraph.split()[0])

De
Arthur


## Find *data* in press releases from members of the US congress 

In [None]:
from bs4 import BeautifulSoup
import requests

url = "https://www.house.gov/representatives"
text = requests.get(url).text
soup = BeautifulSoup(text, "html5lib")

all_urls = [a['href']
            for a in soup('a')
            if a.has_attr('href')]

print(len(all_urls))  # 965 for me, way too many

967


In [None]:
import re

# Must start with http:// or https://
# Must end with .house.gov or .house.gov/
regex = r"^https?://.*\.house\.gov/?$"

# Let's write some tests!
assert re.match(regex, "http://joel.house.gov")
assert not re.match(regex, "joel.house.gov")


# And now apply
good_urls = [url for url in all_urls if re.match(regex, url)]

print(len(good_urls))  # still 862 for me

880


In [None]:
num_original_good_urls = len(good_urls)

good_urls = list(set(good_urls))

print(len(good_urls))  # only 431 for me

440


In [None]:
html = requests.get('https://jayapal.house.gov').text
soup = BeautifulSoup(html, 'html5lib')

# Use a set because the links might appear multiple times.
links = {a['href'] for a in soup('a') if 'press releases' in a.text.lower()}

print(links) # {'/media/press-releases'}

{'https://jayapal.house.gov/category/news/', 'https://jayapal.house.gov/category/press-releases/'}


In [None]:
def paragraph_mentions(text: str, keyword: str) -> bool:
    """
    Returns True if a <p> inside the text mentions {keyword}
    """
    soup = BeautifulSoup(text, 'html5lib')
    paragraphs = [p.get_text() for p in soup('p')]

    return any(keyword.lower() in paragraph.lower()
               for paragraph in paragraphs)

In [None]:
# I don't want this file to scrape all 400+ websites every time it runs.
# So I'm going to randomly throw out most of the urls.
# The code in the book doesn't do this... only the github code
import random
good_urls = random.sample(good_urls, 5)
print(f"after sampling, left with {good_urls}")

from typing import Dict, Set

press_releases: Dict[str, Set[str]] = {}

for house_url in good_urls:
    html = requests.get(house_url).text
    soup = BeautifulSoup(html, 'html5lib')
    
    pr_links = {a['href'] for a in soup('a') if 'press releases' in a.text.lower()}
    print(f"{house_url}: {pr_links}")
    press_releases[house_url] = pr_links

after sampling, left with ['https://auchincloss.house.gov', 'https://desposito.house.gov', 'https://clyde.house.gov', 'https://sewell.house.gov/', 'https://takano.house.gov']
https://auchincloss.house.gov: {'https://auchincloss.house.gov/media/press-releases'}
https://desposito.house.gov: {'/media/press-releases'}
https://clyde.house.gov: {'/news/documentquery.aspx?DocumentTypeID=27'}
https://sewell.house.gov/: {'/press-releases'}
https://takano.house.gov: {'https://takano.house.gov/newsroom/press-releases'}


In [None]:
press_releases

{'https://auchincloss.house.gov': {'https://auchincloss.house.gov/media/press-releases'},
 'https://desposito.house.gov': {'/media/press-releases'},
 'https://clyde.house.gov': {'/news/documentquery.aspx?DocumentTypeID=27'},
 'https://sewell.house.gov/': {'/press-releases'},
 'https://takano.house.gov': {'https://takano.house.gov/newsroom/press-releases'}}

I want to try to read all the press releases and scan on 'data'

In [None]:
# I need all the links to the actual press releases
# first I need to scrape the links from press release page
pr = ['https://desposito.house.gov/media/press-releases',
      'https://clyde.house.gov/news/documentquery.aspx?DocumentTypeID=27',
      'https://sewell.house.gov/press-releases']

pr_text_links =  []

for pr_link in pr:
        url = pr_link
        text = requests.get(url).text
        soup = BeautifulSoup(text, 'html5lib')

        # now add the links
        pr_text_links += [a['href'] for a in 
                          soup('a', {'class': 'ContentGrid'})  # this filtered on contentgrid, which only works for one guy
                          if a.has_attr('href')]  # It's a good practice to check if the href attribute is present before trying to access it

pr_text_links

['/2023/4/rep-sewell-calls-on-congress-to-reject-republican-debt-ceiling-demands-and-avoid-catastrophic-default',
 '/2023/4/rep-sewell-announces-2-million-in-federal-funding-for-the-lovelady-center-in-birmingham',
 '/2023/4/reps-sewell-and-rogers-introduce-bipartisan-legislation-to-combat-alabama-s-rural-wastewater-crisis',
 '/2023/4/rep-sewell-celebrates-the-80th-birthday-of-the-honorable-judge-u-w-clemon-by-honoring-him-on-the-house-floor',
 '/2023/4/rep-sewell-sends-letter-to-federal-railroad-administrator-amit-bose-requesting-action-to-address-recent-train-blockages-and-safety-concerns',
 '/2023/4/rep-sewell-announces-2023-congressional-art-competition-for-high-school-students-in-alabama-s-7th-congressional-district',
 '/2023/4/rep-sewell-joins-reps-tenney-davis-and-kelly-in-introducing-the-bipartisan-new-markets-tax-credit-extension-act',
 '/2023/3/reps-sewell-arrington-hudson-and-ruiz-introduce-the-nancy-gardner-sewell-medicare-multi-cancer-early-detection-screening-coverage-act'

In [None]:
from bs4 import BeautifulSoup
import requests

In [None]:
texts = []

# now read all the texts, and add them to texts just to check
for pr_text_link in pr_text_links:
    url = f'https://sewell.house.gov/{pr_text_link}' # because the house_url / pr_url combo didn't work
    text = requests.get(url).text
        
    texts += text

# then the final check if data is meantioned
    if paragraph_mentions(text, 'data'):   
        print(f"{url}")
        break  # done with this house_url


https://sewell.house.gov//2023/3/reps-sewell-fitzpatrick-introduce-bipartisan-legislation-to-combat-physician-shortage-and-improve-access-to-health-care


So there's one...