A notebook that does some basic webscraping using the `requests` library.

In [None]:
import requests  # To get the pages
from bs4 import BeautifulSoup # and to process them

I have a list of Democratic candidates' websites (as of 2019-09-25). Let's read that in. 

In [None]:
sites = []
with open("candidates_websites.txt",'r') as infile :
    for line in infile :
        sites.append(line.strip())

Let's take a look at Joe Biden's website, which is in the first spot of our list. 

In [None]:
print(sites[0])
r = requests.get(sites[0])
r.status_code

After you pull a page, it's a good idea to see what the status code is. Here's a [link](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status) to what the numbers mean. 

Now let's look at the text that's on the page. Warning, this is going to be a mess.

In [None]:
r.text

That page was a mess, so let's try Beautiful Soup:

In [None]:
soup = BeautifulSoup(r.text, 'html.parser')

We can print a prettier version, but it's not _that_ much prettier.

In [None]:
print(soup.prettify())

One of the cool things we can do is search the soup to find things like `a` tags. Go look up what those tags are used for. 

In [None]:
all_a_tags = soup.find_all('a')

In [None]:
len(all_a_tags)

So there were 46 links on this page. Let's make a list of all of those.

In [None]:
biden_links = []

for link in soup.find_all('a'):
    biden_links.append(link.get('href'))


In [None]:
biden_links

One thing we might want to do now is crawl each one of those pages to extract the text. Let's store the text in a dictionary that has the url as the key and the value is the text. One trick we'll use is to just extract visible text from the page, using the code found at this StackOverflow [answer](https://stackoverflow.com/questions/1936466/beautifulsoup-grab-visible-webpage-text).

In [None]:
from bs4.element import Comment

def tag_visible(element):
    if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]']:
        return False
    if isinstance(element, Comment):
        return False
    return True


In [None]:
biden_text = dict()

for link in biden_links :
    try :
        r = requests.get(link)
    except :
        pass 
    
    if r.status_code == 200 :
        soup = BeautifulSoup(r.text, 'html.parser')
        texts = soup.findAll(text=True)
        visible_texts = filter(tag_visible, texts) 
        biden_text[link] = " ".join(t.strip() for t in visible_texts)