## Scraping Congressional Candidates

In this notebook, we'll scrape some congressional candidates' websites and store those results. This is really just scratching the surface of web scraping, but it'll orient you to the process. 

There are two main libraries you need to know about: `requests` and `BeautifulSoup` (actually called `bs4`). 

The [`requests`](http://docs.python-requests.org/en/master/#) library is a pretty delightful way to ask for information from web servers. I'd urge you to walk through the [quickstart guide](http://docs.python-requests.org/en/master/user/quickstart/) at your leisure to understand what's possible. 

[BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/) is the Python standard for parsing HTML. We don't have time to do a deep dive on HTML (nor am I really the person to teach it), so we'll pick and choose our pieces to use. But there's a ton there and, really, anything is possible.

In [None]:
import requests
from bs4 import BeautifulSoup
from random import sample, seed
from collections import defaultdict, Counter

We'll grab some of our congressional candidate data. Let's read it in. For now we'll just grab some names and websites, but obviously there's a lot more data in there.

In [None]:
congressional_file = "partial_candidate_set.txt"

name_website = dict()

with open(congressional_file,'r') as input_file :
    next(input_file)
    
    for line in input_file.readlines() :
        line = line.strip().split("\t")
        
        # name is in the 7th spot, website in the 9th
        # The strips make sure we don't get extra spaces, which "name" seems
        # to have
        name_website[line[6].strip()] = line[8].strip()
    

Let's take a look at some of these and see if they're right.

In [None]:
for idx, name in enumerate(name_website) :
    print("For {} we have {} as their website.".format(name,name_website[name]))
    
    if idx == 4 :
        break

It looks like some people have gotten incumbent's house website. That's not right, but let's see how often it happens. 

In [None]:
count = 0
for name, site in name_website.items() :
    # many of the websites end in "/". Let's
    # strip those off the end. 
    site = site.strip("/")
    
    # Usually I delete comments like the code below, but I 
    # thought maybe you'd like to see how I actually do the work. 
    # The next two lines are the ones I used to figure out 
    # how to do work the "slices" of the string to properly get 
    # the .gov at the end of the string
#    print(site[-4:])
#    break
    if site[-4:] == ".gov" :
        print("Oops, for {} we have {}".format(name,site))
        count += 1
        
print("Damn, there were {} wrong websites.".format(count))

Well, that's going to be irksome.... Anyway, back to scraping. Let's randomly choose a website and go scrape it.

In [None]:
seed(20181001) # change your seed if you'd like a different candidate.
this_site = sample(list(name_website.values()),k=1)
print(this_site)

Now that we've picked the website, let's pull it using the requests library.

In [None]:
r = requests.get(this_site[0])

To see what's in the request result, type `r.` and the tab key in a cell and you'll see all the options. Some very useful ones are `status_code`s, `headers`, `json`, and `text`. Take a look at all of these and see if you can figure out what we're getting. 

In [None]:
print(r.status_code)
print(r.headers)
print(r.json)
print(r.text)

So the request basically has all the information coming back from the page. We're going to use a very tiny amount of it's functionality, pass that into BeautifulSoup and use the BeautifulSoup parser to get the raw text out. Parsing HTML is not easy, but we can follow [this stackoverflow answer](https://stackoverflow.com/questions/1936466/beautifulsoup-grab-visible-webpage-text) and get most of the way there. 

In [None]:
from bs4.element import Comment

def tag_visible(element):
    if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]']:
        return False
    if isinstance(element, Comment):
        return False
    return True


def text_from_html(body):
    soup = BeautifulSoup(body, 'html.parser')
    texts = soup.findAll(text=True)
    visible_texts = filter(tag_visible, texts)  
    return u" ".join(t.strip() for t in visible_texts)

In [None]:
print(text_from_html(r.text))

Slick. I definitely couldn't have figured that out in any reasonable amount of time.

---

Let's put all those pieces together and build a dictionary that has the candidate's name as the key and all the text for a candidates website as the value. This will require a trick to grab the name from our first dictionary. There are more efficient ways to do this, but this is fast enough for our purposes. 

In [None]:
seed(20181002)

name_to_text = dict()

sites = sample(list(name_website.values()),k=5)

In [None]:
for this_site in sites :
    
    # first, let's get the names
    for name, site in name_website.items() :
        if this_site == site :
            this_name = name
            break
            
    # Now request the website
    r = requests.get(this_site)
    
    if r.status_code == requests.codes.ok : # check out the requests doc. What is this doing?
        name_to_text[name] = text_from_html(r.text)

So, now what? Well, there are a lot of things that we could do. For instance, let's look at the ten most common words on each candidate's website.

In [None]:
for name, site_text in name_to_text.items() :
    
    # let's split the text on whitespace, cast everything
    # to lowercase, and throw it in a counter.
    clean_text = site_text.split()
    clean_text = [item.lower() for item in clean_text]
    
    text_count = Counter(clean_text)
    
    print("Candidate: " + name + "\n")
    print(text_count.most_common(10))
    print("\n"*3)
    

Notice the counts. Some people apparently don't have a lot of text, so we're getting some "most common words" with just a couple of counts. Others have higher counts, but we're getting things like the candidates names and a bunch of "stop words". We haven't talked about those yet, but they're just very common words that typically don't represent much information. It's relatively easy to strip them out. I'll bring them in, then repeat the same code as above, removing stopwords.

Note: if the cell below doesn't work for you, follow these steps.

On a Mac: 

1. Open up a terminal window.
1. Type the following `sudo pip install -U nltk`
1. Test the installation by typing `python` (opens python in the console), then type `import nltk`

On Windows:

1. Open up a command window.
1. Type the following `conda install -c anaconda nltk`
1. Test the installation by typing `python` (opens python in the console), then type `import nltk`


In [None]:
import nltk
from nltk.corpus import stopwords
stopwords.words('english')[:20]

In [None]:
stop_words = set(stopwords.words('english'))

for name, site_text in name_to_text.items() :
    
    # let's split the text on whitespace, cast everything
    # to lowercase, and throw it in a counter.
    clean_text = site_text.split()
    clean_text = [item.lower() for item in clean_text]
    
    # remove the stopwords
    clean_text = [word for word in clean_text if word not in stop_words]
    
    text_count = Counter(clean_text)
    
    print("Candidate: " + name + "\n")
    print(text_count.most_common(10))
    print("\n"*3)
    

Okay, now we're starting to see some more interesting patterns. Sure, most of the words seem candidate specific (names, districts, etc.), but the gist is there. Let's look across 25 candidates, count the most common words, and see if anything emerges. 

In [None]:
seed(20181001)

num_candidates = 25

# an empty list to hold all the words
candidate_words = []

sites = sample(list(name_website.values()),k=num_candidates)

for this_site in sites :
                
    # Now request the website
    r = requests.get(this_site)
    
    if r.status_code == requests.codes.ok : # check out the requests doc. What is this doing?
        site_text = text_from_html(r.text)
        
        clean_text = site_text.split()
        clean_text = [item.lower() for item in clean_text]
        clean_text = [word for word in clean_text if word not in stop_words]

        candidate_words.extend(clean_text)
        
print("Grabbed all the sites.")

In [None]:
Counter(candidate_words).most_common(20)

Based on what you're seeing here, what are the next things you might do to clean up this data? 

Okay, now it's time for you to do some original work. Write some code that does the following:

1. Randomly select 10 congressional websites.
1. Pull the text for the site, using the same tricks we did: cast to lowercase, clean out stopwords. Bonus: do the things you thought of in the previous question. For
1. Build a dictionary with the following form: `my_dict[party][gender][word] = count`. Note that you'll have to go back to the original file to get party and gender associated with the url.  

In [None]:
# Start by getting the mapping of url to party and gender    

In [None]:
# now let's select 10 URLs

In [None]:
# build your results dictionary that will have the form
# results[party][gender][word] = count

In [None]:
# and now fill it in!