A notebook that does some basic webscraping using the `requests` library.

In [None]:
import requests  # To get the pages
from bs4 import BeautifulSoup # and to process them

Let's scrape some webpages for some politicians. As I write this, the obvious candidates (pun intended) are Donald Trump and 
Joe Biden. Feel free to adjust the URLs to candidates that you find interesting. We may use these in some other contexts, so having two candidates on different sides of some issue could be nice. 

In [None]:
sites = ["https://joebiden.com/",
         "https://www.donaldjtrump.com/"]

Let's take a look at the site in the first spot of our list. 

In [None]:
print(sites[0])
r = requests.get(sites[0])
r.status_code

After you pull a page, it's a good idea to see what the status code is. Here's a [link](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status) to what the numbers mean. 

Now let's look at the text that's on the page. Warning, this is going to be a mess.

In [None]:
r.text

I was right, that page was a mess, so let's try Beautiful Soup:

In [None]:
soup = BeautifulSoup(r.text, 'html.parser')

We can print a prettier version, but it's not _that_ much prettier.

In [None]:
print(soup.prettify())

One of the cool things we can do is search the soup to find things like `a` tags. Go look up what those tags are used for. 

In [None]:
all_a_tags = soup.find_all('a')

In [None]:
len(all_a_tags)

That's the number of links on this page. Let's make a list of all of those.

In [None]:
candidate_links = []

for link in soup.find_all('a'):
    candidate_links.append(link.get('href'))


In [None]:
candidate_links[:10]

One thing we might want to do now is crawl each one of those pages to extract the text. Let's store the text in a dictionary that has the url as the key and the value is the text. One trick we'll use is to just extract visible text from the page, using the code found at this StackOverflow [answer](https://stackoverflow.com/questions/1936466/beautifulsoup-grab-visible-webpage-text).

In [None]:
from bs4.element import Comment

def tag_visible(element):
    if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]']:
        return False
    if isinstance(element, Comment):
        return False
    return True


In [None]:
candidate_text = dict()

for link in candidate_links :
    try :
        r = requests.get(link)
    except :
        pass 
    
    if r.status_code == 200 :
        soup = BeautifulSoup(r.text, 'html.parser')
        texts = soup.findAll(text=True)
        visible_texts = filter(tag_visible, texts) 
        candidate_text[link] = " ".join(t.strip() for t in visible_texts)
    else :
        print(f"We got code {r.status_code} for this link: {link}")

Let's write out the results. Storing text data can be tricky, because often that text will have characters in it, like tabs and carriage returns, that we typically use to split up our files. We'll replace those with spaces in the file we're about to write out, so we can use tab delimiters. It's also nice to have a way to turn a URL into a nice file name. Here's a 
[function](https://stackoverflow.com/questions/9055249/simple-way-to-convert-a-url-into-a-filename)
that does it. 

In [None]:
def generate_filename_from_url(url) :
    
    if not url :
        return None
    
    # drop the http or https
    name = url.replace("https","").replace("http","")

    # Replace useless chareacters with UNDERSCORE
    name = name.replace("://","").replace(".","_").replace("/","_")
    
    # remove last underscore
    last_underscore_spot = name.rfind("_")
    
    name = name[:last_underscore_spot] + name[(last_underscore_spot+1):]

    # tack on .txt
    name = name + ".txt"
    
    return(name)


In [None]:
output_file_name = generate_filename_from_url(sites[0])

In [None]:
with open(output_file_name,'w',encoding = "UTF-8") as outfile :
    outfile.write("\t".join(["link","text"]) + "\n")
    for link in candidate_text :
        the_text = candidate_text[link]
        
        # get rid of some of our more annoying output chars
        the_text = the_text.replace("\t"," ").replace("\n"," ").replace("\r"," ") 
        
        if not link :
            link = "empty link"
        
        if the_text : # test to see if it is non-empty
            outfile.write("\t".join([link,the_text]) + "\n")
        

## Exercise

Create a new notebook with a name like "Basic Scraping 2". Rework this code so that it processes the full 
list of URLs in "sites", creating an output file for each site. Test it by adding a politician or two and scraping
them all. 