A notebook that does some basic webscraping using the `requests` library.

In [1]:
import requests  # To get the pages
from bs4 import BeautifulSoup # and to process them

Let's scrape some webpages for some politicians. As I write this, the obvious candidates (pun intended) are Donald Trump and 
Joe Biden. Feel free to adjust the URLs to candidates that you find interesting. We may use these in some other contexts, so having two candidates on different sides of some issue could be nice. 

In [2]:
sites = ["https://joebiden.com/",
         "https://www.donaldjtrump.com/"]

Let's take a look at the site in the first spot of our list. 

In [5]:
print(sites[0])
r = requests.get(sites[0])
r.status_code

https://joebiden.com/


200

In [4]:
r.status_code

200

After you pull a page, it's a good idea to see what the status code is. Here's a [link](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status) to what the numbers mean. 

Now let's look at the text that's on the page. Warning, this is going to be a mess.

In [6]:
r.text

'<!DOCTYPE HTML>\n<html lang="en">\n<head>\n<title>Home | Donald J. Trump</title>\n<meta property="og:title" content="Save America" />\n<meta name="twitter:title" content="Save America">\n<meta name="description" content="Over the past four years, Donald J. Trump\'s administration delivered for Americans of all backgrounds like never before. Save America is about building on those accomplishments!">\n<meta property="og:description" content="Over the past four years, Donald J. Trump\'s administration delivered for Americans of all backgrounds like never before. Save America is about building on those accomplishments!" />\n<meta name="twitter:description" content="Over the past four years, Donald J. Trump\'s administration delivered for Americans of all backgrounds like never before. Save America is about building on those accomplishments!">\n<meta property="og:image" content="https://cdn.donaldjtrump.com/djtweb/general/seo-image.jpg" />\n<meta name="twitter:image" content="https://cdn.d

I was right, that page was a mess, so let's try Beautiful Soup:

In [7]:
soup = BeautifulSoup(r.text, 'html.parser')

We can print a prettier version, but it's not _that_ much prettier.

In [8]:
print(soup.prettify())

<!DOCTYPE HTML>
<html lang="en">
 <head>
  <title>
   Home | Donald J. Trump
  </title>
  <meta content="Save America" property="og:title"/>
  <meta content="Save America" name="twitter:title"/>
  <meta content="Over the past four years, Donald J. Trump's administration delivered for Americans of all backgrounds like never before. Save America is about building on those accomplishments!" name="description"/>
  <meta content="Over the past four years, Donald J. Trump's administration delivered for Americans of all backgrounds like never before. Save America is about building on those accomplishments!" property="og:description">
   <meta content="Over the past four years, Donald J. Trump's administration delivered for Americans of all backgrounds like never before. Save America is about building on those accomplishments!" name="twitter:description"/>
   <meta content="https://cdn.donaldjtrump.com/djtweb/general/seo-image.jpg" property="og:image">
    <meta content="https://cdn.donaldjtru

One of the cool things we can do is search the soup to find things like `a` tags. Go look up what those tags are used for. 

In [9]:
all_a_tags = soup.find_all('a')

In [10]:
?soup.find_all

In [11]:
all_a_tags

[<a href="#main" id="skipnav">Skip to main content</a>,
 <a class="logo icon" href="/" id="header-logo">Back to Home Page</a>,
 <a class="level-1-link btn btn-text-white btn-border-white" href="https://secure.winred.com/save-america-joint-fundraising-committee/storefront?location=djt_sa_header" target="_blank">Shop</a>,
 <a class="level-1-link btn btn-filled" href="https://secure.winred.com/save-america-joint-fundraising-committee/early-renewal-founding-membership/?recurring=true&amp;money_pledge=true&amp;amount=100&amp;utm_medium=homepage&amp;utm_source=na_na_na&amp;utm_campaign=homepage-button_na_saveamerica&amp;utm_content=donate_cpyrs_na" target="_blank">Contribute</a>,
 <a class="level-1-link" href="/about">About</a>,
 <a class="level-1-link" href="/events">Events</a>,
 <a class="level-1-link" href="/news">News</a>,
 <a class="level-1-link" href="/alerts">Alerts</a>,
 <a class="level-1-link" href="https://www.45office.com/?location=djt_sa_header" target="_blank">Contact</a>,
 <a c

In [12]:
len(all_a_tags)

21

That's the number of links on this page. Let's make a list of all of those.

In [28]:
candidate_links = []

for link in soup.find_all('a'):
    candidate_links.append(link.get('href'))


In [29]:
candidate_links[:10]

['https://rsci.app.link/?%24canonical_url=https%3A%2F%2Fmedium.com%2F%40joebiden&~feature=LoOpenInAppButton&~channel=ShowUser&~stage=mobileNavBar&source=user_profile-------------------------------------',
 'https://medium.com/?source=user_profile-------------------------------------',
 '/@JoeBiden?source=user_profile-------------------------------------',
 '/m/signin?operation=login&redirect=https%3A%2F%2Fmedium.com%2F%40joebiden&source=user_profile--------------------------nav_reg-----------',
 'https://medium.com/?source=user_profile-------------------------------------',
 '/m/signin?actionUrl=%2F_%2Fapi%2Fsubscriptions%2Fnewsletters%2F4667f1531d8c&operation=register&redirect=https%3A%2F%2Fmedium.com%2F%40joebiden&newsletterV3=83aa09df6397&newsletterV3Id=4667f1531d8c&user=Joe%20Biden&userId=83aa09df6397&source=user_profile--------------------------subscribe_user-----------',
 '/@JoeBiden/followers?source=user_profile-------------------------------------',
 '/@JoeBiden/about?source=us

One thing we might want to do now is crawl each one of those pages to extract the text. Let's store the text in a dictionary that has the url as the key and the value is the text. One trick we'll use is to just extract visible text from the page, using the code found at this StackOverflow [answer](https://stackoverflow.com/questions/1936466/beautifulsoup-grab-visible-webpage-text).

In [11]:
from bs4.element import Comment

def tag_visible(element):
    if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]']:
        return False
    if isinstance(element, Comment):
        return False
    return True


In [40]:
x = list(range(20))

In [42]:
def is_even(num) :
    if num % 2 == 0 :
        return True
    else :
        return False

In [48]:
(filter(is_even,x))

<filter at 0x7fc0b84aa820>

In [51]:
candidate_text = dict()

for link in candidate_links :
    try :
        r = requests.get(link)
    except :
        next 
    
    if r.status_code == 200 :
        soup = BeautifulSoup(r.text, 'html.parser')
        texts = soup.findAll(text=True)
        visible_texts = filter(tag_visible, texts) 
        candidate_text[link] = " ".join(t.strip() for t in visible_texts)
    else :
        print(f"We got code {r.status_code} for this link: {link}")

We got code 403 for this link: https://help.medium.com/hc/en-us?source=user_profile-------------------------------------


In [52]:
list(candidate_text.keys())

['https://rsci.app.link/?%24canonical_url=https%3A%2F%2Fmedium.com%2F%40joebiden&~feature=LoOpenInAppButton&~channel=ShowUser&~stage=mobileNavBar&source=user_profile-------------------------------------',
 'https://medium.com/?source=user_profile-------------------------------------',
 '/@JoeBiden?source=user_profile-------------------------------------',
 '/m/signin?operation=login&redirect=https%3A%2F%2Fmedium.com%2F%40joebiden&source=user_profile--------------------------nav_reg-----------',
 '/m/signin?actionUrl=%2F_%2Fapi%2Fsubscriptions%2Fnewsletters%2F4667f1531d8c&operation=register&redirect=https%3A%2F%2Fmedium.com%2F%40joebiden&newsletterV3=83aa09df6397&newsletterV3Id=4667f1531d8c&user=Joe%20Biden&userId=83aa09df6397&source=user_profile--------------------------subscribe_user-----------',
 '/@JoeBiden/followers?source=user_profile-------------------------------------',
 '/@JoeBiden/about?source=user_profile-------------------------------------',
 '/@JoeBiden/south-carolina-a-y

In [57]:
nltk.probability.FreqDist([w.lower() for w in candidate_text['http://JoeBiden.com'].split()]).most_common(10)

[('the', 86),
 ('to', 72),
 ('and', 68),
 ('of', 47),
 ('a', 45),
 ('—', 34),
 ('this', 30),
 ('we', 28),
 ('i', 27),
 ('in', 23)]

In [55]:
import nltk

Let's write out the results. Storing text data can be tricky, because often that text will have characters in it, like tabs and carriage returns, that we typically use to split up our files. We'll replace those with spaces in the file we're about to write out, so we can use tab delimiters. It's also nice to have a way to turn a URL into a nice file name. Here's a 
[function](https://stackoverflow.com/questions/9055249/simple-way-to-convert-a-url-into-a-filename)
that does it. 

In [13]:
def generate_filename_from_url(url) :
    
    if not url :
        return None
    
    # drop the http or https
    name = url.replace("https","").replace("http","")

    # Replace useless chareacters with UNDERSCORE
    name = name.replace("://","").replace(".","_").replace("/","_")
    
    # remove last underscore
    last_underscore_spot = name.rfind("_")
    
    name = name[:last_underscore_spot] + name[(last_underscore_spot+1):]

    # tack on .txt
    name = name + ".txt"
    
    return(name)


In [14]:
output_file_name = generate_filename_from_url(sites[0])

In [15]:
with open(output_file_name,'w',encoding = "UTF-8") as outfile :
    outfile.write("\t".join(["link","text"]) + "\n")
    for link in candidate_text :
        the_text = candidate_text[link]
        
        # get rid of some of our more annoying output chars
        the_text = the_text.replace("\t"," ").replace("\n"," ").replace("\r"," ") 
        
        if not link :
            link = "empty link"
        
        if the_text : # test to see if it is non-empty
            outfile.write("\t".join([link,the_text]) + "\n")
        

## Exercise

Create a new notebook with a name like "Basic Scraping 2". Rework this code so that it processes the full 
list of URLs in "sites", creating an output file for each site. Test it by adding a politician or two and scraping
them all. 