### Scraping Candidates' Sites

In this notebook we'll scrape candidates' websites and store the information for later analysis. We'll store the data locally in a database, which we'll build via Python to practice that skill. Let's get started!

In [1]:
import sqlite3
import requests

from bs4 import BeautifulSoup 
from bs4.element import Comment

import random 
# this package is good for randomization, which we may use 
# use when we're pulling a fraction of the pages. 

import datetime # this is good for working with dates and times. It's a bit confusing though. 

First, let's build a DB for this. Right now I'm assuming we'll have five fields in our only table:

* `dt`: the date and time when we pulled the page.
* `base_url`: the main URL we're pulling from. E.g., www.joebiden.com, for Biden's site. 
* `url`: the specific URL we're pulling from. E.g., https://joebiden.com/joes-story/.
* `text`: the text of the `url`. 
* `pulled`: A boolean that is TRUE if we've tried to pull the text from the URL. This is useful for keeping track. 

In [2]:
db = sqlite3.connect("candidate_websites.db") # feel free to change this to something you like. 
cur = db.cursor()

Now let's create the table in the DB. 

In [3]:
cur.execute('''DROP TABLE IF EXISTS site_text''')
cur.execute('''CREATE TABLE site_text (
    dt DATETIME, 
    base_url TEXT, 
    url TEXT,
    text TEXT,
    pulled BOOLEAN)''')
db.commit()

Now we'll build off the previous notebook (`Intro to Scraping.ipynb`) to scrape candidates' sites. Let's begin by reading the list of websites.  

In [4]:
sites = []
with open("candidates_websites.txt",'r') as infile :
    for line in infile :
        sites.append(line.strip())

In [5]:
def tag_visible(element):
    if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]']:
        return False
    if isinstance(element, Comment):
        return False
    return True

In [6]:
for this_site in sites :
    
    print(this_site)

    # First get links on homepage
    links = []
    #this_site = sites[2]

    r = requests.get(this_site)
    soup = BeautifulSoup(r.text, 'html.parser')

    for link in soup.find_all('a'):
        links.append(link.get('href'))

    links = [link for link in links if link]

    # Now loop through links and get pages
    for sub_link in sorted(set(links)) :
        if sub_link and 'http' not in sub_link :
            good_sub_link = this_site + sub_link
        else :
            good_sub_link = sub_link
            
        if 'mailto' in good_sub_link :
            continue

        r = requests.get(good_sub_link)

        if r.status_code == 200 :
            soup = BeautifulSoup(r.text, 'html.parser')
            texts = soup.findAll(text=True)
            visible_texts = filter(tag_visible, texts) 
            page_text = " ".join(t.strip() for t in visible_texts)

        new_row = [datetime.datetime.now(),
                   this_site,
                   good_sub_link,
                   page_text, 
                   1]

        cur.execute('''INSERT INTO site_text (dt,base_url,url,text,pulled) 
               VALUES (?,?,?,?,?)''',new_row)

    db.commit() # don't forget this!


https://joebiden.com/
https://berniesanders.com/
https://elizabethwarren.com
https://kamalaharris.org/
https://peteforamerica.com/
https://betoorourke.com/
https://corybooker.com/
https://amyklobuchar.com
https://www.yang2020.com/
https://www.julianforthefuture.com/
https://www.tulsi2020.com/
https://www.tomsteyer.com/


In [7]:
db.close()

In [10]:
line = "adkkdd; dkdkdkd; 1334; akdkdkd"

In [11]:
line = line.split(";")

In [12]:
",".join(line)

'adkkdd, dkdkdkd, 1334, akdkdkd'