### Scraping Candidates' Sites

In this notebook we'll scrape candidates' websites and store the information for later analysis. We'll store the data locally in a database, which we'll build via Python to practice that skill. Let's get started!

In [None]:
import sqlite3
import requests

from bs4 import BeautifulSoup 
from bs4.element import Comment

import random 
# this package is good for randomization, which we may use 
# use when we're pulling a fraction of the pages. 

import datetime # this is good for working with dates and times. It's a bit confusing though. 

First, let's build a DB for this. Right now I'm assuming we'll have five fields in our only table:

* `dt`: the date and time when we pulled the page.
* `base_url`: the main URL we're pulling from. E.g., www.joebiden.com, for Biden's site. 
* `url`: the specific URL we're pulling from. E.g., https://joebiden.com/joes-story/.
* `text`: the text of the `url`. 
* `pulled`: A boolean that is TRUE if we've tried to pull the text from the URL. This is useful for keeping track. 

In [None]:
db = sqlite3.connect("candidate_websites_2.db") # feel free to change this to something you like. 
cur = db.cursor()

Now let's create the table in the DB. 

In [None]:
cur.execute('''DROP TABLE IF EXISTS site_text''')
cur.execute('''CREATE TABLE site_text (
    dt DATETIME, 
    base_url TEXT, 
    url TEXT,
    text TEXT,
    pulled BOOLEAN)''')
db.commit()

Now we'll build off the previous notebook (`Intro to Scraping.ipynb`) to scrape candidates' sites. Let's begin by reading the list of websites.  

In [None]:
sites = []
with open("candidates_websites.txt",'r') as infile :
    for line in infile :
        sites.append(line.strip())

### Your Turn

Here's where you take over. We want to pull all the text from each candidate's webpage. This is probably a two step process. Step one is to get all the links from the homepage. Step two is to follow each link, grab the text, and store it in the DB. Try starting small, by just pulling a single page and putting it into the DB. Here's an example. 

In [None]:
this_url = "https://facts.elizabethwarren.com/"
r = requests.get(this_url)

# get the visible text from this page. Use the Intro notebook as your guide!
page_text = "some text that you'll replace with the result of excellent work!"

new_row = [datetime.datetime.now(),
           "https://elizabethwarren.com",
           this_url,
           page_text, 
           1]

cur.execute('''INSERT INTO site_text (dt,base_url,url,text,pulled) 
       VALUES (?,?,?,?,?)''',new_row)

db.commit() # don't forget this!

Okay, this got you one page. Now see if you can get a list of every link on *one* candidate's homepage and put that text into the DB.