# Example Scraping to Local Postgres

First we will need to create a database in our terminal

type `psql` to get into the postgres shell

Then type: `CREATE DATABASE capstone_scrape;` - this should return a string saying CREATE DATABASE. You now have the database created, now you have something to connect to.

In [1]:
#imports & Settings
import requests
from bs4 import BeautifulSoup
import psycopg2
config = {'host':"localhost",'database':"capstone_scrape", 'user':"Tim", 'password':""}

We then need to define a table in the database, and define what the table is going to hold. It's important that before this stage you have understood your source data and what you want to store in your database - you don't want to have to change the scema later on

In [2]:
### CREATE TABLEs IN DATABASE EXAMLPE

def create_table(database):
    """
    USE WITH CAUTION - DEMO (NOT FOR PRODUCTION)
    Checks if a table exists in Database, DROPS it, and then creates table in the PostgreSQL database
    """
    commands = (
        """
        DROP TABLE IF EXISTS scrape
        """,
        """
        CREATE TABLE scrape (
            id SERIAL PRIMARY KEY,
            url VARCHAR(255) NOT NULL,
            visited VARCHAR(255) NOT NULL,
            title VARCHAR(255) NOT NULL
            )       
        """)
    conn = None
    try:
        # connect to the PostgreSQL server
        conn = psycopg2.connect(host="localhost",database=database, user="Tim", password="")
        cur = conn.cursor()
        # Drop if exists then add table
        for command in commands:
            cur.execute(command)
        # close communication with the PostgreSQL database server
        cur.close()
        # commit the changes
        conn.commit()
    except (Exception, psycopg2.DatabaseError) as error:
        print(error)
    finally:
        if conn is not None:
            conn.close()

In [3]:
create_table("capstone_scrape")

In [4]:
def insert_data(url,visited,title):
    """ 
    CLASSROOM EXAMPLE - NOT PRODUCTION TESTED
    
    insert a new info into the scrape table """
    sql = """INSERT INTO scrape (url, visited, title) VALUES(%s, %s, %s) RETURNING id;"""
    conn = None
    try:
        # connect to the PostgreSQL database
        conn = psycopg2.connect(host="localhost",database="capstone_scrape", user="Tim", password="")
        # create a new cursor
        cur = conn.cursor()
        # execute the INSERT statement
        data = (url,visited,title)
        cur.execute(sql, data)
        # get the generated id back
        tbl_id = cur.fetchone()[0]
        # commit the changes to the database
        conn.commit()
        # close communication with the database
        cur.close()
    except (Exception, psycopg2.DatabaseError) as error:
        print(error)
    finally:
        if conn is not None:
            conn.close()
    return tbl_id

In [5]:
# Example of an individual scraping element 
# you will need lots of these helper functions- one for each element you want to store
# Its important that each function returns the object type that the database is expecting

def extract_title_from_soup(soup):
    title = soup.title.string
    if title == "":
        return "No Title"
    else:
        return title

In [6]:
urls = ["http://www.techcrunch.com","http://news.ycombinator.com","http://www.ft.com"]

problem_urls = []

for url in urls:
    try:                                               # try the below
        r = requests.get(url)                          # Get the Page
        soup = BeautifulSoup(r.text,'html.parser')     # Parse the page with Beautiful Soup
        title = extract_title_from_soup(soup)          # Call the function to extract the title from the page 
        save = insert_data(url,1,title)                 # Call the function to save the data into our DB
        if type (save) != int:                         # if our save function returns an error (not integer) store url
            problem_urls.append(url)
    except:                                            # for any error above store the url taht caused a problem
        problem_urls.append(url)

## Inspecting The Results of your Scrape

In [7]:
import pandas as pd
from sqlalchemy import create_engine

In [8]:
engine = create_engine('postgresql://localhost:5432/capstone_scrape')

In [9]:
pd.read_sql("SELECT * FROM scrape", con=engine)

Unnamed: 0,id,url,visited,title
0,1,http://www.techcrunch.com,1,TechCrunch – Startup and Technology News
1,2,http://news.ycombinator.com,1,Hacker News
2,3,http://www.ft.com,1,Financial Times


### Clear the Table

In [10]:
sql = """DELETE FROM scrape
WHERE id > 0;"""

conn = psycopg2.connect(host="localhost",database="capstone_scrape", user="Tim", password="")
# create a new cursor
cur = conn.cursor()
# execute the INSERT statement
# data = (url,visited)
cur.execute(sql)
# get the generated id back
# url0 = cur.fetchone()[0]
# commit the changes to the database
conn.commit()
# close communication with the database
cur.close()