In [1]:
from pymongo import MongoClient
import pprint

import pandas as pd
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

# Requests sends and recieves HTTP requests.
import requests

# Beautiful Soup parses HTML documents in python.
from bs4 import BeautifulSoup

import json
import time
import copy 
from bson import ObjectId

# Web Scraping


##  Success criteria:

I will be successful today if I can...

1. Identify some use cases for mongo-db as opposed to traditional RDBMS
2. Create, Read, Update and Delete documents using the mongo shell
2. Use pymongo and requests to pull a website's html into a collection 
3. Use beautiful soup to parse html

## Outline

1. *Review* MongoDB and PyMongo
2. *Explain* and *Use* the basic concepts of HTML with regard to fetching data:
    * Connecting to web pages from Python
    * Parsing HTML in Python
3. *Write* code to pull elements from websites using the BeautifulSoup library 
4. *Describe* a typical pipeline for scraping data from the web and apply this process to scrap monster.com for data science jobs in Austin.
5. *Use* public API's to fetch pre-formatted data using BeautifulSoup.

## Resources

* [w3 schools](http://www.w3schools.com/) : HTML tags and thier attributes.
* [BeautifulSoup Documentation](http://www.crummy.com/software/BeautifulSoup/bs4/doc/)

-----

# Review MongoDB and PyMongo

MongoDB and PyMongo is a NoSQL database program.  Let's first compare SQL to NoSQL. 

## SQL versus NoSQL 

### Example: Structure of a Relational Database versus a Document Oriented Database: 

<center><img src="img/sql_vs_mongo_table.png" style="width: 700px"></center>

![](img/rd_mongo.jpg)


## MongoDB

### What's it about? 

* MongoDB is a document-oriented database, an alternative to RDBMS, used for storing semi-structured data.
* JSON-like objects form the data model, rather than RDBMS tables.
* Schema is optional.
* Sub-optimal for complicated queries.

### Structure of the database.

* MongoDB is made up of databases which contain collections (tables).
* A collection is made up of documents (analogous to rows or records).
* Each document is a JSON object made up of key-value pairs (analogous to columns).

So a RDBMS defines columns at the table level, document oriented database defines its fields at a document level.

## PyMongo 

In this lecture, we will create a database and collection.  Then we will insert documents into a collection. Let's see how to do this:

In [None]:
# Connect to the hosted MongoDB instance 
client = MongoClient('localhost', 27017)

In [None]:
db = client['reddit'] # create or access already existing database 

In [None]:
db.list_collection_names() #any collections available? 

In [None]:
comments = db['comments'] # access already existing collection (this would also create a new collection)

In [None]:
for comment in comments.find():  # find documents in collection from this morning
    print(comment)

In [None]:
comments.insert_one(
            {'comment_id':'laugher',
            'body': 'lol'}
)

In [None]:
for comment in comments.find():  # find documents in collection from this morning
    print(comment)

In [None]:
#Delete an extra item if you ran the above insert_one twice
#db.comments.delete_one({'_id': ObjectId('6054e7743794b58c916539eb')})

-----

# *Explain* and *Use* the basic concepts of HTML with regard to fetching data

* Documents on the web are generally written in <span style="text-decoration: underline">**H**yper**T**ext **M**arkup **L**anguage</span>, HTML, which can be natively viewed by browsers, the tool that we use to browse the web.

### HTML

**H**yper**T**ext **M**arkup **L**anguage

A *markup language* (think markdown) that forms the building blocks of all websites. It specified not just the text of the document but also the organization (into sections and paragraphs and lists and such). It can also control the layout of the document (the font and color and size and such) though that is properly handled with Cascading Style Sheets (CSS)

It consists of opening and closing tags enclosed in angle brackets (like `<html>` and `</html>`) often with more HTML in between.

A minimal HTML document, unfortuantely, contains a lot of cruft.  Here's one I got from [https://www.sitepoint.com/a-minimal-html-document/](https://www.sitepoint.com/a-minimal-html-document/).


```html
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN"
    "http://www.w3.org/TR/html4/strict.dtd">
<html lang="en">
  <head>
  
    <meta http-equiv="content-type" content="text/html; charset=utf-8">
    <title>title</title>
    <link rel="stylesheet" type="text/css" href="style.css">
    <script type="text/javascript" src="script.js"></script>
  </head>
  <body>
		
  </body>
</html>
```

The key=value pairs inside of a tag are called attributes. The `<link>` and `<script>` tags aren't necessary, but appear in more or less every HTML document.

* The `<link>` tag points to a **stylesheet**, which controls how different parts of the document are rendered in the browser.  This makes things pretty.
* The `<script>` tag points to a **javascript** program.  This allows programmers to add *dynamic behaviour* to a html document.
* The `<body>` tag contains the guts of your document.

### Important Tags

```html
<a href="http://www.w3schools.com">A hyperlink to W3Schools.com!</a>

<h1>This is a header!</h1>

<p>This is a paragraph!</p>

<h2>This is a Subheading!</h2>

<table>
  This is a table!
  <tr>
    <th>The header in the first row.</th>
    <th>Another header in the first row.</th>
  </tr>
  <tr>
    <td>An entry in the second row.</td>
    <td>Another entry in the second row.</td>
  </tr>
</table>

<ul>
  This is an unordered list!
  <li>This is the first thing in the list!</li>
  <li>This is the second thing in the list!</li>
</ul>
<div>Specifies a division of the document, generally with additional attributes specifying layout and behavior.</div>
A <span>span is similar</span> but occurs in the middle of a line.

```

## HTTP Requests

To get data from the web, you need to make a HTTP request.  The two most important request types are:

* GET (queries data, no data is *sent*)
    - used for fetching documents
    


* POST (updates data, *data must be sent*)
    - used for updating data

    
<br>

Usually HTTP requests are sent by browsers (like Chrome or Safari) but `curl` is a command line program for sending HTTP requests.  It's easy to send a `GET` request to a url.

## Requests Library

* The [requests](http://docs.python-requests.org/en/latest/index.html) library is designed to simplify the process of making http requests within Python.
* The interface is mindbogglingly simple:
    1. Instantiate a requests object to the request, this will mostly be a `get`, with the URL and optional parameters you'd like passed through the request.
    2. That instance makes the results of the request available via attributes/methods, i.e. we now have a python object representation of the website to play with.
    
Let's do a simple demo where I get the hypertext from deertier.com:

In [None]:
import requests
deer_tier_url = 'http://deertier.com/Leaderboard/AnyPercentRealTime'
r = requests.get(deer_tier_url)

A status code of `200` means that everything went well.

In [None]:
r.status_code

We can also get the HTML via text attribute.  Below I will use the pprint module to make printed text readable

In [None]:
import pprint
pprint.pprint(r.text[:1000])

-----

# *Write* code to pull elements from websites using the BeautifulSoup library 

In order to demo Beautiful Soup, we will first parse a simple webpage in this repo saved to the file name **basic.html**.  First let's read in this file as a string and pass it along to beautiful soup. 

In [None]:
# read in html
with open('basic.html', 'r') as myfile:
    html_str = myfile.read()

In [None]:
# pass to beautiful soup
soup = BeautifulSoup(html_str, 'html.parser')

In [None]:
print(soup.prettify())

####  Beautiful Soup: Tools and Finding tags

First you can easily find HTML tags in the original document with Beautiful Soup.  One way is to use the tag's name.  For example below I find the title of our document

In [None]:
soup.title

A tag may have attribute(s). You can access these attributes like a dictionary:

In [None]:
soup.div

In [None]:
soup.div['class']

Next, let's try to grab the second table...

In [None]:
soup.table

If you want to grab all the `<table>` tags you can use the find_all() method: 

In [None]:
tables = soup.find_all('table')

Note, you can specify the tag and html attributes using find_all(); if no attribute name given it is the class attribute.

In [None]:
soup.find_all('div','myDiv')

In [None]:
indices = tables[1].find_all('th')
rows = tables[1].find_all('tr')
indices,rows

In [None]:
all_data = []
columns = {}
for index in indices:
    columns[index.text] = None
print(columns)

In [None]:
all_data = []
keys = list(columns.keys())
for i,row in enumerate(rows):
    if i > 0:
        new_row = copy.copy(columns)
        entries = row.find_all('td')
        for j,entry in enumerate(entries):
            new_row[keys[j]]= entry.text
        all_data.append(new_row)
all_data

## *Describe* a typical pipeline for scraping data from the web.

We now have all the tools we need to scrap a webpage! Let's first define the process: 

![](img/web_scraping_procedure.png)

Now let's scape the web! 

-----

# Scrap Monster.com for Data Science Jobs! 

## Step1: Inspect the page you plan to scrap

https://www.monster.com/jobs/search/?q=Data-Scientist&where=Denver__2C-CO


## Step 2: Request the webpage's raw HTML 

In [None]:
url = 'https://www.monster.com/jobs/search/?q=Data-Scientist&where=Denver__2C-CO'
r = requests.get(url)

Next, check the status code to make sure you were successful

In [None]:
r.status_code

In [None]:
pprint.pprint(r.text)

## Step3: Save the raw HTML into a MongoDB

In [None]:
client = MongoClient()
db = client.monster

In [None]:
pages = db.data_science_colorado

In [None]:
pages.insert_one({'html': r.content})

## Step 4: Parse the hypertext to get data with Beautiful Soup

In [None]:
soup = BeautifulSoup(r.text, 'html.parser')

In [None]:
print(soup.prettify())

In [None]:
soup.find_all('section','card-content')[1] # get's info for each job posting

In [None]:
soup.find_all('section','card-content')[1].find('h2','title') # get info on with job title and link

In [None]:
#use the .text method to eliminate some of the html noise
soup.find_all('section','card-content')[1].find('h2','title').text # get job title

In [None]:
#here again we see using the .a gets us into the the anchor tag 
# and specifying the href in brackets eliminatees the html noise
soup.find_all('section','card-content')[1].find('h2','title').a['href'] # get link to job posting

In [None]:
soup.find_all('section','card-content')[1].find('div','company')

In [None]:
soup.find_all('section','card-content')[1].find('div','company').span.text

### Gather information from sub-pages 
Next, I will want to loop through all job postings on the Monster webpage, go to the hyperlink and get more info.

First I will look through 

In [None]:
link = []
job_title = []
company = []
for i,job in enumerate(soup.find_all('section','card-content')):
    if job.h2 != None:    # skips the first 'card-content section which does not have job title'
        link.append(job.find('h2','title').a['href'])
        job_title.append(job.find('h2','title').text.rstrip())   # rstrip will remove things like '\n' from the job title
        company.append(job.find('div','company').span.text.rstrip())


[j for j in job_title]

## Step 5: Repeat! 
![](img/web_scraping_procedure.png)
Next I want to go through the list of hyperlinks, grab the hypertext from each page, throw it in a MongoDB (that will prevent us from having to re-scrape), and parse the code again.  That's what's happening below.  I will get when the job was posted from each of the linked webpages. 

In [None]:
when_post = []

#loop through all of the links available in the href that was already saved to a list
for links in link:
    # step 1: inspect webpage -- since they have similar structure, I can check out one of the links
    # step 2: get the HTML code via requests library
    sub_page = requests.get(links)
    
    #step 3: put HTML into MongoDB
    #new collection
    pages = db.data_science_links
    pages.insert_one({'html': sub_page.content})
    
    try:
        # step4: parse with beautiful soup 
        sub_soup = BeautifulSoup(sub_page.text, 'html.parser')
        # get HTML that list when job was posted
        posted = sub_soup.find('div',{'name':'value_posted'}).text

        # append this info to a when_post list 
        when_post.append(posted)

        print('When posted:', posted )
        time.sleep(2)   # If you request too much code too quickly you can get banned from scrapping the site! 
        # Add time between request to try to prevent this from happening
    except AttributeError:
         when_post.append('not available')

In [None]:
db.list_collection_names()

In [None]:
db.data_science_links.count_documents({}), db.data_science_colorado.count_documents({})

In [None]:
#Lets close our connection since we should be done inserting info into our db for now 
#we can query from it later if we want different info

client.close()

In [None]:
df = pd.DataFrame({'job_title': job_title, 'company':company, 'posted': when_post,'link':link})
df.head()

In [None]:
df.to_csv('monster_ds_job_data.csv') # write out pandas DF to csv

# API

* An API is a way for developers to communicate with a certain application against a specific contract
* An API is typically defined as a set of Hypertext Transfer Protocol (HTTP) request messages, along with a definition of the structure of response messages, which is usually in an Extensible Markup Language (XML) or JavaScript Object Notation (JSON) format.

* Finally, lets get some COVID data via an API: https://covidtracking.com/data/api


In [None]:
api_url = 'https://api.covidtracking.com/v1/states/az/daily.json'
r = requests.get(api_url)
r.status_code

In [None]:
r.json()

That was easy! Let's throw the covid data into a pandas library, and see how AZ is doing... 

In [None]:
az_data = pd.DataFrame.from_dict(r.json())
az_data.head()

In [None]:
# I will want to plot hospitalizations, so I will first drop any nulls
az_data = az_data.dropna(subset=['hospitalizedCurrently'])
az_data.head()

In [None]:
fig, ax = plt.subplots()
ax.plot(az_data.index.to_numpy()[::-1], az_data['hospitalizedCurrently'].to_numpy())
ax.set_xlabel('Days Since April 13th 2020')
ax.set_ylabel('# Currently Hospitalized')
ax.set_title('AZ Covid Hospitalizations')