# HCS Workshop 2, Web Scraping

Author: Michael Kielstra

The first thing I needed for this assignment was an interesting dataset to scrape.  I couldn't think of any off the top of my head, and I didn't want to Google "interesting data" because so much of that is depressing these days.  I was stuck.  But then I realized that there were more people than just me doing this comp process, and they would all go online and find data sources.  So I built a scraper to build a list of everyone who had forked the `wcooper90/HCSWorkshops2020` GitHub repo, find all the URLs in their projects, and scrape each one.

In [14]:
# Include a URL with a table in it, to test that the code is able to find it once this notebook is committed to the repo.
# 'https://developer.mozilla.org/en-US/docs/Learn/HTML/Tables/Basics'

In [1]:
# import requests package and set up page

import requests
page = requests.get('https://github.com/wcooper90/HCSWorkshops2020/network/members')

In [2]:
# import BeautifulSoup and make a "BeautifulSoup object"
# sudo apt-get install python-bs4
from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')

In [3]:
network = soup.find(id='network')
repos = map(lambda a: a['href'], network.find_all('a', text='HCSWorkshops2020'))

In [4]:
import time
def map_process(iterable, process_func):
    results = []
    for i in iterable:
        results += [k for k in process_func(i) if k not in results]
        time.sleep(0.5)
    return results

In [5]:
import re
def process_repo(repo):
    repopage = requests.get(f'https://github.com{repo}/tree/master/Workshop2')
    reposoup = BeautifulSoup(repopage.content, 'html.parser')
    fileslist = reposoup.find_all("a", text=re.compile('\\.ipynb'))
    return map(lambda a: a['href'].replace('/blob', ''), fileslist)

In [6]:
notebooks = map_process(repos, process_repo)

In [7]:
known_urls = ['https://github.com/wcooper90/HCSWorkshops2020/network/members', 'https://github.com{repo}/tree/master/Workshop2', 'https://raw.githubusercontent.com{notebook}', 'https://beautiful-soup-4.readthedocs.io/en/latest/', 'https://www.dataquest.io/blog/web-scraping-tutorial-python/', 'http://dataquestio.github.io/web-scraping-pages/simple.html', 'https://docs.python.org/3/library/re.html', 'https://stackoverflow.com/questions/47928608/how-to-use-beautifulsoup-to-parse-google-search-results-in-python', 'https://google.com/search?q=']
link_regex = re.compile('http[^\\\\ \'"]+')
def process_notebook(notebook):
    notebookpage = requests.get(f'https://raw.githubusercontent.com{notebook}')
    dataset_urls = []
    for dataset_url in re.findall(link_regex, notebookpage.text):
        if dataset_url not in known_urls:
            dataset_urls.append(dataset_url)
    return dataset_urls

In [8]:
dataset_urls = map_process(notebooks, process_notebook)

In [9]:
def rank_table(table):
    cells = table.find_all("td")
    if cells:
        return len(cells)
    return 0

def process_dataset_url(dataset_url):
    datasetpage = requests.get(dataset_url)
    datasetsoup = BeautifulSoup(datasetpage.content, 'html.parser')
    tables = datasetsoup.find_all("table")
    if not tables:
        return []
    best_table = max(tables, key=rank_table)
    return [(best_table, datasetsoup.title.text)] # Return the table with the most cells, on the not unreasonable assumption that this is the table with the actual data.

In [10]:
tables_with_titles = map_process(dataset_urls, process_dataset_url)

In [11]:
import pandas as pd
def process_table_with_title(table_with_title):
    df = pd.read_html(str(table_with_title[0]))[0]
    return (df, table_with_title[1])

In [12]:
dataframes_with_titles = map(process_table_with_title, tables_with_titles)

In [13]:
from IPython.display import display
for (df, title) in list(dataframes_with_titles):
    print(title)
    display(df)

Stock Quotes | Stock Charts | Quote Prices | Markets Insider


Unnamed: 0,Name,Price,Unnamed: 2,%,+/-,Date
0,IBM,132.5,,6.79 %,8.43,11:16:29 AM
1,Travelers Cos,115.76,,1.72 %,1.96,11:16:02 AM
2,American Express,105.12,,1.68 %,1.74,11:15:54 AM
3,Intel,53.52,,1.61 %,0.85,11:16:28 AM
4,Merck,81.25,,1.51 %,1.21,11:16:27 AM
5,Home Depot,282.66,,-0.05 %,-0.13,11:14:55 AM
6,Unitedhealth Gro,322.54,,-0.19 %,-0.63,11:16:35 AM
7,McDonald's,225.91,,-0.25 %,-0.57,11:16:00 AM
8,Verizon Comm,59.09,,-0.87 %,-0.52,11:16:35 AM
9,Amgen,245.0,,-4.92 %,-12.67,11:16:21 AM


National Weather Service


Unnamed: 0,0,1
0,Humidity,81%
1,Wind Speed,NA NA MPH
2,Barometer,
3,Dewpoint,52°F (11°C)
4,Visibility,
5,Last update,08 Oct 07:43 AM PDT
