# HCS Workshop 2, Web Scraping

Author: Michael Kielstra

The first thing I needed for this assignment was an interesting dataset to scrape.  I couldn't think of any off the top of my head, and I didn't want to Google "interesting data" because so much of that is depressing these days.  I was stuck.  But then I realized that there were more people than just me doing this comp process, and they would all go online and find data sources.  So I built a scraper to build a list of everyone who had forked the `wcooper90/HCSWorkshops2020` GitHub repo, find all the URLs in their projects, and scrape each one.

In [None]:
# Include a URL with a table in it, to test that the code is able to find it once this notebook is committed to the repo.
# 'https://developer.mozilla.org/en-US/docs/Learn/HTML/Tables/Basics'

In [2]:
# import requests package and set up page

import requests
page = requests.get('https://github.com/wcooper90/HCSWorkshops2020/network/members')

In [3]:
# import BeautifulSoup and make a "BeautifulSoup object"
# sudo apt-get install python-bs4
from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')

In [4]:
network = soup.find(id='network')
repos = map(lambda a: a['href'], network.find_all('a', text='HCSWorkshops2020'))

In [5]:
import time
def map_process(iterable, process_func):
    results = []
    for i in iterable:
        results += [k for k in process_func(i) if k not in results]
        time.sleep(0.5)
    return results

In [6]:
import re
def process_repo(repo):
    repopage = requests.get(f'https://github.com{repo}/tree/master/Workshop2')
    reposoup = BeautifulSoup(repopage.content, 'html.parser')
    fileslist = reposoup.find_all("a", text=re.compile('\\.ipynb'))
    return map(lambda a: a['href'].replace('/blob', ''), fileslist)

In [7]:
notebooks = map_process(repos, process_repo)

In [8]:
known_urls = ['http[^', 'https://github.com/wcooper90/HCSWorkshops2020/network/members', 'https://github.com{repo}/tree/master/Workshop2', 'https://raw.githubusercontent.com{notebook}', 'https://beautiful-soup-4.readthedocs.io/en/latest/', 'https://www.dataquest.io/blog/web-scraping-tutorial-python/', 'http://dataquestio.github.io/web-scraping-pages/simple.html', 'https://docs.python.org/3/library/re.html', 'https://stackoverflow.com/questions/47928608/how-to-use-beautifulsoup-to-parse-google-search-results-in-python', 'https://google.com/search?q=']
link_regex = re.compile('http[^\\\\ \'"]+')
def process_notebook(notebook):
    notebookpage = requests.get(f'https://raw.githubusercontent.com{notebook}')
    dataset_urls = []
    for dataset_url in re.findall(link_regex, notebookpage.text):
        if dataset_url not in known_urls:
            dataset_urls.append(dataset_url)
    return dataset_urls

In [9]:
dataset_urls = map_process(notebooks, process_notebook)

In [10]:
def rank_table(table):
    cells = table.find_all("td")
    if cells:
        return len(cells)
    return 0

def process_dataset_url(dataset_url):
    datasetpage = requests.get(dataset_url)
    datasetsoup = BeautifulSoup(datasetpage.content, 'html.parser')
    tables = datasetsoup.find_all("table")
    if not tables:
        return []
    best_table = max(tables, key=rank_table)
    return [(best_table, datasetsoup.title.text)] # Return the table with the most cells, on the not unreasonable assumption that this is the table with the actual data.

In [11]:
tables_with_titles = map_process(dataset_urls, process_dataset_url)

In [12]:
import pandas as pd
def process_table_with_title(table_with_title):
    df = pd.read_html(str(table_with_title[0]))[0]
    return (df, table_with_title[1])

In [13]:
dataframes_with_titles = map(process_table_with_title, tables_with_titles)

In [14]:
from IPython.display import display
for (df, title) in list(dataframes_with_titles):
    print(title)
    display(df)

Stock Quotes | Stock Charts | Quote Prices | Markets Insider


Unnamed: 0,Name,Price,Unnamed: 2,%,+/-,Date
0,IBM,132.11,,6.48 %,8.04,11:22:31 AM
1,Travelers Cos,115.5,,1.49 %,1.7,11:22:13 AM
2,American Express,104.76,,1.33 %,1.38,11:22:09 AM
3,Intel,53.31,,1.22 %,0.64,11:22:24 AM
4,Merck,81.01,,1.21 %,0.97,11:21:48 AM
5,Home Depot,282.24,,-0.19 %,-0.55,11:22:13 AM
6,Unitedhealth Gro,322.2,,-0.30 %,-0.97,11:21:56 AM
7,McDonald's,225.38,,-0.49 %,-1.1,11:22:07 AM
8,Verizon Comm,59.05,,-0.94 %,-0.56,11:22:28 AM
9,Amgen,244.94,,-4.94 %,-12.73,11:22:31 AM


National Weather Service


Unnamed: 0,0,1
0,Humidity,81%
1,Wind Speed,NA NA MPH
2,Barometer,
3,Dewpoint,52°F (11°C)
4,Visibility,
5,Last update,08 Oct 07:43 AM PDT


HTML table basics - Learn web development | MDN


Unnamed: 0.1,Unnamed: 0,Unnamed: 1,Name,Mass (1024kg),Diameter (km),Density (kg/m3),Gravity (m/s2),Length of day (hours),Distance from Sun (106km),Mean temperature (°C),Number of moons,Notes
0,Terrestial planets,Terrestial planets,Mercury,0.33,4879,5427,3.7,4222.6,57.9,167,0,Closest to the Sun
1,Terrestial planets,Terrestial planets,Venus,4.87,12104,5243,8.9,2802.0,108.2,464,0,
2,Terrestial planets,Terrestial planets,Earth,5.97,12756,5514,9.8,24.0,149.6,15,1,Our world
3,Terrestial planets,Terrestial planets,Mars,0.642,6792,3933,3.7,24.7,227.9,-65,2,The red planet
4,Jovian planets,Gas giants,Jupiter,1898.0,142984,1326,23.1,9.9,778.6,-110,67,The largest planet
5,Jovian planets,Gas giants,Saturn,568.0,120536,687,9.0,10.7,1433.5,-140,62,
6,Jovian planets,Ice giants,Uranus,86.8,51118,1271,8.7,17.2,2872.5,-195,27,
7,Jovian planets,Ice giants,Neptune,102.0,49528,1638,11.0,16.1,4495.1,-200,14,
8,Dwarf planets,Dwarf planets,Pluto,0.0146,2370,2095,0.7,153.3,5906.4,-225,5,"Declassified as a planet in 2006, but this rem..."
