# Week 7 Assignment

_MkKinney 6.1_

This week has been all about getting information off the internet both in structured data formats (CSV, JSON, etc) as well as HTML.  For these exercises, we're going to use two practical examples of fetching data from web pages to show how to use Pandas and BeautifulSoup to extract structured information from the web.

---
---

### 33.1 Parsing a list in HTML

Go to the Banner Health Price Transparency Page: https://www.bannerhealth.com/patients/billing/pricing-resources/hospital-price-transparency

Notice that there is a list of hospitals and the city they are in.  We want to parse the underlying HTML to create a list of all the hospitals along with which city they're in.

```json
[
    ["Banner - University Medical Center Phoenix", "Arizona"],
    ["Banner - University Medical Center South ", "Arizona"],
    ...
]
```

To examine the underlying HTML code, you can use Chrome, right-click, and choose **Inspect**.

For reference, the documentation for BeautifulSoup is here: https://www.crummy.com/software/BeautifulSoup/bs4/doc/


In the cell below, create a function called **parse_banner(url)** that takes as it's one parameter the URL of the webpage to be parsed for links.  Make sure you include docstrings and a good test case using hte URL provided above.

In [1]:
from bs4 import BeautifulSoup
import requests

# Note that you'll need to fetch the data using the following syntax to include headers
url = 'https://www.bannerhealth.com/patients/billing/pricing-resources/hospital-price-transparency'

headers = { "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 11_2_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36" }

response = requests.get(url, headers=headers)

soup = BeautifulSoup(response.text, 'html.parser')
# that make the web server think you're a real web browser.

In [2]:
response.status_code

200

In [3]:
def parse_banner(url):
    results = []
    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.text, 'html.parser')
    div = soup.find_all('div', {"class":"col-md-8"})[0]
    for hospital_list in div.find_all('ul'):
        state = hospital_list.previous_sibling.previous_sibling.string
        for hospital in hospital_list.find_all('li'):
            print(hospital.text, state)
            results.append([hospital.text, state])
    return results

In [4]:
import doctest
doctest.run_docstring_examples(parse_banner, globals(), verbose=True)

Finding tests in NoName


In [5]:
banner = parse_banner('https://www.bannerhealth.com/patients/billing/pricing-resources/hospital-price-transparency')
assert len(banner)==38, 'Length of result should have been 38, but {} returned.'.format(len(banner))
assert banner[0][1]=='Arizona', 'Wrong data found in the first result item: {}'.format(banner[0])

Banner - University Medical Center Phoenix Arizona
Banner - University Medical Center South  Arizona
Banner - University Medical Center Tucson Arizona
Banner Baywood Medical Center  Arizona
Banner Behavioral Health Hospital Arizona
Banner Boswell Medical Center Arizona
Banner Casa Grande Medical Center Arizona
Banner Del E. Webb Medical Center Arizona
Banner Desert Medical Center/Cardon Children's Medical Center   Arizona
Banner Estrella Medical Center Arizona
Banner Gateway Medical Center/Banner MD Anderson Cancer Center Arizona
Banner Goldfield Medical Center   Arizona
Banner Heart Hospital Arizona
Banner Ironwood Medical Center Arizona
Banner Ocotillo Medical Center Arizona
Banner Payson Medical Center Arizona
Banner Rehabilitation Hospitals Arizona
Banner Thunderbird Medical Center Arizona
Page Hospital Arizona
Banner Lassen Medical Center California
Banner Fort Collins Medical Center Colorado
McKee Medical Center Colorado
North Colorado Medical Center Colorado
Sterling Regional Me

---

## 33.2 Using a REST API (from GitHub.com)

Many websites provide something called a REST API to access information from their site programatically, rather than relying on HTML.  One example is GitHub.com, whose API allows you do to things like "list all the public repositories for a user."

The documentation for GitHub.com's REST API can be found here: https://docs.github.com/en/rest/guides/getting-started-with-the-rest-api

Create a function called **repo_summary(user)** that takes a GitHub.com user name as it's parameter and retrieves a list of all the repositories you can see for that user.  The specific documentation for the this kind of request can be found here: https://docs.github.com/en/rest/reference/repos#list-repositories-for-a-user. Make sure your function is well documented with a docstring and includes a simple test to verify that you get back 12 repositories when querying for the repositories for user **paulboal**.

I've provided a related example to help you out.

In [22]:
# Example -- this example of code shows how to get basic information on the user paulboal
# For your solution, make sure you meet the requirements in the instructions above.
import requests
import json

response = requests.get('https://api.github.com/users/paulboal')
data = response.json()

print('This information is about {}. His website is {}.'.format(data.get('login'), data.get('blog')))

This information is about paulboal. His website is www.amitechsolutions.com.


In [125]:
# Your code Here
def repo_summary(user):
    
    url = 'https://api.github.com/users/paulboal/repos'
    response = requests.get(url)
    data = response.json()
    repo_summary = []
    for repository in data:
        if repository not in repo_summary:
            repos = repo_summary.append(repository["name"])
    print(repo_summary)

In [126]:
repos = repo_summary('paulboal')

['ajaxterm', 'cms_hospital_compare', 'collibra-scripts', 'coronadatascraper', 'hadoop-heuristicsminer', 'hds5210-2021', 'hds5210-2022', 'jupyterhub-nbgrader', 'nppes_demo', 'pexpect-curses', 'scm-products', 'tdwi-accelerate-2017-python']


In [103]:
import requests
def repo_summary(user):
    '''(list) -> int
    >>> len(repos)==12
    True
    '''
    link = ('https://api.github.com/users/paulboal/repos')
    api_link = requests.get(link)
    api_data = api_link.json()
    repos_Data = (api_data)
    repo_summary = [print(f"- {items['name']}") for items in repos_Data]
    return repo_summary

In [104]:
import doctest
doctest.run_docstring_examples(repo_summary, globals(), verbose=True)

Finding tests in NoName
Trying:
    len(repos)==12
Expecting:
    True
ok


In [105]:
repos = repo_summary('paulboal')
assert len(repos)==12, 'Expecing 12, but {} were found'.format(len(repos))

- ajaxterm
- cms_hospital_compare
- collibra-scripts
- coronadatascraper
- hadoop-heuristicsminer
- hds5210-2021
- hds5210-2022
- jupyterhub-nbgrader
- nppes_demo
- pexpect-curses
- scm-products
- tdwi-accelerate-2017-python


---

### 33.3 Find Something of Your Own

Do some web searches and find an HTML page with some data that is interesting to something you're studying.  You can extract and parse that information using either BeautifulSoup or Pandas.  If you're using Pandas, then do something interesting to format and structure your data.  If you're using BeautifulSoup, you'll just need to do the work of parsing the data out of HTML -- that's hard enough!

You don't need to build this as a function.  Just use notebook cells as I've done above.  You will be graded based on _style_.  Use variable names that make sense for your problem / solution. Cleanup anything you don't need before you submit your work.

In [175]:
# Your Code Here
import pandas as pd

tables = pd.read_html('https://www.health.harvard.edu/staying-healthy/listing_of_vitamins')

In [176]:
tables[0]

Unnamed: 0,0,1,2,3,4,5
0,VITAMIN,BENEFITS,RECOMMENDED AMOUNT (daily RDA* or daily AI**),UPPER LIMIT (UL) per day,GOOD FOOD SOURCES,DID YOU KNOW?
1,RETINOIDS AND CAROTENE (vitamin A; includes re...,Essential for vision Lycopene may lower prosta...,"M: 900 mcg (3,000 IU)W: 700 mcg (2,333 IU)Some...","3,000 mcg (about 10,000 IU)","Sources of retinoids: beef liver, eggs, shrimp...",Many people get too much preformed vitamin A f...
2,THIAMIN (vitamin B1),Helps convert food into energy. Needed for hea...,"M: 1.2 mg, W: 1.1 mg",Not known,"Pork chops, brown rice, ham, soymilk, watermel...",Most nutritious foods have some thiamin.
3,RIBOFLAVIN (vitamin B2),Helps convert food into energy. Needed for hea...,"M: 1.3 mg, W: 1.1 mg",Not known,"Milk, eggs, yogurt, cheese, meats, green leafy...",Most Americans get enough of this nutrient.
4,"NIACIN (vitamin B3, nicotinic acid)",Helps convert food into energy. Essential for ...,"M: 16 mg, W: 14 mg",35 mg,"Meat, poultry, fish, fortified and whole grain...",Niacin occurs naturally in food and can also b...
5,PANTOTHENIC ACID (vitamin B5),Helps convert food into energy. Helps make lip...,"M: 5 mg, W: 5 mg",Not known,"Wide variety of nutritious foods, including ch...",Deficiency causes burning feet and other neuro...
6,"PYRIDOXINE (vitamin B6, pyridoxal, pyridoxine,...",Aids in lowering homocysteine levels and may r...,"31–50 years old: M: 1.3 mg, W: 1.3 mg; 51+ yea...",100 mg,"Meat, fish, poultry, legumes, tofu and other s...",Many people don't get enough of this nutrient.
7,COBALAMIN (vitamin B12),Aids in lowering homocysteine levels and may l...,"M: 2.4 mcg, W: 2.4 mcg",Not known,"Meat, poultry, fish, milk, cheese, eggs, forti...","Some people, particularly older adults, are de..."
8,BIOTIN,Helps convert food into energy and synthesize ...,"M: 30 mcg, W: 30 mcg",Not known,"Many foods, including whole grains, organ meat...",Some is made by bacteria in the gastrointestin...
9,ASCORBIC ACID (vitamin C),Foods rich in vitamin C may lower the risk for...,"M: 90 mg, W: 75 mg Smokers: Add 35 mg","2,000 mg","Fruits and fruit juices (especially citrus), p...",Evidence that vitamin C helps reduce colds has...


---

## Check your work above

If you didn't get them all correct, take a few minutes to think through those that aren't correct.


## Submitting Your Work

In order to submit your work, you'll need to use the `git` command line program to **add** your homework file (this file) to your local repository, **commit** your changes to your local repository, and then **push** those changes up to github.com.  From there, I'll be able to **pull** the changes down and do my grading.  I'll provide some feedback, **commit** and **push** my comments back to you.  Next week, I'll show you how to **pull** down my comments.

To run through everything one last time and submit your work:
1. Use the `Kernel` -> `Restart Kernel and Run All Cells` menu option to run everything from top to bottom and stop here.
2. Follow the instruction on the prompt below to either ssave and submit your work, or continue working.

If anything fails along the way with this submission part of the process, let me know.  I'll help you troubleshoort.

---

In [None]:
a=input('''
Are you ready to submit your work?
1. Click the Save icon (or do Ctrl-S / Cmd-S)
2. Type "yes" or "no" below
3. Press Enter

''')

if a=='yes':
    !git add week07_assignment_2.ipynb
    !git commit -a -m "Submitting the week 7 programming exercises"
    !git push
else:
    print('''
    
OK. We can wait.
''')