# Week 7 Assignment

_MkKinney 6.1_

This week has been all about getting information off the internet both in structured data formats (CSV, JSON, etc) as well as HTML.  For these exercises, we're going to use two practical examples of fetching data from web pages to show how to use Pandas and BeautifulSoup to extract structured information from the web.

---
---

### 33.1 Parsing a list in HTML

Go to the Banner Health Price Transparency Page: https://www.bannerhealth.com/patients/billing/pricing-resources/hospital-price-transparency

Notice that there is a list of hospitals and the city they are in.  We want to parse the underlying HTML to create a list of all the hospitals along with which city they're in.

```json
[
    ["Banner - University Medical Center Phoenix", "Arizona"],
    ["Banner - University Medical Center South ", "Arizona"],
    ...
]
```

To examine the underlying HTML code, you can use Chrome, right-click, and choose **Inspect**.

For reference, the documentation for BeautifulSoup is here: https://www.crummy.com/software/BeautifulSoup/bs4/doc/


In the cell below, create a function called **parse_banner(url)** that takes as it's one parameter the URL of the webpage to be parsed for links.  Make sure you include docstrings and a good test case using hte URL provided above.

In [1]:
#Assignment 33.1 Parsing a list in HTML

#Import Modules 
from bs4 import BeautifulSoup
import requests

#run the header syntax 
headers = { "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 11_2_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36" }
#Note(per teacher): you'll need to fetch the data using the following syntax to include headers that make the web server think you're a real web browser.

#assign url to object & check the site status(see footnotes below...)
pageUrl = requests.get("https://www.bannerhealth.com/patients/billing/pricing-resources/hospital-price-transparency", headers=headers)
#print(page.status_code) #successful!

#Preliminary Checks
#check the text
soup = BeautifulSoup(pageUrl.text, 'html.parser') #options: 'lxml', 'html.parser'
# print(soup.prettify()) #too long...noprint

#Let’s see what the type of each element in the list is:
# [type(item) for item in list(bannerSoup.children)]



#footnotes (notes to self):
#Note(page.status_code) a status code of 200 is good; 300 means the file may have moved... (see link... https://www.dataquest.io/blog/python-api-tutorial/)

In [2]:
#Import Modules 
from bs4 import BeautifulSoup
import requests

#create function
def parse_banner(url):
    """(url)->list
    This function takes as it's one parameter the URL of the webpage to be parsed for links.
    We want to parse the underlying HTML to create a list of all the hospitals along with which city they're in.
    
    >>> len(parse_banner('https://www.bannerhealth.com/patients/billing/pricing-resources/hospital-price-transparency'))
    38
    """
 
    #initialize the list
    hospList = []
    
    #run the header syntax 
    headers = { "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 11_2_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36" }
    #Note(per teacher): you'll need to fetch the data using the following syntax to include headers that make the web server think you're a real web browser.
    
    #first, check the site status(see footnotes below)
    page = requests.get("https://www.bannerhealth.com/patients/billing/pricing-resources/hospital-price-transparency", headers=headers)
    #print(page.status_code) #successful!

    #next, check the text
    bannerSoup = BeautifulSoup(page.text, 'html.parser') #options: 'lxml', 'html.parser'
    # print(bannerSoup.prettify()) #too long...noprint

    #assign tags to objects
    div_class = bannerSoup.find_all(class_ = "col-md-8")[0] #col-md-8 => the class that contains the hospital pricing info
    div_class_ul = div_class.find_all("ul") #ul => the tag for the hospitals (ul=>unlabeled??)
    div_class_ul_li = div_class.find_all("li") #li => the tag that lists the specific hospitals within the ul hospital tag

    #scrape the state info
    #---get the list of states (#looking at the html, it seems the state names are listed under the "strong" tag)
    #state_list = bannerSoup.find_all("strong")[0:6]
    state_list = div_class.find("strong").get_text()
   
    #scrape the hospital info
    #---get the list of hospitals (#looking at the html, it seems the hospital names are listed under the "li" tag)
    hospital_list = div_class.find("li").get_text()

    #put everything together
    for hosp_list in div_class_ul:
        state = hosp_list.previous_sibling.previous_sibling.get_text() #used the "previous_sibling" syntax (see footnotes below...)
        for hospital in hosp_list.find_all('li'):
            hospList.append([hospital.text, state])
            
    return hospList
 
#footnotes (notes to self):
#Note(page.status_code) a status code of 200 is good; 300 means the file may have moved... (see link... https://www.dataquest.io/blog/python-api-tutorial/)
#***loops 
#states = [st.get_text() for st in state_list] #loop through for all the states
#hospitals = [hsp.get_text() for hsp in div_class_ul_li] #loop through for all the states
#***Sibling info
#used the "previous_sibling" syntax; state names are 2 tags away (<p>, <ul>, <li>) from the hospital list, based on tips from the instructor on slack... 
#...this (sibling syntax) was the only option that worked! (See link: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#going-sideways)

In [3]:
import doctest
doctest.run_docstring_examples(parse_banner, globals(), verbose=True)

Finding tests in NoName
Trying:
    len(parse_banner('https://www.bannerhealth.com/patients/billing/pricing-resources/hospital-price-transparency'))
Expecting:
    38
ok


In [4]:
banner = parse_banner('https://www.bannerhealth.com/patients/billing/pricing-resources/hospital-price-transparency')
assert len(banner)==38, 'Length of result should have been 38, but {} returned.'.format(len(banner))
assert banner[0][1]=='Arizona', 'Wrong data found in the first result item: {}'.format(banner[0])

---

## 33.2 Using a REST API (from GitHub.com)

Many websites provide something called a REST API to access information from their site programatically, rather than relying on HTML.  One example is GitHub.com, whose API allows you do to things like "list all the public repositories for a user."

The documentation for GitHub.com's REST API can be found here: https://docs.github.com/en/rest/guides/getting-started-with-the-rest-api

Create a function called **repo_summary(user)** that takes a GitHub.com user name as it's parameter and retrieves a list of all the repositories you can see for that user.  The specific documentation for the this kind of request can be found here: https://docs.github.com/en/rest/reference/repos#list-repositories-for-a-user. Make sure your function is well documented with a docstring and includes a simple test to verify that you get back 12 repositories when querying for the repositories for user **paulboal**.

I've provided a related example to help you out.

In [5]:
# Example -- this example of code shows how to get basic information on the user paulboal
# For your solution, make sure you meet the requirements in the instructions above.

import requests

response = requests.get('https://api.github.com/users/paulboal')
data = response.json()

print('This information is about {}. His website is {}.'.format(data.get('login'), data.get('blog')))
print(data)

This information is about paulboal. His website is www.amitechsolutions.com.
{'login': 'paulboal', 'id': 1817916, 'node_id': 'MDQ6VXNlcjE4MTc5MTY=', 'avatar_url': 'https://avatars.githubusercontent.com/u/1817916?v=4', 'gravatar_id': '', 'url': 'https://api.github.com/users/paulboal', 'html_url': 'https://github.com/paulboal', 'followers_url': 'https://api.github.com/users/paulboal/followers', 'following_url': 'https://api.github.com/users/paulboal/following{/other_user}', 'gists_url': 'https://api.github.com/users/paulboal/gists{/gist_id}', 'starred_url': 'https://api.github.com/users/paulboal/starred{/owner}{/repo}', 'subscriptions_url': 'https://api.github.com/users/paulboal/subscriptions', 'organizations_url': 'https://api.github.com/users/paulboal/orgs', 'repos_url': 'https://api.github.com/users/paulboal/repos', 'events_url': 'https://api.github.com/users/paulboal/events{/privacy}', 'received_events_url': 'https://api.github.com/users/paulboal/received_events', 'type': 'User', 'si

In [6]:
# Your code Here
import requests
import json

def repo_summary(user):
    """(str)->list
    This function takes a GitHub.com user name as it's parameter and 
    retrieves a list of all the repositories you can see for that user.
    
    >>> len(repo_summary('paulboal'))
    12
    """
    
    #assign url to an object     
    #link = "https://api.github.com/users/{user}/repos"
    link = "https://api.github.com/users/{}/repos".format(user) #github API link 

    #request API link, convert to json etc.
    api_link = requests.get(link)    
    api_data = api_link.json()
    repos_Data = (api_data)
   
    #initialize the final output list of repositories
    repos = []    
    
    #append items to repos list
    [repos.append(items['name']) for items in repos_Data]
    
    return repos
    
#footnotes (notes to self)
#Note: need to include ".format(user)" in the link for the tests below to work
#(#options for repos...(users, org); e.g. https://api.github.com/repos/{owner}/{repo})
#general format of the api link is 'https://api.github.com/users/{USERNAME}/repos'    
#See link: https://www.w3schools.com/python/ref_string_format.asp (also see section on "The Placeholders")    

In [7]:
import doctest
doctest.run_docstring_examples(repo_summary, globals(), verbose=True)

Finding tests in NoName
Trying:
    len(repo_summary('paulboal'))
Expecting:
    12
ok


In [8]:
repos = repo_summary('paulboal')
assert len(repos)==12, 'Expecing 12, but {} were found'.format(len(repos))

In [9]:
#Test
repo_summary('paulboal')

['ajaxterm',
 'cms_hospital_compare',
 'collibra-scripts',
 'coronadatascraper',
 'hadoop-heuristicsminer',
 'hds5210-2021',
 'hds5210-2022',
 'jupyterhub-nbgrader',
 'nppes_demo',
 'pexpect-curses',
 'scm-products',
 'tdwi-accelerate-2017-python']

---

### 33.3 Find Something of Your Own

Do some web searches and find an HTML page with some data that is interesting to something you're studying.  You can extract and parse that information using either BeautifulSoup or Pandas.  If you're using Pandas, then do something interesting to format and structure your data.  If you're using BeautifulSoup, you'll just need to do the work of parsing the data out of HTML -- that's hard enough!

You don't need to build this as a function.  Just use notebook cells as I've done above.  You will be graded based on _style_.  Use variable names that make sense for your problem / solution. Cleanup anything you don't need before you submit your work.

In [10]:
#Title: 10-day Weather Forecast for Des Moines, Iowa
#Description: this code scrapes the 10-day weather forecast from weather.com for the city of Des Moines, Iowa.

#*****************************************************************
#Note: I did not find any information prohibiting the use of this site/data (see licensing info below)

# 3.  Ownership; Limited License.
# A. The Services are owned and operated by TWCPT, but the Services may include elements licensed from or provided by third parties. 
# TWCPT (along with its third party licensors) retains all right, title and interest in and to the Services, including all intellectual property rights in them. 
# The Services contain content (including, without limitation, data, text, software, images, video, graphics, music and sound) which is protected by United States 
# and worldwide copyright, trademark, trade secret, patent and other intellectual property laws, and with the exception of content in the public domain, 
# the rights to the content of the Services under such laws are owned or controlled by TWCPT or licensed from third parties.

# B. Subject to all of the terms and conditions of these Terms of Use, TWCPT grants you a limited, personal, non-exclusive, non-sublicensable, non-transferable 
# license to access and use the Services for your personal, non-commercial use only.  
# While you may access, view, use and display the Services for your personal, non-commercial use, you may not modify, reproduce, distribute,
# publicly display, publicly perform, rent, lease, participate in the transfer or sale, create derivative works, or in any way exploit, any of the content, in whole or in part, except as expressly indicated by TWCPT in the Services, as expressly permitted under intellectual property laws, or with the express written permission of TWCPT and any relevant third party owners of intellectual property rights in the content. In the event of any permitted copying, redistribution or publication of protected material, you will not change or delete any author attribution, trademark legend or copyright notice, and no ownership rights will be transferred.  Some of the Services may include information sourced from the U.S. National Oceanic and Atmospheric Administration/National Weather Service, such as (without limitation) severe weather alerts, which information is in the public domain and is not subject to copyright protection.  T
# WCPT does not claim to own or control such information.
#*****************************************************************

###### Title: 10-day Weather Forecast for Des Moines, Iowa

In [11]:
# Your Code Here
#Import Modules &Packages
import requests
from bs4 import BeautifulSoup

#assign url to object
url = "https://weather.com/weather/tenday/l/9f4008162d433e9dd2584077ace40c9fc72765c052152218f346dc729206e589"

#request page
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
print(page.status_code)
#Notes on output: page status (200) is good!

#quick checks
#print(soup.prettify())
#list(soup.children)

#Let’s see what the type for each element in the list is:
print([type(item) for item in list(soup.children)])

#Notes on output:
#The first is a Doctype object, which contains information about the type of the document.
#The second & final item is a Tag object, which contains other nested tags.

200
[<class 'bs4.element.Doctype'>, <class 'bs4.element.Tag'>]


In [12]:
#10 Day Weather Forecast for Des Moines Iowa
ten_day = soup.find(class_="DailyForecast--DisclosureList--msYIJ") 
forecast_items = ten_day.find_all(class_="DaypartDetails--Content--hJ52O")
today = forecast_items[1]
#today

#use loops to iterate over the weekly forecasts...
#first, get weekdays & time of the day(i.e. daytime vs. nightime)
weekly_tags = ten_day.select(".DailyForecast--DisclosureList--msYIJ .DailyContent--daypartName--1bzYn")
periods = [pt.get_text() for pt in weekly_tags]
print(periods)
print("-----------------------------")

#next, get the weekly temps
weekly_temps = ten_day.select(".DailyForecast--DisclosureList--msYIJ .DailyContent--temp--3d4dn")
temps = [tp.get_text() for tp in weekly_temps]
print(temps)
print("-----------------------------")

#next, get the weekly forecast descriptions
weekly_descr = ten_day.select(".DailyForecast--DisclosureList--msYIJ .DailyContent--narrative--hplRl")
descrs = [dscr.get_text() for dscr in weekly_descr]
#print(descrs)


['Mon 07 | Night', 'Tue 08 | Day', 'Tue 08 | Night', 'Wed 09 | Day', 'Wed 09 | Night', 'Thu 10 | Day', 'Thu 10 | Night', 'Fri 11 | Day', 'Fri 11 | Night', 'Sat 12 | Day', 'Sat 12 | Night', 'Sun 13 | Day', 'Sun 13 | Night', 'Mon 14 | Day', 'Mon 14 | Night', 'Tue 15 | Day', 'Tue 15 | Night', 'Wed 16 | Day', 'Wed 16 | Night', 'Thu 17 | Day', 'Thu 17 | Night', 'Fri 18 | Day', 'Fri 18 | Night', 'Sat 19 | Day', 'Sat 19 | Night', 'Sun 20 | Day', 'Sun 20 | Night', 'Mon 21 | Day', 'Mon 21 | Night']
-----------------------------
['16°', '43°', '22°', '31°', '16°', '22°', '10°', '25°', '2°', '25°', '22°', '46°', '35°', '46°', '26°', '51°', '37°', '61°', '36°', '56°', '34°', '55°', '35°', '55°', '34°', '60°', '39°', '65°', '41°']
-----------------------------


In [13]:
#combine info in a dataframe
import pandas as pd
weather = pd.DataFrame({
    "day_period": periods,
    "temp": temps,
    "forecast":descrs
})
weather

Unnamed: 0,day_period,temp,forecast
0,Mon 07 | Night,16°,Clear skies. Low 16F. Winds light and variable.
1,Tue 08 | Day,43°,Mostly sunny skies. High 43F. Winds SSW at 10 ...
2,Tue 08 | Night,22°,Mostly cloudy skies early will become partly c...
3,Wed 09 | Day,31°,"A mix of clouds and sun early, then becoming c..."
4,Wed 09 | Night,16°,Cloudy with snow showers developing after midn...
5,Thu 10 | Day,22°,Variably cloudy with snow showers. High 22F. W...
6,Thu 10 | Night,10°,Mostly cloudy skies. Low near 10F. Winds NW at...
7,Fri 11 | Day,25°,Partly to mostly cloudy. High around 25F. Wind...
8,Fri 11 | Night,2°,Clear skies. Low 2F. Winds NW at 10 to 15 mph.
9,Sat 12 | Day,25°,Generally sunny despite a few afternoon clouds...


---

## Check your work above

If you didn't get them all correct, take a few minutes to think through those that aren't correct.


## Submitting Your Work

In order to submit your work, you'll need to use the `git` command line program to **add** your homework file (this file) to your local repository, **commit** your changes to your local repository, and then **push** those changes up to github.com.  From there, I'll be able to **pull** the changes down and do my grading.  I'll provide some feedback, **commit** and **push** my comments back to you.  Next week, I'll show you how to **pull** down my comments.

To run through everything one last time and submit your work:
1. Use the `Kernel` -> `Restart Kernel and Run All Cells` menu option to run everything from top to bottom and stop here.
2. Follow the instruction on the prompt below to either ssave and submit your work, or continue working.

If anything fails along the way with this submission part of the process, let me know.  I'll help you troubleshoort.

---

In [None]:
a=input('''
Are you ready to submit your work?
1. Click the Save icon (or do Ctrl-S / Cmd-S)
2. Type "yes" or "no" below
3. Press Enter

''')

if a=='yes':
    !git add week07_assignment_2.ipynb
    !git commit -a -m "Submitting the week 7 programming exercises"
    !git push
else:
    print('''
    
OK. We can wait.
''')