In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("lab12.ipynb")

<img src="data6.png" style="width: 15%; float: right; padding: 1%; margin-right: 2%;"/>

# Lab 12 – APIs and Prompt Engineering

## Data 6

In [None]:
# Just run this cell
from datascience import *
import numpy as np
import warnings
warnings.simplefilter('ignore')

<hr style="border: 5px solid #003262;" />
<hr style="border: 1px solid #fdb515;" />

# Part 1: Scraping HTML with Beautiful Soup

In this part, you will scrape and analyze this made-up webpage: [Fearless Widget Factory webpage](https://static.decontextualize.com/widgets2016.html). The page concerns the catalog of a famous (made-up) widget company. You'll be answering several questions about this web page.

Credit: This part is adapted from a tutorial and worksheet series, [Scraping HTML with Beautiful Soup](https://github.com/aparrish/dmep-python-intro) and [Web Scraping Worksheet](https://github.com/aparrish/dmep-python-intro/blob/master/web-scraping-worksheet.ipynb), by [Professor Allison Parish](https://www.decontextualize.com/) at New York University.

---
## [Tutorial] The BeautifulSoup package

This tutorial highlights the most important parts of the [corresponding Data 6 Notes page](https://data6.org/notes/18-html/).
                                                                                                
As mentioned in class, HTML is hard to parse by hand. (Don't even try it.)

Beautiful Soup is a Python library that will parse the HTML for us, and give us some Python objects that we can call methods on to poke at the data contained therein. So instead of working with strings and bytes, we can work with Python objects, methods and data structures.


In [None]:
from bs4 import BeautifulSoup

Reminder: Developer Tools: Remember that you can (and should!) use Developer Tools to understand a page’s HTML.

1. Use Developer Tools in Chrome to take a look at how `widgets2016.html` is organized.
1. Open the [Fearless Widget Factory webpage](https://static.decontextualize.com/widgets2016.html) in a web browser like Chrome.
1. In Chrome, ctrl-click (or right click) anywhere on the page and select "Inspect Element." This will open Chrome's Developer Tools. Your screen should look (something) like this:

<img src='widgets.png' width=500>

Instructions for Safari and Firefox are on the Data 6 notes.

Beautiful Soup only parses HTML. Let's use the `requests` library to **scrape**, or download, the HTML of this webpage from the web.


In [None]:
# just run this cell
import requests
html_str = requests.get("http://static.decontextualize.com/widgets2016.html").text
html_str

**Beautiful Soup** is a Python library that **parses HTML** (even poorly formatted HTML) and allows us to extract and manipulate its contents. More specifically, it gives us some Python objects that we can call methods on to poke at the data contained therein. So instead of working with strings and bytes, we can work with Python objects, methods and data structures.

Note that BeautifulSoup can parse any HTML that is provided as a string. We’ve already gotten an HTML string `html_str` from our previous web request, so now we need to create a Beautiful Soup object from that data:

In [None]:
# just run this cell
from bs4 import BeautifulSoup

document = BeautifulSoup(html_str, "html.parser")
type(document)


The `document` object supports a number of interesting methods that allow us to dig into the contents of the HTML. Primarily what we'll be working with are:

* `Tag` objects, and
* `ResultSet` objects, which are essentially just lists of `Tag` objects.

## [Tutorial] Finding a tag and multiple tags

As we've previously discussed, HTML documents are composed of tags. To represent this, Beautiful Soup has a type of value that represents tags. We can use the `.find()` method of the `BeautifulSoup` object to find a tag that matches a particular tag name. For example:

In [None]:
h1_tag = document.find('h1')
type(h1_tag)

A `Tag` object has several interesting attributes and methods. The `string` attribute of a `Tag` object, for example, returns a string representing that tag's contents:

In [None]:
h1_tag.string

### Finding multiple tags

It's very often the case that we want to find not just one tag that matches particular criteria, but ALL tags matching those criteria. For that, we use the `.find_all()` method of the `BeautifulSoup` object. For example, to find all `h3` tags in the document:

In [None]:
h3_tags = document.find_all('h3')
type(h3_tags)

But what's in the Result Set?

In order to find out, we're going to need to use a loop:

In [None]:
for tag in h3_tags:
    print(tag.string) 

This conveniently gives you a variable, `tag`, that updates with the appropriate value each time you iterate. 

## [Tutorial] tags by attribute

You can access the attributes of a tag by treating the tag object as though it were a **dictionary**. To get the **value** associated with a particular **attribute**. Use the square-bracket syntax, providing the attribute name as key/string.

The below cell gets the first link in the document. It does this by getting the `href` attribute of the first `<a>` tag in the document:

In [None]:
# just run this cell
a_tag = document.find('a')
a_tag['href']

Contrast this with the string of the first `<a>` tag, which you can verify in your Developer Tools:

In [None]:
# just run this cell
a_tag.string

What this tells us is that emails can be links! Just prepend with `mailto:` when you are making your clickable email address.

## [Tutorial] Finding tags with a certain attribute or class

Relevant BeautifulSoup documentation: [Searching by CSS class](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#searching-by-css-class)

You may notice that there are two `<a>` tags:

In [None]:
# just run this cell
for a_tag in document.find_all('a'):
    print(a_tag)

The second tag has a particular **class** attribute. This special attribute helps determine the tag's CSS, which is a web language for describing presentation and styling that we won't talk about in detail.

To get this second tag, we can specify its class. However, because `class` is a reserved word in Python, we can use the keyword argument `class_`, an optional argument to `find` or `find_all`:

In [None]:
# just run this cell
document.find_all("a", class_="tel")

Otherwise, for general attributes that are **not** the class attribute, BeautifulSoup lets us specify the attributes as optional keywords:

In [None]:
# just run this cell
document.find_all("a", href="tel:2125559912")

<br></br>
<hr style="border: 1px solid #fdb515;" />

# Question 1

This question explores the Widget page. Use the BeautifulSoup `document` to answer each question.

Before writing code, we strongly suggest you:

* Use the Developer Tools in your web browser to explore and understand which tags are needed. Here's the [Fearless Widget Factory webpage](https://static.decontextualize.com/widgets2016.html) again. 
* Use Developer Tools to figure out the answer to each should be, and write it down
* Then, write the code.

Please—ask questions if you're stuck!

## Question 1a
Assign `h3_tags` to the number of `h3` tags on the page. 

_Hint_: Use `find_all()`.

In [None]:
h3_tags = ...
num_h3_tags = ...

num_h3_tags

In [None]:
grader.check("q1a")

## Question 1b
Assign `telephone_number` to a string of the telephone number displayed beneath the "Widget Catalog" header. This telephone number should be formated as `'XXX-XXX-XXXX'`.

_Hint_: Figure out which tag stores the telephone number. Which attribute in this tag contains the telephone number string? You shouldn't need to use any string methods—just use BeautifulSoup calls!

In [None]:
telephone_number = ...

telephone_number

In [None]:
grader.check("q1b")

## Question 1c
Use Beautiful Soup to write some code creates `widget_names`, a **list** of all the Widget names on the page. After your code has executed, `widget_names` should evaluate to a list that looks like this (though not necessarily in this order):

```
Skinner Widget
Widget For Furtiveness
Widget For Strawman
Jittery Widget
Silver Widget
Divided Widget
Manicurist Widget
Infinite Widget
Yellow-Tipped Widget
Unshakable Widget
Self-Knowledge Widget
Widget For Cinema
```

_Hints_:
* Is each widget name stored as a tag attribute or as a tag string?
* Is there a specific tag class associated with widget names?
* We suggest iteratively making the final list of strings `widget_names` _after_ you've identified the right list of widget tags. The code structure is below.

In [None]:
widget_name_tags = ...

widget_names = []
for widget_name_tag in widget_name_tags:
    ...

# do not edit below this line
for widget_name in widget_names:
    print(widget_name)

In [None]:
grader.check("q1c")

<br></br>
<hr style="border: 1px solid #fdb515;" />

# Question 2

This question asks you to construct a **list of dictionaries**, where each dictionary contains information aboud a widget. Let's begin...

As you see from the [Fearless Widget Factory page](https://static.decontextualize.com/widgets2016.html), there are four tables with widget information. Each row of each table contains the widget name, the part number, the price, and how many of this widget are in the warehouse.

## [Tutorial] Find under a specific tag

We will explore the structure of HTML tables more deeply in the project. For now, each widget row is described by a `<tr>` tag with class attribute `"winfo"`:

```
  <tr class="winfo">
    <td class="partno">C1-9476</td>
    <td class="wname">Skinner Widget</td>
    <td class="price">$2.70</td>
    <td class="quantity">512</td>
  </tr>
```

Run the cell below, which uses BeautifulSoup to create the _list_ of tags `widget_info_tags`, then prints out the first tag's HTML. We use `prettify` to build extra formatting.

Verify the first tag's HTML matches the HTML we showed above.

In [None]:
# just run this cell
widget_info_tags = document.find_all("tr", class_="winfo")
first_tag = widget_info_tags[0]
print(first_tag.prettify())

As you can see, `first_tag` is a `<tr>` tag that contains nested `<td>` tags corresponding to each column's value in the four-element row.

To get the part number, we can then use `find` on `first_tag` itself to find the appropriate tag that is a child or descendant of the first tag:

In [None]:
# just run this cell. should display "C1-9476"
part_tag = first_tag.find("td", class_="partno")
part_tag.string

---
## [Tutorial] Lists vs. Arrays: Appending Elements

To append a single element to a list, you can use the list method append. Run the next three cells below:

In [None]:
# just run this cell
my_lst = ["some", "elements", 2]
my_lst

In [None]:
# just run this cell
my_lst.append("in sequence")

In [None]:
# just run this cell
my_lst

Above, the list method `append` returns nothing and directly modifies the original list. This behavior is unlike `np.append`, which returns a new array. Run the cells below for comparison:

In [None]:
# just run this cell
arr = make_array("some", "elements")
np.append("in sequence", arr)

In [None]:
# just run this cell
# np.append leaves arr unchanged
arr 

To "update" the original array, assign the `np.append` return value to the original array name:

In [None]:
# just run this cell
arr = np.append("in sequence", arr)
arr

---

## Exercise

This part may take a bit of time. That is okay!

In the cell below, we've made an empty list and assigned it to a variable called `widgets`. Write code that populates this list with dictionaries, one dictionary per widget in the source file. The keys of each dictionary should be `partno`, `wname`, `price`, and `quantity`, and the value for each of the keys should be the value for the corresponding column for each row. After executing the cell, your list should look something like this:

```
[{'partno': 'C1-9476',
  'price': '$2.70',
  'quantity': u'512',
  'wname': 'Skinner Widget'},
 {'partno': 'JDJ-32/V',
  'price': '$9.36',
  'quantity': '967',
  'wname': u'Widget For Furtiveness'},
  ...several items omitted...
 {'partno': '5B-941/F',
  'price': '$13.26',
  'quantity': '919',
  'wname': 'Widget For Cinema'}]
```

And this expression:

```
widgets[5]['partno']
```
    
... should evaluate to:

```
LH-74/O
```

_Hints_: You may want to use a `for` loop over all of the `widget_info_tags` to iteratively append to `widgets`.

In [None]:
widgets = []

for widget_info in widget_info_tags:
    widget_dict = {
        ...
        ...
        ...
        ...
    } 
    widgets.append(widget_dict)

# don't edit below this cell
widgets

In [None]:
# you can also run this cell to double check
widgets[5]['partno']

In [None]:
grader.check("q2")

<hr style="border: 5px solid #003262;" />
<hr style="border: 1px solid #fdb515;" />

# Part 2: Genius API
In this part, you will access the http://genius.com/ lyrics database using the Genius API.

Credit: The Genius API tutorial and exercises are adapted from [Intro to Cultural Analytics](https://melaniewalsh.github.io/Intro-Cultural-Analytics/welcome.html) by [Professor Melanie Walsh](https://melaniewalsh.org/) at the University of Washington.

Note: The below tutorials and Question 3 are identical to the beginning of Project 2. If you have started Project 2, copy your API token from there to complete Question 3.

---

## [Tutorial] Getting a Genius API key 

This tutorial walks you through obtaining an API Key for the Genius API. This tutorial is adapted from [Intro to Cultural Analytics](https://melaniewalsh.github.io/Intro-Cultural-Analytics/welcome.html) by Melanie Walsh and is also available in the [Data 6 Notes](https://data6.org/notes/).

## API Keys

To use the Genius API, you need a special API key, specifically a "Client Access Token", which is kind of like a password. Many APIs require authentication keys to gain access to them. To get your necessary Genius API keys, you need to navigate to the following URL: [https://genius.com/api-clients](https://genius.com/api-clients).

<center><img src="Genius-API.png" width=50% ></center>

You'll be prompted to sign up for [a Genius account](https://genius.com/signup_or_login), which is required to gain API access. Signing up for a Genius account is free and easy. You just need a Genius nickname (which must be one word), an email address, and a password.

Once you're signed in, you should be taken to [https://genius.com/api-clients](https://genius.com/api-clients), where you need to click the button that says "New API Client."

<center><img src="Genius-New-API.png" width=50% ></center>

After clicking "New API Client," you'll be prompted to fill out a short form about the "App" that you need the Genius API for. You only need to fill out "App Name" and "App Website URL."

It doesn't really matter what you type in. You can simply put "Song Lyrics Project" for the "App Name" and the URL for our course website "https://data6.org/fa25/" for the "App Website URL."

When you click "Save," you'll be given a series of API Keys: a "Client ID" and a "Client Secret." To generate your "Client Access Token," which is the API key that we'll be using in this notebook, you need to click "Generate Access Token".


---

# Question 3
Include your API key from the Python file `api_key.py`.

Follow the corresponding instructions in the [Data 6 Notes](https://data6.org/notes/18-html/genius.html) to navigate to the `api_key.py` file and edit it.

After you have updated `api_key.py` (the file within this assignment directory), running the below line should load your API token into `client_access_token`.

In [None]:
# before running this cell, make sure that you have updated api_key.py with your API key
%reload_ext autoreload
%autoreload 2

import api_key
client_access_token = api_key.my_client_access_token

In [None]:
grader.check("q3")

---
## [Tutorial] Making an API request

## Making an API Request

Making an API request looks a lot like typing a specially-formatted URL. But instead of getting a rendered HTML web page in return, you get some data in return.

There are a few different ways that we can query the Genius API, all of which are discussed in the [Genius API documentation](https://docs.genius.com/#songs-h2). The way we're going to cover in this lesson is [the basic search](https://docs.genius.com/#search-h2), which allows you to get a bunch of Genius data about any artist or songs that you search for:

`http://api.genius.com/search?q={search_term}&access_token={client_access_token}`

Sticking with our Missy Elliott theme/obsession, we're going to search for Genius data about Missy Elliott.

First we're going to assign the string "Missy Elliott" to the variable `search_term`. Then we're going to make an f-string URL that contains the variables `search_term` and `client_access_token`.

In [None]:
search_term = "Missy Elliott"

In [None]:
genius_search_url = f"http://api.genius.com/search?q={search_term}&access_token={client_access_token}"

This URL is basically all we need to make a Genius API request. Want proof? Run the cell below and print this URL, then copy and paste it into a new tab in your web browser.

In [None]:
genius_search_url

It doesn't look pretty, but that's a bunch of Genius data about Missy Elliott!

We can programmatically do the same thing by again using the Python library `requests` with this URL. Instead of getting the `.text` of the response, as we did before, we're going to use `.json()`.

[JSON](https://www.w3schools.com/whatis/whatis_json.asp) is a data format that is commonly used by APIs. JSON data can be nested and contains key/value pairs, much like a Python dictionary.

In [None]:
import requests

response = requests.get(genius_search_url)
json_data = response.json()

The JSON data that we get from our Missy Elliott API query looks something like this:

In [None]:
json_data

We can index this data (again, like a Python dictionary) and look at the first "hit" about Missy Elliott from Genius.com.

In [None]:
json_data['response']['hits'][0]

<br></br>
<hr style="border: 1px solid #fdb515;" />

# [Tutorial] Understanding the JSON from the Genius API call

The Genius API's basic search function returns a JSON, which has data about an artist's **hits**, also known as top songs. There is a lot of information about each hit. This question's goal is to understand the structure of this JSON.


JSON data is much like a **Python dictionary**. We can access values by providing keys. We also note that some of the values are themselves dictionaries, meaning that `json_data` dictionary is a **nested dictionary**. As an example, here are the two entries in the `json_data` dictionary:

* `meta`: Information about the web request, which is itself a dictionary. An HTTP status of 200 means "OK", i.e., data was successfully returned and stored into the JSON response. We generally ignore this key.
* `response`: The data from the web request, which is itself a dictionary. **We mostly care about this `response`.**

As an example, here is the first “hit” about Missy Elliott from Genius.com pertaining to her hit song, "Work It". Before running the cell, check to see that you understand the multiple layers of square brackets. Refer to the full `json_data` structure above.

```
json_data['response']['hits'][0]
              1          2    3
```


1. `json_data['response']`: Within the `json_data` dictionary, look up the key `"response"` and get a dictionary.
2. `json_data['response']['hits']`: Within this dictionary, look up the key `"hits"` and get a list.
3. `json_data['response']['hits'][0]`: Get the first (zero-th) element of the hitlist, which is a dictionary.

In [None]:
# just run this cell
json_data['response']['hits'][0]

<br></br>
<hr style="border: 1px solid #fdb515;" />

# Question 4

---
## Question 4a


Genius is a lyrics database. Using the Genius API, we can find the lyrics page URL (i.e., web link) of each song in the provided hits. In the cell below, assign `top_hit_title` to the title of the first “hit” about Missy Elliott.  A song’s title is under the key `title`. The first hit is the first element in the `hits` list. <br><br>

Use square bracket operations to get this value. Your answer should be in the form:

```
`top_hit_title = json_data[KEY_1][KEY_2][IND][KEY_3][KEY_4]
```

where you provide strings or numbers for `KEY_1`, `KEY_2`, `IND`, `KEY_3`, and `KEY_4`.


In [None]:
top_hit_title = ...
top_hit_title

In [None]:
grader.check("q4a")

---
## Question 4b

Each hit keeps track of the number of views of its page (i.e., accesses to the hit's URL). In the cell below, assign `top_view_counts` to the page view counts of the first “hit” about Missy Elliott. Hint: Look carefully at the structure of the `json_data` dictionary.

Use square bracket operations to get this value. Your answer should be in the form:

```
top_view_counts = json_data[KEY_1][...]...
```

where you provide a series of square bracket operations, just as in the previous part.

In [None]:
top_view_counts = ...
top_view_counts

In [None]:
grader.check("q4b")

---
## Question 4c

This question may take a bit longer than the previous parts.

Assign `total_view_counts` to the sum of page view counts across **all** Missy Elliott hits. You may want to loop over all of the hits; we have provided a for loop template below.


In [None]:
total_view_counts = ...
for song in json_data['response']['hits']:
    ...

# do not edit this line
total_view_counts

In [None]:
grader.check("q4c")

## Pets of Data 6

Luna says congratulations on completing Lab 12!

<img src="luna.jpeg" width="50%" alt="Cat and cat plushie"/>

---

To double-check your work, the cell below will rerun all of the autograder tests.

In [None]:
grader.check_all()

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export(pdf=False)