In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("project2.ipynb")

<img src="data6.png" style="width: 15%; float: right; padding: 1%; margin-right: 2%;"/>

# Project 2: The Web — HTML and Web APIs

## Data 6

In this project, you will explore text analysis on the web:
* How to use Web APIs to get data (JSON, HTML) from databases like Wikipedia and Genius Lyrics
* How to process and analyze text with structure
* How to parse HTML with BeautifulSoup
* How to build Python f-strings
* Why web scraping has limits


**Notes** on each part:

* [Part 1: Wikipedia and NLP](#Part-1:-Wikipedia-and-NLP) can be completed before the BeautifulSoup/HTML lab, because it only assumes you know how to work with dictionaries.
* [Part 2: Wikipedia and HTML](#Part-2:-Wikipedia-and-HTML) assumes you have some fluency with BeautifulSoup, so we suggest you do the related lab first.
* [Part 3: The Genius API](#Part-2:-Wikipedia-and-HTML) is doable without the related lab, but do come into office hours if you have issues setting up your Genius API key.

### Project Partners

For projects, you can work with up to one partner. Please see our syllabus for our [partner collaboration policy](https://data6.org/fa25/syllabus/#projects).

If you work with a partner, please **include their name** below. **Additionally, submit exactly one copy of the assignment** by listing your partner in your Gradescope submission.

**Partners**:
* <partner 1 name, partner 1 email>
* <partner 2 name, partner 2 email>

In [None]:
# Run this cell, don't change anything.
from datascience import *
import numpy as np

from IPython.display import HTML, display
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use("ggplot")

import warnings
warnings.filterwarnings('ignore')
%reload_ext autoreload
%autoreload 2

from bs4 import BeautifulSoup
import re
import requests

<hr style="border: 5px solid #003262;" />
<hr style="border: 1px solid #fdb515;" />


# Wikipedia: Introduction


**Natural language processing (NLP)** is the processing of natural language information by a computer [[Wikipedia](https://en.wikipedia.org/wiki/Natural_language_processing)]. The intersection of NLP and Artificial Intelligence has evolved into the advent of AI and chatbot usage that we see today, but NLP ultimately has very humble beginnings. Many text **documents** need to be analyzed, processed, and understood by humans—in other words, **text is also data**.

Furthermore, data involving human language has structures that can't be _fully represented_ by tables (rows, columns), and other rectangular structures. Fortunately, we have the power of dictionaries and HTML to help us represent and analyze text. Let's see!

A Wikipedia article is just words...right? Not quite. Take a look at the two Wikipedia articles that we explore in this project:

* [Data Science](https://en.wikipedia.org/wiki/Data_science)
* [Python (Programming Language)](https://en.wikipedia.org/wiki/Python_(programming_language))

As you can see, there are many other features of this **text data**. There are headers, images, formatting, links, ... the list goes on. In this project, we will explore many different representations of a Wikipedia article:

1. Natural Language Processing structures:
    * One single text string of all words in the article
    * A list of all words in the article
    * A **bag of words**
1. Article HTML and document structure

**The Wikipedia API**: Wikipedia, like many other modern databases, has an API that can be used to extract all sorts of data. The [Wikimedia Action API](https://www.mediawiki.org/wiki/API:Action_API) run by WikiMedia that data scientists, software developers, and researchers use to study Wikipedia and other knowledge bases. This API does the heavy lifting for us in accessing the official API and makes it much easier for us to analyze text and HTML in our Jupyter notebooks. To use this Wikipedia API, we provide a modified URL that can handle direct data requests (as opposed to the two Wikipedia links above, which are meant for web browser requests).

We have constructed all API calls for you already, so no worries on that front—but feel free to ask us if you're curious!

<hr style="border: 5px solid #003262;" />
<hr style="border: 1px solid #fdb515;" />

# Part 1: Wikipedia and NLP

> What are the most important words in the Wikipedia article on [Data Science](https://en.wikipedia.org/wiki/Data_science)?

Let's answer this question using Natural Language Processing techniques.

First, run the cell below to use the API to load the Data Science wikipedia page into `wiki_text`. (For those curious, this [API Sandbox](https://en.wikipedia.org/wiki/Special:ApiSandbox#action=query&format=json&prop=extracts&titles=Data_science&formatversion=2&exlimit=1&explaintext=1&exsectionformat=wiki)) was used to construct the URL in the request.)

In [None]:
# just run this cell. feel free to ask us about it.

import requests
from bs4 import BeautifulSoup

headers = {'User-Agent':
           'Data 6 Student/0.0 (https://data6.org/; data6@berkeley.edu)'}

url = "https://en.wikipedia.org/w/api.php?action=query&format=json&prop=extracts&titles=Data_science&formatversion=2&exlimit=1&explaintext=1&exsectionformat=wiki"
response = requests.get(url, headers=headers)
wiki_text = response.json()["query"]["pages"][0]["extract"]
wiki_text

As we can see, `wiki_text` is a very, very long string that has lots of information but might not be very useful to process. How long is it? Well...

In [None]:
# just run this cell
len(wiki_text)

That's over 7500 characters—but not all of them are words! These characters also include whitespace (spaces, line breaks/new lines) and punctuation. In order to understand the most common words in our Wikipedia article, we need a means of separating out words from non-words, and one word from other words.

A string where all words are jumbled together by whitespace will not help us much. Let's see if a list or an array will fare better. 

Next, run the cell below to construct `wiki_words`, a list of words in order of appearance in the article. The below cell uses [regular expressions](https://en.wikipedia.org/wiki/Regular_expression), which we don't cover in this course but are often used in many settings for text parsing. The below cell also converts all words to **lowercase**.

In [None]:
# just run this cell
import re
def split_into_words(any_chunk_of_text):
    lowercase_text = any_chunk_of_text.lower()
    split_words = re.split("\W+", lowercase_text)
    return split_words 

wiki_words = split_into_words(wiki_text)
len(wiki_words)

There are over 1100 words in our Wikipedia article! We won't print out all of them, but verify for yourself that the first ten words in the `wiki_words` list match the first twenty words in the original `wiki_text` string:

In [None]:
# just run this cell
wiki_words[:20]

Notice that just from the list above, we can see that the word `"scientific"` appears in our [Data Science](https://en.wikipedia.org/wiki/Data_science) article twice. Let's explore an NLP model called **bag of words** that can help us track these counts.

<hr style="border: 1px solid #fdb515;" />

# Question 1: Bag of Words

A bag of words is a very common NLP representation of text documents like Wikipedia articles. Bag of words is a simple but powerful model involving tracking **word frequencies**. A word's **frequency** in a document is the number of times it appears in a given document.

In this question, you will create a **bag of words** model as a dictionary of word frequencies:
* Each **key** is a word that appears in the document.
* Each **value** is the number of times that word appears.

<!-- BEGIN QUESTION -->

---

## Question 1a

Before we dive into our lengthy Wikipedia document, let's spend some time understanding the meaning of word frequencies.

Consider the document below (consisting of two sentences, but with all punctuation and capitalization removed):

> data 6 is data science data science is cool

Build a bag of words model for this sentence. In other words, assign `manual_word_freqs` to a dictionary of word frequencies. The keys in this dictionary should be all unique words in this sentence; the values should be the integer counts of that dictionary. 

_Notes_:
* You should **not** use a loop to create this sentence. Instead, make a new dictionary using the curly bracket syntax. 
* We have provided some keys for you, but you should supply their values, and you will need to add some more key-value pairs.
* For the purposes of this project, we assume a **word** is any lowercase alphanumeric string of characters that does not include whitespace nor punctuation. `"6"` is therefore a word; we have explicitly included it in your dictionary.

In [None]:
manual_word_freqs = {
    "data": ...,
    "6": ..., 
    "is": ...,
    ... # add more key-value pairs here
}
manual_word_freqs

In [None]:
grader.check("q1a")

<!-- END QUESTION -->

Now that we are equipped with this "bag of words" concept, consider what it would mean to write code to build `word_freqs`, a dictionary of word frequencies in the Wikipedia article on Data Science.

As we saw earlier, there are over 1100 words in this article. It is infeasible for us to manually count all the words. Instead, we will use the power of **iteration** to create our dictionary, one word at a time.

**Algorithm**: To build `word_freqs`, iterate over the words in `wiki_words`. For each word encountered, update its count in `word_freqs`. Of course, sometimes we will encounter a new word which does not yet exist in `word_freqs`; when we do so, we must first create an entry in the dictionary, and then increment the count.

For example, suppose `wiki_words` is the list `['apple', 'banana', 'apple']`, and we start out with an empty `word_freqs` dictionary `{}`. We iterate over the words in `wiki_words`:

1. First word: Since `'apple'` doesn't exist in the dictionary, create an entry in `wiki_words` and increment it by 1. `word_freqs` is now `{'apple': 1}`.
1. Second word: Since `'banana'` doesn't exist in the dictionary, create an entry in `wiki_words` and increment it by 1. `word_freqs` is now `{'apple': 1, 'banana': 1}`.
1. Third word: Since `'apple'` is in the dictionary, look up the entry and increment it by 1. `word_freqs` is now `{'apple': 2, 'banana': 1}`.

<!-- BEGIN QUESTION -->

---

## Question 1b

Using the previously described algorithm, complete the code below to construct `word_freqs` from `wiki_words`.

_Optional_: If you truly want to be fancy, check out the `get(key, default)` dictionary method in Python, which returns the value for `key` if `key` is in the dicitionary and `default` otherwise. See the [Python `get` documentation](https://docs.python.org/3/library/stdtypes.html#dict.get)). Otherwise, this part is totally solvable with square-bracket access of keys, provided the key is in the dictionary to begin with. Our solution doesn't use `get`.

In [None]:
word_freqs = {}

for word in wiki_words:
    ...

# edit the below line as needed to check your work
test_words = ["data", "science", "scientific", "and", "the", "academic"]
for word in test_words:
    print(f"{word}: {word_freqs.get(word, 'not found!')}")

In [None]:
grader.check("q1b")

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

---

## Question 1c: Word Frequency Table

While dictionaries can store a variety of key-value pairs, our word frequency dictionary has a very _rigid_ structure, where each (key, value) pair has the same meaning: the number of times a word appears on the Wikipedia page.

Because of this consistent structure, we can therefore **also** store our word frequencies in a **table**, with two columns: `word` and `count`, where each row is a word on the Wikipedia page.

Use your `word_freqs` dictionary to build the `word_freqs_table`, which should have the below structure:

| word | count | 
| --- | --- |
| ... | ... |
| science | 11 |
| ... | ... |

_Hint_: Use the `keys` and `values` dictionary methods and the `with_columns` Table method.

In [None]:
word_freqs_table = Table().with_columns(
    ...
) 

word_freqs_table

In [None]:
grader.check("q1c")

<!-- END QUESTION -->


<hr style="border: 1px solid #fdb515;" />

# Question 2: Stop words

As a reminder, our goal in this part of the project is to find the most important words in our Wikipedia article. Now that we've transformed our text data into a tabular representation—and, in particular, designed a variable that measures the "importance" of each word by count—we can use `Table` methods to process and visualize. 

However, notice that we've defined **a variable that tries to measure the "importance" of each word by its frequency**. But _some_ words, while _frequent_, do not particularly seem _"important"_ to the article content:

In [None]:
# just run this cell
word_freqs_table.sort("count", descending=True)

As we can see, there are many words that seem important to writing the English language but are not necessarily important to the topic of Data Science: "and", "the", "of", etc. Such words are called **stop words** [[Wikipedia](https://en.wikipedia.org/wiki/Stop_word)] because processing and analyzing them is auxiliary to the purposes of understanding meaning.

There are many lists of stop words out there. Instead of constructing our own, we will use a reliable [Python package](https://pypi.org/project/stop-words/), `stop-words`:

In [None]:
# just run this cell
!pip install stop-words

We can then use this package to create the list `stop_words`, with stop words in the English language. We've sampled out a few for you; notice that the list includes contractions and some common abbreviations in scientific text.

In [None]:
from stop_words import get_stop_words

# Get English stop words
stop_words = get_stop_words('english')

# five random words
Table().with_columns("word", stop_words).sample(5).column("word")

---
## [Tutorial] Lists vs. Arrays: Appending Elements

To append a single element to a list, you can use the list method append. Run the next three cells below:

In [None]:
# just run this cell
my_lst = ["some", "elements", 2]
my_lst

In [None]:
# just run this cell
my_lst.append("in sequence")

In [None]:
# just run this cell
my_lst

Above, the list method `append` returns nothing and directly modifies the original list. This behavior is unlike `np.append`, which returns a new array. Run the cells below for comparison:

In [None]:
# just run this cell
arr = make_array("some", "elements")
np.append("in sequence", arr)

In [None]:
# just run this cell
# np.append leaves arr unchanged
arr 

To "update" the original array, assign the `np.append` return value to the original array name:

In [None]:
# just run this cell
arr = np.append("in sequence", arr)
arr

<!-- BEGIN QUESTION -->

---

## Question 2a

We can use the `stop_words` list to discern which words in a document are "meaningful" words, and which ones are stop words. Let us first try this approach on a small example. Consider the document below (all lowercase, no punctuation):

> the quick brown fox jumps over the lazy dog

In the cell below, write code that constructs `filtered_words`, a list of words in the provided sentence that are **not** stop words.

_Hints_:
* Use a `for` loop.
* `x not in seq` evaluates to `True` if `x` is not contained in the list `seq`; otherwise, it evaluates to `False`.

In [None]:
# do not edit these lines
text = "the quick brown fox jumps over the lazy dog"
words = text.split() 

# make edits below this line
filtered_words = []
...
        
print(filtered_words)

In [None]:
grader.check("q2a")

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

---

## Question 2b

Use `stop_words` to filter out the rows in `word_freqs_table` that are stop words, and assign this table result to `word_freqs_nostop`. After your code runs, `word_freqs_nostop` should not have any stop words.

_Hint_: We suggest you use the `where` Table method, as opposed to looping over the table. How do you write a custom filter function?

In [None]:
...

word_freqs_nostop = ...
word_freqs_nostop

In [None]:
grader.check("q2b")

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

<hr style="border: 1px solid #fdb515;" />

# Question 3

Finally, use `word_freqs_nostop` to visualize the frequency counts of the **top twenty most common words** on the [Data Science Wikipedia article](https://en.wikipedia.org/wiki/Data_science#), excluding stop words. Choose an appropriate visualization strategy, then write code that produces your plot.

In [None]:
...

<!-- END QUESTION -->

**Reflection** (no response needed): Based on your diagram, what are the most important words in the Data Science Wikipedia article? Computing, statistics, and data are central—but there is also a key focus on specific domains, disciplines, and _inter_-disciplinary work. Just like this class!

<hr style="border: 5px solid #003262;" />
<hr style="border: 1px solid #fdb515;" />


# Part 2: Wikipedia and HTML

The NLP bag of words model we learned above is useful for analyzing semantic meaning of general articles. However, when we work with web data, we also can study the structure of the webpage itself to understand how the page categorizes and organizes its content.

In this part, we will use the BeautifulSoup package to see how key components of the [Python (programming language)](https://en.wikipedia.org/wiki/Python_(programming_language)) are organized on the Wikipedia article:

> * What are the key aspects that describe a progamming language?
> * What are built-in Python data types?
> * What are the keywords for Python statements??

Unlike before, where we just analyzed the text, we will now analyze the full HTML data from the above Wikipedia article.

Run the cell below to load the HTML of the Python Wikipedia article into the string `wiki_html`. (For those curious, this [API Sandbox](https://en.wikipedia.org/wiki/Special:ApiSandbox#action=parse&format=json&page=Python_(programming_language)&prop=text&formatversion=2) was used to construct the URL in the request.)

In [None]:
# just run this cell

import requests
from bs4 import BeautifulSoup
headers = {'User-Agent':
           'Data 6 Student/0.0 (https://data6.org/; data6@berkeley.edu)'}

url = "https://en.wikipedia.org/w/api.php?action=parse&format=json&page=Python_(programming_language)&prop=text&formatversion=2"
response = requests.get(url,
            headers=headers)
wiki_html = response.json()["parse"]["text"]

# long HTML response, just show the first 200 chars
wiki_html[:200]

Recall that HTML is difficult to parse by itself, so we can and should use existing packages like BeautifulSoup that can effectively parse HTML structure. 

Next, load the HTML string into a BeautifulSoup document `wiki_doc`:

In [None]:
# just run this cell
from bs4 import BeautifulSoup

wiki_doc = BeautifulSoup(wiki_html, "html.parser")
type(wiki_doc)

In [None]:
wiki_doc.find("h2")

Now that we have our BeautifulSoup document, let's get started!

<div class="alert alert-block alert-info">

_**Tip**_:

When completing this part, you will find it useful to keep open a separate browser window with the Wikipedia article and the HTML inspector.
* In a separate tab, navigate [to the Wikipedia article webpage](https://en.wikipedia.org/wiki/Python_(programming_language)).
* Right click (or ctrl-click), then select Inspect/Inspect Element. If you don't see this option, double-check the Data 6 lecture notes for how to adjust your browser settings.
</div>

<hr style="border: 1px solid #fdb515;" />

# Question 4: Headers

Let's tackle the question:

> What are the key aspects that describe a progamming language?

## [Tutorial] BeautifulSoup

This tutorial serves as a crash-course to BeautifulSoup and a reminder of the salient parts of the lab. But you should complete the lab before proceeding.
                                                           
While Wikipedia is not the source of all truths, it is a crowdsourced information database. Popular articles like this article have been edited by over 5000 times by hundreds or thousands of users! It is reasonable to assume that these users have **broadly agreed on how information on this page should be organized**.
For example, the users have most likely agreed on what information should be included as the headers of the Python article:

```
History
Design philosophy and features
...
External links
```

Inspect the HTML of the page and identify which HTML tag is associated with this header. 

<details>
  <summary>Click for Solution</summary>
    
The `<h2>` tag
    
</details>

Run the cell below which uses the BeautifulSoup method `find` to return the first header tag on the article page:

In [None]:
# just run this cell
first_h2_tag = wiki_doc.find("h2")
first_h2_tag

Run the cell below which accesses the header string itself, defined as the information between the &lt;h2> and &lt;/h2> delimiters:

In [None]:
# just run this cell
first_h2_tag.string

Note that `find` returns just the first tag. To find all header tags, we can use the `find_all` method:

In [None]:
wiki_doc.find_all("h2")

For better or worse, "External links" appears as a header twice on the article page. Verify this for yourself by going back to your HTML inspector. A Wikipedia contributor should really fix this bug/issue... :-)

<!-- BEGIN QUESTION -->


---

## Exercise

How do we extract the header strings from this list of tags?

In the cell below, use the `wiki_doc` BeautifulSoup document to get a list of all of the headers in the Wikipedia article. Assign this **list of strings** to `headers`.

_Hints_:
* We have provided a `for` loop structure for you. It is up to you to figure out what this loop should iterate over.See the tutorial above.
* In BeautifulSoup, you can access the contents of a tag `tag` with `tag.string`. This is not a function.
* After running your code, the first and last elements of `headers` should be `"History"` and `"External links"`, respectively. Based on our discussion above, `"External links"` should appear twice in your list.

In [None]:
headers = []
for h2_tag in ...:
    ...

headers

In [None]:
grader.check("q4")

<!-- END QUESTION -->


<hr style="border: 1px solid #fdb515;" />

# Question 5: Tables and Python Types

Next let's tackle the question:

> What are built-in Python data types?

The build-in data types are listed in the only table on the Wikipedia article page. 

In your HTML inspector, navigate to the singular table on the Wikipedia article page. It should be under the subheading (tag `<h3>`) "Typing."

---

## Question 5a

Before we process this table with BeautifulSoup, let's take a few moments to understand the structure of an HTML table. Answer the multiple-choice questions below. These all have hidden tests.

### Question 5a(i)

What is the `class` attribute of the `<table>` tag? Assign `table_class` to the appropriate choice number.

1. `caption`
1. `table`
1. `wikitable`

In [None]:
table_class = ...

In [None]:
grader.check("q5a_i")

HTML tables are organized in **row order**, meaning that the values of each row are specified before moving onto the next row. (Contrast this with the column order our datascience `Table`, where we specify values of columns across all rows, before moving onto the next column.)

---

### Question 5a(ii)
Inspect the rows of the HTML table on the Wikipedia article page.

What HTML tag is used to denote a row? Assign `row_tag` to the appropriate choice number.

1. `<tbody>`
1. `<tr>`
1. `<th>`
1. `<td>`

In [None]:
row_tag = ...

In [None]:
grader.check("q5a_ii")


The **header row** specifies the column labels of the HTML table. There is one header row per HTML table; all other rows are, well, rows. Note that all rows, header or not, use the same row tag you found in the previous section.

Each column label is a **header row cell**. On our page, the header row is composed of four header row cells:

```
Type
Mutability
Description
Syntax examples
```

---

### Question 5a(iii)

Inspect these header **row cells** on the Wikipedia article page, and contrast them with other row elements from non-header rows.

Do the **header row cells** have the same **tag** as other row cells? Assign `header_row_same` to `True` or `False` appropriately.

In [None]:
header_row_same = ...

In [None]:
grader.check("q5a_iii")

Within each row, the first row cell is a Python data type. For example, see the HTML for the first non-header row:

```
<tr>
    <td><code>bool</code></td>
    <td>immutable</td>
    <td><a href="...">Boolean value</a></td>
    <td>...</td>
</tr>
```

The text "bool" is formatted as code on the Wikipedia page; the browser knows to display this string as code because of the `<code>` tag.

---

### Question 5a(iv)

Which of the below statements about the "bool" `<code>` tag ("the tag") are true? Assign `bool_relation` to a **list** of integers of correct statements from those below. If no statements are correct, assign `bool_relation` to an empty list.

1. The tag is a child of the first `<td>` tag.
1. The tag is a sibling to the second `<td>` tag (the one with "immutable" as a string).
1. The tag's parent is the `<tr>` row tag.
1. If the tag is a BeautifulSoup tag called `code_tag`, then we can access the string "bool" with `code_tag.string`.

This question does not have hidden tests.

In [None]:
bool_relation = ...

In [None]:
grader.check("q5a_iv")


---

## Question 5b

Now that we better understand HTML table structure, we can use BeautifulSoup to extract the rows of the HTML table.

Complete the code below to get the **first data row** from the HTML table. Note that the first data row is first row **after** the header row (the header row always comes first). Your code should **return a BeautifulSoup object** (tag) associated with the first data row, i.e., an element from a `find_all` list. We have loaded in the HTML table for you as `table_tag`.

_Hints_:
* What HTML tag is used to denote a row? See your responses to the previous question.
* You can use the `find_all` method on a tag (say, `table_tag`) to find all matching tags under `table_tag`.
* The header row always comes first (but remember zero-indexing).

If your code works correctly, then running the cell below should print out the first row's HTML, similar to what we displayed to you in the previous question.

In [None]:
# do not change this line!
table_tag = wiki_doc.find("table", "wikitable")

# edit this line
first_data_row = ...

# do not edit this line
print(first_data_row.prettify())

In [None]:
grader.check("q5b")


---

## Question 5c

The HTML table we are analyzing is titled, "Summary of Python 3's built-in types." The first column of this table lists each Python data type, formatted to monospaced font with the `<code>` tag. We would like to extract the data types text in this first column.

In the cell below, implement the function `extract_data_type` that takes in a non-header row tag of the HTML table (`<tr>` tag) and returns the data type as a string, with no tags.

For example, passing in `first_data_row` from the previous part should return `"bool"`.

_Hint_: Consider your analysis in an earlier question. How is the `<code>` tag "descended" from the `<tr>` row tag? Is it a child? A child of a child? We've provided some suggestion of structure below. You can also directly find the `<code>` tag if you know what you're doing...!

In [None]:
def extract_data_type(row_tag):
    first_row_cell_tag = ...
    code_tag = ...
    ...

# do not edit below this line
print(extract_data_type(first_data_row))

In [None]:
grader.check("q5c")


---

## Question 5d

Finally, make a list `python_builtin_types` that uses the `extract_data_type` function you defined earlier to get all the Python built-in types listed in the HTML table. You may want to use a `for` loop.

If your code is implemented correctly, running the cell should print out the values in the first column of the HTML table.

In [None]:
python_builtin_types = []

for ... in ...:
    ...

# do not edit below this line
for dtype in python_builtin_types:
    print(dtype)

In [None]:
grader.check("q5d")

<br/><br/><br/>

Congratulations, now you're a **real** web scraping expert! Now is a good time to take a break.


<hr style="border: 1px solid #fdb515;" />

# Question 6: Subheading: "Statements and control flow"

Let's tackle the last question:

> What are the keywords for Python statements?

These keywords are located in a bulleted list under "Statements and control flow," which is a subheading under the heading "Syntax and semantics". In your separate webpage for the [Python (programming language) page](https://en.wikipedia.org/wiki/Python_(programming_language)), navigate to this area and inspect the HTML.

You will see the following HTML structure:

```
...
<div class="mw-heading mw-heading3"><h3 id="Statements_and_control_flow">Statements and control flow</h3>...</div>
<p>Python's ...</p>
<ul>
    <li>...</li>
    ...
    <li>...</li>
</ul>
<p>The assignment statement...</p>
<p>Python does not support...</p>
<div class="mw-heading mw-heading3"><h3 id="Expressions">Expressions</h3>...</div>
...
```

In this question, we would like to do the following:

a. **Get all HTML elements** pertaining to content for the "Statements and control flow" subheading.
                                                   
b. Within the bulleted list, **get all Python code text.** Code-formatted text describes Python statement keywords.

---

## Question 6a

We would like to:

> **Get all HTML elements** pertaining to content for the "Statements and control flow" subheading.

---

### Question 6a(i)

Consider the below screenshot of the Wikipedia article:

<img src='python_subheading_content.png' width='500px' alt="Python subheading screenshot">

Based on the HTML snippet above and your inspection of the HTML document, what are **all HTML elements associated with the content in the provided screenshot, including the subheading itself**?

Assign `subheader_content` to a list of appropriate numbers chosen from the below choices.

1. `<div class="mw-heading mw-heading3"><h3 id="Statements_and_control_flow">Statements and control flow</h3>...</div>`
1. `<p>Python's ...</p>`
1. `<ul><li>...</li>...<li>...</li></ul>`
1. `<p>The assignment statement...</p>`
1. `<p>Python does not support...</p>`
1. `<div class="mw-heading mw-heading3"><h3 id="Expressions">Expressions</h3>...</div>`

<!-- BEGIN QUESTION -->



In [None]:
subheader_content = ...

In [None]:
grader.check("q6a_i")

<!-- END QUESTION -->

Because of the structure of our document, note that all of the content that (visually, to us) looks to be "within" a subheading are actually siblings of the subheading itself, _not_ children.

See the next two cells for an explanation. Run the first cell below to get the `<div>` tag that has our "Statements and control flow" subheading:

In [None]:
# just run this cell
statements_tag = wiki_doc.find("h3", string="Statements and control flow")
div_tag = statements_tag.find_parent()

Next, run the second cell below, which uses the `find_next_siblings` method to get all tags _after_ `div_tag`. We only print the first 100 characters of each of these tag's HTML for simplicity. But notice that very quickly, the content moves to the next subheading, "Expressions"!

In [None]:
# just run this cell
for tag in div_tag.find_next_siblings()[:4]:
    print(str(tag)[:100], "...")
    print()

<!-- BEGIN QUESTION -->

---

## Question 6a(ii)

In HTML lists, bulleted lists are denoted by the `<ul>` tag (meaning, "unordered list"). Within this tag, each list item is denoted by `<li>`.

In the cell below, use the method `find_next_siblings` on `div_tag` to assign `list_tag` to the `<ul>` bulleted list tag. See the usage above.

If your code works, then running the below cell should display the HTML of each list item `<li>`, separated by newlines `'\n'`.

In [None]:
list_tag = ...

# do not edit below this line
bulleted_list = list_tag.contents
bulleted_list

In [None]:
grader.check("q6a_ii")

<!-- END QUESTION -->

If you accidentally deleted the lines above that create `bulleted_list` and your tests no longer pass, here are the lines again. Copy these lines into your cell above.

```
list_tag = ...

# do not edit below this line
bulleted_list = list_tag.contents
bulleted_list
```

<!-- BEGIN QUESTION -->

---

## Question 6b

We would like to:

> **Get all HTML elements** pertaining to content for the "Statements and control flow" subheading.

In the cell below, get all Python code statements mentioned in the list `bulleted_list`. They are marked by `code` tags:

* The first one is `=`;
* The second is `if`;
* The third is `for` (not `elif`!); 
* and so on.

_Hints_:
* What is the `code_tag`? Try `print`-ing this out by uncommenting the line.
    * Notice that some elements of `bulleted_list` are simply newlines (`"\n"`), but `code_tag` is still assigned to *something* in these "invalid" cases.
* For "valid" cases, try using `code_tag.string`. Will that give you the desired string? (Answer: Yes)

In [None]:
py_statements = []
for tag in bulleted_list:
    code_tag = tag.find("code")
    #print(code_tag) # try uncommenting this line
    ...

# do not edit below this line
print("Python statements:")
print(py_statements)

In [None]:
grader.check("q6b")

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

---

## Question 6c

Read the descriptions of these Python keywords in the [Statements and Control flow section on Wikipedia](<https://en.wikipedia.org/wiki/Python_(programming_language)#Statements_and_control_flow>).
Based on these descriptions, what is one Python keyword that we haven't learned yet that you would be curious to learn more about?

Assign `unknown_keyword` to one of the keywords in `py_statements` corresponding to a keyword that we haven't learned yet. Pretty much all answers will earn credit.

In [None]:
unknown_keyword = ...

<!-- END QUESTION -->

<br/><br/><br/>

Double congratulations—you've now swam your way out of messy, messy HTML (despite the package moniker "beautiful soup"). It's impressive how far you've come in your string processing ways...!

After taking a break, come back and enjoy some music and image generation with a simpler database—the Genius Lyrics database. We're almost there!

<hr style="border: 5px solid #003262;" />
<hr style="border: 1px solid #fdb515;" />

# Part 3: The Genius API

Let's analyze **Kendrick Lamar's Top Ten Hits** using the Genius database. [Kendrick Lamar](https://en.wikipedia.org/wiki/Kendrick_Lamar) is widely regarded as one of the greatest rappers of all time. 

Our goal for this section:

> Can we generate song image art for Kendrick Lamar's greatest hits?


There are a few different ways that we can query the Genius API, all of which are discussed in the [Genius API documentation](https://docs.genius.com/). The way we’re going to cover in this lesson is **basic search**, which allows you to get a bunch of Genius data about any artist or songs that you search for.

<div class="alert alert-block alert-warning">

_**Important Note**_:

Before continuing, **make sure you have a Genius API token saved in `api_key.py`**. To do this, follow the steps on the Data 6 Course Notes: [API Keys](https://data6.org/notes/18-html/genius.html).

We will also cover the above steps in lab. If you get stuck or need help, we recommend you first skip ahead to Part 2, which is doable without Part 1. Only Part 1 needs the Genius API and an API key.
                 
</div>

After you have updated `api_key.py` (the file within this assignment directory), running the below line should load your API token into `client_access_token`.

In [None]:
# before running this cell, make sure that you have updated api_key.py with your API key
%reload_ext autoreload
%autoreload 2

import api_key
client_access_token = api_key.my_client_access_token


Run the below cell to construct the necessary URL (that will serve as our API call), request data from that URL, and parse the response data as a JSON. (Like with the Wikipedia parts, we've done this part for you, but if you're curious check out the [Genius API documentation](https://docs.genius.com/#/artists-h2)).


In [None]:
# just run this cell
import requests

search_term = "Kendrick Lamar"
genius_search_url = f"http://api.genius.com/search?q={search_term}&access_token={client_access_token}"

response = requests.get(genius_search_url)
json_data = response.json()
json_data

If you get a dictionary with the key-value pair `('error', 'invalid_token')`, then you may not have set your API key correctly. Go back and double check you have correctly edited the `api_key.py` file.

<hr style="border: 1px solid #fdb515;" />

# Question 7: Understand the Genius JSON

The Genius API's basic search function returns a JSON, which has data about an artist's **hits**, also known as top songs. There is a lot of information about each hit. This question's goal is to understand the structure of this JSON.


JSON data is much like a **Python dictionary**. We can access values by providing keys. We also note that some of the values are themselves dictionaries, meaning that `json_data` dictionary is a **nested dictionary**. As an example, here are the two entries in the `json_data` dictionary:

* `meta`: Information about the web request, which is itself a dictionary. An HTTP status of 200 means "OK", i.e., data was successfully returned and stored into the JSON response. We generally ignore this key.
* `response`: The data from the web request, which is itself a dictionary. **We mostly care about this `response`.**

As an example, here is the first “hit” about Beyoncé from Genius.com pertaining to her hit song, "Drunk in Love" featuring Jay-Z. Before running the cell, check to see that you understand the multiple layers of square brackets. Refer to the full `json_data` structure above.

```
json_data['response']['hits'][0]
              1          2    3
```


1. `json_data['response']`: Within the `json_data` dictionary, look up the key `"response"` and get a dictionary.
2. `json_data['response']['hits']`: Within this dictionary, look up the key `"hits"` and get a list.
3. `json_data['response']['hits'][0]`: Get the first (zero-th) element of the hitlist, which is a dictionary.

In [None]:
# just run this cell
json_data['response']['hits'][0]

Each "hit" in the JSON data is a dictionary from which we can get the lyrics URL, the page views, and so on.

<!-- BEGIN QUESTION -->


---

## Question 7a: Webpage of a hit

Genius is a lyrics database. Using the Genius API, we can find the lyrics page URL (i.e., web link) of each song in the provided hits. In the cell below, assign `first_hit_url` to the **URL of the first hit** stored somewhere in `json_data`,which is the JSON returned with our web request for Beyonce hits. This value is `"https://genius.com/Kendrick-lamar-not-like-us-lyrics"`.

Use square bracket operations to get this value. Your answer should be in the form:

```
first_hit_url = json_data[KEY_1][KEY_2][IND][KEY_3][KEY_4]
```

where you provide strings or numbers for `KEY_1`, `KEY_2`, `IND`, `KEY_3`, and `KEY_4`.

In [None]:
first_hit_url = ...
first_hit_url

In [None]:
grader.check("q7a")

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->


---

## Question 7b: Page Views of a Hit

Each hit keeps track of the number of views of its page (i.e., accesses to the hit's URL). In the cell below, assign `first_hit_views` to the **number of page views of the first hit** in `json_data`. This value is `16961277`.

Use square bracket operations to get this value. Your answer should be in the form:

```
first_hit_views = json_data[KEY_1][...]...
```

where you provide a series of square bracket operations, just as in the previous part.

In [None]:
first_hit_views = ...
first_hit_views

In [None]:
grader.check("q7b")

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->


---

## Question 7c: Image Art of a Hit

Because Genius is a website, which is displayed visually to users on web browsers, it is interested in including many graphics and visuals on popular pages, such as the Genius pages for Kendrick's hit songs. Each hit has an image associated with the song—this is often the album art. As an example, see the square icon in [Kendrick's "HUMBLE.." lyrics page](https://genius.com/Kendrick-lamar-humble-lyrics)

Our JSON data in `json_data` also stores this information. Instead of storing the image itself, the JSON has the image's **URL**, which can then be used to load the image itself. The associated key for image URLs is `'song_art_image_url'`.

In the cell below, assign `second_hit_image_url` to the image URL for "HUMBLE." which is the **second** hit listed in `json_data`. Like in previous parts, you should use square bracket operations to get this value.

In [None]:
second_hit_image_url = ...
second_hit_image_url

In [None]:
grader.check("q7c")

<!-- END QUESTION -->

If you've `second_hit_image_url` is assigned correctly, then running the below cell should display the same song image as displayed on the ["HUMBLE." lyrics page](https://genius.com/Kendrick-lamar-humble-lyrics):

In [None]:
# just run this cell
from IPython.display import Image, display
second_hit_image_url
Image(url=second_hit_image_url, width=200)

<hr style="border: 1px solid #fdb515;" />

# Question 8: Tabulate Song Hits

How do we display images for all song hits in a nicely organized structure? It turns out that we can also use Tables for this purpose—but we will need to resort to some HTML again!

In this question and the next, you will create a Table of hit songs, page views, and HTML for displaying images.

<!-- BEGIN QUESTION -->


---

## Question 8a: Song Titles

Assign `song_titles` to an **array** of strings: the song title of each hit in the web response `json_data`. The key for song titles is `'title_with_featured'`, which also lists featuring artists if relevant.

Hints:
* You may want to use a `for` loop over all of the hits in our web response to iteratively append to `song_titles`. In our provided skeleton, `song` is a JSON/dictionary; what keys does it have?
* See `np.append` in the [Data 6 Python Reference](https://data6.org/notes/reference) and the [Lists vs. Arrays: Appending](#[Tutorial]-Lists-vs.-Arrays:-Appending-Elements) tutorial above.


In [None]:
song_titles = make_array()
for song in json_data['response']['hits']:
    ...

# do not edit this line
for s in song_titles:
    print(s)

In [None]:
grader.check("q8a")

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->


---

## Question 8b: Page Views

Assign `page_views` to an **array** of integers: the number of page view counts of each hit in the web response `json_data`. Refer to your responses to previous questions to find the appropriate key.

In [None]:
page_views = make_array()
for song in json_data['response']['hits']:
    ...

# do not edit this line
page_views

In [None]:
grader.check("q8b")

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->


---

## Question 8c: The `songs` table

Finally, assign `songs` to a two-column table using the two arrays you created above. The columns should be labeled `Title` and `Views`, in that order.

In [None]:
songs = Table().with_columns(
    "Title", song_titles,
    ...
songs

In [None]:
grader.check("q8c")

<!-- END QUESTION -->

<hr style="border: 1px solid #fdb515;" />

# Question 9: Album Images

In this last part, we will build the HTML needed to render our images inside tables.

## [Tutorial] HTML tag: `<img>`

The `<img>` HTML tag denotes images. It specifies what image to load, the image size, and so on.

Unlike text-based tags, images do not have string contents between open/closing tags (contrast with `<p>` paragraph tags, which enclose text with `<p>` and `</p>`). Instead, all information for an image tag is specified as attributes.

Below, the `image_html` string specifies an `<img>` tag with a `src` source URL attribute and a `width` image width attribute of 200 pixels.

In [None]:
# just run this cell
image_html = "<img src='https://static.decontextualize.com/kitten2.jpg' width='200px'>"
image_html

We can then render this HTML directly using the `HTML` function, part of the `IPython.display` library.

In [None]:
# just run this URL
HTML(image_html)

## [Tutorial] Python f-strings

There is one more variation of strings in Python. **f-strings** (short for format string literals) are strings that make it very convenient to include Python names and complex expressions in strings, without having to worry about how many spaces/newlines separate input.

> f-strings are strings prefixed with `f`. Inside the string, expressions can be written within curly braces. When f-strings are evaluated, they evaluate any expressions within them and then create the corresponding string. Read more in the official Python [f-string documentation](https://docs.python.org/3/tutorial/inputoutput.html#formatted-string-literals).

Run the cell below to see an example.

In [None]:
# just run this cell
animals = "foxes"
animal_count = 3
animal_fstr = f"There are {animal_count} {animals}."
animal_fstr

In [None]:
# just run this cell
type(animal_fstr)

Because f-strings also evaluate expressions, any valid expressions are possible within curly braces:

In [None]:
# just run this cell
count_arr = make_array(2, 3, 4)
count_fstr = f"sum({count_arr}) is {np.sum(count_arr)}"
count_fstr

We will use f-strings to construct our `<img>` URLs.

<!-- BEGIN QUESTION -->


---

## Question 9a

Complete the `get_image_html` function below, which takes one argument (a `image_link` string) and returns an `<img>` tag with;

* `src` attribute as the `image_link`, and
* `width` attribute as **200** pixels.

The example provided should give you the same HTML as we shared in the `<img>` tutorial.

In [None]:
def get_image_html(image_link):
    image_html = ...
    return image_html

# do not edit below this line
image_url = "https://static.decontextualize.com/kitten2.jpg"
get_image_html(image_url)

In [None]:
grader.check("q9a")

<!-- END QUESTION -->

With a correct implementation of `get_image_html`, running the below cell should produce the Data 6 logo:

In [None]:
# just run this cell
data6_logo_url = "https://data6.org/assets/data6.png"
print("HTML string:", get_image_html(data6_logo_url))
HTML(get_image_html(data6_logo_url))

<!-- BEGIN QUESTION -->


---

## Question 9b

Assign `album_html` to an **array** of strings: an `<img>` tag for each hit's image art. Some details:

* Refer to your responses to previous questions to find the appropriate key for a song's image URL.
* The elements in `album_html` should be strings of HTML—specifically, the `<img>` tag that would display the song's image.
* The order of elements in `album_html` should correspond to the order of hits in the `songs` table you created earlier.
* Each `<img>` tag should have `width` attribute of 200 pixels. This means you can and should use the `get_image_html` function you defined earlier.

We have left this last coding question relatively open-ended for you...!

In [None]:
...

album_html

In [None]:
grader.check("q9b")

<!-- END QUESTION -->

## [Tutorial] Putting it all together

If all has gone well, running the below cell will create a table with your image art in it. Congratulations!!!

In [None]:
# just run this cell
song_albums = songs.with_columns("Album Cover", album_html) # SOLUTION

HTML(song_albums.to_df().to_html(escape=False))

<!-- BEGIN QUESTION -->

<hr style="border: 1px solid #fdb515;" />

You may wonder why we end our project here. After all, Genius is a lyrics database; why aren't we analyzing the content of lyrics by scraping from the Genius.com website?

It turns out that scraping lyrics from Genius.com violates its [Terms of Use](https://genius.com/static/terms):

> Except as expressly authorized by Genius in writing, you agree not to modify, copy, frame, scrape, rent, lease, loan, sell, distribute or create derivative works based on the Service or the Genius Content, in whole or in part, including for any purposes related to AI machine learning/use on AI platforms, except that the foregoing does not apply to your own User Content (as defined above) that you legally upload to the Service. In connection with your use of the Service you shall not engage in or use any data mining, robots, scraping or similar data gathering or extraction methods, including for any purposes related to AI machine learning/use on AI platforms. 

In 2022, Genius had a large legal battle with Google over song lyrics and copyright, claiming that "Google was using its transcribed lyrics without permission in search results" ([The Verge](https://www.theverge.com/2022/3/11/22973282/google-wins-court-battle-genius-song-lyrics-copyright)). Ultimately, Genius lost the case because the company was not the rights holder to songs—in fact, artists and music labels are often rights holders, though that is another legal battle.

The Terms of Use have most recently been updated to counter AI and machine learning usage. As we will learn, AI models use massive amounts of data, often through scraping the web for databases. By prohibiting AI models from scraping data, Genius can also protect music artists from misuse, reuse, and remixing of their music without their permission.

In this last question, run the below cells to see what happens when we nonetheless try to scrape Kendrick Lamar's "Not Like Us" webpage for lyrics.

Are the lyrics of Kendrick's hit song actually provided to us? If not, what is provided instead?

In [None]:
response = requests.get("https://genius.com/Kendrick-lamar-not-like-us-lyrics")
lyrics_html = response.text
lyrics_html

In [None]:
# just run this cell
from bs4 import BeautifulSoup

lyrics_doc = BeautifulSoup(lyrics_html, "html.parser")
type(lyrics_doc)

---

# Question 10: No more web crawlers

In the cell(s) below, describe what response was returned when you tried to scrape lyrics from Genius's website. You can use the code cell to try looking through the BeautifulSoup object `lyrics_doc` above, or you can just scroll through the `lyrics_html` until you find some artifacts worth noting.

Share your findings as a response no longer than 3-5 sentences.

_Type your answer here, replacing this text._

In [None]:
# TODO insert code here if you'd like

# Done!!!

We hope you have learned a lot from this project! Web scraping and crawling is a huge part of the internet today, and now you have learned that it's not much more than strings and HTML soup that, sometimes, is not so beautiful. Congratulations!!!

## Pets of Data 6

Luna says congratulations on completing Project 2! You should be proud :-)

<img src="luna.jpeg" width="50%" alt="Cat and cat plushie"/>

## Submission

Below, you will see two cells. Running the first cell will automatically generate a PDF of all questions that need to be manually graded, and running the second cell will automatically generate a zip with your autograded answers. You are responsible for submitting both the coding portion (the zip) and the written portion (the PDF) to their respective Gradescope portals. **Please save before exporting!**

> **Important: You must correctly assign the pages of your PDF after you submit to the correct gradescope assignment. If your pages are not correctly assigned and/or not in the correct PDF format by the deadline, we reserve the right to award no points for your written work.**

If there are issues with automatically generating the PDF in the first cell, you can try downloading the notebook as a PDF by clicking on `File -> Save and Export Notebook As... -> PDF`. If that doesn't work either, you can manually take screenshots of your answers to the manually graded questions and submit those. Either way, **you are responsible for ensuring your submission follows our requirements, we will NOT be granting regrade requests for submissions that don't follow instructions.**

In [None]:
from otter.export import export_notebook
from os import path
from IPython.display import display, HTML
name = 'project2'
export_notebook(f"{name}.ipynb", filtering=True, pagebreaks=True, exporter_type = "html")
if(path.exists(f'{name}.pdf')):
    display(HTML(f"Download your PDF <a href='{name}.pdf' download>here</a>."))
else:
    print("\n Pdf generation failed, please try the other methods described above")

---

To double-check your work, the cell below will rerun all of the autograder tests.

In [None]:
grader.check_all()

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export(pdf=False)