# Lab 4 – APIs and Web Scraping


## Instructions
In this homework, the notebook is *only* for displaying a readable version of your final answers. The coding will be done in an accompanying `hw.py` file that is imported into the current notebook.

**Do not change the function names in the `hw.py` file!**
- The functions in the `hw.py` file are how your assignment is graded, and they are graded by their name.


**Tips for working in the notebook**:
- The notebooks serve to present the questions and give you a place to present your results for later review.
- The notebook serves as a nice environment for 'pre-development' and experimentation before designing your function in your `hw.py` file. You can write code here, but make sure that all of your real work is in the `hw.py` file.

**Tips for developing in the `hw.py` file**:
- Do not change the function names in the starter code; grading is done using these function names.
- Do not change the docstrings in the functions. These are there to tell you if your work is on the right track!
- You are encouraged to write your own additional helper functions to solve the homework! 
- Always document your code!

### Importing code from `hw.py`

* We import our `hw.py` file that's contained in the same directory as this notebook.
* We use the `autoreload` notebook extension to make changes to our `hw.py` file immediately available in our notebook. Without this extension, we would need to restart the notebook kernel to see any changes to `hw.py` in the notebook.
    - `autoreload` is necessary because, upon import, `hw.py` is compiled to bytecode (in the directory `__pycache__`). Subsequent imports of `hw` merely import the existing compiled python.

In [6]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [7]:
from hw import *

In [8]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import glob
import time
import requests
import bs4
import lxml

## Question 1 – Practice with HTML Tags 📎

In this question, you won't write any Python code! Instead, you'll create a very basic `.html` file, named `hw04_1.html`, that satisfies the following conditions:

- It must have `<title>` and `<head>` tags.
- It must also have `<body>` tags. Within the `<body>` tags, it must have:
    - At least two headers.
    * At least three images.
        - At least one image must be a local file.
        - At least one image must be linked to online source.
        - At least one image has to have default text when it cannot be displayed.
    * At least three references (hyperlinks) to different web pages.
    * At least one table with two rows and two columns.
    

Make sure to save your file as `hw04_1.html`, and save it in the same directory as `hw.py`. 
   

***Notes:*** 
- You can write and view basic HTML with a Jupyter Notebook, using either a Markdown cell or by using the `IPython.display.HTML` function (which takes in a string of HTML and renders it).
- If you write your HTML code within a Jupyter Notebook, you should later copy your code into a text editor and save it with the `.html` extension. You could also write your HTML in a text editor directly.
- Be sure to open your final `.html` file in a browser and make sure it looks correct on its own.

## Question 2 – Scraping an Online Bookstore 📚

Browse through the following fake online bookstore: http://books.toscrape.com/. This website is meant for toying with scraping.

Your job is to scrape the website, collecting data on all books that have:
- **_at least_ a four-star rating**, and
- **a price _strictly_ less than £50**, and 
- **belong to specific categories** (more details below). 

You will extract the information into a DataFrame that looks like the one below.

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>UPC</th>
      <th>Product Type</th>
      <th>Price (excl. tax)</th>
      <th>Price (incl. tax)</th>
      <th>Tax</th>
      <th>Availability</th>
      <th>Number of reviews</th>
      <th>Category</th>
      <th>Rating</th>
      <th>Description</th>
      <th>Title</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>e10e1e165dc8be4a</td>
      <td>Books</td>
      <td>Â£22.60</td>
      <td>Â£22.60</td>
      <td>Â£0.00</td>
      <td>In stock (19 available)</td>
      <td>0</td>
      <td>Default</td>
      <td>Four</td>
      <td>For readers of Laura Hillenbrand's Seabiscuit...</td>
      <td>The Boys in the Boat: Nine Americans...</td>
    </tr>
    <tr>
      <th>1</th>
      <td>c2e46a2ee3b4a322</td>
      <td>Books</td>
      <td>Â£25.27</td>
      <td>Â£25.27</td>
      <td>Â£0.00</td>
      <td>In stock (19 available)</td>
      <td>0</td>
      <td>Romance</td>
      <td>Five</td>
      <td>A Michelin two-star chef at twenty-eight, Violette...</td>
      <td>Chase Me (Paris Nights #2)</td>
    </tr>
    <tr>
      <th>2</th>
      <td>00bfed9e18bb36f3</td>
      <td>Books</td>
      <td>Â£34.53</td>
      <td>Â£34.53</td>
      <td>Â£0.00</td>
      <td>In stock (19 available)</td>
      <td>0</td>
      <td>Romance</td>
      <td>Five</td>
      <td>No matter how busy he keeps himself...</td>
      <td>Black Dust</td>
    </tr>
  </tbody>
</table>

To do so, implement the following functions.

#### `extract_book_links`

Create a function `extract_book_links` that takes in the content of a page that contains book listings as a **string of HTML**, and returns a **list** of URLs of book-specific pages for all books with **_at least_ a four-star rating and a price _strictly_ less than £50**. 

<br>

#### `get_product_info`

Create a function `get_product_info` that takes in the content of a book-specific page as a **string of HTML**, and a list `categories` of book categories. If the input book is in the list of `categories`, `get_product_info` should return a dictionary corresponding to a row in the DataFrame in the image above (where the keys are the column names and the values are the row values). If the input book is not in the list of `categories`, return `None`.

<br>

#### `scrape_books`

Finally, put everything together. Create a function `scrape_books` that takes in an integer `k` and a list `categories` of book categories. `scrape_books` should use `requests` to scrape the first `k` pages of the bookstore and return a DataFrame of only the books that have 
- **_at least_ a four-star rating**, and
- **a price _strictly_ less than £50**, and
- **a category that is in the list `categories`**.

<br>

***Notes:***
- The first page of the bookstore is at http://books.toscrape.com. Subsequent pages can be found by clicking the "Next" button at the bottom of the page.
- When instantiating `bs4.BeautifulSoup` objects, use the optional argument `features='lxml'` to suppress any warnings.
- Only `scrape_books` needs to make a `GET` request.
- Don't worry about typecasting, i.e. it's fine if `'Number of reviews'` is not stored as type `int`. Also, don't worry if you run into encoding errors in your price columns (as the example DataFrame at the top of this cell contains).

In [9]:
requests.get(f'http://books.toscrape.com/')

<Response [200]>

In [10]:
extract_book_links_fp = os.path.join('data', 'products.html')
extract_book_out = extract_book_links(
    open(extract_book_links_fp, encoding='utf-8').read()
)
extract_book_out



  soup = bs4.BeautifulSoup(text)


['seven-brief-lessons-on-physics_219/index.html',
 'scarlet-the-lunar-chronicles-2_218/index.html',
 'saga-volume-3-saga-collected-editions-3_216/index.html',
 'running-with-scissors_215/index.html',
 'rise-of-the-rocket-girls-the-women-who-propelled-us-from-missiles-to-the-moon-to-mars_213/index.html',
 'ready-player-one_209/index.html']

In [11]:
categories = ['Default']
get_product_info_fp = os.path.join('data', 'Frankenstein.html')
text = open(get_product_info_fp, encoding='utf-8').read()
soup = bs4.BeautifulSoup(text)
breadcrumb = soup.find('ul', class_='breadcrumb')
category = breadcrumb.find_all('a')
category

[<a href="../../index.html">Home</a>,
 <a href="../category/books_1/index.html">Books</a>,
 <a href="../category/books/default_15/index.html">Default</a>]

In [12]:
# don't delete this cell, but do run it 

# doctest for extract_book_links 
extract_book_links_fp = os.path.join('data', 'products.html')
extract_book_out = extract_book_links(
    open(extract_book_links_fp, encoding='utf-8').read()
)
extract_book_url = 'scarlet-the-lunar-chronicles-2_218/index.html'

# doc tests for get product info
get_product_info_fp = os.path.join('data', 'Frankenstein.html')
get_product_info_out = get_product_info(
    open(get_product_info_fp, encoding='utf-8').read(), ['Default']
)

# doc test for scrape books 
scrape_books_out = scrape_books(1, ['Mystery'])

print(get_product_info_out)

print(scrape_books_out)



  soup = bs4.BeautifulSoup(text)


{'Title': 'Frankenstein', 'Category': 'Default', 'Rating': 'Two', 'Description': "Mary Shelley began writing Frankenstein when she was only eighteen. At once a Gothic thriller, a passionate romance, and a cautionary tale about the dangers of science, Frankenstein tells the story of committed science student Victor Frankenstein. Obsessed with discovering the cause of generation and life and bestowing animation upon lifeless matter, Frankenstein assembles Mary Shelley began writing Frankenstein when she was only eighteen. At once a Gothic thriller, a passionate romance, and a cautionary tale about the dangers of science, Frankenstein tells the story of committed science student Victor Frankenstein. Obsessed with discovering the cause of generation and life and bestowing animation upon lifeless matter, Frankenstein assembles a human being from stolen body parts but; upon bringing it to life, he recoils in horror at the creature's hideousness. Tormented by isolation and loneliness, the onc

## Question 3 – API Requests 🤑

Let's calculate statistics of your favorite stocks by pulling data from a public API. The API we will work with can be found at https://financialmodelingprep.com/developer/docs/#Stock-Historical-Price. Specifically, we will use the "**Stock Historical Price**" endpoint (search for it at the linked page).

Some relevant definitions:
- Ticker: A short code that refers to a stock. For example, Apple's ticker is AAPL and Ford's ticker is F. 
- Open: The price of a stock at the beginning of a trading day.
- Close: The price of a stock at the end of a trading day.
- Volume: The total number of shares traded in a day.
- Percent change: The difference in price with respect to the original price, as a percentage.

To make requests to the aforementioned API, you will need an API key. In order to get one, you will need to make an account at the website. Once you've signed up, you can use the API key that comes with the free plan. It has a limit of 250 requests per day, which should be more than enough. You will have to encode your API key in the URL that you make requests to; see a complete example of such a request at the right side of the [documentation](https://site.financialmodelingprep.com/developer/docs#Stock-Historical-Price).

Implement the following two functions.

#### `stock_history`

Create a function `stock_history` which takes in a string `ticker` and two integers, `year` and `month`, and returns a DataFrame containing the price history for that stock in that month. Keep all of the attributes that are returned by the API.

***Notes:***
- Read the API documentation if you get stuck!
- [`pd.date_range`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.date_range.html) takes in two dates and returns a sequence of all dates between the two dates, excluding the right endpoint. How might this be helpful?
- The [`requests.get`](https://docs.python-requests.org/en/master/user/quickstart/) function returns a Response object, not the data itself. Use the `json` method on the Response object to extract the relevant JSON, as we did in Lecture 15 (you don't need to `import json` to do this).
- You can instantiate a DataFrame using a sequence of dictionaries.

<br>

#### `stock_stats`

Create a function `stock_stats` that takes in a DataFrame outputted by `stock_history` and returns a **tuple** of two numbers:
1. The percent change of the stock throughout the month as a **percentage**.
2. An estimate of the total transaction volume **in billion of dollars** for that month.

Both values in the tuple should be **strings** that contain numbers rounded to two decimal places. Add a plus or minus sign in front of the percent change, and make sure that the total transaction volume string ends in a `'B'`.

**To compute the percent change**, use the opening price on the first day of the month as the starting price and the closing price on the last day of the month as the ending price.

**To compute the total transaction volume**, assume that on any given day, the average price of a share is the midpoint of the high and low price for that day.

$$ \text{Estimated Total Transaction Volume (in dollars)} = \text{Volume (number of shares traded)} \times \text{Average Price} $$

For example, suppose there are only three days in March- March 1st, March 2nd, and March 3rd.

If BYND (Beyond Meat) opens at \\$4 on March 1st and closes at \\$5 on March 3rd, its percent change for the month of March is $$\frac{\$5-\$4}{\$4} = +25.00\%$$

Suppose the high and low prices and volumes of BYND on each day are given below.
- March 1st: high \\$5, low \\$3, volume 500 million (0.5 billion)
- March 2nd: high \\$5.5, low \\$2.5, volume 1 billion
- March 3rd: high \\$5.25, low \\$4, volume 500 million (0.5 billion)

Then, the estimated total transaction volume is
$$\frac{\$5 + \$3}{2} \cdot 0.5 B + \frac{\$5.5 + \$2.5}{2} \cdot 1 B + \frac{\$5.25 + \$4}{2} \cdot 0.5 B = 8.3125B$$

In [13]:
history = stock_history('BYND', 2019, 6)

2019-06-01 2019-06-30


In [19]:
close_price = history['close'].iloc[0]
open_price = history['open'].iloc[-1]
percent_change = (close_price - open_price) / open_price * 100
percent_change_str = f'{percent_change:.2f}%'
if percent_change > 0 :
    percent_change_str = '+' + percent_change_str
percent_change_str

'+54.29%'

In [26]:
average_price = 0.5 * (history['high']+ history['low'])
total_volume = sum(history['volume'] * average_price) / 1e9
total_volume_str = f'{total_volume:.2f}B'
total_volume_str

'33.64B'

In [28]:
# don't delete this cell, but do run it 

# doctest for stock_history
history = stock_history('BYND', 2019, 6)
print(history)
# doctest for stock_stats
stats = stock_stats(history)


print(len(stats[0]), len(stats[1]))

print(float(stats[0][1:-1]) > 30)

print(float(stats[1][:-1]) > 1)

          date    open    high       low   close    adjClose    volume  \
0   2019-06-28  165.30  168.80  159.5500  160.68  160.679993   7315297   
1   2019-06-27  157.31  164.79  155.4500  162.91  162.910004   5719421   
2   2019-06-26  160.10  162.25  153.0200  160.48  160.479996   6378629   
3   2019-06-25  138.50  150.69  138.3425  150.60  150.600006   6682929   
4   2019-06-24  151.88  152.70  138.0000  140.99  140.990005   6538497   
5   2019-06-21  153.54  161.79  150.0000  154.13  154.130005   7474586   
6   2019-06-20  173.00  174.00  163.3000  165.17  165.169998   6660492   
7   2019-06-19  171.37  174.45  162.2500  169.28  169.279999   9451961   
8   2019-06-18  200.00  201.88  160.7000  169.89  169.889999  23966910   
9   2019-06-17  163.18  171.19  160.6111  169.96  169.960007  14626683   
10  2019-06-14  142.01  157.90  141.8000  151.48  151.479996  14964553   
11  2019-06-13  141.52  146.45  134.2500  141.39  141.389999   9474562   
12  2019-06-12  133.99  150.45  131.56

## Congratulations! You're done! 

Submit the following three files to Canvas:
- `hw.py`
- `hw04_1.html`
- `hw04.ipynb`
- The local image you embedded in `hw04_1.html`
