In [3]:
# Initialize Otter
import otter
grader = otter.Notebook("lab.ipynb")

# DSC 80: Lab 06

### Due Date: Monday November 8th, 11:59 PM

## Instructions
Much like in DSC 10, this Jupyter Notebook contains the statements of the problems and provides code and markdown cells to display your answers to the problems. Unlike DSC 10, the notebook is *only* for displaying a readable version of your final answers. The coding work will be developed in an accompanying `lab*.py` file, that will be imported into the current notebook.

Labs and programming assignments will be graded in (at most) two ways:
1. The functions and classes in the accompanying python file will be tested (a la DSC 20),
2. The notebook will be graded (for graphs and free response questions).

**Do not change the function names in the `*.py` file**
- The functions in the `*.py` file are how your assignment is graded, and they are graded by their name. The dictionary at the end of the file (`GRADED FUNCTIONS`) contains the "grading list". The final function in the file allows your doctests to check that all the necessary functions exist.
- If you changed something you weren't supposed to, just use git to revert!

**Tips for working in the Notebook**:
- The notebooks serve to present you the questions and give you a place to present your results for later review.
- The notebook on *lab assignments* are not graded (only the `.py` file).
- Notebooks for PAs will serve as a final report for the assignment, and contain conclusions and answers to open ended questions that are graded.
- The notebook serves as a nice environment for 'pre-development' and experimentation before designing your function in your `.py` file.

**Tips for developing in the .py file**:
- Do not change the function names in the starter code; grading is done using these function names.
- Do not change the docstrings in the functions. These are there to tell you if your work is on the right track!
- You are encouraged to write your own additional functions to solve the lab! 
    - Developing in python usually consists of larger files, with many short functions.
    - You may write your other functions in an additional `.py` file that you import in `lab.py` (much like we do in the notebook).
- Always document your code!

### Importing code from `lab.py`

* We import our `.py` file that's contained in the same directory as this notebook.
* We use the `autoreload` notebook extension to make changes to our `lab.py` file immediately available in our notebook. Without this extension, we would need to restart the notebook kernel to see any changes to `lab.py` in the notebook.
    - `autoreload` is necessary because, upon import, `lab.py` is compiled to bytecode (in the directory `__pycache__`). Subsequent imports of `lab` merely import the existing compiled python.

In [4]:
%load_ext autoreload
%autoreload 2

In [5]:
from lab import *

In [6]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import glob
import time
import requests
import bs4

# Basic HTML tags practice

**Question 1**

Create a very basic `html` file that satisfies the following properties:

1. Has `<head>` and `<body>` tags.
2. Has a title
3. Inside the body tags:
    * At least two headers
    * At least three images:
        * At least one image must be a local file;
        * At least one image must be linked to online source; 
        * At least one image has to have default text when it cannot be displayed.
    * At least three references (hyperlinks) to different web pages;
    * At least one table with two columns.
    
        
   
4. Save your work as `lab06_1.html` in the same directory as `lab.py`, make sure it loads in the browser and do not forget to submit it.
5. **Do not forget to submit all data files needed to display your page.**

**Note:** You can toy with (basic) HTML in the cells of a notebook, using either a "markdown cell" or by using the `IPython.display.HTML` function. However, be sure to open your saved file in a browser to be sure the page displays properly!

**Note:** If you work within Jupyter Notebook, you can later copy your text into a text editor and save it with the .html extension.

In [None]:
question1()

In [None]:
grader.check("q1")

# Scraping an Online Bookstore


**Question 2**

Browse through the following fake on-line bookstore: http://books.toscrape.com/. This website is meant for toying with scraping.

Scrape the website, collecting data on all books that have **at least a four-star rating**, with a price **under £50** and belong to the book categories you want. You should collect the data in a dataframe as below (if you get an encoding error on your prices columns, like you see in the table below, don't worry about it):
<img src="data/bookdata.png">


Do this using the following steps:
1. Create a function `extract_book_links` that takes in the content of a book-listing page (a string of html), and returns a list of urls of book-detail pages that satisfy the requirements on "*at least* a four-star rating, and prices are *under* £50". 

2. Create a function `get_product_info` that takes in the content of a book-detail page (a string of html), a variable `categories` that is a list of book categories you want. If this input book is in the categories you want, returns a dictionary corresponding to a row in the dataframe in the image above (where the keys are the column names and the values are the row values); else, skip this book since this is not the book you want (ie. return None).

3. Create a function `scrape_books` of a single variable `k` that scrapes the first `k` pages of the bookstore (as determined by starting at the url above and clicking on the 'next' button),a variable `categories` that is a list of book categories you want, and returns a dataframe of books as the picture above. (Note: make sure the books returned satisfy the requirements set in part 1 about rating and price).


*Note:* Your function should take under 180 seconds to run through the entire bookstore.

*Note:* Don't worry about type casting (ie changing number of reviews to an int)

In [47]:
fp = os.path.join('data', 'products.html')
text = open(fp, encoding='utf-8').read()


In [48]:
soup = bs4.BeautifulSoup(text)
cell = soup.find_all('article', {'class': 'product_pod'})
result = []
for book in cell:
    price = book.find('p', {'class': 'price_color'}).text
    rate = book.select('p[class*="star-rating "]')[0].get('class')[1]
    #title = book.find_all('a')[1].get('title')
    if (float(price[1:]) < 50) and ((rate == 'Five')|(rate == 'Four')):
        hrefs = book.find_all('a')
        result.append(hrefs[0].get('href'))
result

['seven-brief-lessons-on-physics_219/index.html',
 'scarlet-the-lunar-chronicles-2_218/index.html',
 'saga-volume-3-saga-collected-editions-3_216/index.html',
 'running-with-scissors_215/index.html',
 'rise-of-the-rocket-girls-the-women-who-propelled-us-from-missiles-to-the-moon-to-mars_213/index.html',
 'ready-player-one_209/index.html']

In [101]:
fp = os.path.join('data', 'Frankenstein.html')
out = get_product_info(open(fp, encoding='utf-8').read(), ['Default'])
out

{'Availability': 'In stock (1 available)',
 'Category': 'Default',
 'Description': "Mary Shelley began writing Frankenstein when she was only eighteen. At once a Gothic thriller, a passionate romance, and a cautionary tale about the dangers of science, Frankenstein tells the story of committed science student Victor Frankenstein. Obsessed with discovering the cause of generation and life and bestowing animation upon lifeless matter, Frankenstein assembles Mary Shelley began writing Frankenstein when she was only eighteen. At once a Gothic thriller, a passionate romance, and a cautionary tale about the dangers of science, Frankenstein tells the story of committed science student Victor Frankenstein. Obsessed with discovering the cause of generation and life and bestowing animation upon lifeless matter, Frankenstein assembles a human being from stolen body parts but; upon bringing it to life, he recoils in horror at the creature's hideousness. Tormented by isolation and loneliness, the o

In [65]:
fp = os.path.join('data', 'Frankenstein.html')
text = open(fp, encoding='utf-8').read()
soup = bs4.BeautifulSoup(text, features="lxml")

In [106]:
soup.select('a[href*="/category/books/"]')[0].text

'Default'

In [97]:
k = 1
categories = ['Mystery']

In [99]:
scrape_books(k, categories)

Unnamed: 0,Availability,Category,Description,Number of reviews,Price (excl. tax),Price (incl. tax),Product Type,Rating,Tax,Title,UPC
0,In stock (20 available),Mystery,"WICKED above her hipbone, GIRL across her hear...",0,Â£47.82,Â£47.82,Books,Four,Â£0.00,Sharp Objects,e00eb4fd7b871a48


In [132]:
pages = [f'http://books.toscrape.com/catalogue/page-{num}.html' for num in range(1,k+1)]
book_list = []
book_dict = []
for page in pages:
    page_request = requests.get(page)
    book_list.extend(extract_book_links(page_request.text))
for book in book_list:
    book_request = requests.get('http://books.toscrape.com/catalogue/' + book)
    book_info = get_product_info(book_request.text,categories)
    if book_info is not None:
        book_dict.append(book_info)
pd.DataFrame(book_dict)

Unnamed: 0,Availability,Category,Description,Number of reviews,Price (excl. tax),Price (incl. tax),Product Type,Rating,Tax,Title,UPC
0,In stock (20 available),Mystery,"WICKED above her hipbone, GIRL across her hear...",0,Â£47.82,Â£47.82,Books,Four,Â£0.00,Sharp Objects,e00eb4fd7b871a48


In [133]:
grader.check("q2")

# API Requests

**Question 3**

You trade stocks as a hobby. As an avid pandas coder, you figured it is best to calculate some statistics by pulling data from a public API (https://financialmodelingprep.com/developer/docs/#Stock-Historical-Price). Specifically, "Historical price with change and volume interval".

Some definitions (these are the ones you need to know):
- open: The opening price of a stock at the beginning of a trading day
- close: The closing price of a stock at the end of a trading day
- volume: The total number of shares being traded in a day
- percent change: difference in price with respect to the original price (in percentages)


1. Create a function `stock_history` which takes in the stock code (`ticker`) as a string, `year` and `month` as integers, and return a dataframe which has the price history for that stock in that month (include all columns).

2. Create a function `stock_stats` that takes in the output dataframe from `stock_history` and output the stock price change as a percentage and a rough total transaction volume **in billion dollars** for that month. Assume that on average, shares are traded at the midpoint price of high and low for that day. Return these two values as a tuple in a readable format: reserve 2 decimal points for both values and add a plus or minus sign at the front of the percent change. 
$$ \text{Total Transaction Volume (in dollars)} = \text{Volume (number of shares traded)} \times \text{Price} $$

*Example*: If \\$BYND opens at \\$80 and closes at \\$120 with a volume of 1 million, its percent change for the day is $(\$120-\$80) \div \$80 = +50.00\%$. And the estimated total transaction volume is: $(\$80+\$120) / 2 \times 10^6 = 0.10\text{B}$.


Hint: [pd.date_range](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.date_range.html), 

*Note:* Make sure you read the API documentation if you get stuck!

*Note 2:* In order to make successful requests, you will need an API key. In order to get one, you will need to sign up to the website. Once signed up, you can use the API key that comes with the free plan. It has a limit of 250 requests per day, which should be more than enough. In the code below, replace `your_key` when making requests.

In [215]:
ticker = 'BYND'
year = 2019
month = 6

In [231]:
url = 'https://financialmodelingprep.com/api/v3/historical-price-full/{t}?apikey=a2311ec4aaf814e3bb089cd297376cff'.format(t = ticker)

In [297]:
df = history.copy()
df = df.sort_values('date').reset_index()
op = df.iloc[0].open
cl = df.iloc[-1].close
hi = max(df['high'])
lo = min(df['low'])
volume = sum(df['volume'])
PC = str(round((cl - op)/op*100, 2))+'%'
if not PC.startswith('-'):
    PC = '+' + PC
TTV = str(round(sum((history['low'] + history['high'])/2 * history['volume']/(10**9)), 2))+'B'
(PC, TTV)

('+54.29%', '33.64B')

In [106]:
op = history.iloc[-1].open
cl = history.iloc[0].close
(op,cl)

(104.139999, 160.679993)

In [107]:
PC = str(round((cl - op) / op * 100, 2)) + '%'
PC

'54.29%'

In [103]:
history = stock_history('BYND', 2019, 6)
history.sort_values('date').reset_index()

Unnamed: 0,index,date,open,high,low,close,adjClose,volume,unadjustedVolume,change,changePercent,vwap,label,changeOverTime
0,19,2019-06-03,104.139999,108.669998,95.662003,96.160004,96.160004,8027700.0,8027700.0,-7.98,-7.663,100.164,"June 03, 19",-0.07663
1,18,2019-06-04,101.25,103.5,97.82,103.410004,103.410004,5484900.0,5484900.0,2.16,2.133,101.57667,"June 04, 19",0.02133
2,17,2019-06-05,105.5,105.5,99.639999,102.599998,102.599998,4283500.0,4283500.0,-2.9,-2.749,102.58,"June 05, 19",-0.02749
3,16,2019-06-06,102.0,102.25,98.849998,99.5,99.5,6484000.0,6484000.0,-2.5,-2.451,100.2,"June 06, 19",-0.02451
4,15,2019-06-07,130.0,149.460007,120.760002,138.649994,138.649994,23916700.0,23916700.0,8.64999,6.654,136.29,"June 07, 19",0.06654
5,14,2019-06-10,155.699997,186.429993,147.0,168.100006,168.100006,24986000.0,24986000.0,12.40001,7.964,167.17667,"June 10, 19",0.07964
6,13,2019-06-11,145.25,150.0,125.230003,126.040001,126.040001,15516000.0,15516000.0,-19.21,-13.225,133.75667,"June 11, 19",-0.13225
7,12,2019-06-12,133.990005,150.449997,131.563004,141.970001,141.970001,16918600.0,16918600.0,7.98,5.956,141.32767,"June 12, 19",0.05956
8,11,2019-06-13,141.520004,146.449997,134.25,141.389999,141.389999,9474600.0,9474600.0,-0.13001,-0.092,140.69667,"June 13, 19",-0.00092
9,10,2019-06-14,142.009995,157.899994,141.800003,151.479996,151.479996,14964600.0,14964600.0,9.47,6.669,150.39333,"June 14, 19",0.06669


In [108]:
stock_stats(history)

('+54.29%', '33.64B')

In [298]:
grader.check("q3")

# Comment Threads

**Question 4**

As a hacker, you get your daily dose of tech news on [Hacker News](https://news.ycombinator.com/). The problem now is that you don't have internet access on your phone in your morning commute to work, so you want to save the interesting stories' comments thread beforehand in a flat file source like csv. You find their API documentation ( https://github.com/HackerNews/API) and implement the following task:

1. Write a function `get_comments` that takes `storyid` as a parameter and returns a dataframe of all the comments below the news story. You can ignore 'dead' comments (you will know it when you see it). **Make sure the order of the comments in your dataframe is from top to bottom just as you see on the website**. You are allowed to use loops in this function. Addtional requirement: write at least one helper method

You only want these information for the comments:
1. `id`: the unique ids
2. `by`: the author of the comment
3. `parent`: who (also in unique ids) they are replying to
4. `text`: the actual comment
5. `time`: when the comment is created (in `pd.datetime` format)

Hints:
1. Use depth-first-search when traversing the comments tree.
2. https://docs.python.org/3/tutorial/datastructures.html#using-lists-as-stacks.
3. Check the size of your dataframe to the story's `descendants` attribute (number of comments).

`news_endpoint = "https://hacker-news.firebaseio.com/v0/item/{}.json"`

In [7]:
storyid = 18344932

In [82]:
root = 'https://hacker-news.firebaseio.com/v0/item/18380397.json?print=pretty'
pd.read_json(root, typ = 'series')

by                                                  valyala
id                                                 18380397
parent                                             18344932
text      TimescaleDB is great for storing time series c...
time                                             1541400799
type                                                comment
dtype: object

In [27]:
root = "https://hacker-news.firebaseio.com/v0/item/{}.json?print=pretty".format(storyid)
pd.read_json(root, typ = 'series')

by                                                ScottWRobinson
descendants                                                   18
id                                                      18344932
kids           [18380397, 18346406, 18348601, 18346750, 18346...
score                                                         47
time                                                  1540987334
title                        TimescaleDB 1.0 Is Production Ready
type                                                       story
url            https://blog.timescale.com/1-0-enterprise-prod...
dtype: object

In [96]:
out = []
def walk(tree_path, out):
    tree = pd.read_json(tree_path, typ = 'series')
    if 'kids' not in tree.index:
        return tree
    id_list = tree['kids']
    for i in id_list:
        path =  "https://hacker-news.firebaseio.com/v0/item/{}.json?print=pretty".format(i)
        out.append(pd.read_json(path, typ = 'series'))
        walk(path, out)

SyntaxError: invalid syntax (<ipython-input-96-b7ae86784165>, line 2)

In [94]:
walk(root, out)

by                                                  valyala
id                                                 18380397
parent                                             18344932
text      TimescaleDB is great for storing time series c...
time                                             1541400799
type                                                comment
dtype: object

In [119]:
df = pd.DataFrame([[4, 9]] * 3, columns=['A', 'B'])
d.apply(np.sqrt)
df

TypeError: '(0    4
1    4
2    4
Name: A, dtype: int64,)' is an invalid key

In [121]:
get_comments('18344932')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Unnamed: 0,id,by,parent,text,time
0,18380397,valyala,18344932,TimescaleDB is great for storing time series c...,2018-11-05 06:53:19
1,18346406,msiggy,18344932,I&#x27;m excited to give this database a try i...,2018-10-31 15:20:22
2,18348601,sman393,18344932,Can this be used side by side on normal Postgr...,2018-10-31 19:29:39
3,18348631,RobAtticus,18348601,"Yep, absolutely. Regular PostgreSQL tables coe...",2018-10-31 19:34:52
4,18348984,sman393,18348631,Good to hear! how does the current TimescaleDB...,2018-10-31 20:23:46
5,18349540,RobAtticus,18348984,Not sure I follow exactly what you&#x27;re ask...,2018-10-31 21:47:20
6,18350673,sman393,18349540,Alright thanks! I thought I read that Timescal...,2018-11-01 01:11:59
7,18351061,RobAtticus,18350673,It does not support sharding writes across mul...,2018-11-01 02:35:03
8,18346750,zip1234,18344932,How fast is it when it has a TB of data? I rea...,2018-10-31 15:51:43
9,18347260,nevi-me,18346750,I spent about 8 months writing data to TSDB. I...,2018-10-31 16:47:34


In [122]:
grader.check("q4")

## Congratulations! You're done!

* Submit the lab on Gradescope

---

To double-check your work, the cell below will rerun all of the autograder tests.

In [None]:
grader.check_all()