Name: Carlos Antonio M. Doble

*Contribution statement*: 

# Homework #2: Data Collection with API and Web Scraping

This homework will help you get used to using `requests` and `beautifulsoup4` for getting data through APIs and web scraping.

Read through the whole notebook first to get an idea of what to expect and then go through each task. Skeleton code is provided for you as a guide.

---
You are allowed to work in groups (maximum of 4) to assist each other in coding problems and discussing how to handle different kinds of data. _However, you will still be submitting your notebooks **individually** and write ups (captions, answers to non-coding questions) should also be done individually._ Each notebook must have a `markdown cell` at the beginning enumerating the contributions of each member to the homework. If you worked alone, please simply state you worked alone. 

Missing markdown cell with contributions or any inconsistencies between statements within the group that cannot be easily reconciled by asking will **get a 5% deduction for conflicting members**.

---

### Instructions
Read through the whole notebook first to get an idea of what is expected. 

After answering every question, restart your kernel and re-run ALL the cell from top to bottom to ensure that there are no errors. 

### Grading
Each task will have corresponding points. Full points will be given if the task was successfully completed.

The notebooks turned in should be _runnable_. I should be able to re-run your submitted notebooks from top to bottom without any errors. In case an error is encountered (that has nothing to do with FileNotFound), **there will be 5% deduction from the final score**.

#### Total raw points: 30 points
---

In [1]:
import os 
import csv
import time
import requests
import datetime as dt
import pandas as pd
from bs4 import BeautifulSoup

## Part 1. Downloading data from API

Using the code from the Reddit lab exercise, choose a **different subreddit** that *you* would like to explore.

Retrieve the following fields from the posts between **September 1, 2020 to September 30, 2020**.
- author
- subreddit
- date created 
- number of comments 
- score
- submission title 
- submission description
    
Save the data into a `pandas` `DataFrame`.

### Task 1. Download the data given the time period and the selected fields. (10 pts)

In [19]:
def to_utc(date):
    #This function converts an object to UTC. This is to automate the conversion 
    #of dates instead of going to https://www.unixtimeconverter.io/ 
    return int(date.replace(tzinfo=dt.timezone.utc).timestamp())
    
def to_readable_date(timestamp):
    #This function converts the UTC format to a Year-Month-Day format 
    return dt.datetime.fromtimestamp(timestamp).strftime("%Y-%m-%d")

#Declare start and end of reddit posts to extract 
start_date = dt.datetime.strptime("2020-09-01", "%Y-%m-%d")
end_date = dt.datetime.strptime("2020-09-30", "%Y-%m-%d")

#Create a range of dates to iterate 
#Note: Periods here represents the number of days it will create from the start date 
#We also do a +2 since it will only generate up to April 29. We inlcude May 1 
#since we want to get data from the last day which is April 30 to May 1 
date_range = (pd.date_range(
                start_date, 
                periods=(end_date - start_date).days + 2)
              .tolist())

#prepare the parameters needed to call the API
sort_type="score"
sort="desc"
fields=["author", "subreddit", "title","selftext","score","num_comments","created_utc"]
subreddit = 'DnD'
url = "https://api.pushshift.io/reddit/submission/search/"
results = []
#loop through the dates 
for i, s_date in enumerate(date_range):
    #prevents us from getting an index out of range error
    if i != len(date_range)-1:
        #declare end date 
        e_date = date_range[i+1]
        #call the API
        r = requests.get(url = url, params={
            'after': to_utc(s_date),
            'before': to_utc(e_date),
            'sort_type': sort_type,
            'sort': sort,
            'subreddit': subreddit,
            'fields': fields,
            "size": 500
        })

        #add logs 
        print(f"Doing {s_date.strftime('%Y-%m-%d')} to {e_date.strftime('%Y-%m-%d')}")
        if r.status_code == 200:
            results.append(r.json()['data'])
            print("=====Done")
        else:
            print("=====Skipped")
        #so that we dont get blocked from abusing the API we call it after pausing for 1 second
        time.sleep(1)

Doing 2020-09-01 to 2020-09-02
=====Done
Doing 2020-09-02 to 2020-09-03
=====Done
Doing 2020-09-03 to 2020-09-04
=====Done
Doing 2020-09-04 to 2020-09-05
=====Done
Doing 2020-09-05 to 2020-09-06
=====Done
Doing 2020-09-06 to 2020-09-07
=====Done
Doing 2020-09-07 to 2020-09-08
=====Done
Doing 2020-09-08 to 2020-09-09
=====Done
Doing 2020-09-09 to 2020-09-10
=====Done
Doing 2020-09-10 to 2020-09-11
=====Done
Doing 2020-09-11 to 2020-09-12
=====Done
Doing 2020-09-12 to 2020-09-13
=====Done
Doing 2020-09-13 to 2020-09-14
=====Done
Doing 2020-09-14 to 2020-09-15
=====Done
Doing 2020-09-15 to 2020-09-16
=====Done
Doing 2020-09-16 to 2020-09-17
=====Done
Doing 2020-09-17 to 2020-09-18
=====Done
Doing 2020-09-18 to 2020-09-19
=====Done
Doing 2020-09-19 to 2020-09-20
=====Done
Doing 2020-09-20 to 2020-09-21
=====Done
Doing 2020-09-21 to 2020-09-22
=====Done
Doing 2020-09-22 to 2020-09-23
=====Done
Doing 2020-09-23 to 2020-09-24
=====Done
Doing 2020-09-24 to 2020-09-25
=====Done
Doing 2020-09-25

### Task 2. Save the results to a `pandas` `DataFrame`. (3 pts)

In [20]:
# your code for saving data into a DataFrame here
# store it also in a CSV file for use later (maybe in the next assignment!)
flat_list = []
#loop through the reddit results
for sublist in results:
    #check if sublist is not empty. The reason we have empty lists is because there are days wherein there are no submissions
    if sublist is not None:
        #for each dictionary in the sublist add it to the flat list 
        for item in sublist:
            flat_list.append(item)

#pandas has a useful function called from_dict which will convert a list of dictionary objects into a dataframe
df = pd.DataFrame.from_dict(flat_list)
df

Unnamed: 0,author,created_utc,num_comments,score,selftext,subreddit,title
0,KymmaLabeija,1598951913,204,49,,DnD,[OC] [ART] Our party adopted a kobold and dres...
1,KibblesTasty,1598952378,5,31,,DnD,[OC][Art] Occultist Witch
2,ClockworkArcana,1598956511,21,15,"Hi all! In our campaign, we often take the roa...",DnD,We made a free online tool for randomising tow...
3,Noferini,1598961239,16,15,,DnD,[Art] Azra Longrose - Half-Elf Rogue
4,glorycave,1598957105,21,9,,DnD,[Art] Mobius strip battlemap
...,...,...,...,...,...,...,...
2995,GentleAutumnRain,1601455407,4,1,I'm currently working on a new character for m...,DnD,Need help creating a sorcerer backstory!
2996,Immortalstar01,1601427935,8,1,"Reading a lot of lore, Asmodi seems to be well...",DnD,How would you envision the cosmos ruled by Asm...
2997,Seerias,1601473949,0,1,,DnD,HeroForge Update is a big help. (My Photoshop ...
2998,funk_with_dragons,1601454971,3,1,I thought about a bioengineer or genetic scien...,DnD,What if the artificer could create life


### Task 3. How many posts were you able to retrieve? (2 pts)

In [21]:
# use a pandas function or a python function to get the size of your dataframe
len(df)

3000

---
## Part 2. Web Scraping Books

Go to http://books.toscrape.com/, using what you have learned create a CSV file the contains all the books found in the website. The CSV file should contain the following:
- Title
- Price
- Description
- Availability

Code guides have been provided to help you in creating the web scraper. 

In [22]:
base_url = "http://books.toscrape.com/"

### Task 4. Complete the `get_title_links_and_next_page` function. (3 pts)
This function returns 2 things: the **book urls in a page** and the **link to the next page**. 

The idea here is to collect first all the book links available in the website and store the links in the `title_links` variable.

In [30]:
def get_title_links_and_next_page(page_url):
    #this is where we store our links to the title 
    list_links = [] 
    #get the html for the url that was given
    page = requests.get(page_url)
    
    #parse the html file for beautifulsoup to query on
    soup = BeautifulSoup(page.text, 'html.parser')
    
    #inspecting the page we notice that the books are placed under 
    #the article tag so we get all articles
    for article in soup.find_all('article'):
        #the article tag has an anchor tag so we find it and get the href
        if "catalogue" not in article.find("a")['href']:
            url = base_url + "catalogue/" + article.find("a")['href']
        else:
            url = base_url + article.find("a")['href']
        #add the title url to our list of titles 
        list_links.append(url)
    
    #try to check if a next button is in the page 
    try:
        next_url = 
    #if none we return None :)     
    except:
        next_url = None

    return (list_links, next_url)

### Task 5. Complete the link collector. (2 pts)

This code block is your starter scraper. It uses the data returned by the function `get_title_links_and_next_page` to go through all the pages in http://books.toscrape.com/index.html.

Complete the lines marked with `# TODO!!`.

In [None]:
#initial set up to crawl the book links and next page
res = get_title_links_and_next_page('http://books.toscrape.com/index.html')
title_links = res[0]  

#while we get a next page link keep on crawling for book links
while res[1]:
    #there are cases that the word "catalogue" is not in the link so we add it 
    #so that we can crawl properly
    if "catalogue" not in res[1]:
        page_url = base_url + "catalogue/" + res[1]
    else:
        page_url = base_url + res[1]
    res = # TODO!!
    title_links += # TODO!!

title_links  # this should print a list of every book

### Task 6. Complete the functions. (8 pts)
Once you have a list of all the available book links, we can now loop through the links and use the 4 functions `get_title`, `get_price`, `get_description`, `get_availability` to retrieve the book information.

Complete the functions below to get the specific fields from the individual links from `title_links`.

In [None]:
def get_title(soup):
    return 

def get_price(soup):
    return 
    
def get_description(soup):
    return 
    
def get_availability(soup):
    return 


# This is the scraper for each and every HTML page in title_links
book_data = []
for title_link in : 
    page = 
    soup = 
    
    title = get_title(soup)
    price = get_price(soup)
    description = get_description(soup)
    availability = get_availability(soup)
    
    book_data += [[title, price, description, availability.strip()]]

### Task 7. Save the data into a `pandas` `DataFrame`. (2 pts)

Pass the correct data value to convert the collected books into a `DataFrame` and save it to a CSV file.

In [None]:
df = pd.DataFrame(data=)
df.columns = ['title', 'price', 'description', 'availability']
display(df.head())

#save to csv file 
df.

<h3><center>= END =</center></h3>