_____________________________________________________________
# Coursework 1 - Python Web Scraper
_____________________________________________________________
**Web scraping** is essentially about downloading structured data from the web, selecting some of it, and passing this to some another process. This may be incredibly handy in situations where you do not have any data available to you and want to source it from the web, or when you want to keep the track of some information appearing on it (e.g., monitoring news, stock prices, etc.).

In the following coursework, you will have to exploit the knowledge you collected throughout the course in order to create a web scrapper module that you may later use to gather the data that interests you the most. 
___________
## Install & Import Packages
___________
You will be using `Python 3` and you will also need to install these 2 packages:
- `requests` - for performing the HTTP requests
- `BeautifulSoup4` - for handling the HTML processing

The executable cells below will install these packages to your current conda environment and import all the necessary modules required for this coursework.

`Note` Keep in mind that later you will have to move the imports to their corresponding modules. Not the installs, though, these are permament (but only for this conda environment).

In [1]:
# # Install a `conda package in the current Jupyter kernel
# import sys
# !conda install --yes --prefix {sys.prefix} requests
# !conda install --yes --prefix {sys.prefix} beautifulsoup4

In [2]:
from requests import get
import requests
from requests.exceptions import RequestException
from contextlib import closing
from bs4 import BeautifulSoup
import logging
import datetime
import csv
import pandas as pd

___________
## 01. Web Requests
___________
Your first task will be to develop a set of functions that will help you to download the chosen web page content using the `requests` package and its `.get()` method.

Essentially, your developed function `get_url_content(url)` should accept a single url (string) argument and make a `GET` request to that `URL`. If nothing goes wrong, you end up with the raw HTML content for the page you requested, but you should also think of any problems that may arise when making this request (e.g., bad `URL`, remote server down, etc.) and return `None` if so. 

You may also use `with closing(some_function) as resp: ....` in order to ensure that any network resources are freed when they go out of scope in that `with` block. That is good practice that helps to prevent fatal errors and/or network timeouts. 

`Important!` You may prototype with your code here, but, after you are finished, create a module (file) `scrapper.py` and move your code there. Now you can test your solution by importing your `get_url_content` method from the `scrapper` module and calling it with a URL argument:

    from scrapper import get_url_content
    raw_html = get_url_content('https://www.cnbc.com/stocks/')
    len(raw_html)
    600151
    
    no_html = get_url_content('https://realpython.com/blog/nope-not-gonna-find-it')
    no_html is None

`Note` For more info/examples on `requests`, check [this documentation](https://www.w3schools.com/python/module_requests.asp) or [this tutorial](https://realpython.com/python-requests/).

In [3]:
def get_url_content(url):
    # Get url content as text
    try:
        req = requests.get(url)
        return req.text
    except:
        return None
    


In [4]:
# get_url_content("https://www.forexfactory.com/")

___________
## 02. Wrangling HTML With BeautifulSoup
___________
After you retrieve the raw HTML material, you will have to parse in order to get only the material that is of relevance to you. 

For this, you will be using the `BeautifulSoup` library (that you've installed earlier). The `BeautifulSoup` constructor parses raw HTML strings and produces an object that mirrors the HTML document’s structure, as well as includes numerous methods for selecting, viewing, and/or manipulating the content. For more info, you may to check its [documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/), but generally, the following methods should suffice for the present exercise:

- `find_all(element_tag, ...)`: return all HTML elements from a webpage that have a specified tag and/or some additional attributes. 
In order to get the first one, you may use `find()` instead.
    
- get_text(): extract the text from a HTML element.

You should develop a method `parse_content(raw_html, element_tag, ...)` that:
1. Takes a raw `HTML` content as input, passes it to the `BeautifulSoup` constructor.
2. Uses `find_all()` method to retrieve a list of the elements of your choice (might also add additional arguments to choose a `class` or etc.).
3. Uses the `get_text()` method to retrieve the text content for all the elements retrieved in 2. 

The method should return a list of texts retrieved from the elements, and an additional list of attributes related to it (e.g., dates, url links, or anything else that you may find to be useful about). 

    You may find dictionary format more suitable in case you decide to store multiple element attributes that do not always exist.
    
`Note!` - Feel free to use any other `BeautifulSoup` methods if you believe them to fit your approach better.

`How to start?`
Decide upon a webpage and it's field(s) that you want to scrape and 
move your mouse cursor on any of those fields within the webpage. Click the right mouse button, and choose `inspect_element` from a drop down list. Now you will see the content information that will be required for you to access the relevant fields.
    
Move your code to the `scrapper.py` file, and test your `parse_content()` method as you did in the first exercise.

In [5]:
def parse_content(url, element_tag, class_name):
    data = get_url_content(url)
    soup = BeautifulSoup(data, "html.parser")
    # get and parse table data, ignoring details and graph
    table = soup.find(element_tag, class_=class_name)
    return table

In [6]:
parse_content('https://www.forexfactory.com/calendar?day=apr01.2020', 'table', 'calendar__table')

<table class="calendar__table"> <thead> <tr class="calendar__header--desktop subhead"> <th class="calendar__date">Date</th> <th class="calendar__time"><a href="timezone.php" title="Time Options">9:12am</a></th> <th class="calendar__currency">Currency</th> <th class="calendar__impact">Impact</th> <th class="calendar__event"> </th> <th class="calendar__detail">Detail</th> <th class="calendar__actual">Actual</th> <th class="calendar__forecast">Forecast</th> <th class="calendar__previous">Previous</th> <th class="calendar__graph">Graph</th> </tr> <tr class="calendar__header--mobile subhead"> <th colspan="4"> <a class="calendar__header-time" href="timezone.php" title="Time Options">9:12am</a> </th> <th>Actual</th> </tr> </thead> <tr class="calendar__borderfix borderfix"><td></td></tr> <tr class="calendar__row calendar__row--day-breaker"> <td class="calendar__cell" colspan="10">Wed<span>Apr 1</span></td> </tr> <tr class="calendar__row calendar_row calendar__row--grey calendar__row--new-day n

## 03. Combining Your Methods
Now combine the methods defined above into a single method that allows you to get the text content of the elements of your choice within a given url. 

    `retrieve_text_url(url, element_tag, ...)`
    
As previously, add this method to `scrapper.py` and test it below:

In [7]:
def setLogger():
    logging.basicConfig(level=logging.INFO,
                    format='%(asctime)s - %(levelname)s - %(message)s',
                    filename='logs_file',
                    filemode='w')
    console = logging.StreamHandler()
    formatter = logging.Formatter('%(asctime)s - %(levelname)s - %(message)s')
    console.setFormatter(formatter)
    logging.getLogger('').addHandler(console)

In [8]:
forcal = []

def append_day_info(startlink):
    table = parse_content(startlink, 'table', 'calendar__table')
        # do not use the ".calendar__row--grey" css selector (reserved for historical data)
    trs = table.select("tr.calendar__row.calendar_row")
    fields = ["date","time","currency","impact","event","actual","forecast","previous"]
    # some rows do not have a date (cells merged)
    curr_year = startlink[-4:]
    curr_date = ""
    curr_time = ""
    for tr in trs:
        dict = {}
        # fields may mess up sometimes, see Tue Sep 25 2:45AM French Consumer Spending
        # in that case we append to errors.csv the date time where the error is
        try:
            for field in fields:
                data = tr.select("td.calendar__cell.calendar__{}.{}".format(field,field))[0]
                # print(data)
                if field=="date" and data.text.strip()!="":
                    curr_date = data.text.strip()
                elif field=="time" and data.text.strip()!="":
                    # time is sometimes "All Day" or "Day X" (eg. WEF Annual Meetings)
                    if data.text.strip().find("Day")!=-1:
                        curr_time = "12:00am"
                    else:
                        curr_time = data.text.strip()
                elif field=="currency":
                    currency = data.text.strip()
                elif field=="impact":
                    # when impact says "Non-Economic" on mouseover, the relevant
                    # class name is "Holiday", thus we do not use the classname
                    impact = data.find("span")["title"]
                elif field=="event":
                    event = data.text.strip()
                elif field=="actual":
                    actual = data.text.strip()
                elif field=="forecast":
                    forecast = data.text.strip()
                elif field=="previous":
                    previous = data.text.strip()
            date = datetime.datetime.strptime(",".join([curr_year,curr_date,curr_time]),"%Y,%a%b %d,%I:%M%p")
            # date = datetime.datetime.strptime(",".join([curr_year,curr_date,curr_time]),"%Y,%a%b %d,%I:%M%p")
            # date = datetime.datetime.strptime(",".join([curr_year,curr_date]),"%Y,%a%b")
            # time = datetime.datetime.strptime(curr_time, "%d,%I:%M%p")
            dict["Date"] = date.strftime("%Y-%m-%d %H:%M:%S")
            dict["Currency"] = currency
            dict["Impact"] = impact
            dict["Event"] = event
            dict["Actual"] = actual
            dict["Forecast"] = forecast
            dict["Previous"] = previous
            forcal.append(dict)
            # forcal.append(",".join([str(dt),currency,impact,event,actual,forecast,previous]))
        except:
            with open("errors.csv","a") as f:
                csv.writer(f).writerow([curr_year,curr_date,curr_time])

In [9]:
append_day_info('https://www.forexfactory.com/calendar?day=apr2.2020')
pd.DataFrame(forcal)

Unnamed: 0,Date,Currency,Impact,Event,Actual,Forecast,Previous
0,2020-04-02 02:00:00,GBP,Low Impact Expected,Nationwide HPI m/m,0.8%,0.0%,0.3%
1,2020-04-02 02:30:00,CHF,Low Impact Expected,CPI m/m,0.1%,0.1%,0.1%
2,2020-04-02 03:00:00,EUR,Low Impact Expected,Spanish Unemployment Change,302.3K,27.7K,-7.8K
3,2020-04-02 04:53:00,EUR,Low Impact Expected,Spanish 10-y Bond Auction,0.69|2.0,,0.66|2.2
4,2020-04-02 04:59:00,EUR,Low Impact Expected,French 10-y Bond Auction,0.04|2.7,,-0.32|2.2
5,2020-04-02 05:00:00,EUR,Low Impact Expected,PPI m/m,-0.6%,-0.3%,0.2%
6,2020-04-02 07:30:00,USD,Low Impact Expected,Challenger Job Cuts y/y,266.9%,,-26.3%
7,2020-04-02 08:30:00,CAD,Medium Impact Expected,Trade Balance,-1.0B,-2.3B,-1.7B
8,2020-04-02 08:30:00,USD,High Impact Expected,Unemployment Claims,6648K,3600K,3307K
9,2020-04-02 08:30:00,USD,Low Impact Expected,Trade Balance,-39.9B,-40.6B,-45.5B


In [10]:
def getEconomicCalendar(startlink,endlink):
    # write to console current status
    logging.info("Scraping data for link: {}".format(startlink))
    append_day_info(startlink)
    
    # exit recursion when last available link has reached
    if startlink==endlink:
        logging.info("Successfully retrieved data")
        return
        
    data = get_url_content(startlink)
    soup = BeautifulSoup(data, "html.parser")
    # get the link for the next week and follow 
    
    follow = soup.select("a.calendar__pagination.calendar__pagination--next.next")
    follow = follow[0]["href"]
    getEconomicCalendar(follow,endlink)
# pd.DataFrame(forcal)

In [11]:
getEconomicCalendar("https://www.forexfactory.com/calendar?day=apr2.2020","https://www.forexfactory.com/calendar?day=apr3.2020")

TypeError: object of type 'NoneType' has no len()

In [None]:
pd.DataFrame(forcal)

## 04. Finalizing Your Module
Now, add `__main__` section to your `scrapper.py`, allowing to run the script standalone via terminal by calling:

    `python scrapper.py url element_tag`

In [None]:
if __name__ == "__main__":
    """
    Run this using the command "python `script_name`.py >> `output_name`.csv"
    """
#     setLogger()
#     getEconomicCalendar("calendar.php?week=jan7.2007","calendar.php?week=dec24.2017")

## 05. Scrapper Extensions
Think of a scrapper extension that would increase its functionality. Some examples are given below:

    Maybe it would be useful to also retrieve data from the sub-links within your elements? How would you do that?
    
    Or maybe this could be done both more efficiently using a different package, e.g.: Scrapy?
    
    What if you want to traverse a large list of webpages?
    
    Maybe it is possible to get around the `robot security` problem that may arise when scrapping pages like Bloomberg news.