# Data acquisition
- Large amount of data are available on the net
- Data processing can be automated
    1. exploring the data source
    2. analyzing raw data gathered from the source to find information
        - find information
        - define rules to extract
    3. extract and process information from data with a repeated, autmated flow
        1. extract information based on rules defined by analysis
        2. store information
        3. process information
    4. Scedule execution
        1. wait for the next loop to start
        2. start the next loop

## Web scrapping
- When data is extracted from public web sources
- Exploring data source
    1. creating requests to data source
    2. examine response HTML/JSON contents
        - find information in structured response
        - define regular expressions/search flows to find information
    3. extract and process information
        1. make request
        2. follow rules to extract information
        3. convert extracted information to requisted format
        4. augment extracted informtion with additional (administrtarional) information
        4. store information 
        5. process information
        6. generate output reports
    4. schedule process python or OS functionality
        1. create scheduler
        2. setup scheduler
        3. run scheduler
    
Before doing WebScrapping:
- always check copyright information
- be aware of hacker attak identification processes and their consequences  
    frequent requests could:
        - cause host to crash
        - could be identified as DoS attack

Data source: [Foreign currency exchange rates of Magyar Nemzeti Bank (Central Bank of Hungary)](https://www.mnb.hu/arfolyamok)

## Data source analysis
To analyse structure of data source use:
- integrated developer tools of a browser
- HTTP request debugger tools (like PostMan)


Main tasks of analysis:
- to examine structure of HTTP response, find and identify required information
- identify the enclosing structure

In [None]:
import sys
print(sys.executable)

## Requesting information from web
Using HTTP protocol

In [None]:
!{sys.executable} -m pip install requests

In [None]:
import requests

In [None]:
URL = 'https://www.mnb.hu/arfolyamok'
page = requests.get(URL)
print(page.content[:200])

## HTML decoding
For extracting information from a static HTML response, decoding of HTML formatted data is mecessary.

In [None]:
!{sys.executable} -m pip install BeautifulSoup4
from bs4 import BeautifulSoup

In [None]:
soup = BeautifulSoup(page.content, 'html.parser')

## Defining rules and flow to extract information
When static HTML page is decoded to python data structures, 

In [None]:
results = soup.find(id='fd-arg-IsBlind')
print(results)

In [None]:
data_tables = soup.find_all('table', class_='datatable')
print(len(data_tables))

### Finding HTML table rows containing information

In [None]:
all_rows = []
for table in data_tables:
    table_body = table.find('tbody')
    rows = table_body.find_all('tr')
    all_rows.extend(rows)
    print(len(rows))
print("sum: ", len(all_rows))

## Converting and augmenting acquired information ```items``` array
Finding HTML cells of data, extracting and converting data to the required format.  
Gathered data is augmented with timestamp to process data changes in time.

In [None]:
from datetime import datetime

In [None]:
current_date_and_time = datetime.now()
time_string = current_date_and_time.strftime("%Y/%m/%d %H:%M")
acquired_data = {"timestamp": time_string}

items = []
converters = [
    {"property_name": "code", "method": None},
    {"property_name": "name", "method": None},
    {"property_name": "unit", "method": int},
    {"property_name": "value", "method": float}
]
for row in all_rows:
    cells = row.find_all('td')
    items.append({})
    for i in range(0, len(cells)):
        data = cells[i].text.strip().replace(',', '.')
        items[-1][converters[i]["property_name"]] = \
            data if converters[i]["method"] is None else converters[i]["method"](data)

acquired_data["items"] = items
print(acquired_data)

## Schedule the process using Python scheduler

Before scheduling a task, a function have to be created to let is start scheduled.

In [None]:
!{sys.executable} -m pip install schedule
import schedule
import time

### The data acquisition is sceruded to run every 15 minutes

In [None]:
schedule.every(15).minutes.do(my_data_acquisition_method) 

In [None]:
while(True):
    schedule.run_pending()
    time.sleep(1)

# Homework
## Get currency exchange rates from 
[XE Currency Charts](https://www.xe.com/currencycharts/)
## Modify first acquisition script to get currency/HUF exchange rates
## Combine both data sources to a single storage