In [None]:
### Some generic imports that will be used throughout the workshop
import datetime as dt
import time
from pathlib import Path
import requests

# 1. APIs and Web Scraping

As humans, we interact with websites by **clicking, scrolling, and typing**.
Computers, however, can‚Äôt ‚Äúsee‚Äù buttons or menus ‚Äî they need **structured ways to talk** to the web.

<p align="left">
    <img src="data/assets/user_vs_api.png" alt="user_vs_api" width="200"/>
</p>



There are two main approaches:

1. **APIs**, which are the official, structured conversations.
2. **Web Scraping**, where our program imitates a user and reads information directly from web pages.

APIs are like **asking a waiter for a dish** ‚Äî you make a polite request and get exactly what you ordered.
Scraping is more like **sneaking into the kitchen** to see what‚Äôs cooking.

Both are useful, but APIs are preferred when available: they‚Äôre cleaner, faster, and far less likely to break when a website changes.

<p align="left">
    <img src="data/assets/api_vs_scraper.png" alt="api_vs_scraper" width="200"/>
</p>


## 1.1 API Calls

APIs (Application Programming Interfaces) are **the backbone of modern applications**.
Every time you open Spotify, check Google Maps, or order an Uber, your device is making dozens of API calls behind the scenes ‚Äî quietly fetching playlists, traffic data, or driver locations.

<p align="left">
    <img src="data/assets/api_on_web.png" alt="api_on_web" width="200"/>
</p>


### How REST APIs Work

Most web APIs today follow the **REST** design:

* They use standard web methods ‚Äî **GET**, **POST**, **PUT**, **DELETE** ‚Äî to read, create, update, or remove data.
* Each resource (for example, ‚Äústations‚Äù or ‚Äúdepartures‚Äù) has its own **URL**.
* Requests and responses are structured in **JSON**.

Example:

```
GET https://api.irail.be/liveboard/?station=Leuven&format=json
```

This returns real-time train departures from Leuven Station.

**Why this matters:**
As data scientists, APIs give us **direct, structured access to live data**, without relying on messy or unstable web scraping. They‚Äôre the ‚Äúprofessional‚Äù way for machines to communicate.

### üöÄ **Let‚Äôs Code**
We‚Äôll make our first API call using the iRail public API ‚Äî Belgium‚Äôs open train data ‚Äî and visualize how data travels from the web into our Python program.


In [None]:
from utils.data_formatting import fmt_time, fmt_delay

API_BASE = "https://api.irail.be"
### Define a nice User-Agent to identify your application
UA = "irail-liveboard-swiftbar/1.1 (contact: workshop@riskconcile.com)"

# --- Basic config ---
### Either set STATION_ID (preferred) or STATION name
STATION_ORIGIN_ID = "BE.NMBS.008833001" # Leuven
STATION_ORIGIN = "Leuven"
STATION_DEST_ID = "BE.NMBS.008813003" # Brussels-Central
STATION_DEST = "Brussels-Central"
ARRDEP = "departure"
LANG = "en"
MAX_ROWS = 10

### First we will create a generic function to make requests to the iRail API using variable parameters
def _get(path, params):
    headers = {"User-Agent": UA}
    params = dict({"format": "json", "lang": LANG}, **params)
    r = requests.get(f"{API_BASE}{path}", params=params, headers=headers, timeout=15)
    r.raise_for_status()
    return r.json()

### We now define a function to get the liveboard for our station of interest through the Liveboard endpoint
def get_liveboard():
    ## Your code here
    pass


### Extra: We now want to see if a vehicle passes by a certain station (Brussels-Central)
def passes_by(vehicle_id, target):
    ## Your code here
    pass
# -----------------------------------------------------------------

try:
    data = get_liveboard()
except Exception as e:
    print("üöÇ iRail: error"); print("---"); print(str(e))

key = "departures" if ARRDEP == "departure" else "arrivals"
rows = (data.get(key, {}) or {}).get(key[:-1], [])
if isinstance(rows, dict):
    rows = [rows]

# ------- Extra: apply via filter using vehicle endpoint -------
filtered = []
for item in rows:
    vehicle_id = item.get("vehicle")
    try:
        if passes_by(vehicle_id, STATION_DEST):
            filtered.append(item)
    except Exception:
        # If vehicle lookup fails, just skip that item
        pass
    if len(filtered) >= MAX_ROWS:
        break
rows = filtered
# ------------------------------------------------------------
if len(rows) > 0:
    print(f"üöÇ Next Departure: {fmt_time(rows[0].get('time'))}")
    print("---")
else:
    print("No matching trains (via Brussels-Central).")

# Render lines
href = "https://irail.be/"
for item in rows:
    ts = item.get("time")
    when = fmt_time(ts) if ts else "??:??"
    dest = item.get("station", "?")
    platform = (item.get("platform", {}) or {}).get("name") if isinstance(item.get("platform"), dict) else item.get("platform") or "?"
    delay = fmt_delay(item.get("delay", 0))
    canceled = str(item.get("canceled", "0")) in ("1", "true", "True")
    status = "‚ùå" if canceled else ("‚è±" if delay else "‚Ä¢")
    print(f"{status} {when}  {dest}  (pf {platform}) {delay} | href={href}")

print("---")
print("Refresh now ‚Üª | refresh=true")

## 1.2 Web Scraping

Not every website offers an API.
Sometimes, the only way to access information is to **read what‚Äôs already visible** on the page ‚Äî just like a human would.
That‚Äôs where **web scraping** comes in.

Scraping isn‚Äôt hacking ‚Äî it‚Äôs **automated browsing** within fair-use limits.
Our code acts as a mini-browser: it opens pages, looks for patterns in the HTML, and extracts the data we need.

### Static vs Dynamic Pages

* **Static pages**: all text is already in the HTML. ‚Üí Tools like `BeautifulSoup` work perfectly.
* **Dynamic pages**: content loads via JavaScript after the page opens. ‚Üí We need **Selenium** to simulate a real browser.

### Meet Selenium

Selenium lets Python **control a browser** ‚Äî open pages, click buttons, fill forms, and extract text once the data has loaded.
It‚Äôs also widely used in automated website testing.

<p align="left">
    <img src="data/assets/selenium_workflow.png" alt="selenium_workflow" width="200"/>
</p>


Typical steps:

1. Launch a browser (`webdriver.Chrome()`).
2. Open a URL (`.get(url)`).
3. Wait for elements to load (`WebDriverWait`).
4. Extract text (`find_element`, `find_elements`).
5. Close the browser (`.quit()`).


Main methods to look for an element in a webpage (BY):

1. By.ID("id").
2. By.CLASS_NAME("name").
3. By.CSS_SELECTOR("#id .class tag").
4. By.XPATH("//tag[contains(text(),'text')]")

Waiting with Selenium:

1. Implicit Waits ‚Äì general timeout
2. Explicit Waits ‚Äì wait for a specific condition (EC)


### üöÄ **Let‚Äôs Code**
We‚Äôll use Selenium to open a web page, wait for the content to appear, and extract specific information ‚Äî just like a robot performing your browser actions automatically.

In [None]:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager # installs the right ChromeDriver automatically.
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

from utils.date_retriever import get_today_be

### We define the main function which scrapes the Alma menu for a given date
def scrape_alma_menu(headless: bool = True, date: str = get_today_be()) -> list[str]:
    url = f"https://www.alma.be/en/restaurants/alma-3?date={date}"

    chrome_opts = webdriver.ChromeOptions()
    if headless:
        chrome_opts.add_argument("--headless=new") # use new headless mode
    chrome_opts.add_argument("--no-sandbox") # required for some environments
    chrome_opts.add_argument("--window-size=1280,1200") # set window size to ensure all elements are visible

    # Initialize the Chrome WebDriver
    driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=chrome_opts)

    try:
        driver.get(url)

        # Accept cookies if necessary
        try:
            WebDriverWait(driver, 5).until(
                EC.element_to_be_clickable((By.CSS_SELECTOR, ".js-cookie-accept"))
            ).click()
        except Exception:
            pass

        # Wait for menu to load
        ## Your code here

        menu_list = {}

        # Extract menu items
        ## Your code here
        
        return menu_list

    finally:
        driver.quit()



menu_list = scrape_alma_menu()

for food in menu_list.keys():
    print(f"{food}: {', '.join(menu_list[food])}")
    print(len(f"{food}: {', '.join(menu_list[food])}") * "-")


## 2. Automatic File Observability

As data scientists, we often receive new data files ‚Äî daily reports, exports, or client uploads ‚Äî that we need to process again and again.
Instead of manually running a script each time, we can make our computer **watch a folder and react automatically** when new files appear.
That‚Äôs the idea behind *automatic file observability*.

### Why would we want this?

<p align="left">
    <img src="data/assets/pipeline.avif" alt="pipeline" width="150"/>
</p>

File observability lets us:

* **Automate repetitive ingestion tasks** (no more ‚Äúrerun the notebook‚Äù).
* **React in real time** to new data.
* Build the foundation for **event-driven data pipelines**, where workflows trigger on data arrival instead of on a timer.



### How can we do this?

<p align="left">
    <img src="data/assets/watchdog.png" alt="watchdog" width="250"/>
</p>

The [`watchdog`](https://python-watchdog.readthedocs.io/) library lets Python scripts:

1. **Monitor a directory** for file system events (new, changed, deleted files).
2. **Define custom handlers** for what should happen when those events occur.
3. **Run an observer** that keeps listening in the background.


### Example: Build Your Own Invoice Bot

<p align="left">
    <img src="data/assets/invoice.webp" alt="invoice" width="250"/>
</p>

We‚Äôll use Watchdog to build a small **Invoice Bot** that:

1. Watches an `invoices/` folder for new PDF invoices.
2. Waits until each new file is ready.
3. Uses a helper (`extract_invoice_data()`) to read key info and appends it to an Excel overview.

Our focus here is the **watching and reacting** part ‚Äî the automation logic ‚Äî not the invoice parsing itself (that code lives in `utils/`).





### üöÄ Let‚Äôs Code!

Let‚Äôs now implement our watcher step by step and see how our script can ‚Äúnotice‚Äù new invoices the moment they appear üëá


In [None]:
### We will use watchdog to monitor the directory for new files
from watchdog.events import FileSystemEventHandler, FileCreatedEvent
from watchdog.observers import Observer

### We will use custom utility functions to extract invoice data from pdfs and handle
### different file operations. Since they are not the key focus of this workshop, we
### will not go into their implementation details. But if you are curious, feel free
### to check out the code in the utils/ directory.
from utils.invoice_data_extractor import extract_invoice_data
from utils.file_handling import append_data_to_excel_file, wait_until_file_is_ready

### Let's define the directory to watch and the file path to the invoice data excel 
### overview.
WATCHED_DIR = Path("invoices")
INVOICE_DATA_FILE_PATH = WATCHED_DIR / "invoice_data.xlsx"

### Watchdog lets us define custom functionality for different file system events. File
### system events include file creation, modification, deletion, and movement. Here, we
### will define a custom event handler that reacts to new file creation events.
class InvoiceFileHandler(FileSystemEventHandler):
    pass

### Let's define a function to start watching the directory for new files, and act 
### when the defined file system events occur. We will therefore have to use the 
### custom event handler we defined above, as well as watchdog's own observer.
def start_watching():
    ...

### Finally, let's start the file watching process and see it in action.
start_watching()