## Fetch CSO Data Dump
Fetch the [CSO data dump](www.data.cso.ie) and store it in `artifacts/` directory. This can serve as a fallback mechanism when CSO is down / under-maintenance.

Let's get started.

### 1. Fetch All CSO-APIs URLs
First, we collate all the URLs available for the CSO-Data.

In [None]:
import json
import requests


start_date = "1900-01-01"
collection_url = f"https://ws.cso.ie/public/api.restful/PxStat.Data.Cube_API.ReadCollection/{start_date}/en"
r = requests.get(collection_url, timeout=60, verify="/tmp/cso_ca_bundle.pem")
print(r.status_code)

if r.status_code == 200:
    print("Request was successful.")
    collection_dict = r.json()
    print(len(collection_dict['link']['item']), "tables found in Data-CSO.")
else:
    print("Request failed.")
    raise Exception(f"Failed to fetch collection: {r.status_code} - {r.text}")

Request was successful.
12435 tables found in Data-CSO.


In [None]:
# save the collections dict as json in artifacts directory (for reference)
with open("../..//artifacts/cso_bkp/collection.json", "w") as f:
    json.dump(collection_dict, f, indent=4)

In [70]:
hrefs = [item['href'] for item in collection_dict['link']['item']]

### 2. Fetch Data-Dump from ALL Data-CSO tables
We parallelise this operation using multithreading (since this is an I/O heavy process, we run this on a single core with 32 threads). We do not care for data-cso servers, so we perform a thundering herd and go all-in.


In [None]:
from concurrent.futures import ThreadPoolExecutor, as_completed
from tqdm import tqdm
import time


hrefs = [item['href'] for item in collection_dict['link']['item']]

# parallelise fetching CSO data by using multithreading in jupyter notebook, while using tqdm for progress bar
def fetch_href_data(href: str) -> tuple:
    """Fetch data from the given href and return subject and product values."""
    r_json = requests.get(href, timeout=60, verify="/tmp/cso_ca_bundle.pem").json()
    table_id = r_json["extension"]["matrix"]
    timestamp = time.strftime("%d-%m-%Y %I:%M:%S %p", time.localtime())

    return r_json, table_id, timestamp

with ThreadPoolExecutor(max_workers=32) as executor:
    futures = {executor.submit(fetch_href_data, href): href for href in hrefs}
    
    cso_dump = {}

    for future in tqdm(as_completed(futures), total=len(futures)):
        r_json, table_id, timestamp = future.result()
        cso_dump[table_id] = {
            "timestamp": timestamp,
            "data": r_json
        }


100%|██████████| 12435/12435 [15:56<00:00, 13.01it/s] 


### 3. Save the CSO-Data dump in SQLite DB
We save in SQLite instead of as a JSON to save on space. Where the JSON file containing the CSO dump would take ~6.2GB space, the same data stored in SQLite takes ~285MB space. Kudos SQLite.

We take the help of a custom `JSONStatArchiveDB` class to read / write the CSO data into the local archive.

In [108]:
from pathlib import Path
import os

root = Path().absolute().parents[1]
os.chdir(str(root))

from src.helpers.json_stat_archive_db import JSONStatArchiveDB

In [None]:
# Write the CSO data into the SQLite database
db = JSONStatArchiveDB(compression_level=12)
db.write("artifacts/cso_bkp/cso_archive/jsonstat_archive.sqlite", cso_dump)

### 4. Example: reading a CSO-file from the SQLite DB
Let's say we want to read the table-id `PCA23` from the SQLite DB, we run the following command:

In [113]:
cso_file = {}

for tid, ds, ts in db.read("artifacts/cso_bkp/cso_archive/jsonstat_archive.sqlite", table_id="PCA23", with_labels=True):
    cso_file = ds

In [None]:
from pyjstat import pyjstat

df = pyjstat.from_json_stat(cso_file)
df[0]

Unnamed: 0,Statistic,Year,Product,value
0,Prodcom Sales 2023,2023,07101010 Iron ores and concentrates. Non-agglo...,0.0
1,Prodcom Sales 2023,2023,07101020 Iron ores and concentrates. Agglomera...,-99999999.0
2,Prodcom Sales 2023,2023,07291100 Copper ores and concentrates (kg),0.0
3,Prodcom Sales 2023,2023,07291200 Nickel ores and concentrates (kg),0.0
4,Prodcom Sales 2023,2023,07291300 Aluminium ores and concentrates (kg),-99999999.0
...,...,...,...,...
7945,Prodcom Sales 2023 (Volume),2023,38322902 Secondary raw materials of lithium (kg),0.0
7946,Prodcom Sales 2023 (Volume),2023,38322903 Secondary raw materials of rare earth...,0.0
7947,Prodcom Sales 2023 (Volume),2023,38322910 Secondary raw materials of other meta...,0.0
7948,Prodcom Sales 2023 (Volume),2023,38322940 Slag sands (kg),0.0


Let's say we want to read ALL the cso-files from the SQLite DB, we run the following command:

In [None]:
cso_files = {}

for tid, ds, ts in db.read("artifacts/cso_bkp/cso_archive/jsonstat_archive.sqlite", table_id="PCA23", with_labels=True):
    cso_files[tid] = {
        "data": ds,
        "timestamp": ts,
    }