# API Ingest → AWS S3 (BLS/DataUSA)
This notebook automates the retrieval of BLS productivity data and stores it in Amazon S3.  It uses the BLS Public API to retieve data. 
This sync version keeps S3 matched with the website and detects changes, updates, deletes automatically:
* New files → get added.
* Changed files → get updated.
* Deleted files → get removed.
  
[View Notebook (Foundational Version)](https://github.com/ScottySchmidt/AWS_DataEngineer_API/blob/main/01-ingest-apis-to-s3.ipynb)
This version laid the groundwork for the improved **Sync Version**, which now mirrors the full BLS directory with adds/updates/deletes and supports optional targeted series syncs.

### What's Covered
- **Automated sync** from data api source to S3.
- **No hardcoded file names** – dynamically scrapes the BLS file list.
- **403 error handling** – uses a valid User-Agent to comply with BLS access policy.
- **Cloud-based execution** – runs in Kaggle with secure secret management.
- **Secrets used** – AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_REGION, BUCKET_NAME, BLS_API_KEY.
- **Duplicate protection** – checks content hashes before uploading.

### How It Works
1. Fetch the current list of files from the BLS public directory.
2. Download each file and compare its hash to the version in S3.
3. Upload new or changed files to the configured S3 bucket.
4. Skip unchanged files to save bandwidth and storage.

## Connect to AWS S3
This notebook requires the following Python packages:  
- boto3  
- requests  
- hashlib  
- kaggle_secrets: AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_REGION, BUCKET_NAME, BLS_API_KEY

In [1]:
import boto3
import requests
import hashlib
import json
from kaggle_secrets import UserSecretsClient

# Load AWS secrets
secrets = UserSecretsClient()
API_KEY = secrets.get_secret("BLS_API_KEY")
AWS_ACCESS_KEY_ID = secrets.get_secret("AWS_ACCESS_KEY_ID")
AWS_SECRET_ACCESS_KEY = secrets.get_secret("AWS_SECRET_ACCESS_KEY")
AWS_REGION = secrets.get_secret("AWS_REGION")
BUCKET_NAME = secrets.get_secret("BUCKET_NAME")

# Setup AWS session and S3
session = boto3.Session(
    aws_access_key_id=AWS_ACCESS_KEY_ID,
    aws_secret_access_key=AWS_SECRET_ACCESS_KEY,
    region_name=AWS_REGION
)
s3 = session.client("s3")

# Test connection WITHOUT revealing keys
try:
    response = s3.list_objects_v2(Bucket=BUCKET_NAME)
    num_files = response.get('KeyCount', 0)
    print("S3 connection successful. Bucket contains: ", num_files)
except Exception as e:
    print("S3 connection failed: ", e)

S3 connection successful. Bucket contains:  42


## Fetch BLS Data via API Key
- Authenticate with a registered BLS API key to comply with access policies.  
- Retrieve U.S. inflation data programmatically through the BLS Public API.  

In [2]:
# API payload
headers = {'Content-type': 'application/json'}
data = json.dumps({
    "seriesid": ["CUUR0000SA0", "SUUR0000SA0"],  # You can customize this
    "startyear": "2020",
    "endyear": "2024",
    "registrationkey": API_KEY
})

# Make request
response = requests.post(
    "https://api.bls.gov/publicAPI/v2/timeseries/data/",
    data=data,
    headers=headers
)

if response.status_code == 200:
    results = response.json()
    
    # Save locally
    with open("bls_data.json", "w") as f:
        json.dump(results, f, indent=2)
    
    # Upload JSON file to S3 bucket
    s3.put_object(
        Bucket=BUCKET_NAME,
        Key="bls_data.json",
        Body=json.dumps(results, indent=2)
    )
    
    print("Uploaded bls_data.json to S3.")
else:
    print("Error: ", response.status_code)
    print(response.text)

Uploaded bls_data.json to S3.


## **API Data Pipeline – Send Files to S3 (No API Key Needed)**
- **No hardcoded file names** — script adapts to new or removed files automatically.  
- **Custom User-Agent** to comply with BLS access rules and avoid 403 errors.  
- **Checks for changes** by comparing file hashes before uploading.  
- **Uploads only when updated**, reducing bandwidth usage and S3 storage costs.  
- **Syncs with source** — handles added or removed files, and avoids re-uploading duplicates.

In [3]:
import os, re, json, hashlib, requests, boto3, botocore

# AWS / env
S3          = boto3.client("s3", region_name=os.getenv("AWS_REGION", "us-east-1"))
#BUCKET_NAME = os.environ["BUCKET_NAME"]
USER_AGENT  = os.getenv("USER_AGENT", "ScottSchmidt/1.0 (email)")
SERIES_IDS  = [s.strip() for s in os.getenv("SERIES_IDS", "").split(",") if s.strip()]
BLS_API_KEY = os.getenv("BLS_API_KEY")  # only needed if SERIES_IDS is set

# Bulk sync (BLS pr/ directory)
BASE_URL   = "https://download.bls.gov/pub/time.series/pr/"
BLS_PREFIX = os.getenv("BLS_PREFIX", "bls/pr/")
ALLOW      = re.compile(r"^[A-Za-z0-9._-]+$")

def md5_text(t: str) -> str:
    return hashlib.md5(t.encode("utf-8")).hexdigest()

def md5_bytes(b: bytes) -> str:
    return hashlib.md5(b).hexdigest()

def list_upstream_files():
    r = requests.get(BASE_URL, headers={"User-Agent": USER_AGENT}, timeout=30)
    r.raise_for_status()
    hrefs = re.findall(r'href="([^"]+)"', r.text)
    return [h for h in hrefs if ALLOW.match(h)]

def list_s3_keys(prefix):
    keys, paginator = [], S3.get_paginator("list_objects_v2")
    for page in paginator.paginate(Bucket=BUCKET_NAME, Prefix=prefix):
        for obj in (page.get("Contents") or []):
            keys.append(obj["Key"])
    return keys

def sync_bulk_file(name: str):
    key = f"{BLS_PREFIX}{name}"
    r = requests.get(BASE_URL + name, headers={"User-Agent": USER_AGENT}, timeout=60)

    if r.status_code == 404:
        try:
            S3.delete_object(Bucket=BUCKET_NAME, Key=key)
            print(f"deleted: {name}")
            return "deleted"
        except botocore.exceptions.ClientError as e:
            if e.response["Error"]["Code"] != "NoSuchKey":
                raise
            return "absent"

    r.raise_for_status()
    content = r.text
    new_hash = md5_text(content)

    try:
        obj = S3.get_object(Bucket=BUCKET_NAME, Key=key)
        if md5_text(obj["Body"].read().decode("utf-8")) == new_hash:
            print(f"no-change: {name}")
            return "unchanged"
        S3.put_object(Bucket=BUCKET_NAME, Key=key, Body=content, ContentType="text/plain")
        print(f"updated: {name}")
        return "updated"
    except botocore.exceptions.ClientError as e:
        if e.response["Error"]["Code"] == "NoSuchKey":
            S3.put_object(Bucket=BUCKET_NAME, Key=key, Body=content, ContentType="text/plain")
            print(f"added: {name}")
            return "added"
        raise

def run_bulk_full_sync():
    upstream = set(list_upstream_files())
    s3_files = set(k[len(BLS_PREFIX):] for k in list_s3_keys(BLS_PREFIX) if k.startswith(BLS_PREFIX))

    added = updated = unchanged = 0
    for name in sorted(upstream):
        res = sync_bulk_file(name)
        if res == "added": added += 1
        elif res == "updated": updated += 1
        elif res == "unchanged": unchanged += 1

    extras = s3_files - upstream
    deleted = 0
    for name in sorted(extras):
        S3.delete_object(Bucket=BUCKET_NAME, Key=f"{BLS_PREFIX}{name}")
        print(f"deleted-extra: {name}")
        deleted += 1

    return {"added": added, "updated": updated, "unchanged": unchanged, "deleted": deleted}

def run_series_sync(series_ids):
    if not series_ids:
        return {"upserted": 0, "unchanged": 0, "skipped": True}
    if not BLS_API_KEY:
        return {"error": "BLS_API_KEY missing", "upserted": 0, "unchanged": 0}

    upserted = unchanged = 0
    for sid in series_ids:
        print(f"series: {sid}")
        payload = json.dumps({"seriesid": [sid], "registrationkey": BLS_API_KEY})
        r = requests.post(
            "https://api.bls.gov/publicAPI/v2/timeseries/data/",
            data=payload,
            headers={"Content-type": "application/json"},
            timeout=60,
        )
        if r.status_code != 200:
            print(f"series-fail: {sid} {r.status_code}")
            continue

        content = r.text.encode("utf-8")
        key = f"bls/series/{sid}.json"

        try:
            obj = S3.get_object(Bucket=BUCKET_NAME, Key=key)
            if md5_bytes(obj["Body"].read()) == md5_bytes(content):
                print(f"series-no-change: {sid}")
                unchanged += 1
                continue
        except botocore.exceptions.ClientError as e:
            if e.response["Error"]["Code"] != "NoSuchKey":
                raise

        S3.put_object(Bucket=BUCKET_NAME, Key=key, Body=content, ContentType="application/json")
        print(f"series-upserted: {sid}")
        upserted += 1

    return {"upserted": upserted, "unchanged": unchanged}

def lambda_handler(event, context):
    print("start")
    bulk   = run_bulk_full_sync()
    series = run_series_sync(SERIES_IDS)
    summary = {"bulk": bulk, "series": series}
    print(json.dumps(summary))
    print("done")
    return {"statusCode": 200, "body": json.dumps(summary)}


In [4]:
import os, re, hashlib, requests, boto3, botocore
print("Starting..")

# S3 client (defaults to us-east-1 if AWS_REGION not set)
S3 = boto3.client("s3", region_name=os.getenv("AWS_REGION", "us-east-1"))

# CONFIG
PREFIX = os.getenv("BLS_PREFIX", "bls/pr/")  # optional subfolder in S3
BASE   = "https://download.bls.gov/pub/time.series/pr/"
HDRS   = {"User-Agent": "ScottSchmidt/1.0 (scott.schmidt1989@yahoo.com)"}

# Only allow normal file names (skip folders, junk, or query strings)
ALLOW = re.compile(r"^[A-Za-z0-9._-]+$")


def list_source_files():
    """Get the list of all files available in the BLS directory."""
    resp = requests.get(BASE, headers=HDRS, timeout=30)
    resp.raise_for_status()
    hrefs = re.findall(r'href="([^"]+)"', resp.text)
    return [h for h in hrefs if ALLOW.match(h)]


def list_s3_keys(prefix):
    """Get all files currently stored in S3 under our prefix."""
    keys = []
    paginator = S3.get_paginator("list_objects_v2")
    for page in paginator.paginate(Bucket=BUCKET, Prefix=prefix):
        for obj in page.get("Contents", []):
            keys.append(obj["Key"])
    return keys


def md5_hex(text: str) -> str:
    """Quick way to hash file contents so we can check for changes."""
    return hashlib.md5(text.encode("utf-8")).hexdigest()


def sync_one(filename: str):
    """Sync a single file between BLS and S3 (add, update, or delete)."""
    url = BASE + filename
    key = f"{PREFIX}{filename}"

    r = requests.get(url, headers=HDRS, timeout=60)

    # If file is gone upstream, remove it from S3 too
    if r.status_code == 404:
        try:
            S3.delete_object(Bucket=BUCKET, Key=key)
            print(f"Deleted from S3 (removed at source): {filename}")
        except botocore.exceptions.ClientError as e:
            if e.response["Error"]["Code"] == "NoSuchKey":
                print(f"Already gone in S3: {filename}")
            else:
                raise
        return

    # File exists → compare hashes
    r.raise_for_status()
    content = r.text
    new_hash = md5_hex(content)

    try:
        obj = S3.get_object(Bucket=BUCKET_NAME, Key=key)
        old_hash = md5_hex(obj["Body"].read().decode("utf-8"))
        if new_hash == old_hash:
            print(f"No change: {filename}")
            return
        # Content changed → update S3
        S3.put_object(Bucket=BUCKET_NAME, Key=key, Body=content)
        print(f"Updated: {filename}")
    except botocore.exceptions.ClientError as e:
        if e.response["Error"]["Code"] == "NoSuchKey":
            # New file → add it to S3
            S3.put_object(Bucket=BUCKET, Key=key, Body=content)
            print(f"Added: {filename}")
        else:
            raise


def lambda_handler(event, context):
    # 1) List all files from BLS (the source of truth)
    source_files = set(list_source_files())

    # 2) List all files we currently have in S3
    s3_keys = list_s3_keys(PREFIX)
    s3_files = set(k[len(PREFIX):] for k in s3_keys if k.startswith(PREFIX))

    # 3) Add or update anything that exists upstream
    for name in sorted(source_files):
        sync_one(name)

    # 4) Remove anything in S3 that no longer exists upstream
    extras = s3_files - source_files
    for name in sorted(extras):
        key = f"{PREFIX}{name}"
        S3.delete_object(Bucket=BUCKET, Key=key)
        print(f"Deleted extra (not in source): {name}")

    return {"statusCode": 200, "body": "Full sync complete"}
print("Done.")

Starting..
Done.


## Preview BLS Data from S3
This section retrieves a specific BLS JSON file from Amazon S3, 
converts it into a Pandas DataFrame, reorders the columns, 
and displays the first few rows for inspection.

In [5]:
import pandas as pd

# Load the file content 
key = "CUUR0000SA0.json" # SUUR0000SA0
obj = s3.get_object(Bucket=BUCKET_NAME, Key=key)
json_content = json.loads(obj['Body'].read().decode('utf-8'))

# Extract data into DataFrame
series_data = json_content['Results']['series'][0]['data']
df = pd.DataFrame(series_data)
df = df[["year", "period", "value", "footnotes"]]
print("DataFrame Shape: ", df.shape)
df.head(20)

DataFrame Shape:  (60, 4)


Unnamed: 0,year,period,value,footnotes
0,2024,M12,315.605,[{}]
1,2024,M11,315.493,[{}]
2,2024,M10,315.664,[{}]
3,2024,M09,315.301,[{}]
4,2024,M08,314.796,[{}]
5,2024,M07,314.54,[{}]
6,2024,M06,314.175,[{}]
7,2024,M05,314.069,[{}]
8,2024,M04,313.548,[{}]
9,2024,M03,312.332,[{}]
