# API Ingest → AWS S3 (BLS/DataUSA)
This notebook automates the retrieval of BLS productivity data and stores it in Amazon S3.  It uses the BLS Public API to retieve data. 
This sync version keeps S3 matched with the website and detects changes, updates, deletes automatically:
* New files → get added.
* Changed files → get updated.
* Deleted files → get removed.

### What's Covered
- **Automated sync** from data api source to S3.
- **No hardcoded file names** – dynamically scrapes the BLS file list.
- **403 error handling** – uses a valid User-Agent to comply with BLS access policy.
- **Cloud-based execution** – runs in Kaggle with secure secret management.
- **Secrets used** – AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_REGION, BUCKET_NAME, BLS_API_KEY.
- **Duplicate protection** – checks content hashes before uploading.

### How It Works
1. Fetch the current list of files from the BLS public directory.
2. Download each file and compare its hash to the version in S3.
3. Upload new or changed files to the configured S3 bucket.
4. Skip unchanged files to save bandwidth and storage.

## Connect to AWS S3
This notebook requires the following Python packages:  
- boto3  
- requests  
- hashlib  
- kaggle_secrets: AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_REGION, BUCKET_NAME, BLS_API_KEY

In [1]:
import boto3
import requests
import hashlib
import json
from kaggle_secrets import UserSecretsClient

# Load AWS secrets
secrets = UserSecretsClient()
API_KEY = secrets.get_secret("BLS_API_KEY")
AWS_ACCESS_KEY_ID = secrets.get_secret("AWS_ACCESS_KEY_ID")
AWS_SECRET_ACCESS_KEY = secrets.get_secret("AWS_SECRET_ACCESS_KEY")
AWS_REGION = secrets.get_secret("AWS_REGION")
BUCKET_NAME = secrets.get_secret("BUCKET_NAME")

# Setup AWS session and S3
session = boto3.Session(
    aws_access_key_id=AWS_ACCESS_KEY_ID,
    aws_secret_access_key=AWS_SECRET_ACCESS_KEY,
    region_name=AWS_REGION
)
s3 = session.client("s3")

# Test connection WITHOUT revealing keys
try:
    response = s3.list_objects_v2(Bucket=BUCKET_NAME)
    num_files = response.get('KeyCount', 0)
    print("S3 connection successful. Bucket contains: ", num_files)
except Exception as e:
    print("S3 connection failed: ", e)

S3 connection successful. Bucket contains:  40


## Fetch BLS Data via API Key
- Authenticate with a registered BLS API key to comply with access policies.  
- Retrieve U.S. inflation data programmatically through the BLS Public API.  

In [2]:
# API payload
headers = {'Content-type': 'application/json'}
data = json.dumps({
    "seriesid": ["CUUR0000SA0", "SUUR0000SA0"],  # You can customize this
    "startyear": "2020",
    "endyear": "2024",
    "registrationkey": API_KEY
})

# Make request
response = requests.post(
    "https://api.bls.gov/publicAPI/v2/timeseries/data/",
    data=data,
    headers=headers
)

if response.status_code == 200:
    results = response.json()
    
    # Save locally
    with open("bls_data.json", "w") as f:
        json.dump(results, f, indent=2)
    
    # Upload JSON file to S3 bucket
    s3.put_object(
        Bucket=BUCKET_NAME,
        Key="bls_data.json",
        Body=json.dumps(results, indent=2)
    )
    
    print("Uploaded bls_data.json to S3.")
else:
    print("Error: ", response.status_code)
    print(response.text)

Uploaded bls_data.json to S3.


## **API Data Pipeline – Send Files to S3 (No API Key Needed)**
- **Custom User-Agent** to comply with BLS access rules and avoid 403 errors.  
- **Checks for changes** by comparing file hashes before uploading.  
- **Uploads only when updated**, reducing bandwidth usage and S3 storage costs.  

In [3]:
import os, re, hashlib, requests, boto3, botocore
print("Starting..")

S3 = boto3.client("s3", region_name=os.getenv("AWS_REGION", "us-east-1"))

# -------- CONFIG --------
#BUCKET = os.environ["BUCKET_NAME"]
PREFIX = os.getenv("BLS_PREFIX", "bls/pr/")  # optional subfolder in S3
BASE  = "https://download.bls.gov/pub/time.series/pr/"
HDRS  = {"User-Agent": "ScottSchmidt/1.0 (scott.schmidt1989@yahoo.com)"}
# Keep only real data/listing files (no parent links, no dirs)
ALLOW = re.compile(r"^[A-Za-z0-9._-]+$")

def list_source_files():
    """Scrape the BLS directory listing for file names."""
    resp = requests.get(BASE, headers=HDRS, timeout=30)
    resp.raise_for_status()
    # Extract href="filename" tokens
    hrefs = re.findall(r'href="([^"]+)"', resp.text)
    # Filter to plain files (no slashes, no query)
    return [h for h in hrefs if ALLOW.match(h)]

def list_s3_keys(prefix):
    """List current S3 object keys under a prefix."""
    keys = []
    paginator = S3.get_paginator("list_objects_v2")
    for page in paginator.paginate(Bucket=BUCKET, Prefix=prefix):
        for obj in page.get("Contents", []):
            keys.append(obj["Key"])
    return keys

def md5_hex(text: str) -> str:
    return hashlib.md5(text.encode("utf-8")).hexdigest()

def sync_one(filename: str):
    """Download from source, compare with S3, add/update/delete as needed."""
    url = BASE + filename
    key = f"{PREFIX}{filename}"

    r = requests.get(url, headers=HDRS, timeout=60)
    if r.status_code == 404:
        # Source file gone → ensure it's gone in S3
        try:
            S3.delete_object(Bucket=BUCKET, Key=key)
            print(f"Deleted from S3 (gone upstream): {filename}")
        except botocore.exceptions.ClientError as e:
            if e.response["Error"]["Code"] == "NoSuchKey":
                print(f"Not in S3 (already absent): {filename}")
            else:
                raise
        return

    r.raise_for_status()
    content = r.text
    new_hash = md5_hex(content)

    try:
        obj = S3.get_object(Bucket=BUCKET_NAME, Key=key)
        existing = obj["Body"].read().decode("utf-8")
        old_hash = md5_hex(existing)
        if new_hash == old_hash:
            print(f"No change: {filename}")
            return
        # changed
        S3.put_object(Bucket=BUCKET_NAME, Key=key, Body=content)
        print(f"Updated: {filename}")
    except botocore.exceptions.ClientError as e:
        if e.response["Error"]["Code"] == "NoSuchKey":
            S3.put_object(Bucket=BUCKET, Key=key, Body=content)
            print(f"Added: {filename}")
        else:
            raise

def lambda_handler(event, context):
    # 1) Get authoritative list from BLS
    source_files = set(list_source_files())

    # 2) Current objects in S3 (strip the prefix to compare names)
    s3_keys = list_s3_keys(PREFIX)
    s3_files = set(k[len(PREFIX):] for k in s3_keys if k.startswith(PREFIX))

    # 3) Add/Update everything that exists upstream
    for name in sorted(source_files):
        sync_one(name)

    # 4) Delete anything we have that upstream no longer has
    extras = s3_files - source_files
    for name in sorted(extras):
        key = f"{PREFIX}{name}"
        S3.delete_object(Bucket=BUCKET, Key=key)
        print(f"Deleted extra (not upstream): {name}")

    return {"statusCode": 200, "body": "Full sync complete"}
print("Done.")


Starting..
Done.


## Fetch and Upload BLS Data for Each Series ID
This section goes through each BLS series ID, requests data from the API, 
and uploads it to S3 only if the data has changed since the last upload.

In [4]:
# Series IDs you want to fetch
series_ids = ["CUUR0000SA0", "SUUR0000SA0"]  

# Loop through each series and fetch data from the BLS API
for series_id in series_ids:
    print(f"Fetching {series_id} from BLS API...")

    # Prepare API request payload
    payload = json.dumps({
        "seriesid": [series_id],
        "startyear": "2020",
        "endyear": "2024",
        "registrationkey": API_KEY
    })

    # Send POST request to the BLS API
    response = requests.post(
        "https://api.bls.gov/publicAPI/v2/timeseries/data/",
        data=payload,
        headers={"Content-type": "application/json"}
    )

    #  Skip if API request fails
    if response.status_code != 200:
        print(f"Failed to fetch {series_id}: {response.status_code}")
        continue

    # Skip if API request fails
    content = response.text
    hash_new = hashlib.md5(content.encode("utf-8")).hexdigest()
    s3_key = f"{series_id}.json"

    # Check if file already exists in S3 with same content
    try:
        obj = s3.get_object(Bucket=BUCKET_NAME, Key=s3_key)
        existing = obj["Body"].read().decode("utf-8")
        hash_existing = hashlib.md5(existing.encode("utf-8")).hexdigest()

        if hash_existing == hash_new:
            print(series_id, " Skipping. Unchanged series_id")
            continue
    except ClientError:
        pass  # File doesn't exist 

    # Upload to S3
    s3.put_object(Bucket=BUCKET_NAME, Key=s3_key, Body=content)
    print("Uploaded to S3! ", s3_key)

Fetching CUUR0000SA0 from BLS API...
Uploaded to S3!  CUUR0000SA0.json
Fetching SUUR0000SA0 from BLS API...
Uploaded to S3!  SUUR0000SA0.json


## Preview BLS Data from S3
This section retrieves a specific BLS JSON file from Amazon S3, 
converts it into a Pandas DataFrame, reorders the columns, 
and displays the first few rows for inspection.

In [5]:
import pandas as pd

# Load the file content 
key = "CUUR0000SA0.json" # SUUR0000SA0
obj = s3.get_object(Bucket=BUCKET_NAME, Key=key)
json_content = json.loads(obj['Body'].read().decode('utf-8'))

# Extract data into DataFrame
series_data = json_content['Results']['series'][0]['data']
df = pd.DataFrame(series_data)
df = df[["year", "period", "value", "footnotes"]]
print("DataFrame Shape: ", df.shape)
df.head(20)

DataFrame Shape:  (60, 4)


Unnamed: 0,year,period,value,footnotes
0,2024,M12,315.605,[{}]
1,2024,M11,315.493,[{}]
2,2024,M10,315.664,[{}]
3,2024,M09,315.301,[{}]
4,2024,M08,314.796,[{}]
5,2024,M07,314.54,[{}]
6,2024,M06,314.175,[{}]
7,2024,M05,314.069,[{}]
8,2024,M04,313.548,[{}]
9,2024,M03,312.332,[{}]
