# API Ingest & SYNC → AWS S3 
This notebook automates the retrieval of BLS/DataUSA data and stores it in Amazon S3.  It uses the BLS Public API to retieve data. 
This sync version keeps S3 matched with the website and detects changes, updates, deletes automatically:
* New files → get added.
* Changed files → get updated.
* Deleted files → get removed.
  
**[View Notebook (Foundational Version)](https://github.com/ScottySchmidt/AWS_DataEngineer_API/blob/main/01-ingest-apis-to-s3.ipynb)**

This version laid the groundwork for the improved **Sync Version**, which now mirrors the full BLS directory with adds/updates/deletes and supports optional targeted series syncs.

### What's Covered
- **Automated sync** from data api source to S3.
- **No hardcoded file names** – dynamically scrapes the BLS file list.
- **403 error handling** – uses a valid User-Agent to comply with BLS access policy.
- **Cloud-based execution** – runs in Kaggle with secure secret management.
- **Secrets used** – AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_REGION, BUCKET_NAME, BLS_API_KEY.
- **Duplicate protection** – checks content hashes before uploading.

### How It Works
1. Fetch the current list of files from the BLS public directory.
2. Download each file and compare its hash to the version in S3.
3. Upload new or changed files to the configured S3 bucket.
4. Skip unchanged files to save bandwidth and storage.

## Sync BLS Data to S3
- Script watches for new, changed, or deleted files.  
- Uses a custom ID so BLS doesn’t block us.  
- Checks if a file is different before uploading.  
- Only uploads when needed to save space and time.  
- Keeps S3 matched up with the BLS site (using API key)

In [1]:
import boto3
import requests
import hashlib
import json
import os
from kaggle_secrets import UserSecretsClient

# Load AWS secrets
secrets = UserSecretsClient()
API_KEY = secrets.get_secret("BLS_API_KEY")
AWS_ACCESS_KEY_ID = secrets.get_secret("AWS_ACCESS_KEY_ID")
AWS_SECRET_ACCESS_KEY = secrets.get_secret("AWS_SECRET_ACCESS_KEY")
AWS_REGION = secrets.get_secret("AWS_REGION")
BUCKET_NAME = secrets.get_secret("BUCKET_NAME")

# Setup AWS session and S3
session = boto3.Session(
    aws_access_key_id=AWS_ACCESS_KEY_ID,
    aws_secret_access_key=AWS_SECRET_ACCESS_KEY,
    region_name=AWS_REGION
)
s3 = session.client("s3")

# Test connection WITHOUT revealing keys
try:
    response = s3.list_objects_v2(Bucket=BUCKET_NAME)
    num_files = response.get('KeyCount', 0)
    print("S3 connection successful. Bucket contains: ", num_files)
except Exception as e:
    print("S3 connection failed: ", e)

SERIES_IDS = os.getenv("SERIES_IDS", "CUUR0000SA0,SUUR0000SA0")
SERIES_IDS = [s.strip() for s in SERIES_IDS.split(",") if s.strip()]

filename = f"bls_{'-'.join(sorted(SERIES_IDS))}.json"   # ← deterministic, no hardcode
s3_key   = f"bls/api/{filename}"

print("filename:", filename)
print("s3 key:", s3_key)

headers = {
    "Content-Type": "application/json",
    "User-Agent": os.getenv("USER_AGENT", "ScottSchmidt/1.0 (email)")
}
payload = {
    "seriesid": SERIES_IDS,
    "registrationkey": API_KEY
}

resp = requests.post(
    "https://api.bls.gov/publicAPI/v2/timeseries/data/",
    data=json.dumps(payload),
    headers=headers,
    timeout=60
)
if resp.status_code != 200:
    raise RuntimeError(f"BLS error {resp.status_code}: {resp.text[:200]}")
data = resp.json()
print("got data")

with open(filename, "w") as f:
    json.dump(data, f, indent=2)
print("saved:", filename)

# Add data to S3 bucket:
s3.put_object(Bucket=BUCKET_NAME, Key=s3_key, Body=json.dumps(data, indent=2))
print("uploaded:", s3_key)

S3 connection successful. Bucket contains:  43
filename: bls_CUUR0000SA0-SUUR0000SA0.json
s3 key: bls/api/bls_CUUR0000SA0-SUUR0000SA0.json
got data
saved: bls_CUUR0000SA0-SUUR0000SA0.json
uploaded: bls/api/bls_CUUR0000SA0-SUUR0000SA0.json


## Preview Synced BLS Data
Load the full set of BLS JSON files from S3 (kept in sync with the source) into a single DataFrame for analysis.