# API Ingest → AWS S3 (BLS/DataUSA)
This notebook automates the retrieval of BLS productivity data and stores it in Amazon S3.  
It uses the BLS Public API with secure key management and includes duplicate protection to avoid re-uploading unchanged files.

### What's Covered
- **Automated sync** from data api source to S3.
- **No hardcoded file names** – dynamically scrapes the BLS file list.
- **403 error handling** – uses a valid User-Agent to comply with BLS access policy.
- **Cloud-based execution** – runs in Kaggle with secure secret management.
- **Secrets used** – AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_REGION, BUCKET_NAME, BLS_API_KEY.
- **Duplicate protection** – checks content hashes before uploading.
- **Optional scheduling** – can be run daily or weekly with no manual effort.

### How It Works
1. Fetch the current list of files from the BLS public directory.
2. Download each file and compare its hash to the version in S3.
3. Upload new or changed files to the configured S3 bucket.
4. Skip unchanged files to save bandwidth and storage.

## Connect to AWS S3
This notebook requires the following Python packages:  
- boto3  
- requests  
- hashlib  
- kaggle_secrets: AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_REGION, BUCKET_NAME, BLS_API_KEY

In [1]:
import boto3
import requests
import hashlib
import json
from kaggle_secrets import UserSecretsClient

# Load AWS secrets
secrets = UserSecretsClient()
API_KEY = secrets.get_secret("BLS_API_KEY")
AWS_ACCESS_KEY_ID = secrets.get_secret("AWS_ACCESS_KEY_ID")
AWS_SECRET_ACCESS_KEY = secrets.get_secret("AWS_SECRET_ACCESS_KEY")
AWS_REGION = secrets.get_secret("AWS_REGION")
BUCKET_NAME = secrets.get_secret("BUCKET_NAME")

# Setup AWS session and S3
session = boto3.Session(
    aws_access_key_id=AWS_ACCESS_KEY_ID,
    aws_secret_access_key=AWS_SECRET_ACCESS_KEY,
    region_name=AWS_REGION
)
s3 = session.client("s3")

# Test connection WITHOUT revealing keys
try:
    response = s3.list_objects_v2(Bucket=BUCKET_NAME)
    num_files = response.get('KeyCount', 0)
    print("S3 connection successful. Bucket contains: ", num_files)
except Exception as e:
    print("S3 connection failed: ", e)

S3 connection successful. Bucket contains:  6


## Fetch BLS Data via API Key
- Authenticate with a registered BLS API key to comply with access policies.  
- Retrieve U.S. inflation data programmatically through the BLS Public API.  

In [2]:
# API payload
headers = {'Content-type': 'application/json'}
data = json.dumps({
    "seriesid": ["CUUR0000SA0", "SUUR0000SA0"],  # You can customize this
    "startyear": "2020",
    "endyear": "2024",
    "registrationkey": API_KEY
})

# Make request
response = requests.post(
    "https://api.bls.gov/publicAPI/v2/timeseries/data/",
    data=data,
    headers=headers
)

if response.status_code == 200:
    results = response.json()
    
    # Save locally
    with open("bls_data.json", "w") as f:
        json.dump(results, f, indent=2)
    
    # Upload JSON file to S3 bucket
    s3.put_object(
        Bucket=BUCKET_NAME,
        Key="bls_data.json",
        Body=json.dumps(results, indent=2)
    )
    
    print("Uploaded bls_data.json to S3.")
else:
    print("Error: ", response.status_code)
    print(response.text)

Uploaded bls_data.json to S3.


## **API Data Pipeline – Send Files to S3 (No API Key Needed)**
- **Custom User-Agent** to comply with BLS access rules and avoid 403 errors.  
- **Checks for changes** by comparing file hashes before uploading.  
- **Uploads only when updated**, reducing bandwidth usage and S3 storage costs.  

In [3]:
import hashlib
import requests

# CONFIG 
file_name = "pr.data.0.Current"
bls_url = f"https://download.bls.gov/pub/time.series/pr/{file_name}"
headers = {
    "User-Agent": "ScottSchmidt/1.0 (scott.schmidt1989@yahoo.com)"
}

# Download file from BLS:
print(file_name, " File Downloaded from BLS")
response = requests.get(bls_url, headers=headers)

if response.status_code != 200:
    print("Failed to download BLS data:", response.status_code)
else:
    content = response.text
    content_hash = hashlib.md5(content.encode("utf-8")).hexdigest()

    try:
        existing_obj = s3.get_object(Bucket=BUCKET_NAME, Key=file_name)
        existing_content = existing_obj['Body'].read().decode("utf-8")
        existing_hash = hashlib.md5(existing_content.encode("utf-8")).hexdigest()

        if content_hash == existing_hash:
            print("No changes detected. Upload skipped.")
        else:
            s3.put_object(Bucket=BUCKET_NAME, Key=file_name, Body=content)
            print("Updated file uploaded to S3.")
    except s3.exceptions.NoSuchKey:
        # File doesn't exist yet
        s3.put_object(Bucket=BUCKET_NAME, Key=file_name, Body=content)
        print("File not found in S3. Uploaded new file.")

pr.data.0.Current  File Downloaded from BLS
No changes detected. Upload skipped.


## Fetch and Upload BLS Data for Each Series ID
This section goes through each BLS series ID, requests data from the API, 
and uploads it to S3 only if the data has changed since the last upload.

In [4]:
# Series IDs you want to fetch
series_ids = ["CUUR0000SA0", "SUUR0000SA0"]  

# Loop through each series and fetch data from the BLS API
for series_id in series_ids:
    print(f"Fetching {series_id} from BLS API...")

    # Prepare API request payload
    payload = json.dumps({
        "seriesid": [series_id],
        "startyear": "2020",
        "endyear": "2024",
        "registrationkey": API_KEY
    })

    # Send POST request to the BLS API
    response = requests.post(
        "https://api.bls.gov/publicAPI/v2/timeseries/data/",
        data=payload,
        headers={"Content-type": "application/json"}
    )

    #  Skip if API request fails
    if response.status_code != 200:
        print(f"Failed to fetch {series_id}: {response.status_code}")
        continue

    # Skip if API request fails
    content = response.text
    hash_new = hashlib.md5(content.encode("utf-8")).hexdigest()
    s3_key = f"{series_id}.json"

    # Check if file already exists in S3 with same content
    try:
        obj = s3.get_object(Bucket=BUCKET_NAME, Key=s3_key)
        existing = obj["Body"].read().decode("utf-8")
        hash_existing = hashlib.md5(existing.encode("utf-8")).hexdigest()

        if hash_existing == hash_new:
            print(series_id, " Skipping. Unchanged series_id")
            continue
    except ClientError:
        pass  # File doesn't exist 

    # Upload to S3
    s3.put_object(Bucket=BUCKET_NAME, Key=s3_key, Body=content)
    print("Uploaded to S3! ", s3_key)

Fetching CUUR0000SA0 from BLS API...
Uploaded to S3!  CUUR0000SA0.json
Fetching SUUR0000SA0 from BLS API...
Uploaded to S3!  SUUR0000SA0.json


## Preview BLS Data from S3
This section retrieves a specific BLS JSON file from Amazon S3, 
converts it into a Pandas DataFrame, reorders the columns, 
and displays the first few rows for inspection.

In [5]:
import pandas as pd

# Load the file content 
key = "CUUR0000SA0.json" # SUUR0000SA0
obj = s3.get_object(Bucket=BUCKET_NAME, Key=key)
json_content = json.loads(obj['Body'].read().decode('utf-8'))

# Extract data into DataFrame
series_data = json_content['Results']['series'][0]['data']
df = pd.DataFrame(series_data)
df = df[["year", "period", "value", "footnotes"]]
print("DataFrame Shape: ", df.shape)
df.head(20)

DataFrame Shape:  (60, 4)


Unnamed: 0,year,period,value,footnotes
0,2024,M12,315.605,[{}]
1,2024,M11,315.493,[{}]
2,2024,M10,315.664,[{}]
3,2024,M09,315.301,[{}]
4,2024,M08,314.796,[{}]
5,2024,M07,314.54,[{}]
6,2024,M06,314.175,[{}]
7,2024,M05,314.069,[{}]
8,2024,M04,313.548,[{}]
9,2024,M03,312.332,[{}]
