# 01_explore_ingest_raw.ipynb

## 1. Objective
This notebook collects Google Shopping search results for a selected set of Raycon-related keywords using SerpAPI. Each API response is captured as a full raw JSON object and stored in the `raycon.raw_google_shopping` table.

The goal of this step is to establish a reproducible **raw data ingestion layer**, which will feed downstream processing notebooks responsible for JSON flattening, cleaning, product extraction, and competitive insight generation..

## 2. Setup and Configuration

### 2.1 Imports

In [1]:
import pandas as pd

import requests
import json

from sqlalchemy import create_engine, text
from datetime import datetime, timezone

import os
from dotenv import load_dotenv

### 2.2 Connect to Database
Load database credentials from the environment and initialize a SQLAlchemy engine.

In [2]:
load_dotenv()

user = os.getenv('PGUSER')
password = os.getenv('PGPASSWORD')
pghost = os.getenv('PGHOST')
pgport = os.getenv('PGPORT')
pgdatabase = os.getenv('PGDATABASE')

serpapi_key = os.getenv('SERPAPI_KEY')

engine = create_engine(
    f'postgresql+psycopg2://{user}:{password}@{pghost}:{pgport}/{pgdatabase}')

# Verify integrity of connection
pd.read_sql("SELECT * FROM raycon.raw_google_shopping LIMIT 3", engine)

Unnamed: 0,id,pulled_at,keyword,page,response_json


## 3 Define Keywords
These are the Google Shopping search terms we will track daily.
Keywords were chosen using three criteria:
1. High search volume (inferred from Google Trends)
2. Presence of Raycon products in the organic Shopping results
3. Coverage across different consumer intents

Using these criteria ensures we capture Raycon’s visibility across both high-volume and high-intent queries while avoiding keywords that return no meaningful Raycon results.

These keywords will be revisited later in the project if we expand to additional product categories or identify new high-signal terms.

In [3]:
KEYWORDS = [
    'headphones',
    'earbuds',
    'best headphones',
    'best earbuds',
    'wireless headphones',
    'wireless earbuds',
    'bluetooth headphones'
]

## 4. SerpAPI Ingestion Functions
This section defines the three helper functions used in the raw ingestion step:
1. Fetch a Shopping API response
2. Insert the raw payload into Postgres
3. Wrap both steps together for a single keyword

It then runs the keyword function for each keyword

### 4.1 API fetch helper

This function sends a single request to SerpAPI’s Google Shopping endpoint for a
given keyword. It standardizes a few request parameters (language, location,
result count) to keep API pulls consistent across days.

The API response is returned as a Python dictionary, which will later be stored
as JSONB in the raw Postgres table.

In [4]:
def fetch_google_shopping_results(keyword: str) -> dict:
    """
    Pull one SerpAPI request using a pre-determined keyword.
    Returns a Python dictionary from the API's JSON result.
    """
    # URL for Google Shopping engine
    url = 'https://serpapi.com/search.json'
    
    # Standardized parameters for consistent pulls across days
    params = {
        "engine": "google_shopping",
        "q": keyword,
        "api_key": serpapi_key,
        'gl': 'us',    # geographic location
        'hl': 'en',    # language
        'num': 100     # max number of results to return
    }

    # Send request
    response = requests.get(url, params=params, timeout=30)

    response.raise_for_status()     # Discover API errors immediately
    return response.json()

### 4.2 Raw Table Insert Helper
This helper takes a parsed API response (Python dictionary) and writes it to
`raycon.raw_google_shopping`. It records the pull timestamp, the keyword and
page used, and the full JSON payload as JSONB.

The function returns the newly created `id` from the raw table so that later
steps (staging, debugging, backfills) can trace records back to their source.

In [5]:
def insert_raw_google_shopping(keyword: str, response: dict, page: int = 1):
    """
    Insert one SerpAPI response into raycon.raw_google_shopping.
    Returns the new row's id.
    """
    # Record creation timestamp
    pulled_at = datetime.now(timezone.utc)

    # Convert Python dict to JSON string for Postgres
    json_str = json.dumps(response)

    # Parameterized INSERT statement
    sql = text("""
            INSERT INTO raycon.raw_google_shopping (pulled_at, keyword, page, response_json)
            VALUES (:pulled_at, :keyword, :page, CAST(:response_json AS jsonb))
            RETURNING id;
        """)
    
    # Run inside a transaction and get the new id
    with engine.begin() as conn:
            new_id = conn.execute(
                sql,
                {
                    "pulled_at": pulled_at,
                    "keyword": keyword,
                    "page": page,                 # always 1 for now
                    "response_json": json_str,    # JSONB cast happens in SQL
                },
            ).scalar_one()
    return new_id

### 4.3 Keyword Ingestion Helper
This helper function wraps the two earlier steps (API fetch and raw-table insert) into a single operation for one keyword. It:
1. Pulls one Google Shopping result from SerpAPI for a given keyword
2. Inserts the raw JSON payload into the raycon.raw_google_shopping table
3. Returns both the new row’s ID and the API response (useful for logging, debugging, or saving a sample payload)
   
This keeps the ingestion loop clean and ensures each keyword’s API request is tightly coupled with its corresponding raw-table entry.

In [6]:
def ingest_keyword(keyword: str, page: int = 1) -> tuple[int, dict]:
    """
    Fetch Google Shopping results for one keyword and insert into the raw table.
    
    Returns:
        new_row_id (int): Surrogate key of the inserted raw row.
        response_dict (dict): API response returned by SerpAPI.
    """
    # Step 1: Fetch API response for this keyword
    response = fetch_google_shopping_results(keyword)

    # Step 2: Insert into Postgres raw table
    new_id = insert_raw_google_shopping(keyword, response, page=page)

    return new_id, response

### 4.4 Execute Ingestion for All Keywords
This cell runs the full raw-ingestion step for the current keyword list.

For each keyword in `KEYWORDS`, it:

1. Runs `ingest_keyword` for each term.
2. Prints the created raw-table IDs.
3. Captures the first API response for saving in section 5.

In [7]:
ingestion_results = []
sample_response = None
sample_keyword = None

for i, kw in enumerate(KEYWORDS, start=1):
    print(f"[{i}/{len(KEYWORDS)}] Ingesting {kw!r} ...")

    # Run full ingestion for a single keyword
    new_id, response = ingest_keyword(kw, page=1)
    ingestion_results.append({"keyword": kw, "id": new_id})
    
    # Keep the first response as a sample to save to disk later
    if sample_response is None:
        sample_response = response
        sample_keyword = kw

ingestion_results

[1/7] Ingesting 'headphones' ...
[2/7] Ingesting 'earbuds' ...
[3/7] Ingesting 'best headphones' ...
[4/7] Ingesting 'best earbuds' ...
[5/7] Ingesting 'wireless headphones' ...
[6/7] Ingesting 'wireless earbuds' ...
[7/7] Ingesting 'bluetooth headphones' ...


[{'keyword': 'headphones', 'id': 1},
 {'keyword': 'earbuds', 'id': 2},
 {'keyword': 'best headphones', 'id': 3},
 {'keyword': 'best earbuds', 'id': 4},
 {'keyword': 'wireless headphones', 'id': 5},
 {'keyword': 'wireless earbuds', 'id': 6},
 {'keyword': 'bluetooth headphones', 'id': 7}]

## 5. Save Sample JSON
This section saves the first API response from the ingestion loop as a reference file.  
The file is useful for users who want to understand SerpAPI’s structure and for writing the staging
logic in the next notebook.

In [8]:
# Create folder if missing
os.makedirs("../data/samples", exist_ok=True)

# Where the sample will be saved
sample_path = "../data/samples/google_shopping_example.json"

# Write to disk
with open(sample_path, "w", encoding="utf-8") as f:
    json.dump(sample_response, f, indent=2)

print(f"Saved sample response for {sample_keyword!r} to {sample_path}")

Saved sample response for 'headphones' to ../data/samples/google_shopping_example.json
