<a href="https://colab.research.google.com/github/SanjanaMohan34/MGMT_467_Team4/blob/main/Final_Project_Bitcoin_Price_Prediction_Pipeline/Individual_DIVE_Journals/Final_Lily_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **DIVE Analysis**

**Discover:** What is the current Bitcoin valuation? Connect to the CoinCap API to fetch current price and asset information for Bitcoin. (Cells 1-6)

*Challenge*: The version of CoinCap's API changed from when we created the project blueprint. This meant the domain I was used failed to connect to the server's IP address. This was resolved by investigating CoinCap's webpage and updating the API url. However, this change immediately exposed a new challenge: the v3 API required authentication, leading to 401: Unauthorized and 403: Forbidden errors. I addressed this by sourcing and integrating a CoinCap API key into the method.

Current server: https://rest.coincap.io/v3/assets

Match the schema with the following from https://pro.coincap.io/api-docs/ : { "timestamp": 0, "data": [ { "id": "string", "rank": "string", "symbol": "string", "name": "string", "supply": "string", "maxSupply": "string", "marketCapUsd": "string", "volumeUsd24Hr": "string", "priceUsd": "string", "changePercent24Hr": "string", "vwap24Hr": "string", "explorer": "string", "tokens": { "additionalProp1": [ "string" ], "additionalProp2": [ "string" ], "additionalProp3": [ "string" ] } } ] }

**Investigate:** What are historical Bitcoin prices? Use the endpoint /assets/{slug}/history to fetch historical data from CoinCap's API for Bitcoin. Retrieve data for the maximum amount of time allowed by the API - 1 year. (Historical Data)

**Visualize:** What trends are observable in Bitcoin prices over 2025? Command: "Generate minimal python and use plotly and the variables defined in this notebook to generate one interactive chart displaying the Bitcoin price over time for the year of data that was fetched into the GCP table directed-bongo-471119-d1.467_final_3_2.data_1_t." (Visualization)


**Extend:** Build a linear regression model to predict Bitcoin price. What other types of models could be used to predict price with historical price data? [ARIMA (AutoRegressive Integrated Moving Average), Exponential Smoothing, Recurrent Nueral Networks, Tree-based Models (Random Forest, XGBoost)]

## Cell 1: Python packages and authentication

In [19]:
R"""This cell installs required Python packages and authenticates the user to Google Cloud."""
!pip install google-cloud-storage google-cloud-bigquery requests

from google.colab import auth
print("Authenticating to Google Cloud...")
auth.authenticate_user()
print("Authentication successful.")


Authenticating to Google Cloud...
Authentication successful.


## Cell 2: Configure  project-specific variables and set the `gcloud` project.



In [7]:
R"""
This cell configures essential project-specific variables for Google Cloud operations.

It defines:
- `PROJECT_ID`: The Google Cloud project ID.
- `GCP_REGION`: The Google Cloud region for services.
- `GCS_BUCKET_NAME`: The name of the Google Cloud Storage bucket.
- `GCS_FOLDER_PATH`: The folder path within the GCS bucket for data storage.
- `BQ_DATASET`: The BigQuery dataset name.
- `BQ_TABLE`: The BigQuery table name for flight data.
- `FLIGHT_RECORD_LIMIT`: A pipeline setting to limit records from the API.

Finally, it sets the `gcloud` project configuration to the specified `PROJECT_ID`.
R"""
# --- !! CONFIGURE YOUR VARIABLES !! ---

PROJECT_ID = "directed-bongo-471119-d1"
GCP_REGION = "us-central1"  # Or the region you are using

# --- GCS Bucket (Source & Target) ---
GCS_BUCKET_NAME = "467_final_3_2"
GCS_FOLDER_PATH = "data_1" # The folder you set in your scheduler

# --- BigQuery Table (Target) ---
BQ_DATASET = "467_final_3_2" # The dataset you created
BQ_TABLE = "data_1_t"        # The table for flight data

# --- Pipeline Settings ---
FLIGHT_RECORD_LIMIT = 500 # How many records to pull from the API

# --- CoinCap API Key ---
COINCAP_API_KEY = "7aaeaf52f203c5f61c86c294210dada652fc5adc8d566875faa9eedebbbdf761" # Replace with your actual CoinCap API key

# Define BigQuery Schema for Historical Data
BQ_SCHEMA = [
    bigquery.SchemaField("id", "STRING", mode="NULLABLE"),
    bigquery.SchemaField("priceUsd", "FLOAT", mode="NULLABLE"),
    bigquery.SchemaField("coincap_query_time_bq", "TIMESTAMP", mode="NULLABLE")
]

# -------------------------------------

# Set the project for all gcloud commands
!gcloud config set project $PROJECT_ID


Updated property [core/project].


## Cell 3: Define the `OpenSkyApi` class and helper functions for data parsing and formatting.



In [20]:
R"""This cell defines the `CoinCapApi` class and helper functions for interacting with the CoinCap API."""
import os
import json
import logging
import datetime
import time
import requests
from collections import defaultdict
from google.cloud import storage, bigquery

# ==============================================================================
# CoinCap API Library Code
# ==============================================================================

logger = logging.getLogger('coincap_api')
logger.addHandler(logging.NullHandler())

class CoinCapAsset(object):
    """Represents a single cryptocurrency asset from the CoinCap API data."""
    keys = ["id", "rank", "symbol", "name", "supply", "maxSupply",
            "marketCapUsd", "volumeUsd24Hr", "priceUsd", "changePercent24Hr",
            "vwap24Hr", "explorer"]

    def __init__(self, data_dict):
        for key in CoinCapAsset.keys:
            setattr(self, key, data_dict.get(key))

class CoinCapResponse(object):
    """Represents the overall response from the CoinCap API for an asset."""
    def __init__(self, j):
        self.__dict__ = j
        if 'data' in self.__dict__ and self.data is not None:
            if isinstance(self.data, list) and len(self.data) > 0:
                self.data = CoinCapAsset(self.data[0])
            elif isinstance(self.data, dict):
                self.data = CoinCapAsset(self.data)
            else:
                self.data = None
        else:
            self.data = None
        if 'timestamp' in self.__dict__ and self.timestamp is not None:
            self.timestamp_seconds = int(self.timestamp / 1000)
        else:
            self.timestamp_seconds = None


class CoinCapApi(object):
    def __init__(self):
        self._api_url = "https://rest.coincap.io/v3"
        self._last_requests = defaultdict(lambda: 0)
        self._rate_limit_seconds = 60 / 2000 # 0.03 seconds per request

    def _get_json(self, url_post, callee, params=None):
        time_since_last_request = time.time() - self._last_requests[callee]
        if time_since_last_request < self._rate_limit_seconds:
            sleep_time = self._rate_limit_seconds - time_since_last_request
            time.sleep(sleep_time)

        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
            'Accept': 'application/json',
            'Authorization': f'Bearer {COINCAP_API_KEY}'
        }

        r = requests.get(f"{self._api_url}{url_post}", headers=headers, params=params, timeout=60.00)
        if r.status_code == 200:
            self._last_requests[callee] = time.time()
            json_response = r.json()
            return json_response
        logger.debug(f"Response not OK. Status {r.status_code} - {r.reason}")
        print(f"DEBUG: API call to {f'{self._api_url}{url_post}'} failed with status {r.status_code}: {r.reason}. Response text: {r.text}")
        return None

    def get_asset(self, asset_id="bitcoin"):
        """Fetches data for a specific cryptocurrency asset."""
        json_data = self._get_json(f"/assets/{asset_id}", self.get_asset)
        return CoinCapResponse(json_data) if json_data else None

    def get_asset_history(self, asset_id, interval, start=None, end=None):
        """Fetches historical data for a specific cryptocurrency asset.

        Args:
            asset_id (str): The ID of the asset (e.g., "bitcoin").
            interval (str): The time interval for historical data (e.g., 'd1', 'h1', 'm1').
            start (int, optional): Unix timestamp (milliseconds) for the start of the period.
            end (int, optional): Unix timestamp (milliseconds) for the end of the period.

        Returns:
            dict: The JSON response containing historical data, or None if an error occurs.
        """
        url_post = f"/assets/{asset_id}/history"
        params = {'interval': interval}

        if start is not None:
            params['start'] = int(start)
        if end is not None:
            params['end'] = int(end)

        json_data = self._get_json(url_post, self.get_asset_history, params=params)
        return json_data

# ==============================================================================
# Data Parser Functions
# ==============================================================================

def _convertTimestamp(timestamp_seconds):
    """Converts a Unix timestamp (in seconds) to a formatted datetime string."""
    if timestamp_seconds is not None:
        try:
            return datetime.datetime.fromtimestamp(timestamp_seconds).strftime('%Y-%m-%d %H:%M:%S')
        except Exception:
            logger.error(f"Error converting timestamp {timestamp_seconds}", exc_info=True)
            pass
    return None

def _convert(data, dataType):
    """Converts data to the specified dataType, handling None and type conversion errors."""
    if data is None:
        return None
    if dataType == str:
        return str(data).strip()
    try:
        return dataType(data)
    except Exception:
        logger.error(f"Error converting data '{data}' to {dataType.__name__}", exc_info=True)
        return None

def _convertRow(coincap_response):
    """Converts a CoinCapResponse object into a dictionary suitable for BigQuery.
    Assumes coincap_response contains a CoinCapAsset object under .data and a timestamp_seconds.
    """
    if coincap_response is None or coincap_response.data is None:
        return {}

    asset_data = coincap_response.data
    row = {
        'id': _convert(asset_data.id, str),
        'rank': _convert(asset_data.rank, int),
        'symbol': _convert(asset_data.symbol, str),
        'name': _convert(asset_data.name, str),
        'supply': _convert(asset_data.supply, float),
        'maxSupply': _convert(asset_data.maxSupply, float),
        'marketCapUsd': _convert(asset_data.marketCapUsd, float),
        'volumeUsd24Hr': _convert(asset_data.volumeUsd24Hr, float),
        'priceUsd': _convert(asset_data.priceUsd, float),
        'changePercent24Hr': _convert(asset_data.changePercent24Hr, float),
        'vwap24Hr': _convert(asset_data.vwap24Hr, float),
        'explorer': _convert(asset_data.explorer, str)
    }

    if coincap_response.timestamp_seconds is not None:
        row['coincap_query_time_bq'] = _convertTimestamp(coincap_response.timestamp_seconds)

    return {k: v for k, v in row.items() if v is not None}

def _convertHistoricalRecord(historical_data_point, asset_id):
    """Converts a single historical data point into a dictionary suitable for BigQuery."""
    if not historical_data_point:
        return {}

    price_usd = _convert(historical_data_point.get('priceUsd'), float)
    timestamp_ms = historical_data_point.get('time')
    coincap_query_time_bq = None
    if timestamp_ms is not None:
        coincap_query_time_bq = _convertTimestamp(timestamp_ms / 1000)

    record = {
        'id': asset_id,
        'priceUsd': price_usd,
        'coincap_query_time_bq': coincap_query_time_bq
    }

    return {k: v for k, v in record.items() if v is not None}

print("‚úÖ CoinCap API classes and helper functions consolidated.")


‚úÖ CoinCap API classes and helper functions consolidated.


## Cell 4: Define the `OpenSkyApi` class and helper functions, initialize GCP clients, define the BigQuery schema, and implement the data pipeline logic.



## Adjust API Base URL (Retry with API Key)

### Subtask:
Integrate the CoinCap API key into the `CoinCapApi` class to resolve the `401: Unauthorized` error and re-execute the pipeline to fetch Bitcoin asset data.


**Reasoning**:
The subtask requires defining the `COINCAP_API_KEY` variable before it can be used in the `CoinCapApi` class. I will define it with a placeholder and instruct the user to replace it with their actual key.



In [None]:
COINCAP_API_KEY = "7aaeaf52f203c5f61c86c294210dada652fc5adc8d566875faa9eedebbbdf761" # Replace with your actual CoinCap API key
print("‚úÖ COINCAP_API_KEY variable defined.")

‚úÖ COINCAP_API_KEY variable defined.


**Reasoning**:
I will redefine the `CoinCapApi` class to include the `Authorization` header using the `COINCAP_API_KEY`, and also redefine the necessary helper functions and the `run_full_pipeline_without_bq_load` function to ensure all components are updated and interdependent correctly.



**Reasoning**:
Now that the `CoinCapApi` class has been redefined to include the API key, I will execute the `run_full_pipeline_without_bq_load()` function to verify if the API call is successful and data is fetched.



## Cell 6: Orchestrate the full data pipeline from API to GCS to BigQuery.



In [21]:
R"""This cell orchestrates the entire data pipeline, from API to GCS to BigQuery."""
import os
import json
import logging
import datetime
import time
import requests
from collections import defaultdict
from google.cloud import storage, bigquery
from google.cloud.exceptions import NotFound # Import NotFound for specific error handling

# Re-initialize GCP clients (ensuring they are available in this scope)
storage_client = storage.Client(project=PROJECT_ID)
bq_client = bigquery.Client(project=PROJECT_ID)

def load_gcs_to_bigquery(gcs_uri, project_id, bq_dataset, bq_table, bq_schema, bq_client_instance):
    """Loads data from a GCS URI into a BigQuery table."""
    print(f"\nStep: Loading data from GCS into BigQuery...")
    print(f"  > Source: {gcs_uri}")
    print(f"  > Target: {bq_dataset}.{bq_table}")

    job_config = bigquery.LoadJobConfig()
    job_config.source_format = bigquery.SourceFormat.NEWLINE_DELIMITED_JSON
    job_config.schema = bq_schema
    job_config.autodetect = False
    job_config.write_disposition = bigquery.WriteDisposition.WRITE_APPEND

    load_job = bq_client_instance.load_table_from_uri(
        gcs_uri,
        f"{project_id}.{bq_dataset}.{bq_table}",
        job_config=job_config
    )

    print(f"  > Starting BQ Load Job: {load_job.job_id}")
    load_job.result()
    print(f"  > Job complete. Loaded {load_job.output_rows} rows.")
    print("‚úÖ GCS to BigQuery Load Finished Successfully.")

def run_full_pipeline_without_bq_load():
    """Executes the API -> GCS pipeline for CoinCap historical data and returns GCS URI."""

    try:
        # ======================================================
        # 1. Download data from CoinCap API
        # ======================================================
        print(f"Step 1: Fetching Bitcoin historical data from CoinCap API...")
        api = CoinCapApi()
        queryTime = datetime.datetime.now().timestamp()

        asset_id_to_fetch = "bitcoin"

        # Retrieve data for the maximum amount of time allowed by the API - 1 year (365 days)
        end_datetime = datetime.datetime.now()
        end_timestamp_ms = int(end_datetime.timestamp() * 1000)
        start_datetime = end_datetime - datetime.timedelta(days=365)
        start_timestamp_ms = int(start_datetime.timestamp() * 1000)

        historical_data_response = api.get_asset_history(
            asset_id=asset_id_to_fetch,
            interval="d1", # Daily interval
            start=start_timestamp_ms,
            end=end_timestamp_ms
        )

        records = []
        if historical_data_response and 'data' in historical_data_response:
            for data_point in historical_data_response['data']:
                record = _convertHistoricalRecord(data_point, asset_id_to_fetch)
                if record:
                    records.append(record)

        if not records:
            print("No Bitcoin historical data found. Exiting.")
            return None

        print(f"  > Fetched {len(records)} record(s).")

        # ======================================================
        # 2. Save data to GCS Bucket
        # ======================================================
        local_filename = "bitcoin_historical_data.jsonl"
        with open(local_filename, 'w') as f:
            for record in records:
                f.write(json.dumps(record) + '\n')

        gcs_filename = f"{GCS_FOLDER_PATH}/coincap_history_batch_{int(queryTime)}.jsonl"

        print(f"\nStep 2: Uploading data to GCS...")
        print(f"  > Source: {local_filename}")
        print(f"  > Destination: gs://{GCS_BUCKET_NAME}/{gcs_filename}")

        bucket = storage_client.bucket(GCS_BUCKET_NAME)
        blob = bucket.blob(gcs_filename)
        blob.upload_from_filename(local_filename)

        gcs_uri = f"gs://{GCS_BUCKET_NAME}/{gcs_filename}"
        print("  > Upload complete.")
        print("‚úÖ API to GCS Pipeline Finished Successfully.")
        return gcs_uri

    except Exception as e:
        print(f"\n‚ùå ERROR in pipeline: {e}")
        return None

print("--- Running Full Data Pipeline (API -> GCS -> BigQuery) ---")

# 1. Ensure BigQuery table exists
target_table_id = f"{PROJECT_ID}.{BQ_DATASET}.{BQ_TABLE}"

try:
    bq_client.get_table(target_table_id)  # API request
    print(f"‚úÖ BigQuery table '{target_table_id}' already exists.")
except NotFound:
    print(f"BigQuery table '{target_table_id}' not found. Creating table...")
    table = bigquery.Table(target_table_id, schema=BQ_SCHEMA)
    table = bq_client.create_table(table)  # API request
    print(f"‚úÖ Created table {table.project}.{table.dataset_id}.{table.table_id}")
except Exception as e:
    print(f"‚ùå Error checking or creating BigQuery table: {e}")
    exit() # Exit if table cannot be ensured

# 2. Execute API -> GCS pipeline
gcs_uri_for_bq_load = run_full_pipeline_without_bq_load()

if gcs_uri_for_bq_load:
    # 3. Load data from GCS to BigQuery
    load_gcs_to_bigquery(
        gcs_uri_for_bq_load,
        PROJECT_ID,
        BQ_DATASET,
        BQ_TABLE,
        BQ_SCHEMA,
        bq_client
    )
    print("‚úÖ Pipeline Finished Successfully.")
else:
    print("‚ùå Pipeline aborted: No data fetched or GCS upload failed.")


--- Running Full Data Pipeline (API -> GCS -> BigQuery) ---
‚úÖ BigQuery table 'directed-bongo-471119-d1.467_final_3_2.data_1_t' already exists.
Step 1: Fetching Bitcoin historical data from CoinCap API...
  > Fetched 366 record(s).

Step 2: Uploading data to GCS...
  > Source: bitcoin_historical_data.jsonl
  > Destination: gs://467_final_3_2/data_1/coincap_history_batch_1765686807.jsonl
  > Upload complete.
‚úÖ API to GCS Pipeline Finished Successfully.

Step: Loading data from GCS into BigQuery...
  > Source: gs://467_final_3_2/data_1/coincap_history_batch_1765686807.jsonl
  > Target: 467_final_3_2.data_1_t
  > Starting BQ Load Job: 025658f5-4c44-4a0a-8ab4-22147264e1ed
  > Job complete. Loaded 366 rows.
‚úÖ GCS to BigQuery Load Finished Successfully.
‚úÖ Pipeline Finished Successfully.


# Models

In [23]:
# @title ### Cell 9: Build, Evaluate, and Predict with a Linear Regression Model
# @markdown **Objective:** Use BigQuery ML to create a linear regression model to predict the Bitcoin Close price.

from google.cloud import bigquery
from google.cloud.exceptions import NotFound # Import NotFound for specific error handling

# Re-initialize the BigQuery client to ensure it's fresh and correctly scoped
client = bigquery.Client(project=PROJECT_ID)

# Re-define necessary variables as they might have been cleared
BITCOIN_ANALYTICS_VIEW = "bitcoin_analytics_view"
view_id = f"{PROJECT_ID}.{BQ_DATASET}.{BQ_TABLE}"

print(f"Attempting to use view: {view_id}")

# --- Add a check for view existence before proceeding ---
try:
    client.get_table(view_id) # get_table works for views too
    print(f"‚úÖ BigQuery view '{view_id}' confirmed to exist.")
except NotFound:
    print(f"üî¥ ERROR: BigQuery view '{view_id}' not found. Please ensure Cell 8 (Data Cleaning and Feature Engineering) ran successfully and created the view.")
    raise # Re-raise the error to stop execution if the view is truly not found
except Exception as e:
    print(f"üî¥ An unexpected error occurred while checking view existence: {e}")
    raise

# --- 1. Create the Linear Regression Model ---
print("üöÄ Training Linear Regression model...")
create_linear_model_query = f"""
CREATE OR REPLACE MODEL `{PROJECT_ID}.{BQ_DATASET}.bitcoin_price_predictor`
OPTIONS(model_type='LINEAR_REG', input_label_cols=['priceUsd']) AS
SELECT
  priceUsd,
  LAG(priceUsd, 1) OVER (ORDER BY coincap_query_time_bq) AS prev_priceUsd
FROM
  `{view_id}`
WHERE
  priceUsd IS NOT NULL
  AND coincap_query_time_bq IS NOT NULL;
"""
linear_job = client.query(create_linear_model_query)
linear_job.result()
print("‚úÖ Linear Regression model created successfully.")

# --- 2. Evaluate the Model ---
print("\n--- Model Evaluation ---")
evaluate_linear_model_query = f"""
SELECT * FROM ML.EVALUATE(MODEL `{PROJECT_ID}.{BQ_DATASET}.bitcoin_price_predictor`);
"""
linear_eval_df = client.query(evaluate_linear_model_query).to_dataframe()
display(linear_eval_df)

# --- 3. Make Predictions with the Model ---
print("\n--- Sample Predictions ---")
predict_linear_query = f"""
WITH PredictionInput AS (
  SELECT
    coincap_query_time_bq AS Timestamp,
    priceUsd,
    LAG(priceUsd, 1) OVER (ORDER BY coincap_query_time_bq) AS prev_priceUsd,
    ROW_NUMBER() OVER (ORDER BY coincap_query_time_bq) AS rn
  FROM
    `{view_id}`
  WHERE
    priceUsd IS NOT NULL
    AND coincap_query_time_bq IS NOT NULL
),
FilteredInput AS (
  SELECT
    Timestamp,
    priceUsd,
    prev_priceUsd,
    rn
  FROM
    PredictionInput
  WHERE
    prev_priceUsd IS NOT NULL -- Exclude first row where prev_priceUsd is NULL
  LIMIT 10
)
SELECT
  fi.priceUsd AS actual_priceUsd,
  p.predicted_priceUsd
FROM
  ML.PREDICT(MODEL `{PROJECT_ID}.{BQ_DATASET}.bitcoin_price_predictor`,
    (SELECT rn, prev_priceUsd FROM FilteredInput)
  ) AS p
JOIN
  FilteredInput AS fi
ON
  p.rn = fi.rn
ORDER BY fi.Timestamp
"""
linear_predict_df = client.query(predict_linear_query).to_dataframe()
display(linear_predict_df)


Attempting to use view: directed-bongo-471119-d1.467_final_3_2.data_1_t
‚úÖ BigQuery view 'directed-bongo-471119-d1.467_final_3_2.data_1_t' confirmed to exist.
üöÄ Training Linear Regression model...
‚úÖ Linear Regression model created successfully.

--- Model Evaluation ---


Unnamed: 0,mean_absolute_error,mean_squared_error,mean_squared_log_error,median_absolute_error,r2_score,explained_variance
0,1.159788e-08,1.345107e-16,1.590659e-26,1.159788e-08,-inf,



--- Sample Predictions ---


Unnamed: 0,actual_priceUsd,predicted_priceUsd


# Historical Data
Implement `get_asset_history` in the `CoinCapApi` class to fetch historical asset data, handling `asset_id`, `interval`, `start`, and `end` parameters.

## Implement get_asset_history in CoinCapApi

### Subtask:
Add a new method `get_asset_history` to the `CoinCapApi` class. This method will construct the request URL for `/assets/{asset_id}/history` and handle parameters like `interval`, `start`, and `end` to fetch historical data for a specified asset.


**Reasoning**:
The subtask requires adding a new method `get_asset_history` to the `CoinCapApi` class. I will modify the existing code block that defines the `CoinCapApi` class and related functions to include this new method as per the instructions, handling URL construction and parameters.



In [None]:
import os
import json
import logging
import datetime
import time
import requests
from collections import defaultdict
from google.cloud import storage, bigquery

# --- !! CONFIGURE YOUR VARIABLES (re-defined for self-containment) !! ---
PROJECT_ID = "directed-bongo-471119-d1"
GCP_REGION = "us-central1"
GCS_BUCKET_NAME = "467_final_3_2"
GCS_FOLDER_PATH = "data_1"
BQ_DATASET = "467_final_3_2"
BQ_TABLE = "data_1_t"
FLIGHT_RECORD_LIMIT = 500 # Not used in this specific CoinCap context, but kept for consistency if pipeline structure changes

COINCAP_API_KEY = "7aaeaf52f203c5f61c86c294210dada652fc5adc8d566875faa9eedebbbdf761" # Re-define API key

# Define BigQuery Schema based on _convertRow output
BQ_SCHEMA = [
    bigquery.SchemaField("id", "STRING", mode="NULLABLE"),
    bigquery.SchemaField("rank", "INTEGER", mode="NULLABLE"),
    bigquery.SchemaField("symbol", "STRING", mode="NULLABLE"),
    bigquery.SchemaField("name", "STRING", mode="NULLABLE"),
    bigquery.SchemaField("supply", "FLOAT", mode="NULLABLE"),
    bigquery.SchemaField("maxSupply", "FLOAT", mode="NULLABLE"),
    bigquery.SchemaField("marketCapUsd", "FLOAT", mode="NULLABLE"),
    bigquery.SchemaField("volumeUsd24Hr", "FLOAT", mode="NULLABLE"),
    bigquery.SchemaField("priceUsd", "FLOAT", mode="NULLABLE"),
    bigquery.SchemaField("changePercent24Hr", "FLOAT", mode="NULLABLE"),
    bigquery.SchemaField("vwap24Hr", "FLOAT", mode="NULLABLE"),
    bigquery.SchemaField("explorer", "STRING", mode="NULLABLE"),
    bigquery.SchemaField("coincap_query_time_bq", "TIMESTAMP", mode="NULLABLE"),
]

# ==============================================================================
# CoinCap API Library Code (Modified to use API Key and add get_asset_history)
# ==============================================================================

logger = logging.getLogger('coincap_api')
logger.addHandler(logging.NullHandler())

class CoinCapAsset(object):
    """Represents a single cryptocurrency asset from the CoinCap API data."""
    keys = ["id", "rank", "symbol", "name", "supply", "maxSupply",
            "marketCapUsd", "volumeUsd24Hr", "priceUsd", "changePercent24Hr",
            "vwap24Hr", "explorer"]

    def __init__(self, data_dict):
        # Initialize attributes from the data_dict
        for key in CoinCapAsset.keys:
            # Use .get() to safely retrieve values, in case a key is missing
            setattr(self, key, data_dict.get(key))

class CoinCapResponse(object):
    """Represents the overall response from the CoinCap API for an asset."""
    def __init__(self, j):
        self.__dict__ = j
        # If 'data' exists and is not None, parse it into a CoinCapAsset object
        if 'data' in self.__dict__ and self.data is not None:
            self.data = CoinCapAsset(self.data)
        else:
            self.data = None
        # Convert timestamp from milliseconds to seconds for general use
        if 'timestamp' in self.__dict__ and self.timestamp is not None:
            self.timestamp_seconds = int(self.timestamp / 1000)
        else:
            self.timestamp_seconds = None


class CoinCapApi(object):
    def __init__(self):
        self._api_url = "https://rest.coincap.io/v3"
        self._last_requests = defaultdict(lambda: 0)
        # CoinCap has a rate limit of 2000 requests per minute, so ~0.03 seconds per request
        self._rate_limit_seconds = 60 / 2000 # 0.03 seconds per request

    def _get_json(self, url_post, callee, params=None):
        # Implement a basic rate limit check
        time_since_last_request = time.time() - self._last_requests[callee]
        if time_since_last_request < self._rate_limit_seconds:
            # Wait if we're hitting the rate limit too fast
            sleep_time = self._rate_limit_seconds - time_since_last_request
            time.sleep(sleep_time)

        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
            'Accept': 'application/json',
            'Authorization': f'Bearer {COINCAP_API_KEY}'
        }

        r = requests.get(f"{self._api_url}{url_post}", headers=headers, params=params, timeout=60.00)
        if r.status_code == 200:
            self._last_requests[callee] = time.time()
            json_response = r.json()
            return json_response
        logger.debug(f"Response not OK. Status {r.status_code} - {r.reason}")
        print(f"DEBUG: API call to {f'{self._api_url}{url_post}'} failed with status {r.status_code}: {r.reason}. Response text: {r.text}")
        return None

    def get_asset(self, asset_id="bitcoin"):
        """Fetches data for a specific cryptocurrency asset."""
        json_data = self._get_json(f"/assets/{asset_id}", self.get_asset)
        return CoinCapResponse(json_data) if json_data else None

    def get_asset_history(self, asset_id, interval, start=None, end=None):
        """Fetches historical data for a specific cryptocurrency asset.

        Args:
            asset_id (str): The ID of the asset (e.g., "bitcoin").
            interval (str): The time interval for historical data (e.g., 'd1', 'h1', 'm1').
            start (int, optional): Unix timestamp (milliseconds) for the start of the period.
            end (int, optional): Unix timestamp (milliseconds) for the end of the period.

        Returns:
            dict: The JSON response containing historical data, or None if an error occurs.
        """
        url_post = f"/assets/{asset_id}/history"
        params = {'interval': interval}

        if start is not None:
            params['start'] = int(start)
        if end is not None:
            params['end'] = int(end)

        json_data = self._get_json(url_post, self.get_asset_history, params=params)
        return json_data

# ==============================================================================
# Data Parser Functions (re-defined to ensure consistency with updated CoinCapApi)
# ==============================================================================

def _convertTimestamp(timestamp_seconds):
    """Converts a Unix timestamp (in seconds) to a formatted datetime string."""
    if timestamp_seconds is not None:
        try:
            return datetime.datetime.fromtimestamp(timestamp_seconds).strftime('%Y-%m-%d %H:%M:%S')
        except Exception:
            logger.error(f"Error converting timestamp {timestamp_seconds}", exc_info=True)
            pass
    return None

def _convert(data, dataType):
    """Converts data to the specified dataType, handling None and type conversion errors."""
    if data is None:
        return None
    if dataType == str:
        # For CoinCap API, some string values might already be clean, but strip() is safe.
        return str(data).strip()
    try:
        return dataType(data)
    except Exception:
        logger.error(f"Error converting data '{data}' to {dataType.__name__}", exc_info=True)
        return None

def _convertRow(coincap_response):
    """Converts a CoinCapResponse object into a dictionary suitable for BigQuery.
    Assumes coincap_response contains a CoinCapAsset object under .data and a timestamp_seconds.
    """
    if coincap_response is None or coincap_response.data is None:
        return {}

    asset_data = coincap_response.data
    row = {
        'id': _convert(asset_data.id, str),
        'rank': _convert(asset_data.rank, int),
        'symbol': _convert(asset_data.symbol, str),
        'name': _convert(asset_data.name, str),
        'supply': _convert(asset_data.supply, float),
        'maxSupply': _convert(asset_data.maxSupply, float),
        'marketCapUsd': _convert(asset_data.marketCapUsd, float),
        'volumeUsd24Hr': _convert(asset_data.volumeUsd24Hr, float),
        'priceUsd': _convert(asset_data.priceUsd, float),
        'changePercent24Hr': _convert(asset_data.changePercent24Hr, float),
        'vwap24Hr': _convert(asset_data.vwap24Hr, float),
        'explorer': _convert(asset_data.explorer, str)
    }

    # Add the CoinCap API response timestamp
    if coincap_response.timestamp_seconds is not None:
        row['coincap_query_time_bq'] = _convertTimestamp(coincap_response.timestamp_seconds)

    # Return only non-null values, as BQ handles missing fields
    return {k: v for k, v in row.items() if v is not None}

# Re-initialize GCP clients (ensuring they are available in this scope)
storage_client = storage.Client(project=PROJECT_ID)
bq_client = bigquery.Client(project=PROJECT_ID)

# Redefine load_gcs_to_bigquery (if it uses any global variables that might be affected)
# This function does not directly depend on the CoinCapApi class but is part of the pipeline context.
def load_gcs_to_bigquery(gcs_uri, project_id, bq_dataset, bq_table, bq_schema, bq_client_instance):
    """Loads data from a GCS URI into a BigQuery table."""
    print(f"\nStep: Loading data from GCS into BigQuery...")
    print(f"  > Source: {gcs_uri}")
    print(f"  > Target: {bq_dataset}.{bq_table}")

    job_config = bigquery.LoadJobConfig()
    job_config.source_format = bigquery.SourceFormat.NEWLINE_DELIMITED_JSON
    job_config.schema = bq_schema
    job_config.autodetect = False
    job_config.write_disposition = bigquery.WriteDisposition.WRITE_APPEND

    load_job = bq_client_instance.load_table_from_uri(
        gcs_uri,
        f"{project_id}.{bq_dataset}.{bq_table}",
        job_config=job_config
    )

    print(f"  > Starting BQ Load Job: {load_job.job_id}")
    load_job.result()
    print(f"  > Job complete. Loaded {load_job.output_rows} rows.")
    print("‚úÖ GCS to BigQuery Load Finished Successfully.")

def run_full_pipeline_without_bq_load():
    """Executes the API -> GCS pipeline for CoinCap data and returns GCS URI."""

    try:
        # ======================================================
        # 1. Download data from CoinCap API
        # ======================================================
        print(f"Step 1: Fetching Bitcoin asset data from CoinCap API...")
        api = CoinCapApi()
        queryTime = datetime.datetime.now().timestamp()
        bitcoin_response = api.get_asset("bitcoin")

        records = []
        if bitcoin_response and bitcoin_response.data:
            record = _convertRow(bitcoin_response)
            if record:
                records.append(record)

        if not records:
            print("No Bitcoin data found. Exiting.")
            return None # Return None if no records

        print(f"  > Fetched {len(records)} record(s).")

        # ======================================================
        # 2. Save data to GCS Bucket
        # ======================================================
        local_filename = "bitcoin_data.jsonl"
        with open(local_filename, 'w') as f:
            for record in records:
                f.write(json.dumps(record) + '\n')

        gcs_filename = f"{GCS_FOLDER_PATH}/coincap_batch_{int(queryTime)}.jsonl"

        print(f"\nStep 2: Uploading data to GCS...")
        print(f"  > Source: {local_filename}")
        print(f"  > Destination: gs://{GCS_BUCKET_NAME}/{gcs_filename}")

        bucket = storage_client.bucket(GCS_BUCKET_NAME)
        blob = bucket.blob(gcs_filename)
        blob.upload_from_filename(local_filename)

        gcs_uri = f"gs://{GCS_BUCKET_NAME}/{gcs_filename}"
        print("  > Upload complete.")
        print("‚úÖ API to GCS Pipeline Finished Successfully.")
        return gcs_uri # Return the GCS URI

    except Exception as e:
        print(f"\n‚ùå ERROR in pipeline: {e}")
        return None

print("‚úÖ CoinCap API class updated with get_asset_history method.")

‚úÖ CoinCap API class updated with get_asset_history method.


## Update Data Fetching for Historical Data

### Subtask:
Modify the `run_full_pipeline_without_bq_load` function to use the new `get_asset_history` method to fetch historical data for Bitcoin.


**Reasoning**:
The subtask requires modifying the `run_full_pipeline_without_bq_load` function to fetch historical data using the `get_asset_history` method. I will redefine the cell containing this function, along with `CoinCapApi` and other helper functions, to incorporate the new logic for calculating timestamps, calling `get_asset_history`, and processing its response using a new `TempCoinCapHistoryWrapper` class to adapt historical data points for the existing `_convertRow` function.



In [None]:
import os
import json
import logging
import datetime
import time
import requests
from collections import defaultdict
from google.cloud import storage, bigquery

# --- !! CONFIGURE YOUR VARIABLES (re-defined for self-containment) !! ---
PROJECT_ID = "directed-bongo-471119-d1"
GCP_REGION = "us-central1"
GCS_BUCKET_NAME = "467_final_3_2"
GCS_FOLDER_PATH = "data_1"
BQ_DATASET = "467_final_3_2"
BQ_TABLE = "data_1_t"
FLIGHT_RECORD_LIMIT = 500 # Not used in this specific CoinCap context, but kept for consistency if pipeline structure changes

COINCAP_API_KEY = "7aaeaf52f203c5f61c86c294210dada652fc5adc8d566875faa9eedebbbdf761" # Re-define API key

# Define BigQuery Schema based on _convertRow output
BQ_SCHEMA = [
    bigquery.SchemaField("id", "STRING", mode="NULLABLE"),
    bigquery.SchemaField("rank", "INTEGER", mode="NULLABLE"),
    bigquery.SchemaField("symbol", "STRING", mode="NULLABLE"),
    bigquery.SchemaField("name", "STRING", mode="NULLABLE"),
    bigquery.SchemaField("supply", "FLOAT", mode="NULLABLE"),
    bigquery.SchemaField("maxSupply", "FLOAT", mode="NULLABLE"),
    bigquery.SchemaField("marketCapUsd", "FLOAT", mode="NULLABLE"),
    bigquery.SchemaField("volumeUsd24Hr", "FLOAT", mode="NULLABLE"),
    bigquery.SchemaField("priceUsd", "FLOAT", mode="NULLABLE"),
    bigquery.SchemaField("changePercent24Hr", "FLOAT", mode="NULLABLE"),
    bigquery.SchemaField("vwap24Hr", "FLOAT", mode="NULLABLE"),
    bigquery.SchemaField("explorer", "STRING", mode="NULLABLE"),
    bigquery.SchemaField("coincap_query_time_bq", "TIMESTAMP", mode="NULLABLE"),
]

# ==============================================================================
# CoinCap API Library Code (Modified to use API Key and add get_asset_history)
# ==============================================================================

logger = logging.getLogger('coincap_api')
logger.addHandler(logging.NullHandler())

class CoinCapAsset(object):
    """Represents a single cryptocurrency asset from the CoinCap API data."""
    keys = ["id", "rank", "symbol", "name", "supply", "maxSupply",
            "marketCapUsd", "volumeUsd24Hr", "priceUsd", "changePercent24Hr",
            "vwap24Hr", "explorer"]

    def __init__(self, data_dict):
        # Initialize attributes from the data_dict
        for key in CoinCapAsset.keys:
            # Use .get() to safely retrieve values, in case a key is missing
            setattr(self, key, data_dict.get(key))

class CoinCapResponse(object):
    """Represents the overall response from the CoinCap API for an asset."""
    def __init__(self, j):
        self.__dict__ = j
        # If 'data' exists and is not None, parse it into a CoinCapAsset object
        if 'data' in self.__dict__ and self.data is not None:
            self.data = CoinCapAsset(self.data)
        else:
            self.data = None
        # Convert timestamp from milliseconds to seconds for general use
        if 'timestamp' in self.__dict__ and self.timestamp is not None:
            self.timestamp_seconds = int(self.timestamp / 1000)
        else:
            self.timestamp_seconds = None


class CoinCapApi(object):
    def __init__(self):
        self._api_url = "https://rest.coincap.io/v3"
        self._last_requests = defaultdict(lambda: 0)
        # CoinCap has a rate limit of 2000 requests per minute, so ~0.03 seconds per request
        self._rate_limit_seconds = 60 / 2000 # 0.03 seconds per request

    def _get_json(self, url_post, callee, params=None):
        # Implement a basic rate limit check
        time_since_last_request = time.time() - self._last_requests[callee]
        if time_since_last_request < self._rate_limit_seconds:
            # Wait if we're hitting the rate limit too fast
            sleep_time = self._rate_limit_seconds - time_since_last_request
            time.sleep(sleep_time)

        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
            'Accept': 'application/json',
            'Authorization': f'Bearer {COINCAP_API_KEY}'
        }

        r = requests.get(f"{self._api_url}{url_post}", headers=headers, params=params, timeout=60.00)
        if r.status_code == 200:
            self._last_requests[callee] = time.time()
            json_response = r.json()
            return json_response
        logger.debug(f"Response not OK. Status {r.status_code} - {r.reason}")
        print(f"DEBUG: API call to {f'{self._api_url}{url_post}'} failed with status {r.status_code}: {r.reason}. Response text: {r.text}")
        return None

    def get_asset(self, asset_id="bitcoin"):
        """Fetches data for a specific cryptocurrency asset."""
        json_data = self._get_json(f"/assets/{asset_id}", self.get_asset)
        return CoinCapResponse(json_data) if json_data else None

    def get_asset_history(self, asset_id, interval, start=None, end=None):
        """Fetches historical data for a specific cryptocurrency asset.

        Args:
            asset_id (str): The ID of the asset (e.g., "bitcoin").
            interval (str): The time interval for historical data (e.g., 'd1', 'h1', 'm1').
            start (int, optional): Unix timestamp (milliseconds) for the start of the period.
            end (int, optional): Unix timestamp (milliseconds) for the end of the period.

        Returns:
            dict: The JSON response containing historical data, or None if an error occurs.
        """
        url_post = f"/assets/{asset_id}/history"
        params = {'interval': interval}

        if start is not None:
            params['start'] = int(start)
        if end is not None:
            params['end'] = int(end)

        json_data = self._get_json(url_post, self.get_asset_history, params=params)
        return json_data

# ==============================================================================
# Data Parser Functions (re-defined to ensure consistency with updated CoinCapApi)
# ==============================================================================

def _convertTimestamp(timestamp_seconds):
    """Converts a Unix timestamp (in seconds) to a formatted datetime string."""
    if timestamp_seconds is not None:
        try:
            return datetime.datetime.fromtimestamp(timestamp_seconds).strftime('%Y-%m-%d %H:%M:%S')
        except Exception:
            logger.error(f"Error converting timestamp {timestamp_seconds}", exc_info=True)
            pass
    return None

def _convert(data, dataType):
    """Converts data to the specified dataType, handling None and type conversion errors."""
    if data is None:
        return None
    if dataType == str:
        # For CoinCap API, some string values might already be clean, but strip() is safe.
        return str(data).strip()
    try:
        return dataType(data)
    except Exception:
        logger.error(f"Error converting data '{data}' to {dataType.__name__}", exc_info=True)
        return None

def _convertRow(coincap_response):
    """Converts a CoinCapResponse object into a dictionary suitable for BigQuery.
    Assumes coincap_response contains a CoinCapAsset object under .data and a timestamp_seconds.
    """
    if coincap_response is None or coincap_response.data is None:
        return {}

    asset_data = coincap_response.data
    row = {
        'id': _convert(asset_data.id, str),
        'rank': _convert(asset_data.rank, int),
        'symbol': _convert(asset_data.symbol, str),
        'name': _convert(asset_data.name, str),
        'supply': _convert(asset_data.supply, float),
        'maxSupply': _convert(asset_data.maxSupply, float),
        'marketCapUsd': _convert(asset_data.marketCapUsd, float),
        'volumeUsd24Hr': _convert(asset_data.volumeUsd24Hr, float),
        'priceUsd': _convert(asset_data.priceUsd, float),
        'changePercent24Hr': _convert(asset_data.changePercent24Hr, float),
        'vwap24Hr': _convert(asset_data.vwap24Hr, float),
        'explorer': _convert(asset_data.explorer, str)
    }

    # Add the CoinCap API response timestamp
    if coincap_response.timestamp_seconds is not None:
        row['coincap_query_time_bq'] = _convertTimestamp(coincap_response.timestamp_seconds)

    # Return only non-null values, as BQ handles missing fields
    return {k: v for k, v in row.items() if v is not None}

# Re-initialize GCP clients (ensuring they are available in this scope)
storage_client = storage.Client(project=PROJECT_ID)
bq_client = bigquery.Client(project=PROJECT_ID)

# Redefine load_gcs_to_bigquery (if it uses any global variables that might be affected)
# This function does not directly depend on the CoinCapApi class but is part of the pipeline context.
def load_gcs_to_bigquery(gcs_uri, project_id, bq_dataset, bq_table, bq_schema, bq_client_instance):
    """Loads data from a GCS URI into a BigQuery table."""
    print(f"\nStep: Loading data from GCS into BigQuery...")
    print(f"  > Source: {gcs_uri}")
    print(f"  > Target: {bq_dataset}.{bq_table}")

    job_config = bigquery.LoadJobConfig()
    job_config.source_format = bigquery.SourceFormat.NEWLINE_DELIMITED_JSON
    job_config.schema = bq_schema
    job_config.autodetect = False
    job_config.write_disposition = bigquery.WriteDisposition.WRITE_APPEND

    load_job = bq_client_instance.load_table_from_uri(
        gcs_uri,
        f"{project_id}.{bq_dataset}.{bq_table}",
        job_config=job_config
    )

    print(f"  > Starting BQ Load Job: {load_job.job_id}")
    load_job.result()
    print(f"  > Job complete. Loaded {load_job.output_rows} rows.")
    print("‚úÖ GCS to BigQuery Load Finished Successfully.")

# New wrapper for historical data points to mimic CoinCapResponse structure
class TempCoinCapHistoryWrapper:
    def __init__(self, data_point):
        # CoinCapAsset will only populate keys present in data_point AND CoinCapAsset.keys
        # Historical data points usually only have priceUsd, time, date.
        # This will set self.data.priceUsd and others will remain None.
        self.data = CoinCapAsset(data_point)
        # 'time' in historical data is milliseconds Unix timestamp
        self.timestamp = data_point.get('time')
        self.timestamp_seconds = int(self.timestamp / 1000) if self.timestamp else None

def run_full_pipeline_without_bq_load():
    """Executes the API -> GCS pipeline for CoinCap data and returns GCS URI."""

    try:
        # ======================================================
        # 1. Download data from CoinCap API
        # ======================================================
        print(f"Step 1: Fetching Bitcoin historical data from CoinCap API...")
        api = CoinCapApi()
        queryTime = datetime.datetime.now().timestamp()

        # Calculate start and end timestamps for the last 7 days
        end_datetime = datetime.datetime.now()
        end_timestamp_ms = int(end_datetime.timestamp() * 1000)
        start_datetime = end_datetime - datetime.timedelta(days=7)
        start_timestamp_ms = int(start_datetime.timestamp() * 1000)

        # Fetch historical data
        historical_data_response = api.get_asset_history(
            asset_id="bitcoin",
            interval="d1", # Daily interval
            start=start_timestamp_ms,
            end=end_timestamp_ms
        )

        records = []
        if historical_data_response and 'data' in historical_data_response:
            for data_point in historical_data_response['data']:
                # Wrap each historical data point to be compatible with _convertRow
                wrapped_response = TempCoinCapHistoryWrapper(data_point)
                record = _convertRow(wrapped_response)
                if record:
                    records.append(record)

        if not records:
            print("No Bitcoin historical data found. Exiting.")
            return None # Return None if no records

        print(f"  > Fetched {len(records)} record(s).")

        # ======================================================
        # 2. Save data to GCS Bucket
        # ======================================================
        local_filename = "bitcoin_historical_data.jsonl"
        with open(local_filename, 'w') as f:
            for record in records:
                f.write(json.dumps(record) + '\n')

        gcs_filename = f"{GCS_FOLDER_PATH}/coincap_history_batch_{int(queryTime)}.jsonl"

        print(f"\nStep 2: Uploading data to GCS...")
        print(f"  > Source: {local_filename}")
        print(f"  > Destination: gs://{GCS_BUCKET_NAME}/{gcs_filename}")

        bucket = storage_client.bucket(GCS_BUCKET_NAME)
        blob = bucket.blob(gcs_filename)
        blob.upload_from_filename(local_filename)

        gcs_uri = f"gs://{GCS_BUCKET_NAME}/{gcs_filename}"
        print("  > Upload complete.")
        print("‚úÖ API to GCS Pipeline Finished Successfully.")
        return gcs_uri # Return the GCS URI

    except Exception as e:
        print(f"\n‚ùå ERROR in pipeline: {e}")
        return None

print("‚úÖ CoinCap API class updated with get_asset_history method, and run_full_pipeline_without_bq_load modified to fetch historical data.")

‚úÖ CoinCap API class updated with get_asset_history method, and run_full_pipeline_without_bq_load modified to fetch historical data.


**Reasoning**:
The `run_full_pipeline_without_bq_load` function has been redefined to fetch historical data using the `get_asset_history` method. I will now execute this function to confirm that it correctly fetches historical Bitcoin data, processes it, and uploads it to GCS.



In [None]:
gcs_uri_for_historical_data = run_full_pipeline_without_bq_load()
print(f"GCS URI for historical data: {gcs_uri_for_historical_data}")

Step 1: Fetching Bitcoin historical data from CoinCap API...
  > Fetched 8 record(s).

Step 2: Uploading data to GCS...
  > Source: bitcoin_historical_data.jsonl
  > Destination: gs://467_final_3_2/data_1/coincap_history_batch_1765634011.jsonl
  > Upload complete.
‚úÖ API to GCS Pipeline Finished Successfully.
GCS URI for historical data: gs://467_final_3_2/data_1/coincap_history_batch_1765634011.jsonl


## Adjust Data Conversion for Historical Data

### Subtask:
Review the structure of the historical data returned by the API and adjust the `_convertRow` function (or create a new one) to correctly parse and format each historical record into a dictionary suitable for BigQuery insertion.


**Reasoning**:
The subtask requires defining a new `_convertHistoricalRecord` function and modifying `run_full_pipeline_without_bq_load` to use it for historical data processing. I will redefine the entire cell containing these functions, along with other necessary components like `CoinCapApi` and helper functions, to incorporate these changes and ensure all variables and functions are correctly scoped and updated.



In [None]:
import os
import json
import logging
import datetime
import time
import requests
from collections import defaultdict
from google.cloud import storage, bigquery

# --- !! CONFIGURE YOUR VARIABLES (re-defined for self-containment) !! ---
PROJECT_ID = "directed-bongo-471119-d1"
GCP_REGION = "us-central1"
GCS_BUCKET_NAME = "467_final_3_2"
GCS_FOLDER_PATH = "data_1"
BQ_DATASET = "467_final_3_2"
BQ_TABLE = "data_1_t"
FLIGHT_RECORD_LIMIT = 500 # Not used in this specific CoinCap context, but kept for consistency if pipeline structure changes

COINCAP_API_KEY = "7aaeaf52f203c5f61c86c294210dada652fc5adc8d566875faa9eedebbbdf761" # Re-define API key

# Define BigQuery Schema based on _convertRow output
BQ_SCHEMA = [
    bigquery.SchemaField("id", "STRING", mode="NULLABLE"),
    bigquery.SchemaField("rank", "INTEGER", mode="NULLABLE"),
    bigquery.SchemaField("symbol", "STRING", mode="NULLABLE"),
    bigquery.SchemaField("name", "STRING", mode="NULLABLE"),
    bigquery.SchemaField("supply", "FLOAT", mode="NULLABLE"),
    bigquery.SchemaField("maxSupply", "FLOAT", mode="NULLABLE"),
    bigquery.SchemaField("marketCapUsd", "FLOAT", mode="NULLABLE"),
    bigquery.SchemaField("volumeUsd24Hr", "FLOAT", mode="NULLABLE"),
    bigquery.SchemaField("priceUsd", "FLOAT", mode="NULLABLE"),
    bigquery.SchemaField("changePercent24Hr", "FLOAT", mode="NULLABLE"),
    bigquery.SchemaField("vwap24Hr", "FLOAT", mode="NULLABLE"),
    bigquery.SchemaField("explorer", "STRING", mode="NULLABLE"),
    bigquery.SchemaField("coincap_query_time_bq", "TIMESTAMP", mode="NULLABLE"),
]

# ==============================================================================
# CoinCap API Library Code (Modified to use API Key and add get_asset_history)
# ==============================================================================

logger = logging.getLogger('coincap_api')
logger.addHandler(logging.NullHandler())

class CoinCapAsset(object):
    """Represents a single cryptocurrency asset from the CoinCap API data."""
    keys = ["id", "rank", "symbol", "name", "supply", "maxSupply",
            "marketCapUsd", "volumeUsd24Hr", "priceUsd", "changePercent24Hr",
            "vwap24Hr", "explorer"]

    def __init__(self, data_dict):
        # Initialize attributes from the data_dict
        for key in CoinCapAsset.keys:
            # Use .get() to safely retrieve values, in case a key is missing
            setattr(self, key, data_dict.get(key))

class CoinCapResponse(object):
    """Represents the overall response from the CoinCap API for an asset."""
    def __init__(self, j):
        self.__dict__ = j
        # If 'data' exists and is not None, parse it into a CoinCapAsset object
        if 'data' in self.__dict__ and self.data is not None:
            self.data = CoinCapAsset(self.data)
        else:
            self.data = None
        # Convert timestamp from milliseconds to seconds for general use
        if 'timestamp' in self.__dict__ and self.timestamp is not None:
            self.timestamp_seconds = int(self.timestamp / 1000)
        else:
            self.timestamp_seconds = None


class CoinCapApi(object):
    def __init__(self):
        self._api_url = "https://rest.coincap.io/v3"
        self._last_requests = defaultdict(lambda: 0)
        # CoinCap has a rate limit of 2000 requests per minute, so ~0.03 seconds per request
        self._rate_limit_seconds = 60 / 2000 # 0.03 seconds per request

    def _get_json(self, url_post, callee, params=None):
        # Implement a basic rate limit check
        time_since_last_request = time.time() - self._last_requests[callee]
        if time_since_last_request < self._rate_limit_seconds:
            # Wait if we're hitting the rate limit too fast
            sleep_time = self._rate_limit_seconds - time_since_last_request
            time.sleep(sleep_time)

        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
            'Accept': 'application/json',
            'Authorization': f'Bearer {COINCAP_API_KEY}'
        }

        r = requests.get(f"{self._api_url}{url_post}", headers=headers, params=params, timeout=60.00)
        if r.status_code == 200:
            self._last_requests[callee] = time.time()
            json_response = r.json()
            return json_response
        logger.debug(f"Response not OK. Status {r.status_code} - {r.reason}")
        print(f"DEBUG: API call to {f'{self._api_url}{url_post}'} failed with status {r.status_code}: {r.reason}. Response text: {r.text}")
        return None

    def get_asset(self, asset_id="bitcoin"):
        """Fetches data for a specific cryptocurrency asset."""
        json_data = self._get_json(f"/assets/{asset_id}", self.get_asset)
        return CoinCapResponse(json_data) if json_data else None

    def get_asset_history(self, asset_id, interval, start=None, end=None):
        """Fetches historical data for a specific cryptocurrency asset.

        Args:
            asset_id (str): The ID of the asset (e.g., "bitcoin").
            interval (str): The time interval for historical data (e.g., 'd1', 'h1', 'm1').
            start (int, optional): Unix timestamp (milliseconds) for the start of the period.
            end (int, optional): Unix timestamp (milliseconds) for the end of the period.

        Returns:
            dict: The JSON response containing historical data, or None if an error occurs.
        """
        url_post = f"/assets/{asset_id}/history"
        params = {'interval': interval}

        if start is not None:
            params['start'] = int(start)
        if end is not None:
            params['end'] = int(end)

        json_data = self._get_json(url_post, self.get_asset_history, params=params)
        return json_data

# ==============================================================================
# Data Parser Functions
# ==============================================================================

def _convertTimestamp(timestamp_seconds):
    """Converts a Unix timestamp (in seconds) to a formatted datetime string."""
    if timestamp_seconds is not None:
        try:
            return datetime.datetime.fromtimestamp(timestamp_seconds).strftime('%Y-%m-%d %H:%M:%S')
        except Exception:
            logger.error(f"Error converting timestamp {timestamp_seconds}", exc_info=True)
            pass
    return None

def _convert(data, dataType):
    """Converts data to the specified dataType, handling None and type conversion errors."""
    if data is None:
        return None
    if dataType == str:
        # For CoinCap API, some string values might already be clean, but strip() is safe.
        return str(data).strip()
    try:
        return dataType(data)
    except Exception:
        logger.error(f"Error converting data '{data}' to {dataType.__name__}", exc_info=True)
        return None

def _convertRow(coincap_response):
    """Converts a CoinCapResponse object into a dictionary suitable for BigQuery.
    Assumes coincap_response contains a CoinCapAsset object under .data and a timestamp_seconds.
    """
    if coincap_response is None or coincap_response.data is None:
        return {}

    asset_data = coincap_response.data
    row = {
        'id': _convert(asset_data.id, str),
        'rank': _convert(asset_data.rank, int),
        'symbol': _convert(asset_data.symbol, str),
        'name': _convert(asset_data.name, str),
        'supply': _convert(asset_data.supply, float),
        'maxSupply': _convert(asset_data.maxSupply, float),
        'marketCapUsd': _convert(asset_data.marketCapUsd, float),
        'volumeUsd24Hr': _convert(asset_data.volumeUsd24Hr, float),
        'priceUsd': _convert(asset_data.priceUsd, float),
        'changePercent24Hr': _convert(asset_data.changePercent24Hr, float),
        'vwap24Hr': _convert(asset_data.vwap24Hr, float),
        'explorer': _convert(asset_data.explorer, str)
    }

    # Add the CoinCap API response timestamp
    if coincap_response.timestamp_seconds is not None:
        row['coincap_query_time_bq'] = _convertTimestamp(coincap_response.timestamp_seconds)

    # Return only non-null values, as BQ handles missing fields
    return {k: v for k, v in row.items() if v is not None}

def _convertHistoricalRecord(historical_data_point, asset_id):
    """Converts a single historical data point into a dictionary suitable for BigQuery.

    Args:
        historical_data_point (dict): A dictionary representing a single historical entry from the CoinCap API response.
                                      Expected keys: 'priceUsd', 'time', and potentially others like 'volume'.
        asset_id (str): The ID of the asset (e.g., "bitcoin").

    Returns:
        dict: A dictionary containing the formatted historical record, or an empty dict if conversion fails.
    """
    if not historical_data_point:
        return {}

    # Extract and convert priceUsd
    price_usd = _convert(historical_data_point.get('priceUsd'), float)

    # Extract and convert timestamp from milliseconds to formatted datetime string
    timestamp_ms = historical_data_point.get('time')
    coincap_query_time_bq = None
    if timestamp_ms is not None:
        coincap_query_time_bq = _convertTimestamp(timestamp_ms / 1000)

    # Construct the record dictionary
    record = {
        'id': asset_id,
        'priceUsd': price_usd,
        'coincap_query_time_bq': coincap_query_time_bq
    }

    # Optionally add other fields if available in historical_data_point and relevant
    # For instance, if 'volume' is present and we want to include it:
    # record['volumeUsd24Hr'] = _convert(historical_data_point.get('volume'), float)

    # Return only non-null values
    return {k: v for k, v in record.items() if v is not None}

# Re-initialize GCP clients (ensuring they are available in this scope)
storage_client = storage.Client(project=PROJECT_ID)
bq_client = bigquery.Client(project=PROJECT_ID)

# Redefine load_gcs_to_bigquery (if it uses any global variables that might be affected)
# This function does not directly depend on the CoinCapApi class but is part of the pipeline context.
def load_gcs_to_bigquery(gcs_uri, project_id, bq_dataset, bq_table, bq_schema, bq_client_instance):
    """Loads data from a GCS URI into a BigQuery table."""
    print(f"\nStep: Loading data from GCS into BigQuery...")
    print(f"  > Source: {gcs_uri}")
    print(f"  > Target: {bq_dataset}.{bq_table}")

    job_config = bigquery.LoadJobConfig()
    job_config.source_format = bigquery.SourceFormat.NEWLINE_DELIMITED_JSON
    job_config.schema = bq_schema
    job_config.autodetect = False
    job_config.write_disposition = bigquery.WriteDisposition.WRITE_APPEND

    load_job = bq_client_instance.load_table_from_uri(
        gcs_uri,
        f"{project_id}.{bq_dataset}.{bq_table}",
        job_config=job_config
    )

    print(f"  > Starting BQ Load Job: {load_job.job_id}")
    load_job.result()
    print(f"  > Job complete. Loaded {load_job.output_rows} rows.")
    print("‚úÖ GCS to BigQuery Load Finished Successfully.")

def run_full_pipeline_without_bq_load():
    """Executes the API -> GCS pipeline for CoinCap data and returns GCS URI."""

    try:
        # ======================================================
        # 1. Download data from CoinCap API
        # ======================================================
        print(f"Step 1: Fetching Bitcoin historical data from CoinCap API...")
        api = CoinCapApi()
        queryTime = datetime.datetime.now().timestamp()

        # Define asset_id for historical data fetch
        asset_id_to_fetch = "bitcoin"

        # Calculate start and end timestamps for the last 7 days
        end_datetime = datetime.datetime.now()
        end_timestamp_ms = int(end_datetime.timestamp() * 1000)
        start_datetime = end_datetime - datetime.timedelta(days=7)
        start_timestamp_ms = int(start_datetime.timestamp() * 1000)

        # Fetch historical data
        historical_data_response = api.get_asset_history(
            asset_id=asset_id_to_fetch,
            interval="d1", # Daily interval
            start=start_timestamp_ms,
            end=end_timestamp_ms
        )

        records = []
        if historical_data_response and 'data' in historical_data_response:
            for data_point in historical_data_response['data']:
                # Use the new _convertHistoricalRecord function
                record = _convertHistoricalRecord(data_point, asset_id_to_fetch)
                if record:
                    records.append(record)

        if not records:
            print("No Bitcoin historical data found. Exiting.")
            return None # Return None if no records

        print(f"  > Fetched {len(records)} record(s).")

        # ======================================================
        # 2. Save data to GCS Bucket
        # ======================================================
        local_filename = "bitcoin_historical_data.jsonl"
        with open(local_filename, 'w') as f:
            for record in records:
                f.write(json.dumps(record) + '\n')

        gcs_filename = f"{GCS_FOLDER_PATH}/coincap_history_batch_{int(queryTime)}.jsonl"

        print(f"\nStep 2: Uploading data to GCS...")
        print(f"  > Source: {local_filename}")
        print(f"  > Destination: gs://{GCS_BUCKET_NAME}/{gcs_filename}")

        bucket = storage_client.bucket(GCS_BUCKET_NAME)
        blob = bucket.blob(gcs_filename)
        blob.upload_from_filename(local_filename)

        gcs_uri = f"gs://{GCS_BUCKET_NAME}/{gcs_filename}"
        print("  > Upload complete.")
        print("‚úÖ API to GCS Pipeline Finished Successfully.")
        return gcs_uri # Return the GCS URI

    except Exception as e:
        print(f"\n‚ùå ERROR in pipeline: {e}")
        return None

print("‚úÖ CoinCap API class and pipeline functions redefined with _convertHistoricalRecord and updated run_full_pipeline_without_bq_load.")

‚úÖ CoinCap API class and pipeline functions redefined with _convertHistoricalRecord and updated run_full_pipeline_without_bq_load.


**Reasoning**:
The `run_full_pipeline_without_bq_load` function has been redefined to use the new `_convertHistoricalRecord` function for historical data. I will now execute this function to confirm that it correctly fetches historical Bitcoin data, processes it, and uploads it to GCS with the updated conversion logic.



In [None]:
gcs_uri_for_historical_data = run_full_pipeline_without_bq_load()
print(f"GCS URI for historical data: {gcs_uri_for_historical_data}")

Step 1: Fetching Bitcoin historical data from CoinCap API...
  > Fetched 8 record(s).

Step 2: Uploading data to GCS...
  > Source: bitcoin_historical_data.jsonl
  > Destination: gs://467_final_3_2/data_1/coincap_history_batch_1765634071.jsonl
  > Upload complete.
‚úÖ API to GCS Pipeline Finished Successfully.
GCS URI for historical data: gs://467_final_3_2/data_1/coincap_history_batch_1765634071.jsonl


## Update BigQuery Schema

### Subtask:
Review and update the `BQ_SCHEMA` to accommodate any new fields or different data types that might be present in the historical data, ensuring compatibility with BigQuery.


**Reasoning**:
The subtask requires updating the `BQ_SCHEMA` to reflect the fields generated by `_convertHistoricalRecord`. I will redefine the code cell to modify the `BQ_SCHEMA` to only include `id` (STRING), `priceUsd` (FLOAT), and `coincap_query_time_bq` (TIMESTAMP) and ensure the mode is `NULLABLE` for all.



In [None]:
import os
import json
import logging
import datetime
import time
import requests
from collections import defaultdict
from google.cloud import storage, bigquery

# --- !! CONFIGURE YOUR VARIABLES (re-defined for self-containment) !! ---
PROJECT_ID = "directed-bongo-471119-d1"
GCP_REGION = "us-central1"
GCS_BUCKET_NAME = "467_final_3_2"
GCS_FOLDER_PATH = "data_1"
BQ_DATASET = "467_final_3_2"
BQ_TABLE = "data_1_t"
FLIGHT_RECORD_LIMIT = 500 # Not used in this specific CoinCap context, but kept for consistency if pipeline structure changes

COINCAP_API_KEY = "7aaeaf52f203c5f61c86c294210dada652fc5adc8d566875faa9eedebbbdf761" # Re-define API key

# Define BigQuery Schema based on _convertHistoricalRecord output
BQ_SCHEMA = [
    bigquery.SchemaField("id", "STRING", mode="NULLABLE"),
    bigquery.SchemaField("priceUsd", "FLOAT", mode="NULLABLE"),
    bigquery.SchemaField("coincap_query_time_bq", "TIMESTAMP", mode="NULLABLE")
]

# ==============================================================================
# CoinCap API Library Code (Modified to use API Key and add get_asset_history)
# ==============================================================================

logger = logging.getLogger('coincap_api')
logger.addHandler(logging.NullHandler())

class CoinCapAsset(object):
    """Represents a single cryptocurrency asset from the CoinCap API data."""
    keys = ["id", "rank", "symbol", "name", "supply", "maxSupply",
            "marketCapUsd", "volumeUsd24Hr", "priceUsd", "changePercent24Hr",
            "vwap24Hr", "explorer"]

    def __init__(self, data_dict):
        # Initialize attributes from the data_dict
        for key in CoinCapAsset.keys:
            # Use .get() to safely retrieve values, in case a key is missing
            setattr(self, key, data_dict.get(key))

class CoinCapResponse(object):
    """Represents the overall response from the CoinCap API for an asset."""
    def __init__(self, j):
        self.__dict__ = j
        # If 'data' exists and is not None, parse it into a CoinCapAsset object
        if 'data' in self.__dict__ and self.data is not None:
            self.data = CoinCapAsset(self.data)
        else:
            self.data = None
        # Convert timestamp from milliseconds to seconds for general use
        if 'timestamp' in self.__dict__ and self.timestamp is not None:
            self.timestamp_seconds = int(self.timestamp / 1000)
        else:
            self.timestamp_seconds = None


class CoinCapApi(object):
    def __init__(self):
        self._api_url = "https://rest.coincap.io/v3"
        self._last_requests = defaultdict(lambda: 0)
        # CoinCap has a rate limit of 2000 requests per minute, so ~0.03 seconds per request
        self._rate_limit_seconds = 60 / 2000 # 0.03 seconds per request

    def _get_json(self, url_post, callee, params=None):
        # Implement a basic rate limit check
        time_since_last_request = time.time() - self._last_requests[callee]
        if time_since_last_request < self._rate_limit_seconds:
            # Wait if we're hitting the rate limit too fast
            sleep_time = self._rate_limit_seconds - time_since_last_request
            time.sleep(sleep_time)

        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
            'Accept': 'application/json',
            'Authorization': f'Bearer {COINCAP_API_KEY}'
        }

        r = requests.get(f"{self._api_url}{url_post}", headers=headers, params=params, timeout=60.00)
        if r.status_code == 200:
            self._last_requests[callee] = time.time()
            json_response = r.json()
            return json_response
        logger.debug(f"Response not OK. Status {r.status_code} - {r.reason}")
        print(f"DEBUG: API call to {f'{self._api_url}{url_post}'} failed with status {r.status_code}: {r.reason}. Response text: {r.text}")
        return None

    def get_asset(self, asset_id="bitcoin"):
        """Fetches data for a specific cryptocurrency asset."""
        json_data = self._get_json(f"/assets/{asset_id}", self.get_asset)
        return CoinCapResponse(json_data) if json_data else None

    def get_asset_history(self, asset_id, interval, start=None, end=None):
        """Fetches historical data for a specific cryptocurrency asset.

        Args:
            asset_id (str): The ID of the asset (e.g., "bitcoin").
            interval (str): The time interval for historical data (e.g., 'd1', 'h1', 'm1').
            start (int, optional): Unix timestamp (milliseconds) for the start of the period.
            end (int, optional): Unix timestamp (milliseconds) for the end of the period.

        Returns:
            dict: The JSON response containing historical data, or None if an error occurs.
        """
        url_post = f"/assets/{asset_id}/history"
        params = {'interval': interval}

        if start is not None:
            params['start'] = int(start)
        if end is not None:
            params['end'] = int(end)

        json_data = self._get_json(url_post, self.get_asset_history, params=params)
        return json_data

# ==============================================================================
# Data Parser Functions
# ==============================================================================

def _convertTimestamp(timestamp_seconds):
    """Converts a Unix timestamp (in seconds) to a formatted datetime string."""
    if timestamp_seconds is not None:
        try:
            return datetime.datetime.fromtimestamp(timestamp_seconds).strftime('%Y-%m-%d %H:%M:%S')
        except Exception:
            logger.error(f"Error converting timestamp {timestamp_seconds}", exc_info=True)
            pass
    return None

def _convert(data, dataType):
    """Converts data to the specified dataType, handling None and type conversion errors."""
    if data is None:
        return None
    if dataType == str:
        # For CoinCap API, some string values might already be clean, but strip() is safe.
        return str(data).strip()
    try:
        return dataType(data)
    except Exception:
        logger.error(f"Error converting data '{data}' to {dataType.__name__}", exc_info=True)
        return None

def _convertRow(coincap_response):
    """Converts a CoinCapResponse object into a dictionary suitable for BigQuery.
    Assumes coincap_response contains a CoinCapAsset object under .data and a timestamp_seconds.
    """
    if coincap_response is None or coincap_response.data is None:
        return {}

    asset_data = coincap_response.data
    row = {
        'id': _convert(asset_data.id, str),
        'rank': _convert(asset_data.rank, int),
        'symbol': _convert(asset_data.symbol, str),
        'name': _convert(asset_data.name, str),
        'supply': _convert(asset_data.supply, float),
        'maxSupply': _convert(asset_data.maxSupply, float),
        'marketCapUsd': _convert(asset_data.marketCapUsd, float),
        'volumeUsd24Hr': _convert(asset_data.volumeUsd24Hr, float),
        'priceUsd': _convert(asset_data.priceUsd, float),
        'changePercent24Hr': _convert(asset_data.changePercent24Hr, float),
        'vwap24Hr': _convert(asset_data.vwap24Hr, float),
        'explorer': _convert(asset_data.explorer, str)
    }

    # Add the CoinCap API response timestamp
    if coincap_response.timestamp_seconds is not None:
        row['coincap_query_time_bq'] = _convertTimestamp(coincap_response.timestamp_seconds)

    # Return only non-null values, as BQ handles missing fields
    return {k: v for k, v in row.items() if v is not None}

def _convertHistoricalRecord(historical_data_point, asset_id):
    """Converts a single historical data point into a dictionary suitable for BigQuery.

    Args:
        historical_data_point (dict): A dictionary representing a single historical entry from the CoinCap API response.
                                      Expected keys: 'priceUsd', 'time', and potentially others like 'volume'.
        asset_id (str): The ID of the asset (e.g., "bitcoin").

    Returns:
        dict: A dictionary containing the formatted historical record, or an empty dict if conversion fails.
    """
    if not historical_data_point:
        return {}

    # Extract and convert priceUsd
    price_usd = _convert(historical_data_point.get('priceUsd'), float)

    # Extract and convert timestamp from milliseconds to formatted datetime string
    timestamp_ms = historical_data_point.get('time')
    coincap_query_time_bq = None
    if timestamp_ms is not None:
        coincap_query_time_bq = _convertTimestamp(timestamp_ms / 1000)

    # Construct the record dictionary
    record = {
        'id': asset_id,
        'priceUsd': price_usd,
        'coincap_query_time_bq': coincap_query_time_bq
    }

    # Optionally add other fields if available in historical_data_point and relevant
    # For instance, if 'volume' is present and we want to include it:
    # record['volumeUsd24Hr'] = _convert(historical_data_point.get('volume'), float)

    # Return only non-null values
    return {k: v for k, v in record.items() if v is not None}

# Re-initialize GCP clients (ensuring they are available in this scope)
storage_client = storage.Client(project=PROJECT_ID)
bq_client = bigquery.Client(project=PROJECT_ID)

# Redefine load_gcs_to_bigquery (if it uses any global variables that might be affected)
# This function does not directly depend on the CoinCapApi class but is part of the pipeline context.
def load_gcs_to_bigquery(gcs_uri, project_id, bq_dataset, bq_table, bq_schema, bq_client_instance):
    """Loads data from a GCS URI into a BigQuery table."""
    print(f"\nStep: Loading data from GCS into BigQuery...")
    print(f"  > Source: {gcs_uri}")
    print(f"  > Target: {bq_dataset}.{bq_table}")

    job_config = bigquery.LoadJobConfig()
    job_config.source_format = bigquery.SourceFormat.NEWLINE_DELIMITED_JSON
    job_config.schema = bq_schema
    job_config.autodetect = False
    job_config.write_disposition = bigquery.WriteDisposition.WRITE_APPEND

    load_job = bq_client_instance.load_table_from_uri(
        gcs_uri,
        f"{project_id}.{bq_dataset}.{bq_table}",
        job_config=job_config
    )

    print(f"  > Starting BQ Load Job: {load_job.job_id}")
    load_job.result()
    print(f"  > Job complete. Loaded {load_job.output_rows} rows.")
    print("‚úÖ GCS to BigQuery Load Finished Successfully.")

def run_full_pipeline_without_bq_load():
    """Executes the API -> GCS pipeline for CoinCap data and returns GCS URI."""

    try:
        # ======================================================
        # 1. Download data from CoinCap API
        # ======================================================
        print(f"Step 1: Fetching Bitcoin historical data from CoinCap API...")
        api = CoinCapApi()
        queryTime = datetime.datetime.now().timestamp()

        # Define asset_id for historical data fetch
        asset_id_to_fetch = "bitcoin"

        # Calculate start and end timestamps for the last 7 days
        end_datetime = datetime.datetime.now()
        end_timestamp_ms = int(end_datetime.timestamp() * 1000)
        start_datetime = end_datetime - datetime.timedelta(days=7)
        start_timestamp_ms = int(start_datetime.timestamp() * 1000)

        # Fetch historical data
        historical_data_response = api.get_asset_history(
            asset_id=asset_id_to_fetch,
            interval="d1", # Daily interval
            start=start_timestamp_ms,
            end=end_timestamp_ms
        )

        records = []
        if historical_data_response and 'data' in historical_data_response:
            for data_point in historical_data_response['data']:
                # Use the new _convertHistoricalRecord function
                record = _convertHistoricalRecord(data_point, asset_id_to_fetch)
                if record:
                    records.append(record)

        if not records:
            print("No Bitcoin historical data found. Exiting.")
            return None # Return None if no records

        print(f"  > Fetched {len(records)} record(s).")

        # ======================================================
        # 2. Save data to GCS Bucket
        # ======================================================
        local_filename = "bitcoin_historical_data.jsonl"
        with open(local_filename, 'w') as f:
            for record in records:
                f.write(json.dumps(record) + '\n')

        gcs_filename = f"{GCS_FOLDER_PATH}/coincap_history_batch_{int(queryTime)}.jsonl"

        print(f"\nStep 2: Uploading data to GCS...")
        print(f"  > Source: {local_filename}")
        print(f"  > Destination: gs://{GCS_BUCKET_NAME}/{gcs_filename}")

        bucket = storage_client.bucket(GCS_BUCKET_NAME)
        blob = bucket.blob(gcs_filename)
        blob.upload_from_filename(local_filename)

        gcs_uri = f"gs://{GCS_BUCKET_NAME}/{gcs_filename}"
        print("  > Upload complete.")
        print("‚úÖ API to GCS Pipeline Finished Successfully.")
        return gcs_uri # Return the GCS URI

    except Exception as e:
        print(f"\n‚ùå ERROR in pipeline: {e}")
        return None

print("‚úÖ CoinCap API class and pipeline functions redefined with _convertHistoricalRecord and updated run_full_pipeline_without_bq_load.")

‚úÖ CoinCap API class and pipeline functions redefined with _convertHistoricalRecord and updated run_full_pipeline_without_bq_load.


**Reasoning**:
The subtask is to update the `BQ_SCHEMA` to match the output of the `_convertHistoricalRecord` function. I have redefined the `BQ_SCHEMA` list to include only `id`, `priceUsd`, and `coincap_query_time_bq` with their appropriate data types and `NULLABLE` mode. The rest of the code in the cell remains the same to maintain self-containment and consistency.



In [None]:
import os
import json
import logging
import datetime
import time
import requests
from collections import defaultdict
from google.cloud import storage, bigquery

# --- !! CONFIGURE YOUR VARIABLES (re-defined for self-containment) !! ---
PROJECT_ID = "directed-bongo-471119-d1"
GCP_REGION = "us-central1"
GCS_BUCKET_NAME = "467_final_3_2"
GCS_FOLDER_PATH = "data_1"
BQ_DATASET = "467_final_3_2"
BQ_TABLE = "data_1_t"
FLIGHT_RECORD_LIMIT = 500 # Not used in this specific CoinCap context, but kept for consistency if pipeline structure changes

COINCAP_API_KEY = "7aaeaf52f203c5f61c86c294210dada652fc5adc8d566875faa9eedebbbdf761" # Re-define API key

# Define BigQuery Schema based on _convertHistoricalRecord output
BQ_SCHEMA = [
    bigquery.SchemaField("id", "STRING", mode="NULLABLE"),
    bigquery.SchemaField("priceUsd", "FLOAT", mode="NULLABLE"),
    bigquery.SchemaField("coincap_query_time_bq", "TIMESTAMP", mode="NULLABLE")
]

# ==============================================================================
# CoinCap API Library Code (Modified to use API Key and add get_asset_history)
# ==============================================================================

logger = logging.getLogger('coincap_api')
logger.addHandler(logging.NullHandler())

class CoinCapAsset(object):
    """Represents a single cryptocurrency asset from the CoinCap API data."""
    keys = ["id", "rank", "symbol", "name", "supply", "maxSupply",
            "marketCapUsd", "volumeUsd24Hr", "priceUsd", "changePercent24Hr",
            "vwap24Hr", "explorer"]

    def __init__(self, data_dict):
        # Initialize attributes from the data_dict
        for key in CoinCapAsset.keys:
            # Use .get() to safely retrieve values, in case a key is missing
            setattr(self, key, data_dict.get(key))

class CoinCapResponse(object):
    """Represents the overall response from the CoinCap API for an asset."""
    def __init__(self, j):
        self.__dict__ = j
        # If 'data' exists and is not None, parse it into a CoinCapAsset object
        if 'data' in self.__dict__ and self.data is not None:
            self.data = CoinCapAsset(self.data)
        else:
            self.data = None
        # Convert timestamp from milliseconds to seconds for general use
        if 'timestamp' in self.__dict__ and self.timestamp is not None:
            self.timestamp_seconds = int(self.timestamp / 1000)
        else:
            self.timestamp_seconds = None


class CoinCapApi(object):
    def __init__(self):
        self._api_url = "https://rest.coincap.io/v3"
        self._last_requests = defaultdict(lambda: 0)
        # CoinCap has a rate limit of 2000 requests per minute, so ~0.03 seconds per request
        self._rate_limit_seconds = 60 / 2000 # 0.03 seconds per request

    def _get_json(self, url_post, callee, params=None):
        # Implement a basic rate limit check
        time_since_last_request = time.time() - self._last_requests[callee]
        if time_since_last_request < self._rate_limit_seconds:
            # Wait if we're hitting the rate limit too fast
            sleep_time = self._rate_limit_seconds - time_since_last_request
            time.sleep(sleep_time)

        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
            'Accept': 'application/json',
            'Authorization': f'Bearer {COINCAP_API_KEY}'
        }

        r = requests.get(f"{self._api_url}{url_post}", headers=headers, params=params, timeout=60.00)
        if r.status_code == 200:
            self._last_requests[callee] = time.time()
            json_response = r.json()
            return json_response
        logger.debug(f"Response not OK. Status {r.status_code} - {r.reason}")
        print(f"DEBUG: API call to {f'{self._api_url}{url_post}'} failed with status {r.status_code}: {r.reason}. Response text: {r.text}")
        return None

    def get_asset(self, asset_id="bitcoin"):
        """Fetches data for a specific cryptocurrency asset."""
        json_data = self._get_json(f"/assets/{asset_id}", self.get_asset)
        return CoinCapResponse(json_data) if json_data else None

    def get_asset_history(self, asset_id, interval, start=None, end=None):
        """Fetches historical data for a specific cryptocurrency asset.

        Args:
            asset_id (str): The ID of the asset (e.g., "bitcoin").
            interval (str): The time interval for historical data (e.g., 'd1', 'h1', 'm1').
            start (int, optional): Unix timestamp (milliseconds) for the start of the period.
            end (int, optional): Unix timestamp (milliseconds) for the end of the period.

        Returns:
            dict: The JSON response containing historical data, or None if an error occurs.
        """
        url_post = f"/assets/{asset_id}/history"
        params = {'interval': interval}

        if start is not None:
            params['start'] = int(start)
        if end is not None:
            params['end'] = int(end)

        json_data = self._get_json(url_post, self.get_asset_history, params=params)
        return json_data

# ==============================================================================
# Data Parser Functions
# ==============================================================================

def _convertTimestamp(timestamp_seconds):
    """Converts a Unix timestamp (in seconds) to a formatted datetime string."""
    if timestamp_seconds is not None:
        try:
            return datetime.datetime.fromtimestamp(timestamp_seconds).strftime('%Y-%m-%d %H:%M:%S')
        except Exception:
            logger.error(f"Error converting timestamp {timestamp_seconds}", exc_info=True)
            pass
    return None

def _convert(data, dataType):
    """Converts data to the specified dataType, handling None and type conversion errors."""
    if data is None:
        return None
    if dataType == str:
        # For CoinCap API, some string values might already be clean, but strip() is safe.
        return str(data).strip()
    try:
        return dataType(data)
    except Exception:
        logger.error(f"Error converting data '{data}' to {dataType.__name__}", exc_info=True)
        return None

def _convertRow(coincap_response):
    """Converts a CoinCapResponse object into a dictionary suitable for BigQuery.
    Assumes coincap_response contains a CoinCapAsset object under .data and a timestamp_seconds.
    """
    if coincap_response is None or coincap_response.data is None:
        return {}

    asset_data = coincap_response.data
    row = {
        'id': _convert(asset_data.id, str),
        'rank': _convert(asset_data.rank, int),
        'symbol': _convert(asset_data.symbol, str),
        'name': _convert(asset_data.name, str),
        'supply': _convert(asset_data.supply, float),
        'maxSupply': _convert(asset_data.maxSupply, float),
        'marketCapUsd': _convert(asset_data.marketCapUsd, float),
        'volumeUsd24Hr': _convert(asset_data.volumeUsd24Hr, float),
        'priceUsd': _convert(asset_data.priceUsd, float),
        'changePercent24Hr': _convert(asset_data.changePercent24Hr, float),
        'vwap24Hr': _convert(asset_data.vwap24Hr, float),
        'explorer': _convert(asset_data.explorer, str)
    }

    # Add the CoinCap API response timestamp
    if coincap_response.timestamp_seconds is not None:
        row['coincap_query_time_bq'] = _convertTimestamp(coincap_response.timestamp_seconds)

    # Return only non-null values, as BQ handles missing fields
    return {k: v for k, v in row.items() if v is not None}

def _convertHistoricalRecord(historical_data_point, asset_id):
    """Converts a single historical data point into a dictionary suitable for BigQuery.

    Args:
        historical_data_point (dict): A dictionary representing a single historical entry from the CoinCap API response.
                                      Expected keys: 'priceUsd', 'time', and potentially others like 'volume'.
        asset_id (str): The ID of the asset (e.g., "bitcoin").

    Returns:
        dict: A dictionary containing the formatted historical record, or an empty dict if conversion fails.
    """
    if not historical_data_point:
        return {}

    # Extract and convert priceUsd
    price_usd = _convert(historical_data_point.get('priceUsd'), float)

    # Extract and convert timestamp from milliseconds to formatted datetime string
    timestamp_ms = historical_data_point.get('time')
    coincap_query_time_bq = None
    if timestamp_ms is not None:
        coincap_query_time_bq = _convertTimestamp(timestamp_ms / 1000)

    # Construct the record dictionary
    record = {
        'id': asset_id,
        'priceUsd': price_usd,
        'coincap_query_time_bq': coincap_query_time_bq
    }

    # Optionally add other fields if available in historical_data_point and relevant
    # For instance, if 'volume' is present and we want to include it:
    # record['volumeUsd24Hr'] = _convert(historical_data_point.get('volume'), float)

    # Return only non-null values
    return {k: v for k, v in record.items() if v is not None}

# Re-initialize GCP clients (ensuring they are available in this scope)
storage_client = storage.Client(project=PROJECT_ID)
bq_client = bigquery.Client(project=PROJECT_ID)

# Redefine load_gcs_to_bigquery (if it uses any global variables that might be affected)
# This function does not directly depend on the CoinCapApi class but is part of the pipeline context.
def load_gcs_to_bigquery(gcs_uri, project_id, bq_dataset, bq_table, bq_schema, bq_client_instance):
    """Loads data from a GCS URI into a BigQuery table."""
    print(f"\nStep: Loading data from GCS into BigQuery...")
    print(f"  > Source: {gcs_uri}")
    print(f"  > Target: {bq_dataset}.{bq_table}")

    job_config = bigquery.LoadJobConfig()
    job_config.source_format = bigquery.SourceFormat.NEWLINE_DELIMITED_JSON
    job_config.schema = bq_schema
    job_config.autodetect = False
    job_config.write_disposition = bigquery.WriteDisposition.WRITE_APPEND

    load_job = bq_client_instance.load_table_from_uri(
        gcs_uri,
        f"{project_id}.{bq_dataset}.{bq_table}",
        job_config=job_config
    )

    print(f"  > Starting BQ Load Job: {load_job.job_id}")
    load_job.result()
    print(f"  > Job complete. Loaded {load_job.output_rows} rows.")
    print("‚úÖ GCS to BigQuery Load Finished Successfully.")

def run_full_pipeline_without_bq_load():
    """Executes the API -> GCS pipeline for CoinCap data and returns GCS URI."""

    try:
        # ======================================================
        # 1. Download data from CoinCap API
        # ======================================================
        print(f"Step 1: Fetching Bitcoin historical data from CoinCap API...")
        api = CoinCapApi()
        queryTime = datetime.datetime.now().timestamp()

        # Define asset_id for historical data fetch
        asset_id_to_fetch = "bitcoin"

        # Calculate start and end timestamps for the last 7 days
        end_datetime = datetime.datetime.now()
        end_timestamp_ms = int(end_datetime.timestamp() * 1000)
        start_datetime = end_datetime - datetime.timedelta(days=7)
        start_timestamp_ms = int(start_datetime.timestamp() * 1000)

        # Fetch historical data
        historical_data_response = api.get_asset_history(
            asset_id=asset_id_to_fetch,
            interval="d1", # Daily interval
            start=start_timestamp_ms,
            end=end_timestamp_ms
        )

        records = []
        if historical_data_response and 'data' in historical_data_response:
            for data_point in historical_data_response['data']:
                # Use the new _convertHistoricalRecord function
                record = _convertHistoricalRecord(data_point, asset_id_to_fetch)
                if record:
                    records.append(record)

        if not records:
            print("No Bitcoin historical data found. Exiting.")
            return None # Return None if no records

        print(f"  > Fetched {len(records)} record(s).")

        # ======================================================
        # 2. Save data to GCS Bucket
        # ======================================================
        local_filename = "bitcoin_historical_data.jsonl"
        with open(local_filename, 'w') as f:
            for record in records:
                f.write(json.dumps(record) + '\n')

        gcs_filename = f"{GCS_FOLDER_PATH}/coincap_history_batch_{int(queryTime)}.jsonl"

        print(f"\nStep 2: Uploading data to GCS...")
        print(f"  > Source: {local_filename}")
        print(f"  > Destination: gs://{GCS_BUCKET_NAME}/{gcs_filename}")

        bucket = storage_client.bucket(GCS_BUCKET_NAME)
        blob = bucket.blob(gcs_filename)
        blob.upload_from_filename(local_filename)

        gcs_uri = f"gs://{GCS_BUCKET_NAME}/{gcs_filename}"
        print("  > Upload complete.")
        print("‚úÖ API to GCS Pipeline Finished Successfully.")
        return gcs_uri # Return the GCS URI

    except Exception as e:
        print(f"\n‚ùå ERROR in pipeline: {e}")
        return None

print("‚úÖ CoinCap API class and pipeline functions redefined with _convertHistoricalRecord and updated run_full_pipeline_without_bq_load.")

‚úÖ CoinCap API class and pipeline functions redefined with _convertHistoricalRecord and updated run_full_pipeline_without_bq_load.


**Reasoning**:
The subtask is to update the `BQ_SCHEMA` to match the output of the `_convertHistoricalRecord` function. I have redefined the `BQ_SCHEMA` list to include only `id`, `priceUsd`, and `coincap_query_time_bq` with their appropriate data types and `NULLABLE` mode. The rest of the code in the cell remains the same to maintain self-containment and consistency.



In [None]:
import os
import json
import logging
import datetime
import time
import requests
from collections import defaultdict
from google.cloud import storage, bigquery

# --- !! CONFIGURE YOUR VARIABLES (re-defined for self-containment) !! ---
PROJECT_ID = "directed-bongo-471119-d1"
GCP_REGION = "us-central1"
GCS_BUCKET_NAME = "467_final_3_2"
GCS_FOLDER_PATH = "data_1"
BQ_DATASET = "467_final_3_2"
BQ_TABLE = "data_1_t"
FLIGHT_RECORD_LIMIT = 500 # Not used in this specific CoinCap context, but kept for consistency if pipeline structure changes

COINCAP_API_KEY = "7aaeaf52f203c5f61c86c294210dada652fc5adc8d566875faa9eedebbbdf761" # Re-define API key

# Define BigQuery Schema based on _convertHistoricalRecord output
BQ_SCHEMA = [
    bigquery.SchemaField("id", "STRING", mode="NULLABLE"),
    bigquery.SchemaField("priceUsd", "FLOAT", mode="NULLABLE"),
    bigquery.SchemaField("coincap_query_time_bq", "TIMESTAMP", mode="NULLABLE")
]

# ==============================================================================
# CoinCap API Library Code (Modified to use API Key and add get_asset_history)
# ==============================================================================

logger = logging.getLogger('coincap_api')
logger.addHandler(logging.NullHandler())

class CoinCapAsset(object):
    """Represents a single cryptocurrency asset from the CoinCap API data."""
    keys = ["id", "rank", "symbol", "name", "supply", "maxSupply",
            "marketCapUsd", "volumeUsd24Hr", "priceUsd", "changePercent24Hr",
            "vwap24Hr", "explorer"]

    def __init__(self, data_dict):
        # Initialize attributes from the data_dict
        for key in CoinCapAsset.keys:
            # Use .get() to safely retrieve values, in case a key is missing
            setattr(self, key, data_dict.get(key))

class CoinCapResponse(object):
    """Represents the overall response from the CoinCap API for an asset."""
    def __init__(self, j):
        self.__dict__ = j
        # If 'data' exists and is not None, parse it into a CoinCapAsset object
        if 'data' in self.__dict__ and self.data is not None:
            self.data = CoinCapAsset(self.data)
        else:
            self.data = None
        # Convert timestamp from milliseconds to seconds for general use
        if 'timestamp' in self.__dict__ and self.timestamp is not None:
            self.timestamp_seconds = int(self.timestamp / 1000)
        else:
            self.timestamp_seconds = None


class CoinCapApi(object):
    def __init__(self):
        self._api_url = "https://rest.coincap.io/v3"
        self._last_requests = defaultdict(lambda: 0)
        # CoinCap has a rate limit of 2000 requests per minute, so ~0.03 seconds per request
        self._rate_limit_seconds = 60 / 2000 # 0.03 seconds per request

    def _get_json(self, url_post, callee, params=None):
        # Implement a basic rate limit check
        time_since_last_request = time.time() - self._last_requests[callee]
        if time_since_last_request < self._rate_limit_seconds:
            # Wait if we're hitting the rate limit too fast
            sleep_time = self._rate_limit_seconds - time_since_last_request
            time.sleep(sleep_time)

        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
            'Accept': 'application/json',
            'Authorization': f'Bearer {COINCAP_API_KEY}'
        }

        r = requests.get(f"{self._api_url}{url_post}", headers=headers, params=params, timeout=60.00)
        if r.status_code == 200:
            self._last_requests[callee] = time.time()
            json_response = r.json()
            return json_response
        logger.debug(f"Response not OK. Status {r.status_code} - {r.reason}")
        print(f"DEBUG: API call to {f'{self._api_url}{url_post}'} failed with status {r.status_code}: {r.reason}. Response text: {r.text}")
        return None

    def get_asset(self, asset_id="bitcoin"):
        """Fetches data for a specific cryptocurrency asset."""
        json_data = self._get_json(f"/assets/{asset_id}", self.get_asset)
        return CoinCapResponse(json_data) if json_data else None

    def get_asset_history(self, asset_id, interval, start=None, end=None):
        """Fetches historical data for a specific cryptocurrency asset.

        Args:
            asset_id (str): The ID of the asset (e.g., "bitcoin").
            interval (str): The time interval for historical data (e.g., 'd1', 'h1', 'm1').
            start (int, optional): Unix timestamp (milliseconds) for the start of the period.
            end (int, optional): Unix timestamp (milliseconds) for the end of the period.

        Returns:
            dict: The JSON response containing historical data, or None if an error occurs.
        """
        url_post = f"/assets/{asset_id}/history"
        params = {'interval': interval}

        if start is not None:
            params['start'] = int(start)
        if end is not None:
            params['end'] = int(end)

        json_data = self._get_json(url_post, self.get_asset_history, params=params)
        return json_data

# ==============================================================================
# Data Parser Functions
# ==============================================================================

def _convertTimestamp(timestamp_seconds):
    """Converts a Unix timestamp (in seconds) to a formatted datetime string."""
    if timestamp_seconds is not None:
        try:
            return datetime.datetime.fromtimestamp(timestamp_seconds).strftime('%Y-%m-%d %H:%M:%S')
        except Exception:
            logger.error(f"Error converting timestamp {timestamp_seconds}", exc_info=True)
            pass
    return None

def _convert(data, dataType):
    """Converts data to the specified dataType, handling None and type conversion errors."""
    if data is None:
        return None
    if dataType == str:
        # For CoinCap API, some string values might already be clean, but strip() is safe.
        return str(data).strip()
    try:
        return dataType(data)
    except Exception:
        logger.error(f"Error converting data '{data}' to {dataType.__name__}", exc_info=True)
        return None

def _convertRow(coincap_response):
    """Converts a CoinCapResponse object into a dictionary suitable for BigQuery.
    Assumes coincap_response contains a CoinCapAsset object under .data and a timestamp_seconds.
    """
    if coincap_response is None or coincap_response.data is None:
        return {}

    asset_data = coincap_response.data
    row = {
        'id': _convert(asset_data.id, str),
        'rank': _convert(asset_data.rank, int),
        'symbol': _convert(asset_data.symbol, str),
        'name': _convert(asset_data.name, str),
        'supply': _convert(asset_data.supply, float),
        'maxSupply': _convert(asset_data.maxSupply, float),
        'marketCapUsd': _convert(asset_data.marketCapUsd, float),
        'volumeUsd24Hr': _convert(asset_data.volumeUsd24Hr, float),
        'priceUsd': _convert(asset_data.priceUsd, float),
        'changePercent24Hr': _convert(asset_data.changePercent24Hr, float),
        'vwap24Hr': _convert(asset_data.vwap24Hr, float),
        'explorer': _convert(asset_data.explorer, str)
    }

    # Add the CoinCap API response timestamp
    if coincap_response.timestamp_seconds is not None:
        row['coincap_query_time_bq'] = _convertTimestamp(coincap_response.timestamp_seconds)

    # Return only non-null values, as BQ handles missing fields
    return {k: v for k, v in row.items() if v is not None}

def _convertHistoricalRecord(historical_data_point, asset_id):
    """Converts a single historical data point into a dictionary suitable for BigQuery.

    Args:
        historical_data_point (dict): A dictionary representing a single historical entry from the CoinCap API response.
                                      Expected keys: 'priceUsd', 'time', and potentially others like 'volume'.
        asset_id (str): The ID of the asset (e.g., "bitcoin").

    Returns:
        dict: A dictionary containing the formatted historical record, or an empty dict if conversion fails.
    """
    if not historical_data_point:
        return {}

    # Extract and convert priceUsd
    price_usd = _convert(historical_data_point.get('priceUsd'), float)

    # Extract and convert timestamp from milliseconds to formatted datetime string
    timestamp_ms = historical_data_point.get('time')
    coincap_query_time_bq = None
    if timestamp_ms is not None:
        coincap_query_time_bq = _convertTimestamp(timestamp_ms / 1000)

    # Construct the record dictionary
    record = {
        'id': asset_id,
        'priceUsd': price_usd,
        'coincap_query_time_bq': coincap_query_time_bq
    }

    # Optionally add other fields if available in historical_data_point and relevant
    # For instance, if 'volume' is present and we want to include it:
    # record['volumeUsd24Hr'] = _convert(historical_data_point.get('volume'), float)

    # Return only non-null values
    return {k: v for k, v in record.items() if v is not None}

# Re-initialize GCP clients (ensuring they are available in this scope)
storage_client = storage.Client(project=PROJECT_ID)
bq_client = bigquery.Client(project=PROJECT_ID)

# Redefine load_gcs_to_bigquery (if it uses any global variables that might be affected)
# This function does not directly depend on the CoinCapApi class but is part of the pipeline context.
def load_gcs_to_bigquery(gcs_uri, project_id, bq_dataset, bq_table, bq_schema, bq_client_instance):
    """Loads data from a GCS URI into a BigQuery table."""
    print(f"\nStep: Loading data from GCS into BigQuery...")
    print(f"  > Source: {gcs_uri}")
    print(f"  > Target: {bq_dataset}.{bq_table}")

    job_config = bigquery.LoadJobConfig()
    job_config.source_format = bigquery.SourceFormat.NEWLINE_DELIMITED_JSON
    job_config.schema = bq_schema
    job_config.autodetect = False
    job_config.write_disposition = bigquery.WriteDisposition.WRITE_APPEND

    load_job = bq_client_instance.load_table_from_uri(
        gcs_uri,
        f"{project_id}.{bq_dataset}.{bq_table}",
        job_config=job_config
    )

    print(f"  > Starting BQ Load Job: {load_job.job_id}")
    load_job.result()
    print(f"  > Job complete. Loaded {load_job.output_rows} rows.")
    print("‚úÖ GCS to BigQuery Load Finished Successfully.")

def run_full_pipeline_without_bq_load():
    """Executes the API -> GCS pipeline for CoinCap data and returns GCS URI."""

    try:
        # ======================================================
        # 1. Download data from CoinCap API
        # ======================================================
        print(f"Step 1: Fetching Bitcoin historical data from CoinCap API...")
        api = CoinCapApi()
        queryTime = datetime.datetime.now().timestamp()

        # Define asset_id for historical data fetch
        asset_id_to_fetch = "bitcoin"

        # Calculate start and end timestamps for the last 7 days
        end_datetime = datetime.datetime.now()
        end_timestamp_ms = int(end_datetime.timestamp() * 1000)
        start_datetime = end_datetime - datetime.timedelta(days=7)
        start_timestamp_ms = int(start_datetime.timestamp() * 1000)

        # Fetch historical data
        historical_data_response = api.get_asset_history(
            asset_id=asset_id_to_fetch,
            interval="d1", # Daily interval
            start=start_timestamp_ms,
            end=end_timestamp_ms
        )

        records = []
        if historical_data_response and 'data' in historical_data_response:
            for data_point in historical_data_response['data']:
                # Use the new _convertHistoricalRecord function
                record = _convertHistoricalRecord(data_point, asset_id_to_fetch)
                if record:
                    records.append(record)

        if not records:
            print("No Bitcoin historical data found. Exiting.")
            return None # Return None if no records

        print(f"  > Fetched {len(records)} record(s).")

        # ======================================================
        # 2. Save data to GCS Bucket
        # ======================================================
        local_filename = "bitcoin_historical_data.jsonl"
        with open(local_filename, 'w') as f:
            for record in records:
                f.write(json.dumps(record) + '\n')

        gcs_filename = f"{GCS_FOLDER_PATH}/coincap_history_batch_{int(queryTime)}.jsonl"

        print(f"\nStep 2: Uploading data to GCS...")
        print(f"  > Source: {local_filename}")
        print(f"  > Destination: gs://{GCS_BUCKET_NAME}/{gcs_filename}")

        bucket = storage_client.bucket(GCS_BUCKET_NAME)
        blob = bucket.blob(gcs_filename)
        blob.upload_from_filename(local_filename)

        gcs_uri = f"gs://{GCS_BUCKET_NAME}/{gcs_filename}"
        print("  > Upload complete.")
        print("‚úÖ API to GCS Pipeline Finished Successfully.")
        return gcs_uri # Return the GCS URI

    except Exception as e:
        print(f"\n‚ùå ERROR in pipeline: {e}")
        return None

print("‚úÖ CoinCap API class and pipeline functions redefined with _convertHistoricalRecord and updated run_full_pipeline_without_bq_load.")


‚úÖ CoinCap API class and pipeline functions redefined with _convertHistoricalRecord and updated run_full_pipeline_without_bq_load.


## Execute Updated Pipeline

### Subtask:
Run the complete data pipeline (API -> GCS -> BigQuery) with the modified CoinCapApi and run_full_pipeline_without_bq_load functions to fetch, process, and load the historical Bitcoin data.


**Reasoning**:
The subtask requires executing the complete data pipeline, including checking for and creating the BigQuery table if it doesn't exist, fetching historical data, uploading to GCS, and loading into BigQuery. I will redefine the relevant code block to incorporate the table creation/check logic before proceeding with the data pipeline execution.



In [None]:
import os
import json
import logging
import datetime
import time
import requests
from collections import defaultdict
from google.cloud import storage, bigquery
from google.cloud.exceptions import NotFound # Import NotFound for specific error handling

# --- !! CONFIGURE YOUR VARIABLES (re-defined for self-containment) !! ---
PROJECT_ID = "directed-bongo-471119-d1"
GCP_REGION = "us-central1"
GCS_BUCKET_NAME = "467_final_3_2"
GCS_FOLDER_PATH = "data_1"
BQ_DATASET = "467_final_3_2"
BQ_TABLE = "data_1_t"
FLIGHT_RECORD_LIMIT = 500 # Not used in this specific CoinCap context, but kept for consistency if pipeline structure changes

COINCAP_API_KEY = "7aaeaf52f203c5f61c86c294210dada652fc5adc8d566875faa9eedebbbdf761" # Re-define API key

# Define BigQuery Schema based on _convertHistoricalRecord output
BQ_SCHEMA = [
    bigquery.SchemaField("id", "STRING", mode="NULLABLE"),
    bigquery.SchemaField("priceUsd", "FLOAT", mode="NULLABLE"),
    bigquery.SchemaField("coincap_query_time_bq", "TIMESTAMP", mode="NULLABLE")
]

# ==============================================================================
# CoinCap API Library Code (Modified to use API Key and add get_asset_history)
# ==============================================================================

logger = logging.getLogger('coincap_api')
logger.addHandler(logging.NullHandler())

class CoinCapAsset(object):
    """Represents a single cryptocurrency asset from the CoinCap API data."""
    keys = ["id", "rank", "symbol", "name", "supply", "maxSupply",
            "marketCapUsd", "volumeUsd24Hr", "priceUsd", "changePercent24Hr",
            "vwap24Hr", "explorer"]

    def __init__(self, data_dict):
        # Initialize attributes from the data_dict
        for key in CoinCapAsset.keys:
            # Use .get() to safely retrieve values, in case a key is missing
            setattr(self, key, data_dict.get(key))

class CoinCapResponse(object):
    """Represents the overall response from the CoinCap API for an asset."""
    def __init__(self, j):
        self.__dict__ = j
        # If 'data' exists and is not None, parse it into a CoinCapAsset object
        if 'data' in self.__dict__ and self.data is not None:
            self.data = CoinCapAsset(self.data)
        else:
            self.data = None
        # Convert timestamp from milliseconds to seconds for general use
        if 'timestamp' in self.__dict__ and self.timestamp is not None:
            self.timestamp_seconds = int(self.timestamp / 1000)
        else:
            self.timestamp_seconds = None


class CoinCapApi(object):
    def __init__(self):
        self._api_url = "https://rest.coincap.io/v3"
        self._last_requests = defaultdict(lambda: 0)
        # CoinCap has a rate limit of 2000 requests per minute, so ~0.03 seconds per request
        self._rate_limit_seconds = 60 / 2000 # 0.03 seconds per request

    def _get_json(self, url_post, callee, params=None):
        # Implement a basic rate limit check
        time_since_last_request = time.time() - self._last_requests[callee]
        if time_since_last_request < self._rate_limit_seconds:
            # Wait if we're hitting the rate limit too fast
            sleep_time = self._rate_limit_seconds - time_since_last_request
            time.sleep(sleep_time)

        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
            'Accept': 'application/json',
            'Authorization': f'Bearer {COINCAP_API_KEY}'
        }

        r = requests.get(f"{self._api_url}{url_post}", headers=headers, params=params, timeout=60.00)
        if r.status_code == 200:
            self._last_requests[callee] = time.time()
            json_response = r.json()
            return json_response
        logger.debug(f"Response not OK. Status {r.status_code} - {r.reason}")
        print(f"DEBUG: API call to {f'{self._api_url}{url_post}'} failed with status {r.status_code}: {r.reason}. Response text: {r.text}")
        return None

    def get_asset(self, asset_id="bitcoin"):
        """Fetches data for a specific cryptocurrency asset."""
        json_data = self._get_json(f"/assets/{asset_id}", self.get_asset)
        return CoinCapResponse(json_data) if json_data else None

    def get_asset_history(self, asset_id, interval, start=None, end=None):
        """Fetches historical data for a specific cryptocurrency asset.

        Args:
            asset_id (str): The ID of the asset (e.g., "bitcoin").
            interval (str): The time interval for historical data (e.g., 'd1', 'h1', 'm1').
            start (int, optional): Unix timestamp (milliseconds) for the start of the period.
            end (int, optional): Unix timestamp (milliseconds) for the end of the period.

        Returns:
            dict: The JSON response containing historical data, or None if an error occurs.
        """
        url_post = f"/assets/{asset_id}/history"
        params = {'interval': interval}

        if start is not None:
            params['start'] = int(start)
        if end is not None:
            params['end'] = int(end)

        json_data = self._get_json(url_post, self.get_asset_history, params=params)
        return json_data

# ==============================================================================
# Data Parser Functions
# ==============================================================================

def _convertTimestamp(timestamp_seconds):
    """Converts a Unix timestamp (in seconds) to a formatted datetime string."""
    if timestamp_seconds is not None:
        try:
            return datetime.datetime.fromtimestamp(timestamp_seconds).strftime('%Y-%m-%d %H:%M:%S')
        except Exception:
            logger.error(f"Error converting timestamp {timestamp_seconds}", exc_info=True)
            pass
    return None

def _convert(data, dataType):
    """Converts data to the specified dataType, handling None and type conversion errors."""
    if data is None:
        return None
    if dataType == str:
        # For CoinCap API, some string values might already be clean, but strip() is safe.
        return str(data).strip()
    try:
        return dataType(data)
    except Exception:
        logger.error(f"Error converting data '{data}' to {dataType.__name__}", exc_info=True)
        return None

def _convertRow(coincap_response):
    """Converts a CoinCapResponse object into a dictionary suitable for BigQuery.
    Assumes coincap_response contains a CoinCapAsset object under .data and a timestamp_seconds.
    """
    if coincap_response is None or coincap_response.data is None:
        return {}

    asset_data = coincap_response.data
    row = {
        'id': _convert(asset_data.id, str),
        'rank': _convert(asset_data.rank, int),
        'symbol': _convert(asset_data.symbol, str),
        'name': _convert(asset_data.name, str),
        'supply': _convert(asset_data.supply, float),
        'maxSupply': _convert(asset_data.maxSupply, float),
        'marketCapUsd': _convert(asset_data.marketCapUsd, float),
        'volumeUsd24Hr': _convert(asset_data.volumeUsd24Hr, float),
        'priceUsd': _convert(asset_data.priceUsd, float),
        'changePercent24Hr': _convert(asset_data.changePercent24Hr, float),
        'vwap24Hr': _convert(asset_data.vwap24Hr, float),
        'explorer': _convert(asset_data.explorer, str)
    }

    # Add the CoinCap API response timestamp
    if coincap_response.timestamp_seconds is not None:
        row['coincap_query_time_bq'] = _convertTimestamp(coincap_response.timestamp_seconds)

    # Return only non-null values, as BQ handles missing fields
    return {k: v for k, v in row.items() if v is not None}

def _convertHistoricalRecord(historical_data_point, asset_id):
    """Converts a single historical data point into a dictionary suitable for BigQuery.

    Args:
        historical_data_point (dict): A dictionary representing a single historical entry from the CoinCap API response.
                                      Expected keys: 'priceUsd', 'time', and potentially others like 'volume'.
        asset_id (str): The ID of the asset (e.g., "bitcoin").

    Returns:
        dict: A dictionary containing the formatted historical record, or an empty dict if conversion fails.
    """
    if not historical_data_point:
        return {}

    # Extract and convert priceUsd
    price_usd = _convert(historical_data_point.get('priceUsd'), float)

    # Extract and convert timestamp from milliseconds to formatted datetime string
    timestamp_ms = historical_data_point.get('time')
    coincap_query_time_bq = None
    if timestamp_ms is not None:
        coincap_query_time_bq = _convertTimestamp(timestamp_ms / 1000)

    # Construct the record dictionary
    record = {
        'id': asset_id,
        'priceUsd': price_usd,
        'coincap_query_time_bq': coincap_query_time_bq
    }

    # Optionally add other fields if available in historical_data_point and relevant
    # For instance, if 'volume' is present and we want to include it:
    # record['volumeUsd24Hr'] = _convert(historical_data_point.get('volume'), float)

    # Return only non-null values
    return {k: v for k, v in record.items() if v is not None}

# Re-initialize GCP clients (ensuring they are available in this scope)
storage_client = storage.Client(project=PROJECT_ID)
bq_client = bigquery.Client(project=PROJECT_ID)

# Redefine load_gcs_to_bigquery (if it uses any global variables that might be affected)
# This function does not directly depend on the CoinCapApi class but is part of the pipeline context.
def load_gcs_to_bigquery(gcs_uri, project_id, bq_dataset, bq_table, bq_schema, bq_client_instance):
    """Loads data from a GCS URI into a BigQuery table."""
    print(f"\nStep: Loading data from GCS into BigQuery...")
    print(f"  > Source: {gcs_uri}")
    print(f"  > Target: {bq_dataset}.{bq_table}")

    job_config = bigquery.LoadJobConfig()
    job_config.source_format = bigquery.SourceFormat.NEWLINE_DELIMITED_JSON
    job_config.schema = bq_schema
    job_config.autodetect = False
    job_config.write_disposition = bigquery.WriteDisposition.WRITE_APPEND

    load_job = bq_client_instance.load_table_from_uri(
        gcs_uri,
        f"{project_id}.{bq_dataset}.{bq_table}",
        job_config=job_config
    )

    print(f"  > Starting BQ Load Job: {load_job.job_id}")
    load_job.result()
    print(f"  > Job complete. Loaded {load_job.output_rows} rows.")
    print("‚úÖ GCS to BigQuery Load Finished Successfully.")

def run_full_pipeline_without_bq_load():
    """Executes the API -> GCS pipeline for CoinCap data and returns GCS URI."""

    try:
        # ======================================================
        # 1. Download data from CoinCap API
        # ======================================================
        print(f"Step 1: Fetching Bitcoin historical data from CoinCap API...")
        api = CoinCapApi()
        queryTime = datetime.datetime.now().timestamp()

        # Define asset_id for historical data fetch
        asset_id_to_fetch = "bitcoin"

        # Calculate start and end timestamps for the last 7 days
        end_datetime = datetime.datetime.now()
        end_timestamp_ms = int(end_datetime.timestamp() * 1000)
        start_datetime = end_datetime - datetime.timedelta(days=365)
        start_timestamp_ms = int(start_datetime.timestamp() * 1000)

        # Fetch historical data
        historical_data_response = api.get_asset_history(
            asset_id=asset_id_to_fetch,
            interval="d1", # Daily interval
            start=start_timestamp_ms,
            end=end_timestamp_ms
        )

        records = []
        if historical_data_response and 'data' in historical_data_response:
            for data_point in historical_data_response['data']:
                # Use the new _convertHistoricalRecord function
                record = _convertHistoricalRecord(data_point, asset_id_to_fetch)
                if record:
                    records.append(record)

        if not records:
            print("No Bitcoin historical data found. Exiting.")
            return None # Return None if no records

        print(f"  > Fetched {len(records)} record(s).")

        # ======================================================
        # 2. Save data to GCS Bucket
        # ======================================================
        local_filename = "bitcoin_historical_data.jsonl"
        with open(local_filename, 'w') as f:
            for record in records:
                f.write(json.dumps(record) + '\n')

        gcs_filename = f"{GCS_FOLDER_PATH}/coincap_history_batch_{int(queryTime)}.jsonl"

        print(f"\nStep 2: Uploading data to GCS...")
        print(f"  > Source: {local_filename}")
        print(f"  > Destination: gs://{GCS_BUCKET_NAME}/{gcs_filename}")

        bucket = storage_client.bucket(GCS_BUCKET_NAME)
        blob = bucket.blob(gcs_filename)
        blob.upload_from_filename(local_filename)

        gcs_uri = f"gs://{GCS_BUCKET_NAME}/{gcs_filename}"
        print("  > Upload complete.")
        print("‚úÖ API to GCS Pipeline Finished Successfully.")
        return gcs_uri # Return the GCS URI

    except Exception as e:
        print(f"\n‚ùå ERROR in pipeline: {e}")
        return None


print("--- Running Full Data Pipeline (API -> GCS -> BigQuery) ---")

# 1. Ensure BigQuery table exists
target_table_id = f"{PROJECT_ID}.{BQ_DATASET}.{BQ_TABLE}"

try:
    bq_client.get_table(target_table_id)  # API request
    print(f"‚úÖ BigQuery table '{target_table_id}' already exists.")
except NotFound:
    print(f"BigQuery table '{target_table_id}' not found. Creating table...")
    table = bigquery.Table(target_table_id, schema=BQ_SCHEMA)
    table = bq_client.create_table(table)  # API request
    print(f"‚úÖ Created table {table.project}.{table.dataset_id}.{table.table_id}")
except Exception as e:
    print(f"‚ùå Error checking or creating BigQuery table: {e}")
    exit() # Exit if table cannot be ensured

# 2. Execute API -> GCS pipeline
gcs_uri_for_bq_load = run_full_pipeline_without_bq_load()

if gcs_uri_for_bq_load:
    # 3. Load data from GCS to BigQuery
    load_gcs_to_bigquery(
        gcs_uri_for_bq_load,
        PROJECT_ID,
        BQ_DATASET,
        BQ_TABLE,
        BQ_SCHEMA,
        bq_client
    )
    print("‚úÖ Pipeline Finished Successfully.")
else:
    print("‚ùå Pipeline aborted: No data fetched or GCS upload failed.")

--- Running Full Data Pipeline (API -> GCS -> BigQuery) ---
‚úÖ BigQuery table 'directed-bongo-471119-d1.467_final_3_2.data_1_t' already exists.
Step 1: Fetching Bitcoin historical data from CoinCap API...
  > Fetched 366 record(s).

Step 2: Uploading data to GCS...
  > Source: bitcoin_historical_data.jsonl
  > Destination: gs://467_final_3_2/data_1/coincap_history_batch_1765634428.jsonl
  > Upload complete.
‚úÖ API to GCS Pipeline Finished Successfully.

Step: Loading data from GCS into BigQuery...
  > Source: gs://467_final_3_2/data_1/coincap_history_batch_1765634428.jsonl
  > Target: 467_final_3_2.data_1_t
  > Starting BQ Load Job: 895c6232-51f6-482e-96c2-76dcccb4dcd6
  > Job complete. Loaded 366 rows.
‚úÖ GCS to BigQuery Load Finished Successfully.
‚úÖ Pipeline Finished Successfully.


# Visualize

In [6]:
import plotly.express as px
from google.cloud import bigquery
import pandas as pd

# Initialize BigQuery client
client = bigquery.Client(project=PROJECT_ID)

# Construct the full table ID
table_id = f"{PROJECT_ID}.{BQ_DATASET}.{BQ_TABLE}"

# SQL query to fetch Bitcoin price data
query = f"""
SELECT
    coincap_query_time_bq AS Timestamp,
    priceUsd AS BitcoinPrice
FROM
    `{table_id}`
WHERE
    id = 'bitcoin'
ORDER BY
    Timestamp ASC
"""

# Execute the query and load into a Pandas DataFrame
try:
    df_bitcoin_price = client.query(query).to_dataframe()
    print(f"Fetched {len(df_bitcoin_price)} Bitcoin price records.")
except Exception as e:
    print(f"‚ùå Error fetching data from BigQuery: {e}")
    df_bitcoin_price = pd.DataFrame()

# Generate interactive Plotly chart if data is available
if not df_bitcoin_price.empty:
    fig = px.line(
        df_bitcoin_price,
        x='Timestamp',
        y='BitcoinPrice',
        title='Bitcoin Price Over 2025',
        labels={'Timestamp': 'Date', 'BitcoinPrice': 'Price (USD)'}
    )
    fig.update_layout(
        hovermode="x unified",
        xaxis_title="Date",
        yaxis_title="Price (USD)"
    )
    fig.show()
else:
    print("No data available to plot.")


Fetched 366 Bitcoin price records.
