### Task 2: A Sample of Owners

#### Overview  
In Task 2, the goal is to extract a sample of owner transaction records from the Wedge Co-Op dataset in Google BigQuery. This allows for more efficient local analysis by working with a smaller subset of the data. The process involves selecting a random sample of 400 unique owners from the dataset, excluding non-owners (denoted by card_no == 3). All transactions associated with the sampled owners are then retrieved in batches and saved locally as a CSV file. The extracted sample is designed to be around 250MB in size, providing a manageable dataset for analysis while maintaining data richness.

In [5]:
from google.cloud import bigquery
import pandas as pd

# Initialize BigQuery client
client = bigquery.Client()

# Define project and dataset
project_id = "umt-msba"
dataset_id = "transactions"

# Define sample size for owner records (400 owners for approximately 250MB)
sample_size = 400

# Query to sample unique owners, excluding non-owners (card_no == 3)
owner_query = f"""
    WITH unique_owners AS (
        SELECT DISTINCT card_no
        FROM `{project_id}.{dataset_id}.transArchive_*`
        WHERE card_no != 3
    )
    SELECT card_no
    FROM unique_owners
    ORDER BY RAND()
    LIMIT {sample_size}
"""
# Execute the query and load sampled owner data into a DataFrame
sampled_owners_df = client.query(owner_query).to_dataframe()

# Convert the sampled owners to a list of card_no values
owner_list = sampled_owners_df['card_no'].tolist()

# Define batch size for querying transactions
batch_size = 150

def fetch_transactions(owner_batch):
    """
    Fetches transactions for a batch of owners from BigQuery.

    Parameters:
    - owner_batch (list): List of owner card_no values for the batch.

    Returns:
    - DataFrame: DataFrame containing transaction data for the owners in the batch.
    """
    owner_str = ','.join(map(str, owner_batch))
    transaction_query = f"""
        SELECT *
        FROM `{project_id}.{dataset_id}.transArchive_*`
        WHERE card_no IN ({owner_str})
    """
    return client.query(transaction_query).to_dataframe()

# Save the transaction data in batches to avoid memory overload
output_file = 'owner_transactions.csv'
first_write = True

with open(output_file, 'w') as f:
    for i in range(0, len(owner_list), batch_size):
        owner_batch = owner_list[i:i + batch_size]
        transaction_df = fetch_transactions(owner_batch)
        
        # Write the transaction data to CSV
        transaction_df.to_csv(f, header=first_write, index=False, mode='a', lineterminator='\n')
        first_write = False  # Ensure the header is only written once

print(f"Sampled transactions extracted and saved to {output_file}")





Sampled transactions extracted and saved to owner_transactions.csv
