## Task 2: A Sample of Owners
These files are not easy to use in their current chronological arrangement, though having them in a large system like GBQ will solve a lot of our problems. Nevertheless, it’ll be convenient to have a local sample of owners to do work.

This task asks you to generate a file of owners where the file contains every record for each owner. There will be more than one owner in the file, and I do not want you to include card_no==3, which is the code for non-owners. The size of the sample is up to you, but I’d recommend shooting for a sample that’s around 250 MB. That’s big enough to be rich, but small enough to be fast. Ish.

Deliverable
A Python script that handles the following tasks:

Connects to your GBQ instance.
Builds a list of owners.
Takes a sample of the owners.
Extracts all records associated with those owners and writes them to a local text file.
You’ll submit your code carrying out the steps.

In [12]:
from google.cloud import bigquery
import os
import pandas as pd


In [8]:
# Set up a client
client = bigquery.Client(project = "umt-msba")

In [9]:
# Write your query
query = """
    SELECT DISTINCT card_no
    FROM `umt-msba.transactions.transArchive_*`
    GROUP BY card_no;
    
"""
# Execute the query
try:
    query_job = client.query(query)  # Start the query job
    df = query_job.to_dataframe()  # Convert the result to a pandas DataFrame

    # View the DataFrame
    print(df.head())  # Show the first 5 rows of the DataFrame

except Exception as e:
    print(f"Query failed: {e}")

   card_no
0  46261.0
1  46428.0
2  47794.0
3  47848.0
4  48313.0


### Sample the Owners

In [None]:
sampled_owners = df.sample(n=500, random_state=1)  # Sample 10 owners

owners_list = sampled_owners['card_no'].tolist()

# Write query to extract all records for the sampled owners
query = f"""
    SELECT *
    FROM `umt-msba.transactions.transArchive_*`
    WHERE card_no IN ({', '.join(map(str, owners_list))});
"""

# Step 3: Execute the query to get all associated records
try:
    query_job = client.query(query)  # Start the query job
    records_df = query_job.to_dataframe()  # Convert the result to a pandas DataFrame

    # Step 4: Write the records to a local text file
    records_df.to_csv('sampled_owners_records.txt', index=False, sep='\t')

    print("Records successfully written to sampled_owners_records.txt")

except Exception as e:
    print(f"Query failed: {e}")


In [16]:
# Write the records to a local text file
file_path = 'Data/sampled_owners_records.txt'
records_df.to_csv(file_path, index=False, sep='\t')

# Get and print the size of the file
file_size = os.path.getsize(file_path)
print(f"Records successfully written to {file_path}")
print(f"File size: {file_size / (1024 * 1024):.2f} MB")  # Convert from bytes to MB

Records successfully written to Data/sampled_owners_records.txt
File size: 302.59 MB


#Careful!!!!
--sample only from one set of data.
-- not sample all records for card_no....and make sure have all records for that card no. not just one record--- don't sample by row
