## Task 2: A Sample of Owners

In order to create a more convenient local sample of owners to work with, this task generates a file of owners where the file contains every record for each owner. It starts by taking the full list of card numbers and sampling from that list to  extract all records associated with those owners. It then writes the extracted data to a local text file, making it easier to work with and analyze further.


#### Import Required Libraries

In [1]:
from google.cloud import bigquery
import os
import pandas as pd


#### Configure Goggle Big Query Project

In [2]:
# Set up a client
client = bigquery.Client(project = "wedge-project-np")


#### Query for Card Numbers

This query finds all the card numbers from the `card_no` column and excludes `card_no` 3.

In [4]:
query = """
    SELECT DISTINCT card_no
    FROM `umt-msba.transactions.transArchive_*`
    WHERE card_no != 3
    GROUP BY card_no;
"""
# Execute the query
try:
    query_job = client.query(query)  # Start the query job
    df = query_job.to_dataframe()  # Convert the result to a pandas DataFrame

    # View the DataFrame
    print(df.head())  # Show the first 5 rows of the DataFrame

except Exception as e:
    print(f"Query failed: {e}")

   card_no
0  11769.0
1  21003.0
2  19750.0
3  22112.0
4  14258.0


### Sample the Owners
This code block takes a sample of the owner's `card_no`s from the full `card_no` list and pulls all the rows that are associated with those card numbers found in the sample. It then saves it to a txt file. 

In [6]:
sampled_owners = df.sample(n=500, random_state=1)  

owners_list = sampled_owners['card_no'].tolist()

file_path = 'Data/sampled_owners_records.txt'

# Write query to extract all records for the sampled owners
query = f"""
    SELECT *
    FROM `umt-msba.transactions.transArchive_*`
    WHERE card_no IN ({', '.join(map(str, owners_list))});
"""

# Step 3: Execute the query to get all associated records
try:
    query_job = client.query(query)  # Start the query job
    records_df = query_job.to_dataframe()  # Convert the result to a pandas DataFrame

    # Step 4: Write the records to a local text file
    #records_df.to_csv('sampled_owners_records.txt', index=False, sep='\t')
    
    records_df.to_csv(file_path, index=False, sep='\t')

    print("Records successfully written to sampled_owners_records.txt")

except Exception as e:
    print(f"Query failed: {e}")

Records successfully written to sampled_owners_records.txt


#### Toolbox

In [7]:
###################################################################
####### Tool to Check File Size  ##################################
###################################################################

# Write the records to a local text file
#file_path = 'Data/sampled_owners_records.txt'
#records_df.to_csv(file_path, index=False, sep='\t')

# Get and print the size of the file
file_size = os.path.getsize(file_path)
print(f"Records successfully written to {file_path}")
print(f"File size: {file_size / (1024 * 1024):.2f} MB")  # Convert from bytes to MB

Records successfully written to Data/sampled_owners_records.txt
File size: 219.72 MB


In [None]:
###################################################################
####### Tool to Test a Card_no  ###################################
###################################################################

card_no_to_filter = 13062

# Filter the DataFrame to include only rows with the specified card_no
filtered_df = records_df[records_df['card_no'] == card_no_to_filter]

# Display the shape and head of the filtered DataFrame
print(filtered_df.shape)
print(filtered_df.head())