Google Big Query is a distributed data warehouse built on a serverless architecture . We’ll discuss this framework in class. In this task you’ll upload all Wedge transaction records to Google Big Query. You’ll want to make sure that the column data types are correctly specified and you’ve properly handled the null values. 
The requirements for this task change depending on the grade you’re going for. 
Note: this assignment can be done manually or programmatically. Naturally I’d prefer it be done programmatically so that you get more practice, but that’s not required to get full credit. 

In [1]:
import pandas as pd
import zipfile
from google.cloud import bigquery
from google.oauth2 import service_account
from google.api_core.exceptions import NotFound
import os



In [5]:

zip_path = 'Data\wedge-clean-files.zip'  # Replace with your zip file path
extract_path = 'Data'   # Replace with your desired extract path

with zipfile.ZipFile(zip_path, 'r') as zip_ref:
    zip_ref.extractall(extract_path)

In [4]:
#export GOOGLE_APPLICATION_CREDENTIALS='wedge-project-403222-85fe5b35980b.json'

service_path = ""
service_file = 'wedge-project-403222-80aeb3085a6a.json' # change this to your authentication information  

gbq_proj_id = 'wedge-project-403222'  

# And this should stay the same. 
private_key = service_path + service_file

# Now we pass in our credentials so that Python has permission to access our project.
credentials = service_account.Credentials.from_service_account_file(private_key)

# And finally we establish our connection
client = bigquery.Client(credentials = credentials, project=gbq_proj_id)


In [5]:
# Check if the dataset exists
dataset_id = 'Transactions'
dataset_ref = client.dataset(dataset_id)

try:
    client.get_dataset(dataset_ref)
    print(f"Dataset {dataset_id} already exists.")
except NotFound:
    # Create the dataset if it does not exist
    dataset = bigquery.Dataset(dataset_ref)
    dataset = client.create_dataset(dataset)
    print(f"Dataset {dataset_id} created.")

Dataset Transactions already exists.


In [14]:
# BigQuery client
#client = bigquery.Client()

# Path to the directory where files are extracted
files_path = 'Data\clean-files' # Update this to your path

# Loop through the files and upload each to BigQuery
for filename in os.listdir(files_path):
    if filename.endswith('.csv'):  # Assuming files are in CSV format
 
        file_path = os.path.join(files_path, filename)
        dataframe = pd.read_csv(file_path, low_memory=False)

        #dataframe['matched'] = dataframe['matched'].apply(lambda x: bytes([int(x)]) if not pd.isna(x) else x)

        project_id = 'wedge-project-403222'
        dataset_id = 'Transactions'
        table_id = os.path.splitext(filename)[0]

        # Define the full table ID
        table_full_id = f"{client.project}.{dataset_id}.{table_id}"

        # If the table does not exist, it will be created. If it exists, data will be appended.
        job = client.load_table_from_dataframe(dataframe, table_full_id)

        # Wait for the job to complete
        job.result()
        print(f"Uploaded {filename} to {table_full_id}")

Uploaded transArchive_201001_201003_clean.csv to wedge-project-403222.Transactions.transArchive_201001_201003_clean
Uploaded transArchive_201004_201006_clean.csv to wedge-project-403222.Transactions.transArchive_201004_201006_clean
Uploaded transArchive_201007_201009_clean.csv to wedge-project-403222.Transactions.transArchive_201007_201009_clean
Uploaded transArchive_201010_201012_clean.csv to wedge-project-403222.Transactions.transArchive_201010_201012_clean
Uploaded transArchive_201101_201103_clean.csv to wedge-project-403222.Transactions.transArchive_201101_201103_clean
Uploaded transArchive_201104_clean.csv to wedge-project-403222.Transactions.transArchive_201104_clean
Uploaded transArchive_201105_clean.csv to wedge-project-403222.Transactions.transArchive_201105_clean
Uploaded transArchive_201106_clean.csv to wedge-project-403222.Transactions.transArchive_201106_clean
Uploaded transArchive_201107_201109_clean.csv to wedge-project-403222.Transactions.transArchive_201107_201109_clea