## Task 1: Building a Transaction Database in Google BigQuery  
  
#### Overview  
The aim of this task is to programmatically upload Wedge transaction records to Google BigQuery. This process involves extracting transaction data from zipped files, ensuring correct column data types, and handling null values before uploading the data to BigQuery.   

##### Step 1: Extract the Main Zip File  

In [3]:
import zipfile
import os
import shutil

def extract_main_zip(main_zip_file, extract_to_folder):
        """
    Extracts the contents of the main zip file to the specified folder.
    
    Parameters:
    - main_zip_file (str): Path to the main zip file.
    - extract_to_folder (str): Folder where the extracted contents will be saved.

    The function handles the following:
    - Skips files that already exist.
    - Creates directories as needed.
    - Extracts each file from the zip archive.
    """
    # Ensure the extraction folder exists
    os.makedirs(extract_to_folder, exist_ok=True)

    # Open the main zip file
    with zipfile.ZipFile(main_zip_file, 'r') as main_zip:
        # Loop through all files in the main zip file
        for zip_info in main_zip.infolist():
            # Create the full output path
            output_file_path = os.path.join(extract_to_folder, zip_info.filename)
            
            # Check if the file or folder already exists
            if os.path.exists(output_file_path):
                print(f"Skipping {zip_info.filename}, already exists.")
                continue

            # Check if it's a directory, then create it without extraction
            if zip_info.is_dir():
                os.makedirs(output_file_path, exist_ok=True)
                print(f"Created directory {output_file_path}")
            else:
                # Create any necessary directories
                os.makedirs(os.path.dirname(output_file_path), exist_ok=True)
                
                # Extract the file
                with main_zip.open(zip_info) as source, open(output_file_path, 'wb') as target:
                    shutil.copyfileobj(source, target)
                print(f"Extracted {zip_info.filename} to {output_file_path}")

# Input definitions: 
main_zip = 'D:/WedgeProject/Wedge-Project-ADA-Riley-ORorke/data/WedgeZipOfZips.zip'
extract_folder = 'D:/WedgeProject/Wedge-Project-ADA-Riley-ORorke/data/extracted_main_zip'

extract_main_zip(main_zip, extract_folder)

Skipping transArchive_201001_201003.zip, already exists.
Skipping transArchive_201004_201006.zip, already exists.
Skipping transArchive_201007_201009.zip, already exists.
Skipping transArchive_201010_201012.zip, already exists.
Skipping transArchive_201101_201103.zip, already exists.
Skipping transArchive_201104.zip, already exists.
Skipping transArchive_201105.zip, already exists.
Skipping transArchive_201106.zip, already exists.
Skipping transArchive_201107_201109.zip, already exists.
Skipping transArchive_201110_201112.zip, already exists.
Skipping transArchive_201201_201203.zip, already exists.
Extracted transArchive_201201_201203_inactive.zip to D:/WedgeProject/Wedge-Project-ADA-Riley-ORorke/data/extracted_main_zip\transArchive_201201_201203_inactive.zip
Skipping transArchive_201204_201206.zip, already exists.
Extracted transArchive_201204_201206_inactive.zip to D:/WedgeProject/Wedge-Project-ADA-Riley-ORorke/data/extracted_main_zip\transArchive_201204_201206_inactive.zip
Skipping 

Step 2: Extract the Nested Zip Files

In [4]:
import zipfile
import os

def extract_all_csvs_to_one_folder(extract_folder, output_folder):
    # Ensure the output folder exists
    os.makedirs(output_folder, exist_ok=True)

    # Walk through the extracted folder and look for zip files
    for root, dirs, files in os.walk(extract_folder):
        for file in files:
            if file.endswith('.zip'):
                nested_zip_path = os.path.join(root, file)
                
                # Check if the file is a valid zip file before proceeding
                try:
                    with zipfile.ZipFile(nested_zip_path, 'r') as nested_zip:
                        for zip_info in nested_zip.infolist():
                            if zip_info.filename.endswith('.csv'):
                                output_file_path = os.path.join(output_folder, zip_info.filename)
                                # Check if the CSV file already exists in the output folder
                                if not os.path.exists(output_file_path):
                                    # Extract the CSV if it doesn't already exist
                                    nested_zip.extract(zip_info, output_folder)
                                    print(f"Extracted {zip_info.filename} to {output_folder}")
                                else:
                                    print(f"Skipping {zip_info.filename}, already exists.")
                except zipfile.BadZipFile:
                    print(f"Skipping {nested_zip_path}, not a valid zip file.")

# Example usage
extract_folder = 'D:/WedgeProject/Wedge-Project-ADA-Riley-ORorke/data/extracted_main_zip'
output_folder = 'D:/WedgeProject/Wedge-Project-ADA-Riley-ORorke/data/extracted_csv_files'  

extract_all_csvs_to_one_folder(extract_folder, output_folder)




Skipping transArchive_201001_201003.csv, already exists.
Skipping transArchive_201004_201006.csv, already exists.
Skipping transArchive_201007_201009.csv, already exists.
Skipping transArchive_201010_201012.csv, already exists.
Skipping transArchive_201101_201103.csv, already exists.
Skipping transArchive_201104.csv, already exists.
Skipping transArchive_201105.csv, already exists.
Skipping transArchive_201106.csv, already exists.
Skipping transArchive_201107_201109.csv, already exists.
Skipping transArchive_201110_201112.csv, already exists.
Skipping transArchive_201201_201203.csv, already exists.
Extracted transArchive_201201_201203_inactive.csv to D:/WedgeProject/Wedge-Project-ADA-Riley-ORorke/data/extracted_csv_files
Skipping transArchive_201204_201206.csv, already exists.
Extracted transArchive_201204_201206_inactive.csv to D:/WedgeProject/Wedge-Project-ADA-Riley-ORorke/data/extracted_csv_files
Skipping transArchive_201207_201209.csv, already exists.
Extracted transArchive_201207_

Step 3: Standardize and Clean the CSV Files

In [5]:
import pandas as pd
import glob
import os

# Path to the reference CSV for column names
good_file_path = 'D:/WedgeProject/Wedge-Project-ADA-Riley-ORorke/data/reference_files/transArchive_201001_201003_clean.csv'

# Load the reference file to use its column names
df_good = pd.read_csv(good_file_path)

# Store the column names of the reference file for comparison
good_column_names = df_good.columns.tolist()

def clean_and_standardize_file(input_file, output_file):
    """
    Cleans and standardizes the input CSV file by handling delimiters, 
    NULL values, and column headers.

    """
    try:
        # Attempt to load the file with automatic delimiter detection
        try:
            df = pd.read_csv(input_file, sep=None, engine='python')
        except pd.errors.ParserError:
            # Fallback to semicolon if automatic detection fails
            df = pd.read_csv(input_file, delimiter=';')
        
        # If column count doesn't match, attempt to load with comma delimiter
        if df.shape[1] != len(good_column_names):
            df = pd.read_csv(input_file, delimiter=',')
        
        # Replace various representations of NULL with NaN/None
        df.replace({"NULL": None, r"\\N": None, r"\N": None}, inplace=True)
        
        # Check if the file's columns match the reference file's columns
        if list(df.columns) != good_column_names:
            # Check if the first row is a header and reorder columns accordingly
            potential_header = df.iloc[0].tolist()
            if set(potential_header) == set(good_column_names):
                df.columns = potential_header
                df = df.iloc[1:].reset_index(drop=True)  # Remove first row (now header)
            else:
                df.columns = good_column_names  # Set correct headers if missing

        # Save the cleaned file with a comma delimiter
        df.to_csv(output_file, index=False, sep=",")
        print(f"File cleaned and saved: {output_file}")

    except pd.errors.EmptyDataError:
        print(f"Error: {input_file} is empty.")
    except pd.errors.ParserError:
        print(f"Error: Could not parse {input_file}.")
    except FileNotFoundError:
        print(f"Error: {input_file} not found.")
    except Exception as e:
        print(f"Error processing {input_file}: {e}")

def process_extracted_csvs(extracted_folder, output_folder):
    """
    Processes all CSV files in the extracted folder, cleaning and saving them
    to the output folder. Skips files that have already been processed.

    """
    # Ensure the output folder exists
    os.makedirs(output_folder, exist_ok=True)
    
    # Get all CSV files from the extracted folder (recursive search)
    csv_files = glob.glob(f"{extracted_folder}/**/*.csv", recursive=True)
    
    # Process each CSV file
    for csv_file in csv_files:
        output_file = os.path.join(output_folder, os.path.basename(csv_file))
        
        # Skip files that are already processed
        if os.path.exists(output_file):
            print(f"Skipping {csv_file}, already processed.")
            continue
        
        # Clean and standardize the file
        clean_and_standardize_file(csv_file, output_file)

# Example usage
extracted_folder = 'D:/WedgeProject/Wedge-Project-ADA-Riley-ORorke/data/extracted_csv_files'  # Folder where the extracted CSVs are located
output_folder = 'D:/WedgeProject/Wedge-Project-ADA-Riley-ORorke/data/final_cleaned_csv_files'  # Folder to save cleaned CSVs

# Process and clean the extracted CSVs
process_extracted_csvs(extracted_folder, output_folder)


Skipping D:/WedgeProject/Wedge-Project-ADA-Riley-ORorke/data/extracted_csv_files\transArchive_201001_201003.csv, already processed.
Skipping D:/WedgeProject/Wedge-Project-ADA-Riley-ORorke/data/extracted_csv_files\transArchive_201004_201006.csv, already processed.
Skipping D:/WedgeProject/Wedge-Project-ADA-Riley-ORorke/data/extracted_csv_files\transArchive_201007_201009.csv, already processed.
Skipping D:/WedgeProject/Wedge-Project-ADA-Riley-ORorke/data/extracted_csv_files\transArchive_201010_201012.csv, already processed.
Skipping D:/WedgeProject/Wedge-Project-ADA-Riley-ORorke/data/extracted_csv_files\transArchive_201101_201103.csv, already processed.
Skipping D:/WedgeProject/Wedge-Project-ADA-Riley-ORorke/data/extracted_csv_files\transArchive_201104.csv, already processed.
Skipping D:/WedgeProject/Wedge-Project-ADA-Riley-ORorke/data/extracted_csv_files\transArchive_201105.csv, already processed.
Skipping D:/WedgeProject/Wedge-Project-ADA-Riley-ORorke/data/extracted_csv_files\transArch

  df = pd.read_csv(input_file, delimiter=';')


File cleaned and saved: D:/WedgeProject/Wedge-Project-ADA-Riley-ORorke/data/final_cleaned_csv_files\transArchive_201207_201209_inactive.csv
File cleaned and saved: D:/WedgeProject/Wedge-Project-ADA-Riley-ORorke/data/final_cleaned_csv_files\transArchive_201210_201212.csv


  df = pd.read_csv(input_file, delimiter=';')


File cleaned and saved: D:/WedgeProject/Wedge-Project-ADA-Riley-ORorke/data/final_cleaned_csv_files\transArchive_201210_201212_inactive.csv
File cleaned and saved: D:/WedgeProject/Wedge-Project-ADA-Riley-ORorke/data/final_cleaned_csv_files\transArchive_201301_201303.csv


  df = pd.read_csv(input_file, delimiter=';')


File cleaned and saved: D:/WedgeProject/Wedge-Project-ADA-Riley-ORorke/data/final_cleaned_csv_files\transArchive_201301_201303_inactive.csv
File cleaned and saved: D:/WedgeProject/Wedge-Project-ADA-Riley-ORorke/data/final_cleaned_csv_files\transArchive_201304_201306.csv


  df = pd.read_csv(input_file, delimiter=';')


File cleaned and saved: D:/WedgeProject/Wedge-Project-ADA-Riley-ORorke/data/final_cleaned_csv_files\transArchive_201304_201306_inactive.csv
File cleaned and saved: D:/WedgeProject/Wedge-Project-ADA-Riley-ORorke/data/final_cleaned_csv_files\transArchive_201307_201309.csv


  df = pd.read_csv(input_file, delimiter=';')


File cleaned and saved: D:/WedgeProject/Wedge-Project-ADA-Riley-ORorke/data/final_cleaned_csv_files\transArchive_201307_201309_inactive.csv
File cleaned and saved: D:/WedgeProject/Wedge-Project-ADA-Riley-ORorke/data/final_cleaned_csv_files\transArchive_201310_201312.csv
File cleaned and saved: D:/WedgeProject/Wedge-Project-ADA-Riley-ORorke/data/final_cleaned_csv_files\transArchive_201310_201312_inactive.csv
File cleaned and saved: D:/WedgeProject/Wedge-Project-ADA-Riley-ORorke/data/final_cleaned_csv_files\transArchive_201401_201403.csv


  df = pd.read_csv(input_file, delimiter=';')


File cleaned and saved: D:/WedgeProject/Wedge-Project-ADA-Riley-ORorke/data/final_cleaned_csv_files\transArchive_201401_201403_inactive.csv
File cleaned and saved: D:/WedgeProject/Wedge-Project-ADA-Riley-ORorke/data/final_cleaned_csv_files\transArchive_201404_201406.csv
File cleaned and saved: D:/WedgeProject/Wedge-Project-ADA-Riley-ORorke/data/final_cleaned_csv_files\transArchive_201404_201406_inactive.csv
Error: Could not parse D:/WedgeProject/Wedge-Project-ADA-Riley-ORorke/data/extracted_csv_files\transArchive_201407_201409.csv.
File cleaned and saved: D:/WedgeProject/Wedge-Project-ADA-Riley-ORorke/data/final_cleaned_csv_files\transArchive_201407_201409_inactive.csv
File cleaned and saved: D:/WedgeProject/Wedge-Project-ADA-Riley-ORorke/data/final_cleaned_csv_files\transArchive_201410_201412.csv
File cleaned and saved: D:/WedgeProject/Wedge-Project-ADA-Riley-ORorke/data/final_cleaned_csv_files\transArchive_201410_201412_inactive.csv
File cleaned and saved: D:/WedgeProject/Wedge-Proje

Step 4: Uploading to Google BigQuery

In [15]:
import os
import glob
from google.cloud import bigquery

def table_exists(client, dataset_id, table_name):
    """Check if a table already exists in BigQuery."""
    try:
        client.get_table(f'{dataset_id}.{table_name}')
        return True
    except Exception:
        # Table does not exist
        return False

def upload_csv_to_bigquery(client, dataset_id, table_name, csv_file):
    try:
        # Configure the load job with schema autodetection
        job_config = bigquery.LoadJobConfig(
            source_format=bigquery.SourceFormat.CSV,
            skip_leading_rows=1,  # Skipping header row if the CSV contains headers
            autodetect=True  # Automatically detect schema
        )

        # Load data from CSV into BigQuery
        with open(csv_file, "rb") as source_file:
            load_job = client.load_table_from_file(source_file, f'{dataset_id}.{table_name}', job_config=job_config)

        # Wait for the load job to complete
        load_job.result()

        print(f"Loaded {load_job.output_rows} rows into {dataset_id}.{table_name}")
    
    except Exception as e:
        print(f"Error uploading {csv_file} to BigQuery: {e}")

def upload_all_csvs(input_folder, dataset_id):
    # Initialize the BigQuery client
    client = bigquery.Client()

    # Get all CSV files in the input folder
    csv_files = glob.glob(os.path.join(input_folder, '*.csv'))
    
    print(f"Found {len(csv_files)} CSV files to upload.")
    
    # Iterate through each file and upload it
    for csv_file in csv_files:
        # Extract the base name of the CSV file to use as the table name
        table_name = os.path.basename(csv_file).replace('.csv', '')
        
        # Check if the table already exists
        if table_exists(client, dataset_id, table_name):
            print(f"Skipping {csv_file}, table {table_name} already exists in BigQuery.")
            continue
        
        # Upload the CSV to BigQuery if the table does not exist
        upload_csv_to_bigquery(client, dataset_id, table_name, csv_file)

# Example usage
input_folder = 'D:/WedgeProject/Wedge-Project-ADA-Riley-ORorke/data/final_cleaned_csv_files'  # Folder with cleaned CSVs
dataset_id = 'wedgeproject-rileyororke.transaction_tables'  # Your BigQuery dataset

# Upload all CSVs to BigQuery
upload_all_csvs(input_folder, dataset_id)


Found 52 CSV files to upload.
Skipping D:/WedgeProject/Wedge-Project-ADA-Riley-ORorke/data/final_cleaned_csv_files\transArchive_201001_201003.csv, table transArchive_201001_201003 already exists in BigQuery.
Skipping D:/WedgeProject/Wedge-Project-ADA-Riley-ORorke/data/final_cleaned_csv_files\transArchive_201004_201006.csv, table transArchive_201004_201006 already exists in BigQuery.
Skipping D:/WedgeProject/Wedge-Project-ADA-Riley-ORorke/data/final_cleaned_csv_files\transArchive_201007_201009.csv, table transArchive_201007_201009 already exists in BigQuery.
Skipping D:/WedgeProject/Wedge-Project-ADA-Riley-ORorke/data/final_cleaned_csv_files\transArchive_201010_201012.csv, table transArchive_201010_201012 already exists in BigQuery.
Skipping D:/WedgeProject/Wedge-Project-ADA-Riley-ORorke/data/final_cleaned_csv_files\transArchive_201101_201103.csv, table transArchive_201101_201103 already exists in BigQuery.
Skipping D:/WedgeProject/Wedge-Project-ADA-Riley-ORorke/data/final_cleaned_csv_f

In [14]:
import pandas as pd

# Load the problematic CSV with low_memory=False to handle mixed types better
csv_file = 'D:/WedgeProject/Wedge-Project-ADA-Riley-ORorke/data/final_cleaned_csv_files/transArchive_201301_201303_inactive.csv'
df = pd.read_csv(csv_file, low_memory=False)

# Check for unexpected values in the 'organic' column
print(f"Unique values before cleaning: {df['organic'].unique()}")

# Convert 'organic' to numeric, coercing errors (which will convert invalid entries to NaN)
df['organic'] = pd.to_numeric(df['organic'], errors='coerce')

# Replace valid floating point numbers with integers (convert 1.0 to 1, 0.0 to 0)
df['organic'] = df['organic'].replace({1.0: 1, 0.0: 0, -1.0: None, 3.0: None})

# Check for unique values after cleaning
print(f"Unique values after cleaning: {df['organic'].unique()}")

# Save the cleaned file
output_file = 'D:/WedgeProject/Wedge-Project-ADA-Riley-ORorke/data/final_cleaned_csv_files/transArchive_201301_201303_inactive.csv'
df.to_csv(output_file, index=False)

print(f"Cleaned file saved to {output_file}")


Unique values before cleaning: ['1' '0' nan '-1' '3' ' ' '1.0' '0.0' '-1.0' '3.0']
Unique values after cleaning: [ 1.  0. nan]


  df['organic'] = df['organic'].replace({1.0: 1, 0.0: 0, -1.0: None, 3.0: None})


Cleaned file saved to D:/WedgeProject/Wedge-Project-ADA-Riley-ORorke/data/final_cleaned_csv_files/transArchive_201301_201303_inactive.csv
