# Basic ETL
The tables will ultimately be uploaded to Snowflake for ELT, but prior to doing this, it's imperative that the data are in the correct format per the schema, addressing any null values or data type issues. This will be the only consideration in this part of the ETL, as the heavier handling will be done via ELT in Snowflake. 

#### The files:
_external_vendor_information.csv_
- Vendor_ID VARHCAR
- Vendor_Name VARCHAR
- Service Type VARCHAR
- Contract_Start_Date DATETIME
- Contract_End_Date DATETIME
- Amount_Paid_YTD FLOAT

_portfolio_holdings.csv_
- Holding_ID VARCHAR
- Security_Name VARCHAR
- Security_Type VARCHAR
- Quantity_Held INT
- Market_Value FLOAT
- Acquisition_Date DATETIME

_program_performance_metrics.xlsx_
- Program_ID VARCHAR
- Program_Name VARCHAR
- Reporting_Period DATETIME
- Participants VARCHAR
- Successful_Outcomes VARCHAR
- Program_Cost FLOAT

_transactions.csv_
- Transaction_ID (unique identifier) VARCHAR
- Transaction_Date DATETIME
- Transaction_Amount FLOAT
- Transaction_Type (Expense, Revenue) VARCHAR
- Department (e.g., Operations, Investments, Community Programs) VARCHAR
- Vendor_Name VARCHAR
- Description VARCHAR

_unclaimed_property_records.csv_
- Property_ID VARCHAR
- Owner_Name VARCHAR
- Property_Type VARCHAR
- Reported_Date DATETIME
- Property_Value FLOAT
- Claim_Status VARCHAR

In [105]:
import polars as pl

# Read in the files
dim_external_vendor_info = pl.read_csv("external_vendor_information.csv")
dim_portfolio_holdings = pl.read_csv("portfolio_holdings.csv")
dim_program_performance_metrics = pl.read_excel("program_performance_metrics.xlsx", sheet_name="Program Metrics")
fact_transactions = pl.read_csv("transactions.csv")
fact_unclaimed_property_records = pl.read_csv("unclaimed_property_records.csv")


In [None]:
# Address the Numeric Columns to ensure that all values are in FLOAT format
dim_external_vendor_info = dim_external_vendor_info.with_columns(dim_external_vendor_info["Amount_Paid_YTD"]
                                                                 .cast(pl.Float64)
                                                                 .alias("Amount_Paid_YTD"))

In [73]:
# Convert all date columns to the correct date format or Null
dim_external_vendor_info = dim_external_vendor_info.with_columns(
    dim_external_vendor_info["Contract_Start_Date"]
    .str.to_date("%Y-%m-%d")
    .alias("Contract_Start_Date")
)

## Validate column formatting 
Ensure that all of the columns have the first 10 digits in the correct date format or are null. Any that return a false will need to be handled differently. If this checks out, we can safely strip the date and convert to date. 

In [74]:
dim_external_vendor_info.with_columns(
    dim_external_vendor_info["Contract_End_Date"]
    .str.slice(0,10) # Slice off everything beyond the date parameters
    .str.contains(r"^\d{4}-\d{2}-\d{2}$")
    .alias("is_valid_date")
).filter(pl.col("is_valid_date") == False).shape[0]



0

In [75]:
# Convert all date columns to the correct date format or Null
dim_external_vendor_info = dim_external_vendor_info.with_columns(
    dim_external_vendor_info["Contract_End_Date"]
    .str.slice(0,10) # Slice off everything beyond the date parameters
    .str.to_date("%Y-%m-%d")
    .alias("Contract_End_Date")
)

In [76]:
dim_external_vendor_info.head()

Vendor_ID,Vendor_Name,Service_Type,Contract_Start_Date,Contract_End_Date,Amount_Paid_YTD
str,str,str,date,date,f64
"""VND0001""","""Taylor-Robinson""","""Consulting""",2020-10-28,2025-04-18,89484.49
"""VND0002""","""Mitchell-Sanchez""","""Supplies""",2022-12-05,2024-08-09,75736.47
"""VND0003""","""Hardy-Swanson""","""Maintenance""",2021-09-29,2025-04-18,44447.81
"""VND0004""","""Benjamin-Medina""","""Maintenance""",2023-07-15,2024-11-10,16022.84
"""VND0005""","""Herrera, Campbell and Rios""","""Supplies""",2022-08-21,2025-04-18,88657.4


In [77]:
dim_external_vendor_info.dtypes

[String, String, String, Date, Date, Float64]

Now that the table has the correct data types for the schema, we can save it as a parquet file, ready to pass to Snowflake. 

In [78]:
dim_external_vendor_info.write_parquet("dim_external_vendor_info.parquet")

Next table: Ultimately, I am going to map the column names that need date conversions or float conversions so that all tables can be processed with a single function. 

In [90]:
def float_validator(df, column_name):
        null_count_a = df[column_name].null_count()
        df = df.with_columns(df[column_name]
                             .cast(pl.Float64)
                             .alias(column_name))
        null_count_b = df[column_name].null_count()
        print(f'There were {null_count_b-null_count_a} invalid values that could not be coerced.')
        return df
        

In [118]:
def int_validator(df, column_name):
        null_count_a = df[column_name].null_count()
        df = df.with_columns(df[column_name]
                             .cast(pl.Int64)
                             .alias(column_name))
        null_count_b = df[column_name].null_count()
        print(f'There were {null_count_b-null_count_a} invalid values that could not be coerced.')
        return df

In [106]:
dim_external_vendor_info = float_validator(dim_external_vendor_info, "Amount_Paid_YTD")

There were 0 invalid values that could not be coerced.


In [103]:
def date_converter(df, column_name):
    # Validate that columns will be coerced correctly
    not_valid = df.with_columns(
        df[column_name]
            .str.slice(0,10) # Slice off everything beyond the date parameters
            .str.contains(r"^\d{4}-\d{2}-\d{2}$")
            .alias("is_valid_date")
            ).filter(pl.col("is_valid_date") == False).shape[0]

    # Convert to dates
    df = df.with_columns(
        df[column_name]
        .str.slice(0,10)
        .str.to_date("%Y-%m-%d")
        .alias(column_name)
    )
    print(f'There were {not_valid} invalid dates formats that could not be coerced.')
    return df
    

In [107]:
dim_external_vendor_info = date_converter(dim_external_vendor_info, "Contract_Start_Date")

There were 0 invalid dates formats that could not be coerced.


In [108]:
dim_external_vendor_info = date_converter(dim_external_vendor_info, "Contract_End_Date")

There were 0 invalid dates formats that could not be coerced.


In [109]:
dim_external_vendor_info.head()

Vendor_ID,Vendor_Name,Service_Type,Contract_Start_Date,Contract_End_Date,Amount_Paid_YTD
str,str,str,date,date,f64
"""VND0001""","""Taylor-Robinson""","""Consulting""",2020-10-28,2025-04-18,89484.49
"""VND0002""","""Mitchell-Sanchez""","""Supplies""",2022-12-05,2024-08-09,75736.47
"""VND0003""","""Hardy-Swanson""","""Maintenance""",2021-09-29,2025-04-18,44447.81
"""VND0004""","""Benjamin-Medina""","""Maintenance""",2023-07-15,2024-11-10,16022.84
"""VND0005""","""Herrera, Campbell and Rios""","""Supplies""",2022-08-21,2025-04-18,88657.4


In [110]:
dim_external_vendor_info.dtypes

[String, String, String, Date, Date, Float64]

Now we can make functions for each of the tables that will open, process, and save.

In [None]:
def external_vendor_processor(path='external_vendor_information.csv'):
    df = pl.read_csv(path)
    df = float_validator(df, "Amount_Paid_YTD")
    df = date_converter(df, "Contract_Start_Date")
    df = date_converter(df, "Contract_End_Date")
    df.write_parquet("dim_external_vendor_info.parquet")
    print(f'File burned into dim_external_vendor_info.parquet') 

In [124]:
external_vendor_processor('external_vendor_information.csv')

There were 0 invalid values that could not be coerced.
There were 0 invalid dates formats that could not be coerced.
There were 0 invalid dates formats that could not be coerced.
File burned into dim_externale_vendor_info.parquet


In [117]:
print(dim_portfolio_holdings.columns)
print(dim_portfolio_holdings.dtypes)

['Holding_ID', 'Security_Name', 'Security_Type', 'Quantity_Held', 'Market_Price_Per_Unit', 'Total_Market_Value', 'Acquisition_Date']
[String, String, String, Int64, Float64, Float64, String]


In [125]:
def portfolio_holdings_processor(path='portfolio_holdings.csv'):
    df = pl.read_csv(path)
    df = int_validator(df, "Quantity_Held")
    df = float_validator(df, "Market_Price_Per_Unit")
    df = float_validator(df, "Total_Market_Value")
    df = date_converter(df, "Acquisition_Date")
    df.write_parquet("dim_portfolio_holdings.parquet")
    print(f'File written to dim_portfolio_holdings.parquet')
    

In [126]:
portfolio_holdings_processor('portfolio_holdings.csv'
)

There were 0 invalid values that could not be coerced.
There were 0 invalid values that could not be coerced.
There were 0 invalid values that could not be coerced.
There were 0 invalid dates formats that could not be coerced.
File written to dim_portfolio_holdings.parquet


In [127]:
print(dim_program_performance_metrics.columns)
print(dim_program_performance_metrics.dtypes)

['Reporting_Period', 'Program_Name', 'Program_Cost', 'Participants', 'Successful_Outcomes', 'Program_ID', 'Budget_Allocation', 'Budget_Utilization_Rate', 'Completion_Rate', 'Participant_Satisfaction', 'Cost_per_Participant', 'Cost_per_Successful_Outcome', 'On_Time_Completion']
[String, String, String, Int64, Int64, String, Float64, Float64, String, Float64, Float64, Float64, String]


In [128]:
dim_program_performance_metrics.head()

Reporting_Period,Program_Name,Program_Cost,Participants,Successful_Outcomes,Program_ID,Budget_Allocation,Budget_Utilization_Rate,Completion_Rate,Participant_Satisfaction,Cost_per_Participant,Cost_per_Successful_Outcome,On_Time_Completion
str,str,str,i64,i64,str,f64,f64,str,f64,f64,f64,str
"""2024-06-30 00:00:00""","""Maintenance""","""1999755.17""",225.0,131,"""PRG050""",2972500.0,67.27,"""58.22%""",3.7,8887.8,15265.31,"""No"""
"""2024-03-31 00:00:00""","""Consulting Fees""","""2238505.6""",442.0,397,"""PRG034""",2584400.0,86.62,"""89.82%""",4.0,5064.49,5638.55,"""No"""
"""2023-12-31 00:00:00""","""Software Subscription""","""1282043.08""",920.0,518,"""PRG032""",1826500.0,70.19,"""56.3%""",4.4,1393.53,2474.99,"""No"""
"""2024-03-31 00:00:00""","""Maintenance""","""1930857.38""",683.0,572,"""PRG039""",2586900.0,74.64,"""83.75%""",4.0,2827.02,3375.62,"""No"""
"""2023-12-31 00:00:00""","""Reimbursement""","""720991.95""",,416,"""PRG031""",970445.27,74.29,"""87.39%""",4.3,1514.69,1733.15,"""No"""


In [156]:
def program_performance_processor(path='program_performance_metrics.xlsx'):
    df = pl.read_excel(path, sheet_name="Program Metrics")
    df = float_validator(df, "Program_Cost")
    df = int_validator(df, "Participants")
    df = int_validator(df, "Successful_Outcomes")
    df = float_validator(df, "Budget_Allocation")
    df = float_validator(df, "Budget_Utilization_Rate")
    df = float_validator(df, "Participant_Satisfaction")
    df = float_validator(df, "Cost_per_Participant")
    df = float_validator(df, "Cost_per_Successful_Outcome")
    df.write_parquet("dim_program_performance_metrics.parquet")
    print(f'File written to dim_program_performance_metrics.parquet')


In [157]:
program_performance_processor('program_performance_metrics.xlsx')

There were 0 invalid values that could not be coerced.
There were 0 invalid values that could not be coerced.
There were 0 invalid values that could not be coerced.
There were 0 invalid values that could not be coerced.
There were 0 invalid values that could not be coerced.
There were 0 invalid values that could not be coerced.
There were 0 invalid values that could not be coerced.
There were 0 invalid values that could not be coerced.
File written to dim_program_performance_metrics.parquet


In [151]:
fact_transactions = pl.read_csv("transactions.csv")
fact_transactions.head()

Transaction_ID,Transaction_Date,Transaction_Amount,Transaction_Type,Department,Vendor_Name,Description
str,str,f64,str,str,str,str
"""TXN00003763""","""2024-01-12""",49388.52,"""Expense""","""Administration""","""Conner, Foster and Johnson""","""Software Subscription"""
"""TXN00007941""","""2025-09-29 13:32:35.851475""",52924.48,"""Expense""","""Compliance""","""Cole, Nunez and Harris""","""Consulting Fees"""
"""TXN00007395""","""2025-03-02""",140902.59,"""Expense""","""Administration""","""Tran-Sanchez""","""Office Supplies"""
"""TXN00009310""","""2024-06-22""",60473.14,"""Expense""","""Compliance""","""Garza-Bright""","""Office Supplies"""
"""TXN00009084""","""2025-04-13""",137488.96,"""Expense""","""Administration""","""Serrano-Butler""","""Event Sponsorship"""


In [158]:
def transactions_processor(path='transactions.csv'):
    df = pl.read_csv(path)
    df = date_converter(df, 'Transaction_Date')
    df = float_validator(df, "Transaction_Amount")
    df.write_parquet("transactions.parquet")
    print(f'File written to transactions.parquet')

In [None]:
transactions_processor('fact_transactions.csv')

There were 0 invalid dates formats that could not be coerced.
There were 0 invalid values that could not be coerced.
File written to transactions.parquet


In [161]:
fact_unclaimed_property_records.head()

Property_ID,Owner_Type,Owner_Name,Property_Type,Reported_Date,Property_Value,Claim_Status
str,str,str,str,str,f64,str
"""UP0001""","""Vendor""","""Carroll, Rodriguez and Morgan""","""Safe Deposit Box""","""2018-09-30""",5011.73,"""Unclaimed"""
"""UP0002""","""Individual""","""Kyle Stark""","""Bank Account""","""2022-03-31""",8860.79,"""Unclaimed"""
"""UP0003""","""Individual""","""Kelly Robinson DDS""","""Insurance Claim""","""2022-03-31""",6253.99,"""Unclaimed"""
"""UP0004""","""Individual""","""Dylan Bryant""","""Safe Deposit Box""","""2018-12-31""",7916.36,"""Unclaimed"""
"""UP0005""","""Individual""","""Christopher Chavez""","""Insurance Claim""","""2018-12-31""",443.81,"""Unclaimed"""


In [165]:
def unclaimed_record_processor(path="unclaimed_property_records.csv"):
    df = pl.read_csv(path)
    df = date_converter(df, "Reported_Date")
    df = float_validator(df, "Property_Value")
    df.write_parquet("fact_unclaimed_property_records.parquet")
    print(f'File written to fact_unclaimed_property_records.parquet')


In [166]:
unclaimed_record_processor("unclaimed_property_records.csv")

There were 0 invalid dates formats that could not be coerced.
There were 0 invalid values that could not be coerced.
File written to fact_unclaimed_property_records.parquet
