# Generate Synthetic Data for Treasury Analytics

1. Transactions Table (Largest Dataset, 10000 records)
Reasoning: Financial transactions typically occur frequently and continuously, resulting in a large volume of records. This dataset would naturally be the largest, as it captures daily operational expenses, revenues, vendor payments, and other financial activities. Transactional data often accumulates rapidly, making it the most extensive dataset.

2. Investment Portfolio Holdings (Second Largest, 250 records)
Reasoning: Investment portfolios usually contain numerous securities, each with multiple transactions (buys, sells, dividends, interest payments). While not as frequent as daily operational transactions, investment holdings still generate substantial data, especially if historical records and periodic valuations are included.

3. Unclaimed Property Records (Third Largest, 200 records)
Reasoning: Unclaimed property records accumulate steadily but at a slower rate compared to financial transactions or investment activities. These records are typically updated periodically (monthly, quarterly, or annually), resulting in fewer records than the first two datasets.

4. Program Performance Metrics (Smallest Dataset, 20-50 records)
Reasoning: Program performance metrics are usually aggregated data points collected periodically (monthly, quarterly, or annually). They summarize program outcomes, participant counts, and costs, resulting in fewer records compared to transactional or investment data.
Recommended Dataset Sizes (for your dummy data project):


In [32]:
import pandas as pd
import numpy as np
from faker import Faker
import random

fake = Faker()

# Set seed for reproducibility
np.random.seed(91942)
random_state = random.seed(91942)

## Generate Dataset 1: ISTO Financial Transactions
Purpose: Analyze spending patterns, identify cost-saving opportunities, and visualize financial trends.

**Schema:**
- Transaction_ID (unique identifier) VARCHAR
- Transaction_Date DATETIME
- Transaction_Amount FLOAT
- Transaction_Type (Expense, Revenue) VARCHAR
- Department (e.g., Operations, Investments, Community Programs) VARCHAR
- Vendor_Name VARCHAR
- Description VARCHAR

In [44]:
# Number of rows
num_rows = 10000
num_unique_vendors = 1159 

# Generate Transaction IDs (formatted as TXN00000001)
transaction_ids = [f'TXN{str(i).zfill(8)}' for i in range(1, num_rows + 1)]

# Generate random dates within a realistic range (e.g., past 2 years)
transaction_dates = [fake.date_between(start_date='-2y', end_date='today') for _ in range(num_rows)]

# Generate transaction amounts (expenses and revenues)
transaction_amounts = np.round(np.random.uniform(50, 150000, num_rows), 2)

# Randomly assign transaction types
transaction_types = np.random.choice(['Expense', 'Revenue'], size=num_rows, p=[0.7, 0.3])

# Departments
departments = ['Operations', 'Investments', 'Community Programs', 'Administration', 'IT', 'Compliance']
department_choices = np.random.choice(departments, size=num_rows)

# Vendor names (using Faker), limiting to lower unique vendors
unique_vendors = [fake.unique.company() for _ in range(num_unique_vendors)]

# Create a realistic distribution of vendor frequencies
vendor_probabilities = np.random.zipf(a=2, size=num_unique_vendors)
vendor_probabilities = vendor_probabilities / vendor_probabilities.sum()

# Assign vendors to transactions based on probabilities
vendor_names = np.random.choice(unique_vendors, size=num_rows, p=vendor_probabilities)

# Transaction descriptions (simple examples)
descriptions_expense = ['Office Supplies', 'Consulting Fees', 'Software Subscription', 'Event Sponsorship', 'Maintenance', 'Travel Expenses']
descriptions_revenue = ['Interest Income', 'Investment Returns', 'Program Fees', 'Grant Funding', 'Reimbursement']

descriptions = [
    random.choice(descriptions_expense) if t == 'Expense' else random.choice(descriptions_revenue)
    for t in transaction_types
]

# Assemble DataFrame
transactions_df = pd.DataFrame({
    'Transaction_ID': transaction_ids,
    'Transaction_Date': transaction_dates,
    'Transaction_Amount': transaction_amounts,
    'Transaction_Type': transaction_types,
    'Department': department_choices,
    'Vendor_Name': vendor_names,
    'Description': descriptions
})

# Preview the dataset
print(transactions_df.head())
print("Total transactions:", len(transactions_df))
print("Unique vendors:", transactions_df.Vendor_Name.nunique())
print("Vendors appearing more than once:", (transactions_df.Vendor_Name.value_counts() > 1).sum())


  Transaction_ID Transaction_Date  Transaction_Amount Transaction_Type  \
0    TXN00000001       2023-06-13            52895.59          Expense   
1    TXN00000002       2024-12-21            99357.18          Expense   
2    TXN00000003       2024-12-29            27593.75          Expense   
3    TXN00000004       2025-02-21           105942.45          Revenue   
4    TXN00000005       2023-12-25            77628.58          Expense   

           Department                Vendor_Name        Description  
0          Operations             Gonzalez-Hines        Maintenance  
1                  IT                Bennett LLC  Event Sponsorship  
2      Administration              Jackson-Smith    Travel Expenses  
3  Community Programs             Hardy and Sons      Grant Funding  
4  Community Programs  Porter, Ramirez and Jones    Office Supplies  
Total transactions: 10000
Unique vendors: 835
Vendors appearing more than once: 504


This has created a perfect dataframe with no errors or duplicates. This is unrealistic, so let's shake it up.

1. Introduce dupes
2. Add incorrect data types
3. Add logical inconsistencies
4. Give it some outliers!
5. Mess up the vendor names including INC
6. Remove a few IDs
7. Spread some NaN love throughout

In [46]:
import pandas as pd
import numpy as np

np.random.seed(91942)
random_state = random.seed(91942)
random_state_alt = random.seed(24919)

# 1. Introduce duplicate rows
num_duplicates = int(len(transactions_df) * 0.02)  # 2% duplicates
duplicates_df = transactions_df.sample(n=num_duplicates, random_state=random_state)
transactions_df = pd.concat([transactions_df, duplicates_df], ignore_index=True)

# 2. Introduce incorrect data types
num_type_errors = int(len(transactions_df) * 0.02)  # 2% type errors
type_error_indices = transactions_df.sample(n=num_type_errors, random_state=random_state).index
transactions_df.loc[type_error_indices, 'Transaction_Amount'] = transactions_df.loc[type_error_indices, 'Transaction_Amount'].astype(str)

date_error_indices = transactions_df.sample(n=num_type_errors, random_state=random_state_alt).index
transactions_df.loc[date_error_indices, 'Transaction_Date'] = transactions_df.loc[date_error_indices, 'Transaction_Date'].astype(str) + ' INVALID'

# 3. Introduce logical inconsistencies
expense_indices = transactions_df[transactions_df['Transaction_Type'] == 'Expense'].sample(n=num_type_errors, random_state=random_state).index
transactions_df.loc[expense_indices, 'Transaction_Amount'] *= -1  # Negative amounts for expenses

invalid_department_indices = transactions_df.sample(n=num_type_errors, random_state=random_state_alt).index
transactions_df.loc[invalid_department_indices, 'Department'] = 'InvalidDept'

# 4. Introduce outliers
# Ensure 'Transaction_Amount' is numeric
transactions_df['Transaction_Amount'] = pd.to_numeric(transactions_df['Transaction_Amount'], errors='coerce')
outlier_indices = transactions_df.sample(n=num_type_errors, random_state=random_state).index

# Generate outliers using a log-normal distribution
outlier_values = np.random.lognormal(mean=10, sigma=2, size=num_type_errors)
transactions_df.loc[outlier_indices, 'Transaction_Amount'] = outlier_values

future_date_indices = transactions_df.sample(n=num_type_errors, random_state=random_state_alt).index
transactions_df.loc[future_date_indices, 'Transaction_Date'] = pd.Timestamp('today') + pd.to_timedelta(np.random.randint(30, 365, size=num_type_errors), unit='d')

# 5. Introduce inconsistent vendor names
vendor_indices = transactions_df.sample(n=num_type_errors, random_state=random_state).index
transactions_df.loc[vendor_indices, 'Vendor_Name'] = transactions_df.loc[vendor_indices, 'Vendor_Name'].apply(
    lambda x: x.replace('Inc.', 'Incorporated') if 'Inc.' in x else x + ' Inc.'
)

# 6. Introduce missing or invalid Transaction IDs
missing_id_indices = transactions_df.sample(n=num_type_errors, random_state=random_state).index
transactions_df.loc[missing_id_indices, 'Transaction_ID'] = np.nan

# 7. Introduce missing values (nulls) - needs to be last so it doesn't break the multiplication code
num_nulls = int(len(transactions_df) * 0.03)  # 3% nulls
null_indices_amount = transactions_df.sample(n=num_nulls, random_state=random_state).index
null_indices_department = transactions_df.sample(n=num_nulls, random_state=random_state_alt).index
transactions_df.loc[null_indices_amount, 'Transaction_Amount'] = np.nan
transactions_df.loc[null_indices_department, 'Department'] = np.nan

# Shuffle the DataFrame to mix errors naturally
transactions_df = transactions_df.sample(frac=1, random_state=random_state).reset_index(drop=True)

# Check the resulting DataFrame
print(transactions_df.head())
# print(transactions_df.info())


  Transaction_ID Transaction_Date  Transaction_Amount Transaction_Type  \
0    TXN00001968       2024-04-16           109876.08          Expense   
1    TXN00008799       2023-06-06            47425.61          Expense   
2    TXN00008096       2024-10-30            95756.35          Expense   
3    TXN00004666       2023-07-14            59383.96          Expense   
4    TXN00005871       2024-09-05            28682.69          Expense   

       Department     Vendor_Name            Description  
0              IT  Hardy and Sons      Event Sponsorship  
1  Administration  Summers-Arroyo        Travel Expenses  
2             NaN     Diaz-Nguyen        Consulting Fees  
3  Administration  Hardy and Sons        Office Supplies  
4     Investments  Hardy and Sons  Software Subscription  


 '24768.73' '127700.63' '146614.62' '43446.66' '92902.67' '-77043.93'
 '50876.35' '37490.58' '117194.16' '71490.68' '9988.16' '95227.11'
 '18581.42' '64127.11' '51566.91' '143417.13' '139726.63' '31963.7' 'nan'
 '61560.13' '143283.57' '115795.73' '1195773.053480141' '56723.94'
 '29745.52' '20653.11' '146142.95' '101.64' '146276.02' '65322.12'
 '61984.06' '104262.34' '6252.21' '127733.11' '-134970.86' '65192.2'
 '10436.19' '119566.75' '74805.43' '123384.1' '84547.95' '48686.1'
 '130846.52' '4692.63' '112368.66' '12834.14387981832' '137825.92'
 '48037.54' '112252.99' '70064.19' '73397.57' '29256.43' '117786.37'
 '260.13220444691933' '17772.8' '1163.659649618186' '314562.00130865327'
 '-4641.92' '61022.99' '128378.34' '122450.74' '80255.94' '12447.93'
 '78236.95' '44588.09' '1629.25' '49257.45' '140118.11' '139233.47'
 '2324.71' '135714.42' '144697.35' '67833.13' '144558.88' '50519.69'
 '132291.67' '148771.13' '36845.9546855062' '126765.86' '8438.69'
 '90345.11' '17979.27' '102085.75' '96

In [47]:
print("Number of unique vendors:", len(transactions_df.Vendor_Name.unique()))
print("Number of duplicated vendors:", transactions_df.Vendor_Name.duplicated().sum())

Number of unique vendors: 939
Number of duplicated vendors: 9465


Finally, save the data in a horrible data format: csv

In [48]:
# Save to CSV and cringe
transactions_df.to_csv('transactions.csv', index=False)

## Generate Dataset #2 Investment Portfolio Holdings
Purpose: Analyze ISTO's investment portfolio, diversification, and risk management.

**Schema:**
- Holding_ID VARCHAR
- Security_Name VARCHAR
- Security_Type VARCHAR
- Quantity_Held INT
- Market_Value FLOAT
- Acquisition_Date DATETIME

In [49]:
import pandas as pd
import numpy as np
from faker import Faker

fake = Faker()
np.random.seed(91942)

# Number of holdings
num_holdings = 257

# Generate Holding IDs
holding_ids = [f'HLD{str(i).zfill(5)}' for i in range(1, num_holdings + 1)]

# Generate realistic security names (e.g., company names)
security_names = [fake.unique.company() for _ in range(num_holdings)]

# Security types distribution
security_types = np.random.choice(
    ['Stock', 'Bond', 'ETF', 'Mutual Fund'],
    size=num_holdings,
    p=[0.5, 0.2, 0.2, 0.1]
)

# Quantity held (realistic quantities)
quantity_held = np.random.randint(100, 5000, size=num_holdings)

# Market value per unit (realistic prices)
market_price_per_unit = np.round(np.random.uniform(10, 500, size=num_holdings), 2)

# Calculate total market value
market_values = np.round(quantity_held * market_price_per_unit, 2)

# Acquisition dates (within past 5 years)
acquisition_dates = pd.to_datetime([fake.date_between(start_date='-5y', end_date='today') for _ in range(num_holdings)])


# Assemble DataFrame
portfolio_holdings_df = pd.DataFrame({
    'Holding_ID': holding_ids,
    'Security_Name': security_names,
    'Security_Type': security_types,
    'Quantity_Held': quantity_held,
    'Market_Price_Per_Unit': market_price_per_unit,
    'Total_Market_Value': market_values,
    'Acquisition_Date': acquisition_dates
})

# Preview the dataset
print(portfolio_holdings_df.head())


  Holding_ID            Security_Name Security_Type  Quantity_Held  \
0   HLD00001   Diaz, Walker and Cline         Stock            544   
1   HLD00002         Johnson and Sons          Bond           1106   
2   HLD00003  Frazier, Shaw and Davis          Bond           1841   
3   HLD00004              Edwards Ltd   Mutual Fund           1427   
4   HLD00005          Valencia-Miller         Stock            805   

   Market_Price_Per_Unit  Total_Market_Value Acquisition_Date  
0                 493.30           268355.20       2021-08-11  
1                  44.65            49382.90       2025-03-07  
2                  53.83            99101.03       2023-03-08  
3                 424.49           605747.23       2022-10-12  
4                  96.97            78060.85       2025-02-06  


Now that we have a perfect dataset, let's make it crappy!

In [50]:
import numpy as np
import pandas as pd
from faker import Faker

fake = Faker()
np.random.seed(91942)
random_state=random.seed(91942)
random_state_alt = random.seed(24919)

# 1. Introduce duplicate holdings with slight variations (3%)
num_duplicates = int(len(portfolio_holdings_df) * 0.03)
duplicates_df = portfolio_holdings_df.sample(n=num_duplicates, random_state=random_state).copy()

duplicates_df['Security_Name'] = duplicates_df['Security_Name'].apply(
    lambda x: x + ' Inc.' if 'Inc.' not in x else x.replace('Inc.', 'Incorporated')
)
duplicates_df['Acquisition_Date'] += pd.to_timedelta(
    np.random.randint(-10, 10, size=num_duplicates), unit='d'
)

# 2. Introduce missing values in 'Market_Price_Per_Unit' (2%)
num_nulls = int(len(portfolio_holdings_df) * 0.02)
null_indices = portfolio_holdings_df.sample(n=num_nulls, random_state=random_state).index
portfolio_holdings_df.loc[null_indices, 'Market_Price_Per_Unit'] = np.nan

# 3. Introduce unrealistic market prices (outliers) (2%)
num_outliers = int(len(portfolio_holdings_df) * 0.02)
outlier_indices = portfolio_holdings_df.sample(n=num_outliers, random_state=random_state_alt).index
portfolio_holdings_df.loc[outlier_indices, 'Market_Price_Per_Unit'] *= np.random.choice([0.01, 100], size=num_outliers)
portfolio_holdings_df['Total_Market_Value'] = portfolio_holdings_df['Quantity_Held'] * portfolio_holdings_df['Market_Price_Per_Unit']

# 4. Introduce incorrect security types (misclassification) (3%)
num_misclassified = int(len(portfolio_holdings_df) * 0.03)
misclassified_indices = portfolio_holdings_df.sample(n=num_misclassified, random_state=random_state).index
portfolio_holdings_df.loc[misclassified_indices, 'Security_Type'] = np.random.choice(['Stock', 'Bond', 'ETF', 'Mutual Fund'], size=num_misclassified)

# 5. Introduce future acquisition dates (2%)
num_future_dates = int(len(portfolio_holdings_df) * 0.02)
future_indices = portfolio_holdings_df.sample(n=num_future_dates, random_state=random_state_alt).index
portfolio_holdings_df.loc[future_indices, 'Acquisition_Date'] = pd.Timestamp('today') + pd.to_timedelta(np.random.randint(30, 365, size=num_future_dates), unit='d')

# Combine duplicates back into the original DataFrame
portfolio_holdings_df = pd.concat([portfolio_holdings_df, duplicates_df], ignore_index=True)

# Shuffle the DataFrame to mix errors naturally
portfolio_holdings_df = portfolio_holdings_df.sample(frac=1, random_state=random_state).reset_index(drop=True)

# Check the resulting DataFrame
print(portfolio_holdings_df.head())


  Holding_ID             Security_Name Security_Type  Quantity_Held  \
0   HLD00144               Jackson Inc         Stock           2908   
1   HLD00145          Sandoval-Herrera           ETF           2357   
2   HLD00226                Carter PLC         Stock           2428   
3   HLD00182              Gomez-Harris   Mutual Fund           4352   
4   HLD00239  Owens, Miller and Gibson          Bond           4716   

   Market_Price_Per_Unit  Total_Market_Value Acquisition_Date  
0                 157.64           458417.12       2021-01-09  
1                 487.14          1148188.98       2022-07-04  
2                 201.93           490286.04       2024-11-19  
3                 492.02          2141271.04       2021-01-23  
4                 426.09          2009440.44       2021-11-05  


In [51]:
# Save to CSV and cringe some more
portfolio_holdings_df.to_csv('portfolio_holdings.csv', index=False)

# Generate Dataset #3: 