# 01. Data Generation & Synthesis
## Overview
In this notebook, we generate a synthetic dataset of financial transactions.
Since real-world banking data is highly sensitive and subject to GDPR and banking secrecy laws, we create a mock dataset that mimics the structure of real bank statements.

### Goal:
Create a dataset with transaction descriptions and their corresponding categories (e.g., Food, Transport, Utilities) to train a classification model.

## Imports

In [1]:
import pandas as pd
import numpy as np
import os

# Ensuring reproducibility
np.random.seed(42)

## Defining Transaction Samples

In [2]:
data_samples = {
    'Food & Drinks': [
        'Starbucks Coffee', 'McDonalds London', 'Silpo Supermarket Kyiv',
        'WOG Cafe', 'Pizza Hut Delivery', 'Uber Eats', 'Novus Shop', 'Bakery Kyiv'
    ],
    'Transport': [
        'Uber Trip', 'Bolt Taxi Kyiv', 'Gas Station WOG', 'Shell Fuel',
        'Train Ticket PKP', 'Parking Zone A', 'Public Transport London', 'Avis Rental'
    ],
    'Entertainment': [
        'Netflix.com', 'Spotify Monthly', 'Steam Games', 'Cinema City',
        'Youtube Premium', 'PlayStation Network', 'Disney Plus', 'Apple Services'
    ],
    'Shopping': [
        'Zara Clothing', 'H&M Online', 'Amazon.com', 'IKEA Furniture',
        'Apple Store', 'Adidas Shop', 'Nike Store', 'Pharmacy Central'
    ],
    'Utilities & Bills': [
        'Kyivstar Mobile', 'Vodafone Bill', 'Electric Utility',
        'Internet Provider', 'Insurance Premium', 'Monthly Rent', 'iCloud Storage'
    ]
}

# Generating 500 records
records = []
for _ in range(500):
    category = np.random.choice(list(data_samples.keys()))
    base_desc = np.random.choice(data_samples[category])

    # Adding a random transaction ID to simulate real-world "noise"
    random_id = np.random.randint(1000, 9999)
    record = f"{base_desc} ID:{random_id}"

    records.append((record, category))

df = pd.DataFrame(records, columns=['description', 'category'])

## Data Storage

In [3]:
# Create directory if it doesn't exist
output_path = '../data/raw'
os.makedirs(output_path, exist_ok=True)

# Save to CSV
df.to_csv(f'{output_path}/transactions.csv', index=False)
print(f"Data saved to {output_path}/transactions.csv")

Data saved to ../data/raw/transactions.csv


## Data Preview

In [4]:
print("First 5 rows:")
print(df.head())

print("\nCategory Distribution:")
print(df['category'].value_counts())

First 5 rows:
               description           category
0      Apple Store ID:6390           Shopping
1   Apple Services ID:6734      Entertainment
2  Gas Station WOG ID:5426          Transport
3   Apple Services ID:9322      Entertainment
4    Vodafone Bill ID:7949  Utilities & Bills

Category Distribution:
category
Entertainment        115
Shopping             108
Food & Drinks         98
Utilities & Bills     91
Transport             88
Name: count, dtype: int64
