# AIS Project - Financial Transactions

- **Notebook**: Merge.ipynb
- **Contents**: This notebook contains the code to merge all data-files into 1 dataset

The merging and exporting process takes ~40 seconds, thanks to Polars being very fast :)

In [1]:
import polars as pl

def merge_datasets(cards_file: str, users_file: str, transactions_file: str) -> pl.DataFrame:
    """
    Merge three CSV datasets:
    - cards_file: Path to the cards data file, containing `client_id`.
    - users_file: Path to the users data file, containing `id` that matches `client_id` in cards_file.
    - transactions_file: Path to the transactions data file, containing `client_id`.

    Returns:
        A Polars DataFrame containing the merged data.
    """

    # Read the CSV files into Polars DataFrames
    cards_data = pl.read_csv(cards_file)
    users_data = pl.read_csv(users_file)
    transactions_data = pl.read_csv(transactions_file)

    # Merge cards_data with users_data using `client_id` and `id`
    merged_cards_users = cards_data.join(users_data, left_on="client_id", right_on="id")

    # Merge the resulting data with transactions_data using `client_id`
    final_merged_data = merged_cards_users.join(transactions_data, on="client_id")

    return final_merged_data

In [4]:
merged_data = merge_datasets(
	cards_file="./data/cards_data.csv",
	users_file="./data/users_data.csv",
	transactions_file="./data/transactions_data.csv"
)

In [6]:
print(merged_data)

shape: (51_115_337, 37)
┌──────┬───────────┬────────────┬─────────────────┬───┬────────────────┬─────────┬──────┬────────┐
│ id   ┆ client_id ┆ card_brand ┆ card_type       ┆ … ┆ merchant_state ┆ zip     ┆ mcc  ┆ errors │
│ ---  ┆ ---       ┆ ---        ┆ ---             ┆   ┆ ---            ┆ ---     ┆ ---  ┆ ---    │
│ i64  ┆ i64       ┆ str        ┆ str             ┆   ┆ str            ┆ f64     ┆ i64  ┆ str    │
╞══════╪═══════════╪════════════╪═════════════════╪═══╪════════════════╪═════════╪══════╪════════╡
│ 4333 ┆ 1556      ┆ Mastercard ┆ Debit           ┆ … ┆ ND             ┆ 58523.0 ┆ 5499 ┆ null   │
│ 1955 ┆ 1556      ┆ Visa       ┆ Credit          ┆ … ┆ ND             ┆ 58523.0 ┆ 5499 ┆ null   │
│ 2972 ┆ 1556      ┆ Mastercard ┆ Debit (Prepaid) ┆ … ┆ ND             ┆ 58523.0 ┆ 5499 ┆ null   │
│ 412  ┆ 1556      ┆ Amex       ┆ Credit          ┆ … ┆ ND             ┆ 58523.0 ┆ 5499 ┆ null   │
│ 3764 ┆ 561       ┆ Mastercard ┆ Debit           ┆ … ┆ IA             ┆ 52722.0 ┆ 53

In [7]:
merged_data.write_csv("./data/merged/merged_data.csv")

In [None]:
for n_rows in [100, 1000, 10_000, 100_000]:
	sampled_data = merged_data.sample(n=n_rows)
	sampled_data.write_csv(f"./data/sampled/sampled_data_{n_rows}.csv")