# TP1 – Analyse des données (IFT599 / IFT799)

## 3.3 Analyse de séquences

### 1. Building Transaction Sequences

We started from the transactions_data table as the backbone.

We extended the transactions table by joining with:
- `mcc_codes` → to replace merchant category codes with readable descriptions.
- `fraud_labels` → to tag each transaction as fraudulent or not.

We then kept the key columns: `client_id`, `date`, `amount`, `mcc_description`, and `is_fraud`.

Transactions were sorted chronologically within each client.

Finally, we grouped by `client_id` and used a custom function to convert each client's ordered transactions into a list of events.

This gave us sequences like:

`client_id → [(Book Stores, 33.96, No), (Department Stores, 30.32, No), (Wholesale Clubs, 103.53, No)]`

In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import json

In [2]:
# Load transaction data
transactions = pd.read_csv("data/transactions_data.csv")

# Load card information
cards = pd.read_csv("data/cards_data.csv")

# Load user information
users = pd.read_csv("data/users_data.csv")

# Load mcc codes information
with open("data/mcc_codes.json", "r") as f:
    mcc_data = json.load(f)

mcc_codes = pd.DataFrame(list(mcc_data.items()), columns=["code", "description"])

# Load fraud information
with open("data/train_fraud_labels.json", "r") as f:
    fraud_data = json.load(f)

# Extract the inner dictionary
fraud_dict = fraud_data['target']

# Convert to DataFrame
fraud_labels = pd.DataFrame(list(fraud_dict.items()), columns=["transaction_id", "is_fraud"])

Variables fron `transactions` we will use:
- transaction_id (for joining with fraud labels later)
- date (key: to order the sequence)
- client_id (to group by client → one sequence per client)
- card_id (if you want sequences at the card level instead of the user level)
- amount (behavioral info)
- mcc (merchant category → type of event)

Expanded using:
- mcc_codes: Join on transactions.mcc = mcc_codes.code. Use description to make the event human-readable (instead of raw codes).
- fraud_labels: Join on transactions.id = fraud_labels.transaction_id. Adds is_fraud so you can tag events in the sequence.

In [9]:
# Rename keys before merging
transactions = transactions.rename(columns={"id": "transaction_id"})
mcc_codes = mcc_codes.rename(columns={"code": "mcc_code", "description": "mcc_description"})

In [10]:
# Ensure same dtype for transaction_id
transactions["transaction_id"] = transactions["transaction_id"].astype(str)
fraud_labels["transaction_id"] = fraud_labels["transaction_id"].astype(str)

# Ensure same dtype for mcc
transactions["mcc"] = transactions["mcc"].astype(str)
mcc_codes["mcc_code"] = mcc_codes["mcc_code"].astype(str)

In [13]:
# Merge transactions with MCC codes
# transactions.mcc -> mcc_codes.code
merged_df = transactions.merge(mcc_codes, left_on="mcc", right_on="mcc_code", how="left")

# Merge with fraud labels
# transactions.id -> fraud_labels.transaction_id
merged_df = merged_df.merge(fraud_labels, left_on="transaction_id", right_on="transaction_id", how="left")

In [14]:
# List columns
print(merged_df.columns.tolist())

['transaction_id', 'date', 'client_id', 'card_id', 'amount', 'use_chip', 'merchant_id', 'merchant_city', 'merchant_state', 'zip', 'mcc', 'errors', 'mcc_code', 'mcc_description', 'is_fraud']


In [16]:
# Keep useful columns for sequences
seq_df = merged_df[[
    "transaction_id",  # transaction ID
    "client_id",       # user
    "date",            # chronological order
    "amount",          # spending amount
    "mcc_description",     # MCC description (event type)
    "is_fraud"         # fraud label
]].copy()

# Convert date column to datetime
seq_df["date"] = pd.to_datetime(seq_df["date"])

In [17]:
# Sort by client_id + date
seq_df = seq_df.sort_values(by=["client_id", "date"])
seq_df.head()

Unnamed: 0,transaction_id,client_id,date,amount,mcc_description,is_fraud
1795,7477483,0,2010-01-01 13:10:00,$33.96,Book Stores,No
2960,7478861,0,2010-01-01 19:39:00,$7.78,Eating Places and Restaurants,No
3299,7479264,0,2010-01-01 22:13:00,$65.86,Drinking Places (Alcoholic Beverages),No
4955,7481226,0,2010-01-02 13:08:00,$55.85,Lumber and Building Materials,
8801,7485861,0,2010-01-03 15:44:00,$1.37,Miscellaneous Food Stores,No


In [20]:
# For each client, build a list of tuples (description, amount, is_fraud)
def build_sequence(client_df):
    seq = list(zip(client_df["mcc_description"], client_df["amount"], client_df["is_fraud"]))
    return seq

In [21]:
# Group transactions by client_id
grouped_txns = seq_df.groupby("client_id")

# Get sequences for each group
client_sequences = grouped_txns.apply(build_sequence)

  client_sequences = grouped_txns.apply(build_sequence)


In [24]:
# Check a few examples
client_sequences.head()

client_id
0    [(Book Stores, $33.96, No), (Eating Places and...
1    [(Taxicabs and Limousines, $15.09, No), (Drink...
2    [(Wholesale Clubs, $0.72, No), (Drug Stores an...
3    [(Grocery Stores, Supermarkets, $15.64, No), (...
4    [(Betting (including Lottery Tickets, Casinos)...
dtype: object