# LumaFin - Data Setup and Preparation

This notebook prepares the training data for the LumaFin transaction categorization system.

**What this notebook does:**
1. Clones the LumaFin repository
2. Installs required dependencies
3. Downloads and prepares training data
4. Saves prepared data to Google Drive for use in other notebooks

**Runtime:** GPU not required (use CPU runtime)
**Time:** ~10 minutes

## Step 1: Mount Google Drive

In [None]:
from google.colab import drive
drive.mount('/content/drive')

# Create directory for LumaFin data
!mkdir -p /content/drive/MyDrive/LumaFin
!mkdir -p /content/drive/MyDrive/LumaFin/models
!mkdir -p /content/drive/MyDrive/LumaFin/data

## Step 2: Clone Repository and Install Dependencies

In [None]:
# Clone the repository
!git clone https://github.com/LathissKhumar/LumaFin.git
%cd LumaFin

In [None]:
# Install core dependencies
!pip install -q pandas numpy scikit-learn sentence-transformers torch

## Step 3: Check and Load Existing Data

In [None]:
import pandas as pd
import os

# Check if training data exists
data_file = 'data/merged_training.csv'
if os.path.exists(data_file):
    df = pd.read_csv(data_file)
    print(f"✅ Found {len(df)} training examples")
    print(f"\nColumns: {df.columns.tolist()}")
    print(f"\nCategory distribution:")
    print(df['category'].value_counts())
else:
    print("❌ Training data not found.")

## Step 4: Filter and Clean Data

In [None]:
# Remove Uncategorized entries
df_filtered = df[df['category'] != 'Uncategorized'].copy()

# Add description if missing
if 'description' not in df_filtered.columns:
    df_filtered['description'] = df_filtered['merchant']

# Remove duplicates
df_filtered = df_filtered.drop_duplicates(subset=['merchant', 'category'])

print(f"✅ Filtered: {len(df_filtered)} examples")
print(df_filtered['category'].value_counts())

## Step 5: Create Train/Test Split

In [None]:
from sklearn.model_selection import train_test_split

train_df, test_df = train_test_split(
    df_filtered, test_size=0.2, random_state=42, stratify=df_filtered['category']
)

print(f"Training: {len(train_df)}, Test: {len(test_df)}")

train_df.to_csv('data/train.csv', index=False)
test_df.to_csv('data/test.csv', index=False)

## Step 6: Copy to Google Drive

In [None]:
!cp data/train.csv /content/drive/MyDrive/LumaFin/data/
!cp data/test.csv /content/drive/MyDrive/LumaFin/data/
print("✅ Data saved to Google Drive")