# Personal Finance Tracker - ML Pipeline

## Transaction Categorization using Machine Learning

**Author:** Aniket Behera  
**Goal:** Build an ML model to automatically categorize financial transactions

---

### Pipeline Overview:
1. **Data Generation & Exploration**
2. **Feature Engineering**
3. **Model Selection & Training**
4. **Hyperparameter Tuning**
5. **Evaluation & Analysis**
6. **Deployment Considerations**

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# Set style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

print("✓ Libraries imported successfully")

## 1. Data Generation & Exploration

We generate synthetic transaction data that mimics real-world patterns:
- 1600+ transactions over 12 months
- 12 categories (Dining, Groceries, Utilities, etc.)
- Includes recurring transactions (subscriptions, bills, salary)

In [None]:
# Load dataset
df = pd.read_csv('large_transactions.csv')
df['date'] = pd.to_datetime(df['date'])

print(f"Dataset shape: {df.shape}")
print(f"\nColumns: {df.columns.tolist()}")
print(f"\nDate range: {df['date'].min()} to {df['date'].max()}")
print(f"\nFirst few rows:")
df.head()

In [None]:
# Category distribution
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))

# Bar plot
category_counts = df['category'].value_counts()
category_counts.plot(kind='bar', ax=ax1, color='steelblue', alpha=0.8)
ax1.set_title('Transaction Distribution by Category', fontsize=14, fontweight='bold')
ax1.set_xlabel('Category')
ax1.set_ylabel('Count')
ax1.grid(axis='y', alpha=0.3)
plt.setp(ax1.xaxis.get_majorticklabels(), rotation=45, ha='right')

# Pie chart
category_counts.plot(kind='pie', ax=ax2, autopct='%1.1f%%', startangle=90)
ax2.set_title('Category Distribution', fontsize=14, fontweight='bold')
ax2.set_ylabel('')

plt.tight_layout()
plt.show()

print(f"\nNumber of categories: {df['category'].nunique()}")
print(f"Most common category: {category_counts.index[0]} ({category_counts.values[0]} transactions)")

In [None]:
# Sample transactions per category
print("Sample transactions from each category:\n")
for category in sorted(df['category'].unique()):
    samples = df[df['category'] == category]['description'].sample(min(3, len(df[df['category'] == category]))).tolist()
    print(f"{category:20s}: {', '.join(samples)}")

## 2. Feature Engineering

**Why TF-IDF?**
- Transaction descriptions are text data
- TF-IDF converts text to numerical features
- Captures importance of words across categories
- Using bi-grams (1,2) captures phrases like "gas station" not just "gas"

In [None]:
# Explore TF-IDF features
vectorizer = TfidfVectorizer(ngram_range=(1, 2), max_features=50)
X_sample = vectorizer.fit_transform(df['description'])

feature_names = vectorizer.get_feature_names_out()
print(f"Number of features extracted: {len(feature_names)}")
print(f"\nTop 20 features (words/phrases):")
print(feature_names[:20])

## 3. Model Selection & Training

**Why Random Forest?**
1. Handles high-dimensional sparse data (TF-IDF features) well
2. Robust to overfitting with proper parameters
3. Provides feature importance (which words matter most)
4. No need for feature scaling
5. Works well with imbalanced classes

**Alternatives considered:**
- Naive Bayes: Too simple, assumes feature independence
- Logistic Regression: Linear, may miss complex patterns
- SVM: Slower, harder to interpret
- Neural Networks: Overkill for this dataset size

In [None]:
# Prepare data
X = df['description']
y = df['category']

# Train-test split (80/20)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Training set size: {len(X_train)}")
print(f"Test set size: {len(X_test)}")
print(f"\nClass distribution in training set:")
print(y_train.value_counts())

In [None]:
# Create baseline model
baseline_pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(ngram_range=(1, 2), min_df=2, max_df=0.95)),
    ('clf', RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1))
])

print("Training baseline model...")
baseline_pipeline.fit(X_train, y_train)

# Evaluate baseline
y_pred_baseline = baseline_pipeline.predict(X_test)
baseline_accuracy = accuracy_score(y_test, y_pred_baseline)

print(f"✓ Baseline model trained")
print(f"  Baseline accuracy: {baseline_accuracy:.4f}")

## 4. Hyperparameter Tuning

Using GridSearchCV to find optimal parameters:
- `ngram_range`: Should we use single words or phrases?
- `max_features`: How many features to keep?
- `n_estimators`: Number of trees in forest
- `max_depth`: How deep should trees grow?

In [None]:
# Hyperparameter grid
param_grid = {
    'tfidf__ngram_range': [(1, 1), (1, 2)],
    'tfidf__max_features': [300, 500],
    'clf__n_estimators': [100, 200],
    'clf__max_depth': [15, 20, None],
    'clf__min_samples_split': [2, 5]
}

print(f"Total combinations to test: {np.prod([len(v) for v in param_grid.values()])}")
print("\nStarting grid search (this may take a few minutes)...")

In [None]:
# Grid search with cross-validation
grid_search = GridSearchCV(
    baseline_pipeline,
    param_grid,
    cv=3,
    scoring='f1_weighted',
    n_jobs=-1,
    verbose=1
)

grid_search.fit(X_train, y_train)

print(f"\n✓ Grid search complete")
print(f"\nBest parameters:")
for param, value in grid_search.best_params_.items():
    print(f"  {param}: {value}")
print(f"\nBest cross-validation F1 score: {grid_search.best_score_:.4f}")

## 5. Evaluation & Analysis

Comprehensive evaluation using multiple metrics:
- **Accuracy**: Overall correctness
- **Precision**: Of predicted category X, how many were actually X?
- **Recall**: Of all actual category X transactions, how many did we catch?
- **F1 Score**: Harmonic mean of precision and recall

In [None]:
# Get best model
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)

# Accuracy
test_accuracy = accuracy_score(y_test, y_pred)
train_accuracy = accuracy_score(y_train, best_model.predict(X_train))

print(f"Model Performance:")
print(f"  Training accuracy: {train_accuracy:.4f}")
print(f"  Test accuracy:     {test_accuracy:.4f}")
print(f"  Overfitting gap:   {(train_accuracy - test_accuracy):.4f}")

# Cross-validation
cv_scores = cross_val_score(best_model, X_train, y_train, cv=5, scoring='f1_weighted')
print(f"\n5-Fold CV F1 Score: {cv_scores.mean():.4f} (+/- {cv_scores.std():.4f})")

In [None]:
# Classification report
print("\nPer-Category Performance:")
print("="*70)
print(classification_report(y_test, y_pred))

In [None]:
# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
categories = sorted(y_test.unique())

plt.figure(figsize=(14, 12))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
            xticklabels=categories, yticklabels=categories,
            cbar_kws={'label': 'Count'})
plt.title('Confusion Matrix - Transaction Categorization', fontsize=16, fontweight='bold')
plt.ylabel('True Category', fontsize=12)
plt.xlabel('Predicted Category', fontsize=12)
plt.xticks(rotation=45, ha='right')
plt.yticks(rotation=0)
plt.tight_layout()
plt.show()

# Analysis
print("\nKey Observations:")
print("- Diagonal values (correct predictions) should be highest")
print("- Off-diagonal values show misclassifications")
print("- Can identify which categories are confused with each other")

In [None]:
# Feature importance
vectorizer = best_model.named_steps['tfidf']
classifier = best_model.named_steps['clf']

feature_names = vectorizer.get_feature_names_out()
importances = classifier.feature_importances_

# Get top 20 features
top_indices = np.argsort(importances)[-20:][::-1]
top_features = [(feature_names[i], importances[i]) for i in top_indices]

# Plot
plt.figure(figsize=(12, 6))
features, scores = zip(*top_features)
plt.barh(range(len(features)), scores, color='steelblue', alpha=0.8)
plt.yticks(range(len(features)), features)
plt.xlabel('Importance Score')
plt.title('Top 20 Most Important Features (Keywords)', fontsize=14, fontweight='bold')
plt.gca().invert_yaxis()
plt.grid(axis='x', alpha=0.3)
plt.tight_layout()
plt.show()

print("\nMost important keywords for categorization:")
for feature, score in top_features[:10]:
    print(f"  {feature:30s}: {score:.4f}")

## 6. Real-World Testing & Error Analysis

In [None]:
# Test on unseen examples
test_transactions = [
    "Starbucks Coffee Shop",
    "Walmart Grocery Store",
    "Netflix Monthly Subscription",
    "Uber Ride to Airport",
    "Paycheck Deposit",
    "CVS Pharmacy Medicine",
    "Delta Airlines Flight Ticket",
    "Rent Payment Apartment",
    "Amazon Online Shopping",
    "Gas Station Fill-up"
]

predictions = best_model.predict(test_transactions)
probabilities = best_model.predict_proba(test_transactions)
confidences = np.max(probabilities, axis=1)

print("Predictions on unseen transactions:\n")
print(f"{'Description':<35} {'Predicted Category':<20} {'Confidence'}")
print("="*75)
for desc, pred, conf in zip(test_transactions, predictions, confidences):
    print(f"{desc:<35} {pred:<20} {conf:.2%}")

In [None]:
# Error analysis - find misclassifications
errors = X_test[y_test != y_pred]
true_labels = y_test[y_test != y_pred]
pred_labels = pd.Series(y_pred, index=y_test.index)[y_test != y_pred]

print(f"\nMisclassified transactions: {len(errors)} out of {len(y_test)} ({len(errors)/len(y_test):.1%})")
print("\nSample misclassifications:")
print(f"{'Transaction':<40} {'True':<15} {'Predicted':<15}")
print("="*75)
for i, (desc, true, pred) in enumerate(zip(errors[:10], true_labels[:10], pred_labels[:10])):
    print(f"{desc:<40} {true:<15} {pred:<15}")

## 7. Deployment Considerations

**Production Readiness:**
1. ✓ Model versioning implemented
2. ✓ Metrics tracking for each version
3. ✓ Fallback to rule-based categorization
4. ✓ Confidence scores for active learning

**Future Improvements:**
- Implement model retraining pipeline (monthly)
- Add drift detection
- Upgrade to transformer models (BERT/DistilBERT)
- Add anomaly detection for fraud
- A/B testing framework

In [None]:
# Model summary
print("\n" + "="*70)
print("FINAL MODEL SUMMARY")
print("="*70)
print(f"Model Type:           Random Forest Classifier")
print(f"Feature Engineering:  TF-IDF with bi-grams")
print(f"Training Data:        {len(X_train)} transactions")
print(f"Categories:           {len(categories)}")
print(f"Test Accuracy:        {test_accuracy:.4f}")
print(f"CV F1 Score:          {cv_scores.mean():.4f} (+/- {cv_scores.std():.4f})")
print(f"Best Parameters:      {grid_search.best_params_}")
print("="*70)