## Bank Transaction Categorization Model Report

### 1. Introduction

Financial technology applications increasingly rely on automatic categorization of financial transactions to support user budgeting, detect fraud, and provide meaningful insights. This report presents a machine learning pipeline to categorize bank transactions based on textual descriptions, transaction metadata, and user profile attributes. The task combines structured and unstructured data sources, representing a classic multi-modal classification problem.

We aim to predict the transaction category given the transaction description (text), amount (numeric), and user profile information (categorical/boolean). The solution must be accurate, interpretable, and scalable.

---

### 2. Problem Definition and Algorithm

#### 2.1 Task Definition

**Input:**

* Transaction record consisting of:

  * `description` (free text)
  * `amount` (float)
  * `txn_date` (timestamp)
  * `client_id` and related profile fields (boolean/categorical)

**Output:**

* A predicted transaction category from a fixed set (e.g., Groceries, Bills, Entertainment).

**Challenge:**

* Integrating unstructured (text) and structured (numerical, categorical) features
* Handling class imbalance
* Ensuring interpretability for financial use cases

#### 2.2 Algorithm Definition

We use a pipeline-based supervised machine learning classifier. The selected model is **Random Forest Classifier** due to its robustness to noise, ability to handle non-linear feature interactions, and support for mixed data types.

**Pseudocode Overview:**

```python
1. Clean and preprocess the 'description' column:
   - Lowercase, remove punctuation and stopwords
   - Vectorize using TF-IDF (unigrams + bigrams)

2. Process structured fields:
   - Normalize 'amount'
   - Extract date features (day_of_week, is_weekend)
   - One-hot encode user profile booleans

3. Concatenate all features into a combined dataset

4. Train a Random Forest Classifier (n=50 trees)
   - Use 5-fold cross-validation

5. Evaluate using accuracy, F1-score, confusion matrix
```

**Example:**

* Transaction: "Netflix Payment"
* TF-IDF: Token "netflix" gets high score
* Amount: 13.99 → typical subscription
* Category → Predicted: "Entertainment"

---

### 3. Data Analysis and Preprocessing

#### 3.1 Dataset Overview

* Two datasets:

  * `bank_transaction.csv` with transaction logs
  * `user_profile.csv` with user-level behavior flags

#### 3.2 Preprocessing Steps

**Text (Description):**

* Cleaned using regex
* Stopwords removed manually
* Vectorized using `TfidfVectorizer` with `ngram_range=(1,2)` for richer context

**Numerical:**

* `amount` scaled using standardization
* Missing values handled by dropping sparse rows

**Date:**

* `txn_date` used to extract `month`, `weekday`, and `is_weekend`

**Categorical/Boolean:**

* Boolean flags (e.g., `is_student`, `has_joint_account`) converted via OneHotEncoder

#### 3.3 Feature Importance

* Top TF-IDF features: "uber", "netflix", "atm"
* Key structured features: `amount`, `is_student`, `day_of_week`

> Feature importances were visualized using Random Forest's built-in `feature_importances_`.

---

### 4. Model Selection and Justification

#### 4.1 Alternatives Considered

| Model               | Pros                                    | Cons                                           |
| ------------------- | --------------------------------------- | ---------------------------------------------- |
| Logistic Regression | Simple, interpretable                   | Underfits sparse text, weak on non-linear data |
| Random Forest       | Robust to noise, handles mixed features | Slower training                                |
| XGBoost             | High accuracy, efficient                | Requires parameter tuning                      |

#### 4.2 Why Random Forest?

* Handles high-dimensional sparse vectors (TF-IDF)
* No strict assumptions on feature distribution
* Supports ranking of features for explainability
* Strong baseline before moving to deep models

---

### 5. Evaluation and Results

#### 5.1 Metrics

* **Accuracy:** \~78%
* **Macro F1-score:** Balanced across categories
* **Confusion Matrix:** Showed confusion in overlapping terms (e.g., "Grab" vs. "Uber")

#### 5.2 Cross-Validation

* 5-fold stratified validation used
* Performance was consistent across folds (low variance)

#### 5.3 Class-Wise Insights

* Strong performance: Entertainment, Food, Transportation
* Weaker performance: Miscellaneous or ambiguous text

---

### 6. Future Development Plans

#### 6.1 Short-Term (1 Month)

* Use `RandomizedSearchCV` for hyperparameter tuning
* Add `merchant_name` field if available
* Implement `SMOTE` for class balancing
* Optimize TF-IDF memory footprint with sparse matrices

#### 6.2 Long-Term (3 Months)

* Replace TF-IDF with pretrained embeddings (e.g., FastText or BERT)
* Integrate XGBoost or LightGBM with early stopping
* Deploy model as a REST API for real-time inference
* Implement user-specific profile modeling (e.g., personal merchant history)
* Use SHAP for interpretability dashboards

---

### 7. Conclusion

This project built a pipeline for categorizing transactions using both unstructured and structured data. Random Forest was chosen for its strong performance and interpretability. Through TF-IDF and user profile integration, the model achieved \~78% accuracy. Future improvements will focus on feature engineering, deep text embeddings, and real-time deployment.
