<a href="https://colab.research.google.com/github/ChaoticSam/transaction-sms-parser/blob/v0.1/sms_parser.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 💬 SMS Transaction Parser - NLP Pipeline

This project builds a robust NLP pipeline to parse raw SMS alerts and extract structured financial information.

---

## 🎯 Objective

Classify SMS messages into categories and extract structured data fields:

- **Classification**: Identify whether the SMS is about a:
  - `debit` transaction
  - `credit` transaction
  
- **Entity Extraction**:
  - `amount`: Transaction amount (e.g., ₹5,000)
  - `balance`: Available balance if present (e.g., Avl Bal: ₹10,000)

---

## 🛠️ Project Workflow

### 1. Preprocessing
- Lemmatization using `spaCy`
- Removal of stopwords and punctuation
- Lowercasing text

### 2. Feature Engineering
- `TF-IDF` vectorization (unigrams + bigrams)
- Binary keyword features:
  - `has_fail`, `has_due`, `has_otp`, `has_credit_score`, `has_payment`, `has_credit_card`

### 3. Classification Model
- **Algorithm**: `RandomForestClassifier`
- **Input**: TF-IDF + binary features
- **Output**: `debit`, `credit`, or `misc`
- **Metrics**:
  - Accuracy
  - Precision, Recall, F1-score

### 4. Named Entity Recognition (NER)
- Trained a custom `spaCy` model to detect:
  - `AMOUNT` — Transaction amount
  - `BALANCE` — Available balance
- Annotated ~150+ real and synthetic SMS messages

### 5. Inference Pipeline
- Load Acutal SMS Test dataset (Data.xlsx)
- Predict class (`debit`, `credit`, `misc`)
- Extract `amount` and `balance` using trained `spaCy` NER
- Save output to `.csv` or display in notebook

---

## 🧠 How to Run

1. Install dependencies:
```bash
pip install spacy pandas scikit-learn joblib
python -m spacy download en_core_web_sm

2. Train Classifier (if needed):
```bash
train_pipeline("data/training_data.csv")

3. Call prediction function:
```bash
predict_sms("data/Data.csv")


In [224]:
import re
import os
import joblib
import pandas as pd
import spacy
import numpy as np
from scipy.sparse import hstack, csr_matrix
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

In [225]:
# GLOBAL PARAMETERS
SPACY_MODEL = "output/model-best"
TFIDF_PATH = "/content/models/sms_tfidf.pkl"
MODEL_PATH = "/content/models/sms_rf_model.pkl"
LABEL_MAP_PATH = "/content/models/label_map.pkl"

In [226]:
# SMS PREPROCESSING
def preprocess_text(text: str) -> str:
    nlp = spacy.load('en_core_web_sm')
    doc = nlp(text or "")
    tokens = [tok.lemma_ for tok in doc if not tok.is_stop and not tok.is_punct]
    return " ".join(tokens)

In [227]:
# ENTITY EXTRACT (AMOUNT AND BALANCE)
def extract_fields(text: str) -> dict:
    nlp = spacy.load(SPACY_MODEL)
    doc = nlp(text)
    amount = None
    balance = None

    for ent in doc.ents:
        if ent.label_ == "AMOUNT":
            amount = ent.text.strip("₹Rs. ").replace(",", "")
        elif ent.label_ == "BALANCE":
            balance = ent.text.strip("₹Rs. ").replace(",", "")

    return {"amount": amount, "balance": balance}

In [228]:
# TRAINING PIPELINE FOR CLASSIFICATION

def train_pipeline(data_path: str):
    df = pd.read_excel(data_path) if data_path.lower().endswith(('xls', 'xlsx')) else pd.read_csv(data_path)
    df['clean'] = df['SMS'].apply(preprocess_text)

    # Adding features
    df["has_fail"] = df["SMS"].str.contains(r"\b(?:fail|invalid|declined)\b", case=False).astype(int)
    df["has_due"] = df["SMS"].str.contains(r"\b(?:due|overdue|late|charge|emi)\b", case=False).astype(int)
    df["has_otp"] = df["SMS"].str.contains(r"\b(?:otp|verify|code|login)\b", case=False).astype(int)
    df["has_credit_score"] = df["SMS"].str.contains(r"\b(?:credit score|cibil|score)\b", case=False).astype(int)
    df['has_payment'] = df['SMS'].str.contains(r'\b(?:payment|paid|successfully processed|processed|debited)\b',case=False,regex=True).astype(int)
    df['has_credit_card'] = df['SMS'].str.contains(r'\b(?:credit card)\b',case=False,regex=True).astype(int)

    # Vectorizing text
    tfidf = TfidfVectorizer(ngram_range=(1,2), max_features=3000)
    X_text = tfidf.fit_transform(df['clean'])
    X_bin = df[["has_fail","has_due","has_otp","has_credit_score", "has_payment", "has_credit_card"]].values
    X = hstack([X_text, csr_matrix(X_bin)])

    # # Encoding labels
    label_map = {"debit":0, "credit":1, "misc":2}
    y = df["label"].map(label_map)

    # Training RandomForest on full data
    model = RandomForestClassifier(
        n_estimators=200,
        max_depth=20,
        class_weight="balanced",
        random_state=42
    )
    model.fit(X, y)

    # Split
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.7, stratify=y, random_state=42)

    # Evaluate
    y_pred = model.predict(X_test)
    print("Accuracy:", accuracy_score(y_test, y_pred))
    print(classification_report(y_test, y_pred, target_names=label_map.keys()))
    print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))

    # Persist
    joblib.dump(tfidf, TFIDF_PATH)
    joblib.dump(model, MODEL_PATH)
    joblib.dump(label_map, LABEL_MAP_PATH)
    print(f"Saved TFIDF, model, and label map to disk.")

In [229]:
# CLASSIFICATION LOGIC

def predict_sms(data_path: str) -> pd.DataFrame:
    df = pd.read_excel(data_path) if data_path.lower().endswith(('xls', 'xlsx')) else pd.read_csv(data_path)
    df['clean'] = df['SMS'].apply(preprocess_text)

    # Load artifacts
    tfidf = joblib.load(TFIDF_PATH)
    model = joblib.load(MODEL_PATH)
    label_map = joblib.load(LABEL_MAP_PATH)
    reverse_map = {v:k for k,v in label_map.items()}
    df["has_fail"] = df["SMS"].str.contains(r"\b(?:fail|invalid|declined)\b", case=False).astype(int)
    df["has_due"] = df["SMS"].str.contains(r"\b(?:due|overdue|late|charge|emi)\b", case=False).astype(int)
    df["has_otp"] = df["SMS"].str.contains(r"\b(?:otp|verify|code|login)\b", case=False).astype(int)
    df["has_credit_score"] = df["SMS"].str.contains(r"\b(?:credit score|cibil|score)\b", case=False).astype(int)
    df['has_payment'] = df['SMS'].str.contains(r'\b(?:payment|paid|successfully processed|processed|debited)\b',case=False,regex=True).astype(int)
    df['has_credit_card'] = df['SMS'].str.contains(r'\b(?:credit card)\b',case=False,regex=True).astype(int)

    # Features
    X_text = tfidf.transform(df['clean'])
    X_bin  = df[["has_fail","has_due","has_otp","has_credit_score", "has_payment", "has_credit_card"]].values
    X = hstack([X_text, X_bin])

    # Classification
    pred_ids = model.predict(X)
    df["predicted_label"] = [reverse_map[i] for i in pred_ids]

    # Field extraction only if predicted_label is credit or debit
    def conditional_extract(row):
        if row['predicted_label'] in ['credit', 'debit']:
            return extract_fields(row['SMS'])
        else:
            return pd.Series({'amount': None, 'balance': None})

    fields = df.apply(conditional_extract, axis=1)
    df = pd.concat([df, fields], axis=1)

    return df[['SMS', 'predicted_label', 'amount', 'balance']]

In [230]:
# Train Classification model
train_pipeline('/content/data/training_data.csv')

# Classify and Saving Final Output
df = predict_sms('/content/data/Data.csv')
df.to_csv("/content/response/final_output.csv", index=False)
print("Final output saved to response/final_output.csv")


✅ Final output saved to final_output.csv
