<a href="https://colab.research.google.com/github/ChaoticSam/transaction-sms-parser/blob/v0.1/sms_parser.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 💬 SMS Transaction Parser - NLP Pipeline

This project builds a robust NLP pipeline to parse raw SMS alerts and extract structured financial information.

---

## 🎯 Objective

Classify SMS messages into categories and extract structured data fields:

- **Classification**: Identify whether the SMS is about a:
  - `debit` transaction
  - `credit` transaction
  
- **Entity Extraction**:
  - `amount`: Transaction amount (e.g., ₹5,000)
  - `balance`: Available balance if present (e.g., Avl Bal: ₹10,000)

---

## 🛠️ Project Workflow

### 1. Preprocessing
- Lemmatization using `spaCy`
- Removal of stopwords and punctuation
- Lowercasing text

### 2. Feature Engineering
- `TF-IDF` vectorization (unigrams + bigrams)
- Binary keyword features:
  - `has_fail`, `has_due`, `has_otp`, `has_credit_score`, `has_payment`, `has_credit_card`

### 3. Classification Model
- **Algorithm**: `RandomForestClassifier`
- **Input**: TF-IDF + binary features
- **Output**: `debit`, `credit`, or `misc`
- **Metrics**:
  - Accuracy
  - Precision, Recall, F1-score

### 4. Named Entity Recognition (NER)
- Trained a custom `spaCy` model to detect:
  - `AMOUNT` — Transaction amount
  - `BALANCE` — Available balance
- Annotated ~150+ real and synthetic SMS messages

### 5. Inference Pipeline
- Load Acutal SMS Test dataset (Data.xlsx)
- Predict class (`debit`, `credit`, `misc`)
- Extract `amount` and `balance` using trained `spaCy` NER
- Save output to `.csv` or display in notebook

---

## 🧠 How to Run

1. Install dependencies:
```bash
pip install spacy pandas scikit-learn joblib
python -m spacy download en_core_web_sm
```
2. Train Classifier (if needed):
```bash
train_pipeline("data/training_data.csv")
```
3. Call prediction function:
```bash
predict_sms("data/Data.csv")
```
4. Train spaCy NER (optional, already provided):
```bash
python -m spacy train config.cfg --output ./output --paths.train ./train.spacy --paths.dev ./dev.spacy
```

  ```bash
  After training the spaCy model, an folder named output will be created with model-best in it. We can directly use this model AMOUNT/BALANCE exttraction.
  ```
  ```bash
  spaCy model training is an optional step, I have also provide the trained model in the repo
  ```



In [None]:
import re
import os
import joblib
import pandas as pd
import spacy
import numpy as np
from scipy.sparse import hstack, csr_matrix
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

In [None]:
# GLOBAL PARAMETERS
SPACY_MODEL = "output/model-best"
TFIDF_PATH = "/content/models/sms_tfidf.pkl"
MODEL_PATH = "/content/models/sms_rf_model.pkl"
LABEL_MAP_PATH = "/content/models/label_map.pkl"

In [None]:
# SMS PREPROCESSING
def preprocess_text(text: str) -> str:
    nlp = spacy.load('en_core_web_sm')
    doc = nlp(text or "")
    tokens = [tok.lemma_ for tok in doc if not tok.is_stop and not tok.is_punct]
    return " ".join(tokens)

In [None]:
# ENTITY EXTRACT (AMOUNT AND BALANCE)
def extract_fields(text: str) -> dict:
    nlp = spacy.load(SPACY_MODEL)
    doc = nlp(text)
    amount = None
    balance = None

    for ent in doc.ents:
        if ent.label_ == "AMOUNT":
            amount = ent.text.strip("₹Rs. ").replace(",", "")
        elif ent.label_ == "BALANCE":
            balance = ent.text.strip("₹Rs. ").replace(",", "")

    return {"amount": amount, "balance": balance}

In [None]:
# TRAINING PIPELINE FOR CLASSIFICATION

def train_pipeline(data_path: str):
    df = pd.read_excel(data_path) if data_path.lower().endswith(('xls', 'xlsx')) else pd.read_csv(data_path)
    df['clean'] = df['SMS'].apply(preprocess_text)

    # Adding features
    df["has_fail"] = df["SMS"].str.contains(r"\b(?:fail|invalid|declined)\b", case=False).astype(int)
    df["has_due"] = df["SMS"].str.contains(r"\b(?:due|overdue|late|charge|emi)\b", case=False).astype(int)
    df["has_otp"] = df["SMS"].str.contains(r"\b(?:otp|verify|code|login)\b", case=False).astype(int)
    df["has_credit_score"] = df["SMS"].str.contains(r"\b(?:credit score|cibil|score)\b", case=False).astype(int)
    df['has_payment'] = df['SMS'].str.contains(r'\b(?:payment|paid|successfully processed|processed|debited)\b',case=False,regex=True).astype(int)
    df['has_credit_card'] = df['SMS'].str.contains(r'\b(?:credit card)\b',case=False,regex=True).astype(int)

    # Vectorizing text
    tfidf = TfidfVectorizer(ngram_range=(1,2), max_features=3000)
    X_text = tfidf.fit_transform(df['clean'])
    X_bin = df[["has_fail","has_due","has_otp","has_credit_score", "has_payment", "has_credit_card"]].values
    X = hstack([X_text, csr_matrix(X_bin)])

    # # Encoding labels
    label_map = {"debit":0, "credit":1, "misc":2}
    y = df["label"].map(label_map)

    # Training RandomForest on full data
    model = RandomForestClassifier(
        n_estimators=200,
        max_depth=20,
        class_weight="balanced",
        random_state=42
    )
    model.fit(X, y)

    # Split
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.7, stratify=y, random_state=42)

    # Evaluate
    y_pred = model.predict(X_test)
    print("Accuracy:", accuracy_score(y_test, y_pred))
    print(classification_report(y_test, y_pred, target_names=label_map.keys()))
    print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))

    # Persist
    joblib.dump(tfidf, TFIDF_PATH)
    joblib.dump(model, MODEL_PATH)
    joblib.dump(label_map, LABEL_MAP_PATH)
    print(f"Saved TFIDF, model, and label map to disk.")

In [None]:
# CLASSIFICATION LOGIC

def predict_sms(data_path: str) -> pd.DataFrame:
    df = pd.read_excel(data_path) if data_path.lower().endswith(('xls', 'xlsx')) else pd.read_csv(data_path)
    df['clean'] = df['SMS'].apply(preprocess_text)

    # Load artifacts
    tfidf = joblib.load(TFIDF_PATH)
    model = joblib.load(MODEL_PATH)
    label_map = joblib.load(LABEL_MAP_PATH)
    reverse_map = {v:k for k,v in label_map.items()}
    df["has_fail"] = df["SMS"].str.contains(r"\b(?:fail|invalid|declined)\b", case=False).astype(int)
    df["has_due"] = df["SMS"].str.contains(r"\b(?:due|overdue|late|charge|emi)\b", case=False).astype(int)
    df["has_otp"] = df["SMS"].str.contains(r"\b(?:otp|verify|code|login)\b", case=False).astype(int)
    df["has_credit_score"] = df["SMS"].str.contains(r"\b(?:credit score|cibil|score)\b", case=False).astype(int)
    df['has_payment'] = df['SMS'].str.contains(r'\b(?:payment|paid|successfully processed|processed|debited)\b',case=False,regex=True).astype(int)
    df['has_credit_card'] = df['SMS'].str.contains(r'\b(?:credit card)\b',case=False,regex=True).astype(int)

    # Features
    X_text = tfidf.transform(df['clean'])
    X_bin  = df[["has_fail","has_due","has_otp","has_credit_score", "has_payment", "has_credit_card"]].values
    X = hstack([X_text, X_bin])

    # Classification
    pred_ids = model.predict(X)
    df["predicted_label"] = [reverse_map[i] for i in pred_ids]

    # Field extraction only if predicted_label is credit or debit
    def conditional_extract(row):
        if row['predicted_label'] in ['credit', 'debit']:
            return extract_fields(row['SMS'])
        else:
            return pd.Series({'amount': None, 'balance': None})

    fields = df.apply(conditional_extract, axis=1)
    df = pd.concat([df, fields], axis=1)

    return df[['SMS', 'predicted_label', 'amount', 'balance']]

In [None]:
 !python -m spacy init config config.cfg --lang en --pipeline ner --optimize efficiency

In [None]:
# Trainining sPacy Model for entity extraction it might be time taking

TRAIN_DATA = [
    (
        "rs 2500 debited from your sbi a/c xxxx1234 for txn at amazon.in on 03/07/25. avl bal rs 18,543.",
        {"entities": [(3, 7, "AMOUNT"), (88, 94, "BALANCE")]}
    ),
    (
        "your a/c 375416 is debited with rs 2.00 on 30-08-2019 14:39:02 at pur/www phonepe com/000124305400.avbl bal is rs 252.62.call 18605005555 for dispute",
        {"entities": [(35, 39, "AMOUNT"), (114, 120, "BALANCE")]}
    ),
    (
        "dear customer, your consumer durable loan 1011430007254554 has got bounce due to ecs mandate not processed. rbi seq no-815717",
        {"entities": []}
    ),
    (
        "Dear SBI Cardholder, payment of Rs. 7094.00 for your SBI Credit Card has been successfully processed. ref no : LSM36078572514.",
        {"entities": [(36, 43, "AMOUNT")]}
    ),
    (
        "Use SBI ATMs for better security, convenience & faster complaint resolution. As per RBI directive, more than 3 txns in metro & 5 in non-metro are chargeable",
        {"entities": []}
    ),
    (
        "hi there! you just made a successful payment of rs. 2269.00 at kreditbee for future reference, your order id is kb191006wifaw_nhr98. thanks, paytm team",
        {"entities": [(52, 59, "AMOUNT")]}
    ),
    (
        "Dear BOI customer, A/c XXXXXXXXXXX0612 charged Rs. 1 for non-maintenance of required AQB in preceding qtr. For details visit Branch/website.",
        {"entities": [(51, 52, "AMOUNT")]}
    ),
    (
        "hi lucky, rs.10.00 has been added to your freecharge wallet. updated balance is rs.10.98",
        {"entities": [(13, 18, "AMOUNT"), (83, 88, "BALANCE")]}
    ),
    (
        "inr 2 956 00 credited to your a c no xxxxxxx8956 on 02 01 19 through neft with utr barbd19002300597 by scm garments pvt ltd",
        {"entities": [(4, 11, "AMOUNT")]}
    ),
    (
        "refunded rs 48 to your wallet for your order on payu biz. updated bal rs 48",
        {"entities": [(12, 14, "AMOUNT"), (73, 75, "BALANCE")]}
    ),
    (
    "your idbi account number nn86222 credited with rs.74.05 on 30-08-2019 towards lpg/dbtl subsidy by goi.",
    {"entities": [(50, 55, "AMOUNT")]}
    ),
    (
        "rs 559 00 was spent on your sbi card ending 7049 at darshan biscuit mart on 10 01 19 available credit limit rs 36 485 39",
        {"entities": [(3,9, "AMOUNT"), (112, 120, "BALANCE")]}
    ),
    (
        "Your online request for Net Bkg Login and Transaction Pwds and Transaction Access with SPL Daily Txn limit of Rs 50,000.00 has been processed successfully.",
        {"entities": [(113, 123, "AMOUNT")]}
    ),
    (
        "your a c xxxxx538956 debited inr 2 135 00 on 02 01 19 transferred to investment intermedi a c balance inr 338 89",
        {"entities": [(33, 41, "AMOUNT"), (106, 112, "BALANCE")]}
    ),
    (
        "you paid rs 4282.08 via freecharge at razorpay software pvt ltd (orderid pay_dhpz2pntclhlgb). updated balance is rs 27.92. not you? call 18005727133",
        {"entities": [(12, 19, "AMOUNT"), (116, 122, "BALANCE")]}
    ),
    (
        "your idbi bk a/c. nnnnn79140 debited rs 10329.00 on 26 oct. details: upi/92990881836/. a/c bal rs 9576.69 as of 26 oct 09:08 hrs",
        {"entities": [(40, 48, "AMOUNT"), (98, 105, "BALANCE")]}
    ),
    (
    "E-statement dated 08/12/2017 has been sent. Total amt due of Rs 1984 or Min Amt of Rs 1984 payable by 28/12/2017. SMS ENRS to 5676791 -SBI Card",
    {"entities": [(64, 68, "AMOUNT")]}
    ),
    (
    "rs 1,200.00 deducted for late payment fee on cc xxxx5678. due date was 15/07/25. avl limit rs 25,000",
    {"entities": [(3, 12, "AMOUNT"), (87, 94, "BALANCE")]}
    ),
    (
    "hi user, rs.15.00 cashback credited for mobikwik recharge txn id MW901256783. wallet balance: 35.50",
    {"entities": [(10, 16, "AMOUNT"), ( 97, 102, "BALANCE")]}
    ),
    (
    "inr 5,000.00 transferred to a/c xx7890 via imps ref 9081726354 on 04-07-25 11:45 am. bal inr 12,345.67",
    {"entities": [(4, 12, "AMOUNT"), (93, 102, "BALANCE")]}
    ),
    (
    "alert: a/c xx1234 attempted overdraft of rs 7500 on 03-07-25. current balance -rs 500.75. txn declined",
    {"entities": [(44, 48, "AMOUNT"), (82, 88, "BALANCE")]} # Done
    ),
    (
    "your upi txn 9081273615 paid rs.325.50 to swiggy@icici. upi bal rs 1,200.00 remaining",
    {"entities": [(32, 38, "AMOUNT"), (67, 75, "BALANCE")]}
    ),
    (
    "rs 299.00 monthly charges for sbi savings a/c xx4321. deducted on 01-07-25. balance rs 5,432.10",
    {"entities": [(3, 10, "AMOUNT"), (87, 94, "BALANCE")]}
    ),

    (
    "congrats! rs 50 cashback for phonepe first txn. credit to bank a/c xx9087. bal rs 150.50",
    {"entities": [(12, 15, "AMOUNT"), (82, 88, "BALANCE")]}
    ),
    (
    "a/c no. nn4567 credited rs 25,000.00 (salary) on 05-07-25 ifsc sbin0000123. total bal rs 35,678.90",
    {"entities": [(27, 36, "AMOUNT"), (90, 99, "BALANCE")]}
    ),
    (
    "rs 2,150.50 electricity bill paid via auto debit. a/c xx5678 debited on 02-07-25. bal 8,765.43",
    {"entities": [(3, 11, "AMOUNT"), (86, 94, "BALANCE")]}
    ),
    (
    "upi collect request of rs 1,250 from merchant@upi expired. wallet balance unchanged rs 2,300",
    {"entities": [(26, 31, "AMOUNT"), (87, 92, "BALANCE")]}
    ),
    (
    "your fd maturing on 15-08-25 for a/c xx9012 amount rs 1,00,000.00 will be credited. current bal rs 25,432.10",
    {"entities": [(54, 65, "AMOUNT"), (99, 108, "BALANCE")]}
    ),
    (
    "rs 75 gst charged on sbi credit card txn at amazon. txn id sbi908172635. available limit rs 35,000",
    {"entities": [(3, 5, "AMOUNT"), (92, 98, "BALANCE")]}
    ),
    (
    "refund rs 1,499.00 for cancelled ola ride initiated. will credit in 3-5 days. wallet bal 1,850.50",
    {"entities": [(10, 18, "AMOUNT"), (89, 97, "BALANCE")]}
    ),
    (
    "your sbi quick alert: a/c xx7890 debited rs 450.00 for fastag recharge. time 04-07-25 18:30. bal rs 1,200.50",
    {"entities": [(44, 49, "AMOUNT"), (100, 108, "BALANCE")]}
    ),
    (
    "rs 3,500.00 loan emi auto debit from a/c xx1234. processed on 05-07-25. balance rs 7,890.12",
    {"entities": [(3, 11, "AMOUNT"), (83, 91, "BALANCE")]}
    ),
    (
    "inr 150.00 deducted as sms charges for qtr end mar'25. a/c xx5678. bal inr 12,345.67",
    {"entities": [(4, 10, "AMOUNT"), (75, 84, "BALANCE")]}
    ),
    (
    "rs 2,000.00 deposited via cash at sbi atm dx7890 on 04-07-25 16:45. total bal rs 22,500.50",
    {"entities": [(3, 12, "AMOUNT"), (81, 90, "BALANCE")]}
    ),
    (
    "hi customer, rs.25.00 cashback credited for gpay first txn. updated balance rs.125.00",
    {"entities": [(16, 21, "AMOUNT"), ( 80, 85, "BALANCE")]}
    ),
    (
    "a/c xx4321 credited rs 8,500.00 (refund) from amazon txn id amz908172635. bal rs 10,500.75",
    {"entities": [(23, 31, "AMOUNT"), (81, 90, "BALANCE")]}
    ),
    (
    "rs 750.00 received from sister. total bal rs 4,500.50",
    {"entities": [(3, 9, "AMOUNT"), (45, 53, "BALANCE")]}
    ),
    (
    "rs 3,700.00 maintenance paid. remaining balance rs 1,700.25",
    {"entities": [(3, 11, "AMOUNT"), ( 51, 59, "BALANCE")]}
    ),
    (
    "rs. 449.00 deducted for zee5. bal rs. 2.00",
    {"entities": [(4, 10, "AMOUNT"), (38, 42, "BALANCE")]}
    ),
    (
    "atm cash rs 5000. charges rs 25. net bal 12,975.50",
    {"entities": [(12, 16, "AMOUNT"), (29, 31, "AMOUNT"), (41, 50, "BALANCE")]}
    ),
    (
    "rs 2,650.00 donation paid. account balance rs 8,465.43",
    {"entities": [(3, 11, "AMOUNT"), ( 46, 54, "BALANCE")]}
    ),
    (
    "refund rs 1,950.00 for hotel booking. wallet bal rs 2,250.50",
    {"entities": [(10, 18, "AMOUNT"), ( 52, 60, "BALANCE")]}
    ),
    (
    "rs 1,150.00 charity deduction. available balance rs 3,140.12",
    {"entities": [(3, 11, "AMOUNT"), (52, 60, "BALANCE")]}
    ),
    (
    "a/c xx9012 credited rs 6,500.00 (reward). bal rs 19,678.90",
    {"entities": [(23, 31, "AMOUNT"), ( 49, 58, "BALANCE")]}
    ),
    (
    "rs 1,350.00 debited for landline bill. balance rs 656.78",
    {"entities": [(3, 11, "AMOUNT"), ( 51, 56, "BALANCE")]}
    ),
    (
    "rs. 95.00 service tax. current balance rs. 7,725.67",
    {"entities": [(4, 9, "AMOUNT"), (43, 51, "BALANCE")]}
    ),
    (
    "cashback rs 200 credited. wallet balance rs 725.00",
    {"entities": [(12, 15, "AMOUNT"), (44, 50, "BALANCE")]}
    ),
    (
    "rs 10,500.00 fd maturity. available funds rs 11,265.43",
    {"entities": [(3, 12, "AMOUNT"), ( 46, 54, "BALANCE")]}
    ),
    (
    "a/c xx3456 debited rs 3,000.00 for travel. bal rs 7,090.54",
    {"entities": [(22, 30, "AMOUNT"), ( 51, 58, "BALANCE")]}
    ),
    (
    "rs 1,799.00 spent at lifestyle. wallet balance rs 2,900.75",
    {"entities": [(3, 11, "AMOUNT"), ( 50, 58, "BALANCE")]}
    ),
    (
    "rs 850.00 received from friend. total bal rs 5,350.50",
    {"entities": [(3, 9, "AMOUNT"), ( 45, 53, "BALANCE")]}
    ),
    (
    "rs 3,800.00 society charge paid. remaining balance rs 3,900.25",
    {"entities": [(3, 12, "AMOUNT"), ( 54, 62, "BALANCE")]}
    ),
    (
    "rs. 549.00 deducted for voot. bal rs. 1.00",
    {"entities": [(4, 11, "AMOUNT"), ( 39, 42, "BALANCE")]}
    ),
    ]

from spacy.tokens import DocBin


def adjust_entity_spans(data):
    """Ensure entity spans align with token boundaries."""
    nlp = spacy.blank("en")
    adjusted_data = []

    for text, annot in data:
        doc = nlp.make_doc(text)
        new_entities = []

        for start, end, label in annot.get("entities", []):
            if start >= len(text) or end > len(text):
                continue

            span = doc.char_span(start, end, label=label)
            if span is None:
                # Try to align manually
                start_token = None
                end_token = None
                for token in doc:
                    if token.idx <= start < token.idx + len(token.text):
                        start_token = token
                    if token.idx < end <= token.idx + len(token.text):
                        end_token = token

                if start_token and end_token:
                    new_start = start_token.idx
                    new_end = end_token.idx + len(end_token.text)
                    new_entities.append((new_start, new_end, label))
            else:
                new_entities.append((start, end, label))

        adjusted_data.append((text, {"entities": new_entities}))

    return adjusted_data

# 2. Validate annotations

def validate_annotations(data):
    """Report misaligned entities."""
    nlp = spacy.blank("en")
    misaligned = 0
    for i, (text, annot) in enumerate(data):
        doc = nlp.make_doc(text)
        for start, end, label in annot["entities"]:
            if doc.char_span(start, end, label=label) is None:
                print(f"Misaligned: {text[start:end]} | Span: {start}-{end} | Label: {label}")
                misaligned += 1
    print(f"\nTotal misaligned: {misaligned}")

# 3. Convert to spaCy binary format

def convert_to_spacy(data, out_path):
    nlp = spacy.blank("en")
    doc_bin = DocBin()
    for text, annot in data:
        doc = nlp.make_doc(text)
        ents = []
        for start, end, label in annot["entities"]:
            span = doc.char_span(start, end, label=label)
            if span:
                ents.append(span)
        doc.ents = ents
        doc_bin.add(doc)
    doc_bin.to_disk(out_path)
    print(f"Saved {len(data)} samples to {out_path}")

# Run processing
TRAIN_DATA = adjust_entity_spans(TRAIN_DATA)
validate_annotations(TRAIN_DATA)
train_data, dev_data = train_test_split(TRAIN_DATA, test_size=0.2, random_state=42)
convert_to_spacy(train_data, "train.spacy")
convert_to_spacy(dev_data, "dev.spacy")

In [None]:
!python -m spacy train config.cfg \
    --output output \
    --paths.train ./train.spacy \
    --paths.dev ./dev.spacy \
    --gpu-id -1

In [None]:
# Train Classification model
train_pipeline('/content/data/training_data.csv')

# Classify and Saving Final Output
df = predict_sms('/content/data/Data.csv')
df.to_csv("/content/response/final_output.csv", index=False)
print("Final output saved to response/final_output.csv")
