# üìß Email Intent Classification System

## Objective
The goal of this project is to automatically classify incoming customer emails into meaningful intent categories and route them either to an automated chatbot or a human customer support agent.

The system performs:
- Email preprocessing and normalization
- Order ID and URL extraction
- Spam / phishing detection (rule-based + semantic)
- Intent classification using multilingual embeddings
- Confidence-based routing to human agents


In [1]:
import pandas as pd, re
from sklearn.metrics.pairwise import cosine_similarity
from sentence_transformers import SentenceTransformer
import numpy as np

data_path = "data/emails-data.csv"
df = pd.read_csv(data_path)
df

Unnamed: 0,id,subject,email_text,language
0,1,T√ù≈ΩDE≈á ƒåAK√ÅM!!!,U≈Ω T√ù≈ΩDE≈á ƒåAK√ÅM NA OBJEDN√ÅVKU 68247! ƒåO SA DEJ...,sk
1,2,Glut√©nov√° di√©ta,"Pros√≠m v√°s, ktor√© z va≈°ich prote√≠nov s√∫ bezglu...",sk
2,3,Gain√©r a vitam√≠ny,"Ahojte, minul√Ω t√Ω≈æde≈à som objednal gain√©r a vi...",sk
3,4,Duplicitn√° objedn√°vka,"Ahoj, vytvoril som 2 objedn√°vky omylom, 66481 ...",sk
4,5,Zmena mena pr√≠jemcu,"Ahoj, obj 52936, d√° sa zmeni≈• meno pr√≠jemcu? J...",sk
...,...,...,...,...
269,270,Exspirovan√° tyƒçinka!!!,"Objedn√°vka ƒç. 92157, prote√≠nov√° tyƒçinka je po ...",sk
270,271,Doruƒçujete cez v√≠kend?,"Ahoj, doruƒçujete z√°sielky aj v sobotu a nedeƒæu?",sk
271,272,Prida≈• pozn√°mku pre kuri√©ra,"Dobr√Ω veƒçer, objedn√°vka 89174, m√¥≈æem prida≈• po...",sk
272,273,Thor Fuel + Vitargo d√°vkovanie,"Ahoj, ako sa d√°vkuje Thor Fuel + Vitargo? Koƒæk...",sk


# Text Preprocessing

Before classification, we:

- Normalize whitespace
- Combine subject and body into a single field
- Extract order IDs using regex
- Extract URLs
- Prepare text for embedding model

This ensures consistent input for downstream models.


In [2]:
def normalize_text(s: str) -> str:
    s = str(s) if s is not None else ""
    s = s.strip()
    s = re.sub(r"\s+", " ", s)
    return s

df["subject_norm"] = df["subject"].apply(normalize_text)
df["email_text_norm"] = df["email_text"].apply(normalize_text)

df["text_for_model"] = (df["subject_norm"] + " " + df["email_text_norm"]).str.strip()
df["text_for_model"]

0      T√ù≈ΩDE≈á ƒåAK√ÅM!!! U≈Ω T√ù≈ΩDE≈á ƒåAK√ÅM NA OBJEDN√ÅVKU ...
1      Glut√©nov√° di√©ta Pros√≠m v√°s, ktor√© z va≈°ich pro...
2      Gain√©r a vitam√≠ny Ahojte, minul√Ω t√Ω≈æde≈à som ob...
3      Duplicitn√° objedn√°vka Ahoj, vytvoril som 2 obj...
4      Zmena mena pr√≠jemcu Ahoj, obj 52936, d√° sa zme...
                             ...                        
269    Exspirovan√° tyƒçinka!!! Objedn√°vka ƒç. 92157, pr...
270    Doruƒçujete cez v√≠kend? Ahoj, doruƒçujete z√°siel...
271    Prida≈• pozn√°mku pre kuri√©ra Dobr√Ω veƒçer, objed...
272    Thor Fuel + Vitargo d√°vkovanie Ahoj, ako sa d√°...
273    Sponzorstvo ≈°portovca Dobr√Ω veƒçer, som profesi...
Name: text_for_model, Length: 274, dtype: str

In [3]:
ORDER_ID_RE = re.compile(r"\b\d{4,}\b")
URL_RE = re.compile(r"(https?://\S+|www\.\S+)", re.IGNORECASE)
EMAIL_RE = re.compile(r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}")

def extract_order_ids(text: str):
    return ORDER_ID_RE.findall(text)

def extract_first_order_id(text: str):
    ids = extract_order_ids(text)
    return ids[0] if ids else None

def extract_urls(text: str):
    return URL_RE.findall(text)

def extract_emails(text: str):
    return EMAIL_RE.findall(text)

df["order_id"] = df["text_for_model"].apply(extract_first_order_id)
df["order_ids_all"] = df["text_for_model"].apply(lambda t: ",".join(extract_order_ids(t)) if extract_order_ids(t) else None)
df["urls"] = df["text_for_model"].apply(lambda t: ",".join(extract_urls(t)) if extract_urls(t) else None)
df["emails"] = df["text_for_model"].apply(lambda t: ",".join(extract_emails(t)) if extract_emails(t) else None)
df["emails"]


0      NaN
1      NaN
2      NaN
3      NaN
4      NaN
      ... 
269    NaN
270    NaN
271    NaN
272    NaN
273    NaN
Name: emails, Length: 274, dtype: str

# Spam & Phishing Detection

Spam detection is implemented as a hybrid system:

### 1. Rule-Based Detection
- Keyword matching
- Excessive capital letters
- Multiple exclamation marks
- Suspicious URLs

### 2. Semantic Detection
We use multilingual sentence embeddings (E5 model) to compare emails with predefined spam prototypes.

A message is marked as spam if:
- It is semantically closer to spam prototypes than business intents
- OR it matches strong rule-based signals

This layered approach improves robustness and reduces false positives.


In [4]:
SPAM_PROTOTYPES = [
    "v√°≈° √∫ƒçet", "ucet", "zablokovan", "zablokovan√Ω",
    "overi≈• identitu", "overit identitu",
    "prihlasovacie √∫daje", "prihlasovacie udaje",
    "zada≈•", "zadat",
    "kliknut√≠m", "kliknutim",
    "kliknite tu", "upozornenie",
    "heslo", "potvrdi≈• √∫ƒçet", "overenie √∫ƒçtu",
    "24 hod√≠n", "okam≈æit√° akcia",
    "lacn√©", "exkluz√≠vna ponuka" ,"investuj", "Bitcoin mining",
    "√öƒçet bude zablokovan√Ω, overenie identity, kliknite na link, zadajte prihlasovacie √∫daje.",
    "Vyhrali ste lot√©riu, kliknite tu a potvrƒète √∫ƒçet.",
    "Exkluz√≠vna ponuka, investuj do Bitcoinu teraz.",
    "Okam≈æit√° akcia, potvrdi≈• √∫ƒçet do 24 hod√≠n.",


    "your account", "account suspended", "account blocked",
    "verify your account", "verify identity",
    "click here", "confirm your details",
    "login immediately", "urgent action required",
    "security alert", "password reset",
    "update your payment", "payment failed",
    "limited time offer", "exclusive access",
    "invest now", "become millionaire",
    "earn money fast", "risk free investment",
    "guaranteed profit", "crypto investment",
    "wire transfer", "bank verification",
    "tax refund", "claim your prize",
    "free gift", "winner", "lottery",
    "act now", "immediate response required",
    "suspicious activity detected",
    "Your account will be blocked, verify your identity immediately.",
    "Click here to confirm your login details.",
    "Limited time offer, guaranteed profit investment.",
    "Earn money fast, risk free crypto investment.",
    "Security alert, update your payment information now."
]


def spam_score(text: str):
    t = text.lower()
    score = 0
    hits = []
    for kw in SPAM_PROTOTYPES:
        if kw in t:
            score += 1
            hits.append(kw)
    urls = extract_urls(text)
    emails = extract_emails(text)
    if urls:
        score += 2
        hits.append("has_url")
    if emails:
        score += 1
        hits.append("has_emails")
    if text.count("!") >= 3:
        score += 1
        hits.append("many_exclamations")
    if sum(1 for c in text if c.isupper()) / max(1, len(text)) > 0.2 and len(text) > 40:
        score += 1
        hits.append("lots_of_caps")
    return score, ",".join(sorted(set(hits))) if hits else None

scores = df["text_for_model"].apply(spam_score)
df["spam_score"] = scores.apply(lambda x: x[0])
df["spam_signals"] = scores.apply(lambda x: x[1])
df["spam_flag"] = df["spam_score"] >= 3


df["final_category"] = None

df.loc[df["spam_flag"], "final_category"] = "Spam / Phishing"

df["final_category"]

0      None
1      None
2      None
3      None
4      None
       ... 
269    None
270    None
271    None
272    None
273    None
Name: final_category, Length: 274, dtype: object

In [5]:
df["final_category"].head()

0    None
1    None
2    None
3    None
4    None
Name: final_category, dtype: object

# Intent Classification (Embedding + Prototype Matching)

We use the multilingual embedding model:

**Model:** intfloat/multilingual-e5-base

Each email is encoded as a vector representation.

We define prototype texts for each business intent category:
- Order Issue (Cancel / Modify / Status)
- Return / Complaint
- Product Question
- Store / Delivery / Availability
- Cooperation / Partnership

Classification is performed by computing cosine similarity between:
- Email embedding
- Prototype embeddings

The category with the highest similarity score is assigned.


In [6]:
# Categories + prototypes
CATEGORIES = [
    "Order Status",
    "Order Cancel",
    "Order Modify",
    "Return / Complaint",
    "Product Question",
    "Store / Delivery / Availability",
    "Cooperation / Partnership"
]

PROTOTYPES = {
    "Order Status": [
        "Where is my order? delivery status, tracking number, order delayed, waiting for package, no update.",
        "I am still waiting for my order, it has not arrived yet, package not delivered, when will it arrive?",
        "Objedn√°vka me≈°k√°, kde je moja objedn√°vka, stav objedn√°vky, doruƒçenie, kuri√©r, tracking ƒç√≠slo.",
        "ƒåak√°m u≈æ t√Ω≈æde≈à, st√°le ƒçak√°m, u≈æ t√Ω≈æde≈à niƒç, bal√≠k nepri≈°iel, nepri≈°lo mi niƒç.",
        "Objednal som a st√°le niƒç, objedn√°vka st√°le nedorazila, kedy pr√≠de bal√≠k?",
        "Potrebujem urgentne inform√°ciu o stave objedn√°vky, bal√≠k st√°le nepri≈°iel, ƒçak√°m pr√≠li≈° dlho."
    ],

    # "Order Cancel": [
    #     "Customer wants to cancel an existing order.",
    #     "Request to cancel order before shipment.",
    #     "Customer wants to fully cancel the order, not modify it.",
    #
    #     "Cancel my order, I ordered the wrong product.",
    #     "I want to cancel my order before shipping.",
    #     "Duplicate order created by mistake, please cancel one.",
    #     "Please cancel my order immediately.",
    #     "Order was a mistake, I need to cancel it.",
    #     "I changed my mind, please cancel the order.",
    #     "Stop my order, I do not want it anymore.",
    #     "Can you cancel my order today?",
    #
    #     "Chcem zru≈°i≈• objedn√°vku.",
    #     "Zru≈°i≈• objedn√°vku urgentne.",
    #     "Objedn√°vka bola omyl, potrebujem ju zru≈°i≈•.",
    #     "Vytvoril som duplicitn√∫ objedn√°vku, jednu zru≈°te.",
    #     "Nechcem t√∫to objedn√°vku, zru≈°te ju pros√≠m.",
    #     "Je mo≈æn√© stornova≈• objedn√°vku?",
    #     "Pros√≠m o zru≈°enie objedn√°vky.",
    #     "Objedn√°vku nechcem, pros√≠m zru≈°i≈•.",
    #     "zru≈°i≈• pros√≠m.",
    #     "ZRU≈†I≈§ 55298.",
    #     " ‚Äì chcem zru≈°i≈•.",
    #     "Zru≈°te moju  e≈°te dnes.",
    #     "Stornova≈•  pros√≠m.",
    #     "Objednal som omylom, zru≈°te to.",
    # ],
    "Order Cancel": [

        # Business definition
        "Customer wants to permanently cancel the order before delivery.",
        "The order should be stopped and not processed further.",
        "This is full cancellation, not modification of details.",
        "The customer does not want to receive the order anymore.",
        "Customer wants to completely stop the order process."

        # English
        "Cancel my order before shipping.",
        "I changed my mind, cancel the order.",
        "Duplicate order, please cancel one.",
        "Stop my order immediately.",

        # Slovak
        "Chcem zru≈°i≈• objedn√°vku.",
        "Pros√≠m o zru≈°enie objedn√°vky.",
        "Objedn√°vka bola omyl, zru≈°te ju.",
        "Nechcem objedn√°vku, pros√≠m zru≈°i≈•.",
        "ZRU≈†I≈§",
    ],

    # "Order Modify": [
    #     "Customer wants to modify order details.",
    #     "Request to change information in existing order.",
    #
    #     "I need to change my order details.",
    #     "Can I add another product to my order?",
    #     "Please update the shipping address.",
    #     "I entered wrong address, can you fix it?",
    #     "Change delivery details before dispatch.",
    #     "I want to change recipient name.",
    #     "Can I edit my order?",
    #     "Update phone number in my order.",
    #     "Change payment method before shipping.",
    #
    #     "M√¥≈æem e≈°te prida≈• produkt do objedn√°vky?",
    #     "Zadal som zl√∫ adresu, pros√≠m o opravu.",
    #     "D√° sa zmeni≈• meno pr√≠jemcu?",
    #     "Potrebujem zmeni≈• sp√¥sob platby.",
    #     "M√¥≈æem upravi≈• objedn√°vku pred odoslan√≠m?",
    #     "Objedn√°vka 67329, m√¥≈æem prida≈• produkt?",
    #     "Zmena √∫dajov v objedn√°vke.",
    #     "Pros√≠m o zmenu adresy doruƒçenia.",
    #     "Zmeni≈• telef√≥nne ƒç√≠slo v objedn√°vke.",
    #     "Chcem upravi≈• objedn√°vku.",
    #     "Obj 52936 ‚Äì zmena √∫dajov.",
    #     "Je mo≈æn√© doplni≈• produkt do objedn√°vky?",
    #     "Zmena mena, zmena priezviska",
    #     "Prida≈• produkt do objedn√°vky.",
    #     "Chcem prida≈• produkt.",
    #     "M√¥≈æem doplni≈• produkt?",
    #     "Doplni≈• tovar do objedn√°vky.",
    #     "Pridanie produktu do existuj√∫cej objedn√°vky.",
    #     "Objedn√°vka 67329 ‚Äì prida≈• produkt.",
    #     "Zmeni≈• adresu doruƒçenia.",
    #     "Opravi≈• adresu v objedn√°vke.",
    #     "da lo by sa zmenit?",
    #     "preto chcel by som zmenit",
    #     "chcel zmenit"
    #
    # ],
    "Order Modify": [

        # Business definition
        "Customer wants to update information but keep the order active.",
        "The order remains valid but details should be changed.",
        "This is modification, not cancellation.",
        "Customer wants to edit order information before shipment.",
        "Customer wants to keep the order active but change details."

        # English
        "Change shipping address.",
        "Add another product to my order.",
        "Update recipient name.",
        "Change payment method.",

        # Slovak
        "Zmeni≈• adresu doruƒçenia.",
        "Prida≈• produkt do objedn√°vky.",
        "Zmena √∫dajov v objedn√°vke.",
        "M√¥≈æem upravi≈• objedn√°vku?",
    ],

    # "Return / Complaint": [
    #     "Return a product, refund, complaint, missing item, wrong item, damaged product, exchange.",
    #     "I want to return my order and get a refund.",
    #     "The product arrived damaged, I need a replacement.",
    #     "I received the wrong product, please fix this.",
    #     "One item is missing from my package.",
    #     "The package was broken when it arrived.",
    #     "I want to exchange the product for another one.",
    #     "The product is defective, how can I claim a refund?",
    #     "I am not satisfied with the product, I want to return it.",
    #     "The item arrived opened or used.",
    #     "The product has expired.",
    #     "The bottle was leaking inside the box.",
    #     "I would like to file a complaint about my order.",
    #     "Customer reports damaged product.",
    #     "Customer received defective or damaged goods.",
    #     "Complaint about damaged packaging.",
    #     "Product arrived broken or spilled.",
    #     "The product arrived damaged.",
    #     "The package was damaged.",
    #     "Protein container was broken.",
    #     "Item arrived defective.",
    #     "Product spilled inside the box.",
    #
    #     "Produkt pri≈°iel po≈°koden√Ω.",
    #     "Obal bol po≈°koden√Ω.",
    #     "Bal√≠k bol po≈°koden√Ω.",
    #     "Prote√≠n je rozsypan√Ω.",
    #     "Tovar je po≈°koden√Ω.",
    #     "Pri≈°lo nieƒço po≈°koden√©.",
    #     "Rozsypan√Ω prote√≠n v bal√≠ku.",
    #     "Produkt je rozbit√Ω.",
    #     "ƒåo m√°m robi≈• keƒè je tovar po≈°koden√Ω?"
    #     "Vr√°tenie tovaru, reklam√°cia, ch√Ωba produkt, zl√° pr√≠chu≈•, pri≈°lo nieƒço in√©, v√Ωmena, vr√°ti≈•.",
    #     "Pros√≠m o rie≈°enie reklam√°cie objedn√°vky.",
    #     "Pri≈°iel mi po≈°koden√Ω tovar.",
    #     "Produkt bol rozbit√Ω pri doruƒçen√≠.",
    #     "V bal√≠ku ch√Ωba jedna polo≈æka.",
    #     "Pri≈°lo len jedno balenie namiesto dvoch.",
    #     "Chcem vr√°ti≈• cel√Ω tovar z objedn√°vky.",
    #     "M√¥≈æem tovar vymeni≈• za in√∫ pr√≠chu≈•?",
    #     "Obal bol po≈°koden√Ω.",
    #     "Produkt je pokazen√Ω alebo expirovan√Ω.",
    #     "Zl√° veƒækos≈• triƒçka, potrebujem v√Ωmenu.",
    #     "Ako postupova≈• pri reklam√°cii?",
    #     "Kam m√°m posla≈• tovar na vr√°tenie?",
    #     "Je vr√°tenie zadarmo alebo plat√≠m po≈°tovn√©?",
    #     "Kedy mi vr√°tite peniaze?",
    #     "Vr√°tenie celej objedn√°vky.",
    #     "Objedn√°vka vr√°tenie, reklam√°cia tovaru."
    # ],
    "Return / Complaint": [

        # Business definition
        "Customer received the product and reports damage or defect.",
        "Post-delivery issue requiring refund or replacement.",
        "This is about damaged, missing, or wrong item.",
        "Product has already been delivered.",

        # English
        "Product arrived damaged.",
        "Wrong item delivered.",
        "One item missing from package.",
        "I want to return the product.",

        # Slovak
        "Produkt pri≈°iel po≈°koden√Ω.",
        "V bal√≠ku ch√Ωba produkt.",
        "Zl√° pr√≠chu≈•, chcem v√Ωmenu.",
        "Chcem vr√°ti≈• tovar.",
    ],
    "Product Question": [
        "Product question: ingredients, gluten-free, dosage, differences between supplements, recommendation.",
        "What is the difference between EAA and BCAA?",
        "Which protein is better for muscle gain?",
        "Is this product suitable for beginners?",
        "Does this supplement contain gluten?",
        "Is this product lactose-free?",
        "What is the recommended dosage?",
        "When should I take this supplement?",
        "Is this product safe for daily use?",
        "Which creatine is better?",
        "Does this contain artificial sweeteners?",
        "Is this product vegan?",
        "Can you recommend something for weight loss?",
        "Which product is best for joint pain?",


        "Ot√°zka o produkte: bezglut√©nov√©, EAA vs BCAA, kreat√≠n, kolag√©n, d√°vkovanie, zlo≈æenie.",
        "Ak√Ω je rozdiel medzi t√Ωmito produktmi?",
        "Ktor√Ω prote√≠n je najlep≈°√≠ na naberanie svalov?",
        "Je tento produkt vhodn√Ω pre zaƒçiatoƒçn√≠kov?",
        "Obsahuje tento produkt lakt√≥zu?",
        "Je tento prote√≠n bez cukru?",
        "Ako sa tento produkt u≈æ√≠va?",
        "Kedy je najlep≈°ie u≈æ√≠va≈• kreat√≠n?",
        "Je tento doplnok vhodn√Ω pre ≈æeny?",
        "M√°te nieƒço na chudnutie?",
        "Je tento produkt vhodn√Ω pri celiakii?",
        "Obsahuje produkt umel√© sladidl√°?",
        "Ak√© s√∫ √∫ƒçinky kolag√©nu?",
        "Pom√¥≈æe tento produkt pri bolestiach kƒ∫bov?",
        "M√¥≈æete mi odporuƒçi≈• vhodn√Ω doplnok v√Ω≈æivy?"
    ],
    "Store / Delivery / Availability": [
        "Do you deliver to Czechia? shipping abroad, store location, pickup, opening hours.",
        "How long does delivery take?",
        "What are the shipping options?",
        "Do you offer express shipping?",
        "Where is your physical store located?",
        "Can I pick up my order in person?",
        "What are your opening hours?",
        "How much is shipping?",
        "Do you ship internationally?",
        "Is cash on delivery available?",
        "What payment methods do you accept?",
        "Can I pay by bank transfer?",
        "How can I track my order?",
        "Do you deliver on weekends?",


        "Doruƒçujete do ƒåeska, predaj≈àa, poboƒçka, kamenn√° predaj≈àa, osobn√Ω odber, dostupnos≈•.",
        "Koƒæko stoj√≠ doprava?",
        "Ak√© s√∫ mo≈ænosti dopravy?",
        "Doruƒçujete do zahraniƒçia?",
        "Ak√© platobn√© met√≥dy prij√≠mate?",
        "Je mo≈æn√Ω osobn√Ω odber?",
        "Kedy bude objedn√°vka odoslan√°?",
        "Ako sledova≈• z√°sielku?",
        "Doruƒçujete cez v√≠kend?",
        "Odosielate aj v sobotu?",
        "Ak√° je otv√°racia doba predajne?",
        "Koƒæko dn√≠ trv√° doruƒçenie?",
        "M√¥≈æem nak√∫pi≈• telefonicky?",
        "Je tovar skladom?"
    ]
    ,
    "Cooperation / Partnership": [
        "Business cooperation inquiry, partnership proposal, influencer collaboration, wholesale or B2B offer.",
        "We would like to discuss a business partnership.",
        "Influencer collaboration proposal.",
        "We are interested in wholesale cooperation.",
        "Marketing partnership opportunity.",
        "Sponsorship request for sports event.",
        "Brand collaboration inquiry.",
        "Can we promote your products?",
        "Affiliate partnership proposal.",
        "B2B cooperation request.",
        "We would like to become a distributor.",
        "Proposal for event sponsorship.",

        "Spolupr√°ca, kooper√°cia, partnersk√° ponuka, influencer spolupr√°ca, veƒækoobchod, B2B, marketingov√° spolupr√°ca.",
        "Po≈æiadavka o sponzoring.",
        "Potreboval by som sponzoring na podujatie.",
        "Z√°ujem o obchodn√∫ spolupr√°cu.",
        "Ponuka partnerskej spolupr√°ce.",
        "Chceli by sme propagova≈• va≈°e produkty.",
        "Mo≈ænos≈• veƒækoobchodnej spolupr√°ce.",
        "Marketingov√° ponuka.",
        "Z√°ujem o affiliate spolupr√°cu.",
        "Sponzorstvo ≈°portovca.",
        "Spolupr√°ca s fitness influencerom."
    ]
}

# Embedding

In [7]:
model = SentenceTransformer("intfloat/multilingual-e5-base")

embeddings = model.encode(
    [f"query: {t}" for t in df["text_for_model"]],
    normalize_embeddings=True
)

df

Loading weights:   0%|          | 0/199 [00:00<?, ?it/s]

[1mXLMRobertaModel LOAD REPORT[0m from: intfloat/multilingual-e5-base
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

[3mNotes:
- UNEXPECTED[3m	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.[0m


Unnamed: 0,id,subject,email_text,language,subject_norm,email_text_norm,text_for_model,order_id,order_ids_all,urls,emails,spam_score,spam_signals,spam_flag,final_category
0,1,T√ù≈ΩDE≈á ƒåAK√ÅM!!!,U≈Ω T√ù≈ΩDE≈á ƒåAK√ÅM NA OBJEDN√ÅVKU 68247! ƒåO SA DEJ...,sk,T√ù≈ΩDE≈á ƒåAK√ÅM!!!,U≈Ω T√ù≈ΩDE≈á ƒåAK√ÅM NA OBJEDN√ÅVKU 68247! ƒåO SA DEJ...,T√ù≈ΩDE≈á ƒåAK√ÅM!!! U≈Ω T√ù≈ΩDE≈á ƒåAK√ÅM NA OBJEDN√ÅVKU ...,68247,68247,,,2,"lots_of_caps,many_exclamations",False,
1,2,Glut√©nov√° di√©ta,"Pros√≠m v√°s, ktor√© z va≈°ich prote√≠nov s√∫ bezglu...",sk,Glut√©nov√° di√©ta,"Pros√≠m v√°s, ktor√© z va≈°ich prote√≠nov s√∫ bezglu...","Glut√©nov√° di√©ta Pros√≠m v√°s, ktor√© z va≈°ich pro...",,,,,0,,False,
2,3,Gain√©r a vitam√≠ny,"Ahojte, minul√Ω t√Ω≈æde≈à som objednal gain√©r a vi...",sk,Gain√©r a vitam√≠ny,"Ahojte, minul√Ω t√Ω≈æde≈à som objednal gain√©r a vi...","Gain√©r a vitam√≠ny Ahojte, minul√Ω t√Ω≈æde≈à som ob...",53471,53471,,,0,,False,
3,4,Duplicitn√° objedn√°vka,"Ahoj, vytvoril som 2 objedn√°vky omylom, 66481 ...",sk,Duplicitn√° objedn√°vka,"Ahoj, vytvoril som 2 objedn√°vky omylom, 66481 ...","Duplicitn√° objedn√°vka Ahoj, vytvoril som 2 obj...",66481,6648166482,,,0,,False,
4,5,Zmena mena pr√≠jemcu,"Ahoj, obj 52936, d√° sa zmeni≈• meno pr√≠jemcu? J...",sk,Zmena mena pr√≠jemcu,"Ahoj, obj 52936, d√° sa zmeni≈• meno pr√≠jemcu? J...","Zmena mena pr√≠jemcu Ahoj, obj 52936, d√° sa zme...",52936,52936,,,0,,False,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
269,270,Exspirovan√° tyƒçinka!!!,"Objedn√°vka ƒç. 92157, prote√≠nov√° tyƒçinka je po ...",sk,Exspirovan√° tyƒçinka!!!,"Objedn√°vka ƒç. 92157, prote√≠nov√° tyƒçinka je po ...","Exspirovan√° tyƒçinka!!! Objedn√°vka ƒç. 92157, pr...",92157,92157,,,1,many_exclamations,False,
270,271,Doruƒçujete cez v√≠kend?,"Ahoj, doruƒçujete z√°sielky aj v sobotu a nedeƒæu?",sk,Doruƒçujete cez v√≠kend?,"Ahoj, doruƒçujete z√°sielky aj v sobotu a nedeƒæu?","Doruƒçujete cez v√≠kend? Ahoj, doruƒçujete z√°siel...",,,,,0,,False,
271,272,Prida≈• pozn√°mku pre kuri√©ra,"Dobr√Ω veƒçer, objedn√°vka 89174, m√¥≈æem prida≈• po...",sk,Prida≈• pozn√°mku pre kuri√©ra,"Dobr√Ω veƒçer, objedn√°vka 89174, m√¥≈æem prida≈• po...","Prida≈• pozn√°mku pre kuri√©ra Dobr√Ω veƒçer, objed...",89174,89174,,,0,,False,
272,273,Thor Fuel + Vitargo d√°vkovanie,"Ahoj, ako sa d√°vkuje Thor Fuel + Vitargo? Koƒæk...",sk,Thor Fuel + Vitargo d√°vkovanie,"Ahoj, ako sa d√°vkuje Thor Fuel + Vitargo? Koƒæk...","Thor Fuel + Vitargo d√°vkovanie Ahoj, ako sa d√°...",,,,,0,,False,


In [8]:
# Flatten prototypes
proto_texts = []
proto_labels = []

for category, texts in PROTOTYPES.items():
    for t in texts:
        proto_texts.append(t)
        proto_labels.append(category)

proto_embeddings = model.encode(
    [f"passage: {t}" for t in proto_texts],
    normalize_embeddings=True
)

sims = cosine_similarity(embeddings, proto_embeddings)

best_idx = np.argmax(sims, axis=1)

sorted_sims = np.sort(sims, axis=1)

best_scores = sorted_sims[:, -1]
second_best_scores = sorted_sims[:, -2]

df["proto_score"] = best_scores
df["margin"] = best_scores - second_best_scores

df["proto_category"] = [proto_labels[i] for i in best_idx]

In [12]:
def keyword_override(text):
    text = text.lower()

    cancel_verbs = ["zru≈°i≈•", "zrusit", "stornova≈•", "storno", "cancel", "ZRU≈†I≈§", "ZRUSIT", "STORNO"]
    modify_verbs = ["zmeni≈•", "zmenit", "upravi≈•", "upravit", "prida≈•", "pridat", "doplni≈•"]

    if any(v in text for v in cancel_verbs):
        return "Order Cancel"

    if any(v in text for v in modify_verbs):
        return "Order Modify"

    return None

df["override_category"] = df["text_for_model"].apply(keyword_override)

In [14]:
def apply_routing(row):

    if row["margin"] < 0.015:
        return "human"

    if row["override_category"] is not None:
        return row["override_category"]

    return row["centroid_category"]

df["final_category"] = df.apply(apply_routing, axis=1)

In [15]:
spam_proto_embeddings = model.encode(
    [f"passage: {t}" for t in SPAM_PROTOTYPES],
    normalize_embeddings=True
)
spam_sims = cosine_similarity(embeddings, spam_proto_embeddings)

df["spam_similarity"] = spam_sims.max(axis=1)

df["spam_flag_final"] = (
        (
                (df["spam_similarity"] > df["proto_score"] + 0.05) &
                (df["spam_similarity"] > 0.82)
        )
        |
        (df["spam_score"] >= 3)
)

df["final_category"] = df["proto_category"]

df.loc[df["spam_flag_final"], "final_category"] = "Spam / Phishing"

In [16]:
CONF_THRESHOLD = 0.85

df["low_confidence"] = df["margin"] < 0.004

df["route"] = "bot"

df.loc[(df["low_confidence"] & ~df["spam_flag_final"]), "route"] = "human"

df.loc[df["spam_flag_final"], "route"] = "blocked"

HUMAN_ONLY_CATEGORIES = [
    "Return / Complaint",
    "Cooperation / Partnership",
    "Order Modify"
]

df.loc[df["final_category"].isin(HUMAN_ONLY_CATEGORIES), "route"] = "human"

df["route"].value_counts()

# df

route
human      146
bot        118
blocked     10
Name: count, dtype: int64

# Final Output

The final dataset includes:

- Message ID
- Original email text
- Extracted order ID (if present)
- Assigned intent category
- Spam flag
- Confidence score
- Routing decision (human / bot)

This structured output enables seamless integration with customer support workflows.


In [17]:
export_cols = [
    "id",
    "subject",
    "email_text",
    "final_category",
    "route",
    "proto_score",
    "order_id",
    "order_ids_all",
    "urls",
    "emails",
    "spam_score",
    "spam_signals",
    "spam_flag_final",
    "low_confidence"
]

df_export = df[export_cols]

df_export.to_excel("output-data/emails_categorized.xlsx", index=False)

df[export_cols].to_csv(
    "output-data/emails_categorized.csv",
    index=False,
    encoding="utf-8"
)


# System Architecture Overview

Pipeline:

1. Text Normalization
2. Feature Extraction (Order ID, URLs)
3. Semantic Embedding Generation
4. Spam Detection (Rule-based + Semantic)
5. Intent Classification via Prototype Similarity
6. Confidence Estimation
7. Human / Bot Routing
8. Export Structured Dataset
