#Email Classification with PII Masking and API Deployment

##Introduction

In today's fast-paced digital environment, customer support teams often deal with a large volume of emails. Automatically classifying these emails into predefined categories can significantly streamline workflows and improve response times. Additionally, to ensure data privacy and regulatory compliance, it is essential to mask personally identifiable information (PII) before any data processing.

##Objective

The objective of this project is to build an end-to-end email classification system that:
- **Masks personal information (PII)** such as name, email, phone number, etc.
- **Classifies support emails** into categories like Billing Issues, Technical Support, etc.
- **Restores the original data** after classification
- **Exposes the system through an API** for real-time usage

##Problem Scope

The system must:
1. Detect and mask the following PII entities (without using LLMs):
   - Full Name (`full_name`)
   - Email Address (`email`)
   - Phone Number (`phone_number`)
   - Date of Birth (`dob`)
   - Aadhar Card Number (`aadhar_num`)
   - Credit/Debit Card Number (`credit_debit_no`)
   - CVV Number (`cvv_no`)
   - Expiry Number (`expiry_no`)
2. Classify emails using any suitable model (ML/DL/LLM)
3. Accept user input and return a structured response via a **POST API**

##Workflow

1. Data Loading and Preprocessing  
2. PII Detection and Masking (Regex / NER / Custom NLP)  
3. Email Classification (Model training and prediction)  
4. Demasking and Output Formatting  
5. API Development and Deployment  

---

Let’s begin with the necessary imports and initial setup.




##  Environment Setup

Before starting with the implementation, we need to install and configure the required libraries which ensure that our environment is ready for email processing, PII masking, and API deployment.


In [None]:

!pip install fastapi nest-asyncio pyngrok uvicorn
!pip install scikit-learn pandas joblib spacy
!python -m spacy download en_core_web_sm


Collecting fastapi
  Downloading fastapi-0.115.12-py3-none-any.whl.metadata (27 kB)
Collecting pyngrok
  Downloading pyngrok-7.2.9-py3-none-any.whl.metadata (9.3 kB)
Collecting uvicorn
  Downloading uvicorn-0.34.3-py3-none-any.whl.metadata (6.5 kB)
Collecting starlette<0.47.0,>=0.40.0 (from fastapi)
  Downloading starlette-0.46.2-py3-none-any.whl.metadata (6.2 kB)
Downloading fastapi-0.115.12-py3-none-any.whl (95 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m95.2/95.2 kB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pyngrok-7.2.9-py3-none-any.whl (25 kB)
Downloading uvicorn-0.34.3-py3-none-any.whl (62 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.4/62.4 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading starlette-0.46.2-py3-none-any.whl (72 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m72.0/72.0 kB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: uvicorn, pyngrok, s

### Step 1: Importing the Required Libraries

To begin with, we import all the necessary Python libraries that will support various parts of our project. We use `re` for regular expressions, which will help us detect and mask personal information like emails or phone numbers. `spacy` is imported for natural language processing tasks, especially useful in identifying named entities such as names or dates of birth. For handling and analyzing data, we use `pandas`.

We also import several modules from `scikit-learn` which will be used for building and evaluating our email classification models. These include vectorizers like `TfidfVectorizer`, classifiers such as `MultinomialNB` and `SVC`, and utility functions like `train_test_split` and `classification_report`. `joblib` will help us save and reload our trained models later.

Since this project includes creating an API for real-time interaction, we use `FastAPI` and `BaseModel` to define and deploy the API endpoints. Running an API inside a notebook isn't straightforward, so we use `nest_asyncio` and `pyngrok` to allow asynchronous execution and make the API publicly accessible via a tunnel. Finally, `uvicorn` is included to serve the FastAPI app.

This setup ensures that we have all the tools needed for building a robust pipeline — from data preprocessing and classification to deploying an interactive API.


In [None]:
import re
import spacy
import pandas as pd
from fastapi import FastAPI
from pydantic import BaseModel
import joblib
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
from sklearn.svm import SVC
import sys
from fastapi import FastAPI
from pydantic import BaseModel
import nest_asyncio
from pyngrok import ngrok
import uvicorn


### Step 2: Creating the PII Masking Function

In this section, we define a custom function named `mask_pii` which is responsible for identifying and masking Personally Identifiable Information (PII) from a given text. To do this effectively, we first load the small English language model from spaCy, which is well-suited for basic NLP tasks like named entity recognition.

The function works in two stages. First, it uses regular expressions to detect common PII patterns such as email addresses, phone numbers, Aadhaar numbers, credit or debit card numbers, CVV codes, and expiration dates. When such information is found, it is replaced in the text with a label in square brackets (e.g., `[email]` or `[aadhar_num]`), and details about each masked item are recorded — including its position, type, and original content.

In the second stage, we use spaCy’s NLP model to identify named entities in the text, focusing specifically on entities labeled as `PERSON` (which we treat as full names) and `DATE` (interpreted as dates of birth). These entities are also replaced with placeholders like `[full_name]` or `[dob]`.

The function returns two items: the updated version of the text with all detected PII masked, and a list containing metadata for each masked item. This approach ensures that sensitive user information is hidden before any further analysis or processing.


In [None]:
# Load the small English NLP model from spaCy
nlp = spacy.load("en_core_web_sm")

def mask_pii(text):
    # Make a copy of the input text
    masked_text = text
    masked_entities = []

    # Define patterns for common types of PII (Personally Identifiable Information)
    patterns = [
        (r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b', 'email'),
        (r'\b\d{10}\b', 'phone_number'),
        (r'\b\d{4} \d{4} \d{4}\b', 'aadhar_num'),
        (r'\b(?:\d{4}[- ]?){3}\d{4}\b', 'credit_debit_no'),
        (r'\b\d{3}\b', 'cvv_no'),
        (r'\b(0[1-9]|1[0-2])\/([0-9]{2}|[0-9]{4})\b', 'expiry_no'),
    ]

    # Mask any matches from regex patterns
    for pattern, label in patterns:
        for match in re.finditer(pattern, masked_text):
            entity_text = match.group()
            start, end = match.start(), match.end()

            # Replace the entity in text with its label
            masked_text = masked_text.replace(entity_text, f"[{label}]", 1)

            # Store masked entity info
            masked_entities.append({
                "position": [start, end],
                "classification": label,
                "entity": entity_text
            })

    # Use spaCy to detect named entities (like names and dates)
    doc = nlp(text)
    for ent in doc.ents:
        if ent.label_ == "PERSON":
            label = "full_name"
        elif ent.label_ == "DATE":
            label = "dob"
        else:
            continue  # Skip other entity types

        entity_text = ent.text
        start, end = ent.start_char, ent.end_char

        # Only replace if the entity still exists in the text
        if entity_text in masked_text:
            masked_text = masked_text.replace(entity_text, f"[{label}]", 1)

            masked_entities.append({
                "position": [start, end],
                "classification": label,
                "entity": entity_text
            })

    return masked_text, masked_entities


Next, we're loading our dataset into the program.
Once the file is loaded using pandas.read_csv(), we immediately perform a basic data cleaning step. We use the dropna() function to remove any rows in the dataset where the email or type columns are missing. This ensures that we’re only working with complete and meaningful data for our analysis or model training.

Finally, we use df.head() to preview the first few rows of the cleaned dataset so we can verify that the data looks correct and has been loaded successfully.



In [None]:

# Upload CSV manually via Colab file interface or use path below if already uploaded
df = pd.read_csv("combined_emails_with_natural_pii.csv")
df = df.dropna(subset=["email", "type"])  # drop rows with missing data
df.head()


Unnamed: 0,email,type
0,Subject: Unvorhergesehener Absturz der Datenan...,Incident
1,Subject: Customer Support Inquiry\n\nSeeking i...,Request
2,Subject: Data Analytics for Investment\n\nI am...,Request
3,Subject: Krankenhaus-Dienstleistung-Problem\n\...,Incident
4,"Subject: Security\n\nDear Customer Support, I ...",Request


Now the dataset is first split into training and test sets, where 80% of the data is used to train the model and 20% is reserved for testing its performance. The TfidfVectorizer is then applied to convert the raw email text into numerical features by computing the term frequency-inverse document frequency (TF-IDF) scores for each word, which helps in emphasizing important terms while reducing the weight of commonly occurring ones. The transformed training data is then used to train a Multinomial Naive Bayes classifier, a popular choice for text classification tasks. Once the model is trained, predictions are made on the test data, and the performance is evaluated using accuracy and a classification report that provides precision, recall, and F1-score for each class. Finally, both the trained model and the vectorizer are saved using joblib, allowing them to be reused later in a deployment scenario without needing to retrain.

In [None]:

X_train, X_test, y_train, y_test = train_test_split(df["email"], df["type"], test_size=0.2, random_state=42)
vectorizer = TfidfVectorizer()
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)

model = MultinomialNB()
model.fit(X_train_vec, y_train)
y_pred = model.predict(X_test_vec)

print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

joblib.dump(model, "naive_bayes_model.pkl")
joblib.dump(vectorizer, "tfidf_vectorizer.pkl")


Accuracy: 0.66875
              precision    recall  f1-score   support

      Change       0.97      0.07      0.13       479
    Incident       0.61      0.99      0.75      1920
     Problem       0.38      0.01      0.02      1009
     Request       0.78      0.91      0.84      1392

    accuracy                           0.67      4800
   macro avg       0.68      0.50      0.44      4800
weighted avg       0.65      0.67      0.56      4800



['tfidf_vectorizer.pkl']

This output summarizes the performance of the trained Naive Bayes model on the test dataset. The overall accuracy is 66.88%, which means that about two-thirds of the email types were correctly predicted. However, looking at the detailed classification report, it's clear that the model performs unevenly across different categories. For example, it does extremely well with "Incident" emails (high precision and recall), but performs poorly with "Change" and "Problem" types, barely identifying any of them correctly. The f1-scores for these underperforming categories are especially low, indicating a lack of balance between precision and recall. This imbalance could be due to class distribution or feature limitations. Finally, the line ['tfidf_vectorizer.pkl'] confirms that the TF-IDF vectorizer has been successfully saved to disk for future use.

Now, a Support Vector Machine (SVM) model with a linear kernel is trained on the same TF-IDF vectorized training data. The model is then used to make predictions on the test set. The performance is evaluated using accuracy and a detailed classification report, allowing comparison with the earlier Naive Bayes model. After evaluation, the trained SVM model is saved as svm_model.pkl, and the TF-IDF vectorizer is saved again as tfidf_vectorizer.pkl, ensuring that both components are available for future inference or deployment.

In [None]:
svm_model = SVC(kernel='linear')
svm_model.fit(X_train_vec, y_train)
y_pred_svm = svm_model.predict(X_test_vec)

print("SVM Accuracy:", accuracy_score(y_test, y_pred_svm))
print(classification_report(y_test, y_pred_svm))

joblib.dump(svm_model, "svm_model.pkl")
joblib.dump(vectorizer, "tfidf_vectorizer.pkl")

SVM Accuracy: 0.7729166666666667
              precision    recall  f1-score   support

      Change       0.96      0.80      0.87       479
    Incident       0.69      0.91      0.78      1920
     Problem       0.65      0.28      0.39      1009
     Request       0.91      0.94      0.92      1392

    accuracy                           0.77      4800
   macro avg       0.80      0.73      0.74      4800
weighted avg       0.77      0.77      0.75      4800



['tfidf_vectorizer.pkl']

The performance metrics show that the Support Vector Machine (SVM) model significantly outperforms the earlier Naive Bayes model. With an overall accuracy of approximately 77%, the SVM handles most categories effectively. Notably, it achieves high precision and recall for classes like "Request" and "Change", indicating reliable classification in those categories. However, the "Problem" class still has relatively low recall, suggesting that many instances in this category are not being correctly identified. Overall, the model provides a better balance across categories compared to Naive Bayes, making it more suitable for practical deployment.

Next, a previously trained SVM model and TF-IDF vectorizer are loaded using joblib, which allows the model to be reused without retraining. The function classify_email_pipeline then defines a complete processing pipeline for an incoming email. It first calls the mask_pii function to anonymize any sensitive personal data in the email (like names, dates, and contact info), producing both the masked version of the text and a list of masked entities. The cleaned, masked text is then transformed into numerical features using the TF-IDF vectorizer. This vector is passed into the loaded SVM model, which predicts the email's category (such as "Incident", "Request", etc.). Finally, the function returns a dictionary containing the original email, the list of masked entities with their details, the masked email, and the predicted category—packaging the entire process into a reusable, production-ready pipeline.










In [None]:
model = joblib.load("svm_model.pkl")
vectorizer = joblib.load("tfidf_vectorizer.pkl")

def classify_email_pipeline(email_text):
    masked_text, masked_entities = mask_pii(email_text)
    vector = vectorizer.transform([masked_text])
    category = model.predict(vector)[0]
    return {
        "input_email_body": email_text,
        "list_of_masked_entities": masked_entities,
        "masked_email": masked_text,
        "category_of_the_email": category
    }


##Step 7: Run FastAPI with `pyngrok` in Colab

In [None]:
!ngrok config add-authtoken  YOUR_NGROK_AUTHTOKEN


Authtoken saved to configuration file: /root/.config/ngrok/ngrok.yml


In [None]:
sys.modules["main"] = sys.modules["__main__"]

Finally, sets up and launches a simple API using FastAPI directly from a Jupyter or Google Colab environment. First, an instance of the FastAPI app is created, and nest_asyncio.apply() is called to allow asynchronous event loops to run in notebooks without errors. Then, a data model EmailInput is defined using Pydantic to validate the structure of incoming data—specifically, the email body that needs classification. A POST endpoint /classify is created, which accepts email data, processes it through the classify_email_pipeline function, and returns the output including the masked content and predicted category. To make this locally hosted API accessible over the internet, ngrok is used to open a secure public tunnel on port 8000, and the public URL is printed for easy access. Finally, the FastAPI app is launched using uvicorn, which starts the server and listens for incoming requests. This setup makes it possible to interact with your email classification model via a web-based API interface from anywhere.

In [None]:
# Create API app
app = FastAPI()
nest_asyncio.apply()

# Define input format
class EmailInput(BaseModel):
    email_body: str

# Define route
@app.post("/classify")
async def classify_email(data: EmailInput):
    return classify_email_pipeline(data.email_body)

# Open public tunnel with ngrok
public_url = ngrok.connect(8000)
print("Public URL:", public_url)

# Run FastAPI app from this notebook
uvicorn.run(app, host="0.0.0.0", port=8000)


Public URL: NgrokTunnel: "https://2cc1-35-231-75-185.ngrok-free.app" -> "http://localhost:8000"


INFO:     Started server process [236]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)


INFO:     2409:4073:29d:d089:258d:be7a:652d:86de:0 - "GET / HTTP/1.1" 404 Not Found
INFO:     2409:4073:29d:d089:258d:be7a:652d:86de:0 - "GET /favicon.ico HTTP/1.1" 404 Not Found
INFO:     157.46.1.77:0 - "GET / HTTP/1.1" 404 Not Found
