<a href="https://colab.research.google.com/github/SushmithaKasimsettyRamesh/LLM-Privacy-Shield-Privacy-Preserving-NLP-Pipeline-using-GPT-spaCy-Hugging-Face/blob/main/Day4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install -q spacy
import spacy
nlp = spacy.load("en_core_web_sm")


# 🔐 Day 4: Entity Memory + Selective Remapping
This notebook enhances the LLM Privacy Shield by:
- Ensuring repeated entities (like names) are mapped to the same token
- Allowing selective remapping of certain PII (e.g., keep emails masked)

✅ Smarter masking  
✅ Consistent outputs  
✅ More control for users


In [None]:
def mask_text_with_memory(text, nlp, memory=None):
    if memory is None:
        memory = {}

    doc = nlp(text)
    token_map = {}
    masked_text = text
    counters = {
        "PERSON": 1,
        "ORG": 1,
        "GPE": 1,
        "EMAIL": 1  # placeholder if you add regex-based email detection
    }

    for ent in doc.ents:
        if ent.label_ in counters:
            ent_text = ent.text
            label = ent.label_

            if ent_text in memory:
                token = memory[ent_text]
            else:
                token = f"{{{{{label}_{counters[label]}}}}}"
                memory[ent_text] = token
                counters[label] += 1

            if token not in masked_text:
                masked_text = masked_text.replace(ent_text, token)
                token_map[token] = ent_text

    return masked_text, token_map, memory


In [None]:
def remap_output(llm_response, token_map, skip=None):
    if skip is None:
        skip = []

    for token, original in token_map.items():
        if token not in skip:
            llm_response = llm_response.replace(token, original)

    return llm_response


In [None]:
text = "John Smith met John Smith at Acme Corp. Then he emailed john.smith@acme.com"

# Mask
masked, token_map, memory = mask_text_with_memory(text, nlp)
print("🔹 Masked Text:\n", masked)
print("🔹 Token Map:", token_map)
print("🔹 Memory:", memory)

# Simulate LLM output using masked text
llm_output = masked + " They signed a contract."

# Remap (without skipping)
final_output = remap_output(llm_output, token_map)
print("\n✅ Final Remapped Output:\n", final_output)

# Remap (with skip toggle)
final_skipped = remap_output(llm_output, token_map, skip=["{{EMAIL_1}}"])
print("\n🚫 Skipping Email Remap:\n", final_skipped)


🔹 Masked Text:
 {{PERSON_1}} met {{PERSON_1}} at {{ORG_1}} Then he emailed john.smith@acme.com
🔹 Token Map: {'{{PERSON_1}}': 'John Smith', '{{ORG_1}}': 'Acme Corp.'}
🔹 Memory: {'John Smith': '{{PERSON_1}}', 'Acme Corp.': '{{ORG_1}}'}

✅ Final Remapped Output:
 John Smith met John Smith at Acme Corp. Then he emailed john.smith@acme.com They signed a contract.

🚫 Skipping Email Remap:
 John Smith met John Smith at Acme Corp. Then he emailed john.smith@acme.com They signed a contract.


## ✅ Summary

This notebook added:
- 🧠 Memory dictionary to ensure consistent tokens (e.g., same name → same token)
- ⚙️ Optional `skip` toggle to control what gets remapped back
- 🧪 Simple end-to-end test showing how input flows to final output

➡️ You're now ready to wrap this pipeline into an API or app (FastAPI next).
