📚 Glossary Extractor for Localization QA

Extract recurring noun phrases from English source strings to build or enhance your bilingual glossary (EN > IT), with real context examples.

✨ Overview

This tool is designed to support Localization Quality Assurance (LQA) and Translation Memory (TM) management by programmatically identifying candidate terms for glossary inclusion.

It leverages spaCy for NLP-based noun chunking and pandas for filtering, with progress tracking via tqdm. Output terms include real examples from source and target strings, and can be checked against an existing glossary for overlap.

🧠 Use Case

Large projects often suffer from inconsistent terminology or missing glossary entries. This tool helps:

Detect high-frequency candidate terms in EN source text.
Review them with real EN/IT sentence pairs.
Identify gaps or mismatches in your current glossary.

Ideal for PMs, linguists, and LQA specialists working on games, apps, or any high-volume localization project.

⚙️ Features

🔍 Noun phrase extraction from EN source strings
📈 Frequency filtering (customizable)
🔤 Support for multi-word expressions (n-grams)
✅ Check against existing glossary (EN > IT)
🧹 Rejects noisy/invalid terms and logs them
💬 Outputs context-rich examples for human validation
📦 Export to CSV for review or integration into glossaries

📂 Input Format

You will need an Excel file with at least these two columns:

EN	Italian
The final boss is here!	Il boss finale è qui!

Optionally, provide a second glossary file with:

en Term1	it Term1
final boss	boss finale

🏁 How to Use

Clone this repo
Install dependencies
```
pip install -r requirements.txt
```
```
python cli#2.py
```
Follow the prompts to input:

Your Translation Excel File
Your Glossary file

View your results:

terms_filtered.xlsx -> Cleaned candidates
rejected_terms.xlsx -> Filtered-out noisy terms

📌 Notes

Language model: spaCy en_core_web_sm
Current language pair: EN > IT
Extendable to other pairs with simple tweaks

📬 Contributors Leoth — Localization QA Specialist @ Moonton

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
assets/images		assets/images
core		core
.gitignore		.gitignore
README.md		README.md
cli#2.py		cli#2.py
cli.py		cli.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

📚 Glossary Extractor for Localization QA

✨ Overview

🧠 Use Case

⚙️ Features

📂 Input Format

🏁 How to Use

About

Uh oh!

Releases

Packages

Languages

LeothDev/LQA-glossary-extractor

Folders and files

Latest commit

History

Repository files navigation

📚 Glossary Extractor for Localization QA

✨ Overview

🧠 Use Case

⚙️ Features

📂 Input Format

🏁 How to Use

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages