# Named Entity Recognition (NER) from DOCX using Stanza

This notebook extracts **person names** and **locations** from a `.docx` file using the **Stanza NLP library**.  
It supports both **Ukrainian** and **Russian** text.

---
## Setup Instructions

Before running the notebook, you'll need to install a few Python libraries and prepare your document.

### Install Required Python Libraries

You can install all the required packages by running the following code cell in the notebook:

In [None]:
!pip install stanza python-docx torch==2.5.0 numpy

### 📁 Step 1: Uploading Files

Before executing the next code cell, you must upload a `.docx` transcript that contains the text you want to analyze.

To upload files in **Google Colab**:

- Click on the **folder icon** labeled **Files** on the left sidebar.  
- Click the **Upload icon** to upload your `.docx` file from your computer.  
- Once uploaded, **right-click** the file name and select **"Copy path"**.  
- Paste that path between the quotes in the `input_docx_file` variable below keeping `.docx` extension.

You have to also name the output `.txt` files.

⚙️ You can also change the `language` variable to:
- `'uk'` for **Ukrainian**
- `'ru'` for **Russian**

> **Note:** Files uploaded this way will be removed once the session ends. You can ignore any temporary file warnings from Colab.


In [None]:
# Insert between quotes path to your .docx file
input_docx_file = '2022-03-16_CUH_U-001.docx'
# Type the name of the output files (with .txt extension) that will be created after execution of the following code cell
output_persons_file = 't1.txt'
output_locations_file = 't2.txt'
# Change 'uk' to 'ru' if you want to process Russian text and vice versa
language = 'uk'

## 🔄 Step 2: Load and Preprocess Document Text

We'll read the `.docx` file and append a lowercased version of the text to help improve NER recognition.


In [None]:
from docx import Document
import stanza
import torch
import numpy as np

def read_docx(file_path):
    doc = Document(file_path)
    return '\n'.join(para.text for para in doc.paragraphs)

doc_text = read_docx(input_docx_file).strip()
doc_text += '\n' + doc_text.lower()
stanza.download(language)
nlp = stanza.Pipeline(language)

doc = nlp(doc_text)

person_tag = 'PERS' if language == 'uk' else 'PER'

unique_persons = {ent.text for ent in doc.ents if ent.type == person_tag}
unique_locations = {ent.text for ent in doc.ents if ent.type == 'LOC'}
with open(output_persons_file, 'w', encoding='utf-8') as persons_file:
    for person in sorted(unique_persons):
        persons_file.write(person + '\n')

with open(output_locations_file, 'w', encoding='utf-8') as locations_file:
    for location in sorted(unique_locations):
        locations_file.write(location + '\n')

print(f"Unique persons written to: {output_persons_file}")
print(f"Unique locations written to: {output_locations_file}")
