# Textual Analysis and Retrieval System (TARS) - Extracting Text from Word Documents 
The following notebook code extracts text from a word document, like outlined in section 3.1 of the accompanying research discussion paper [AI-driven Information Retrieval from Liaison: The Reserve Bank of Australia’s New Tool](https://www.google.com). An example document is contained in the Data folder inside this directory. The following code:
1. Parses DOCX document
2. Iterates over Paragraph or Table and retrieves the texts and styles for each.
3. Cleans up text (e.g. removing empty strings)
4. Uses heuristics to assign text as "BODY", "HEADING", "TABLE" or "UNKNOWN"
5. Creates an additional column identifying which heading the text is under.

The output is a dataframe containing text and extracted metadata as outlined above. This text would then be run through enrichment, like is done in `TARS_Enrichment.ipynb`

In [None]:
import pandas as pd
from TARSutils import DocXNote, clean_content, detect_content_type, add_last_heading_column

pd.set_option('display.max_colwidth', None)

In [None]:
## Initialise DocX extraction object
docs = DocXNote()

In [None]:
## Load Doc into session and parse content into a dataframe
extracted_content = docs.parse("Data/Example Liaison Summary Note.docx")
## Create file_id column (in practice, this would be assigned programatically or from a document directory)
extracted_content["file_id"] = 123456

In [None]:
## Check content
extracted_content.head()

In [None]:
## Remove duplicate, empty and outdated rows.
content = clean_content(extracted_content)
content.head()

In [None]:
## Apply heuristics to get content type. E.g. string length, capitalisation, and punctuation traits known to exists in liaison summary notes.
content = detect_content_type(content)
content.head()

In [None]:
## add_last_heading_column content type assignment from previous step
content = add_last_heading_column(content)
content