# Textual Analysis and Retrieval System (TARS) - Data Enrichment  
The following notebook code enriches text using a suite of language models (LM), like outlined in section 3.1 of the accompanying research discussion paper [AI-driven Information Retrieval from Liaison: The Reserve Bank of Australia’s New Tool](https://www.google.com). In the paper, the text has been extracted from word documents written about liaison meetings. However in this code, the data is generated artifically using AI and is imported directly from the csv `Data/Example_liaison_data.csv`. For code on how to extract text out of word documents, see `TARS_Extraction.ipynb`.

The steps of enrichment are:
1. Import data (and take sample if just testing - full dataset can take hours to days depending on compute power).
2. Import LMs from HuggingFace (see `TARSml` for model cards).
3. Apply model to text and save output in list
4. Merge lists of model outputs into single "enriched" data set.

The output of this file (with the added metadata extracted from the word document), is the basis for the TARS. This can be connected to the TARS dashboard frontend, or used to developed the capabilities outlined in Section 4 of paper.

In [None]:
import pandas as pd
import torch
import TARSutils
import TARSml

## Use GPU if avialable, otherwise, use most of CPU capacity (check if this is okay for your system)
if torch.cuda.is_available():
    print(torch.cuda.get_device_name())
    device_use = 0
else:
    print(torch.get_num_threads())
    torch.set_num_threads(torch.get_num_threads())
    device_use = "cpu"

In [None]:
IND_COLUMNS = ['top_industry','top_industry_score'] # industry-based topic tags
CAT_COLUMNS = ['top_category','top_score'] # economic/business-based topic tags
SENT_COLUMNS = ['sentiment','sentiment_score'] # sentiment tags
ID_COLS = ['file_id','seq_id','rev_id'] # columns to join on
now = TARSutils.current_datetime() # take timestamp
timestamp = now.strftime("%Y%m%d%H%M")

In [None]:
## Import data
data = pd.read_csv("../Data/Example_liaison_data.csv")

In [None]:
## Check imported data 
print(len(data))
data.head()

## Enrich Data using language models

In [None]:
## remove .head(100) if you would like to run the full history (warning: full history can take many hours depending on system)
to_enrich = data.loc[data['category']=='BODY',ID_COLS + ["text"]].head(100)
results = [data.head(100)]

### Category (economic and business topic) tags
Using Zeroshot classification model

In [None]:
## Initialise model
Cat_mod = TARSml.CategoryModel('category',"cat_",device = device_use)
columns = list(ID_COLS) + list(Cat_mod.creates_columns())
print(f'enriching data with model: {Cat_mod.name}')

In [None]:
## Perform enrichment in data
enriched = Cat_mod.enrich(to_enrich, ID_COLS)[columns]
enriched.head()

In [None]:
## Save enriched dataframe to results list 
results.append(enriched)

### Industry tags

Also using zeroshot model

In [None]:
## Initialise model
Ind_mod = TARSml.IndustryModel('industry',"ind_",device = device_use)
columns = list(ID_COLS) + list(Ind_mod.creates_columns())
print(f'enriching data with model: {Ind_mod.name}')

In [None]:
## Perform enrichment in data
enriched = Ind_mod.enrich(to_enrich, ID_COLS)[columns]
enriched.head()

In [None]:
## Save enriched dataframe to results list 
results.append(enriched)

### Tone/sentiment tags
Using FinBERT model

In [None]:
## Initialise model
Sent_mod = TARSml.SentimentModel('sentiment',"sentiment_",device = device_use)
columns = list(ID_COLS) + list(Sent_mod.creates_columns())
print(f'enriching data with model: {Sent_mod.name}')

In [None]:
## Perform enrichment in data
enriched = Sent_mod.enrich(to_enrich, ID_COLS)[columns]
enriched.head()

In [None]:
## Save enriched dataframe to results list 
results.append(enriched)

### Numerical Extraction
Using a Roberta QA model and Zeroshot classification model

In [None]:
## initialise model suite for numerical extraction
qa = TARSml.QA(device = device_use)
Zeroshot = TARSml.Zero(device = device_use)
models = {"QA":qa, "Zeroshot":Zeroshot}

In [None]:
#### PRICE EXTRACT ####
price_extract = TARSutils.extract_target_numbers(to_enrich, ID_COLS, models = models, target = "prices", target_list = ["price","prices"])
price_extract.head()

In [None]:
## Save enriched dataframe to results list 
results.append(price_extract[ID_COLS + ["PricesExtract"]])

In [None]:
#### Wages EXTRACT ####
wages_extract = TARSutils.extract_target_numbers(to_enrich, ID_COLS, models = models, target = "wages", target_list = ["wage","wages"])
wages_extract.head()

In [None]:
## Save enriched dataframe to results list 
results.append(wages_extract[ID_COLS + ["WagesExtract"]])

In [None]:
## Convert list of dataframes into single joined dataframe
results_all = TARSutils.pd_left_join_all(results, on=ID_COLS)

In [None]:
## Check all enirched data joined into one dataframe
results_all

In [None]:
results_all.to_csv("../Data/Example_liaison_data_enriched.csv")