# Build Data set

## About the DataSet

DataSet consists of news articles scrapped from a Mexican Newspaper. For each news article I extracted Entities using Stanford Core NLP Python client. I focused only on PERSON, ORGANIZATION, COUNTRY and LOCATION entities.

For each entity of interest I performed a query string search to Wikidata knowledge graph, which sometimes returned several results for each entity. The objective is to train a "lightweight" (lighter than an LLM)  model that is able to identify which of all the wikidata options is the correct option given the context in which the entity is being mentioned.

I built a dataset of x articles, x entities and y options. The true labels were computed using StabilityAI/StableBeluga-7B LLM. I left the LLM running for several days answering the query:
```
"""
Given this news article:

"{news_text}"

When the article mentions "{entity}", which of the following options is the news article most likely referring to? Provide only one option.
options:

{search_options}
"""
```

in order to get a big enough dataset.

In [2]:
import numpy as np
import pandas as pd
from transformers import BertTokenizer
import json
import nltk
import re

In [3]:
def print_progress_bar(iteration, total, bar_length=50):
    progress = float(iteration) / float(total)
    arrow = '=' * int(round(progress * bar_length) - 1)
    spaces = ' ' * (bar_length - len(arrow))

    print(f'Progress: [{arrow + spaces}] {int(progress * 100)}%', end='\r')

# Load Data

`sb_disambiguation_result.json` is an 800 MB file which will not be uploaded to github. Contains the teacher observations produced using `StableBeluga - 7b`. See the `0-ask_stable_beluga.ipynb` notebook for further details on how this dataset was produced using LLMs.
The StableBeluga process was ran on a subset of 315,512 news-entities observations. We considered this as a big enough data set for a POC. The process could further be left running for the whole 2,953,563 samples or for a bigger subset. It is important to left out a porcetage of the whole observations for validations and to compare running times.

This 315,512 subset of news-entities observations will be our training data.

Columns:

- `text`: Is the text of the Named Entity. For example, Donald Trump.
- `ner`: Is the type of the Named Entity, such as PERSON, LOCATION, etc.
- `nerConfidences`: Confidence of the Named Entity belonging to a NER.
- `clean_text`: Is the text of the Named Entity cleaned regarding special caracters and upper cased.
- `h1`: Is the title of the news articles. Is used as key to merge with the news articles information.
- `wikidata_search_entries`: Is a list of all instances of an entity that were found in Wikidata. Only one of those instances corresponds to the correct option.
- `sb_answer`: Is the answer of Stable Belgua answering the question:
```
  """
Given this news article:

"{news_text}"

When the article mentions "{entity}", which of the following options is the news article most likely referring to? Provide only one option.
options:

{search_options}
"""
```

`news_with_metadata.json` is a 1.5 GB file which will not be uploaded to github. Contains 117,786 spanish news articles and metadata associated to those news articles, which was also extracted using other NLP techniques, such as translations and summarizations. Data was scrapped from "El Universal" digital news paper.

Relevant Columns:

- `h1`: Is the title of the news articles. Is used as key to merge with the news articles information.
- `date`: Date and time when the news articles was published.
- `author`: Name of the author who published the article.
- `content`: Article text in spanish.
- `h1_en`: Article title translated to english.
- `content_en`: Article text translated to English.

In [4]:
with open("datasets/sb_disambiguation_result.json", "r") as file:
    sb_results = json.load(file)

In [6]:
# 
len(sb_results)

315512

In [8]:
sb_results_df = pd.DataFrame.from_dict(sb_results)

In [9]:
sb_results_df.head(5)

Unnamed: 0,text,ner,nerConfidences,clean_text,wikidata_search_entries,h1,sb_answer,options_given,ix
0,Tokyo,CITY,{'LOCATION': 0.9994902892958},TOKYO,"[{'id': 'Q1490', 'display_label': 'Tokyo', 'di...",El Hijo de Dr. Wagner Jr. gana la ‘Global Tag ...,The news article is most likely referring to: ...,"1. Tokyo, capital and largest city of Japan \n...",
1,Duprée,PERSON,{'PERSON': 0.83429111680194},DUPREE,"[{'id': 'Q1013540', 'display_label': 'Dupree',...",El Hijo de Dr. Wagner Jr. gana la ‘Global Tag ...,The news article is most likely referring to: ...,"1. Dupree, city in South Dakota, United States...",
2,Katsuhiko Nakajima,PERSON,{'PERSON': 0.99989643856346},KATSUHIKO NAKAJIMA,"[{'id': 'Q959636', 'display_label': 'Katsuhiko...",El Hijo de Dr. Wagner Jr. gana la ‘Global Tag ...,The news article is most likely referring to: ...,"1. Katsuhiko Nakajima, Japanese professional w...",
3,Ministry of Finance and Public Credit,ORGANIZATION,{'ORGANIZATION': 0.78478234268622},MINISTRY OF FINANCE AND PUBLIC CREDIT,"[{'id': 'Q3062474', 'display_label': 'Ministry...",Estímulos a gasolinas afectan recaudación entr...,The news article is most likely referring to: ...,"1. Ministry of Finance and Public Credit, gove...",
4,Treasury,ORGANIZATION,{'ORGANIZATION': 0.99823209224661},TREASURY,"[{'id': 'Q3277092', 'display_label': 'Departme...",Estímulos a gasolinas afectan recaudación entr...,The news article is most likely referring to: ...,"1. Department of the Treasury, Australian gove...",


In [11]:
sb_results_df.iloc[0].to_dict()

{'text': 'Tokyo',
 'ner': 'CITY',
 'nerConfidences': "{'LOCATION': 0.9994902892958}",
 'clean_text': 'TOKYO',
 'wikidata_search_entries': [{'id': 'Q1490',
   'display_label': 'Tokyo',
   'display_desc': 'capital and largest city of Japan',
   'label': 'Tokyo',
   'desc': 'capital and largest city of Japan',
   'label_desc': 'Tokyo, capital and largest city of Japan',
   'match_type': 'label'},
  {'id': 'Q7473516',
   'display_label': 'Tokyo',
   'display_desc': 'special wards in the eastern part of Tokyo Metropolis in Japan, that used to form a single city',
   'label': 'Tokyo',
   'desc': 'special wards in the eastern part of Tokyo Metropolis in Japan, that used to form a single city',
   'label_desc': 'Tokyo, special wards in the eastern part of Tokyo Metropolis in Japan, that used to form a single city',
   'match_type': 'label'},
  {'id': 'Q7842',
   'display_label': 'University of Tokyo',
   'display_desc': 'National university in Tokyo, Japan',
   'label': 'University of Tokyo',


In [12]:
# Load the news articles data set
path_file = 'datasets/news_with_metadata.json'
with open(path_file, 'r') as jfile:
    processed_news_articles = json.load(jfile)

In [13]:
news_articles_df = pd.DataFrame.from_dict(processed_news_articles)

In [14]:
all_ixs = sb_results_df.reset_index()['index'].unique()

# Prepare Data

In [16]:
# Merge the article text into the stable beluga teacher observations
sb_results_df = pd.merge(
    sb_results_df,
    news_articles_df[['h1', 'content_en']],
    how='left',
    on='h1'
)

## Extract index of the best option given by Stable Beluga

In [18]:
regex_pattern = "The news article is most likely referring to:"
sb_results_df['sb_option_given'] = sb_results_df['sb_answer'].str[len(regex_pattern):]
sb_results_df['sb_option_given'] = sb_results_df['sb_option_given'].str.replace('\n', '').str.strip()

In [19]:
regex_pattern = r"(\d+.)(.*)"
sb_results_df[['index_of_option_given', 'sb_option_given']] = sb_results_df['sb_option_given'].str.extract(regex_pattern)
sb_results_df['index_of_option_given'] = sb_results_df['index_of_option_given'].str.replace(".", "").str.strip()
sb_results_df['sb_option_given'] = sb_results_df['sb_option_given'].str.strip()

In [20]:
# see if there were instances where the process couldn't find a numeric best option
sb_results_df[sb_results_df['index_of_option_given'].isnull()]

Unnamed: 0,text,ner,nerConfidences,clean_text,wikidata_search_entries,h1,sb_answer,options_given,ix,content_en,sb_option_given,index_of_option_given


In [22]:
def create_regex_pattern(term):
    """
    Define a function that creates a regex pattern with word boundaries in order to avoid
    false positives when searching for an entity inside a text. Examples of false positives are:
    entity: 'us' (as united states)
    false positive: `museum`, which contains `us` in the text.
    """
    return r'\b' + re.escape(term) + r'\b'

def get_sentences_containing_entity(row):
    """
    Function to be used over a DataFrame.
    ---------
    For an article text which is in the DataFrame column named `content_en`
    extract all sentences that mention an entity contained in the DataFrame column named `text`
    """
    # article text
    content = row['content_en']
    # the entity
    ent = row['text']
    # break text into sentences
    sents_srs = pd.Series(nltk.sent_tokenize(content))
    # create regex patter with word boundaries
    reg = create_regex_pattern(ent)
    # find sentences containing entity
    sents_containing_ent = sents_srs[
        sents_srs.str.contains(reg, regex=True)
    ].tolist()
    return sents_containing_ent
    
# this is a slow process and could be optimized
sb_results_df['sentence_mentions']=sb_results_df.apply(
    lambda row: get_sentences_containing_entity(row), 
    axis=1
)

In [23]:
sb_results_df['label_desc_options'] = sb_results_df['wikidata_search_entries'].apply(
    lambda x: [i.get('label_desc') for i in x]
)

In [24]:
sb_results_df['sents_len'] = sb_results_df['sentence_mentions'].apply(len)

In [25]:
# drop observations with no sentence mentions
sb_results_df.drop(sb_results_df[sb_results_df['sents_len']==0].index, inplace=True)

In [26]:
sb_results_df = sb_results_df.reset_index(drop=True).reset_index()

In [27]:
# explode by sentence mentions, so that each entity has only one sentence mention
sb_results_df_exp = sb_results_df.set_index(
    ['index']
).explode(
    'sentence_mentions'
).reset_index()

In [28]:
sb_results_df_exp['index_of_option_given'] = (sb_results_df_exp['index_of_option_given'].astype(int) - 1)

In [29]:
final_df = sb_results_df_exp[
    ['index', 'text', 'index_of_option_given', 'sentence_mentions', 'label_desc_options']
].copy()

# Encode Data

In this part, we create the input texts, as well as the labels that will be used to train the BERT model. We make sure that the input text does not exceed 512 tokens (max number of tokens that BERT accepts).

Example of an input text:

Is "Andres Manuel Lopez Obrador" in the context of: "President Andres Manuel Lopez Obrador's call for the next scheduled on 27 November ...", referring to [SEP] "Andrés Manuel López Obrador, President of Mexico since 2018"?

The sentence where the entity is being mentioned might be too long, resulting in input texts with more than 512 tokens. In this case, we shorten the sentence by extracting only the surrounding text in the vicinity of the entity mention, making sure that the surroiunding text doesnt exceed 512 tokens again.

In [34]:
# Load the tokenizer.
tokenizer = BertTokenizer.from_pretrained('bert-base-cased')

tokenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

In [30]:
import string

def build_query_for_bert_ned(
    entity_text, sentence_mention, option
):
    """
    Create a query string that will be feed to BERT.
    -------
    - entity_text: str: Named entity.
    - sentence_mention: str: Sentence where the entity is being mentioned (context).
    - option: str: The wikidata option.
    """
    query = " Is '{entity_mention}' in the context of: '{sentence_mention}', referring to [SEP] {option}?".format(
        entity_mention=entity_text,
        sentence_mention=sentence_mention,
        option=option
    )
    return query

def join_text_tokens(tokens):
    """
    Join a tokenized text. Puntuaction marks are not followed by a space, while all other tokens are
    followed by a space
    """
    text = ''.join(
        [
            tokens[i] if tokens[i] in string.punctuation else ' ' + tokens[i] for i in range(len(tokens))
        ]
    ).strip()
    return text

def find_surrounding_text(target_word, sentence, n_tokens=5):
    """
    Find text in the vicinity of an entity mention.
    -----------------------
    - target_word: str: Named entity
    - sentence: str: The whole sentence where the entity is being mentioned
    - n_tokens: int: Number of tokens surrounding the `target_word`. n_tokens + target_word + n_tokens
    """
    start_indices = [
        m.start() for m in re.finditer(
            r'\b' + re.escape(target_word) + r'\b', 
            sentence, 
            flags=re.IGNORECASE)
    ]
    
    surrounding_words = []
    for ix in start_indices:
        before_txt = sentence[:ix]
        before_tkns = nltk.word_tokenize(before_txt)
        before_n_tokens = before_tkns[-n_tokens:]
        after_txt = sentence[ix+len(target_word):]
        after_tkns = nltk.word_tokenize(after_txt)
        after_n_tokens = after_tkns[:n_tokens]
        all_tokens = before_n_tokens + [target_word] + after_n_tokens 
        surrounding_txt = join_text_tokens(all_tokens)
        surrounding_words.append(surrounding_txt)
    return surrounding_words

In [32]:
final_dataset = final_df.to_dict(orient='records')

In [37]:
# Create data set of inputs less than 512 tokens

new_final_ds = []
total_exs = len(final_dataset)
for e, example in enumerate(final_dataset):
    print_progress_bar(iteration=e, total=total_exs)
    # For each option, we'll need to format the text in a way that BERT can understand.
    # This usually involves concatenating the mention sentence(s) with the option text,
    # separated by the [SEP] token, and starting with a [CLS] token.
    for i, option in enumerate(example['label_desc_options']):
        new_entry = dict()
        entity_mention = example['text']
        new_entry['entity_mention'] = entity_mention
        sentence_mention = example['sentence_mentions']
        option_ix = example['index_of_option_given']

        # n_tokens surrounding text in case that input text exceed 512 tokens
        n_tokens = 30
        while True: # do this until encoded input text doesn't exceed 512 tokens
            # input text
            query = build_query_for_bert_ned(
                entity_mention, sentence_mention, option
            )
            # try encoding the query
            encoded_dict = tokenizer.encode_plus(
                query,                           # Sentence to encode.
                add_special_tokens = True,       # Add '[CLS]' and '[SEP]'
                #max_length = 512,                # Pad & truncate all sentences.
                #padding='max_length',            # Make sure this applies padding as needed
                #truncation=True,
                #return_attention_mask = True,    # Construct attention masks.
                #return_tensors = 'pt',           # Return pytorch tensors.
            )
            if len(encoded_dict['input_ids'])<=512:
                # encoded query didn't exceeded 512 tokens
                new_entry['bert_qry'] = query
                new_entry['sentence_mention'] = sentence_mention
                new_entry['option'] = option
                new_entry['label'] = int(i == option_ix)
                break # break the while true
            
            # encoded query exceeded 512 tokens, keep the while loop

            # find the text around the entity in order to shorten the input text
            sentence_mentions = find_surrounding_text(
                target_word=entity_mention, 
                sentence=sentence_mention, 
                n_tokens=n_tokens
            )
            sentence_mention = 'None' if len(sentence_mentions)==0 else sentence_mentions[0]
            # decrease the number of tokens around the text by 5 tokens.
            n_tokens-= 5
        
        new_final_ds.append(new_entry)

Progress: [==                                                ] 5%

Token indices sequence length is longer than the specified maximum sequence length for this model (553 > 512). Running this sequence through the model will result in indexing errors




In [None]:
new_final_df = pd.DataFrame.from_dict(new_final_ds)

In [None]:
# save the data set containing input texts and labels.
new_final_df.to_json("datasets/dataset_for_bert_fine_tune_shortened.json", orient="records")