# BERTuality

 <p style='text-align: justify;'>In this repository, we explore a way to check the topicality of textual data using the "BERT" language model and generate data corresponding to topicality. Our concept is called BERT-actuality (=BERTuality) and is a pipeline that is able to overcome the limitations of a pre-trained large language model by collecting correct data according to its topicality and predicting a correct value for a searched value.
<br><br>
<b>Why is this important?</b> There are over 6.6 million English-language Wikipedia articles on the Internet. In the USA alone, over 5,000 new news articles are published every day. Assuming that not every source is constantly updated, it can be stated that a large amount of outdated information exists. Timeliness in this context is understood as a property of the current relevance of information. Outdated information no longer corresponds to the current conditions of the real world, since it has become obsolete due to the passage of time. Outdated information thus poses a danger because it is misinformation in our case. The use of misinformation can lead to misunderstanding, economic or reputational damage. Therefore, it is of great interest to both individuals and businesses to obtain and use only current information. We describe a best practice for how BERT keeps information current.
<br><br>
<b>How do we achieve it?</b> BERTuality generates a current value for the [MASK] token for a sentence of the form "Prime Minister [MASK] is the leader of Japan". In this case, the result is the word "Kishida" (as of February 2023). To predict the word for the [MASK] token, the language model BERT, which was previously sensitized to the current state, is used. In the first step, BERTuality analyzes the outdated information and systematically extracts up-to-date information from various data sources. In the second step, this information is split into individual sentences and transformed into an optimal form for BERT. Procedures are presented that allow BERT to be made aware of timeliness using these sentences. In the third step, BERT is used to generate a correct and up-to-date prediction.</p>


# 1. Dependencies

<p style='text-align: justify;'>To keep this notebook readable and simple, we will keep most of the code out of this notebook and use all the necessary functions from our bertuality package as we go through the chapters. If you need more information about the implementation, you will need to refer to the files in the bertuality package imported here. But for now, we will import all the necessary dependencies:</p>


In [1]:
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import BertTokenizer, BertForMaskedLM, pipeline
from bertuality import BERTuality
from bertuality import BERTuality_loader
from bertuality import BERTuality_quickstart

import pandas as pd
pd.set_option('display.max_colwidth', 100)
pd.set_option('display.max_rows', 18)

import os

<p style='text-align: justify;'>We also need to define a path to our BERTuality package and import our pre-saved and pre-trained model. This pre-trained model is equal to "bert-base-uncased". Although it is also possible to simply import models via Hugging Face, we have also saved them in the package files.</p>


In [2]:
path = os.path.join(os.getcwd(), "bertuality")

In [3]:
model = BertForMaskedLM.from_pretrained(os.path.join(path, "model"))
tokenizer = BertTokenizer.from_pretrained(os.path.join(path, "tokenizer"))

# 2. The Basics needed

<p style='text-align: justify;'>The training data of BERT includes books with 800 million words and Wikipedia articles with 2.5 billion English words. In the original research work, the BERT core model is trained with two self-supervised learning tasks, including the Masked Language Model. In this pre-training method, some tokens are masked by [MASK] tokens. The model is then trained to predict the masked tokens by determining the context from the surrounding tokens.
To predict the [MASK] tokens, we use a pipeline from the transformers library. Our goal is to use BERT to make both contextually correct and topical statements about a previously masked [MASK] token.</p>


## 2.1 Problems with pre-trained models

<p style='text-align: justify;'>To take a closer look at the predictions of a trained model, we first need to define a pipeline. This can be easily done with the Huggingface models and API's. The pipeline below can be used for so-called masekd predictions, where the [MASK] token is predicted by the model.</p>


In [4]:
pre_trained_pipeline = pipeline("fill-mask", model=model, tokenizer=tokenizer)

<p style='text-align: justify;'>Below are the predictions for two [MASK] sentences, each containing a [MASK] token.</p>


In [5]:
# Prediciton for our first sentence:

test_sentence_1 = "Paris is the capital of [MASK]."
prediction_1 = pre_trained_pipeline(test_sentence_1)

prediction_1_df = pd.DataFrame(prediction_1)
prediction_1_df[:1]

Unnamed: 0,sequence,score,token,token_str
0,paris is the capital of france.,0.950923,2605,f r a n c e


<p style='text-align: justify;'>The prediction for the first sentence is correct. Since this information is not related to actuality, it does not change over time. So we can say that the BERT "knows" this information based on its training data and predicts correctly.</p>


In [6]:
# Prediciton for our second sentence:

test_sentence_2 = "Tim Cook is the CEO of [MASK]."
prediction_2 = pre_trained_pipeline(test_sentence_2)

prediction_2_df = pd.DataFrame(prediction_2)
prediction_2_df[:1]

Unnamed: 0,sequence,score,token,token_str
0,tim cook is the ceo of amazon.,0.007595,9733,a m a z o n


<p style='text-align: justify;'>However, this is not always the case. If a piece of information relates to time, it may become obsolete due to the lapse of time. For example, the data used to train BERT may become obsolete after training. Consequently, BERT then has an outdated knowledge state and cannot make the current prediction. In addition, some information is not included in the training data of the trained BERT model, so no actual prediction can be made.
<br><br>
In order to get a correct prediction also for update related information, BERT needs to be made aware of updates. To this end, we will look at a way to accomplish this.</p>


## 2.2 Approaches to create a sensitivity to the topicallity

<p style='text-align: justify;'>There are several approaches to making BERT aware of a current context. Essentially, the goal is to "teach" BERT new content that is unknown due to the time and topic constraints of the training data. In our case of actuality awareness, BERT is taught the current state of actuality-related information.
<br><br>
One of these approaches is called Priming and is the idea that exposure to one stimulus may influence a response to a subsequent stimulus, without conscious guidance or intention. The priming effect refers to the positive or negative effect of a rapidly presented stimulus (priming stimulus) on the processing of a second stimulus (target stimulus) that appears shortly after.
<br><br>
Priming can also be used with BERT to influence the prediction of a [MASK] token. To do this, the word or phrase that is to be used to influence BERT is concatenated with the [MASK] phrase. In the case of topicality sensitization, the priming may look like the following:</p>


In [7]:
# Primed prediction for our second sentence with additional information

add_information = "He has been the chief executive officer of Apple Inc. since 2011."

primed_sentence =  test_sentence_2 + " " + add_information
primed_prediction = pre_trained_pipeline(primed_sentence)

primed_prediction_df = pd.DataFrame(primed_prediction)
primed_prediction_df[:1]

Unnamed: 0,sequence,score,token,token_str
0,tim cook is the ceo of apple. he has been the chief executive officer of apple inc. since 2011.,0.941625,6207,a p p l e


<p style='text-align: justify;'>It is shown that concatenating additional information (=input sentences) with the [MASK] sentence affects the prediction of the [MASK] token. Without priming, BERT predicted that Tim Cook is the CEO of Amazon. With priming, Apple is predicted. This principle of priming can also influence the prediction of BERT just by the presence of individual words before or after the [MASK] sentence.</p>
<p style='text-align: justify;'>It follows that the knowledge of the BERT can be influenced by adding more information. In the following, we will use this principle for our BERTuality pipeline, which will be able to predict current information without having a current knowledge base.</p>

# 3. BERTuality

<p style='text-align: justify;'>In this chapter, we will go over the BERTuality pipeline and introduce its structure and the functionality of the individual modules. This is followed by a schematic overview of the process:</p>


<img src="references/bertuality_pipeline.jpg" width="60%">

<p style='text-align: justify;'><b>Description:</b> First, the masked sentence is passed to the pipeline to extract all relevant keywords from it. Based on the keywords, relevant data is searched in an arbitrary data source. Then, the found input data passes through the text data preparation and query module to filter out optimal input sentences. Based on these prepared input sentences, a suitable and up-to-date token for the [MASK] token is predicted for the masked sentence.</p>


## 3.1 Keyword Extraction

<p style='text-align: justify;'>First, the keywords are extracted from the [MASK] sentence, which will be used in the next steps to search for the relevant information. For an optimal search it is of interest that the list of keywords contains only the most relevant words.
<br><br>
To obtain an optimal keyword extraction, the part-of-speech method is used for this purpose. POS tagging is the assignment of a POS tag to each word according to its definition and context. The various POS tags correspond to Penn Treebank's tag set, which includes various tags for word types such as verbs, adjectives, and nouns.</p>


In [8]:
# a simple test sentence for exploring the bertuality pipeline

mask_sentence = "Prime minister [MASK] is the leader of Japan."

In [9]:
keywords = BERTuality.pos_keywords(mask_sentence)

print('The keywords of "{}" are {}'.format(mask_sentence, keywords))

The keywords of "Prime minister [MASK] is the leader of Japan." are ['Prime', 'minister', 'Japan']


<p style='text-align: justify;'>Another possibility to extract relevant keywords from the masked sentence would be Named Entity Recognition, but this will not be discussed further here. Due to the very good results for us, the POS method for BERTuality is used in the following.</p>


## 3.2 Loading the data

<p style='text-align: justify;'>As mentioned before, the basic idea of BERTuality is that BERT is primed using temporally current input data to make a correct prediction for the [MASK] token. Thus, the quality of a prediction depends in particular on the data quality of a source. The reason for this is the later extraction of optimal sentences from the data source, which are used as input for BERT. At the moment, the BERTuality pipeline uses three freely available data sources from the Internet.</p>

These are APIs for: 
    <ul>
        <li><b>NewsAPI</b>, which primarily provides short descriptions of articles from over 80,000 sources, mostly one to two sentences in length.</li>
        <li><b>TheGuardian</b>, which offers detailed and longer journalistic articles with many sentences on current topics.</li>
        <li><b>Wikipedia</b>, which serves as a backup in case no current news articles can be found on a particular topic</li>
    </ul>

<p style='text-align: justify;'>The keywords generated by POS tagging are concatenated into a single string using AND operators in this step. The created string is passed as a search query to a function that coordinates the search across the different APIs. Afterwards, the text data is extracted from all search results of the found sources with the help of web scrapers. The quality of the selected keywords is important here, as suitable sources are searched exclusively on the basis of them. In this step, special care must be taken to ensure that only current data from a specific time period is obtained.</p>



In [10]:
# Api's can be called individually with their own function, but for simplicity we will call the general news_loader function:

textual_data = BERTuality_loader.news_loader(from_date="2023-02-01", 
                                             to_date="2023-03-01", 
                                             keywords=keywords, 
                                             use_NewsAPI=True, 
                                             use_guardian=True, 
                                             use_wikipedia=True)

In [11]:
# Information about NewsAPI:

news_api = textual_data[0]

print(f"Found {len(news_api)} articles for NewsApi:")
      
show_articles = 3   
for i in range(show_articles):
    print(f"\nArticle Nr. {i}:")
    print(news_api[i])

Found 100 articles for NewsApi:

Article Nr. 0:
JAXA second attempt at launching the H3 rocket has ended up becoming a major setback for Japan space ambitions. While the rocket was able to leave the launch pad the country space authorities were forced to activate its flight termination system a few.

Article Nr. 1:
BBC correspondents explain why historic rivals are trying to rebuild trust and who stands to gain.

Article Nr. 2:
The number of births registered in Japan plummeted to another record low last year the latest worrying statistic in a decadeslong decline that the country authorities have failed to reverse despite their extensive efforts.


In [12]:
# Information about The Guardian:

guardian = textual_data[1]

print(f"Found {len(guardian)} articles for The Guardian:")
     
show_articles = 1
show_percentage = 0.2
for i in range(show_articles):
    print(f"\nArticle Nr. {i}:")
    print(guardian[i][:int(len(guardian[i]) * show_percentage)], "...")

Found 21 articles for The Guardian:

Article Nr. 0:
Officials meet in Tokyo to discuss concerns at China cooperation with Russia and Japan military buildup Chinese and Japanese officials met in Tokyo on Wednesday for formal security talks for the first time in four years in a meeting aimed at stabilising increasingly strained relations. In Japan national security strategy released in December China was described as the greatest strategic challenge to Japan peace and security. Both sides expressed concerns at Wednesday meeting. China said it was troubled by Japan military buildup while Tokyo is worried about China suspected use of spy balloons as well as Chinese  ...


In [13]:
# Information about Wikipedia - Wikipedia articles contain only the description 

wikipedia = textual_data[2]

print(f"Found {len(wikipedia)} article for Wikipedia:")
      
show_articles = 1  
for i in range(show_articles):
    print(f"\nArticle Nr. {i}:")
    print(wikipedia[i])

Found 1 article for Wikipedia:

Article Nr. 0:
The Prime Minister of Japan is the chief minister of the government of Japan and the head of the Japanese Cabinet. This is a list of prime ministers of Japan from when the first Japanese prime minister Itō Hirobumi took office in 1885 until the present day. 32 prime ministers under the Meiji Constitution had a mandate from the Emperor. The electoral mandates shown are for the House of Representatives lower house of the Imperial Diet that was not constitutionally guaranteed to have any influence on the appointment of the prime minister. Currently the prime minister under the Constitution of Japan shall be designated from among the members of the National Diet and shall be appointed by the Emperor after being nominated by the National Diet. The incumbent prime minister is Fumio Kishida.


## 3.3 Preparing the data

<p style='text-align: justify;'>In order to use the collected data, it is important to put it into a consistent, clean form that is useful for BERT. The reason for this is contamination of texts, which can affect subsequent predictions for the [MASK] token. In particular, the cleanup of texts is done by a function of the pipeline, which consists of a variety of regular expressions that filter out or transform various anomalies and impurities. Regular expressions are particularly suitable for textual data for data preparation, as it is possible to specify a uniform syntax of texts through these rules. All impurities, such as elements of page description languages, acronyms, abbreviations, and formatting errors that violate these rules are either replaced or transformed into a different form. This step is important because only a syntactically clean block of text can be broken down into error-free individual sentences and inserted into a list.</p>

<p style='text-align: justify;'>Note: Due to the structure of the news_loader and for simplification reasons, this step is automatically performed by the text_clean_up function when loading the articles. </p>

<p style='text-align: justify;'>For further preparation, we also need to split the found articles into sentences and merge them into a one-dimensional list. For this step we will use the split and merge functions from our package. </p>


In [14]:
split_data = BERTuality.nltk_sentence_split(textual_data)
merged_data = BERTuality.merge_sentences(split_data)

print(f"Found a total of {len(merged_data)} possible input sentences.")

Found a total of 857 possible input sentences.


In [15]:
# below are all found sentences from our data sources

api_sentences = pd.DataFrame(merged_data, columns=["Merged API Sentences"])
api_sentences

Unnamed: 0,Merged API Sentences
0,JAXA second attempt at launching the H3 rocket has ended up becoming a major setback for Japan s...
1,While the rocket was able to leave the launch pad the country space authorities were forced to a...
2,BBC correspondents explain why historic rivals are trying to rebuild trust and who stands to gain.
3,The number of births registered in Japan plummeted to another record low last year the latest wo...
4,The longrange missile is Pyongyang fourth round of launches in a week ahead of a crucial summit.
...,...
852,This is a list of prime ministers of Japan from when the first Japanese prime minister Itō Hirob...
853,32 prime ministers under the Meiji Constitution had a mandate from the Emperor.
854,The electoral mandates shown are for the House of Representatives lower house of the Imperial Di...
855,Currently the prime minister under the Constitution of Japan shall be designated from among the ...


<p style='text-align: justify;'>Theoretically, it is already possible from this step to pass the list of collected sentences to BERT and to generate a prediction for the searched [MASK] token for each individual sentence. However, this should not be done, since at this point all sentences from the data sources are unfiltered. These unfiltered sentences may contain irrelevant information that worsens the prediction and drastically increases the pipeline processing time. Therefore, the amount of sentences must be reduced to optimal sentences in the next step of the pipeline.</p>


## 3.4 Finding optimal sentences

<p style='text-align: justify;'>In addition to cleaning through data preparation, the query module from this chapter offers the possibility of filtering for optimal data and modifying it if necessary. The goal is to reduce the amount of sentences to optimal input sentences. This should on the one hand increase the performance and on the other hand improve the quality of the prediction.</p>


### 3.4.1 Extraction Query

<p style='text-align: justify;'>We introduce the extraction query into the query module to remove irrelevant sentences from the dataset obtained so far. Only sentences that contain the respective keywords are considered. However, not all keywords must be contained in a sentence, otherwise this leads to a too strict data filtering.</p>

<p style='text-align: justify;'>With a minimum number of keywords defined by the subset_size variable, subsets are formed. The subsets contain sentences with partially existing keywords. A sentence can contain variations of a keyword. If none of the keywords are present in a sentence, the sentence is removed from the set of input sentences.</p>



In [16]:
# define a subset_size

subset_size=2

In [17]:
# and have a look at the created subsets of our keywords

subsets = BERTuality.create_subsets(keywords, subset_size=subset_size)

print(f"Keywords:      {keywords}\n")
for i, j in enumerate(subsets):
    print(f"   {i}. Subset = {j}")

Keywords:      ['Prime', 'minister', 'Japan']

   0. Subset = ['Prime', 'minister']
   1. Subset = ['Prime', 'Japan']
   2. Subset = ['minister', 'Japan']


In [18]:
# call the extraction query with the defined subset_size and filter for more optimal sentences 

extraction_query = BERTuality.filter_for_keyword_subsets(input_sentences=merged_data, 
                                                         keywords=keywords, 
                                                         tokenizer=tokenizer, 
                                                         subset_size=subset_size, 
                                                         duplicates=False)

print("Extraction Query:")
print(f"-> Found a total of {len(extraction_query)} input sentences.")
print(f"-> Reduced the amount of sentences by {100 - len(extraction_query) * 100 / len(merged_data):.2f} %")

Extraction Query:
-> Found a total of 80 input sentences.
-> Reduced the amount of sentences by 90.67 %


In [19]:
extraction_df = pd.DataFrame(extraction_query, columns=["Extraction Query"])
extraction_df

Unnamed: 0,Extraction Query
0,South Korean President Yoon Suk Yeol departed South Korea on Thursday for Tokyo to meet with Pri...
1,The other toasted a close friendship with MoscowOn Tuesday Japan prime minister laid a wreath fo...
2,Japanese Prime Minister Fumio Kishida began a surprise visit to Ukraine early Tuesday hours afte...
3,College football Mexican cola and muffins United Kingdom prime minister has plenty to talk about...
4,Sen. Mike Lee of Utah demanded that Prime Minister Fumio Kishida release a U. S. Navy lieutenant...
...,...
75,The electoral mandates shown are for the House of Representatives lower house of the Imperial Di...
76,Currently the prime minister under the Constitution of Japan shall be designated from among the ...
77,The incumbent prime minister is Fumio Kishida.
78,Prime Miniser Kishida is Japan first postwar leader to enter a war zone.


### 3.4.2 Similarity Query

<p style='text-align: justify;'>We introduce the similarity query in the query module to reduce another portion of the input sentences to generate an improved prediction for the [MASK] token. We achieve this goal by only allowing identical or similar sentences that exceed a certain threshold as input to BERT. Similar inputs in our case are sentences that have a similarity to our [MASK] sentence. To be able to determine a similarity score, we need the vector of the [MASK] sentence and the vector of an input sentence. Using cosine similarity, a similarity value is calculated between the [MASK] sentence and the input sentence. This method is repeated for each input sentence found to determine a similarity score between -1 and 1 for each. Here -1 stands for no similarity, 0 for an independent sentence and 1 for an identical sentence. By using a predefined threshold, input records with insufficient similarity can be filtered out. This threshold is also called similarity_score (=sim_score) in BERTuality and should be a small value above 0 when using the extraction query.</p>


In [20]:
# call the similarity query to find similar sentences to our masked sentence

similarity_query_tuple = BERTuality.similarity_filter(mask_sentence, extraction_query, sim_score=0.25, return_tuples=True)

print("Similarity Query:")
print(f"-> Found a total of {len(similarity_query_tuple)} input sentences.")
print(f"-> Further reduced the amount of sentences by {100 - len(similarity_query_tuple) * 100 / len(extraction_query):.2f} %")

Similarity Query:
-> Found a total of 39 input sentences.
-> Further reduced the amount of sentences by 51.25 %


In [21]:
similarity_df = pd.DataFrame(similarity_query_tuple, columns=["Similarity Score", "Similarity Query"])
similarity_df

Unnamed: 0,Similarity Score,Similarity Query
0,0.267162,South Korean President Yoon Suk Yeol departed South Korea on Thursday for Tokyo to meet with Pri...
1,0.291097,The other toasted a close friendship with MoscowOn Tuesday Japan prime minister laid a wreath fo...
2,0.283896,Japanese Prime Minister Fumio Kishida began a surprise visit to Ukraine early Tuesday hours afte...
3,0.261317,College football Mexican cola and muffins United Kingdom prime minister has plenty to talk about...
4,0.437871,Japanese prime minister to show olidarity' with Ukraine in visit that coincides with Chinese lea...
...,...,...
34,0.555677,This is a list of prime ministers of Japan from when the first Japanese prime minister Itō Hirob...
35,0.509661,32 prime ministers under the Meiji Constitution had a mandate from the Emperor.
36,0.510935,Currently the prime minister under the Constitution of Japan shall be designated from among the ...
37,0.487961,The incumbent prime minister is Fumio Kishida.


In [22]:
# add all sentences into a sperarate list - needed because we used return_tuples

similarity_query = [i[1] for i in similarity_query_tuple]

### 3.4.3 Focus Query

<p style='text-align: justify;'>We introduce the focus query into the query module to remove contextually unimportant information from sentences that are too long. Since there is usually only one short core information in a sentence that we need for BERTuality, it is important to extract just this core information from the sentence. Any additional information beyond this core may have a negative impact on the prediction result. Therefore, the focus query uses the previously created keywords that must be included in our sentence. We also assume that the information to be predicted must occur near the keywords we are looking for.</p>


<p style='text-align: justify;'>However, to separate irrelevant from relevant parts of the sentence, we first determine the first positions of each keyword in the sentence. This span between the found positions forms our core information. In the next step, we truncate all words outside this interval and classify them as irrelevant for now. As mentioned earlier, our relevant information is located near the keywords, which means that some of the information may be outside the core. Therefore, after the truncation process, we again append a part of the original sentence to our core information. We refer to this appended part as the padding of the focus query. This padding can be defined by a value of our choice and includes the number of words appended to the left and right.</p>


In [23]:
# call the focus query to extract the core information from our input sentences

focus_query = BERTuality.keyword_focus(similarity_query, keywords, padding=2)

In [24]:
sim_len = sum([len(i) for i in similarity_query])
foc_len = sum([len(i) for i in focus_query])

print("Total sum of characters in the Similarity Query:", sim_len)
print("Total sum of characters in the Focus Query:     ", foc_len)
print(f"\nWe achieved a reduction of {sim_len - foc_len} characters in {len(focus_query)} sentences!")

Total sum of characters in the Similarity Query: 5923
Total sum of characters in the Focus Query:      2396

We achieved a reduction of 3527 characters in 39 sentences!


In [25]:
# The following are the optimal input sentences for our predictions with BERT

focus_df = pd.DataFrame(focus_query, columns=["Focus Query"])
focus_df

Unnamed: 0,Focus Query
0,meet with prime minister fumio kishida the first such summit on japan soil in.
1,moscowon tuesday japan prime minister laid a.
2,japan prime minister fumio kishida.
3,united kingdom prime minister has plenty.
4,japan prime minister to show.
...,...
34,list of prime ministers of japan from when the first japanese prime minister itō hirobumi.
35,32 prime minister under the.
36,currently the prime minister under the constitution of japan shall be.
37,the incumbent prime minister is fumio.


## 3.5 Predictions with BERTuality

<p style='text-align: justify;'>Finally, the [MASK] sentence and the actuality-related input sentences created by the BERTuality pipeline are passed to BERT. Ideally, the selected input sentences contain the maximum relevant content of all sentences from the data source.</p>


### 3.5.1 Input Prediction

<p style='text-align: justify;'>By passing the inputs to the fill mask pipeline a prediction for the [MASK] token is generated. This may differ from the prediction without any additional input. The reason for this is the priming of BERT with the input sentences created from the BERTuality pipeline described in chapter 2.2. Therefore, throughout the BERTuality pipeline, our goal is to find optimally suitable input sentences. This is because the more similar the input sentence is to the [MASK] sentence, the higher the probability that the sentence also contains the correct information to be predicted. In order to be able to generate a prediction, the corresponding input sentences from the BERTuality pipeline are passed to BERT in addition to the [MASK] sentence in the optimal input form already explained.</p>

<p style='text-align: justify;'>To catch the case of a non-optimal input sentence, so that no wrong prediction is risked, all found sentences are used as inputs for BERT. Thus, the [MASK] token is predicted once for each of the included input sentences.</p>


In [26]:
# the make_predictions function is equal to the "fill_mask" pipeline from chapter 2, but also returns a DataFrame

input_prediction = BERTuality.make_predictions(mask_sentence, focus_query, model, tokenizer)

In [27]:
# let's have a look at the first 5 predictions of the input prediction

input_prediction_df = input_prediction[["masked", "input", "token1", "score1"]]
input_prediction_df

Unnamed: 0,masked,input,token1,score1
0,Prime minister [MASK] is the leader of Japan.,meet with prime minister fumio kishida the first such summit on japan soil in.,abe,0.432130
1,Prime minister [MASK] is the leader of Japan.,moscowon tuesday japan prime minister laid a.,who,0.477845
2,Prime minister [MASK] is the leader of Japan.,japan prime minister fumio kishida.,he,0.422209
3,Prime minister [MASK] is the leader of Japan.,united kingdom prime minister has plenty.,tanaka,0.211324
4,Prime minister [MASK] is the leader of Japan.,japan prime minister to show.,who,0.815423
...,...,...,...,...
34,Prime minister [MASK] is the leader of Japan.,list of prime ministers of japan from when the first japanese prime minister itō hirobumi.,ito,0.944046
35,Prime minister [MASK] is the leader of Japan.,32 prime minister under the.,who,0.689149
36,Prime minister [MASK] is the leader of Japan.,currently the prime minister under the constitution of japan shall be.,tanaka,0.283642
37,Prime minister [MASK] is the leader of Japan.,the incumbent prime minister is fumio.,tanaka,0.448058


<p style='text-align: justify;'>To select the correct word from the list of predictions, the scores of the same word occurrences are summed. This is what we call sum_up_score and is consequently the decision criterion for the [MASK] token, where the word with the highest value is selected for the [MASK] token. Nevertheless, it is not possible for the input prediction to predict every single token for the [MASK] token.</p>


In [28]:
# here you can have a look at a summary of the top predictions

BERTuality.simple_pred_results(input_prediction).head()

Unnamed: 0,Token,Frequency,max_score,min_score,mean_score,sum_up_score
0,who,10,0.815423,0.282997,0.557341,5.573406
1,he,10,0.424347,0.185562,0.347023,3.470225
2,abe,7,0.484453,0.115956,0.337288,2.361017
3,tanaka,7,0.448058,0.147092,0.255719,1.790031
4,making,1,0.979401,0.979401,0.979401,0.979401


### 3.5.2 The Token Problem

<p style='text-align: justify;'>The token problem and thus the cause of some incorrect predictions of BERTuality can be traced back to the limited vocabulary and the restriction to a single [MASK] token. This vocabulary consists of 30,000 WordPieces. WordPieces make it possible to represent a large number of complicated tokens with a relatively compact vocabulary. A token that is outside the vocabulary is formed by concatenating WordPieces. For example, "kishida" is tokenized by BERT to "ki + ##shi + ##da". To allow related WordPieces to be reassembled into a token, successive WordPieces are marked with a leading "##". The token problem of BERTuality refers to the fact that we are limited to one [MASK] token in our [MASK] sentence and do not know if and from how many WordPieces the searched token is formed. Consequently, the simple input prediction is limited to predicting tokens that consist of a single WordPiece.</p>


In [29]:
example = "Kishida"

In [30]:
# the default way of tokenizing

print(example, "get's tokenized into", tokenizer.tokenize(example))

Kishida get's tokenized into ['ki', '##shida']


In [31]:
# the more powerful tokenizer of bertuality which breaks down tokens into the smallest from of WordPieces

print(example, "get's tokenized into", BERTuality.better_tokenizer(example, tokenizer))

Kishida get's tokenized into ['ki', '##shi', '##da']


### 3.5.3 WordPiece-Prediction

<p style='text-align: justify;'>The WordPiece prediction is an extension of the input prediction. This method solves the token problem of many texts and manages to make tokens that are not included in the vocabulary predictable. Tokens, which consist of several connected WordPieces, are called WordPiece-Token (=WPT) in the following for simplification. The WordPiece prediction uses a procedure of concatenated predictions of single WordPieces with the help of a block list. The block list contains modifications of the original input sentence. The modifications of an input sentence are created by inserting the individual WordPieces of the WPT one after the other at the original token position of the WPT of the input sentence. This results in a blocklist with a length n for the number of WordPieces of a WPT. The above mentioned block lists are created for each WPT of a sentence and then passed to BERT for prediction.</p>


In [32]:
# below you can find an example for a blocklist:

sentence = focus_query[0]

# 1. Step: tokenize the sentence:
tokens = BERTuality.better_tokenizer(sentence, tokenizer)

# 2. Step: find every WordPiece position:
every_wp_position = BERTuality.wp_find_positions(tokens)

# 3. Step: create the blocklist for the sentence
wpw_input_sentences = BERTuality.wp_input_sentence_creator(tokens, every_wp_position, tokenizer)    

print(f'Create blocklist for sentence: "{sentence}"')
print('Found positions of WPT:', every_wp_position)

for nr, i in enumerate(every_wp_position):
    print(f'\nBlock Nr. {nr}')
    for j in i:
        print(f'  WP-Position {j}: {tokens[j]}')
        
print('\nSentences for Block Nr. 0')
for i in wpw_input_sentences[0]:
    print(' ', i)

Create blocklist for sentence: "meet with prime minister fumio kishida the first such summit on japan soil in."
Found positions of WPT: [[4, 5, 6], [7, 8, 9]]

Block Nr. 0
  WP-Position 4: fu
  WP-Position 5: ##mi
  WP-Position 6: ##o

Block Nr. 1
  WP-Position 7: ki
  WP-Position 8: ##shi
  WP-Position 9: ##da

Sentences for Block Nr. 0
  meet with prime minister fu the first such summit on japan soil in.
  meet with prime minister mi the first such summit on japan soil in.
  meet with prime minister o the first such summit on japan soil in.


In [33]:
# Now let's move on to the predictions with WordPiece Prediction:

wp_prediction = BERTuality.word_piece_prediction(mask_sentence, focus_query, model, tokenizer, combine=False, threshold=0.8)

100%|██████████████████████████████████████████████████████████████████████████████████| 39/39 [02:53<00:00,  4.44s/it]


In [34]:
wp_prediction_df = wp_prediction[["masked", "input", "token1", "score1"]]
wp_prediction_df

Unnamed: 0,masked,input,token1,score1
0,Prime minister [MASK] is the leader of Japan.,0 meet with prime minister fu the first such summit on japan soil in. 1 meet with prime mi...,fumio,0.997911
1,Prime minister [MASK] is the leader of Japan.,0 meet with prime minister ki the first such summit on japan soil in. 1 meet with prime m...,kishida,0.996877
2,Prime minister [MASK] is the leader of Japan.,meet with prime minister fumio kishida the first such summit on japan soil in.,abe,0.432130
3,Prime minister [MASK] is the leader of Japan.,0 moscow tuesday japan prime minister laid a. 1 on tuesday japan prime minister laid a...,who,0.146832
4,Prime minister [MASK] is the leader of Japan.,moscowon tuesday japan prime minister laid a.,who,0.477845
...,...,...,...,...
88,Prime minister [MASK] is the leader of Japan.,0 the incumbent prime minister is fu. 1 the incumbent prime minister is mi. 2 the incu...,fumio,0.926627
89,Prime minister [MASK] is the leader of Japan.,the incumbent prime minister is fumio.,tanaka,0.448058
90,Prime minister [MASK] is the leader of Japan.,"0 prime mini is japan first postwar. 1 prime ser is japan first postwar. Name: input, dty...",meijijapan,0.166201
91,Prime minister [MASK] is the leader of Japan.,0 prime ki is japan first postwar. 1 prime shi is japan first postwar. 2 prime da is ...,kishitanaka,0.402761


In [35]:
wp_pred_simple = BERTuality.simple_pred_results(wp_prediction)
wp_pred_simple.head()

Unnamed: 0,Token,Frequency,max_score,min_score,mean_score,sum_up_score
0,kishida,15,0.997021,0.871143,0.981703,14.725542
1,fumio,15,0.997911,0.830478,0.976237,14.643549
2,who,13,0.815423,0.146832,0.542155,7.048012
3,he,11,0.424347,0.162035,0.330206,3.632261
4,abe,7,0.484453,0.115956,0.337288,2.361017


<p style='text-align: justify;'>The WordPiece prediction algorithm allows each WPT to be mapped by individual predictions of the WordPieces. This way, a WordPiece inserted at the original position of the original WPT will still have the same context to the rest of the sentence due to the positional data within the embeddings.</p>


In [36]:
# let's have a look at our final prediction

print("Predicted [MASK] Token:       ", wp_pred_simple["Token"][0])
print("Final output for our Sentence:", mask_sentence.replace("[MASK]", wp_pred_simple["Token"][0]))

Predicted [MASK] Token:        kishida
Final output for our Sentence: Prime minister kishida is the leader of Japan.


# 4. Result & Playground

<p style='text-align: justify;'><b>Result:</b> BERTuality helps correct misinformation in the form of outdated information. Since the use of misinformation is a great danger, the application of BERTuality can be very useful. The obtained performance is a starting point for further development of the method using the presented approaches.</p>

<p style='text-align: justify;'>Below this cell you can test BERTuality yourself. You can find an implemented default configuration, but it is also possible to define a custom configuration and play around. </p>


In [37]:
defaults = BERTuality_quickstart.load_default_config()
values, keys = list(defaults.values()), list(defaults.keys())

pd.DataFrame(values, keys, columns=["Values"])

Unnamed: 0,Values
model,bertuality/model
tokenizer,bertuality/tokenizer
from_date,2023-01-26
to_date,2023-04-26
use_NewsAPI,True
use_guardian,True
use_wikipedia,True
subset_size,2
sim_score,0.25
focus_padding,6


In [38]:
# In the default configuration, to_date is always set to today and from_day is set to 90 days in the past.  

BERTuality_quickstart.bertuality("Dara Khosrowshahi is the CEO of [MASK].")


Step 1: Load config -------> Done
Step 2: Load latest data --> Done
Step 3: Prepare data ------> Done
Step 4: Start Prediction:


100%|██████████████████████████████████████████████████████████████████████████████████| 21/21 [01:12<00:00,  3.43s/it]


Prediction: Dara Khosrowshahi is the CEO of Uber.




