## BERTuality
 
In this repository, we explore a way to check the topicality of textual data using the "BERT" language model and generate data corresponding to topicality. Our concept is called BERT-actuality (=BERTuality) and is a pipeline that is able to overcome the limitations of a pre-trained large language model by collecting correct data according to its topicality and predicting a correct value for a searched value.


<b>Why is this important?</b> There are over 6.6 million English-language Wikipedia articles on the Internet. In the USA alone, over 5,000 new news articles are published every day. Assuming that not every source is constantly updated, it can be stated that a large amount of outdated information exists. In this context, timeliness is the property of information to be relevant to the present. Outdated information no longer corresponds to the current conditions of the real world because it has become obsolete due to decay over time. Thus, outdated information poses a danger because it is misinformation in our case. The use of misinformation can lead to misunderstanding, economic or reputational damage. Therefore, it is of great interest to both individuals and businesses to obtain and use only current information. We describe a best practice for how BERT keeps information current.



<b>How do we achieve it?</b> BERTuality generates a current value for the [MASK] token for a sentence of the form "Prime Minister [MASK] is the leader of Japan". In this case, the result is the word "Kishida" (as of February 2023). To predict the word for the [MASK] token, the language model BERT, which was previously sensitized to the current state, is used. In the first step, BERTuality analyzes the outdated information and systematically extracts up-to-date information from various data sources. In the second step, this information is split into individual sentences and transformed into an optimal form for BERT. Procedures are presented that allow BERT to be made aware of timeliness using these sentences. In the third step, BERT is used to generate a correct and up-to-date prediction.

## 1. Dependencies

To keep this notebook readable and simple, we will keep most of the code out of this notebook and import all the necessary functions as we go along in the chapters we need. But for now, we'll import all the necessary dependencies:

In [1]:
from bertuality import BERTuality
from bertuality import BERTuality_loader
from bertuality import BERTuality_quickstart

## 2. The Basics needed

## 3. BERTuality

In this chapter, we will go over the BERTuality pipeline and introduce its structure and the functionality of the individual modules. This is followed by a schematic overview of the process:

TODO - Import image of pipeline

<b>Description:</b> First, the masked sentence is passed to the pipeline to extract all relevant keywords from it. Based on the keywords, relevant data is searched in any data source. Then, the found input data passes through the textual data preparation and query module to filter out optimal input sentences. Based on these prepared input records, a suitable and up-to-date token for the [MASK] token for the masked record is predicted.


### 3.1 Keyword Extraction

First, the keywords are extracted from the [MASK] sentence, which will be used in the next steps to search for the relevant information. For an optimal search it is of interest that the list of keywords contains only the most relevant words.

### 3.2 Loading the data

### 3.3 Preparing the data

### 3.4 Finding optimal sentences

### 3.5 Predictions with BERTuality

## 4. Results & Playground

In [2]:
BERTuality_quickstart.load_default_config()

{'model': 'bertuality/model',
 'tokenizer': 'bertuality/tokenizer',
 'from_date': '2023-01-20',
 'to_date': '2023-04-20',
 'use_NewsAPI': True,
 'use_guardian': False,
 'use_wikipedia': True,
 'subset_size': 2,
 'sim_score': 0.3,
 'focus_padding': 6,
 'duplicates': False,
 'extraction': True,
 'similarity': True,
 'focus': True,
 'max_input': 30,
 'threshold': 0.9,
 'only_target_token': True}

In [3]:
BERTuality_quickstart.bertuality("Tim Cook is the current CEO of [MASK].")


Step 1: Load config --> Done
Step 2: Load latest data --> Done
Step 3: Prepare data --> Done
Step 4: Start Prediction:

Tim Cook is the current CEO of [MASK].

    WordPiece Prediction:
    Input Size: 30
    Progress --> 3.33 %
    Progress --> 6.67 %
    Progress --> 10.0 %
    Progress --> 13.33 %
    Progress --> 16.67 %
    Progress --> 20.0 %
    Progress --> 23.33 %
    Progress --> 26.67 %
    Progress --> 30.0 %
    Progress --> 33.33 %
    Progress --> 36.67 %
    Progress --> 40.0 %
    Progress --> 43.33 %
    Progress --> 46.67 %
    Progress --> 50.0 %
    Progress --> 53.33 %
    Progress --> 56.67 %
    Progress --> 60.0 %
    Progress --> 63.33 %
    Progress --> 66.67 %
    Progress --> 70.0 %
    Progress --> 73.33 %
    Progress --> 76.67 %
    Progress --> 80.0 %
    Progress --> 83.33 %
    Progress --> 86.67 %
    Progress --> 90.0 %
    Progress --> 93.33 %
    Progress --> 96.67 %
    Progress --> 100.0 %

Tim Cook is the current CEO of Apple.
