# Is Named Entity Recognition a bottleneck in masking entities?
# Does paraphrasing move out too many wiki pages?

To mask wikipedia texts, a NER model is used (dslim/bert-base-NER). Of 1000 random wikipedia articles about personalities, about half gets lost before further processing.
-> lost when acquiring via wikiquery -> title -> find in dataset (some are not found..)
-> lose more when paraphrasing deletes names
-> lose more when NER does not detect !this is the bottleneck!
**TODO** insert graphic from (https://sankeymatic.com/build/)

Wiki-Text is reduced to 4096 characters for faster processing. 4k characters should be plenty to get several mentions of the personality of the article.


NER vs Manual vs "String Matching"

NER:             use a model
string matching: use a very simple matching approach
manual:          search and replace by hand


### Open Questions
- [ ] are 4096 characters enough to keep several mentions of the name of the person?
- [ ] how many names are removed by paraphrasing?
- [ ] how many names are removed in NER?
- [ ] paraphrase text as a whole or as single sentences? -> single sentences keep more info, but might remove more names?




### Step 1: How often does the title of the wiki page occur in the document?

In [None]:
import pandas as pd

In [None]:
df = pd.read_csv('wiki-dataset-results.csv')

In [None]:
df.shape

In [None]:
# read in 15 examples of paraphrased wiki texts
df = pd.read_csv('15-wiki-samples.csv')

In [None]:
# samples = df.sample(15)
# samples.to_csv('15-wiki-samples.csv')
# samples
df

In [None]:
import re

In [None]:
counts = []
for index, page in df.iterrows():
    title = page['title']
    text = page['raw']
    counts.append(len(re.findall(title, text)))
counts

In [None]:
len(df.iloc[0]['raw'])

In [None]:
text = samples.iloc[0]['raw']
title = samples.iloc[0]['title']

In [None]:
import re

In [None]:
re.findall(title, text)

### Step 2: check if mfwparserfrom hell + wikibot is better to work with
no, use the sparqlwrapper to get persons from wikipedia, keep using datasets from huggingface to access cleaned wiki texts

In [None]:
from SPARQLWrapper import SPARQLWrapper, JSON
import pandas as pd

In [None]:
sparql = SPARQLWrapper("https://query.wikidata.org/sparql")

In [None]:
# get the page titles of the queries persons
names = [page['page_titleEN']['value'] for page in results['results']['bindings']]

In [None]:
from datasets import load_dataset
dataset = load_dataset("wikipedia", "20220301.en", split="train") # use train split, as it only has train, no other splits

In [None]:
len(dataset['title'])

In [None]:
from helpers import extract_text
articles = extract_text(dataset, names)

In [None]:
len(articles)
articles[0]

In [None]:
# apply NER to the articles

In [None]:
from custom.wiki import query_wiki_persons

In [None]:
query_wiki_persons(5)

## time executions of fill-mask (array of inputs vs single inputs)

In [1]:
from transformers import pipeline

In [2]:
inputs = ["Hello, I am a text with a <mask>", "Wow, there is <mask> more text.",
          "The largest building in <mask> is the taj mahal.", "Let's see if <mask> can win the f1 championship.",
         "<mask> are the only animals to live in these rough conditions.", "Not many people can say the have seen <mask>."]

In [3]:
fill_mask = pipeline("fill-mask", model="roberta-base", tokenizer='roberta-base', top_k=5)

In [10]:
%timeit -r 30 [fill_mask(x) for x in inputs]

334 ms ± 12.2 ms per loop (mean ± std. dev. of 30 runs, 1 loop each)


In [11]:
%timeit -r 30 fill_mask(inputs)

352 ms ± 22.6 ms per loop (mean ± std. dev. of 30 runs, 1 loop each)


### time executions of paraphrase (array of inputs vs single inputs)

In [12]:
from transformers import PegasusForConditionalGeneration, PegasusTokenizer

In [13]:


def load_model(model_name='tuner007/pegasus_paraphrase'):
    print(f"Loading {model_name}")
    tokenizer = PegasusTokenizer.from_pretrained(model_name)
    model = PegasusForConditionalGeneration.from_pretrained(model_name).to(torch_device)
    return model, tokenizer