# Build and analyze base dataset

To build an entity reidentification on wikipedia articles, a base dataset is required. For that the following steps are required:

#### Retrieve Base Data
- retrieve a list of all pages concerning persons on wikipedia
- retrieve a list of all wikipedia page texts for those persons
- combine both lists to a usable dataset

#### Cleaning
- preprocess texts by:
  - splitting texts into sentences
  - trimming the texts to the maximum length allowed by our model / performance limit
 
#### Preprocessing
  - paraphrasing sentences to harden the problem of predicting masked entities as sentences are now unknown to the trained model
  - recognize the entity within the text by
    - using regex
    - using NER models
  - mask recognized entities

#### Analyzing
- Compare regex vs NER for entity recognition
- Compare paraphrasing vs no paraphrasing in terms of occurences of named entity
- 
  

## Retrieve Base Data

In [11]:
# retrieve full wiki dataset
from datasets import load_dataset
dataset = load_dataset("wikipedia", "20220301.en", split="train") # use train split, as it only has train, no other splits

Reusing dataset wikipedia (/home/nya/.cache/huggingface/datasets/wikipedia/20220301.en/2.0.0/aa542ed919df55cc5d3347f42dd4521d05ca68751f50dbc32bae2a7f1e167559)


In [2]:
# retrieve 30k of wikipedia persons (approx max. allowed return)
from custom.wiki import query_wiki_persons
persons = query_wiki_persons(10)
persons

['George Washington',
 'Pierre Corneille',
 'Kofi Annan',
 'George W. Bush',
 'Douglas Adams',
 'Leonard Cohen',
 'Abel Mutai',
 'Oscar Luigi Scalfaro',
 'Charles Baudelaire',
 'Plato']

In [10]:
# concat datasets
from custom.wiki import extract_text
import pandas as pd

articles = extract_text(dataset, persons)
df = pd.DataFrame(articles)

Unnamed: 0,id,url,title,text
0,11968,https://en.wikipedia.org/wiki/George%20Washington,George Washington,"George Washington (February 22, 1732, 1799) wa..."
1,58193,https://en.wikipedia.org/wiki/Pierre%20Corneille,Pierre Corneille,Pierre Corneille (; 6 June 1606 – 1 October 16...
2,23192127,https://en.wikipedia.org/wiki/Ashton%20Eaton,Ashton Eaton,"Ashton James Eaton (born January 21, 1988) is ..."
3,26260416,https://en.wikipedia.org/wiki/Abel%20Mutai,Abel Mutai,Abel Kiprop Mutai (born 2 October 1988) is a K...
4,33616917,https://en.wikipedia.org/wiki/%C3%89rick%20Bar...,Érick Barrondo,Érick Bernabé Barrondo García (born 14 June 19...
5,22954,https://en.wikipedia.org/wiki/Plato,Plato,Plato ( ; ; 428/427 or 424/423 – 348/347 BC) ...
6,8091,https://en.wikipedia.org/wiki/Douglas%20Adams,Douglas Adams,Douglas Noel Adams (11 March 1952 – 11 May 200...
7,5804,https://en.wikipedia.org/wiki/Charles%20Baudel...,Charles Baudelaire,"Charles Pierre Baudelaire (, ; ; 9 April 1821 ..."
8,2346975,https://en.wikipedia.org/wiki/Ban%20Ki-moon,Ban Ki-moon,Ban Ki-moon (; ; born 13 June 1944) is a South...
9,16844,https://en.wikipedia.org/wiki/Kofi%20Annan,Kofi Annan,Kofi Atta Annan (; 8 April 193818 August 2018)...
