### 1. Data Notebook

##### In this notebook:
1. We collect the text from the selected Wikipedia pages using the BeautifulSoup (bs4) library.
2. We use nltk sent_tokenize function to divide the text body into sentence. 
    * divide_chunks function creates two-sentence-lenght chunks. 
    * Since we cut the whole Wikipedia page into two sentence, we have increased ambiguity problem. For example, a chunk starts with a pronoun refering to an entity from the previous chunk. 
3. It is almost impossible for the generative extraction model to disambiguate a pronoun referring an entity out of the given context. Therefore, we use a pre-trained coreferance resolution model. We use crosslingual_coreference library, and particularly xlm_roberta model.
4. We save the resulting data file: 'data/preprocessed_data_for_extraction.jsonl'

In [1]:
!python --version

Python 3.8.16


In [2]:
# Import Modules
from bs4 import *
import requests
import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize
import re
import json
from crosslingual_coreference import Predictor

[nltk_data] Downloading package punkt to /home/finapolat/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /home/finapolat/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
  from .autonotebook import tqdm as notebook_tqdm


In [3]:
# we'll use this function to chunk the text into smaller pieces
def divide_chunks(l, n):
      
    # looping till length l
    for i in range(0, len(l), n): 
        yield l[i:i + n]

In [4]:
# We use predictor for coreference resolution, and choose xlm-roberta for accuracy.
predictor = Predictor(
    language="en_core_web_sm", device=-1, model_name="xlm_roberta",
)

models/crosslingual-coreference/xlm-roberta-base/model.tar.gz: 851072KB [00:52, 16181.64KB/s]                            
Downloading: 8.68MB [00:00, 15.0MB/s]
Some weights of the model checkpoint at xlm-roberta-base were not used when initializing XLMRobertaModel: ['lm_head.layer_norm.weight', 'lm_head.dense.weight', 'lm_head.bias', 'lm_head.dense.bias', 'lm_head.decoder.weight', 'lm_head.layer_norm.bias']
- This IS expected if you are initializing XLMRobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLMRobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [5]:
# Before we start collection text, lets see how the coreference resolution model works
text = ("""Fina was born in Bulgaria, but she lives in Utrecht now.
        She has a daughter called Iris. 
        Her daughter is 5 years old.
        She is a very good girl.
        """)	
print(predictor.predict(text)["resolved_text"])

Fina was born in Bulgaria, but Fina lives in Utrecht now.
        Fina has a daughter called Iris. 
        Fina's daughter is 5 years old.
        a daughter called Iris is a very good girl.
        


In [6]:
# The list of the Wikipedia urls we want to collect text from
urls = ["https://en.wikipedia.org/wiki/Adidas",
        "https://en.wikipedia.org/wiki/Zalando",
        "https://en.wikipedia.org/wiki/Phoenix_Pharmahandel",
        "https://en.wikipedia.org/wiki/DATEV",
        "https://en.wikipedia.org/wiki/BASF",
        "https://en.wikipedia.org/wiki/Just_Eat_Takeaway.com",
        "https://en.wikipedia.org/wiki/Syrian_refugee_camps",
        r"https://en.wikipedia.org/wiki/2022%E2%80%93present_Ukrainian_refugee_crisis",
        "https://en.wikipedia.org/wiki/List_of_earthquakes_in_California",
        "https://en.wikipedia.org/wiki/2005_Birmingham_tornado",
        "https://en.wikipedia.org/wiki/Aftermath_of_the_2011_T%C5%8Dhoku_earthquake_and_tsunami",
        "https://en.wikipedia.org/wiki/Ozone_depletion"
      	]

In [7]:
# Fetch URL Content
annotations = []
for ind, url in enumerate(urls):
    annotation_dict = {} # we'll use this to store the annotations
    annotation_dict['Document id'] = ind + 1 # we give each document an id
    annotation_dict['Document url'] = url # we store the url
    page_name = url.split('/')[-1] # we get the page name from the url
    annotation_dict['Document name'] = page_name
    page = requests.get(url)
    page_content = BeautifulSoup(page.text,'html.parser').select('body')[0]
    page_text = []
    #print(page_content)
    for tag in page_content.find_all(): # we check each tag name
        if tag.name=="p": # For Paragraph we use p tag
            text = tag.text
            text = re.sub(r'\[\d+\]', '', text) # Regex that removes the numbers in square bracets
            text = text.replace('\n',  '').replace('\\', '').replace('[citation needed]', '')
            page_text.append(text)
    page_text = ' '.join(page_text).strip()
    page_text = sent_tokenize(page_text) # we tokenize the text into sentences
    page_text = list(divide_chunks(page_text, 2)) # we make two sentences into a chunk
    #print(page_text)
    for ind, chunk in enumerate(page_text):
        annotation_dict['Chunk id'] = ind + 1
        chunk = ' '.join(chunk)
        chunk = predictor.predict(chunk)["resolved_text"] # here we use the coref model
        annotation_dict['Chunk text'] = chunk
        annotations.append(annotation_dict.copy())

In [8]:
with open('data/preprocessed_data_for_extraction.jsonl', 'w', encoding='utf-8') as f:
    for annotation in annotations:
        line = json.dump(annotation, f, ensure_ascii=False)
        f.write(f'{line}\n')