<a href="https://colab.research.google.com/github/LUMII-AILab/NLP_Course/blob/main/notebooks/BSSDH2024.ipynb" target="_new"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab"/></a>

# Acquiring plain-text data for the corpus

## Getting the source documents

### Via RSS feeds

Consider [Europe Media Monitor](https://emm.newsbrief.eu) feeds:

* [Current top 10 stories](https://emm.newsbrief.eu/NewsBrief/clusteredition/en/latest_en.html) (in English) ⇒ [machine-readable feed](https://emm.newsbrief.eu/rss/rss?type=rtn&language=en&duplicates=false) (RSS/XML)
* [Biggest 10 Stories Over Last 24h](https://emm.newsbrief.eu/NewsBrief/clusteredition/en/24hrs_en.html) ⇒ [machine-readable feed](https://emm.newsbrief.eu/rss/rss?type=24hrs&language=en&duplicates=false) (RSS/XML)

The Really Simple Syndication (RSS) standard and its XML format: https://www.w3schools.com/xml/xml_rss.asp

The Python `feedparser` library: https://feedparser.readthedocs.io

In [None]:
!pip install feedparser

In [None]:
import feedparser

from urllib.parse import urlparse
from collections import Counter

In [None]:
LANG = 'en'  # TODO: choose another language

In [None]:
url_current = f'https://emm.newsbrief.eu/rss/rss?type=rtn&language={LANG}&duplicates=false'
url_last24h = f'https://emm.newsbrief.eu/rss/rss?type=24hrs&language={LANG}&duplicates=false'

feed = feedparser.parse(url_current)  # TODO: compare to the last 24h feed

LINKS = [entry.link for entry in feed.entries]

for link in LINKS: print(link)
print(len(LINKS))

In [None]:
filter = 'telegraph.co.uk'  # TODO: adjust to your case

FILTERED_LINKS = [link for link in LINKS if filter in link]

for link in FILTERED_LINKS: print(link)
print(len(FILTERED_LINKS))

#### Data analysis

##### First attempt

In [None]:
portals = [urlparse(link).netloc for link in LINKS]

frequencies = Counter(portals)

for portal, count in frequencies.items():
    print(f'{portal}: {count}')

##### Second attempt

In [None]:
portals = [urlparse(link).netloc for link in LINKS]

frequencies = Counter(portals)

pruned = {portal: count for portal, count in frequencies.items() if count > 1}

for portal, count in Counter(pruned).most_common():
    print(f'{portal}: {count}')

### Via web crawling

In [None]:
# TODO (optionally)

## Extracting useful content

### Rule-based approach

The Python `requests` library: https://requests.readthedocs.io

The Python `beautifulsoup4` library: https://beautiful-soup-4.readthedocs.io

The Python `json` library: https://docs.python.org/3/library/json.html

In [None]:
!pip install requests
!pip install beautifulsoup4

In [None]:
from bs4 import BeautifulSoup

import requests
import json

Extraction strategies:
1. Try to remove the unnecessary elements, keep the rest.
2. Pick out only the useful elements, ignore the rest.

The set(s) of rules:
1. Common patterns across news portals.
2. Portal-specific patterns.

In [None]:
def extract_plain_text(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')

    paragraphs = soup.find_all('p')
    text = ' '.join([p.get_text().strip() for p in paragraphs])
    # TODO: try and compare this instead: text = soup.get_text(separator=' ', strip=True)

    # TODO: do more elaborate filtering and preprocessing

    # TODO: use a specialised library instead of the generic bs4, e.g.:
    #       https://newspaper.readthedocs.io
    #       https://goose3.readthedocs.io
    #       https://trafilatura.readthedocs.io
    #       https://pypi.org/project/news-please/

    return text

See also: https://github.com/LUMII-AILab/NLP_Course/blob/main/notebooks/TextExtraction.ipynb (HTML-to-Text)

In [None]:
dataset = []

for link in FILTERED_LINKS:
    content = extract_plain_text(link)
    print(content[:200], '\n' + '='*200)  # Just for the quick testing purposes

    article = {
        'language': LANG,
        'portal': urlparse(link).netloc,
        'link': link,
        'text': content
    }

    dataset.append(article)

with open('corpus.json', 'w', encoding='utf-8') as json_file:
    json.dump(dataset, json_file, ensure_ascii=False, indent=4)

### Zero/few-shot learning approach

In [None]:
# TODO (optionally)

## Some challenges

### Messy HTML source code

In [None]:
# TODO (optionally)

### PDF documents

In [None]:
# TODO (optionally)

See: https://github.com/LUMII-AILab/NLP_Course/blob/main/notebooks/TextExtraction.ipynb (PDF-to-Text)

# Creating an annotated text corpus

## Syntactic parsing

Documentation:
* Available models per language: https://stanfordnlp.github.io/stanza/models.html
* Supported processors and pipelines: https://stanfordnlp.github.io/stanza/pipeline.html
* Data objects: https://stanfordnlp.github.io/stanza/data_objects.html

In [None]:
!pip install stanza

In [None]:
import stanza

In [None]:
stanza.download(LANG)

In [None]:
NLP_PIPE = stanza.Pipeline(lang=LANG, processors='tokenize,mwt,pos,lemma,depparse')

In [None]:
CORPUS = []

with open('corpus.json', 'r', encoding='utf-8') as json_file:
    data = json.load(json_file)

    for article in data:
        CORPUS.append({
            'language': article['language'],
            'portal': article['portal'],
            'link': article['link'],
            'document': NLP_PIPE(article['text'])  # All the NLP happens here!
        })

### CoNLL-U output

Format: https://universaldependencies.org/docs/format.html

In [None]:
with open('corpus.conllu', 'w', encoding='utf-8') as conllu_file:
    for article in CORPUS:

        for s in article['document'].sentences:
            conllu_file.write(f'# text = {s.text}\n')

            for w in s.words:
                conllu_file.write(
                    f'{w.id}\t'
                    f'{w.text}\t'
                    f'{w.lemma}\t'
                    f'{w.upos}\t'
                    f'{w.xpos}\t'
                    '_\t'
                    f'{w.head}\t'
                    f'{w.deprel}\t'
                    '_\t'
                    '_\n'
                )

            conllu_file.write("\n")

### VERT output

Format: https://www.sketchengine.eu/my_keywords/vertical/

NoSketch Engine configuration file: https://raw.githubusercontent.com/LUMII-AILab/NLP_Course/main/notebooks/resources/BSSDH2024/corpus.config (**comply with it!** - field names, order)

In [None]:
with open('corpus.vert', 'w', encoding='utf-8') as vert_file:
    for article in CORPUS:
        vert_file.write(f'<doc>\n')
        # TODO: add doc-level metadata: language, portal, link
        # Hint: article["language"], article["portal"], article["link"]

        for s in article['document'].sentences:
            vert_file.write(f'<s>\n')

            for w in s.words:
                vert_file.write(
                    f'{w.text}\t'
                    f'{w.upos}\t'
                    f'{w.lemma}\t'
                    '_\t'
                    f'{s.words[w.head-1].upos if w.head > 0 else "_"}\t'
                    '_\t'
                    '_\n'
                )  # TODO: fill the missing word-level features: dep, dep_head_lemma, dep_head_dep

            vert_file.write("</s>\n")

        vert_file.write("</doc>\n")

Download and copy your VERT file to https://drive.google.com/drive/folders/1aXM_AVDuoyBkc8M6t2tQlbYHvCSS6dYt

Before copying, rename it to `corpus-<language>-<NameSurname>.vert`