# Rīkkopas teksta priekšapstrādei | Toolkits for text preprocessing

**LV**: Šajā nodarbībā turpināsim izmantot *Python* moduļus darbam ar regulārajām izteiksmēm, un aplūkosim NLP rīkkopas [NLTK](https://www.nltk.org/), [spaCy](https://spacy.io/) un [Stanza](https://stanfordnlp.github.io/stanza/), kā arī [Hugging Face](https://github.com/huggingface/tokenizers/) BPE bibliotēku.

Nākamajās nodarbībās redzēsim, ka šeit aplūkotās daudzfunkcionālās rīkkopas ir izmantojamas ne vien teksta priekšapstrādei, bet arī teksta tālākajai analīzei.

In [1]:
# Install the packages
!pip install nltk spacy spacy-transformers stanza

# Import the packages
import nltk
import spacy
import stanza



## Darbs ar NLTK datu kopām | Working with NLTK datasets

**LV**: NLTK ietver daudzveidīgu teksta korpusu un leksisko resursu kolekciju. Šos korpusus u.c. resursus iespējams lejuplādēt un strādāt ar tiem, izmantojot `nltk.corpus` pakotni. Vairāk informācijas: https://www.nltk.org/data.html

---

**EN**: NLTK includes a diverse set of preprocessed text corpora and lexical resources that can be downloaded and manipulated using the `nltk.corpus` package. More info: https://www.nltk.org/data.html

In [None]:
# Open the NLTK Downloader CLI
nltk.download()

# Download the Brown corpus (see https://en.wikipedia.org/wiki/Brown_Corpus)
nltk.download('brown')

# Import the Brown corpus
from nltk.corpus import brown

In [None]:
# Get the size of the corpus
total_size = len(brown.words())
print("Total number of words:", total_size)

# List the text categories (genres) in the corpus
for cat in brown.categories():
    doc_per_cat = len(brown.fileids(categories=cat))
    words_per_cat = len(brown.words(categories=cat))
    cat_percentage = words_per_cat / total_size

    print(f'\t{cat:10}\t{doc_per_cat}\t{words_per_cat}\t{cat_percentage:.2%}')

**LV**: Piezīme: salīdziniet līdzsvarotā *Brown* (1964!!) korpusa žanru proporcijas un līdzvarotā latviešu valodas korpusa [LVK2022](https://korpuss.lv/id/LVK2022) [žanru proporcijas](https://nosketch.korpuss.lv/#wordlist?corpname=LVK2022&tab=attribute&onecolumn=1&wlattr=doc.section&wlminfreq=1&include_nonwords=1&showresults=1).

In [None]:
# Work with a sub-corpus
print("humor:", brown.fileids(categories='humor'))
print("cr01:", len(brown.words('cr01')), '\n')

# NLTK supports reading a corpus as a raw/annotated text..
print("Raw text with POS tags:", brown.raw('cr01').split('\n\n')[0], '\n')

# ..as well as a list of words, sentences, or paragraphs
print("Words:", brown.words('cr01'))
print("Sentences:", brown.sents('cr01'))
print("Paragraphs:", brown.paras('cr01'))

## Dalīšana tekstvienībās un teikumos | Tokenization and sentence splitting

### RegEx

**LV**: Vispirms aplūkosim, kā varam mēģināt risināt tekstvienību un teikumu segmentēšanu, izmantojot tikai regulārās izteiksmes.

In [None]:
# Let's begin with the built-in re module
import re

# We will need BeautifulSoup again
!pip install beautifulsoup4
from bs4 import BeautifulSoup

In [None]:
# Download some non-NLTK corpora:
!wget -O "Rainis.txt" https://repository.clarin.lv/repository/xmlui/bitstream/handle/20.500.12574/41/rainis_v20180716.txt
!wget -O "Romeo.txt" https://www.gutenberg.org/cache/epub/1112/pg1112.txt

In [None]:
# Alternatively, upload files from your local file system
from google.colab import files
files.upload()

In [27]:
# Read and clean up the downloaded corpora

with open("Romeo.txt", mode='r', encoding='utf-8') as file_en:
    text_en = file_en.read()

with open("Rainis.txt", mode='r', encoding='utf-8') as file_lv:
    text_lv = file_lv.read().split('</doc>')[0]
    text_lv = BeautifulSoup(text_lv, 'html.parser').text

text_en = text_en.strip()
text_lv = text_lv.strip()

In [None]:
# Normalize white spaces (incl. line breaks)
text_en = re.sub(r'\s+', ' ', text_en)
text_lv = re.sub(r'\s+', ' ', text_lv)

print(text_en)
print(text_lv)

In [None]:
# Simplified (and lossy) tokenization

words_en = re.split(r'\W+', text_en)
print(len(words_en), words_en)

words_lv = re.split(r'\W+', text_lv)
print(len(words_lv), words_lv)

words_en = re.findall('[A-z]+', text_en) # vs. \w+
print(len(words_en), words_en) # FIXME: https://en.wikipedia.org/wiki/ASCII#Character_set

words_lv = re.findall('[A-Za-zĀāČčĒēĢģĪīĶķĻļŅņŌōŖŗŠšŪūŽž]+', text_lv)
print(len(words_lv), words_lv)

**LV**: Šādi izgūstam vārdus, bet pazaudējam citas tekstvienības un sadalām kompleksus vārdus (piem., "Covid-19").

Mēģināsim uzlabot šablonu un izmantosim ārējo `regex` moduli, kas ir savietojams ar iebūvēto `re`, var uzlabot ātrdarbību, kā arī nodrošina papildu iespējas, piemēram, ērti definējamas *Unicode* rakstzīmju klases: https://en.wikipedia.org/wiki/Unicode_character_property#General_Category.

In [None]:
!pip install regex

import regex

In [None]:
words_en = regex.findall('\p{L}+|\p{P}', text_en)
words_lv = regex.findall('\p{L}+|\p{P}', text_lv)

print(text_en + '\n' + ' '.join(words_en) + '\n' + str(len(words_en)) + '\n')
print(text_lv + '\n' + ' '.join(words_lv) + '\n' + str(len(words_lv)) + '\n')

# TODO: wee'l, o'th, peoples', eye-sight, sald-sāpīgs, zil-ziediņi, sarkan-zili-zaļi

**LV**: Mēģināsim dalīt tekstu teikumos un rindkopās, izmantojot vienkāršas regulārās izteiksmes.

---

**EN**: An attempt to split a text into sentences and paragraph using simple regular expressions.

In [None]:
# TODO: Re-load text_en and text_lv

# Naive assumptions:
# (1) the full-stop characters indicate end-of-sentence;
# (2) line breaks indicate end-of-paragraph.

# Sentence splitting
text_en = regex.sub(r'(?<=[.!?])[ ]+', '[SENT]', text_en)
text_lv = regex.sub(r'(?<=[.!?])[ ]+', '[SENT]', text_lv)

# Paragraph splitting
text_en = regex.sub(r'\n+([ ]+\n+)?', '[PARA]', text_en)
text_lv = regex.sub(r'\n+([ ]+\n+)?', '[PARA]', text_lv)

# Normalization of the remaining white spaces
text_en = regex.sub(r'\s+', ' ', text_en)
text_lv = regex.sub(r'\s+', ' ', text_lv)

print(text_en)
print(text_lv)

### NLTK

**EN**: The NLTK sentence splitter (`sent_tokenize`) and tokenizer (`word_tokenize`) use pre-trained models and heuristics that take into account the complexity and irregularities of natural language text.

Before using these functions, ensure you have downloaded the *Punkt* tokenizer models. This NLTK data package includes a pre-trained model for English and some other languages.

*Punkt* uses an unsupervised algorithm to build a tokenization and sentence splitting model. It must be trained on a large plain-text corpus in the target language.

See https://www.nltk.org/api/nltk.tokenize.punkt.html

In [None]:
nltk.download('punkt')

# To list the available Punkt models:

import os

punkt_path = os.path.join(nltk.data.find('tokenizers/punkt'), '')
punkt_files = [f for f in os.listdir(punkt_path) if f.endswith('.pickle')]

print([f.replace('.pickle', '') for f in punkt_files])

In [5]:
short_text_en = '''
    Punkt knows that the periods in Mr. Smith and Johann S. Bach
    do not mark sentence boundaries. And sometimes sentences
    start with non-capitalized words. i is a good variable
    name.
    You may copy it, give it away or re-use it under the terms
    of the Project Gutenberg License on-line
    at www.gutenberg.org. If you're not in the US,
    you'll have to check the laws where you are located
    before using this e-Book.
    '''

short_text_lv = '''
    TĀLAS NOSKAŅAS ZILĀ VAKARĀ
    Vēlreiz garā tuvajiem mīļā dzimtenē sirsnīgus sveicienus!
    Daudz simtu jūdžu tāļumā,
    aiz tīreļiem, purviem un siliem,
    guļ mana dzimtene diendusā.
    Tā aizsegta debešiem ziliem,
    zil-saulainiem debešu palagiem
    pret dvesmām un strāvām, un negaisiem...
    Piemērs tapis 2024. gadā. Šis u.c. piemēri.
    '''

In [None]:
from nltk.tokenize import sent_tokenize, word_tokenize

sents_en = sent_tokenize(short_text_en)
sents_lv = sent_tokenize(short_text_lv, language='polish')

for s in sents_en:
    s = regex.sub(r'\s+', ' ', s.strip())
    w = word_tokenize(s)
    print(s + '\n' + ' '.join(w) + '\n')

print("-----\n")

for s in sents_lv:
    s = regex.sub(r'\s+', ' ', s.strip())
    w = word_tokenize(s, language='polish')
    print(s + '\n' + ' '.join(w) + '\n')

# TODO: a potential mini-project - train and use a Punkt model for another language

### spaCy

In [2]:
# First, we must download language models trained for the languages of interest.
# See https://spacy.io/usage/models

!python -m spacy download en_core_web_sm
!python -m spacy download lt_core_news_sm
!python -m spacy download xx_sent_ud_sm

nlp_en = spacy.load("en_core_web_sm")
nlp_lt = spacy.load("lt_core_news_sm")
nlp_xx = spacy.load("xx_sent_ud_sm")

# Assumption: LT should be the closest one to LV.
# Compare to the multilingual (XX) models.
# Compare to the large models (lg and rtf)!

# PS. The xx_ent_wiki_sm model does not include a component for setting sentence
# boundaries by default - the sentencizer has to be added to the nlp_xx pipeline
# before using it: nlp_xx.add_pipe('sentencizer').
# Still, the xx_sent_ud_sm model is more accurate for LV sentence splitting.

Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m44.3 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.
Collecting lt-core-news-sm==3.7.0
  Using cached https://github.com/explosion/spacy-models/releases/download/lt_core_news_sm-3.7.0/lt_core_news_sm-3.7.0-py3-none-any.whl (13.2 MB)
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('lt_core_news_sm')
[38;5;3m⚠ Restart to reload dependencies[0m


In [None]:
!pip install spacy-transformers
!python -m spacy download en_core_web_trf
nlp_en = spacy.load("en_core_web_trf")

In [None]:
!python -m spacy download lt_core_news_lg
nlp_lt = spacy.load("lt_core_news_lg")

In [None]:
# Pass a plain-text to the respective language model,
# which returns a processed spaCy document.
doc_en = nlp_en(short_text_en) # TODO: vs. en-trf
doc_lv = nlp_xx(short_text_lv) # TODO: vs. lt-lg

# For each sentence in the EN text, tokenize it and pretty-print
for sent in doc_en.sents:
    sent_text = ' '.join(tok.text for tok in sent)
    print(regex.sub(r'\s+', ' ', sent_text.strip()))

print("\n-----")

# For each sentence in the LV text, tokenize it and pretty-print
for sent in doc_lv.sents:
    sent_text = ' '.join(tok.text for tok in sent)
    print(regex.sub(r'\s+', ' ', sent_text.strip()))

### Stanza

In [None]:
# Get the necessary models
# See https://stanfordnlp.github.io/stanza/performance.html

stanza.download('en')
stanza.download('lv')

In [None]:
# Create NLP pipelines
nlp_en = stanza.Pipeline('en')
nlp_lv = stanza.Pipeline('lv')

In [None]:
# Process the texts
doc_en = nlp_en(short_text_en)
doc_lv = nlp_lv(short_text_lv)

In [None]:
# Get the EN sentences and tokens
for i, sent in enumerate(doc_en.sentences):
    print(i, ' '.join(tok.text for tok in sent.tokens))

In [None]:
# Get the LV sentences and tokens
for i, sent in enumerate(doc_lv.sentences):
    print(i, ' '.join(tok.text for tok in sent.tokens))

## BPE tokenizācija

*HuggingFace* BPE tokenizācijas bibliotēka: https://github.com/huggingface/tokenizers/

In [None]:
!pip install tokenizers

from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import Whitespace

In [None]:
# Initialize a tokenizer with the BPE model
tokenizer = Tokenizer(BPE(unk_token="[UNK]"))

# Use a pre-tokenizer to split the input into words
tokenizer.pre_tokenizer = Whitespace()

# Create a trainer for the tokenizer
trainer = BpeTrainer(vocab_size=30000, special_tokens=["[UNK]", "[CLS]", "[SEP]"])

# Train the tokenizer
tokenizer.train(files=["Rainis.txt"], trainer=trainer)

print("Vocabulary size:", tokenizer.get_vocab_size())

# Save the tokenizer on disk
tokenizer.save("Rainis_BPR_tokenizer.json")

In [None]:
# Load a pre-trained tokenizer
tokenizer = Tokenizer.from_file("Rainis_BPR_tokenizer.json")

test = tokenizer.encode(short_text_lv)
print(test.tokens) # Compare to the Rainis' word frequency list at Korpuss.lv
print(test.ids)

print()

test = tokenizer.encode("Цей текст українською мовою.")
print(test.tokens) # See Rainis.txt
print(test.ids)

## Celmošana | Stemming

In [4]:
# NLTK PorterStemmer for English

from nltk.stem.porter import PorterStemmer
from nltk.tokenize import word_tokenize

stemmer = PorterStemmer()

test_sentence_en = "These boys running were fastest off-road runners when they smiled."

stemmed_tokens = [stemmer.stem(t) for t in word_tokenize(test_sentence_en)]

print(' '.join(stemmed_tokens))

these boy run were fastest off-road runner when they smile .


In [5]:
# NLTK SnowballStemmer supports 15+ languages

from nltk.stem.snowball import SnowballStemmer

print(SnowballStemmer.languages)

('arabic', 'danish', 'dutch', 'english', 'finnish', 'french', 'german', 'hungarian', 'italian', 'norwegian', 'porter', 'portuguese', 'romanian', 'russian', 'spanish', 'swedish')


In [None]:
stemmer = SnowballStemmer('english')
stemmed_tokens = [stemmer.stem(t) for t in word_tokenize(test_sentence_en)]
print("EN:", ' '.join(stemmed_tokens))

test_sentence_se = "Dessa fiskar smakar gott."

stemmer = SnowballStemmer('swedish')
stemmed_tokens = [stemmer.stem(t) for t in word_tokenize(test_sentence_se)]
print("SE:", ' '.join(stemmed_tokens))

## Lemmatizācija

In [None]:
# Lemmatization with NLTK

nltk.download('wordnet')

from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import wordnet

lemmatizer = WordNetLemmatizer()

lemmatized_tokens = [lemmatizer.lemmatize(w) for w in word_tokenize(test_sentence_en)]

print(test_sentence_en)
print(' '.join(lemmatized_tokens))

# TODO: POS-tagging first...

In [None]:
# Corresponding POS tags (WordNet tags)
pos_tags = ['s', 'n', 'v', 's', 'a', 'a', 'n', 'r', 's', 'v']
# 's' tags are incorrect - used instead of None

for w, p in zip(word_tokenize(test_sentence_en), pos_tags):
    lemma = lemmatizer.lemmatize(w, pos=p)
    print(f"{w:10}\t{lemma}")

In [None]:
# Lemmatization with Spacy

# Load the multilingual tokenizer, tagger, lemmatizer, etc.
#nlp_xx = spacy.load("xx_sent_ud_sm")

# Process the EN sentence
for token in nlp_en(test_sentence_en):
    print(f"{token.text}\t{token.lemma_}")

print()

# Process the SE sentence
for token in nlp_xx(test_sentence_se):
    print(f"{token.text}\t{token.lemma_}")

In [None]:
# Lemmatization with Stanza

stanza.download('en')
stanza.download('sv')

nlp_en = stanza.Pipeline(lang='en')
nlp_sv = stanza.Pipeline(lang='sv')

In [None]:
doc_en = nlp_en(test_sentence_en)

for s in doc_en.sentences:
    for w in s.words:
        print(f"{w.text}\t{w.lemma}")

In [None]:
doc_sv = nlp_sv(test_sentence_se)

for s in doc_sv.sentences:
    for w in s.words:
        print(f"{w.text}\t{w.lemma}")