# NLTK un citas Python rīkkopas teksta apstrādei un normalizācijai

Izmantoti piemēri no [NLTK grāmatas](https://www.nltk.org/book/ch02.html) un no [Annas Rodžersas nodarbībām](https://colab.research.google.com/drive/18raZBpdx5tDg3TZ015p99j2rF06qABbW?usp=drive_open#scrollTo=iAbsVVK2v-4w) ESSLLI 2019 vasaras skolā.

In [None]:
# Package setup
!pip install nltk spacy bpe

# Import NLTK
import nltk

# Download the Brown corpus
nltk.download('brown')

## Darbs ar NLTK korpusiem

NLTK includes a diverse set of corpora which can be downloaded and processed using the nltk.corpus package.

In [None]:
from nltk.corpus import brown

# List the documents in the Brown corpus
brown.fileids()

In [None]:
# Get the size of the corpus / document
print(len(brown.words()))
print(len(brown.words('cr09')))

In [None]:
# NLTK text corpora support methods to read the corpus as raw/annotated text, a list of words, a list of sentences, or a list of paragraphs.

print("Raw text:")
print(brown.raw('cr09'))

print("Words:")
print(brown.words('cr09'))

print("Sentences:")
print(brown.sents('cr09'))

print("Paragraphs:")
print(brown.paras('cr09'))

## Regulāro izteiksmju lietojums

In [None]:
# To work with non-NLTK corpora:

!wget -O "Rainis.txt" https://repository.clarin.lv/repository/xmlui/bitstream/handle/20.500.12574/41/rainis_v20180716.txt
!wget -O "Romeo.txt" https://www.gutenberg.org/cache/epub/1112/pg1112.txt

# Alternatively, upload files from your local file system
from google.colab import files
files.upload()

file_en = open('Romeo.txt', mode='r', encoding='utf-8')
file_lv = open('Rainis.txt', mode='r', encoding='utf-8')

text_en = file_en.read()
text_lv = file_lv.read()

In [None]:
# Vārdu atrašana
# See also: https://docs.python.org/3/library/re.html
import re

words_en = re.findall('[A-z]+', text_en)
print(words_en)

words_lv = re.findall('[A-zĀāČčĒēĢģĪīĶķĻļŅņŠšŪūŽž]+', text_lv)
#print(words_lv)

In [None]:
# Vienkāršota dalīšana teikumos un tekstvienībās
words_en = re.split(r'\W+', text_en)
print(words_en)

words_lv = re.split(r'\W+', text_lv)
#print(words_lv)

# Punkts kā teikuma beigu indikators (naivs pieņēmums)
sentences = re.split(r'\.', text_en)
print(sentences)

In [None]:
# Substitution
one_liner = re.sub(r'[ ]+', ' ', re.sub(r'\n+', ' ', text_en))  # FIXME: \s
print(one_liner)

## NLTK rīki un bibliotēkas

In [None]:
# Generic RegEx-based tokenization with NLTK
nltk.regexp_tokenize(text_en, '\w+')

In [None]:
# Using a pre-trained tokenizer

from nltk.tokenize import sent_tokenize, word_tokenize

# Punkt Sentence Tokenizer
# This tokenizer divides a text into a list of sentences by using an unsupervised algorithm to build
# a model for abbreviation words, collocations, and words that start sentences.
# It must be trained on a large collection of plaintext in the target language before it can be used.
# See: https://www.nltk.org/api/nltk.tokenize.punkt.html

nltk.download('punkt')

nltk_words_en = word_tokenize(text_en)
print(nltk_words_en)

nltk_words_lv = word_tokenize(text_lv)
#print(nltk_words_lv)

## Spacy bibliotēka dalīšanai tekstvienībās

In [None]:
# To perform natural language processing tasks for a given language,
# we must load a language model that has been trained to perform these tasks for the language in question.
!python -m spacy download en_core_web_sm

# Tokenization with Spacy
import spacy

nlp = spacy.load("en_core_web_sm")

# Passing the variable text to the Language object nlp returns a spaCy Doc object
doc = nlp(text_en)

# For each token..
n = 0
for token in doc:
    print(token)
    n += 1
    if n == 500: break

# BPE dalītājs tekstvienībās

In [None]:
#BPE tokenizer from https://github.com/soaxelbrooke/python-bpe
import bpe
from bpe import Encoder

#file=open('raimondspauls.txt', mode='r', encoding='utf-8')
#text=file.read()
encoder = Encoder(200, pct_bpe=0.88)
encoder.fit(text_lv.split('\n'))
example="Ziema nemaz tik drīz nebeigsies ."

print(encoder.tokenize(example))
print(next(encoder.inverse_transform(encoder.transform([example]))))

['__sow', 'zi', 'e', 'ma', '__eow', '__sow', 'n', 'e', 'ma', 'z', '__eow', '__sow', 'ti', 'k', '__eow', '__sow', 'd', 'rī', 'z', '__eow', '__sow', 'n', 'e', 'be', 'i', 'g', 's', 'ie', 's', '__eow', '.']
ziema nemaz tik drīz nebeigsies .


Dažādi rīki dalīšanai tekstvienībās:  https://github.com/huggingface/tokenizers/

Moses tokenizer: https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/tokenizer.perl

More tools in Python: https://www.machinelearningplus.com/nlp/lemmatization-examples-python/

# Celmošana (Stemming)

In [None]:
#PorterStemmer
from nltk.stem import PorterStemmer
ps =PorterStemmer()
#for w in m:
#	rootWord=ps.stem(w)
#    print(rootWord)
#ps.stem(m[6])
stems_output = ' '.join([ps.stem(w) for w in word_list])

In [None]:
# NLTK's SnowballStemmer supports 13 languages
from nltk.stem import SnowballStemmer
print(SnowballStemmer.languages)

In [None]:
stemmer_en = SnowballStemmer('english')
print(stemmer_en.stem(text_en))

# Lemmatizācijas rīki

In [None]:
# WordNet lemmatizer

nltk.download('wordnet')

from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
word_list = nltk.word_tokenize(text_en)
lemmatized_output = ' '.join([lemmatizer.lemmatize(w) for w in word_list])

print(lemmatized_output)

In [None]:
# Lemmatization with Spacy
lemmas_spacy = " ".join([token.lemma_ for token in doc])
print(lemmas_spacy)