Zalecamy nie czytać notatników na githubie, ze względu na źle wyświetlające się wizualizacje i brak możliwości uruchamiania kodu. Polecamy otworzyć notatnik w google colab, następującym linkiem:

<a target="_blank" href="https://colab.research.google.com/github/OlimpiadaAI/szkolenia/blob/edycja1/08_words_ngrams.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# Słowa, tokanizacja, n-gramy

In [2]:
from collections import Counter, defaultdict

import nltk
import numpy as np
from nltk.corpus import brown
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from nltk.util import ngrams

nltk.download('brown')
nltk.download('wordnet')
nltk.download('punkt')

[nltk_data] Downloading package brown to
[nltk_data]     /home/witolddrzewakowski/nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /home/witolddrzewakowski/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     /home/witolddrzewakowski/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

## Tokenizacja

Współczesne modele językowe wykorzystują bardziej zaawansowane metody tokenizacji!

In [3]:
example_text = "The Fulton County Grand Jury said Friday an investigation of Atlanta's recent primary election produced `` no evidence '' that any irregularities took place ."
print(word_tokenize(example_text))

['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of', 'Atlanta', "'s", 'recent', 'primary', 'election', 'produced', '``', 'no', 'evidence', '``', 'that', 'any', 'irregularities', 'took', 'place', '.']


In [4]:
print(nltk.wordpunct_tokenize(example_text))

['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of', 'Atlanta', "'", 's', 'recent', 'primary', 'election', 'produced', '``', 'no', 'evidence', "''", 'that', 'any', 'irregularities', 'took', 'place', '.']


In [5]:
print(nltk.regexp_tokenize(example_text, pattern=r'\s+', gaps=True))

['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of', "Atlanta's", 'recent', 'primary', 'election', 'produced', '``', 'no', 'evidence', "''", 'that', 'any', 'irregularities', 'took', 'place', '.']


## Normalizacja słów

In [17]:
def normalize_word(word, mode=0):
    if mode == 0:
        normalized = word.lower()
    elif mode == 1:
        lemmatizer = WordNetLemmatizer()
        normalized = lemmatizer.lemmatize(word)
    elif mode == 2:
        stemmer = nltk.PorterStemmer()
        normalized = stemmer.stem(word)

    return normalized


print(f"{'word':<12}{'lower':<12}{'lemma':<12}{'stem':<12}")
print(f"{'---':<12}{'---':<12}{'---':<12}{'---':<12}")
for word in ['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dogs']:
    print(f"{word:<12}", end='')
    for mode in [0, 1, 2]:
        print(f"{normalize_word(word, mode):<12}" , end='')
    print()


word        lower       lemma       stem        
---         ---         ---         ---         
The         the         The         the         
quick       quick       quick       quick       
brown       brown       brown       brown       
fox         fox         fox         fox         
jumps       jumps       jump        jump        
over        over        over        over        
the         the         the         the         
lazy        lazy        lazy        lazi        
dogs        dogs        dog         dog         


## Brown corpus

In [7]:
example_brown_words = brown.words()[:20]
print(example_brown_words)

['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of', "Atlanta's", 'recent', 'primary', 'election', 'produced', '``', 'no', 'evidence', "''", 'that']


## N-gramy


In [8]:
list(ngrams(example_brown_words, 1))

[('The',),
 ('Fulton',),
 ('County',),
 ('Grand',),
 ('Jury',),
 ('said',),
 ('Friday',),
 ('an',),
 ('investigation',),
 ('of',),
 ("Atlanta's",),
 ('recent',),
 ('primary',),
 ('election',),
 ('produced',),
 ('``',),
 ('no',),
 ('evidence',),
 ("''",),
 ('that',)]

In [9]:
list(ngrams(example_brown_words, 2))

[('The', 'Fulton'),
 ('Fulton', 'County'),
 ('County', 'Grand'),
 ('Grand', 'Jury'),
 ('Jury', 'said'),
 ('said', 'Friday'),
 ('Friday', 'an'),
 ('an', 'investigation'),
 ('investigation', 'of'),
 ('of', "Atlanta's"),
 ("Atlanta's", 'recent'),
 ('recent', 'primary'),
 ('primary', 'election'),
 ('election', 'produced'),
 ('produced', '``'),
 ('``', 'no'),
 ('no', 'evidence'),
 ('evidence', "''"),
 ("''", 'that')]

In [10]:
list(ngrams(example_brown_words, 3))

[('The', 'Fulton', 'County'),
 ('Fulton', 'County', 'Grand'),
 ('County', 'Grand', 'Jury'),
 ('Grand', 'Jury', 'said'),
 ('Jury', 'said', 'Friday'),
 ('said', 'Friday', 'an'),
 ('Friday', 'an', 'investigation'),
 ('an', 'investigation', 'of'),
 ('investigation', 'of', "Atlanta's"),
 ('of', "Atlanta's", 'recent'),
 ("Atlanta's", 'recent', 'primary'),
 ('recent', 'primary', 'election'),
 ('primary', 'election', 'produced'),
 ('election', 'produced', '``'),
 ('produced', '``', 'no'),
 ('``', 'no', 'evidence'),
 ('no', 'evidence', "''"),
 ('evidence', "''", 'that')]

## Najprostszy model językowy

In [11]:
class NGramModel:
    def __init__(self, n, words):
       self.n = n
       self.words = self.normalize_words(words)
       self.next_words = self.count_next_words()
       self.probabilities = self.calculate_probabilities()

    def normalize_words(self, words):
        return [word.lower() for word in words]

    def count_next_words(self):
        next_words = defaultdict(Counter)
        for ngram in ngrams(self.words, self.n + 1):
            prefix = ngram[:-1]
            next_word = ngram[-1]
            next_words[prefix][next_word] += 1
        return next_words
   
    def calculate_probabilities(self):
        probabilities = {}
        for prefix, words_count in self.next_words.items():
            words, counts = zip(*words_count.items())
            probabilities[prefix] = (words, np.array(counts) / sum(counts))
        return probabilities

    def generate(self, text, num_words=1):
        out = text.copy()
        for i in range(num_words):
            prefix = tuple(out[-(self.n):])
            next_word = self.sample_word(prefix)
            out += [next_word]
        return " ".join(out)
   
    def sample_word(self, prefix):
        words, probs = self.probabilities[prefix]
        return np.random.choice(words, p=probs)

In [12]:
brown_words = brown.words()[:2000]
model = NGramModel(1, brown_words)
model.generate(['In', 'the', 'best', 'interest'], 40)

'In the best interest of commerce is expected to face is expected for night in expense allowances . georgia house hopper friday , the jury said that none of possible `` there to work toward adjournment . the validity of $50 million to make'

In [13]:
brown_words = brown.words()[:2000]
model = NGramModel(3, brown_words)
model.generate(['In', 'the', 'best', 'interest'], 40)

"In the best interest of both governments '' . merger proposed however , the jury said , `` considering the widespread interest in the election , the number of voters and the size of this city '' . the city purchasing department , the"