# Home Assignment 2 - Natural Language Processing (Majid Sohrabi)

1. Build your own POS tagger for Russian training it on the disambiguated part of opencorpora corpus. You can choose from a number of methods described in chapters 5 and 6 of the book such as Unigram/Biggram Taggers, AffixTagger or NaiveBayes/DecisionTree/Maxent Classifiers. Try to achieve best performance analyzing errors that your tagger makes. Compare your performance with pymorphy2.

    opencorpora can be downloaded from here: http://opencorpora.org/?page=downloads

opencorpora documentation:
https://pypi.org/project/opencorpora-tools/0.3/

pymorphy2 documentation:
https://pymorphy2.readthedocs.io/en/latest/user/guide.html

2. Build a named entity classifier for Spanish based on nltk.corpus.conll2002. You can take the code of ConsecutiveNPChunker from p. 3.3 of chapter 7 of the book as an example. Use 'esp.train' file for training, 'esp.testa' for debugging and 'esp.testb' for final testing.

3. Collect the probabilistic grammar from Penn treebank corpus (nltk.corpus.treebank) using nltk.Tree.productions() method. Save the grammar to a file. Create a parser based on the grammar and try to parse some English sentences with it. If the parser fails because of missing words in the vocabulary, replace the words in the sentence with similar words of the same category. After that - does the parser create reasonably correct trees?

Note: NLTK grammar does not allow punctuation marks in category labels and does not allow a category label to begin with a dash. So you have to modify category labels extracted from the treebank to match the rules of NLTK grammar.

4. Write a function that generates a random sentence based on a given probabilistic grammar. You can use grammar.start() to identify the starting symbol of the grammar (normally S) and grammar.productions(lhs) to get all the productions with the same left hand side (lhs). Generate some sentences from the grammar collected in the previous task. Do they look reasonably grammatical?

Based on your experiments with the grammar in tasks 3 and 4 - what are the strong and weak points of the collected grammar?

## Imports

In [None]:
pip install opencorpora-tools



In [None]:
pip install pymorphy2



In [None]:
import nltk
import opencorpora
import pymorphy2 as pmp
import random
from nltk.tokenize import word_tokenize
from nltk.corpus import inaugural
from nltk.corpus import treebank
from nltk.draw.tree import draw_trees
from nltk import Tree
from sklearn.metrics import accuracy_score, recall_score, precision_score
from nltk.corpus import conll2002

## Load dataset

In [None]:
corpus = opencorpora.CorpusReader('annot.opcorpora.no_ambig.xml')

## Task 1:

In [None]:
tagged_sents = corpus.tagged_sents()
words = corpus.words()

size = int(len(tagged_sents) * 0.9)
train_sents = tagged_sents[:size]
test_sents = tagged_sents[size:]
unigram_tagger = nltk.UnigramTagger(train_sents)
print(unigram_tagger.evaluate(test_sents))

0.6881507842361935


In [None]:
def type_features(word):     
     return {'word': word[0]}

featuresets = []
for tag_sen in tagged_sents:
    for word_list in tag_sen:
        split_type = word_list[1].split(',')
        #print(split_type)
        type_word = split_type[0]
        featuresets.append((type_features(word_list),type_word))

size = int(len(featuresets) * 0.9)
print('Size of features:', size)

Size of features: 84526


### Split train-test set

In [None]:
train_set, test_set = featuresets[:size], featuresets[size:]
print('First 15 elements of train set:')
print(train_set[:15], '\n')
print('*' * 50)
print('\nFirst 15 elements of test set:')
print(test_set[:15])

First 15 elements of train set:
[({'word': '«'}, 'PNCT'), ({'word': 'Школа'}, 'NOUN'), ({'word': 'злословия'}, 'NOUN'), ({'word': '»'}, 'PNCT'), ({'word': 'учит'}, 'VERB'), ({'word': 'прикусить'}, 'INFN'), ({'word': 'язык'}, 'NOUN'), ({'word': 'Сохранится'}, 'VERB'), ({'word': 'ли'}, 'PRCL'), ({'word': 'градус'}, 'NOUN'), ({'word': 'дискуссии'}, 'NOUN'), ({'word': 'в'}, 'PREP'), ({'word': 'новом'}, 'ADJF'), ({'word': 'сезоне'}, 'NOUN'), ({'word': '?'}, 'PNCT')] 

**************************************************

First 15 elements of test set:
[({'word': 'маленькую'}, 'ADJF'), ({'word': 'квартирку'}, 'NOUN'), ({'word': 'Игнатия'}, 'NOUN'), ({'word': 'Иваныча'}, 'NOUN'), ({'word': 'позвонили'}, 'VERB'), ({'word': '.'}, 'PNCT'), ({'word': '—'}, 'PNCT'), ({'word': 'Дома'}, 'ADVB'), ({'word': '?'}, 'PNCT'), ({'word': 'Встал'}, 'VERB'), ({'word': '?'}, 'PNCT'), ({'word': '..'}, 'PNCT'), ({'word': '—'}, 'PNCT'), ({'word': 'Читал'}, 'VERB'), ({'word': '?'}, 'PNCT')]


### Naive Bayes Classifier

In [None]:
classifier_1 = nltk.NaiveBayesClassifier.train(train_set)
print('Accuracy for Naive Bayes Classifier:', nltk.classify.accuracy(classifier_1, test_set))

Accuracy for Naive Bayes Classifier: 0.7853492333901193


In [None]:
train_tags = featuresets[20000:]
devtest_tags = featuresets[10000:20000]
test_tags = featuresets[:10000]

errors = []
for name, tag in devtest_tags:
    guess = classifier_1.classify(name)
    if guess != tag:
        errors.append((tag, guess, name))

print('Length of errors:', len(errors))

Length of errors: 882


In [None]:
print("First 15 elements of errors:")
errors[:15]

First 15 elements of errors:


[('ADVB', 'NOUN', {'word': 'неясно'}),
 ('LATN', 'NOUN', {'word': 'xemyjlb'}),
 ('UNKN', 'NOUN', {'word': 'kn0pkaaaa'}),
 ('UNKN', 'NOUN', {'word': 'egr07'}),
 ('LATN', 'NOUN', {'word': 'jokermedia'}),
 ('UNKN', 'NOUN', {'word': 'гопота'}),
 ('UNKN', 'NOUN', {'word': 'ширялась'}),
 ('UNKN', 'NOUN', {'word': 'zverek300'}),
 ('LATN', 'NOUN', {'word': 'irdis'}),
 ('UNKN', 'NOUN', {'word': 'gleb_tcherkasov'}),
 ('LATN', 'NOUN', {'word': 'dmitrygalkin'}),
 ('UNKN', 'NOUN', {'word': 'dodger_37'}),
 ('ADJS', 'NOUN', {'word': 'признательны'}),
 ('LATN', 'NOUN', {'word': 'iukka'}),
 ('INFN', 'NOUN', {'word': 'бывать'})]

In [None]:
# Accuracy on debugging set by Naive Bayes Classifier
print('The accuracy of classifier:', nltk.classify.accuracy(classifier_1, devtest_tags))

The accuracy of classifier: 0.9118


### Decision Tree Classifier

In [None]:
Classifier_2 = nltk.DecisionTreeClassifier.train(train_set)
print('Accuracy for decision tree classifier:', nltk.classify.accuracy(Classifier_2, test_set))

Accuracy for decision tree classifier: 0.7042163543441227


In [None]:
words_type = []
for sent in tagged_sents:
  for word in sent:
    word_split = word[1].split(',') 
    word_type = word_split[0]
    words_type.append(({'' : word[0]}, word_type))

size3 = int(len(words_type) * 0.9)
print(size3)
train_sents3 = words_type[:size3]
test_sents3 = words_type[size3:]

84526


### Comparing with pymorphy2

In [None]:
morph = pmp.MorphAnalyzer()

pred = []
for sent in train_sents3:
  x = morph.parse(list(sent[0].values())[0])
  pred.append(x[0].tag.POS)
print(pred)


ind = []
pred2 = []
for i in range(len(pred)):
    if pred[i] is not None:
        pred2.append(pred[i])
    else:
        ind.append(i)

def delete(list_obj, indices):
    indices = sorted(indices, reverse = True)
    for index in indices:
        if index < len(list_obj):
            list_obj.pop(index)

labels = [label for dic, label in train_sents3]
print(labels)

delete(labels, ind)

[None, 'NOUN', 'NOUN', None, 'VERB', 'INFN', 'NOUN', 'VERB', 'CONJ', 'NOUN', 'NOUN', 'PREP', 'ADJF', 'NOUN', None, 'ADJF', None, 'NOUN', 'NOUN', None, 'VERB', 'PREP', 'NOUN', 'PREP', 'ADJF', 'NOUN', 'PREP', 'ADJF', 'NOUN', None, 'ADVB', 'NOUN', 'VERB', 'PREP', None, 'NOUN', None, 'PREP', 'NOUN', None, 'PRCL', 'PREP', 'ADJF', 'NOUN', 'PREP', 'NOUN', 'ADVB', 'PRCL', 'PRTF', 'ADJF', 'NOUN', None, 'PRTF', 'PREP', None, 'NOUN', None, None, 'VERB', None, 'PREP', 'ADJF', 'NOUN', None, 'INFN', 'NOUN', 'NOUN', None, 'PREP', 'ADJF', None, 'INFN', 'NOUN', 'NOUN', None, 'PRCL', 'VERB', 'NOUN', 'PREP', 'NOUN', None, 'CONJ', 'PREP', 'NOUN', 'NOUN', None, 'PRTF', 'PREP', 'ADJF', 'NOUN', 'PREP', 'ADJF', 'CONJ', 'PREP', 'NOUN', 'PRTF', 'PREP', None, 'NOUN', 'NOUN', None, None, 'PREP', 'NOUN', 'NOUN', 'CONJ', 'NOUN', 'NOUN', 'VERB', 'INFN', 'ADVB', 'ADJF', 'NOUN', None, 'NOUN', None, 'NOUN', None, 'CONJ', 'ADVB', 'PRCL', 'ADJF', 'NOUN', None, 'ADJF', 'NOUN', 'CONJ', 'NOUN', None, 'NOUN', None, 'NOUN', '

In [None]:
print('Accuracy, pymorphy2: ', accuracy_score(labels, pred2))

Accuracy, pymorphy2:  0.9558443337271217


## Task 2:

In [None]:
nltk.download('conll2002')
print('File id for conll2002: ', conll2002.fileids())

[nltk_data] Downloading package conll2002 to /root/nltk_data...
[nltk_data]   Package conll2002 is already up-to-date!
File id for conll2002:  ['esp.testa', 'esp.testb', 'esp.train', 'ned.testa', 'ned.testb', 'ned.train']


### Class of Unigram Chunker

In [None]:
class UnigramChunker(nltk.ChunkParserI):
    def __init__(self, train_sents):
        train_data = [[(t,c) for w,t,c in nltk.chunk.tree2conlltags(sent)]
                      for sent in train_sents]
        self.tagger = nltk.UnigramTagger(train_data)

    def parse(self, sentence):
        pos_tags = [pos for (word,pos) in sentence]
        tagged_pos_tags = self.tagger.tag(pos_tags)
        chunktags = [chunktag for (pos, chunktag) in tagged_pos_tags]
        conlltags = [(word, pos, chunktag) for ((word,pos),chunktag)
                     in zip(sentence, chunktags)]
        return nltk.chunk.conlltags2tree(conlltags)

In [None]:
debug_sents = conll2002.chunked_sents('esp.testa')
test_sents = conll2002.chunked_sents('esp.testb')
train_sents = conll2002.chunked_sents('esp.train')
unigram_chunker = UnigramChunker(train_sents)

print(unigram_chunker.evaluate(debug_sents))

ChunkParse score:
    IOB Accuracy:  86.1%%
    Precision:     46.2%%
    Recall:         4.6%%
    F-Measure:      8.4%%


### Class of Bigram Chunker

In [None]:
class BigramChunker(nltk.ChunkParserI):
    def __init__(self, train_sents):
        train_data = [[(t,c) for w,t,c in nltk.chunk.tree2conlltags(sent)]
                      for sent in train_sents]
        self.tagger = nltk.BigramTagger(train_data)

    def parse(self, sentence):
        pos_tags = [pos for (word,pos) in sentence]
        tagged_pos_tags = self.tagger.tag(pos_tags)
        chunktags = [chunktag for (pos, chunktag) in tagged_pos_tags]
        conlltags = [(word, pos, chunktag) for ((word,pos),chunktag)
                     in zip(sentence, chunktags)]
        return nltk.chunk.conlltags2tree(conlltags)

In [None]:
bigram_chunker = BigramChunker(train_sents)
print(bigram_chunker.evaluate(debug_sents))

ChunkParse score:
    IOB Accuracy:  86.1%%
    Precision:     47.4%%
    Recall:         7.2%%
    F-Measure:     12.6%%


### Class of Consecutive NP Chunk Tagger

In [None]:
class ConsecutiveNPChunkTagger(nltk.TaggerI):

    def __init__(self, train_sents):
        train_set = []
        for tagged_sent in train_sents:
            untagged_sent = nltk.tag.untag(tagged_sent)
            history = []
            for i, (_, tag) in enumerate(tagged_sent):
                featureset = npchunk_features(untagged_sent, i, history)
                train_set.append( (featureset, tag) )
                history.append(tag)
        self.classifier = nltk.NaiveBayesClassifier.train(
            train_set)#, max_iter=4)

    def tag(self, sentence):
        history = []
        for i, word in enumerate(sentence):
            featureset = npchunk_features(sentence, i, history)
            tag = self.classifier.classify(featureset)
            history.append(tag)
        return zip(sentence, history)

### Class of Consecutive NP Chunker

In [None]:
class ConsecutiveNPChunker(nltk.ChunkParserI):
    def __init__(self, train_sents):
        tagged_sents = [[((w,t),c) for (w,t,c) in
                         nltk.chunk.tree2conlltags(sent)]
                        for sent in train_sents]
        self.tagger = ConsecutiveNPChunkTagger(tagged_sents)

    def parse(self, sentence):
        tagged_sents = self.tagger.tag(sentence)
        conlltags = [(w,t,c) for ((w,t),c) in tagged_sents]
        return nltk.chunk.conlltags2tree(conlltags)

In [None]:
tagged_sents = [[((w,t),c) for (w,t,c) in nltk.chunk.tree2conlltags(sent)] for sent in train_sents]
print('First 15 elements of tagged sentences:')
tagged_sents[:15]

First 15 elements of tagged sentences:


[[(('Melbourne', 'NP'), 'B-LOC'),
  (('(', 'Fpa'), 'O'),
  (('Australia', 'NP'), 'B-LOC'),
  ((')', 'Fpt'), 'O'),
  ((',', 'Fc'), 'O'),
  (('25', 'Z'), 'O'),
  (('may', 'NC'), 'O'),
  (('(', 'Fpa'), 'O'),
  (('EFE', 'NC'), 'B-ORG'),
  ((')', 'Fpt'), 'O'),
  (('.', 'Fp'), 'O')],
 [(('-', 'Fg'), 'O')],
 [(('El', 'DA'), 'O'),
  (('Abogado', 'NC'), 'B-PER'),
  (('General', 'AQ'), 'I-PER'),
  (('del', 'SP'), 'I-PER'),
  (('Estado', 'NC'), 'I-PER'),
  ((',', 'Fc'), 'O'),
  (('Daryl', 'VMI'), 'B-PER'),
  (('Williams', 'NC'), 'I-PER'),
  ((',', 'Fc'), 'O'),
  (('subrayó', 'VMI'), 'O'),
  (('hoy', 'RG'), 'O'),
  (('la', 'DA'), 'O'),
  (('necesidad', 'NC'), 'O'),
  (('de', 'SP'), 'O'),
  (('tomar', 'VMN'), 'O'),
  (('medidas', 'NC'), 'O'),
  (('para', 'SP'), 'O'),
  (('proteger', 'VMN'), 'O'),
  (('al', 'SP'), 'O'),
  (('sistema', 'NC'), 'O'),
  (('judicial', 'AQ'), 'O'),
  (('australiano', 'AQ'), 'O'),
  (('frente', 'RG'), 'O'),
  (('a', 'SP'), 'O'),
  (('una', 'DI'), 'O'),
  (('página', 'NC'),

In [None]:
def npchunk_features(sentence, i, history):
    word, pos = sentence[i]
    return {"pos": pos}

chunker = ConsecutiveNPChunker(train_sents)
print(chunker.evaluate(debug_sents))

ChunkParse score:
    IOB Accuracy:  86.1%%
    Precision:     46.2%%
    Recall:         4.6%%
    F-Measure:      8.4%%


In [None]:
def npchunk_features(sentence, i, history):
    word, pos = sentence[i]
    if i == 0:
        prevword, prevpos = "<START>", "<START>"
    else:
        prevword, prevpos = sentence[i-1]
    return {"pos": pos, "prevpos": prevpos}

chunker = ConsecutiveNPChunker(train_sents)
print(chunker.evaluate(debug_sents))

ChunkParse score:
    IOB Accuracy:  86.6%%
    Precision:     65.3%%
    Recall:        12.4%%
    F-Measure:     20.9%%


In [None]:
def npchunk_features(sentence, i, history):
    word, pos = sentence[i]
    if i == 0:
        prevword, prevpos = "<START>", "<START>"
    else:
        prevword, prevpos = sentence[i-1]
    return {"pos": pos, "word": word, "prevpos": prevpos}

chunker = ConsecutiveNPChunker(train_sents)
print(chunker.evaluate(debug_sents))

ChunkParse score:
    IOB Accuracy:  90.0%%
    Precision:     35.5%%
    Recall:        56.7%%
    F-Measure:     43.7%%


In [None]:
chunker = ConsecutiveNPChunker(train_sents)
print(chunker.evaluate(test_sents))

ChunkParse score:
    IOB Accuracy:  91.7%%
    Precision:     37.8%%
    Recall:        61.5%%
    F-Measure:     46.8%%


## Task 3:

In [None]:
nltk.download('treebank')
print(nltk.corpus.treebank.parsed_sents('wsj_0001.mrg')[1])

[nltk_data] Downloading package treebank to /root/nltk_data...
[nltk_data]   Package treebank is already up-to-date!
(S
  (NP-SBJ (NNP Mr.) (NNP Vinken))
  (VP
    (VBZ is)
    (NP-PRD
      (NP (NN chairman))
      (PP
        (IN of)
        (NP
          (NP (NNP Elsevier) (NNP N.V.))
          (, ,)
          (NP (DT the) (NNP Dutch) (VBG publishing) (NN group))))))
  (. .))


In [None]:
grammar = nltk.PCFG.fromstring('''
  NP  -> NNS [0.2]
  NP -> JJ NNS [0.3]
  NP -> NP CC NP [0.5]
  NNS -> 'cats' [0.4]
  NNS -> "dogs" [0.2]
  NNS -> "mice" [0.3]
  NNS -> NNS CC NNS [0.1]
  JJ  -> "big" [0.4]
  JJ -> "small" [0.6]
  CC  -> "and" [0.9]
  CC -> "or" [0.1]
  ''')

prob_prod = grammar.productions()
print(prob_prod[1].lhs())

print(prob_prod[1].rhs())

print(prob_prod[1].prob())

print(nltk.corpus.treebank.fileids()[:2])

NP
(JJ, NNS)
0.3
['wsj_0001.mrg', 'wsj_0002.mrg']


In [None]:
prob_prods = []
tree_grammar = []
for item in nltk.corpus.treebank.fileids()[:2]:
    for tree in nltk.corpus.treebank.parsed_sents(item):
        tr = str(tree)
        tree_grammar.append(tr)
        prob_prods += tree.productions()


for i in range(len(tree_grammar)):
    with open('tree_{0}.txt'.format(str(i)), 'w') as f:
        f.writelines(tree_grammar[i])


tree_files =[]
for i in range(0,3):
  x = 'tree_{0}.txt'.format(str(i))
  tree_files.append(x)

print(tree_files)

['tree_0.txt', 'tree_1.txt', 'tree_2.txt']


In [None]:
prob_prods = []
for file in tree_files:
    with open(file, 'r')as f:
        lines = f.readlines()
        tree = ''
        tree_gram = []
        for line in lines:
            tree+=line
        s1 = nltk.Tree.fromstring(tree)
        s2 = s1.productions()
        prob_prods+=s2


S = nltk.Nonterminal('S')
grammar = nltk.induce_pcfg(S, prob_prods)

print(grammar)

Grammar with 71 productions (start state = S)
    S -> NP-SBJ VP . [0.5]
    NP-SBJ -> NP , ADJP , [0.333333]
    NP -> NNP NNP [0.2]
    NNP -> 'Pierre' [0.0714286]
    NNP -> 'Vinken' [0.142857]
    , -> ',' [1.0]
    ADJP -> NP JJ [1.0]
    NP -> CD NNS [0.133333]
    CD -> '61' [0.333333]
    NNS -> 'years' [1.0]
    JJ -> 'old' [0.285714]
    VP -> MD VP [0.2]
    MD -> 'will' [1.0]
    VP -> VB NP PP-CLR NP-TMP [0.2]
    VB -> 'join' [1.0]
    NP -> DT NN [0.0666667]
    DT -> 'the' [0.4]
    NN -> 'board' [0.142857]
    PP-CLR -> IN NP [1.0]
    IN -> 'as' [0.25]
    NP -> DT JJ NN [0.133333]
    DT -> 'a' [0.4]
    JJ -> 'nonexecutive' [0.285714]
    NN -> 'director' [0.285714]
    NP-TMP -> NNP CD [1.0]
    NNP -> 'Nov.' [0.0714286]
    CD -> '29' [0.333333]
    . -> '.' [1.0]
    NP-SBJ -> NNP NNP [0.333333]
    NNP -> 'Mr.' [0.0714286]
    VP -> VBZ NP-PRD [0.2]
    VBZ -> 'is' [1.0]
    NP-PRD -> NP PP [1.0]
    NP -> NN [0.0666667]
    NN -> 'chairman' [0.285714]
    PP ->

In [None]:
parse_grammar = nltk.pchart.InsideChartParser(grammar)
parse_grammar.trace(3)
sents = nltk.corpus.treebank.parsed_sents('wsj_0002.mrg')[:2]

for sent in sents:
    sent_leaves = sent.leaves()
    print(sent_leaves)
    for parse in parse_grammar.parse(sent_leaves):
        print(parse) 

['Rudolph', 'Agnew', ',', '55', 'years', 'old', 'and', 'former', 'chairman', 'of', 'Consolidated', 'Gold', 'Fields', 'PLC', ',', 'was', 'named', '*-1', 'a', 'nonexecutive', 'director', 'of', 'this', 'British', 'industrial', 'conglomerate', '.']
  |[-] . . . . . . . . . . . . . . . . . . . . . . . . . .| [0:1] 'Rudolph' [1.0]
  |. [-] . . . . . . . . . . . . . . . . . . . . . . . . .| [1:2] 'Agnew' [1.0]
  |. . [-] . . . . . . . . . . . . . . . . . . . . . . . .| [2:3] ',' [1.0]
  |. . . [-] . . . . . . . . . . . . . . . . . . . . . . .| [3:4] '55' [1.0]
  |. . . . [-] . . . . . . . . . . . . . . . . . . . . . .| [4:5] 'years' [1.0]
  |. . . . . [-] . . . . . . . . . . . . . . . . . . . . .| [5:6] 'old' [1.0]
  |. . . . . . [-] . . . . . . . . . . . . . . . . . . . .| [6:7] 'and' [1.0]
  |. . . . . . . [-] . . . . . . . . . . . . . . . . . . .| [7:8] 'former' [1.0]
  |. . . . . . . . [-] . . . . . . . . . . . . . . . . . .| [8:9] 'chairman' [1.0]
  |. . . . . . . . . [-] . . . . . . . .

## Task 4

In [None]:
chart_parser = nltk.ChartParser(grammar)
grammer1 = chart_parser.grammar()

def generate_random_sent(grammare, symb):
    words = []
    productions = grammare.productions(lhs = symb)
    production = random.choice(productions)
    for sym in production.rhs():
        if isinstance(sym, str):
            words.append(sym)
        else:
            words.extend(generates_random_sentence(grammare, sym))
    return words

print(' '.join(generate_random_sent(grammer1, grammer1.start())))

British group , director British , will was named a board , a Agnew publishing conglomerate former , join this nonexecutive chairman of a PLC publishing group Dutch 29 . .


### Comments:
As we can see the senteces we have doesn't seems reasonably grammatical. The collected grammar are limited in generating new sentences, and they don't make good sence. What is shown is that when we groth the set from the collected grammar, the precision in generating will increase.