
1. How does the precision of an n-gram tagger behave as you increase the value of n from one to k where k > 3 is the value of your choice (depending on the computing resources you have at hand). You are free to choose your own corpus (it does not have to be brown like in the examples).

    - "As n gets larger, the specificity of the contexts increases, as does the chance that the data we wish to tag contains contexts that were not present in the training data. This is known as the sparse data problem, and is quite pervasive in NLP. As a consequence, there is a trade-off between the accuracy and the coverage of our results (and this is related to the precision/recall trade-off in information retrieval)." (NLP Python book) So it will increase precision and we would have less false positives (so better tagging) but way lesswords will be tagged). We can see it here:
    
    uni = nltk.UnigramTagger(train_sents) # Testing bigram accuracy
    print(uni.accuracy(test_sents))

    bi = nltk.BigramTagger(train_sents) # Testing bigram accuracy
    print(bi.accuracy(test_sents))

    tri = nltk.TrigramTagger(train_sents) # Testing bigram accuracy
    print(tri.accuracy(test_sents))

    quad = nltk.NgramTagger(n=4, train=train_sents) # Testing bigram accuracy
    print(quad.accuracy(test_sents))

    octa = nltk.NgramTagger(n=9, train=train_sents) # Testing bigram accuracy
    print(octa.accuracy(test_sents))

    hund = nltk.NgramTagger(n=100, train=train_sents) # Testing bigram accuracy
    print(hund.accuracy(test_sents))

    Accuracy results:
    
    uni = 0.8121200039868434
    
    bi = 0.10206319146815508
    
    tri = 0.0626931127279976
    
    quad = 0.05511811023622047
    
    octa = 0.05372271504036679
    
    hund = 0.05372271504036679


2. What is the effect on that precision when sentence breaks are taken into account versus when they are ignored? (See the section Tagging Across Sentence Boundaries in the (Python textbook, Chapter 5).

    - NLTK taggers are designed to work with lists of sentences, where each sentence is a list of words. At the start of a sentence, tn-1 and preceding tags are set to None. If they were to consider context that crosses this sentency boundary, it would drop the precision of our results and increasing false postives. This is because the lexical category that closed the previous sentence has no bearing on the one that begins the next sentence.

    
3. Compare the accuracy and the training time of a non-transformer tagger to a trans- former tagger (each one of your own choice) on at least three different corpora (again or your choice)

    - Tranformer based models can take up to 500x longer to train than n-gram based combined tagger and only be half as accurate. The combined n-gram tagger achieved 84% accuracy and it took no more than 1 second to train, while the brill based one achieved 42% accuracy and took over 500 seconds to train. 

## Question 1

In [40]:
## N-gram implementation

import nltk
from nltk.tag import brill, brill_trainer
from nltk.tag import DefaultTagger
from nltk.corpus import brown


# Obtain data
brown_tagged_sents = brown.tagged_sents(categories='news') #Tagged sentences 
brown_sents = brown.sents(categories='news') #Untagged sentences

#Determine the train/test split - setting training data at 90%
size = int(len(brown_tagged_sents) * 0.9)

#Split the data of tagged sentences
train_sents = brown_tagged_sents[:size]
test_sents = brown_tagged_sents[size:]

uni = nltk.UnigramTagger(train_sents) # Testing bigram accuracy
print(uni.accuracy(test_sents))

bi = nltk.BigramTagger(train_sents) # Testing bigram accuracy
print(bi.accuracy(test_sents))

tri = nltk.TrigramTagger(train_sents) # Testing bigram accuracy
print(tri.accuracy(test_sents))

quad = nltk.NgramTagger(n=4, train=train_sents) # Testing bigram accuracy
print(quad.accuracy(test_sents))

octa = nltk.NgramTagger(n=9, train=train_sents) # Testing bigram accuracy
print(octa.accuracy(test_sents))

hund = nltk.NgramTagger(n=100, train=train_sents) # Testing bigram accuracy
print(hund.accuracy(test_sents))

0.8121200039868434
0.10206319146815508
0.0626931127279976
0.05511811023622047
0.05372271504036679
0.05372271504036679


## Question 3

In [104]:
# Obtain data
brown_tagged_sents = brown.tagged_sents(categories='learned') #Tagged sentences 
brown_sents = brown.sents(categories='learned') #Untagged sentences

print(brown_tagged_sents)
#Determine the train/test split - setting training data at 90%
size = int(len(brown_tagged_sents) * 0.9)

#Split the data of tagged sentences
train_sents = brown_tagged_sents[:size]
test_sents = brown_tagged_sents[size:]

[[('1', 'CD-HL'), ('.', '.-HL')], [('Introduction', 'NN-HL')], ...]


In [98]:
%%time
# Build out a combined n-gram tagger



#Building the combined taggers 
t0 = nltk.DefaultTagger('NN') #Default tagger is everything will be a noun if bigram and unigram taggers fail
t1 = nltk.UnigramTagger(train_sents, backoff=t0) # If bigram fails we use this one
t2 = nltk.BigramTagger(train_sents, backoff=t1) # Main tagger 
t3 = nltk.TrigramTagger(train_sents, backoff=t2) # Main tagger 

#Obtain accuracy score
t3.accuracy(test_sents)

CPU times: total: 797 ms
Wall time: 799 ms


0.8394620117327228

In [117]:
object_methods = [method_name for method_name in dir(brown)
                  if callable(getattr(brown, method_name))]
object_methods

['__class__',
 '__delattr__',
 '__dir__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '_add',
 '_get_root',
 '_init',
 '_para_block_reader',
 '_resolve',
 '_unload',
 'abspath',
 'abspaths',
 'categories',
 'citation',
 'encoding',
 'ensure_loaded',
 'fileids',
 'license',
 'open',
 'paras',
 'raw',
 'readme',
 'sents',
 'tagged_paras',
 'tagged_sents',
 'tagged_words',
 'words']

In [99]:
%%time
#Build out a tranformer based model

# specify a model to train
init_tagger = DefaultTagger('NN') # a default starting point
templates = nltk.tag.brill.nltkdemo18() # use the demo templates

#init here will define the initial tagger and template the the brill trainer will use 
b_trainer = brill_trainer.BrillTaggerTrainer(init_tagger, templates) # https://www.nltk.org/_modules/nltk/tag/brill_trainer.html

#Determine the train/test split - setting training data at 90%
size = int(len(brown_tagged_sents) * 0.9)

#Split the data of tagged sentences
train_sents = brown_tagged_sents[:size]
test_sents = brown_tagged_sents[size:]


max_r = 200 # just a few rules so that this will not take too long

brill_tagger = b_trainer.train(train_sents, max_rules = max_r) # train the model with the data

CPU times: total: 5min 24s
Wall time: 5min 24s


In [101]:
%%time
brill_tagger.evaluate(test_sents)


  Function evaluate() has been deprecated.  Use accuracy(gold)
  instead.


CPU times: total: 312 ms
Wall time: 308 ms


0.463013306624696

In [116]:
import pandas as pd
import matplotlib as plt

d = {'dataset': ['Brown - News', 'Brown - Editorial'], 'n_runtime': [1, 0.7], 'n_accuracy': [0.841, 0.839], 'b_runtime': [526, 324], 'b_accuracy': [0.42, 0.46]}
results_df = pd.DataFrame(data=d)
results_df
# results_df.plot.bar(x='dataset', y = ['n_runtime', 'n_accuracy', 'b_runtime', 'b_accuracy'])

Unnamed: 0,dataset,n_runtime,n_accuracy,b_runtime,b_accuracy
0,Brown - News,1.0,0.841,526,0.42
1,Brown - Editorial,0.7,0.839,324,0.46
