## Chapter 6

#### 1. Using Naive Bayes classifier described in this chapter, and any features you can think of, build the best name gender classifier you can. Begin by splitting the Names Corpus into three subsets: 500 words for the test set, 500 words for the dev-test set, and the remaining 6900 words for the training set. Then, starting with the example name gender classifier, make incremental improvements. Use the dev-test set to check your progress. Once you are satisfied with your classifier, check its final performance on the test set. 

In [1]:
import nltk
from nltk.corpus import names 
import random


labeled_names = ([(name, 'male') for name in names.words('male.txt')] +
   [(name, 'female') for name in names.words('female.txt')])

random.seed(1) 
random.shuffle(labeled_names)

In [2]:
def get_features(word):
     return {'First_letter': word[0].lower()}

In [3]:
train_names = labeled_names[1000:]
devtest_names = labeled_names[500:1000]
test_names = labeled_names[:500]

In [4]:
train_set = [(get_features(n), gender) for (n, gender) in train_names]
devtest_set = [(get_features(n), gender) for (n, gender) in devtest_names]
test_set = [(get_features(n), gender) for (n, gender) in test_names]

In [5]:
classifier = nltk.NaiveBayesClassifier.train(train_set) 
print(nltk.classify.accuracy(classifier, devtest_set)) 

0.634


In [6]:
def get_features2(word):
    features = {}
    features['First_letter'] = word[0].lower()
    features['Suffix'] = word[-2:].lower()
    return features

In [7]:
train_set2 = [(get_features2(n), gender) for (n, gender) in train_names]
devtest_set2 = [(get_features2(n), gender) for (n, gender) in devtest_names]
test_set2 = [(get_features2(n), gender) for (n, gender) in test_names]

In [8]:
classifier2 = nltk.NaiveBayesClassifier.train(train_set2) 
print(nltk.classify.accuracy(classifier2, devtest_set2)) 

0.808


In [9]:
def get_features3(word):
    features = {}
    features['First_letter'] = word[0].lower()
    features['Suffix'] = word[-2:].lower()
    features['Name_length'] = len(word)
    return features

In [10]:
train_set3 = [(get_features3(n), gender) for (n, gender) in train_names]
devtest_set3 = [(get_features3(n), gender) for (n, gender) in devtest_names]
test_set3 = [(get_features3(n), gender) for (n, gender) in test_names]

In [11]:
classifier3 = nltk.NaiveBayesClassifier.train(train_set3) 
print(nltk.classify.accuracy(classifier3, devtest_set3)) 

0.808


In [12]:
## I ran three get_features extractors based on what i thought were the best features to determine gender names.
## When I had the first letter and last two letter suffixes, I got the highest accuracy metric.

## Only using the first letter of a name gave me an accuracy metric of .634. And when I added name_length with first letter and
## suffix, my accuracy metric did not improve. 

## I decided to go with get_features2. Now I will check its final performance on the test data.

In [12]:
print(nltk.classify.accuracy(classifier2, test_set2))
## Using get_features2 gave me an accuracy measure of .81 on the test set.

0.81


#### 2. Using the movie review document classifier discussed in Chapter 6- Section 1.3 ( constructing a list of the 2500 most frequent words as features and use the first 150 documents as the test dataset) , generate a list of the 10 features that the classifier finds to be most informative. Can you explain why these particular features are informative? Do you find any of them surprising?

In [13]:
from nltk.corpus import movie_reviews
documents = [(list(movie_reviews.words(fileid)), category) ## a list of words from each review which is our input, not X, did not generate x yet
            for category in movie_reviews.categories() ## positive or negative (label Y)
            for fileid in movie_reviews.fileids(category)]

random.seed(1)
random.shuffle(documents)

In [14]:
all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words())

In [15]:
word_features = list(all_words)[:2000]

def document_features(document): 
    document_words = set(document)
    features = {}
    for word in word_features:
        features['contains({})'.format(word)] = (word in document_words)
    return features

In [16]:
featuresets = [(document_features(d), c) for (d,c) in documents]
train_set, test_set = featuresets[150:], featuresets[:150]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set)) 

0.8266666666666667


In [17]:
classifier.show_most_informative_features(10)

Most Informative Features
   contains(outstanding) = True              pos : neg    =     10.6 : 1.0
         contains(mulan) = True              pos : neg    =      9.0 : 1.0
        contains(seagal) = True              neg : pos    =      8.2 : 1.0
   contains(wonderfully) = True              pos : neg    =      7.9 : 1.0
         contains(awful) = True              neg : pos    =      5.7 : 1.0
         contains(damon) = True              pos : neg    =      5.7 : 1.0
         contains(flynt) = True              pos : neg    =      5.7 : 1.0
          contains(lame) = True              neg : pos    =      5.6 : 1.0
        contains(wasted) = True              neg : pos    =      5.1 : 1.0
    contains(ridiculous) = True              neg : pos    =      5.0 : 1.0


In [None]:
## These 10 features are important because they contain words that by definition can tell you if something is good or bad.
## When we look at words with outstanding, wondefully, awful, lame, wasted, and ridiculous, we can identify how people felt 
## about those movies. On the other hand, words like mulan, seagal, damon, and flynt were surprising to see. 

## I have an explanation for those words as the following: 
## Mulan - In 1998, Disney released the movie Mulan, which many people liked and was viewed positively based on reviews.
## seagal - Refers to the actor steven seagal, who's majority of movies were viewed as bad/negatively. 
## damon - Refers to the actor Matt Damon, who many see as an exceptional actor, and probably love the movies he stars in.
## flynt - Refers to the movie, The People vs Larry Flynt, which has many positive reviews and is seen as a great movie.

#### 3. Select one of the classification tasks described in this chapter, such as name gender detection, document classification, part-of-speech tagging, or dialog act classification. Using the same training and test data, and the same feature extractor, build three classifiers for the task: a decision tree, a naive Bayes classifier, and a  Maximum Entropy classifier. Compare the performance of the three classifiers on your selected task.

In [18]:
import nltk
posts = nltk.corpus.nps_chat.xml_posts()

In [19]:
def dialogue_act_features(post):
    features = {}
    for word in nltk.word_tokenize(post):
        features['contains({})'.format(word.lower())] = True
    return features

In [20]:
featuresets = [(dialogue_act_features(post.text), post.get('class'))
              for post in posts]

size = int(len(featuresets) * 0.1)
train_set, test_set = featuresets[size:], featuresets[:size]

In [21]:
NB_classifier = nltk.NaiveBayesClassifier.train(train_set)

In [22]:
print("NavieBayes: ", nltk.classify.accuracy(NB_classifier, test_set))

NavieBayes:  0.6685606060606061


In [8]:
DT_classifier = nltk.DecisionTreeClassifier.train(train_set) ## TAKES TOO LONG TOO RUN

KeyboardInterrupt: 

In [None]:
print("DecisionTree: ", nltk.classify.accuracy(DT_classifier, test_set))

In [6]:
ME_classifier = nltk.MaxentClassifier.train(train_set)

  ==> Training (100 iterations)

      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -2.70805        0.051
             2          -1.24967        0.848
             3          -0.91881        0.884
             4          -0.74746        0.900
             5          -0.63429        0.911
             6          -0.55128        0.921
             7          -0.48745        0.924
             8          -0.43747        0.928
             9          -0.39805        0.932
            10          -0.36667        0.936
            11          -0.34127        0.940
            12          -0.32028        0.943
            13          -0.30259        0.945
            14          -0.28740        0.947
            15          -0.27419        0.949
            16          -0.26256        0.951
            17          -0.25220        0.953
            18          -0.24292        0.954
            19          -0.23453        0.956
 

In [7]:
print("MaximumEntropy_classifier: ", nltk.classify.accuracy(ME_classifier, test_set))

MaximumEntropy_classifier:  0.71875


In [None]:
## When running the different classifiers, NavieBayes, DecisionTree, and MaximumEntropy, we see that there is a difference on
## how each output is processed.

## First, NavieBayes, produces the lowest accuracy metric but its processing speed is the fastest.
## Secondly, MaximumEntropy has the higher accuracy metric but it does produce an output slower than NavieBayes.
## This happens because by default, MaximumEntropy does 100 iterations to find the set of parameters that will maximize the performance of the classifier.
## Its accuracy metric is higher than NavieBayes at .71875.
## Lastly, my DecisionTree classifier took forever to run but it will not produce an output.
## This may be because my computer is ill-equipped to handle it.

#### 4. Identify the NPS Chat Corpus, which was demonstrated in Chapter 2, consists of posts from instant messaging sessions. These posts have all been labeled with one of 15 dialogue act types, such as "Statement," "Emotion," "ynQuestion", and "Continuer." We can therefore use this data to build a classifier that can identify the dialogue act types for new instant messaging posts. Build a simple feature extractor that checks what words the post contains. Construct the training and testing data by applying the feature extractor to each post and create a Naïve Bayes classifier. Please print the accuracy of this classifier. Please use NPS Chat Corpus as our dataset and use 8% data as our test data.

In [23]:
posts = nltk.corpus.nps_chat.xml_posts()

In [24]:
def dialogue_act_features(post):
    features = {}
    for word in nltk.word_tokenize(post):
        features['contains({})'.format(word.lower())] = True
    return features

In [25]:
featuresets = [(dialogue_act_features(post.text), post.get('class'))
              for post in posts]

size = int(len(featuresets) * 0.08)
train_set, test_set = featuresets[size:], featuresets[:size]

In [26]:
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set))

0.676923076923077


In [None]:
### 5.Given the following confusion matrix, please calculate: a) Accuracy Rate; b) Precision; c) Recall; d) F-Measure.
                 True
	            No	Yes
Predicted No	104	 33
          Yes	13	 50


In [None]:
## accuracy rate = (True Positive + True Negative)/Number of Observations
## Precision = True Positive / (True Positive + False Positive)
## Recall = True Positive / (True Positive + False Negative)
## F-Measure = 2 * (Precision * Recall)/ (Precision + Recall)

In [27]:
## 5a.
AR = (50 + 104) / (104+33+13+50)
print("Accuracy Rate is", round(AR,3))

Accuracy Rate is 0.77


In [28]:
## 5b.
PR = 50 / (50 + 13)
print("Precision is", round(PR,3))

Precision is 0.794


In [29]:
## 5c.
RC = 50 / (50 + 33)
print("Recall is", round(RC,3))

Recall is 0.602


In [30]:
## 5d.
F1M = 2* (PR*RC) / (PR + RC) 
print("F-Measure is", round(F1M,3))

F-Measure is 0.685


## Chapter 7

#### 6. Write a tag pattern to match noun phrases containing plural head nouns in the following sentence: "Many researchers discussed this project for two weeks." Try to do this by generalizing the tag pattern that handled singular noun phrases too. Please 1) pos-tag this sentence 2) write a tag pattern (i.e. grammar); 3) use RegexpParser to parse the sentence and 4) print out the result containing NP (noun phrases).

In [31]:
doc_sample = "Many researchers discussed this project for two weeks." 

In [32]:
sentences = nltk.sent_tokenize(doc_sample) 
tokenized_sentences = [nltk.word_tokenize(sent) for sent in sentences]
tagged_sentences = [nltk.pos_tag(sent) for sent in tokenized_sentences]

In [33]:
tagged_sentences

[[('Many', 'JJ'),
  ('researchers', 'NNS'),
  ('discussed', 'VBD'),
  ('this', 'DT'),
  ('project', 'NN'),
  ('for', 'IN'),
  ('two', 'CD'),
  ('weeks', 'NNS'),
  ('.', '.')]]

In [34]:
grammar = r"""
    NP: {<NN.*>+} """

In [35]:
cp = nltk.RegexpParser(grammar)

In [36]:
### print(cp.parse(tagged_sentence)) was not working and im not sure why that is.
### So i had to print the pos separately using an empty list and going through each word within the tagged_sentences.
chunked = []
for sent in tagged_sentences:
    chunked.append(cp.parse(sent))
print(chunked)

[Tree('S', [('Many', 'JJ'), Tree('NP', [('researchers', 'NNS')]), ('discussed', 'VBD'), ('this', 'DT'), Tree('NP', [('project', 'NN')]), ('for', 'IN'), ('two', 'CD'), Tree('NP', [('weeks', 'NNS')]), ('.', '.')])]


#### 7. Write a tag pattern to cover noun phrases that contain gerunds, e.g. "the/DT receiving/VBG end/NN", "assistant/NN                      managing/VBG editor/NN". Add these patterns to the grammar, one per line. Test your work using some tagged sentences of your own  devising.

In [37]:
sentence = [("the", "DT"), ("receiving","VBG"), ("end","NN"), ("assistant", "NN"), ("managing", "VBG"), ("editor", "NN")]

In [38]:
grammar = r"""
    NP: {<DT>}   
        {<VBG>}
        {<NN.*>} """

In [39]:
cp = nltk.RegexpParser(grammar)
print(cp.parse(sentence))

(S
  (NP the/DT)
  (NP receiving/VBG)
  (NP end/NN)
  (NP assistant/NN)
  (NP managing/VBG)
  (NP editor/NN))


#### 8. Use the Brown Corpus and the cascaded chunkers that has patterns for noun phrases, prepositional phrases, verb phrases, and clauses to print out all the verb phrases in the Brown corpus.

In [41]:
grammar = r"""
    NP: {<DT|JJ|NN.*>+}
    PP: {<IN><NP>}     
    VP: {<VB.*><NP|PP|CLAUSE>+$}
    CLAUSE: {<NP><VP>}
    """
cp = nltk.RegexpParser(grammar)

In [42]:
for sent in nltk.corpus.brown.tagged_sents():
    tree = cp.parse(sent)
    for subtree in tree.subtrees():
        if subtree.label() == "VP": print(subtree)

(VP Ask/VB-HL (NP jail/NN-HL deputies/NNS-HL))
(VP revolving/VBG-HL (NP fund/NN-HL))
(VP Issue/VB-HL (NP jury/NN-HL subpoenas/NNS-HL))
(VP Nursing/VBG-HL (NP home/NN-HL care/NN-HL))
(VP pay/VB-HL (NP doctors/NNS-HL))
(VP nursing/VBG-HL (NP homes/NNS))
(VP Asks/VBZ-HL (NP research/NN-HL funds/NNS-HL))
(VP Regrets/VBZ-HL (NP attack/NN-HL))
(VP Decries/VBZ-HL (NP joblessness/NN-HL))
(VP Underlying/VBG-HL (NP concern/NN-HL))
(VP bar/VB-HL (NP vehicles/NNS-HL))
(VP loses/VBZ-HL (NP pace/NN-HL))
(VP hits/VBZ-HL (NP homer/NN-HL))
(VP attend/VB-HL (NP races/NNS-HL))
(VP follows/VBZ-HL (NP ceremonies/NNS-HL))
(VP Noted/VBN-HL (NP artist/NN-HL))
(VP Cites/VBZ-HL (NP discrepancies/NNS-HL))
(VP calls/VBZ-HL (NP police/NNS-HL))
(VP held/VBN-HL (NP key/NN-HL))
(VP grant/VB-HL (NP bail/NN-HL))
(VP Held/VBD-HL (NP candle/NN-HL))
(VP Expresses/VBZ-HL (NP thanks/NNS-HL))
(VP Gets/VBZ-HL (NP car/NN-HL number/NN-HL))
(VP Attacks/VBZ-HL (NP officer/NN-HL))
(VP oks/VBZ-HL (NP pact/NN-HL))
(VP report/VB-HL (

#### 9. The bigram chunker scores about 90% accuracy. Using bigram_chunker.tagger.tag(postags) to examine the results and study its errors. Then experiment with trigram chunking. Are you able to improve the performance any more?

In [43]:
from nltk.corpus import conll2000

test_sents = conll2000.chunked_sents('test.txt', chunk_types=['NP'])
train_sents = conll2000.chunked_sents('train.txt', chunk_types=['NP'])

class BigramChunker(nltk.ChunkParserI):
    def __init__(self, train_sents): 
        train_data = [[(t,c) for w,t,c in nltk.chunk.tree2conlltags(sent)]
                      for sent in train_sents]
        self.tagger = nltk.BigramTagger(train_data)
    def parse(self, sentence): 
        pos_tags = [pos for (word,pos) in sentence]
        tagged_pos_tags = self.tagger.tag(pos_tags)
        chunktags = [chunktag for (pos, chunktag) in tagged_pos_tags]
        conlltags = [(word, pos, chunktag) for ((word,pos),chunktag)
                     in zip(sentence, chunktags)]
        return nltk.chunk.conlltags2tree(conlltags)
    
postags = sorted(set(pos for sent in train_sents
                    for (word,pos) in sent.leaves()))

In [44]:
bigram_chunker = BigramChunker(train_sents)
print(bigram_chunker.evaluate(test_sents))

ChunkParse score:
    IOB Accuracy:  93.3%%
    Precision:     82.3%%
    Recall:        86.8%%
    F-Measure:     84.5%%


In [45]:
print(bigram_chunker.tagger.tag(postags))

[('#', 'B-NP'), ('$', 'I-NP'), ("''", 'O'), ('(', 'O'), (')', 'O'), (',', 'O'), ('.', 'O'), (':', 'O'), ('CC', 'O'), ('CD', 'B-NP'), ('DT', 'I-NP'), ('EX', 'B-NP'), ('FW', 'I-NP'), ('IN', 'O'), ('JJ', 'B-NP'), ('JJR', 'I-NP'), ('JJS', 'I-NP'), ('MD', 'O'), ('NN', 'B-NP'), ('NNP', 'I-NP'), ('NNPS', 'I-NP'), ('NNS', 'I-NP'), ('PDT', 'I-NP'), ('POS', 'B-NP'), ('PRP', 'B-NP'), ('PRP$', 'I-NP'), ('RB', 'O'), ('RBR', 'O'), ('RBS', 'B-NP'), ('RP', None), ('SYM', None), ('TO', None), ('UH', None), ('VB', None), ('VBD', None), ('VBG', None), ('VBN', None), ('VBP', None), ('VBZ', None), ('WDT', None), ('WP', None), ('WP$', None), ('WRB', None), ('``', None)]


In [46]:
class TrigramChunker(nltk.ChunkParserI):
    def __init__(self, train_sents): 
        train_data = [[(t,c) for w,t,c in nltk.chunk.tree2conlltags(sent)]
                      for sent in train_sents]
        self.tagger = nltk.TrigramTagger(train_data)
    def parse(self, sentence): 
        pos_tags = [pos for (word,pos) in sentence]
        tagged_pos_tags = self.tagger.tag(pos_tags)
        chunktags = [chunktag for (pos, chunktag) in tagged_pos_tags]
        conlltags = [(word, pos, chunktag) for ((word,pos),chunktag)
                     in zip(sentence, chunktags)]
        return nltk.chunk.conlltags2tree(conlltags)
    
trigram_chunker = TrigramChunker(train_sents)
print(trigram_chunker.evaluate(test_sents))

ChunkParse score:
    IOB Accuracy:  93.3%%
    Precision:     82.5%%
    Recall:        86.8%%
    F-Measure:     84.6%%


In [47]:
print(trigram_chunker.tagger.tag(postags))

[('#', 'B-NP'), ('$', 'I-NP'), ("''", 'O'), ('(', 'O'), (')', 'O'), (',', 'O'), ('.', 'O'), (':', 'O'), ('CC', 'O'), ('CD', 'B-NP'), ('DT', 'I-NP'), ('EX', 'B-NP'), ('FW', 'I-NP'), ('IN', 'O'), ('JJ', 'B-NP'), ('JJR', 'I-NP'), ('JJS', 'I-NP'), ('MD', 'O'), ('NN', 'B-NP'), ('NNP', 'I-NP'), ('NNPS', 'I-NP'), ('NNS', 'I-NP'), ('PDT', None), ('POS', None), ('PRP', None), ('PRP$', None), ('RB', None), ('RBR', None), ('RBS', None), ('RP', None), ('SYM', None), ('TO', None), ('UH', None), ('VB', None), ('VBD', None), ('VBG', None), ('VBN', None), ('VBP', None), ('VBZ', None), ('WDT', None), ('WP', None), ('WP$', None), ('WRB', None), ('``', None)]


In [None]:
## It has discovered that most punctuation marks occur outside of NP chunks, with the exception of # and $, 
## both of which are used as currency markers. 
## It has also found that determiners (DT) and possessives (PRP$ and WP$) occur at the beginnings of NP chunks, 
## while noun types (NN, NNP, NNPS, NNS) mostly occur inside of NP chunks.

## We could not improve the current measurement metrics using an Trigram_Chunker, although there are small changes to it.

#### 10. Explore the Brown Corpus to print out all the FACILITIES (one of the commonly used types of name entities).

In [39]:
sentence = nltk.corpus.brown.tagged_sents()[75]

In [40]:
print(nltk.ne_chunk(sentence))

(S
  A/AT
  veteran/JJ
  Jackson/NP-TL
  County/NN-TL
  legislator/NN
  will/MD
  ask/VB
  the/AT
  (ORGANIZATION Georgia/NP-TL)
  (ORGANIZATION House/NN-TL)
  Monday/NR
  to/TO
  back/VB
  federal/JJ
  aid/NN
  to/IN
  education/NN
  ,/,
  something/PN
  it/PPS
  has/HVZ
  consistently/RB
  opposed/VBN
  in/IN
  the/AT
  past/NN
  ./.)


In [48]:
for sent in nltk.corpus.brown.tagged_sents()[:100000]:
    tree = nltk.ne_chunk(sent)
    for subtree in tree.subtrees():
        if subtree.label() == "FACILITY": print(subtree)

(FACILITY Raymondville/NP)
(FACILITY White/JJ-TL)
(FACILITY White/JJ-TL)
(FACILITY White/JJ-TL)
(FACILITY White/JJ-TL)
(FACILITY White/JJ-TL)
(FACILITY White/JJ-TL)
(FACILITY Kremlin/NP)
(FACILITY White/JJ-TL)
(FACILITY White/JJ-TL)
(FACILITY White/JJ-TL House/NN-TL)
(FACILITY White/JJ-TL House/NN-TL)
(FACILITY Franklin/NP-TL)
(FACILITY Kremlin/NP)
(FACILITY Franklin/NP-TL Square/NN-TL)
(FACILITY Pennsylvania/NP-TL Avenue/NN-TL)
(FACILITY Jenks/NP-TL Street/NN-TL)
(FACILITY White/JJ-TL)
(FACILITY White/JJ-TL)
(FACILITY White/JJ-TL)
(FACILITY White/JJ-TL)
(FACILITY White/JJ-TL)
(FACILITY White/JJ-TL)
(FACILITY White/JJ-TL)
(FACILITY White/JJ-TL House/NN-TL)
(FACILITY White/JJ-TL)
(FACILITY White/JJ-TL)
(FACILITY White/JJ-TL)
(FACILITY White/JJ-TL)
(FACILITY White/JJ-TL)
(FACILITY White/JJ-TL)
(FACILITY White/JJ-TL)
(FACILITY White/JJ-TL)
(FACILITY White/JJ-TL)
(FACILITY Pensacola/NP)
(FACILITY White/JJ-TL Sox/NPS-TL)
(FACILITY Caltech/NP)
(FACILITY White/JJ-TL House/NN-TL)
(FACILITY Whi