# AnTeDe Lab A: Understanding PoS tags 

## Session goal
The goal of this session is to help you familiarize with PoS tags. We begin by importing the NLTK fragments of the Brown corpus and the Wall Street Journal.

In [1]:
import nltk
nltk.download('brown')
nltk.download('treebank')
nltk.download('universal_tagset')
from nltk.corpus import brown 
from nltk.corpus import treebank 

[nltk_data] Downloading package brown to /Users/Daniele/nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package treebank to
[nltk_data]     /Users/Daniele/nltk_data...
[nltk_data]   Package treebank is already up-to-date!
[nltk_data] Downloading package universal_tagset to
[nltk_data]     /Users/Daniele/nltk_data...
[nltk_data]   Package universal_tagset is already up-to-date!


Complete the inner loop of the following function as directed by the comments.

In [2]:
def get_ground_truth_distribution(token, corpus, universal_tagset=False):
    
    if universal_tagset:
        corpus_tagset='universal'
    else:
        corpus_tagset=''
        
    sentences = corpus.tagged_sents(tagset=corpus_tagset)
    untagged_sentences = corpus.sents()
    
    # this is going to be a dict where each key is a tag
    tag_freq={}
    
    # sent is an untagged sentence
    # sentences[i] is the corresponding tagged sentence
    
    for i, sent in enumerate(untagged_sentences):
            # if the token we're looking for is in sent
            if token in sent:
                # for each (token, tag) tuple in the tagged sentence
                for pair in sentences[i]:
                    
                    # pair[0] contains the current token
                    # pair[1] contains the corresponding tag
                    
                    # increase tag_freq[pair[1]] by one unit
                    # careful because tag_freq may not yet have a 
                    # key corresponding to pair[1]!
                    
# BEGIN_REMOVE                    
                    if pair[0] == token:
                        try:
                            tag_freq[pair[1]]+=1
                        except:
                            tag_freq[pair[1]]=1
# END_REMOVE                            
    return tag_freq                        
              

In the following cells, we get the PoS tag distribution of *that* in the Penn treebank and the Brown corpus using the universal and the Penn tagset (for Penn) and the Brown tagset (for Brown).

In [9]:
get_ground_truth_distribution(token='fast', corpus=treebank)

{'RB': 2, 'JJ': 1}

The following function gives you examples for a specific combination of token and tag. 

In [7]:
def get_ground_truth_examples(token, corpus, tag, universal_tagset=False):
    
    if universal_tagset:
        corpus_tagset='universal'
    else:
        corpus_tagset=''
    
    sentences =corpus.tagged_sents(tagset=corpus_tagset)
    untagged_sentences = corpus.sents()
    tag_freq={}
    count=0
    visualize=False
    
    
    for i, sent in enumerate(untagged_sentences):
        
        if token in sent:
            text=""
            for pair in sentences[i]:
                if 'NONE' not in pair[1]:
                    text = text+" "+pair[0]
                if (pair[0]==token) and (pair[1]==tag):
                    visualize=True    
            if visualize:
                count=count+1
                print (str(count)+' '+text)
                print (str(sentences[i]))
                visualize=False
                          

In [10]:
get_ground_truth_examples(token='fast', corpus=treebank, tag='JJ')

1  The New York Stock Exchange 's attempt to introduce a new portfolio basket is evidence of investors ' desires to make fast and easy transactions of large numbers of shares .
[('The', 'DT'), ('New', 'NNP'), ('York', 'NNP'), ('Stock', 'NNP'), ('Exchange', 'NNP'), ("'s", 'POS'), ('attempt', 'NN'), ('*', '-NONE-'), ('to', 'TO'), ('introduce', 'VB'), ('a', 'DT'), ('new', 'JJ'), ('portfolio', 'NN'), ('basket', 'NN'), ('is', 'VBZ'), ('evidence', 'NN'), ('of', 'IN'), ('investors', 'NNS'), ("'", 'POS'), ('desires', 'NNS'), ('*', '-NONE-'), ('to', 'TO'), ('make', 'VB'), ('fast', 'JJ'), ('and', 'CC'), ('easy', 'JJ'), ('transactions', 'NNS'), ('of', 'IN'), ('large', 'JJ'), ('numbers', 'NNS'), ('of', 'IN'), ('shares', 'NNS'), ('.', '.')]


In [18]:
get_ground_truth_distribution(token='that', corpus=treebank)

{'WDT': 214, 'IN': 513, 'DT': 77, 'RB': 3}

In [22]:
get_ground_truth_examples(token='that', corpus=treebank, tag='IN')

1  The finding probably will support those who argue that the U.S. should regulate the class of asbestos including crocidolite more stringently than the common kind of asbestos , chrysotile , found in most schools and other buildings , Dr. Talcott said .
[('The', 'DT'), ('finding', 'NN'), ('probably', 'RB'), ('will', 'MD'), ('support', 'VB'), ('those', 'DT'), ('who', 'WP'), ('*T*-6', '-NONE-'), ('argue', 'VBP'), ('that', 'IN'), ('the', 'DT'), ('U.S.', 'NNP'), ('should', 'MD'), ('regulate', 'VB'), ('the', 'DT'), ('class', 'NN'), ('of', 'IN'), ('asbestos', 'NN'), ('including', 'VBG'), ('crocidolite', 'NN'), ('more', 'RBR'), ('stringently', 'RB'), ('than', 'IN'), ('the', 'DT'), ('common', 'JJ'), ('kind', 'NN'), ('of', 'IN'), ('asbestos', 'NN'), (',', ','), ('chrysotile', 'NN'), (',', ','), ('found', 'VBN'), ('*', '-NONE-'), ('in', 'IN'), ('most', 'JJS'), ('schools', 'NNS'), ('and', 'CC'), ('other', 'JJ'), ('buildings', 'NNS'), (',', ','), ('Dr.', 'NNP'), ('Talcott', 'NNP'), ('said', 'VBD'

31  In another reflection that the growth of the economy is leveling off , the government said that orders for manufactured goods and spending on construction failed to rise in September .
[('In', 'IN'), ('another', 'DT'), ('reflection', 'NN'), ('that', 'IN'), ('the', 'DT'), ('growth', 'NN'), ('of', 'IN'), ('the', 'DT'), ('economy', 'NN'), ('is', 'VBZ'), ('leveling', 'VBG'), ('off', 'RP'), (',', ','), ('the', 'DT'), ('government', 'NN'), ('said', 'VBD'), ('that', 'IN'), ('orders', 'NNS'), ('for', 'IN'), ('manufactured', 'VBN'), ('goods', 'NNS'), ('and', 'CC'), ('spending', 'NN'), ('on', 'IN'), ('construction', 'NN'), ('failed', 'VBD'), ('*-1', '-NONE-'), ('to', 'TO'), ('rise', 'VB'), ('in', 'IN'), ('September', 'NNP'), ('.', '.')]
32  Meanwhile , the National Association of Purchasing Management said its latest survey indicated that the manufacturing economy contracted in October for the sixth consecutive month .
[('Meanwhile', 'RB'), (',', ','), ('the', 'DT'), ('National', 'NNP'), ('A

65  Prosecutors alleged that she was trying to bolster students ' scores to win a bonus under the state 's 1984 Education Improvement Act .
[('Prosecutors', 'NNS'), ('alleged', 'VBD'), ('that', 'IN'), ('she', 'PRP'), ('was', 'VBD'), ('trying', 'VBG'), ('*-1', '-NONE-'), ('to', 'TO'), ('bolster', 'VB'), ('students', 'NNS'), ("'", 'POS'), ('scores', 'NNS'), ('*-1', '-NONE-'), ('to', 'TO'), ('win', 'VB'), ('a', 'DT'), ('bonus', 'NN'), ('under', 'IN'), ('the', 'DT'), ('state', 'NN'), ("'s", 'POS'), ('1984', 'CD'), ('Education', 'NNP'), ('Improvement', 'NNP'), ('Act', 'NNP'), ('.', '.')]
66  A 50-state study released in September by Friends for Education , an Albuquerque , N.M. , school-research group , concluded that `` outright cheating by American educators '' is `` common . ''
[('A', 'DT'), ('50-state', 'JJ'), ('study', 'NN'), ('released', 'VBN'), ('*', '-NONE-'), ('in', 'IN'), ('September', 'NNP'), ('by', 'IN'), ('Friends', 'NNPS'), ('for', 'IN'), ('Education', 'NNP'), (',', ','), ('an

112  The company also disclosed that during that period it offered 10,000 yen , or about $ 70 , for another contract .
[('The', 'DT'), ('company', 'NN'), ('also', 'RB'), ('disclosed', 'VBD'), ('that', 'IN'), ('during', 'IN'), ('that', 'DT'), ('period', 'NN'), ('it', 'PRP'), ('offered', 'VBD'), ('10,000', 'CD'), ('yen', 'NNS'), (',', ','), ('or', 'CC'), ('about', 'RB'), ('$', '$'), ('70', 'CD'), ('*U*', '-NONE-'), (',', ','), ('for', 'IN'), ('another', 'DT'), ('contract', 'NN'), ('.', '.')]
113  Foreigners complain that they have limited access to government procurement in Japan , in part because Japanese companies unfairly undercut them .
[('Foreigners', 'NNS'), ('complain', 'VBP'), ('that', 'IN'), ('they', 'PRP'), ('have', 'VBP'), ('limited', 'JJ'), ('access', 'NN'), ('to', 'TO'), ('government', 'NN'), ('procurement', 'NN'), ('in', 'IN'), ('Japan', 'NNP'), (',', ','), ('in', 'IN'), ('part', 'NN'), ('because', 'IN'), ('Japanese', 'JJ'), ('companies', 'NNS'), ('unfairly', 'RB'), ('under

149  Integra , which owns and operates hotels , said that Hallwood Group Inc. has agreed to exercise any rights that are n't exercised by other shareholders .
[('Integra', 'NNP'), (',', ','), ('which', 'WDT'), ('*T*-174', '-NONE-'), ('owns', 'VBZ'), ('and', 'CC'), ('operates', 'VBZ'), ('hotels', 'NNS'), (',', ','), ('said', 'VBD'), ('that', 'IN'), ('Hallwood', 'NNP'), ('Group', 'NNP'), ('Inc.', 'NNP'), ('has', 'VBZ'), ('agreed', 'VBN'), ('*-1', '-NONE-'), ('to', 'TO'), ('exercise', 'VB'), ('any', 'DT'), ('rights', 'NNS'), ('that', 'WDT'), ('*T*-175', '-NONE-'), ('are', 'VBP'), ("n't", 'RB'), ('exercised', 'VBN'), ('*-2', '-NONE-'), ('by', 'IN'), ('other', 'JJ'), ('shareholders', 'NNS'), ('.', '.')]
150  `` Each day that Congress fails to act ... will cause additional disruption in our borrowing schedule , possibly resulting in higher interest costs to the taxpayer , '' Treasury Secretary Nicholas Brady said in a speech prepared for delivery last night to a group of bankers .
[('``', '`

187  `` I live in hopes that the ringers themselves will be drawn into that fuller life . ''
[('``', '``'), ('I', 'PRP'), ('live', 'VBP'), ('in', 'IN'), ('hopes', 'NNS'), ('that', 'IN'), ('the', 'DT'), ('ringers', 'NNS'), ('themselves', 'PRP'), ('will', 'MD'), ('be', 'VB'), ('drawn', 'VBN'), ('*-139', '-NONE-'), ('into', 'IN'), ('that', 'DT'), ('fuller', 'JJR'), ('life', 'NN'), ('.', '.'), ("''", "''")]
188  Says Mr. Baldwin , `` We recognize that we may no longer have as high a priority in church life and experience . ''
[('Says', 'VBZ'), ('*ICH*-1', '-NONE-'), ('Mr.', 'NNP'), ('Baldwin', 'NNP'), (',', ','), ('``', '``'), ('We', 'PRP'), ('recognize', 'VBP'), ('that', 'IN'), ('we', 'PRP'), ('may', 'MD'), ('no', 'RB'), ('longer', 'RBR'), ('have', 'VBP'), ('as', 'RB'), ('high', 'JJ'), ('a', 'DT'), ('priority', 'NN'), ('in', 'IN'), ('church', 'NN'), ('life', 'NN'), ('and', 'CC'), ('experience', 'NN'), ('.', '.'), ("''", "''")]
189  One survey says that of the 100,000 trained bellringers i

227  And though the size of the loan guarantees approved yesterday is significant , recent experience with a similar program in Central America indicates that it could take several years before the new Polish government can fully use the aid effectively .
[('And', 'CC'), ('though', 'IN'), ('the', 'DT'), ('size', 'NN'), ('of', 'IN'), ('the', 'DT'), ('loan', 'NN'), ('guarantees', 'NNS'), ('approved', 'VBN'), ('*', '-NONE-'), ('yesterday', 'NN'), ('is', 'VBZ'), ('significant', 'JJ'), (',', ','), ('recent', 'JJ'), ('experience', 'NN'), ('with', 'IN'), ('a', 'DT'), ('similar', 'JJ'), ('program', 'NN'), ('in', 'IN'), ('Central', 'NNP'), ('America', 'NNP'), ('indicates', 'VBZ'), ('that', 'IN'), ('it', 'PRP'), ('could', 'MD'), ('take', 'VB'), ('several', 'JJ'), ('years', 'NNS'), ('before', 'IN'), ('the', 'DT'), ('new', 'JJ'), ('Polish', 'JJ'), ('government', 'NN'), ('can', 'MD'), ('fully', 'RB'), ('use', 'VB'), ('the', 'DT'), ('aid', 'NN'), ('effectively', 'RB'), ('.', '.')]
228  The potential

272  Moreover , the framers believed that the nation needed a unitary executive with the independence and resources to perform the executive functions that the Confederation Congress had performed poorly under the Articles of Confederation .
[('Moreover', 'RB'), (',', ','), ('the', 'DT'), ('framers', 'NNS'), ('believed', 'VBD'), ('that', 'IN'), ('the', 'DT'), ('nation', 'NN'), ('needed', 'VBD'), ('a', 'DT'), ('unitary', 'JJ'), ('executive', 'NN'), ('with', 'IN'), ('the', 'DT'), ('independence', 'NN'), ('and', 'CC'), ('resources', 'NNS'), ('0', '-NONE-'), ('*', '-NONE-'), ('to', 'TO'), ('perform', 'VB'), ('the', 'DT'), ('executive', 'NN'), ('functions', 'NNS'), ('that', 'IN'), ('the', 'DT'), ('Confederation', 'NNP'), ('Congress', 'NNP'), ('had', 'VBD'), ('performed', 'VBN'), ('*T*-2', '-NONE-'), ('poorly', 'RB'), ('under', 'IN'), ('the', 'DT'), ('Articles', 'NNPS'), ('of', 'IN'), ('Confederation', 'NNP'), ('*T*-1', '-NONE-'), ('.', '.')]
273  The language of the appropriations rider imp

305  Some Democrats , led by Rep. Jack Brooks -LRB- D. , Texas -RRB- , unsuccessfully opposed the measure because they fear that the fees may not fully make up for the budget cuts .
[('Some', 'RB'), ('Democrats', 'NNPS'), (',', ','), ('led', 'VBN'), ('*', '-NONE-'), ('by', 'IN'), ('Rep.', 'NNP'), ('Jack', 'NNP'), ('Brooks', 'NNP'), ('-LRB-', '-LRB-'), ('D.', 'NNP'), (',', ','), ('Texas', 'NNP'), ('-RRB-', '-RRB-'), (',', ','), ('unsuccessfully', 'RB'), ('opposed', 'VBD'), ('the', 'DT'), ('measure', 'NN'), ('because', 'IN'), ('they', 'PRP'), ('fear', 'VBP'), ('that', 'IN'), ('the', 'DT'), ('fees', 'NNS'), ('may', 'MD'), ('not', 'RB'), ('fully', 'RB'), ('make', 'VB'), ('up', 'RP'), ('for', 'IN'), ('the', 'DT'), ('budget', 'NN'), ('cuts', 'NNS'), ('.', '.')]
306  Proponents of the funding arrangement predict that , based on recent filing levels of more than 2,000 a year , the fees will yield at least $ 40 million this fiscal year , or $ 10 million more than the budget cuts .
[('Proponents

361  There were concerns early in the day that Wall Street 's sharp gains on Tuesday were overdone and due for a reversal .
[('There', 'EX'), ('were', 'VBD'), ('concerns', 'NNS'), ('*ICH*-1', '-NONE-'), ('early', 'RB'), ('in', 'IN'), ('the', 'DT'), ('day', 'NN'), ('that', 'IN'), ('Wall', 'NNP'), ('Street', 'NNP'), ("'s", 'POS'), ('sharp', 'JJ'), ('gains', 'NNS'), ('on', 'IN'), ('Tuesday', 'NNP'), ('were', 'VBD'), ('overdone', 'VBN'), ('and', 'CC'), ('due', 'JJ'), ('for', 'IN'), ('a', 'DT'), ('reversal', 'NN'), ('.', '.')]
362  Dealers said institutions were still largely hugging the sidelines on fears that the market 's recent technical rally might prove fragile .
[('Dealers', 'NNS'), ('said', 'VBD'), ('0', '-NONE-'), ('institutions', 'NNS'), ('were', 'VBD'), ('still', 'RB'), ('largely', 'RB'), ('hugging', 'VBG'), ('the', 'DT'), ('sidelines', 'NNS'), ('on', 'IN'), ('fears', 'NNS'), ('that', 'IN'), ('the', 'DT'), ('market', 'NN'), ("'s", 'POS'), ('recent', 'JJ'), ('technical', 'JJ'), ('

410  The October survey of corporate purchasing managers , as expected , provided evidence that economic growth remains subdued .
[('The', 'DT'), ('October', 'NNP'), ('survey', 'NN'), ('of', 'IN'), ('corporate', 'JJ'), ('purchasing', 'VBG'), ('managers', 'NNS'), (',', ','), ('as', 'IN'), ('*', '-NONE-'), ('expected', 'VBN'), (',', ','), ('provided', 'VBN'), ('evidence', 'NN'), ('that', 'IN'), ('economic', 'JJ'), ('growth', 'NN'), ('remains', 'VBZ'), ('subdued', 'VBN'), ('.', '.')]
411  An index of economic activity drawn from the survey stood last month at 47.6 % ; a reading above 50 % would have indicated that the manufacturing sector was improving .
[('An', 'DT'), ('index', 'NN'), ('of', 'IN'), ('economic', 'JJ'), ('activity', 'NN'), ('drawn', 'VBN'), ('*', '-NONE-'), ('from', 'IN'), ('the', 'DT'), ('survey', 'NN'), ('stood', 'VBD'), ('last', 'JJ'), ('month', 'NN'), ('at', 'IN'), ('47.6', 'CD'), ('%', 'NN'), (';', ':'), ('a', 'DT'), ('reading', 'NN'), ('above', 'IN'), ('50', 'CD'), (

449  And some grain analysts are predicting that corn prices might gyrate this month as exporters scrounge to find enough of the crop to meet their obligations to the Soviets .
[('And', 'CC'), ('some', 'DT'), ('grain', 'NN'), ('analysts', 'NNS'), ('are', 'VBP'), ('predicting', 'VBG'), ('that', 'IN'), ('corn', 'NN'), ('prices', 'NNS'), ('might', 'MD'), ('gyrate', 'VB'), ('this', 'DT'), ('month', 'NN'), ('as', 'IN'), ('exporters', 'NNS'), ('scrounge', 'VBP'), ('*-1', '-NONE-'), ('to', 'TO'), ('find', 'VB'), ('enough', 'RB'), ('of', 'IN'), ('the', 'DT'), ('crop', 'NN'), ('0', '-NONE-'), ('*T*-2', '-NONE-'), ('to', 'TO'), ('meet', 'VB'), ('their', 'PRP$'), ('obligations', 'NNS'), ('to', 'TO'), ('the', 'DT'), ('Soviets', 'NNPS'), ('.', '.')]
450  Because of persistent dry weather in the northern Plains , the water level on the upper section of the Mississippi River is so low that many river operators are already trimming the number of barges their tows push at one time .
[('Because', 'IN'),

In [15]:
get_ground_truth_examples(token='that', corpus=treebank, tag='ADV', universal_tagset=True)

1  While *-1 acknowledging 0 one month 's figures do n't prove a trend , Mr. Bretz said , `` It does lead you to suspect 0 imports are going down , or at least not increasing that much . ''
[('While', 'ADP'), ('*-1', 'X'), ('acknowledging', 'VERB'), ('0', 'X'), ('one', 'NUM'), ('month', 'NOUN'), ("'s", 'PRT'), ('figures', 'NOUN'), ('do', 'VERB'), ("n't", 'ADV'), ('prove', 'VERB'), ('a', 'DET'), ('trend', 'NOUN'), (',', '.'), ('Mr.', 'NOUN'), ('Bretz', 'NOUN'), ('said', 'VERB'), (',', '.'), ('``', '.'), ('It', 'PRON'), ('does', 'VERB'), ('lead', 'VERB'), ('you', 'PRON'), ('to', 'PRT'), ('suspect', 'VERB'), ('0', 'X'), ('imports', 'NOUN'), ('are', 'VERB'), ('going', 'VERB'), ('down', 'ADV'), (',', '.'), ('or', 'CONJ'), ('at', 'ADP'), ('least', 'ADJ'), ('not', 'ADV'), ('increasing', 'VERB'), ('that', 'ADV'), ('much', 'ADV'), ('.', '.'), ("''", '.')]
2  * Winning a bonus for a third year was n't that important to her , Mrs. Yeargin insists 0 *T*-1 .
[('*', 'X'), ('Winning', 'VERB'), ('a', 