# AnTeDe Lab B: using PoS taggers 

## Session goal

The goal of this session is to help you familiarize with PoS tagging. We'll be using NLTK, Stanza, and Spacy.
For Spacy, in addition to *pip install spacy*, you'll need to run *python -m spacy download en_core_web_sm*


In [2]:
#! pip install stanza
import stanza


from nltk.tag import PerceptronTagger
from nltk.tokenize import word_tokenize

from nltk import download
download('averaged_perceptron_tagger')
download('punkt')

import spacy
!python -m spacy download en_core_web_sm

text='I really like this class. This lab is going to be fun.'
spacy_analyzer = spacy.load('en_core_web_sm')

stanza.download('en')
stanza_pipeline = stanza.Pipeline(lang='en', processors='tokenize,mwt,pos,lemma')


def run_stanza(text):
    
    pairs=[]    
    doc = stanza_pipeline(text)
    for sent in doc.sentences:
        for word in sent.words:
            pairs.append((word.text, word.xpos))
    return pairs        

def run_spacy(text):
    
    doc = spacy_analyzer(text)
    return [(token, token.tag_) for token in doc]

def run_nltk (text):
    
    tagger = PerceptronTagger()
    return tagger.tag(word_tokenize(text))

       


[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\adria\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\adria\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Defaulting to user installation because normal site-packages is not writeable
Collecting en-core-web-sm==3.0.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.0.0/en_core_web_sm-3.0.0-py3-none-any.whl (13.7 MB)
[+] Download and installation successful
You can now load the package via spacy.load('en_core_web_sm')


You should consider upgrading via the 'C:\Program Files\Python39\python.exe -m pip install --upgrade pip' command.


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.3.0.json:   0%|   …

2022-04-07 11:16:36 INFO: Downloading default packages for language: en (English)...


Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.3.0/models/default.zip:   0%|          | 0…

2022-04-07 11:17:36 INFO: Finished downloading models and saved to C:\Users\adria\stanza_resources.
2022-04-07 11:17:36 INFO: Loading these models for language: en (English):
| Processor | Package  |
------------------------
| tokenize  | combined |
| pos       | combined |
| lemma     | combined |

2022-04-07 11:17:36 INFO: Use device: cpu
2022-04-07 11:17:36 INFO: Loading: tokenize
2022-04-07 11:17:37 INFO: Loading: pos
2022-04-07 11:17:37 INFO: Loading: lemma
2022-04-07 11:17:37 INFO: Done loading processors!


In [3]:
def visualize_pos_results (text):
    
    stanza_pairs = run_stanza(text)    
    spacy_pairs = run_spacy(text)   
    nltk_pairs = run_nltk(text) 

    if len(stanza_pairs)==len(spacy_pairs)==len(nltk_pairs):
        tokens = [x[0] for x in stanza_pairs]
        stanza_tags = [x[1] for x in stanza_pairs]
        spacy_tags = [x[1] for x in spacy_pairs]
        nltk_tags = [x[1] for x in nltk_pairs]

        import pandas as pd    
        df=pd.DataFrame(columns = ['tokens','Stanza', 'NLTK', 'Spacy'])   
        df['tokens']=tokens
        df['Stanza']=stanza_tags
        df['NLTK']=nltk_tags
        df['Spacy']=spacy_tags

        print (df)

    else:
        print ('-'*30)
        print ('Stanza')
        print (stanza_pairs)
        print ('-'*30)
        print ('NLTK')
        print (nltk_pairs)
        print ('-'*30)
        print ('Spacy')
        print (spacy_pairs)
        
        

In [6]:
from nltk.data import load
from nltk import download
download('tagsets')
tagdict = load('help/tagsets/upenn_tagset.pickle')
tagdict['WDT'][0]

[nltk_data] Downloading package tagsets to
[nltk_data]     C:\Users\adria\AppData\Roaming\nltk_data...
[nltk_data]   Package tagsets is already up-to-date!


'WH-determiner'

In [5]:
sentence = "Traffic congestion in the Shire is getting worse. After we landed at Baggins international airport, we got stuck on the ring road around Hobbiton."
visualize_pos_results(sentence)


           tokens Stanza NLTK Spacy
0         Traffic     NN   JJ    NN
1      congestion     NN   NN    NN
2              in     IN   IN    IN
3             the     DT   DT    DT
4           Shire    NNP  NNP   NNP
5              is    VBZ  VBZ   VBZ
6         getting    VBG  VBG   VBG
7           worse    JJR  JJR   JJR
8               .      .    .     .
9           After     IN   IN    IN
10             we    PRP  PRP   PRP
11         landed    VBD  VBD   VBD
12             at     IN   IN    IN
13        Baggins    NNP  NNP   NNP
14  international     JJ   JJ    JJ
15        airport     NN   NN    NN
16              ,      ,    ,     ,
17             we    PRP  PRP   PRP
18            got    VBD  VBD   VBD
19          stuck    VBN  VBN   VBN
20             on     IN   IN    IN
21            the     DT   DT    DT
22           ring     NN   NN    NN
23           road     NN   NN    NN
24         around     IN   IN    IN
25       Hobbiton    NNP  NNP   NNP
26              .      .    

In [8]:
sentence_1='Back me up.'
visualize_pos_results(sentence_1)

sentence_2='I asked them to back me up.'
visualize_pos_results(sentence_2)

sentence_3='Watch your back.'
visualize_pos_results(sentence_3)

  tokens Stanza NLTK Spacy
0   Back     VB   RB    RB
1     me    PRP  PRP   PRP
2     up     RP   RP    RP
3      .      .    .     .
  tokens Stanza NLTK Spacy
0      I    PRP  PRP   PRP
1  asked    VBD  VBD   VBD
2   them    PRP  PRP   PRP
3     to     TO   TO    TO
4   back     VB   VB    VB
5     me    PRP  PRP   PRP
6     up     RP   RP    RP
7      .      .    .     .
  tokens Stanza  NLTK Spacy
0  Watch     VB    VB    VB
1   your   PRP$  PRP$  PRP$
2   back     NN    NN    NN
3      .      .     .     .


**When** can have many multiple PoS tags. 

In [9]:
sentences=['When did you last go to Bern?',   # interrogative adverb
'Raise your hand when you\'re finished']  # conjunction

for sentence in sentences:
    dflist = visualize_pos_results(sentence)

  tokens Stanza NLTK Spacy
0   When    WRB  WRB   WRB
1    did    VBD  VBD   VBD
2    you    PRP  PRP   PRP
3   last     VB   JJ    RB
4     go     VB   VB    VB
5     to     IN   TO    IN
6   Bern    NNP  NNP   NNP
7      ?      .    .     .
     tokens Stanza  NLTK Spacy
0     Raise     VB    VB    VB
1      your   PRP$  PRP$  PRP$
2      hand     NN    NN    NN
3      when    WRB   WRB   WRB
4       you    PRP   PRP   PRP
5       're    VBP   VBP   VBP
6  finished    VBN   VBN   VBN


In [10]:
tagdict['WRB'][0]

'Wh-adverb'

In [11]:
tagdict['PRP$'][0]

'pronoun, possessive'

What's happening in the following example? Which PoS tagger does better?

In [12]:
sentences=['An experienced man should always man the ship',
'Never has so much been owed to so many by so few',
           'A nation will not survive morally or economically \
when so few have so much and so many have so little']

for sentence in sentences:
    dflist = visualize_pos_results(sentence)

        tokens Stanza NLTK Spacy
0           An     DT   DT    DT
1  experienced     JJ   JJ    JJ
2          man     NN   NN    NN
3       should     MD   MD    MD
4       always     RB   RB    RB
5          man     VB   NN    VB
6          the     DT   DT    DT
7         ship     NN   NN    NN
   tokens Stanza NLTK Spacy
0   Never     RB   RB    RB
1     has    VBZ  VBZ   VBZ
2      so     RB   RB    RB
3    much     RB   JJ    RB
4    been    VBN  VBN   VBN
5    owed    VBN  VBN   VBN
6      to     IN   TO    IN
7      so     RB   RB    RB
8    many     JJ   JJ    JJ
9      by     IN   IN    IN
10     so     RB   IN    RB
11    few     JJ   JJ    JJ
          tokens Stanza NLTK Spacy
0              A     DT   DT    DT
1         nation     NN   NN    NN
2           will     MD   MD    MD
3            not     RB   RB    RB
4        survive     VB   VB    VB
5        morally     RB   RB    RB
6             or     CC   CC    CC
7   economically     RB   RB    RB
8           when    WRB 

What's happening in the following examples? 

In [13]:
sentences = [
    'That much is true.',
    'I don\'t know that much.', 
    'That\'s not that interesting.',
    'How much is true?',
    'Many of them are here.',
    'This is mine.',
    'Everyone knows',
    ]
for sentence in sentences:
    dflist = visualize_pos_results(sentence)

  tokens Stanza NLTK Spacy
0   That     DT   DT    DT
1   much     RB   JJ    JJ
2     is    VBZ  VBZ   VBZ
3   true     JJ   JJ    JJ
4      .      .    .     .
  tokens Stanza NLTK Spacy
0      I    PRP  PRP   PRP
1     do    VBP  VBP   VBP
2    n't     RB   RB    RB
3   know     VB   VB    VB
4   that     RB   RB    DT
5   much     JJ   JJ    JJ
6      .      .    .     .
        tokens Stanza NLTK Spacy
0         That     DT   DT    DT
1           's    VBZ  VBZ   VBZ
2          not     RB   RB    RB
3         that     RB   IN    RB
4  interesting     JJ  VBG    JJ
5            .      .    .     .
  tokens Stanza NLTK Spacy
0    How    WRB  WRB   WRB
1   much     JJ   JJ    JJ
2     is    VBZ  VBZ   VBZ
3   true     JJ   JJ    JJ
4      ?      .    .     .
  tokens Stanza NLTK Spacy
0   Many     JJ   JJ    JJ
1     of     IN   IN    IN
2   them    PRP  PRP   PRP
3    are    VBP  VBP   VBP
4   here     RB   RB    RB
5      .      .    .     .
  tokens Stanza NLTK Spacy
0   This     