<a href="https://colab.research.google.com/github/scskalicky/LING-226-vuw/blob/main/20_Part_of_Speech_Patterns.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Patterns and Parts of Speech

Exploring bigrams and parts of speech demonstrate that words do not occur randomly in text. Some words usually occur before or after other words. And, some words with certain parts of speech occur before/after other parts of speech.

In this notebook we will do two things. First we will look at bigrams based not on words, but instead on parts of speech. We will think about whether these bigrams tells us anything about word order and syntax in English.

Then, we will consider the role that preprocessing might have on part of speech tagging, particularly when it comes to stopword removal.

## Load in some data

Let's load in some data from The Current to start with.

In [None]:
# don't play myrtles in gardens as a means to protect pōhutukawa
# load the TP010 data to the notebook environment
!wget 'https://raw.githubusercontent.com/scskalicky/LING-226-vuw/main/the-current/tp010.txt'

Do some miminal cleaning and processing of the data to get a set of tokens. We will then create a list of each individual comment.

In [1]:
# read in the entire file
tp010 = open('./the-current/tp010.txt', encoding='utf8').read().rstrip()

# remove any punctuation
import re
punctuation = '[#.,!\'"-]'
tp010 = re.sub(pattern = punctuation, repl = '', string = tp010)

# extract the comments
tp010_comments = [comment.split('\t')[1] for comment in tp010.split('\n')]

In [2]:
# check the data
tp010_comments[:8]

['you shoaud plant more nativ tree',
 'i think that is a great idea',
 'that itis a great idea to intagrate more myrtles nto our gardens',
 'i dont know what a myrtle is',
 'because it is going to help plants',
 'because i have already planted a myrtle',
 'because it is going to help our plants',
 'Becaus it will be great to save the world']

For each comment, let's tokenize then part of speech tag the comment. First we need to download nltk resources.



In [3]:
# download tokenizer resources
import nltk
nltk.download(['punkt' ,'averaged_perceptron_tagger'])

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Ming\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Ming\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [5]:
# tag and tokenize in one go
tp010_tagged = [nltk.pos_tag(nltk.word_tokenize(comment)) for comment in tp010_comments]

In [6]:
tp010_tagged[:3]

[[('you', 'PRP'),
  ('shoaud', 'VBP'),
  ('plant', 'NN'),
  ('more', 'RBR'),
  ('nativ', 'JJ'),
  ('tree', 'NN')],
 [('i', 'NN'),
  ('think', 'VBP'),
  ('that', 'WDT'),
  ('is', 'VBZ'),
  ('a', 'DT'),
  ('great', 'JJ'),
  ('idea', 'NN')],
 [('that', 'DT'),
  ('itis', 'VBZ'),
  ('a', 'DT'),
  ('great', 'JJ'),
  ('idea', 'NN'),
  ('to', 'TO'),
  ('intagrate', 'VB'),
  ('more', 'JJR'),
  ('myrtles', 'NNS'),
  ('nto', 'IN'),
  ('our', 'PRP$'),
  ('gardens', 'NNS')]]

## **the goal is to create bigrams of the pos tags**

Now, create a new list that is the bigrams of the pos tags only. How can we do this?

1. Create a new variable named `tp010_tags`
2. Set the value of this variable to be a list comprehension, which loops through each comment, and then each (word,tag) pair for each comment, returning only the tag. This template gets you 75% of the way there:

`[[... for (..., ...) in comment] for comment in tp010_tagged]`

3. Look at the first 3 items in `tp010_tags`, you should see this output:

```
[['PRP', 'VBP', 'NN', 'RBR', 'JJ', 'NN'],
 ['NN', 'VBP', 'WDT', 'VBZ', 'DT', 'JJ', 'NN'],
 ['DT',
  'VBZ',
  'DT',
  'JJ',
  'NN',
  'TO',
  'VB',
  'JJR',
  'NNS',
  'IN',
  'PRP$',
  'NNS']]
```

In [7]:
# create tp010 tags here
tp010_tags = [[tag for (word,tag) in comment] for comment in tp010_tagged]

tp010_tags[:3]

[['PRP', 'VBP', 'NN', 'RBR', 'JJ', 'NN'],
 ['NN', 'VBP', 'WDT', 'VBZ', 'DT', 'JJ', 'NN'],
 ['DT',
  'VBZ',
  'DT',
  'JJ',
  'NN',
  'TO',
  'VB',
  'JJR',
  'NNS',
  'IN',
  'PRP$',
  'NNS']]

4. Create a new variable named tp010_tag_bigrams
5. Set the value of this variable to be a list comprehension, which returns the results of `nltk_bigrams()` for each comment in tp010 tags. Wrap the results of `nltk.bigrams()` inside `list()`
6. Look at the first 3 values in the list, you should see something like this:

```
[[('PRP', 'VBP'), ('VBP', 'NN'), ('NN', 'RBR'), ('RBR', 'JJ'), ('JJ', 'NN')],
 [('NN', 'VBP'),
  ('VBP', 'WDT'),
  ('WDT', 'VBZ'),
  ('VBZ', 'DT'),
  ('DT', 'JJ'),
  ('JJ', 'NN')],
 [('DT', 'VBZ'),
  ('VBZ', 'DT'),
  ('DT', 'JJ'),
  ('JJ', 'NN'),
  ('NN', 'TO'),
  ('TO', 'VB'),
  ('VB', 'JJR'),
  ('JJR', 'NNS'),
  ('NNS', 'IN'),
  ('IN', 'PRP$'),
  ('PRP$', 'NNS')]]
```

In [8]:
tp010_tag_bigrams = [list(nltk.bigrams(comment)) for comment in tp010_tags]
tp010_tag_bigrams[:3]

[[('PRP', 'VBP'), ('VBP', 'NN'), ('NN', 'RBR'), ('RBR', 'JJ'), ('JJ', 'NN')],
 [('NN', 'VBP'),
  ('VBP', 'WDT'),
  ('WDT', 'VBZ'),
  ('VBZ', 'DT'),
  ('DT', 'JJ'),
  ('JJ', 'NN')],
 [('DT', 'VBZ'),
  ('VBZ', 'DT'),
  ('DT', 'JJ'),
  ('JJ', 'NN'),
  ('NN', 'TO'),
  ('TO', 'VB'),
  ('VB', 'JJR'),
  ('JJR', 'NNS'),
  ('NNS', 'IN'),
  ('IN', 'PRP$'),
  ('PRP$', 'NNS')]]

Flatten the list of bigrams into a single list called `tp010_combined`

You can use a for loop or a list comprehension to do this. [Try the top answer here.](https://stackoverflow.com/questions/952914/how-do-i-make-a-flat-list-out-of-a-list-of-lists)


The first five items of the results should look like this:

```
[('PRP', 'VBP'), ('VBP', 'NN'), ('NN', 'RBR'), ('RBR', 'JJ'), ('JJ', 'NN')]
```


In [17]:
# make tp010_combined here
tp010_combined = [pair for list in tp010_tag_bigrams for pair in list]


# for list in tp010_tag_bigrams:
#     for pair in list:
#         tp010_combined.append(pair)

tp010_combined

[('PRP', 'VBP'),
 ('VBP', 'NN'),
 ('NN', 'RBR'),
 ('RBR', 'JJ'),
 ('JJ', 'NN'),
 ('NN', 'VBP'),
 ('VBP', 'WDT'),
 ('WDT', 'VBZ'),
 ('VBZ', 'DT'),
 ('DT', 'JJ'),
 ('JJ', 'NN'),
 ('DT', 'VBZ'),
 ('VBZ', 'DT'),
 ('DT', 'JJ'),
 ('JJ', 'NN'),
 ('NN', 'TO'),
 ('TO', 'VB'),
 ('VB', 'JJR'),
 ('JJR', 'NNS'),
 ('NNS', 'IN'),
 ('IN', 'PRP$'),
 ('PRP$', 'NNS'),
 ('JJ', 'NN'),
 ('NN', 'VBP'),
 ('VBP', 'WP'),
 ('WP', 'DT'),
 ('DT', 'NN'),
 ('NN', 'VBZ'),
 ('IN', 'PRP'),
 ('PRP', 'VBZ'),
 ('VBZ', 'VBG'),
 ('VBG', 'TO'),
 ('TO', 'VB'),
 ('VB', 'NNS'),
 ('IN', 'NNS'),
 ('NNS', 'VBP'),
 ('VBP', 'RB'),
 ('RB', 'VBN'),
 ('VBN', 'DT'),
 ('DT', 'NN'),
 ('IN', 'PRP'),
 ('PRP', 'VBZ'),
 ('VBZ', 'VBG'),
 ('VBG', 'TO'),
 ('TO', 'VB'),
 ('VB', 'PRP$'),
 ('PRP$', 'NNS'),
 ('IN', 'PRP'),
 ('PRP', 'MD'),
 ('MD', 'VB'),
 ('VB', 'JJ'),
 ('JJ', 'TO'),
 ('TO', 'VB'),
 ('VB', 'DT'),
 ('DT', 'NN'),
 ('WP', 'IN'),
 ('IN', 'NN'),
 ('NN', 'VBZ'),
 ('VBZ', 'JJ'),
 ('JJ', 'IN'),
 ('IN', 'PRP'),
 ('PRP', 'VBP'),
 ('VBP', 'RB')

Now that you have a list of post tag bigrams, run a FreqDist to see the most frequent pos tag bigrams. Look at the top ten bigrams using `.most_common(10)`. What do you see in the results? What sorts of words could fill each slot? What does this say about patterning of syntax in English?




In [18]:
# make fdist here.
nltk.FreqDist(tp010_combined)

FreqDist({('DT', 'NN'): 99, ('TO', 'VB'): 94, ('PRP', 'VBP'): 91, ('JJ', 'NN'): 91, ('NNP', 'NNP'): 81, ('MD', 'VB'): 69, ('NN', 'VBP'): 65, ('IN', 'PRP'): 60, ('NN', 'NN'): 57, ('PRP', 'MD'): 52, ...})

## **Stability of patterns**

Try replicating this analysis on different English data. Will you find the same bigram patterns? Discuss the results in terms of what this means for linguistic analysis and patterns in language.

# Pre-processing and parts of speech

Ok, so we have some idea about the patterning of language and how words occur in certain slots (at least in English!).

Take a moment to consider: what might the effects of different preprocessing would be on part of speech tagging? In particular, what effects would stopword removal have on part of speech tagging?




In [19]:
# import NLTK list of English stopwords
nltk.download('stopwords')

from nltk.corpus import stopwords
sw = stopwords.words('english')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Ming\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [20]:
# save the two examples as strings
ex1 = 'dog ate ball'
ex2 = 'the dog ate the ball'

In [21]:
# pos tags for ex1
nltk.pos_tag(nltk.word_tokenize(ex1))

[('dog', 'NN'), ('ate', 'NN'), ('ball', 'NN')]

In [22]:
# pos tags for ex2
nltk.pos_tag(nltk.word_tokenize(ex2))

[('the', 'DT'), ('dog', 'NN'), ('ate', 'VBD'), ('the', 'DT'), ('ball', 'NN')]

### **Discuss**

Why is the tagger more accurate for ex1 compared to ex2? Why is the tagger counting everything as a noun in ex1?

### **Your Turn**

Select a short text of your choice (at least a paragraph in length). Use `nltk.pos_tag()` to pos tag two versions of the text: one using the raw text, and one using the raw text with stopwords removed.

Consider your results and discuss this question:

- What happens to the accuracy of the tags when you remove the stopwords? What is the reason for this? Your answer will likely be very similar to the example above...

In [24]:
copy_pasta = """ 
You have been reported.

I am not a bot. I am a Volunteer Reddit moderator. I do not have mod powers but my reports are taken seriously and those who get on my bad side tend to get banned in under 24 hours. I have numerous rules, which you may read in my post history, but 1 is the most important rule of all

• I am an officer in training, and I expect to be treated the same way I would be with my uniform and badge.

Watch your back and get used to this face kiddo, you’ll be seeing a lot of it.
"""

cleared = ' '.join([word for word in copy_pasta.split() if word not in sw])

rawPOS = nltk.pos_tag(nltk.word_tokenize(copy_pasta))
clearedPOS = nltk.pos_tag(nltk.word_tokenize(cleared))

In [25]:
clearedPOS

[('You', 'PRP'),
 ('reported', 'VBD'),
 ('.', '.'),
 ('I', 'PRP'),
 ('bot', 'VBP'),
 ('.', '.'),
 ('I', 'PRP'),
 ('Volunteer', 'VBP'),
 ('Reddit', 'NNP'),
 ('moderator', 'NN'),
 ('.', '.'),
 ('I', 'PRP'),
 ('mod', 'VBP'),
 ('powers', 'NNS'),
 ('reports', 'NNS'),
 ('taken', 'VBN'),
 ('seriously', 'RB'),
 ('get', 'VB'),
 ('bad', 'JJ'),
 ('side', 'NN'),
 ('tend', 'VBP'),
 ('get', 'NN'),
 ('banned', 'VBN'),
 ('24', 'CD'),
 ('hours', 'NNS'),
 ('.', '.'),
 ('I', 'PRP'),
 ('numerous', 'JJ'),
 ('rules', 'NNS'),
 (',', ','),
 ('may', 'MD'),
 ('read', 'VB'),
 ('post', 'NN'),
 ('history', 'NN'),
 (',', ','),
 ('1', 'CD'),
 ('important', 'JJ'),
 ('rule', 'NN'),
 ('•', 'IN'),
 ('I', 'PRP'),
 ('officer', 'NN'),
 ('training', 'NN'),
 (',', ','),
 ('I', 'PRP'),
 ('expect', 'VBP'),
 ('treated', 'JJ'),
 ('way', 'NN'),
 ('I', 'PRP'),
 ('would', 'MD'),
 ('uniform', 'VB'),
 ('badge', 'NN'),
 ('.', '.'),
 ('Watch', 'VB'),
 ('back', 'RB'),
 ('get', 'VB'),
 ('used', 'VBN'),
 ('face', 'NN'),
 ('kiddo', 'NN'),


In [26]:
rawPOS

[('You', 'PRP'),
 ('have', 'VBP'),
 ('been', 'VBN'),
 ('reported', 'VBN'),
 ('.', '.'),
 ('I', 'PRP'),
 ('am', 'VBP'),
 ('not', 'RB'),
 ('a', 'DT'),
 ('bot', 'NN'),
 ('.', '.'),
 ('I', 'PRP'),
 ('am', 'VBP'),
 ('a', 'DT'),
 ('Volunteer', 'NNP'),
 ('Reddit', 'NNP'),
 ('moderator', 'NN'),
 ('.', '.'),
 ('I', 'PRP'),
 ('do', 'VBP'),
 ('not', 'RB'),
 ('have', 'VB'),
 ('mod', 'JJ'),
 ('powers', 'NNS'),
 ('but', 'CC'),
 ('my', 'PRP$'),
 ('reports', 'NNS'),
 ('are', 'VBP'),
 ('taken', 'VBN'),
 ('seriously', 'RB'),
 ('and', 'CC'),
 ('those', 'DT'),
 ('who', 'WP'),
 ('get', 'VBP'),
 ('on', 'IN'),
 ('my', 'PRP$'),
 ('bad', 'JJ'),
 ('side', 'NN'),
 ('tend', 'NN'),
 ('to', 'TO'),
 ('get', 'VB'),
 ('banned', 'VBN'),
 ('in', 'IN'),
 ('under', 'IN'),
 ('24', 'CD'),
 ('hours', 'NNS'),
 ('.', '.'),
 ('I', 'PRP'),
 ('have', 'VBP'),
 ('numerous', 'JJ'),
 ('rules', 'NNS'),
 (',', ','),
 ('which', 'WDT'),
 ('you', 'PRP'),
 ('may', 'MD'),
 ('read', 'VB'),
 ('in', 'IN'),
 ('my', 'PRP$'),
 ('post', 'NN'),
 ('

# **Finding ambiguous words**

Can a word be ambiguous based on its part of speech? Can some words be used as nouns *and* verbs? If a word can be used either as a noun or a verb, the word by itself is ambiguous and needs some sort of co-text in order to determine the actual part of speech.

An easy example to demonstrate this is

- a comb : noun
- to comb: verb

In both cases, the word that comes before strongly predicts (or even dictates) the part of speech for the word.

`a` is a determiner, whereas `to` is a preposition, and in this case, `to` is being used as part of the infinitive form of `comb`, which is different than using `to` in a phrase like `from work to school`.

This small example should help reinforce how patterns in language, based on both word form *and* part of speech, can be exploited by linguists.


In [27]:
droids = nltk.FreqDist(nltk.pos_tag(nltk.word_tokenize('Where is my comb? Please comb the desert for droids!')))

# how can a comb be a noun and a verb?!
for key in droids.keys():
  if 'comb' in key:
    print(key, droids[key])

('comb', 'NN') 1
('comb', 'VBZ') 1



Let's explore this following the example from NLTK to find ambiguous words.

In [28]:
# we need the brown corpus
nltk.download('book')
from nltk.corpus import brown

[nltk_data] Downloading collection 'book'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to
[nltk_data]    |     C:\Users\Ming\AppData\Roaming\nltk_data...
[nltk_data]    |   Package abc is already up-to-date!
[nltk_data]    | Downloading package brown to
[nltk_data]    |     C:\Users\Ming\AppData\Roaming\nltk_data...
[nltk_data]    |   Package brown is already up-to-date!
[nltk_data]    | Downloading package chat80 to
[nltk_data]    |     C:\Users\Ming\AppData\Roaming\nltk_data...
[nltk_data]    |   Package chat80 is already up-to-date!
[nltk_data]    | Downloading package cmudict to
[nltk_data]    |     C:\Users\Ming\AppData\Roaming\nltk_data...
[nltk_data]    |   Package cmudict is already up-to-date!
[nltk_data]    | Downloading package conll2000 to
[nltk_data]    |     C:\Users\Ming\AppData\Roaming\nltk_data...
[nltk_data]    |   Package conll2000 is already up-to-date!
[nltk_data]    | Downloading package conll2002 to
[nltk_data]    |     C:\Users\Ming\AppData\Roaming

In [29]:
# Make a conditional frequency distribution of Brown POS tags
# using the universal tags, and lowercase the words
brown_news_tagged = brown.tagged_words(categories = 'news', tagset = 'universal')

brown_cfd = nltk.ConditionalFreqDist((word.lower(), tag)
                                for (word, tag) in brown_news_tagged)

After making the CFD, loop through each word (which are they keys in the dictionary and represented by `brown_cfd.condition()`.

Then, checking the `len()` of the word entry allows to see how many tags are associated with that word (each tag will add +1 to the `len()`. This method thus allows you to locate words used with many different POS tags.



In [30]:
# find all words associated with more than a certain number of pos tags
for word in sorted(brown_cfd.conditions()):
  # if the entry has more than three POS tags
  if len(brown_cfd[word]) > 3:

    # get just the tag, not the frequency
    tags = [tag for (tag, _) in brown_cfd[word].most_common()]

    print(word, ' '.join(tags))

best ADJ ADV VERB NOUN
close ADV ADJ VERB NOUN
open ADJ VERB NOUN ADV
present ADJ ADV NOUN VERB
that ADP DET PRON ADV


In [39]:
brown_cfd['best']

FreqDist({'ADJ': 28, 'ADV': 1, 'VERB': 1, 'NOUN': 1})

## **Your Turn**

If we have time, spend it now loading in your own text(s) and tagging them for part of speech.

- Then, run some frequency distributions and conditional frequency distributions
- can you find the most frequent nouns, verbs, etc?
- can you find ambiguious words?
- can you find particular types of ambiguous words, such as words which are ambiguous between nouns and verbs?