In [None]:
import os

## Morphological analysis

Morphological analysis involves studying the structure and formation of words. Some techniques used in morphological analysis are stemming, lemmatization and POS-tagging. **POS-tagging** is process of assigning part-of-speech tags (noun, verb, adjective etc.) to tokens (words). This is done automatically by using a
POS-tagger.

### StanfordPOSTagger

A POS-tagger is a program that tags words in raw text, indicating their part-of-speech. [StanfordPOSTagger](https://nlp.stanford.edu/software/tagger.html) is widely used in NLP tasks. Please download the full version, then unzip the archive and search for *stanford-postagger.jar*. To import the tagger use:

In [None]:
from nltk.tag.stanford import StanfordPOSTagger

Then create a tagger instance. We will use *english-bidirectional-distsim.tagger*.

In [None]:
java_path = "C:\\Program Files\\Java\\jdk-19\\bin\\java.exe"
os.environ['JAVAHOME'] = java_path

path_to_model = "stanford-postagger-full-2020-11-17/models/english-bidirectional-distsim.tagger"
path_to_jar = "stanford-postagger-full-2020-11-17/stanford-postagger.jar"
tagger = StanfordPOSTagger(path_to_model, path_to_jar)

To compute POS tags we use the *tag* method. Note that before we POS-tag a text, we must tokenize it first.

In [None]:
tagger.tag(["I","saw","my","cat", "playing", "with", "a", "dog", "."])

[('I', 'PRP'),
 ('saw', 'VBD'),
 ('my', 'PRP$'),
 ('cat', 'NN'),
 ('playing', 'VBG'),
 ('with', 'IN'),
 ('a', 'DT'),
 ('dog', 'NN'),
 ('.', '.')]

You can see here [what each tag means](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html).

In case you receive an error while trying to use the tagger, read here: https://stackoverflow.com/questions/34692987/cant-make-stanford-pos-tagger-working-in-nltk

### NLTK tagger

You can also use NLTK's own POS tagger, although the Stanford POS tagger is often reported to be more accurate.

In [None]:
import nltk
nltk.download('averaged_perceptron_tagger_eng')

[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger_eng.zip.


True

In [None]:
nltk.pos_tag(["I","saw","my","cat", "playing", "with", "a", "dog", "."])

[('I', 'PRP'),
 ('saw', 'VBD'),
 ('my', 'PRP$'),
 ('cat', 'JJ'),
 ('playing', 'NN'),
 ('with', 'IN'),
 ('a', 'DT'),
 ('dog', 'NN'),
 ('.', '.')]

You can find the tags meaning with *nltk.help.upenn_tagset(tag)*:

In [None]:
nltk.download('tagsets_json')

[nltk_data] Downloading package tagsets_json to /root/nltk_data...
[nltk_data]   Unzipping help/tagsets_json.zip.


True

In [None]:
nltk.help.upenn_tagset('PRP$')

PRP$: pronoun, possessive
    her his mine my our ours their thy your


## Syntax analysis (also known as parsing)
It is used to obtain the structure of a sentence and the connections between tokens (words) in a sentence.

### Parsing with Context-Free Grammars

A parser processes input sentences according to the productions of a **grammar**, which is a set of rules used to describe all possible sentences in a language. For example NP -> Det N (a noun phrase can be a determiner followed by a noun). A grammar is "context-free" when the rules can be applied without considering surrounding symbols.

By convention, the lefthand side of the first production is the start-symbol of the grammar, typically S. All parsing trees must have this symbol as their root label.

Note that a production like VP -> V NP | V NP PP is an abbreviation for the two productions VP -> V NP and VP -> V NP PP.

#### Recursive Descent Parsing

This is a top-down parser (it constructs the parse tree top down), that backtracks through the rules, expanding the tree nodes in a depth-first manner.

In [None]:
gram = nltk.CFG.fromstring("""  S -> NP VP | TO VB
VP -> V NP | V NP PP | V PP
PP -> P NP
V -> "caught" | "ate" | "likes" | "like" | "chase" | "go"
NP -> Det N | Det N PP | PRP
Det -> "the" | "a" | "an" | "my" | "some"
N -> "mice" | "cat" | "dog" |  "school"
P -> "in" | "to" | "on"
TO -> "to"
VB -> "like" | "go" | "catch" | "eat" | "chase"
PRP -> "I"  """)

# Note that a grammar needs terminal symbols (e.g Det -> "the" | "a" | "an" | "my" | "some")

sent=["I", "like", "my", "school"]
rdp = nltk.RecursiveDescentParser(gram)
for tree in rdp.parse(sent):
    print(tree)

(S (NP (PRP I)) (VP (V like) (NP (Det my) (N school))))


You can observe the way it works by using the app provided by nltk:

In [None]:
nltk.app.rdparser() # does not work in colab

#### Shift Reduce Parsing (bottom-up parsing)

Note that this parser does not implement any backtracking, so it is not guaranteed to find a parse for a text, even if one exists. Furthermore, it will only find at most one parse, even if more parses exist. On the other hand, Shift-Reduce parsing can deal with productions displaying left recursion (e.g. VP -> VP NP), unlike Recursive Descent parsing.

In [None]:
srp = nltk.ShiftReduceParser(gram)
sent=["I", "like", "my", "school"]
for tree in srp.parse(sent):
  print(tree)

(S (NP (PRP I)) (VP (V like) (NP (Det my) (N school))))


In [None]:
nltk.app.srparser()

## Exercises

1. Choose a wikipedia article. You will download and acces the article using this python module: [wikipedia](https://pypi.org/project/wikipedia/). Use the property *content* to extract the text. Print the title of the chosen article and the first N=200 words from the article to verify that all works well. Print the POS-tagging for the first N=20 sentences. You can use nltk's word_tokenize.

2. Create a function that receives a part of speech tag and returns a list with all the words from the text (can be given as a parameter too) that represent that part of speech. Create a function that receives a list of POS tags and returns a list with words having any of the given POS tags (use the first function in implementing the second one).

3. Use the function above to print all the nouns (note that there are multiple tags for nouns), and, respectively all the verbs (corresponding to all verb tags). Also, print the percentage of content words (noun+verbs) from the entire text.

3. Print a table of four columns. The columns will be separated with the character "|". The head of the table will be: **Original word | POS | Simple lemmatization | Lemmatization with POS**. The table will compare the results of lemmatization (WordNetLemmatizer) without giving the part of speech and the lemmatization with the given part of speech for each word. The table must contain only words that give different results for the two lemmatizations (for example, the word "running" - without POS, the result will always be "running", but with pos="v" it is "run"). The table will contain the results for the first N sentences from the text (each row corresponding to a word). Try to print only distinct results inside the table (for example, if a word has two occurnces inside the text, and matches the requirments for appearing in the table, it should have only one corresponding row).

5. Print a graphic showing the number of words for each part of speech. If there are too many different parts of speech, you can print only those with a higher number of corresponding words.

6. Create your own grammar with different productions and terminal symbols. Apply recursive descent parsing on a sentence with at least 5 different parts of speech and a tree of at least level 4.

7. Apply shift reduce parsing on the same sentence and check programatically if the two trees are equal. Find a sentence with equal trees and a sentence with different results (we consider the tree different even when it has no solution for one of the parsers, but has for the other).

# Each exercises is done in a separate cell

In [None]:
import wikipedia as wiki
import wikipedia as wiki
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk import pos_tag
from nltk.tag.stanford import StanfordPOSTagger
import os 
java_path = "C:\\Program Files\\Java\\jdk-19\\bin\\java.exe"
os.environ['JAVAHOME'] = java_path

path_to_model = "/Users/raul/Downloads/stanford-postagger-full-2020-11-17/models/english-bidirectional-distsim.tagger"
path_to_jar = "/Users/raul/Downloads/stanford-postagger-full-2020-11-17/stanford-postagger-4.2.0.jar"
tagger = StanfordPOSTagger(path_to_model, path_to_jar)

try:
    article = wiki.page("bbrain")
    content = article.content
    words = word_tokenize(content)  
    first_200_words = " ".join(words[:200])  
    print(f"First 200 words of the article:\n{first_200_words}\n")
    sentences = sent_tokenize(content)
    first_20_sentences = sentences[:20]
    
    print("Tag the first 20 sentences:\n")
    for i, sent in enumerate(first_20_sentences, 1):
        tokens = word_tokenize(sent)
        pos_tag = tagger.tag(tokens)
        print(f"Sentence {i}: {pos_tag}\n")

except wiki.exceptions.PageError:
    print(f"Error: Wikipedia page bbrain not found.")
except wiki.exceptions.DisambiguationError as e:
    print(f"Disambiguation error: {e.options}")

In [None]:
def get_words_by_pos(text, pos_tag_filter):
    words = word_tokenize(text)  
    tagged_words = tagger.tag(words)  
    return [word for word, tag in tagged_words if tag == pos_tag_filter]

def get_words_by_pos_list(text, pos_tags_filter):
    words = []
    for pos_tag_filter in pos_tags_filter:
        words.extend(get_words_by_pos(text, pos_tag_filter))
    return words

get_words_by_pos("The brain is an organ that serves as the center of the nervous system in all vertebrate and most invertebrate animals", "NN")

In [None]:
noun_tags = ["NN", "NNS", "NNP", "NNPS"] 
verb_tags = ["VB", "VBD", "VBG", "VBN", "VBP", "VBZ"]


try:
    article = wiki.page("bbrain")
    content = article.content
    sentences = sent_tokenize(content) 
    first_20_sentences = " ".join(sentences[:20])
    nouns = get_words_by_pos_list(first_20_sentences, noun_tags)
    verbs = get_words_by_pos_list(first_20_sentences, verb_tags)
    content_word_count = len(nouns) + len(verbs)
    words = word_tokenize(first_20_sentences)
    total_words = len(words) if words else 1
    percentage = (content_word_count / total_words) * 100
    print(f"Percentage is: {percentage}")
except wiki.exceptions.PageError:
    print(f"Error: Wikipedia page bbrain not found.")
except wiki.exceptions.DisambiguationError as e:
    print(f"Disambiguation error: {e.options}")

In [None]:
import wikipedia as wiki
import wikipedia as wiki
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.tag.stanford import StanfordPOSTagger
from nltk.stem import WordNetLemmatizer
import os 
from nltk.corpus import wordnet

java_path = "C:\\Program Files\\Java\\jdk-19\\bin\\java.exe"
os.environ['JAVAHOME'] = java_path

path_to_model = "/Users/raul/Downloads/stanford-postagger-full-2020-11-17/models/english-bidirectional-distsim.tagger"
path_to_jar = "/Users/raul/Downloads/stanford-postagger-full-2020-11-17/stanford-postagger-4.2.0.jar"
tagger = StanfordPOSTagger(path_to_model, path_to_jar)

print(f"{'Original Word':<{15}} | {'POS':<{5}} | {'Simple Lemmatization':<{20}} | {'Lemmatization with POS':<{15}}")
print(f"-" * 71)

def get_wordnet_pos(treebank_tag):

    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN
try:
    article = wiki.page("bbrain")
    content = article.content
    sentences = sent_tokenize(content) 
    first_20_sentences = " ".join(sentences[:20])
    words = list(map(str.lower, word_tokenize(first_200_words)))
    taggs = tagger.tag(words)
    lem = WordNetLemmatizer()
    seen = set()
    for i, word in enumerate(words, start=0):
        if word not in seen:
            print(f"{word:<{15}} | {taggs[i][1]:<{5}} | {lem.lemmatize(word):<{20}} | {lem.lemmatize(word, pos=get_wordnet_pos(taggs[i][1])):<{15}}")
            seen.add(word)

except wiki.exceptions.PageError:
    print(f"Error: Wikipedia page bbrain not found.")
except wiki.exceptions.DisambiguationError as e:
    print(f"Disambiguation error: {e.options}")

In [None]:
from collections import Counter
import matplotlib.pyplot as plt
counter = Counter(second for _, second in taggs)
counter = {k: v for k, v in counter.items() if v >= 2}

labels, counts = zip(*counter.items())

plt.bar(labels, counts, color='red')
plt.show()

Exercise 6 - was not done =))