## Lab 5. Exploring Shallow Parsers of Text

In this practicum, we'll delve into shallow text parsing, specifically part-of-speech tagging and parsing. Today, we'll mainly use various taggers and chunkers.

Complete all exercises and submit under "Lab 5: PoS tagging and Chunking Examples" : https://utexas.instructure.com/courses/1382133/assignments/6627267?module_item_id=13596161

## 1. Part of Speech Tagging

Part of speech (POS) tagging in natural language processing (NLP) is the process of assigning grammatical categories or labels (parts of speech) to words in a text corpus. These labels classify words based on their syntactic and grammatical roles within a sentence.

POS tagging is a fundamental task in NLP and serves various purposes in language processing and understanding. Some applications of POS taging are:

1. **Text Analysis**: POS tagging helps break down a text into its grammatical components, facilitating further analysis.

2. **Information Retrieval**: It aids in retrieving documents or sentences containing specific parts of speech.

3. **Sentiment Analysis**: Identifying adjectives and adverbs helps determine the sentiment or tone of a text.

4. **Machine Translation**: POS tags can guide the translation of words and phrases in different languages.

5. **Speech Recognition**: It assists in converting spoken language into written text by identifying parts of speech.

6. **Search Engine Optimization (SEO)**: Knowing the parts of speech in web content can help optimize it for search engines.

7. **Grammar Checking**: POS tagging can be used in grammar-checking tools to highlight errors or suggest improvements.

8. **Named Entity Recognition (NER)**: It can help identify proper nouns and entities within a text.

9. **Syntax Parsing**: POS tags are crucial for building parse trees and understanding the syntactic structure of sentences.

10. **Question Answering**: Identifying nouns and verbs in a question can help find relevant answers in a text corpus.

11. **Text Summarization**: POS tagging assists in summarizing texts by identifying important content words.

... and many more

Let's explore some exisitng POS taggers. (Note, you may need to install libraries to use certain taggers.)


### 1.1. PoS Tagging with Natural Language Toolkit Library (NLTK)

NLTK provides a list of taggers and documentations for them. Please refer to https://www.nltk.org/book/ch05.html for more details.

**Always remember to tokenize before applying tagging / parsing**. Stemming and lemmatizations are not required (why?)


In [1]:
import nltk
from nltk.tokenize import word_tokenize

def download_nltk_dataset(dataset_name):
  # Let's do this one time implementation  for downloading an NLTK dataset
  # ONLY IF it does not exist
  try:
      nltk.data.find(dataset_name)
  except LookupError:
      nltk.download(dataset_name)
      print(f"Downloaded {dataset_name}")
  else:
        print(f"{dataset_name} is already downloaded")

# Usage example:
download_nltk_dataset("punkt")  # Replace with the name of the dataset you want to download
download_nltk_dataset("averaged_perceptron_tagger")

some_sentences =[
    "The conference, which was held in New York, had over 500 attendees (including students).",
    "She said, 'I'll meet you at the park... if it doesn't rain.'",
    "The URL for the website is https://www.example.com, and you can reach me at john.doe@email.com."
    ]

for sentence in some_sentences:
  print (sentence)
  words = word_tokenize(sentence)
  print (words)
  tags = nltk.pos_tag(words)
  print(tags)



[nltk_data] Downloading package punkt to /home/codespace/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/codespace/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


Downloaded punkt
Downloaded averaged_perceptron_tagger
The conference, which was held in New York, had over 500 attendees (including students).
['The', 'conference', ',', 'which', 'was', 'held', 'in', 'New', 'York', ',', 'had', 'over', '500', 'attendees', '(', 'including', 'students', ')', '.']
[('The', 'DT'), ('conference', 'NN'), (',', ','), ('which', 'WDT'), ('was', 'VBD'), ('held', 'VBN'), ('in', 'IN'), ('New', 'NNP'), ('York', 'NNP'), (',', ','), ('had', 'VBD'), ('over', 'IN'), ('500', 'CD'), ('attendees', 'NNS'), ('(', '('), ('including', 'VBG'), ('students', 'NNS'), (')', ')'), ('.', '.')]
She said, 'I'll meet you at the park... if it doesn't rain.'
['She', 'said', ',', "'", 'I', "'ll", 'meet', 'you', 'at', 'the', 'park', '...', 'if', 'it', 'does', "n't", 'rain', '.', "'"]
[('She', 'PRP'), ('said', 'VBD'), (',', ','), ("'", "''"), ('I', 'PRP'), ("'ll", 'MD'), ('meet', 'VB'), ('you', 'PRP'), ('at', 'IN'), ('the', 'DT'), ('park', 'NN'), ('...', ':'), ('if', 'IN'), ('it', 'PRP'), (

### 1.2. Print the default POS tagset: The PENN Treebank Dataset



In [2]:
import nltk

# Download the Penn Treebank POS tagset (if not already downloaded)
nltk.download('tagsets')

# Print the list of Penn Treebank POS tags
print("Penn Treebank POS Tags:")
nltk.help.upenn_tagset()

Penn Treebank POS Tags:
$: dollar
    $ -$ --$ A$ C$ HK$ M$ NZ$ S$ U.S.$ US$
'': closing quotation mark
    ' ''
(: opening parenthesis
    ( [ {
): closing parenthesis
    ) ] }
,: comma
    ,
--: dash
    --
.: sentence terminator
    . ! ?
:: colon or ellipsis
    : ; ...
CC: conjunction, coordinating
    & 'n and both but either et for less minus neither nor or plus so
    therefore times v. versus vs. whether yet
CD: numeral, cardinal
    mid-1890 nine-thirty forty-two one-tenth ten million 0.5 one forty-
    seven 1987 twenty '79 zero two 78-degrees eighty-four IX '60s .025
    fifteen 271,124 dozen quintillion DM2,000 ...
DT: determiner
    all an another any both del each either every half la many much nary
    neither no some such that the them these this those
EX: existential there
    there
FW: foreign word
    gemeinschaft hund ich jeux habeas Haementeria Herr K'ang-si vous
    lutihaw alai je jour objets salutaris fille quibusdam pas trop Monte
    terram fiche oui corpori

[nltk_data] Downloading package tagsets to
[nltk_data]     /home/codespace/nltk_data...
[nltk_data]   Package tagsets is already up-to-date!


### 1.3. Using Spacy Library's POS taggers

SpaCy's default models are based on deep learning techniques, specifically convolutional neural networks (CNN) and long short-term memory networks (LSTM).

SpaCy can provide coarse (high-level) and fine (Upenn style) tagging.

In [3]:
import spacy

# download a set of pre-trained models from spacy. Note: This is a model pipeline
# This will do tokenization, tagging and parsing
nlp = spacy.load("en_core_web_sm")

some_sentences =[
    "The conference, which was held in New York, had over 500 attendees (including students).",
    "She said, 'I'll meet you at the park... if it doesn't rain.'",
    "The URL for the website is https://www.example.com, and you can reach me at john.doe@email.com."
    ]

for sent in some_sentences:
  print ("****************************************")
  print (sent)
  processed_sent = nlp(sent)
  # See tokenization results
  print (f"Tokens:")
  tokens = [token.text for token in processed_sent]
  print (tokens)

  # Coarse tagging
  print (f"Coarse tags:")
  tags = [token.pos_ for token in processed_sent]
  print (tags)

  # Fine tagging
  print (f"Fine tags:")
  tags = [token.tag_ for token in processed_sent]
  print (tags)

****************************************
The conference, which was held in New York, had over 500 attendees (including students).
Tokens:
['The', 'conference', ',', 'which', 'was', 'held', 'in', 'New', 'York', ',', 'had', 'over', '500', 'attendees', '(', 'including', 'students', ')', '.']
Coarse tags:
['DET', 'NOUN', 'PUNCT', 'PRON', 'AUX', 'VERB', 'ADP', 'PROPN', 'PROPN', 'PUNCT', 'VERB', 'ADP', 'NUM', 'NOUN', 'PUNCT', 'VERB', 'NOUN', 'PUNCT', 'PUNCT']
Fine tags:
['DT', 'NN', ',', 'WDT', 'VBD', 'VBN', 'IN', 'NNP', 'NNP', ',', 'VBD', 'IN', 'CD', 'NNS', '-LRB-', 'VBG', 'NNS', '-RRB-', '.']
****************************************
She said, 'I'll meet you at the park... if it doesn't rain.'
Tokens:
['She', 'said', ',', "'", 'I', "'ll", 'meet', 'you', 'at', 'the', 'park', '...', 'if', 'it', 'does', "n't", 'rain', '.', "'"]
Coarse tags:
['PRON', 'VERB', 'PUNCT', 'PUNCT', 'PRON', 'AUX', 'VERB', 'PRON', 'ADP', 'DET', 'NOUN', 'PUNCT', 'SCONJ', 'PRON', 'AUX', 'PART', 'VERB', 'PUNCT', 'PUNCT']


### 1.4. Multilingual PoS Tagging using SpaCY
Let's download the multilingual model first. Use command line interface and command `python -m spacy download xx_ent_wiki_sm` for this.



In [4]:
!python -m spacy download de_core_news_sm

Collecting de-core-news-sm==3.7.0
  Downloading https://github.com/explosion/spacy-models/releases/download/de_core_news_sm-3.7.0/de_core_news_sm-3.7.0-py3-none-any.whl (14.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m14.6/14.6 MB[0m [31m52.3 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('de_core_news_sm')


Let's tag some GERMAN text now.

In [5]:
import spacy

# Load a multilingual model
nlp = spacy.load("de_core_news_sm")

# Define a sentence in a different language (e.g., Spanish)
sentence = "Dies ist ein Beispiel für PoS-Tagging in mehreren Sprachen."

# Process the sentence with the multilingual model
doc = nlp(sentence)

# Extract PoS tags
tags = [(token.text, token.tag_) for token in doc]
print(tags)

[('Dies', 'PDS'), ('ist', 'VAFIN'), ('ein', 'ART'), ('Beispiel', 'NN'), ('für', 'APPR'), ('PoS-Tagging', 'NE'), ('in', 'APPR'), ('mehreren', 'PIAT'), ('Sprachen', 'NN'), ('.', '$.')]


## Exercise E1. Analysis Spanish Wikipedia Data

1. Copy paste the first paragraph of https://es.wikipedia.org/wiki/Pen%C3%A9lope_Cruz (Spanish wikipedia page for popular acress Penelope Cruz)
2. Perform Part of Speech tagging using Spacy's PoS tagger.
[Hint: you will have to use SpaCy's `es_core_news_sm` model that has been developed for Spanish.]
3. Calculate the frequency or percentage of each POS tag category in the text and print the results.

In [6]:
!python -m spacy download es_core_news_sm

Collecting es-core-news-sm==3.7.0
  Downloading https://github.com/explosion/spacy-models/releases/download/es_core_news_sm-3.7.0/es_core_news_sm-3.7.0-py3-none-any.whl (12.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.9/12.9 MB[0m [31m45.7 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('es_core_news_sm')


In [7]:
span_sentences = ["Penélope Cruz Sánchez (Alcobendas, 28 de abril de 1974) les una actriz y modelo española.", 
                  "En 2006 fue la primera actriz española candidata a los Premios Óscar y a los Globos de Oro en la categoría de mejor actriz protagonista, por su papel en la película española Volver, dirigida por el cineasta español Pedro Almodóvar;",
                  "en esa ocasión no obtuvo el Óscar, pero en 2008 se convirtió en la primera actriz española en conseguir el Óscar como mejor actriz de reparto gracias a la película Vicky Cristina Barcelona dirigida por Woody Allen."
                  ]

s_nlp = spacy.load("es_core_news_sm")

for sent in span_sentences:
  print ("****************************************")
  print (sent)
  processed_sent = s_nlp(sent)
  # See tokenization results
  print (f"Tokens:")
  tokens = [token.text for token in processed_sent]
  print (tokens)

  # Coarse tagging
  print (f"Coarse tags:")
  tags = [token.pos_ for token in processed_sent]
  print (tags)

  # Fine tagging
  print (f"Fine tags:")
  tags = [token.tag_ for token in processed_sent]
  print (tags)

****************************************
Penélope Cruz Sánchez (Alcobendas, 28 de abril de 1974) les una actriz y modelo española.
Tokens:
['Penélope', 'Cruz', 'Sánchez', '(', 'Alcobendas', ',', '28', 'de', 'abril', 'de', '1974', ')', 'les', 'una', 'actriz', 'y', 'modelo', 'española', '.']
Coarse tags:
['PROPN', 'PROPN', 'PROPN', 'PUNCT', 'PROPN', 'PUNCT', 'NUM', 'ADP', 'NOUN', 'ADP', 'NUM', 'PUNCT', 'PRON', 'DET', 'NOUN', 'CCONJ', 'NOUN', 'ADJ', 'PUNCT']
Fine tags:
['PROPN', 'PROPN', 'PROPN', 'PUNCT', 'PROPN', 'PUNCT', 'NUM', 'ADP', 'NOUN', 'ADP', 'NUM', 'PUNCT', 'PRON', 'DET', 'NOUN', 'CCONJ', 'NOUN', 'ADJ', 'PUNCT']
****************************************
En 2006 fue la primera actriz española candidata a los Premios Óscar y a los Globos de Oro en la categoría de mejor actriz protagonista, por su papel en la película española Volver, dirigida por el cineasta español Pedro Almodóvar;
Tokens:
['En', '2006', 'fue', 'la', 'primera', 'actriz', 'española', 'candidata', 'a', 'los', 'Premi

## 2. Shallow Parsing : Chunking

In NLP, chunking is the process of extracting meaningful phrases (chunks) from a sentence or text. These phrases are typically noun phrases (NP), verb phrases (VP), or other grammatical units that convey useful information about the text.

Various applications of chunking include (but not limited to):

1. **Information Extraction:** Chunking is often used in information extraction tasks to identify and extract relevant information from unstructured text data. For example, in news articles, chunking can be used to extract names of people, organizations, locations, and other key entities.

2. **Keyphrase Extraction:** Keyphrase extraction via chunking can be used to identify and extract the most important phrases or sentences from a document, aiding in the creation of concise and informative document summaries.

3. **Search Engine Optimization (SEO):** For SEO purposes, chunking can help identify and extract important keywords and phrases from web content. These chunks can be used for keyword analysis and optimization.
Content Extraction: In web scraping and content extraction applications, chunking can be used to locate and extract specific pieces of information from HTML documents or other structured text formats.

4. **Academic Research and Literature Analysis:**
In academic research, keyphrase extraction can help researchers quickly identify the main topics and contributions of research papers.

5. **Legal Document Analysis:**
In legal documents, chunking and keyphrase extraction can help identify relevant sections or clauses of interest.

6. **Patent Analysis and Search:**
Keyphrase extraction is commonly used in patent analysis to identify the essential elements of a patent application.

## 2.1. Rule Based Chunking with the help of POS tags

In [8]:
import nltk
from nltk.chunk import RegexpParser
from nltk.corpus import stopwords

nltk.download('stopwords')


# Sample text
text = "Natural language processing (NLP) is a subfield of artificial intelligence (AI)."

# Tokenize the text
tokens = nltk.word_tokenize(text)

# Part-of-speech tagging
pos_tags = nltk.pos_tag(tokens)

# Define a chunking grammar for noun phrases (NP)
grammar = r"NP: {<DT>?<JJ>*<NN>}"

# Create a chunk parser with the grammar
chunk_parser = RegexpParser(grammar)

# Apply chunking to the part-of-speech tagged text
tree = chunk_parser.parse(pos_tags)

# Extract noun phrases
noun_phrases = [" ".join(leaf[0] for leaf in subtree.leaves()) for subtree in tree.subtrees() if subtree.label() == 'NP']

# Remove stopwords
stop_words = set(stopwords.words("english"))
filtered_noun_phrases = [phrase for phrase in noun_phrases if phrase.lower() not in stop_words]

print(filtered_noun_phrases)

['Natural language', 'processing', 'a subfield', 'artificial intelligence']


[nltk_data] Downloading package stopwords to
[nltk_data]     /home/codespace/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### 2. Chunking using pre-trained models via spaCy



In [9]:
import spacy

# Load the spaCy model
nlp = spacy.load("en_core_web_sm")

# Sample sentence
sentence = "John and Mary are eating pizza in the restaurant"

# Process the sentence using spaCy
doc = nlp(sentence)

# Iterate over the parsed tokens and extract chunks
for chunk in doc.noun_chunks:
    print(chunk.text)


John
Mary
pizza
the restaurant


## Exercise E2: Verb phrase extraction
A verb phrase is a syntactic unit that consists of one or more verbs and any accompanying elements, such as objects, complements, adverbs, or prepositional phrases.

Here's the general structure of a verb phrase:

`Verb Phrase = Verb + (Object/Complement/Adverb/Prepositional Phrase)`

Examples:

1. "I ate rice" -> "ate rice"
2. "I ate rice with fork" -> "ate rice with fork"
3. "I ate rice with salt and paper " -> "ate rice with salt and paper"
4. "I ate rice and slept " -> "ate rice"

For the list of sentences below, identify the verb phrases following section 2.1.

```
list_of_sents = [
  "John lives in New York.",
  "J.K. Rowling wrote the book Harry Potter.",
  " Tom Hanks acted in Forrest Gump."
  ]
```

Expected outputs:
1. `lives in New York.`.
2. `wrote the book Harry Potter.`
3. `acted in Forrest Gump.`



In [17]:
# Exercise 2

list_of_sents = [
  "John lives in New York.",
  "J.K. Rowling wrote the book Harry Potter.",
  " Tom Hanks acted in Forrest Gump."
  ]

def e2(sentence):
  # Tokenize the text
  tokens = nltk.word_tokenize(sentence)
  # Part-of-speech tagging
  pos_tags = nltk.pos_tag(tokens)
  # Define a chunking grammar for noun phrases (NP)
  grammar = r"""
    VP: {<VB.*><DT>?<JJ>*<NN.*>+}
        {<VB.*><IN><DT>?<JJ>*<NN.*>+}
"""
  # Create a chunk parser with the grammar
  chunk_parser = RegexpParser(grammar)
  # Apply chunking to the part-of-speech tagged text
  tree = chunk_parser.parse(pos_tags)
  # Extract verb phrases
  verb_phrases = [" ".join(leaf[0] for leaf in subtree.leaves()) for subtree in tree.subtrees() if subtree.label() == 'VP']
  # Remove stopwords
  stop_words = set(stopwords.words("english"))
  filtered_verb_phrases = [phrase for phrase in verb_phrases if phrase.lower() not in stop_words]
  print(filtered_verb_phrases)
  # Process the sentence using spaCy
  doc = nlp(sentence)
  # Iterate over the parsed tokens and extract chunks
  for chunk in doc.noun_chunks:
    if not isinstance(chunk, spacy.tokens.span.Span):
      print(chunk[0])

for sent in list_of_sents:
   e2(sent)

['lives in New York']
['wrote the book Harry Potter']
['acted in Forrest Gump']
