## Lab6_Parsing_and_Named_Entity_Identification

In this practicum, we'll delve into text parsing, specifically dependency parsing and named entity identification.

Complete all exercises and submit under "Lab 6: Parsing_and_Named_Entity_Identification" : https://utexas.instructure.com/courses/1382133/assignments/6627276

# 1. Dependency Parsing Example using SpaCy


In [1]:
import spacy

# Load the spaCy English language model
nlp = spacy.load("en_core_web_sm")

# Example sentence
sentence = "Dependency parsing helps in understanding sentence structure."

# Process the sentence
doc = nlp(sentence)

# Print the dependency parse tree
for token in doc:
  print(f"{token.text} --({token.dep_})--> {token.head.text}")


Dependency --(compound)--> parsing
parsing --(nsubj)--> helps
helps --(ROOT)--> helps
in --(prep)--> helps
understanding --(pcomp)--> in
sentence --(compound)--> structure
structure --(dobj)--> understanding
. --(punct)--> helps


### 1.1. Explaining Dependency Parse Labels


In [2]:
import spacy

# Load the English language model
nlp = spacy.load("en_core_web_sm")

# Function to print dependency relation definitions
def print_dependency_definitions():
  # Iterate over all dependency labels and print their definitions
  for label in sorted(nlp.get_pipe("parser").labels):
    print(f"{label}: {spacy.explain(label)}")

# Print dependency relation definitions
print_dependency_definitions()


ROOT: root
acl: clausal modifier of noun (adjectival clause)
acomp: adjectival complement
advcl: adverbial clause modifier
advmod: adverbial modifier
agent: agent
amod: adjectival modifier
appos: appositional modifier
attr: attribute
aux: auxiliary
auxpass: auxiliary (passive)
case: case marking
cc: coordinating conjunction
ccomp: clausal complement
compound: compound
conj: conjunct
csubj: clausal subject
csubjpass: clausal subject (passive)
dative: dative
dep: unclassified dependent
det: determiner
dobj: direct object
expl: expletive
intj: interjection
mark: marker
meta: meta modifier
neg: negation modifier
nmod: modifier of nominal
npadvmod: noun phrase as adverbial modifier
nsubj: nominal subject
nsubjpass: nominal subject (passive)
nummod: numeric modifier
oprd: object predicate
parataxis: parataxis
pcomp: complement of preposition
pobj: object of preposition
poss: possession modifier
preconj: pre-correlative conjunction
predet: None
prep: prepositional modifier
prt: particle
punct



# 2. Named Entity Recognition

Named Entity Recognition (NER) is a natural language processing (NLP) technique that involves identifying and classifying named entities (such as names of people, organizations, locations, dates, and more).

A common named entity tag set used in Named Entity Recognition (NER) includes the following categories:

1. PERSON: Names of individuals, including first and last names.
2. ORGANIZATION: Names of companies, institutions, and organizations.
3. LOCATION: Names of places, such as cities, countries, and geographical locations.
4. DATE: Expressions of date and time, including specific dates, months, years, and time references.
5. TIME: Specific times and time-related expressions.
6. PERCENT: Percentage values.
7. MONEY: Monetary values and currency references.
8. QUANTITY: Measurements or quantities, such as distances, weights, and numbers.
9. ORDINAL: Ordinal numbers (e.g., first, second, third).
10. CARDINAL: Cardinal numbers (e.g., one, two, three).
11. EVENT: Names of events, meetings, and occurrences.
12. ARTIFACT: Names of products, inventions, and works of art.
13. WORK_OF_ART: Names of artistic works, such as books, movies, and paintings.
14. LANGUAGE: Names of languages.
15. NORP: Nationalities, religious and political groups.

This is just a standard set of named entity categories, and in practice, NER systems can be customized to include additional categories or adapt to specific domains or languages as needed.

In [3]:
import spacy

nlp = spacy.load("en_core_web_sm")

text = "Apple Inc. is headquartered in Cupertino, California."
doc = nlp(text)

for ent in doc.ents:
  print(ent.text, ent.label_)


Apple Inc. ORG
Cupertino GPE
California GPE


# 3. Putting Everything we have Discussed Together to Extract Knowledge from Raw Text.

Knowledge extraction from text refers to the process of automatically identifying and extracting structured information or knowledge from unstructured text sources such as articles, documents, or web pages. This extracted knowledge can then be organized and represented in a structured format that is suitable for analysis, storage, and further processing.

One common way to represent knowledge extracted from text is through triples. Triples are basic units of structured information consisting of three parts: a subject, a predicate, and an object. This representation is often referred to as a "subject-predicate-object" or "entity-relationship-entity" structure.

Here's how triples work:

**Subject:** The entity or concept being described or referenced.

**Predicate:** The relationship or attribute that connects the subject to the object.

**Object:** The value or entity that is related to the subject by the predicate.

For example, consider the following sentence:

`John teaches at UT Austin`

In this sentence, we can extract the following triple:

Subject: John
Predicate: teach at
Object: UT Austin

This can be simply written as `<John, teach at, UT Austin>`

Triples allow for the representation of various types of knowledge, including relationships between entities, attributes of entities, events, facts, and more. They provide a structured way to organize and store information extracted from text, enabling easier analysis, inference, and integration with other data sources. Triple-based representations are commonly used in knowledge graphs, semantic web technologies, and natural language processing applications for tasks such as information retrieval, question answering, and knowledge base construction.


Below are some functions to extract triples using Dependency Parsing, PoS tagging and Named Entity Recognition.



In [4]:
import spacy
nlp = spacy.load("en_core_web_sm")

def extract_triples(sentence):
  doc = nlp(sentence)
  triples = []
  for token in doc:
    # We will try to identify the SUBJECT and ROOT (head node of subject in the graph)
    # if the ROOT has a coarse grain POS category of VERB, we consider the ROOT as PRECICATE
    # all children of the note "direct object (dobj) are considered as OBJECT"
    #print (token.text, token.dep_)
    if token.dep_.startswith("nsubj") and token.head.pos_ == "VERB":
      subject = get_entity_or_noun_phrase(token)
      predicate = token.head.lemma_
      obj = None
      for child in token.head.children:
        if "obj" in child.dep_:
          obj = get_entity_or_noun_phrase(child)
          break
        elif "prep" in child.dep_:
          # we merge the preposition to the predicate
          predicate = predicate+" " +child.text
          obj = " ".join([c.text for c in child.subtree if c.text != child.text])
          break
      #lowercase everything
      if subject is not None:
        subject = subject.lower()
      else:
        subject = "None"
      if predicate is not None:
        predicate = predicate.lower()
      else:
        predicate = "None"
      if obj is not None:
        obj = obj.lower()
      else:
        obj = "None"
      triples.append((subject, predicate, obj))
  return triples

# Function to get entity or noun phrase
def get_entity_or_noun_phrase(token):
  if token.ent_type_:
    return token.text
  else:
    return " ".join([child.text for child in token.subtree])

3.1. Processing Documents and Extracting Tuples.

For processing documents, we need to extract sentences from documents. Let's work on a hypothetical docuemnt:

```
document = "Barack Obama was born in Hawaii. He served as the 44th president of the United States.
```

In [5]:
from nltk.tokenize import sent_tokenize
import nltk
nltk.download('punkt')

document = "Barack Obama was born in Hawaii. Obama served as the 44th president of the United States."

sentences = sent_tokenize(document)
all_triples = []
for sentence in sentences:
  triples = extract_triples(sentence)
  all_triples.extend(triples)

print (all_triples)


[nltk_data] Downloading package punkt to /home/codespace/nltk_data...


[('obama', 'bear in', 'hawaii'), ('obama', 'serve as', 'the 44th president of the united states')]


[nltk_data]   Package punkt is already up-to-date!


## 4. Using the knowlegde extracted to answer questions

We can parse questions in the same way as we parsed sentences in the document. Assuming that questions represent incomplete knowlegde tuples, we can take incomplete tuples and search within our knowledge base to find the missing portions.


In [6]:
def find_answer_from_triples(question_triple, knowledge_triples):
  answer = None
  for knowledge_triple in knowledge_triples:
    if question_triple[0] in knowledge_triple[0] and question_triple[1] in knowledge_triple[1]:
      answer = knowledge_triple[2]
    elif question_triple[0] in knowledge_triple[0] and question_triple[2] in knowledge_triple[2]:
      answer = knowledge_triple[1]
    elif question_triple[1] in knowledge_triple[1] and question_triple[2] in knowledge_triple[2]:
      answer = knowledge_triple[0]
  return answer

def answer_question(question, knowledge_triples):
  question_triple = extract_triples(question)[0] # only one triple from one question
  print ("Question triples", question_triple)
  answer = find_answer_from_triples(question_triple, knowledge_triples)
  if answer is None:
     answer = "Sorry, I don't have an answer to that question."
  print (answer)

Now let's try to answer some questions.

In [7]:
questions = [
    "Where was Barack Obama born?",
    "Who served as the president of the United States?",
    "What did Barack Obama do?",
    "Who was born in Hawaii?",
]
for question in questions:
  answer_question(question,all_triples)

Question triples ('obama', 'bear', 'None')
hawaii
Question triples ('who', 'serve as', 'the president of the united states')
Sorry, I don't have an answer to that question.
Question triples ('obama', 'do', 'what')
Sorry, I don't have an answer to that question.
Question triples ('who', 'bear in', 'hawaii')
obama


As we can see, we would probably need some fuzzy matching techniques to answer all questions perfectly.

## E1. Exercise: Identifying Entities from News

- Visit a news website of your choice. Copy paste a few sentences from an article (preferrably containing a lot of names) into a variable `document`.

- Process the text (clean up a bit if required) and put all unique sentences in a list.

- Perform NER on the sentences using SpaCY and print the unique entities (person, location, organizations etc).

- Try repeating the same exercise on a French news website https://www.lefigaro.fr/. You will have to download the and `fr_core_news_sm` .

- Write down your observations.

**Optional Exercise 2 (not-graded) **
Can you extract some knowledge tuples by using NER, POS and/or Dependency Parsing from the same English news website following our example code. What kind of ammendments did you need to make to the existing rules to extract better tuples.

**Optional Exercise 3 (not-graded) **
Can you build a question answering system, following the above examples? Feel free to formulate variery of questions and check the accuracy of your system (qualitatively).



In [8]:
document = "Google Chrome is getting a new AI writing generator today. At its core, this Gemin-powered tool is essentially the existing \"Help me write\" feature from Gmail, but extended to the entire web and powered by one of Google's latest Gemini AI models. The company first announced this new tool in January and it remains in its 'experimental' phase, meaning you must explicitly enable it. To get started, head to the Chrome settings menu and look for the 'Experimental AI' page. From there, you can easily enable the new writing feature, as well as Google's new automatic tab organizer (which I haven't found particularly useful or smart so far) and the new Chrome theme manager. For now, the AI writer is only available in English on Windows, Mac and Linux. After that, right-click on any text field and select 'Help me write.' You can use this to write something completely now Gemini can also rewrite existing text."
print(document)

Google Chrome is getting a new AI writing generator today. At its core, this Gemin-powered tool is essentially the existing "Help me write" feature from Gmail, but extended to the entire web and powered by one of Google's latest Gemini AI models. The company first announced this new tool in January and it remains in its 'experimental' phase, meaning you must explicitly enable it. To get started, head to the Chrome settings menu and look for the 'Experimental AI' page. From there, you can easily enable the new writing feature, as well as Google's new automatic tab organizer (which I haven't found particularly useful or smart so far) and the new Chrome theme manager. For now, the AI writer is only available in English on Windows, Mac and Linux. After that, right-click on any text field and select 'Help me write.' You can use this to write something completely now Gemini can also rewrite existing text.


In [9]:
# process text
document = document.lower()

list_sentences = document.split(".")
list_sentences

['google chrome is getting a new ai writing generator today',
 ' at its core, this gemin-powered tool is essentially the existing "help me write" feature from gmail, but extended to the entire web and powered by one of google\'s latest gemini ai models',
 " the company first announced this new tool in january and it remains in its 'experimental' phase, meaning you must explicitly enable it",
 " to get started, head to the chrome settings menu and look for the 'experimental ai' page",
 " from there, you can easily enable the new writing feature, as well as google's new automatic tab organizer (which i haven't found particularly useful or smart so far) and the new chrome theme manager",
 ' for now, the ai writer is only available in english on windows, mac and linux',
 " after that, right-click on any text field and select 'help me write",
 "' you can use this to write something completely now gemini can also rewrite existing text",
 '']

In [10]:
nlp = spacy.load("en_core_web_sm")

for sentence in list_sentences:
  doc = nlp(sentence)

  for ent in doc.ents:
    print(ent.text, ent.label_)

google ORG
today DATE
one CARDINAL
google ORG
january DATE
google ORG
english LANGUAGE
mac PERSON
linux PERSON


In [11]:
!python -m spacy download fr_core_news_sm

Collecting fr-core-news-sm==3.7.0
  Downloading https://github.com/explosion/spacy-models/releases/download/fr_core_news_sm-3.7.0/fr_core_news_sm-3.7.0-py3-none-any.whl (16.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m16.3/16.3 MB[0m [31m54.3 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('fr_core_news_sm')


In [12]:
# do it but in french
french_document = "En déplacement à Rennes à l’usine de production des peluches des JO, la ministre des Sports a rendu hommage à l’attaquant du PSG, un «héros des dernières années». «Une page est en train de se tourner», a déclaré jeudi au micro de RMC Amélie Oudéa-Castéra, à propos du départ de Kylian Mbappé du PSG. Après un bref mais mouvementé passage au ministère de l’Education nationale, «AOC» est de retour au ministère des Sports à temps plein et se trouvait à Rennes pour visiter l’usine de production des peluches des Jeux olympiques."
nlp = spacy.load("fr_core_news_sm")
list_fr_sentences = french_document.split(".")
for sentence in list_fr_sentences:
  doc = nlp(sentence)
  for ent in doc.ents:
    print(ent.text, ent.label_)

Rennes LOC
l’ LOC
JO MISC
ministre des Sports ORG
l’ LOC
PSG ORG
RMC ORG
Amélie Oudéa-Castéra PER
Kylian Mbappé MISC
PSG ORG
l’Education ORG
ministère des Sports ORG
Rennes LOC
l’ ORG
Jeux olympiques MISC


There's a lot more repetition in this set of words, as well as more misc. objects than the english sets.