# WMIR practice lesson on Spacy

### Objective
Use the Spacy framework to extract relevant information from sentences for enriching the representations.  
Train several models and compare them with and without the enriched representations.

#### Author
Claudiu Daniel Hromei, April 2023.  
hromei@ing.uniroma2.it

# Introduction

[SpaCy](spacy.io) is a free, open-source library for Natural Language Processing (NLP) in Python. It can be used to build information extraction or natural language understanding systems, or to pre-process text for deep learning.  
Some Features:
- **Tokenization**: Segmenting text into words, punctuations marks etc.
- **Part-of-speech (POS) Tagging**: Assigning word types to tokens, like verb or noun.
- **Dependency Parsing**: Assigning syntactic dependency labels, describing the relations between individual tokens, like subject or object.
- **Lemmatization**: Assigning the base forms of words. For example, the lemma of “was” is “be”, and the lemma of “rats” is “rat”.
- **Sentence Boundary Detection (SBD)**: Finding and segmenting individual sentences.
- **Named Entity Recognition (NER)**: Labelling named “real-world” objects, like persons, companies or locations.
- **Entity Linking (EL)**: Disambiguating textual entities to unique identifiers in a knowledge base.
- **Similarity**: Comparing words, text spans and documents and how similar they are to each other.
- **Text Classification**: Assigning categories or labels to a whole document, or parts of a document.
- **Rule-based Matching**: Finding sequences of tokens based on their texts and linguistic annotations, similar to regular expressions.
- **Training**: Updating and improving a statistical model’s predictions.
- **Serialization**: Saving objects to files or byte strings.

# Required Libraries

In [1]:
import pandas as pd

from IPython.display import display, HTML

In [2]:
# option to print all the value of cells in DataFrames
pd.set_option("max_colwidth", None)

### Install spacy and download the english pipeline

In [3]:
# install the spacy module
!pip install spacy

# download the english pipeline here
# 'it_core_news_sm' for italian texts
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.5.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.5.0/en_core_web_sm-3.5.0-py3-none-any.whl (12.8 MB)
     --------------------------------------- 12.8/12.8 MB 46.7 MB/s eta 0:00:00
[38;5;2m[+] Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [4]:
import spacy
from spacy import displacy
import warnings
warnings.filterwarnings("ignore")

# Annotation example

In [5]:
input_string = "In 1982, Mark drove his car from Los Angeles to Las Vegas until 5 of july"
nlp = spacy.load('en_core_web_sm')

In [6]:
def print_annotation(input_string):
    doc = nlp(input_string)
    
    df = pd.DataFrame({
        "id": [],
        "token": [],
        "lemma": [],
        "tag": [],
        "entity": [],
        "dependency": [],
        "head id": []
    })

    for sent in doc.sents:
        for i, word in enumerate(sent):
            if word.head is word:
                head_idx = 0
            else:
                head_idx = doc[i].head.i+1
            if head_idx == i + 1:
                head_idx = 0

            entity_tag = word.ent_type_
            if len(entity_tag) == 0:
                entity_tag = "O"
            
            word_obj = {"id": str(i+1), "token": str(word), "lemma": word.lemma_, "tag": word.tag_, "entity": entity_tag,
                                    "dependency": word.dep_, "head id": str(head_idx)}
            df = df.append(word_obj, ignore_index=True)
    display(df)

In [7]:
print_annotation(input_string) #Entity: O-->Not an entity Head id: how many dependency there are in the graph containing that word (how amny arrows to reach the root)

Unnamed: 0,id,token,lemma,tag,entity,dependency,head id
0,1,In,in,IN,O,prep,5
1,2,1982,1982,CD,DATE,pobj,1
2,3,",",",",",",O,punct,5
3,4,Mark,Mark,NNP,PERSON,nsubj,5
4,5,drove,drive,VBD,O,ROOT,0
5,6,his,his,PRP$,O,poss,7
6,7,car,car,NN,O,dobj,5
7,8,from,from,IN,O,prep,5
8,9,Los,Los,NNP,GPE,compound,10
9,10,Angeles,Angeles,NNP,GPE,pobj,8


In [8]:
def visualize_annotation(input_string, style="dep"):
    doc = nlp(input_string)
    # style can be either "dep" or "ent"
    displacy.render(doc, style=style, jupyter=True, options={"distance": 100}) #distance default 140, visualization metric
    #Jupyter=true --> use no server

In [9]:
visualize_annotation(input_string, style="dep")

In [10]:
visualize_annotation(input_string, style="ent")

In [11]:
visualize_annotation(input_string, style="span")

# Information extraction

Get information about a particular word in a given string.

In [12]:
def get_word_annotation(input_string, word_string):
    doc = nlp(input_string)
    
    words = []
    for sent in doc.sents:
        for i, word in enumerate(sent):
            if word.head is word:
                head_idx = 0
            else:
                head_idx = doc[i].head.i+1
            if head_idx == i + 1:
                head_idx = 0

            entity_tag = word.ent_type_
            if len(entity_tag) == 0:
                entity_tag = "O"
            
            word_obj = {"id": i+1, "token": str(word), "lemma": word.lemma_, "tag": word.tag_, "entity": entity_tag,
                                    "dependency": word.dep_, "head id": head_idx}
            words.append(word_obj)
    
    for word in words:
        if word["token"] == word_string:
            return word
    
    return None

In [13]:
print(get_word_annotation(input_string, "Mark"))
print(get_word_annotation(input_string, "1982"))

{'id': 4, 'token': 'Mark', 'lemma': 'Mark', 'tag': 'NNP', 'entity': 'PERSON', 'dependency': 'nsubj', 'head id': 5}
{'id': 2, 'token': '1982', 'lemma': '1982', 'tag': 'CD', 'entity': 'DATE', 'dependency': 'pobj', 'head id': 1}


### Exercise 1: Search for relations

Define a method that takes in input a sentence (`input_string`) and the name of a relation (`relation_string`), parses with spacy the input and returns the words (the `objects`! not the strings) involved in the relation. If the relation is not present, return an empty array.

```
def search_relation(input_string, relation_string):
    return word_obj_list
```

In [14]:
def get_word_relation(input_string, relation_String):
    doc = nlp(input_string)
    
    words = []
    ret = []
    for sent in doc.sents:
        for i, word in enumerate(sent):
            if word.head is word:
                head_idx = 0
            else:
                head_idx = doc[i].head.i+1
            if head_idx == i + 1:
                head_idx = 0

            entity_tag = word.ent_type_
            if len(entity_tag) == 0:
                entity_tag = "O"
            
            word_obj = {"id": i+1, "token": str(word), "lemma": word.lemma_, "tag": word.tag_, "entity": entity_tag,
                                    "dependency": word.dep_, "head id": head_idx}
            words.append(word_obj)
    
    for word in words:
        if word["dependency"] == relation_String:
            ret.append(word)
    if(ret == []):
        ret = None
    return ret

In [15]:
get_word_relation(input_string,"prep")

[{'id': 1,
  'token': 'In',
  'lemma': 'in',
  'tag': 'IN',
  'entity': 'O',
  'dependency': 'prep',
  'head id': 5},
 {'id': 8,
  'token': 'from',
  'lemma': 'from',
  'tag': 'IN',
  'entity': 'O',
  'dependency': 'prep',
  'head id': 5},
 {'id': 11,
  'token': 'to',
  'lemma': 'to',
  'tag': 'IN',
  'entity': 'O',
  'dependency': 'prep',
  'head id': 5},
 {'id': 14,
  'token': 'until',
  'lemma': 'until',
  'tag': 'IN',
  'entity': 'O',
  'dependency': 'prep',
  'head id': 5},
 {'id': 16,
  'token': 'of',
  'lemma': 'of',
  'tag': 'IN',
  'entity': 'DATE',
  'dependency': 'prep',
  'head id': 15}]

In [16]:
def get_word_relation_2(input_string, relation_String):
    doc = nlp(input_string)
    
    words = []
    ret = []
    for sent in doc.sents:
        for i, word in enumerate(sent):
            if word.head is word:
                head_idx = 0
            else:
                head_idx = doc[i].head.i+1
            if head_idx == i + 1:
                head_idx = 0

            entity_tag = word.ent_type_
            if len(entity_tag) == 0:
                entity_tag = "O"
            
            word_obj = {"id": i+1, "token": str(word), "lemma": word.lemma_, "tag": word.tag_, "entity": entity_tag,
                                    "dependency": word.dep_, "head id": head_idx}
            words.append(word_obj)
    
    for word in words:
        if word["dependency"] == relation_String:
            l = []
            for w in words:
                if(word["head id"] == w["id"]):
                    l.append(w["lemma"])
                    break
            l.append(word["lemma"])
            l.append(word["dependency"])
            ret.append(l)
            
    if(not ret):
        ret = None
    return ret

In [17]:
get_word_relation_2(input_string,"compound")

[['Angeles', 'Los', 'compound'], ['Vegas', 'Las', 'compound']]

In [18]:
print_annotation(input_string)

Unnamed: 0,id,token,lemma,tag,entity,dependency,head id
0,1,In,in,IN,O,prep,5
1,2,1982,1982,CD,DATE,pobj,1
2,3,",",",",",",O,punct,5
3,4,Mark,Mark,NNP,PERSON,nsubj,5
4,5,drove,drive,VBD,O,ROOT,0
5,6,his,his,PRP$,O,poss,7
6,7,car,car,NN,O,dobj,5
7,8,from,from,IN,O,prep,5
8,9,Los,Los,NNP,GPE,compound,10
9,10,Angeles,Angeles,NNP,GPE,pobj,8


### Exercise 2: Search for entities

Define a method that takes in input a sentence (`input_string`) and the name of an entity type (`entity_type_string`), parses with spacy the input and returns the words (the `objects`! not the strings) described by that entity. If the entity type is not present, return an empty array.

```
def search_entity(input_string, entity_type_string):
    return word_obj_list
```

In [19]:
def search_entity(input_string, entity_type_string):
    doc = nlp(input_string)
    
    words = []
    ret = []
    for sent in doc.sents:
        for i, word in enumerate(sent):
            if word.head is word:
                head_idx = 0
            else:
                head_idx = doc[i].head.i+1
            if head_idx == i + 1:
                head_idx = 0

            entity_tag = word.ent_type_
            if len(entity_tag) == 0:
                entity_tag = "O"
            
            word_obj = {"id": i+1, "token": str(word), "lemma": word.lemma_, "tag": word.tag_, "entity": entity_tag,
                                    "dependency": word.dep_, "head id": head_idx}
            words.append(word_obj)
    
    i=0
    while(i<len(words)):
        res = []
        while(i<len(words) and words[i]["entity"] == entity_type_string):
            res.append(words[i]["lemma"])
            i=i+1
        if(res):
            ret.append(" ".join(res))
        i=i+1
            
    if(not ret):
        ret = None
                          
    return ret

In [20]:
search_entity(input_string,"GPE")

['Los Angeles', 'Las Vegas']

In [21]:
visualize_annotation(input_string, style="ent")

In [22]:
search_entity(input_string,"DATE")

['1982', '5 of july']

### Exercise 3: Enriching the sentences

For every sentence in the QuestionClassification dataset, extract the `subject-verb` relation and the `verb-object` relation. Add these couples to the original input, divided by the `#`:

- Sentence: '*What is the full form of .com?*' 
- `subject-verb`: *What is*  
- `verb-object`: *is the full form* 
- Enriched sentence: '*What is the full form of .com? # What is # is the full form*'  

Store the enriched sentences in a new dataframe and train a classifier (SVM, NB, Rocchio..) and evaluate it.

In [23]:
from sklearn.datasets import fetch_20newsgroups
# Load the Reuters dataset
newsgroups_train = fetch_20newsgroups(subset='train', shuffle=True, random_state=42, remove=('headers', 'footers', 'quotes'))
newsgroups_test = fetch_20newsgroups(subset='test', shuffle=True, random_state=42, remove=('headers', 'footers', 'quotes'))


In [24]:
print_annotation(newsgroups_train.data[0])

Unnamed: 0,id,token,lemma,tag,entity,dependency,head id
0,1,I,I,PRP,O,nsubj,3
1,2,was,be,VBD,O,aux,3
2,3,wondering,wonder,VBG,O,ROOT,0
3,4,if,if,IN,O,mark,9
4,5,anyone,anyone,NN,O,nsubj,9
...,...,...,...,...,...,...,...
111,37,please,please,UH,O,intj,36
112,38,e,e,NN,O,dobj,39
113,39,-,-,NN,O,dobj,33
114,40,mail,mail,NN,O,dobj,22


In [28]:
nlps = []
for i in newsgroups_train.data:
    nlps.append(nlp(i))
                

In [31]:
target = nlps[0]
print_annotation(target)

Unnamed: 0,id,token,lemma,tag,entity,dependency,head id
0,1,I,I,PRP,O,nsubj,3
1,2,was,be,VBD,O,aux,3
2,3,wondering,wonder,VBG,O,ROOT,0
3,4,if,if,IN,O,mark,9
4,5,anyone,anyone,NN,O,nsubj,9
...,...,...,...,...,...,...,...
111,37,please,please,UH,O,intj,36
112,38,e,e,NN,O,dobj,39
113,39,-,-,NN,O,dobj,33
114,40,mail,mail,NN,O,dobj,22


In [37]:
words = []

for sent in target.sents:
    for i, word in enumerate(sent):
        if word.head is word:
            head_idx = 0
        else:
            head_idx = target[i].head.i+1
        if head_idx == i + 1:
            head_idx = 0

        entity_tag = word.ent_type_
        if len(entity_tag) == 0:
            entity_tag = "O"
            
        word_obj = {"id": i+1, "token": str(word), "lemma": word.lemma_, "tag": word.tag_, "entity": entity_tag,
                                    "dependency": word.dep_, "head id": head_idx}
        words.append(word_obj)

In [40]:
words

[{'id': 1,
  'token': 'I',
  'lemma': 'I',
  'tag': 'PRP',
  'entity': 'O',
  'dependency': 'nsubj',
  'head id': 3},
 {'id': 2,
  'token': 'was',
  'lemma': 'be',
  'tag': 'VBD',
  'entity': 'O',
  'dependency': 'aux',
  'head id': 3},
 {'id': 3,
  'token': 'wondering',
  'lemma': 'wonder',
  'tag': 'VBG',
  'entity': 'O',
  'dependency': 'ROOT',
  'head id': 0},
 {'id': 4,
  'token': 'if',
  'lemma': 'if',
  'tag': 'IN',
  'entity': 'O',
  'dependency': 'mark',
  'head id': 9},
 {'id': 5,
  'token': 'anyone',
  'lemma': 'anyone',
  'tag': 'NN',
  'entity': 'O',
  'dependency': 'nsubj',
  'head id': 9},
 {'id': 6,
  'token': 'out',
  'lemma': 'out',
  'tag': 'RB',
  'entity': 'O',
  'dependency': 'advmod',
  'head id': 7},
 {'id': 7,
  'token': 'there',
  'lemma': 'there',
  'tag': 'RB',
  'entity': 'O',
  'dependency': 'advmod',
  'head id': 5},
 {'id': 8,
  'token': 'could',
  'lemma': 'could',
  'tag': 'MD',
  'entity': 'O',
  'dependency': 'aux',
  'head id': 9},
 {'id': 9,
  'tok

In [44]:
for word in words:
    if "VB" in word["tag"]:
        print(word["token"])

was
wondering
enlighten
saw
was
looked
be
was
called
were
was
is
know
tellme
is
made
have
looking


In [50]:
visualize_annotation(target)

In [46]:
visualize_annotation("What is the full form of .com?")

Thanks to arrows in the graph, then I'll retrive everything that is connected with the verb for each verb in the sentence