# SPACY BASICS

In this lab we will learn to use the spacy.io API to annotate text. Many of the concepts seen in this lab are explained in detailed in the spacy course:

https://spacy.io/usage/spacy-101 

Here you can configure the kind of spacy setup (language, annotators, etc.) that you may require for installation:

https://spacy.io/usage 


In [1]:
# Install Spacy and learn about Token and Sentence objects

!pip install -U spacy

Collecting spacy
  Downloading spacy-3.2.1-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (6.0 MB)
[K     |████████████████████████████████| 6.0 MB 5.4 MB/s 
Collecting pydantic!=1.8,!=1.8.1,<1.9.0,>=1.7.4
  Downloading pydantic-1.8.2-cp37-cp37m-manylinux2014_x86_64.whl (10.1 MB)
[K     |████████████████████████████████| 10.1 MB 32.1 MB/s 
[?25hCollecting srsly<3.0.0,>=2.4.1
  Downloading srsly-2.4.2-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (451 kB)
[K     |████████████████████████████████| 451 kB 34.1 MB/s 
[?25hCollecting thinc<8.1.0,>=8.0.12
  Downloading thinc-8.0.13-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (628 kB)
[K     |████████████████████████████████| 628 kB 34.7 MB/s 
Collecting typer<0.5.0,>=0.3.0
  Downloading typer-0.4.0-py3-none-any.whl (27 kB)
Collecting spacy-legacy<3.1.0,>=3.0.8
  Downloading spacy_legacy-3.0.8-py2.py3-none-any.whl (14 kB)
Collecting spacy-loggers<2.0.0,>=1.0.0
  Downloading spacy_loggers-1.0.1-py3-no

# ASSIGNMENT 1

Install the language modules of your choice. 

Read the documentation in https://spacy.io/usage and choose the language modules (according to your interests) that you would like to install.  
  + TODO: Install the language module(s).
  + TODO: Try different language module versions for one language and compare the results obtained.

In [2]:
# TODO install other language modules of your choice following the https://spacy.io/usage
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.2.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.2.0/en_core_web_sm-3.2.0-py3-none-any.whl (13.9 MB)
[K     |████████████████████████████████| 13.9 MB 5.3 MB/s 
Installing collected packages: en-core-web-sm
  Attempting uninstall: en-core-web-sm
    Found existing installation: en-core-web-sm 2.2.5
    Uninstalling en-core-web-sm-2.2.5:
      Successfully uninstalled en-core-web-sm-2.2.5
Successfully installed en-core-web-sm-3.2.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


# Loading the language modules

The nlp object is a language model instance. You can assume that, throughout this tutorial, nlp refers to the language model loaded by the language package or packages of your choice. In the following steps we will use spacy to process a string and a text file.

In [3]:
import spacy
#TODO load the installed language module
spacy.prefer_gpu()
nlp = spacy.load("en_core_web_sm")



In [4]:
doc = nlp("Washington University, which is located in Missouri, is named after George Washington.")
print(doc)

Washington University, which is located in Missouri, is named after George Washington.


# ASSIGNMENT 2

When you call nlp on a string, spaCy first tokenizes the text and creates a document object. In this exercise, you’ll learn more about the Doc, as well as its views Token and Span.

+ TODO: print the tokens in the Doc object. You should get something like the output below.
+ TODO: print the description of each tag (see morphology example, below)
+ TODO: print the entities recognized by iterating over the Doc object (scrowl down after the morphology print to see an example outputs).

In [5]:
# TODO add your code here to print the tokens in the Doc object

for word in doc:
  print(word)

Washington
University
,
which
is
located
in
Missouri
,
is
named
after
George
Washington
.


+ TODO: print the two entities containing "Washington"


In [6]:
# A slice of the Doc for "Washington University"
span = doc[0:2]

print(span)


# A slice of the Doc for "George Washington" (without the ".")

span2 = doc[-3:-1] #same as span2 = doc[12:14] 

print(span2)


Washington University
George Washington


In [7]:
# TODO obtain number of sentences
print("The total of sentences is ", len(list(doc.sents)))

The total of sentences is  1


In [8]:
# morphology and syntax
for token in doc:
    # Get the token text, part-of-speech tag and dependency label
    token_text = token.text
    token_pos = token.pos_
    token_tag = token.tag_
    token_lemma = token.lemma_
    token_dep = token.dep_
    # This is for formatting only
    print(f"{token_text:<12}{token_pos:<10}{token_tag:<10}{token_lemma:<20}{token_dep:<20}")

Washington  PROPN     NNP       Washington          compound            
University  PROPN     NNP       University          nsubjpass           
,           PUNCT     ,         ,                   punct               
which       PRON      WDT       which               nsubjpass           
is          AUX       VBZ       be                  auxpass             
located     VERB      VBN       locate              relcl               
in          ADP       IN        in                  prep                
Missouri    PROPN     NNP       Missouri            pobj                
,           PUNCT     ,         ,                   punct               
is          AUX       VBZ       be                  auxpass             
named       VERB      VBN       name                ROOT                
after       ADP       IN        after               prep                
George      PROPN     NNP       George              compound            
Washington  PROPN     NNP       Washington         

In [9]:
# morphology and syntax
for token in doc:
    token_text = token.text
    token_pos = token.pos_
    token_tag = token.tag_
    token_lemma = token.lemma_
    token_dep = token.dep_
    # This is for formatting only
    # TODO modify the code above to print the description of each tag, like so:
    print(f"{token_text:<12}{token_pos:<10}{token_tag:<10}{token_lemma:<20}{token_dep:<20}")

Washington  PROPN     NNP       Washington          compound            
University  PROPN     NNP       University          nsubjpass           
,           PUNCT     ,         ,                   punct               
which       PRON      WDT       which               nsubjpass           
is          AUX       VBZ       be                  auxpass             
located     VERB      VBN       locate              relcl               
in          ADP       IN        in                  prep                
Missouri    PROPN     NNP       Missouri            pobj                
,           PUNCT     ,         ,                   punct               
is          AUX       VBZ       be                  auxpass             
named       VERB      VBN       name                ROOT                
after       ADP       IN        after               prep                
George      PROPN     NNP       George              compound            
Washington  PROPN     NNP       Washington         

In [10]:
# TODO Iterate over the predicted entities
for ent in doc.ents:
    print(ent.text, ent.label_)

Washington University ORG
Missouri GPE
George Washington PERSON


In [11]:
# TODO modify the code above to iterate over the predicted entities at token level, like so:
# iob2 entities
import re

for token in doc:
    print (token.text, '---->', token.pos_, '---->', token.dep_, '---->', token.head.text)


Washington ----> PROPN ----> compound ----> University
University ----> PROPN ----> nsubjpass ----> named
, ----> PUNCT ----> punct ----> University
which ----> PRON ----> nsubjpass ----> located
is ----> AUX ----> auxpass ----> located
located ----> VERB ----> relcl ----> University
in ----> ADP ----> prep ----> located
Missouri ----> PROPN ----> pobj ----> in
, ----> PUNCT ----> punct ----> University
is ----> AUX ----> auxpass ----> named
named ----> VERB ----> ROOT ----> named
after ----> ADP ----> prep ----> named
George ----> PROPN ----> compound ----> Washington
Washington ----> PROPN ----> pobj ----> after
. ----> PUNCT ----> punct ----> named


In [12]:
# easy feature extraction
for token in doc:
  print (token, token.idx, token.text_with_ws, 
         token.is_alpha, token.is_punct, token.is_space,
         token.shape_, token.is_stop)

Washington 0 Washington  True False False Xxxxx False
University 11 University True False False Xxxxx False
, 21 ,  False True False , False
which 23 which  True False False xxxx True
is 29 is  True False False xx True
located 32 located  True False False xxxx False
in 40 in  True False False xx True
Missouri 43 Missouri True False False Xxxxx False
, 51 ,  False True False , False
is 53 is  True False False xx True
named 56 named  True False False xxxx False
after 62 after  True False False xxxx True
George 68 George  True False False Xxxxx False
Washington 75 Washington True False False Xxxxx False
. 85 . False True False . False


In [13]:
# stopwords available for English
spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS
len(spacy_stopwords)
for stop_word in list(spacy_stopwords)[:10]:
  print(stop_word)

latter
even
such
themselves
throughout
as
if
hence
twelve
what


# ASSIGNMENT 3

+ TODO: Remove stopwords from doc
+ TODO: print only the verbs, 3rd person singular present and the proper singular nouns

In [14]:
# TODO remove stopwords
spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS
len(spacy_stopwords)
for stop_word in list(spacy_stopwords)[:10]:
  print(stop_word)

latter
even
such
themselves
throughout
as
if
hence
twelve
what


In [15]:
# TODO print only verbs, 3rd person singular present and proper singular nouns
nouns = []
verbs = []

prop_n = "NNP"
aux = "VBZ"

for token in doc:
  if token.tag_ == prop_n:
    nouns.append(token)
  elif token.tag_ == aux:
    verbs.append(token)




print(nouns)
print(verbs)

[Washington, University, Missouri, George, Washington]
[is, is]


# ASSIGNMENT 4 (BONUS 1)

Visualizations with spacy. Check the documentation in  https://spacy.io/usage/visualizers and render the dependencies and NER annotations, like so:



In [16]:
from spacy.util import minify_html
from IPython.core.display import display, HTML
from spacy import displacy

options = {"compact": False, "bg": "black",
           "color": "yellow", "font": "Source Sans Pro", "distance": 110}

displacy.render(doc, style='ent', jupyter=True, options=options)
displacy.render(doc, style='dep', jupyter=True, options=options)

# ASSIGNMENT 5 (BONUS 2)

In this task you will be annotating a movie review at document and sentence level.

1. Open the file '/content/drive/My Drive/Colab Notebooks/2022-ILTAPP/resources/movie-review.txt'
2. Predict and print the various annotations seen previously (POS, NER, lemmas, etc.) for each of the sentences in the document using at least two language modules for one language of your interest (most basic and most advanced).
3. Visualize the results.



In [17]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)


Mounted at /content/drive


In [23]:
# TODO add code here
file_name = '/content/drive/MyDrive/2022-ILTAPP/resources/guardian.txt'

with open(file_name) as f:
  file_name = f.read()

doc = nlp(file_name)

print(file_name)
displacy.render(doc, style='ent', jupyter=True)

for token in doc:
  token_text = token.text
  token_pos = token.pos_
  token_tag = token.tag_
  token_lemma = token.lemma_
  token_dep = token.dep_
  print(f"{token_text:<12}{token_pos:<10}{token_tag:<10}{token_lemma:<20}{token_dep:<20}")


Twelve years after the fall of the Taliban, Afghanistan is heading for a near-record opium crop as instability pushes up the amount of land planted with illegal but lucrative poppies, according to a bleak UN report.

The rapid growth of poppy farming as western troops head home reflects particularly badly on Britain, which was designated "lead nation" for counter-narcotics work over a decade ago.

"Poppy cultivation is not only expected to expand in areas where it already existed in 2012 … but also in new areas or areas where poppy cultivation was stopped," the Afghanistan Opium Winter Risk Assessment found.

The growth in opium cultivation reflects both spreading instability and concerns about the future. Farmers are more likely to plant the deadly crop in areas of high violence or where they have not received any agricultural aid, the report said.

Opium traders are often happy to provide seeds, fertilisers and even advance payments to encourage crops, leaving farmers who do not have

Twelve      NUM       CD        twelve              nummod              
years       NOUN      NNS       year                npadvmod            
after       ADP       IN        after               prep                
the         DET       DT        the                 det                 
fall        NOUN      NN        fall                pobj                
of          ADP       IN        of                  prep                
the         DET       DT        the                 det                 
Taliban     PROPN     NNP       Taliban             pobj                
,           PUNCT     ,         ,                   punct               
Afghanistan PROPN     NNP       Afghanistan         nsubj               
is          AUX       VBZ       be                  aux                 
heading     VERB      VBG       head                ROOT                
for         ADP       IN        for                 prep                
a           DET       DT        a                  