# Linguistic processing with Spacy

As we've been seeing, first of all we need to import Spacy and
load a trained pipeline:

In [1]:
# Import spacy
import spacy

# Load English language pipeline and store in variable `nlp`:
nlp = spacy.load("en_core_web_sm")

## Sentence-level processing

### Sentence segmentation

In [2]:
# Sentence to process:
txt = "Mr. Smith and Mr. Jones went to the shops. They bought a notebook."

# Process the sentence using the trained pipeline that has been loaded before:
doc = nlp(txt)

# To retrieve sentences from a `doc` object, use the `sents` attribute:
print([sent.text for sent in doc.sents])

['Mr. Smith and Mr. Jones went to the shops.', 'They bought a notebook.']


## Token-level processing

### Tokenization

With a for-loop:

In [3]:
# Given a sentence:
txt = "Mr. Smith and Mr. Jones went to the shops and they bought a notebook."

# Process the sentence using the trained pipeline that has been loaded before:
doc = nlp(txt)

# Access the tokens text, either in a for-loop:
for token in doc:
    print(token.text)

Mr.
Smith
and
Mr.
Jones
went
to
the
shops
and
they
bought
a
notebook
.


With a list comprehension:

In [4]:
# Given a sentence:
txt = "Mr. Smith and Mr. Jones went to the shops and they bought a notebook."

# Process the sentence using the trained pipeline that has been loaded before:
doc = nlp(txt)

# ... or with a list-comprehension:
print([token.text for token in doc])

['Mr.', 'Smith', 'and', 'Mr.', 'Jones', 'went', 'to', 'the', 'shops', 'and', 'they', 'bought', 'a', 'notebook', '.']


### Lemmatization

In [5]:
# Given a sentence:
txt = "Mr. Smith and Mr. Jones went to the shops and they bought a notebook."

# Process the sentence using the trained pipeline that has been loaded before:
doc = nlp(txt)

# Access the tokens lemmas:
print([token.lemma_ for token in doc])

['Mr.', 'Smith', 'and', 'Mr.', 'Jones', 'go', 'to', 'the', 'shop', 'and', 'they', 'buy', 'a', 'notebook', '.']


### Part-of-speech tagging

In [6]:
# Given a sentence:
txt = "Mr. Smith and Mr. Jones went to the shops and they bought a notebook."

# Process the sentence using the trained pipeline that has been loaded before:
doc = nlp(txt)

# Get the part-of-speech of each token:
print([token.pos_ for token in doc])

['PROPN', 'PROPN', 'CCONJ', 'PROPN', 'PROPN', 'VERB', 'ADP', 'DET', 'NOUN', 'CCONJ', 'PRON', 'VERB', 'DET', 'NOUN', 'PUNCT']


In [7]:
# If we want to get more than one layer or information, we need to have them as a
# tuple (i.e. enclosed in parentheses). For example, we want to get the text and
# part-of-speech of each token:
print([(token.text, token.pos_) for token in doc])

[('Mr.', 'PROPN'), ('Smith', 'PROPN'), ('and', 'CCONJ'), ('Mr.', 'PROPN'), ('Jones', 'PROPN'), ('went', 'VERB'), ('to', 'ADP'), ('the', 'DET'), ('shops', 'NOUN'), ('and', 'CCONJ'), ('they', 'PRON'), ('bought', 'VERB'), ('a', 'DET'), ('notebook', 'NOUN'), ('.', 'PUNCT')]


In [8]:
# As we have already seeen in the past, list comprehensions can include conditions.
# For example, get only the text of the tokens that are nouns:
print([token.text for token in doc if token.pos_ == "NOUN"])

['shops', 'notebook']


In [9]:
# Or get only the verbs:
print([token.text for token in doc if token.pos_ == "VERB"])

['went', 'bought']


In [10]:
# Instead of printing the result, we can store it in a new variable:
list_of_verbs = [token.text for token in doc if token.pos_ == "VERB"]
print(list_of_verbs)

['went', 'bought']


✏️ **Exercises:**

In [11]:
# 1. Print the verbatim text, lemma, and part-of-speech of each token.
# 
# Type your code here:



In [12]:
# 2. Define a new function that takes a sentence and an nlp object as input,
# and returns the adjectives in the text (as a list). Then try it with the
# following sentences:
# 
# * "It was a bright cold day in April, and the clocks were striking thirteen."
# * "April is the Cruellest Month."
# * "Twas brillig, and the slithy toves did gyre and gimble in the wabe; all mimsy were the borogoves, and the mome raths outgrabe."
# * ... and any other sentence you'd like to try!
# 
# Type your code here:



👀 **Suggested solutions:**

In [13]:
# Exercise 1:

print([(token.text, token.lemma_, token.pos_) for token in doc])

[('Mr.', 'Mr.', 'PROPN'), ('Smith', 'Smith', 'PROPN'), ('and', 'and', 'CCONJ'), ('Mr.', 'Mr.', 'PROPN'), ('Jones', 'Jones', 'PROPN'), ('went', 'go', 'VERB'), ('to', 'to', 'ADP'), ('the', 'the', 'DET'), ('shops', 'shop', 'NOUN'), ('and', 'and', 'CCONJ'), ('they', 'they', 'PRON'), ('bought', 'buy', 'VERB'), ('a', 'a', 'DET'), ('notebook', 'notebook', 'NOUN'), ('.', '.', 'PUNCT')]


In [14]:
# Exercise 2:

def get_adjectives(text, nlp):
    processed_text = nlp(text)
    list_of_adjectives = [token.text for token in processed_text if token.pos_ == "ADJ"]
    return list_of_adjectives

text = "April is the cruellest month."
get_adjectives(text, nlp)

['cruellest']

### Dependency parsing

In [15]:
# Pro tip: you can visualize the dependency parsing like this:

from spacy import displacy

txt = "A trifling incident thus served to settle a victory."
doc = nlp(txt)

displacy.render(doc, style='dep', jupyter=True, options={'distance': 100})

In [16]:
# Given a sentence:
txt = "A trifling incident thus served to settle a victory."

# Process the sentence using the trained pipeline that has been loaded before:
doc = nlp(txt)

# The text, the dependency and the head token for each token in the doc, stored in `parsing`:
parsing = [(token.text, token.dep_, token.head.text) for token in doc]
print(parsing)

[('A', 'det', 'incident'), ('trifling', 'compound', 'incident'), ('incident', 'nsubj', 'served'), ('thus', 'advmod', 'served'), ('served', 'ROOT', 'served'), ('to', 'aux', 'settle'), ('settle', 'xcomp', 'served'), ('a', 'det', 'victory'), ('victory', 'dobj', 'settle'), ('.', 'punct', 'served')]


# Processing text in a dataframe

First let's have a look at our data [here](data/LwM-nlp-animacy-annotations-machines19thC.tsv).

The first step before any preprocessing or analysis can even begin should be to make sure you understand your data. The decision on which preprocessing steps (and which analyses you are able to carry out!) will change depending on the characteristics of your dataset.

We will be using **pandas** to work with tabular data:

In [17]:
# Import the pandas library
import pandas as pd

Load the dataset:

In [18]:
# We can read the tsv file using pandas library. The resulting object is called
# a dataframe, which we store in variable `df`:
df = pd.read_csv("data/LwM-nlp-animacy-annotations-machines19thC.tsv", sep="\t", index_col=0)

Have a look at your data:

In [19]:
# Let's print the first rows of our dataframe:
df.head()

Unnamed: 0,Date,Sentence,SentenceCtxt,SentenceId,TargetExpression,animacy,humanness
0,1891,Poetic RMS OF THE CITY OF MANCHESTEI legends m...,"That there was a Roman camp on Castlefield, wi...",002732647_02_180_7,engine,0.0,0.0
1,1880,"Immured in a convent, debarred from life-givin...",and sounds of His creation. [SEP] Immured in a...,003176106_01_278_2,machines,1.0,0.0
2,1871,"Still, before these moral stumps are removed, ...","The re moval of the former, however, is much e...",001028082_01_132_8,machine,1.0,0.0
3,1893,100 shows a Cornish ***boiler*** improperly se...,Fig. [SEP] 100 shows a Cornish ***boiler*** im...,002757962_01_184_16,boiler,0.0,1.0
4,1883,Laws are too apt to be regarded as if intended...,"What a deal of time, trouble, expense, and mis...",002932979_02_121_10,machinery,0.0,0.0


In [20]:
def process_text(text):
    """
    Function that takes a text and an nlp pipeline, and
    returns a list of the verbs in the text.
    
    Args:
        text: The text that will be processed.
        nlp: A spacy pipeline.
    
    Returns:
        A list of verbs in the text.
    """
    doc = nlp(text)
    processed = [x.text for x in doc if x.pos_ == "VERB"]
    return processed

In [None]:
# Apply the procfunction to a column in the dataframe:
df['SentenceCtxt'].apply(process_text)

In [21]:
# Apply the process_text() function to the 'SentenceCtxt' column:
df['processed'] = df['SentenceCtxt'].apply(process_text)

In [22]:
# Check the first rows: a new column has appeared, with our processing!
df.head()

Unnamed: 0,Date,Sentence,SentenceCtxt,SentenceId,TargetExpression,animacy,humanness,processed
0,1891,Poetic RMS OF THE CITY OF MANCHESTEI legends m...,"That there was a Roman camp on Castlefield, wi...",002732647_02_180_7,engine,0.0,0.0,"[was, is, occupied, asserted, claim, inhabited..."
1,1880,"Immured in a convent, debarred from life-givin...",and sounds of His creation. [SEP] Immured in a...,003176106_01_278_2,machines,1.0,0.0,"[sounds, debarred, giving, cease, living, feel..."
2,1871,"Still, before these moral stumps are removed, ...","The re moval of the former, however, is much e...",001028082_01_132_8,machine,1.0,0.0,"[removed, return, removed, stumping, have, do,..."
3,1893,100 shows a Cornish ***boiler*** improperly se...,Fig. [SEP] 100 shows a Cornish ***boiler*** im...,002757962_01_184_16,boiler,0.0,1.0,"[shows, seated, cause, applied]"
4,1883,Laws are too apt to be regarded as if intended...,"What a deal of time, trouble, expense, and mis...",002932979_02_121_10,machinery,0.0,0.0,"[saved, considered, legislating, legislating, ..."


In [23]:
# Store the dataframe with the added column:
df.to_csv("data/animacy_processed.tsv", sep="\t")

✏️ **Exercises:**

In [24]:
# Using the SentenceCtxt column in the animacy dataset, write a function that returns a
# list of lemma forms instead of the words. Type your code here:



In [25]:
# Using the SentenceCtxt column in the animacy dataset, write a function that returns a
# list of named entities instead of the words. Type your code here:



In [26]:
# Using the SentenceCtxt column in the animacy dataset, write a function "count_tokens"
# and a function "count_lemmas", and add one column for the total number of tokens in
# TextSnippet and another one for the number of unique lemmas. Tip: remember that you
# can convert a `list` into a `set` to get rid of duplicates.
# 
# Type your code here:

