# Linguistic processing with Spacy

As we've been seeing, first of all we need to import Spacy and
load a trained pipeline:

In [None]:
# Import spacy
import spacy

# Load English language pipeline and store in variable `nlp`:
nlp = spacy.load("en_core_web_sm")

## Sentence-level processing

### Sentence segmentation

In [None]:
# Sentence to process:
txt = "Mr. Smith and Mr. Jones went to the shops. They bought a notebook."

# Process the sentence using the trained pipeline that has been loaded before:
doc = nlp(txt)

# To retrieve sentences from a `doc` object, use the `sents` attribute:
print([sent.text for sent in doc.sents])

## Token-level processing

### Tokenization

In [None]:
# Given a sentence:
txt = "Mr. Smith and Mr. Jones went to the shops and they bought a notebook."

# Process the sentence using the trained pipeline that has been loaded before:
doc = nlp(txt)

# Access the tokens text, either in a for-loop:
for token in doc:
    print(token.text)

In [None]:
# Given a sentence:
txt = "Mr. Smith and Mr. Jones went to the shops and they bought a notebook."

# Process the sentence using the trained pipeline that has been loaded before:
doc = nlp(txt)

# ... or with a list-comprehension:
print([token.text for token in doc])

### Lemmatization

In [None]:
# Given a sentence:
txt = "Mr. Smith and Mr. Jones went to the shops and they bought a notebook."

# Process the sentence using the trained pipeline that has been loaded before:
doc = nlp(txt)

# Access the tokens lemmas:
print([token.lemma_ for token in doc])

### Part-of-speech tagging

In [None]:
# Given a sentence:
txt = "Mr. Smith and Mr. Jones went to the shops and they bought a notebook."

# Process the sentence using the trained pipeline that has been loaded before:
doc = nlp(txt)

# Get the part-of-speech of each token:
print([token.pos_ for token in doc])

In [None]:
# Get the text and part-of-speech of each token:
print([(token.text, token.pos_) for token in doc])

In [None]:
# Get only the nouns:
print([token.text for token in doc if token.pos_ == "NOUN"])

In [None]:
# Get only the verbs:
print([token.text for token in doc if token.pos_ == "VERB"])

### Dependency parsing

In [None]:
# Given a sentence:
txt = "A trifling incident thus served to settle a victory."

# Process the sentence using the trained pipeline that has been loaded before:
doc = nlp(txt)

# Verbatim token content:
print([token.text for token in doc])

# Syntactic dependency relation:
print([token.dep_ for token in doc])

# The head of a dependency relation:
print([token.head.text for token in doc])

In [None]:
# Pro tip: you can visualize the dependency parsing like this:

from spacy import displacy

txt = "A trifling incident thus served to settle a victory."
doc = nlp(txt)

displacy.render(doc, style='dep', jupyter=True, options={'distance': 160})

# Processing text in a dataframe

First let's have a look at our data [here](XXXXXXX).

The first step before any preprocessing or analysis can even begin should be to make sure you understand your data. The decision on which preprocessing steps (and which analyses you are able to carry out!) will change depending on the characteristics of your dataset.

We will be using **pandas** to work with tabular data:

In [None]:
# Import the pandas library
import pandas as pd

Load the dataset:

In [None]:
# We can read the tsv file using pandas library. The resulting object is called
# a dataframe, which we store in variable `df`:
df = pd.read_csv("data/animacy.tsv", sep="\t")

Have a look at your data:

In [None]:
# Let's print the first rows of our dataframe:
df.head()

In [None]:
def process_text(text, nlp):
    """
    Function that takes a text and an nlp pipeline, and
    returns a list of the verbs in the text.
    
    Args:
        text: The text that will be processed.
        nlp: A spacy pipeline.
    
    Returns:
        A list of verbs in the text.
    """
    doc = nlp(text)
    processed = [x.text for x in doc if x.pos_ == "VERB"]
    return processed

In [None]:
# Load a Spacy pipeline:
nlp = spacy.load("en_core_web_sm")

# Apply the process_text() function to the 'TextSnippet' column:
df['processed'] = df['TextSnippet'].apply(lambda x: process_text(x, nlp))

In [None]:
df.head()

In [None]:
# Store the dataframe with the added column:
df.to_csv("data/animacy_processed.tsv", sep="\t")

✏️ **Exercises:**

In [None]:
# Using the TextSnippet column in the animacy dataset, write a function that returns a
# list of lemma forms instead of the words. Type your code here:



In [None]:
# Using the TextSnippet column in the animacy dataset, write a function that returns a
# list of named entities instead of the words. Type your code here:



In [None]:
# Using the TextSnippet column in the animacy dataset, write a function "count_tokens"
# and a function "count_lemmas", and add one column for the total number of tokens in
# TextSnippet and another one for the number of unique lemmas. Tip: remember that you
# can convert a `list` into a `set` to get rid of duplicates.
# 
# Type your code here:

