# Analysing text with ```spaCy```

There are a number of different NLP frameworks that you're likely to encounter. The most popular and widely-used of these are:

- ```NLTK``` (Natural Language Toolkit, old-school)
- ```UDPipe``` (Neural network based, fast and light, but not super accurate)
- ```CoreNLP``` and ```stanza``` (Created by the team at Stanford; academically robust)
- ```spaCy``` production-ready, well-documented, state-of-the-art

We'll be working with ```spaCy``` in this module, primarily because it's easy and intuitive, and also scales well. Even with an off-the-shelf solution from SpaCy, we can get quite a bit. See [SpaCy documentation about linguistic features](https://spacy.io/usage/linguistic-features) for more details.

First thing we need to do is install ```spaCy``` and the language model that we want to use.

From the command line, you should first make sure to run the setup script to install requirements with `./setup.sh`. Or run these commands yourself:
```bash
pip install spacy pandas
python -m spacy download en_core_web_md
```

## Initializing ```spaCy```

The first thing we need to do is import ```spaCy``` __and__ the language model that we want to use.

Note that, if you want to use different languages you want to use different language models.

In [1]:
# create a spacy NLP class
import spacy
nlp = spacy.load("en_core_web_md")

With the model now loaded, we can begin to do some very simple NLP tasks.

Here, we create a spaCy object and assign it to the variable ```nlp```. This is the NLP pipeline that will do all our heavy lifting, using the trained model we've specified.

Below, you can see what the pipeline does with a bit of sample text. Passing text to the nlp object gives us access to a bunch of properties, including tokens (words), parts of speech, named entities, and so on. Here's we two of them, tokens and entities. These objects, in turn, have certain methods attached to them. A full outline of available methods can be found in the spaCy docs.

In this case, for all token objects, let's return the token itself (```token.text```); its part-of-speech tag (```token.pos_```); and the grammatical dependency relations between the tokens (```token.dep_```).


In [2]:
# a single sentence example
input_string = "My name is Ross and I have family in New York City."

In [3]:
# create a new Doc object
doc = nlp(input_string)

In [4]:
print(doc)

My name is Ross and I have family in New York City.


## Tokens

In [5]:
# tokenizing text
for token in doc:
    print(token.text)

My
name
is
Ross
and
I
have
family
in
New
York
City
.


## Index, text and POS-tag

In [6]:
# find parts-of-speech
for token in doc:
    print(token.i, token.text, token.pos_)

0 My PRON
1 name NOUN
2 is AUX
3 Ross PROPN
4 and CCONJ
5 I PRON
6 have VERB
7 family NOUN
8 in ADP
9 New PROPN
10 York PROPN
11 City PROPN
12 . PUNCT


## NER labels
Extracting named entities from a ```spaCy``` doc requires an extra step, but nothing too challenging:

In [7]:
# extracting NERs
for ent in doc.ents:
    print(ent.text, ent.label_)

Ross PERSON
New York City GPE


## Count distribution of linguistic features

### Create doc object

In [8]:
# load a text file
import os
# corresponds to: ../../data/au-history/au-history.txt
filename = os.path.join("..", "..", "data", "au-history", "au-history.txt")
with open(filename, "r", encoding="utf-8") as file:
    text = file.read()

In [9]:
nlp = spacy.load("en_core_web_md")
# The text has been formatted (manually) with double line breaks between headers and paragraphs.
# One could also have done this with preprocessing if there had been many more documents!
# With the double line breaks, we tell spacy to split sentences on those.
nlp.add_pipe("sentencizer", config={"punct_chars": ["\n\n"]}, before="parser")

<spacy.pipeline.sentencizer.Sentencizer at 0x11b750c50>

In [10]:
# create a doc object
doc = nlp(text)

In [11]:
for sent in doc.sents:
    print(sent.text)
    print('-' * 50)

History


--------------------------------------------------
AU will soon be celebrating its 100th anniversary. Since its foundation in 1928, the university has evolved from 78 students to approx. 38,000 students today.


--------------------------------------------------
1928
Establishment
After a long, persistent struggle and strong unity among the citizens of Aarhus, university education finally opened its doors.

‘
--------------------------------------------------
University Teaching in Jutland’ started up with 78 students in rented premises, where they were taught French, English, German, Danish and introductory philosophy.


--------------------------------------------------
1933
The first university building is inaugurated
On 11 September 1933, King Christian X inaugurated the first university building. The building was designed by Kay Fisker and C. F. Møller, and its design became the template for the buildings in the University Park.
text
In 1934, the grass surrounding the bu

### Entities

In [12]:
# create empty list
entities = []

# add each entity to list
for ent in doc.ents:
    entities.append((ent.text, ent.label_))

In [13]:
entities[:10]

[('100th', 'ORDINAL'),
 ('1928', 'DATE'),
 ('78', 'CARDINAL'),
 ('38,000', 'CARDINAL'),
 ('today', 'DATE'),
 ('1928', 'DATE'),
 ('Aarhus', 'NORP'),
 ('University Teaching in Jutland', 'ORG'),
 ('78', 'CARDINAL'),
 ('French', 'LANGUAGE')]

In [14]:
print(set(entities))

{('the Danish University of Education', 'ORG'), ('C. F. Møller', 'PERSON'), ('the Department of Clinical Medicine', 'ORG'), ('the Department of Biomedicine', 'ORG'), ('1965-1977', 'DATE'), ('2018', 'DATE'), ('20', 'CARDINAL'), ('about 34,000', 'CARDINAL'), ('2020', 'DATE'), ('Technical Sciences', 'ORG'), ('Aarhus University’s', 'ORG'), ('2012', 'DATE'), ('The Institute of Business and Technology', 'ORG'), ('Five', 'CARDINAL'), ('another Nobel Prize', 'WORK_OF_ART'), ('French', 'LANGUAGE'), ('Dale T. Mortensen', 'PERSON'), ('Danish', 'NORP'), ('one', 'CARDINAL'), ('German', 'NORP'), ('first', 'ORDINAL'), ('Navitas', 'ORG'), ('the Aarhus School of Business', 'ORG'), ('the Danish Institute of Agricultural Sciences', 'ORG'), ('1933', 'DATE'), ('King Christian X', 'NORP'), ('many years', 'DATE'), ('1935-1942', 'DATE'), ('English', 'LANGUAGE'), ('Gestapo', 'ORG'), ('1992', 'DATE'), ('78', 'CARDINAL'), ('4 and 5', 'DATE'), ('the International Centre', 'FAC'), ('the Faculty of Medical Sciences

### Adjective frequency

In [15]:
# count number of adjectives
adjective_count = 0
for token in doc:
    if token.pos_ == "ADJ":
        adjective_count += 1

In [16]:
adjective_count

34

In [17]:
len(doc)

799

In [18]:
# find the relative frequency per 100 tokens
percent = (adjective_count/len(doc)) * 100

In [19]:
round(percent, 2)

4.26

## Creating neater outputs using ```pandas```

At the moment, all of our output from ```spaCy``` is in the form of lists. If we want to save these, it probably makes sense to have them saved in a more transferable format, such as CSV files or JSONs.

One very easy way to do this with Python is by using the dataframe library ```pandas```.

In [20]:
import pandas as pd

In [21]:
# create spaCy doc
# create a new Doc object
doc = nlp(input_string)

In [22]:
annotations = []
for token in doc:
    annotations.append((token.text, token.pos_))

In [23]:
annotations

[('My', 'PRON'),
 ('name', 'NOUN'),
 ('is', 'AUX'),
 ('Ross', 'PROPN'),
 ('and', 'CCONJ'),
 ('I', 'PRON'),
 ('have', 'VERB'),
 ('family', 'NOUN'),
 ('in', 'ADP'),
 ('New', 'PROPN'),
 ('York', 'PROPN'),
 ('City', 'PROPN'),
 ('.', 'PUNCT')]

In [24]:
# spaCy doc to pandas dataframe
data = pd.DataFrame(annotations,
                    columns=["Token", "POS-tag"])

In [25]:
data

Unnamed: 0,Token,POS-tag
0,My,PRON
1,name,NOUN
2,is,AUX
3,Ross,PROPN
4,and,CCONJ
5,I,PRON
6,have,VERB
7,family,NOUN
8,in,ADP
9,New,PROPN


In [26]:
# save dataframe
output_path = os.path.join("output", "annotations.csv")
os.makedirs(os.path.dirname(output_path), exist_ok=True)  # make if it does not exist
data.to_csv(output_path)