## Introduction to ```spaCy```

There are a number of different NLP frameworks that you're likely to encounter. The most popular and widely-used of these are:

- ```NLTK``` (Natural Language Toolkit, old-school)
- ```UDPipe``` (Neural network based, fast and light, but not super accurate)
- ```CoreNLP``` and ```stanza``` (Created by the team at Stanford; academically robust)
- ```spaCy``` production-ready, well-documented, state-of-the-art

We'll be working with ```spaCy``` in this module, primarily because it's easy and intuitive, and also scales well.

First thing we need to do is install ```spaCy``` and the language model that we want to use.

```
$ source ./lang101/bin/activate
$ pip install pandas
$ pip install spacy 
$ python -m spacy download en_core_web_sm
$ deactivate 
```

## Initializing ```spaCy```

The first thing we need to do is import ```spaCy``` __and__ the language model that we want to use.

Note that, if you want to use different langauges you want to use different language models.

In [4]:
# create a spacy NLP object
import spacy
nlp = spacy.load("en_core_web_sm")

With the model now loaded, we can begin to do some very simple NLP tasks.

Here, we create a spaCy object and assign it to the variable ```nlp```. This is the NLP pipeline that will do all our heavy lifting, using the trained model we've specified.

Below, you can see what the pipeline does with a bit of sample text. Passing text to the nlp object gives us access to a bunch of properties, including tokens (words), parts of speech, named entities, and so on. Here's we two of them, tokens and entities. These objects, in turn, have certain methods attached to them. A full outline of available methods can be found in the spaCy docs.

In this case, for all token objects, let's return the token itself (token.text); its part-of-speech tag (token.pos_); and the grammatical dependency relations between the tokens (token.dep_).


In [59]:
doc = nlp("This is a test sentence written in English. You're a good worker.")

__Tokenize__

In [60]:
for token in doc:
    print(token.text)

This
is
a
test
sentence
written
in
English
.
You
're
a
good
worker
.


__Trying some more attributes__

In [62]:
for token in doc:
    print(token.i, token.text, token.lemma_)

0 This this
1 is be
2 a a
3 test test
4 sentence sentence
5 written write
6 in in
7 English English
8 . .
9 You -PRON-
10 're be
11 a a
12 good good
13 worker worker
14 . .


## Count distribution of linguistic features

__Create doc object__

In [None]:
with open("example.txt", "r", encoding="utf-8") as file:
    text = file.read()

In [64]:
doc = nlp(text)

In [68]:
# Create counter
adjective_count = 0
for token in doc:
    if token.pos_ == "ADJ":
        adjective_count += 1

__Relative frequency__

In [73]:
relative_freq = adjective_count/len(doc)*1000

In [136]:
print(f"This text has a relative frequency of {int(relative_freq)} adjectives per 1000 tokens")

This text has a relative frequency of 73 adjectives per 1000 tokens
