# Linguistics Overview

A variety of tools exist that can be used to support linguistic analysis of texts.

In Python, this includes:

- [(`nltk`)](https://www.nltk.org/), the Natural Language Toolkit (NLTK), a rich an powerful toolkit for analysing texts that includes a wide range of reference texts and corpora;
- [`spaCy`](https://spacy.io/) is an "industrial strength natural language processing toolkit" that provides very easy to use, yet very fast and very powerful, langage processing features.

## Natural Language Toolkit (NLTK)

NLTK is a long-lived Python project that was arguably the dominant NLP toolkit for many years.

In [2]:
%%capture
try:
    import nltk
except:
    %pip install nltk

In [6]:
import nltk

sentence = """This is demonstration of how
the "NLTK" package can tokenise a sentence; simple, but effective."""

tokens = nltk.word_tokenize(sentence)
tokens

['This',
 'is',
 'demonstration',
 'of',
 'how',
 'the',
 '``',
 'NLTK',
 "''",
 'package',
 'can',
 'tokenise',
 'a',
 'sentence',
 ';',
 'simple',
 ',',
 'but',
 'effective',
 '.']

The tokens can also be tagged according the the part of speech (POS) they represent:

In [8]:
nltk.download('averaged_perceptron_tagger')

nltk.pos_tag(tokens)

[('This', 'DT'),
 ('is', 'VBZ'),
 ('demonstration', 'NN'),
 ('of', 'IN'),
 ('how', 'WRB'),
 ('the', 'DT'),
 ('``', '``'),
 ('NLTK', 'NNP'),
 ("''", "''"),
 ('package', 'NN'),
 ('can', 'MD'),
 ('tokenise', 'VB'),
 ('a', 'DT'),
 ('sentence', 'NN'),
 (';', ':'),
 ('simple', 'NN'),
 (',', ','),
 ('but', 'CC'),
 ('effective', 'JJ'),
 ('.', '.')]

### `spaCy`

With its growing ecosystem of plugins, and simplicity of use, the `spaCy` package increasingly provides an effective toolkit for working with natural language texts in a growing number of languages.

Another attractive feature when authoring rich tests is the good range of visualisers it provides.

In [12]:
%%capture
try:
    import spacy
except:
    %pip install spacy

In [39]:
text = """
The spaCy package uses a range of pretrained models to parse provided sentences.

Named entity recognition allows names such as John Smith, Managing Director of FooBar Ltd., to be easily
extracted from a text.
"""

Consider the following text:

In [40]:
print(text)


The spaCy package uses a range of pretrained models to parse provided sentences.

Named entity recognition allows names such as John Smith, Managing Director of FooBar Ltd., to be easily
extracted from a text.



We can parse the document and display any named entities it contains:

In [42]:
# Load English tokenizer, tagger, parser and NER
nlp = spacy.load("en_core_web_sm")

# Parse the document
doc = nlp(text)

# Extract entities
for entity in doc.ents:
    print(entity.text, entity.label_)

John Smith PERSON
FooBar Ltd. ORG


Visualisation tools exist that can highlight named entitites within a text:

In [43]:
from spacy import displacy

displacy.render(doc, style="ent")

The `spaCy` package can also be used to diagram the connected parts of speech in a sentence.

In [34]:
doc = nlp("The cat sat on the mat.")

displacy.render(doc, style="dep")

A more compact view is also available:

In [37]:
displacy.render(doc, style="dep", options={"compact":True})