# Linguistic Analysis

Let's look at the different levels of linguistic analysis. We will use the `spacy` library for a lot of the analyses. Frist, we need to import and  load the appropriate model (here, the one for English):

In [1]:
import spacy

nlp = spacy.load('en_core_web_sm')

### Usage:

We can now call `nlp()` as a function on any text. By default, it will perform a number of analyses:
- tokenization
- sentence splitting
- lemmatization
- part of speech tagging
- dependency parsing
- named entity recognition

To speed up analysis, we can disable some of these analyses if we do not need it:
```
nlp = spacy.load('en', disable=['tokenizer', 'tagger', 'parser', 'ner'])
```


The result is an iterator over the sentences (if called on a text), or tokens (if called on a sentence). Each token has a range of properties see [here](https://spacy.io/api/token#attributes). We will use a few of them in the following:

- `text`: the actual word
- `lemma_`: the dictionary entry of a word
- `pos_`: the part of speech
- `dep`: dependency relation
- `is_punct`: check whether word is punctuation
- `is_stop`: check whether word is a stop word

First, we need some text. Let's take *Moby Dick* from Project Gutenberg, as a list of strings. The file is in the data/ folder. To use the code here, make sure that folder is in the same parent folder as this folder.

You will notice that the text is already tokenized by separating words with spaces. In other data sets, this is likely not the case.

In [5]:
documents = [line.strip() for line in open('Moby Dick.txt', encoding='utf8').readlines()]
#The strip() method removes any leading (spaces at the beginning) and trailing (spaces at the end) characters (space is the default leading character to remove)
#The readlines() method returns a list containing each line in the file as a list item. (Use the hint parameter to limit the number of lines returned. If the total number of bytes returned exceeds the specified number, no more lines are returned.)

# let's take a look
print(documents[1:10])

['Call me Ishmael .', 'Some years ago -- never mind how long precisely -- having little or no money in my purse , and nothing particular to interest me on shore , I thought I would sail about a little and see the watery part of the world .', 'It is a way I have of driving off the spleen and regulating the circulation .', "Whenever I find myself growing grim about the mouth ; whenever it is a damp , drizzly November in my soul ; whenever I find myself involuntarily pausing before coffin warehouses , and bringing up the rear of every funeral I meet ; and especially whenever my hypos get such an upper hand of me , that it requires a strong moral principle to prevent me from deliberately stepping into the street , and methodically knocking people ' s hats off -- then , I account it high time to get to sea as soon as I can .", 'This is my substitute for pistol and ball .', 'With a philosophical flourish Cato throws himself upon his sword ; I quietly take to the ship .', 'There is nothing su

## Tokenization
Before we do anything, we need to insert spaces into the data. Yes, here, they are already inserted, but we still need to turn the string into a list. (In other data sets, `spacy` will figure out where to insert spaces).

In [6]:
# this creates a list of list of tokens (one per sentence)
tokens = [
    [token.text for token in nlp(sentence)] 
    for sentence in documents[:100] # iterate over first 100 documents
]

print(tokens)

[['Loomings', '.'], ['Call', 'me', 'Ishmael', '.'], ['Some', 'years', 'ago', '--', 'never', 'mind', 'how', 'long', 'precisely', '--', 'having', 'little', 'or', 'no', 'money', 'in', 'my', 'purse', ',', 'and', 'nothing', 'particular', 'to', 'interest', 'me', 'on', 'shore', ',', 'I', 'thought', 'I', 'would', 'sail', 'about', 'a', 'little', 'and', 'see', 'the', 'watery', 'part', 'of', 'the', 'world', '.'], ['It', 'is', 'a', 'way', 'I', 'have', 'of', 'driving', 'off', 'the', 'spleen', 'and', 'regulating', 'the', 'circulation', '.'], ['Whenever', 'I', 'find', 'myself', 'growing', 'grim', 'about', 'the', 'mouth', ';', 'whenever', 'it', 'is', 'a', 'damp', ',', 'drizzly', 'November', 'in', 'my', 'soul', ';', 'whenever', 'I', 'find', 'myself', 'involuntarily', 'pausing', 'before', 'coffin', 'warehouses', ',', 'and', 'bringing', 'up', 'the', 'rear', 'of', 'every', 'funeral', 'I', 'meet', ';', 'and', 'especially', 'whenever', 'my', 'hypos', 'get', 'such', 'an', 'upper', 'hand', 'of', 'me', ',', 't

## Lemmatization
We want to get the dictionary form of each word, to reduce variation.

In [7]:
lemmas = [
    [token.lemma_ for token in nlp(sentence)]
    for sentence in documents[:100]
]

# let's compare the two versions
print(tokens[2])
print()
print(lemmas[2])

['Some', 'years', 'ago', '--', 'never', 'mind', 'how', 'long', 'precisely', '--', 'having', 'little', 'or', 'no', 'money', 'in', 'my', 'purse', ',', 'and', 'nothing', 'particular', 'to', 'interest', 'me', 'on', 'shore', ',', 'I', 'thought', 'I', 'would', 'sail', 'about', 'a', 'little', 'and', 'see', 'the', 'watery', 'part', 'of', 'the', 'world', '.']

['some', 'year', 'ago', '--', 'never', 'mind', 'how', 'long', 'precisely', '--', 'have', 'little', 'or', 'no', 'money', 'in', 'my', 'purse', ',', 'and', 'nothing', 'particular', 'to', 'interest', 'I', 'on', 'shore', ',', 'I', 'think', 'I', 'would', 'sail', 'about', 'a', 'little', 'and', 'see', 'the', 'watery', 'part', 'of', 'the', 'world', '.']


You'll notice that all the *-ing* endings have been chopped off (e.g., having -> have), and that object pronouns (e.g., me) have been returned to their subject form.

## Stemming
A more aggressive way of removing variation is *stemming*. `spacy` does not provide stemming, but the `nltk` package does. Stemming uses a set of rules to reduce word endings. Since those morphemes are language-specific, we need to specify which language we need.
This process is much slower than `spacy`'s lemmatizer.

In [8]:
from nltk import SnowballStemmer

# initialize English stemmer
stemmer = SnowballStemmer('english')

stems = [[stemmer.stem(token) for token in sentence] 
         for sentence in tokens
        ]
print(stems[2])

['some', 'year', 'ago', '--', 'never', 'mind', 'how', 'long', 'precis', '--', 'have', 'littl', 'or', 'no', 'money', 'in', 'my', 'purs', ',', 'and', 'noth', 'particular', 'to', 'interest', 'me', 'on', 'shore', ',', 'i', 'thought', 'i', 'would', 'sail', 'about', 'a', 'littl', 'and', 'see', 'the', 'wateri', 'part', 'of', 'the', 'world', '.']


You'll notice that some of the results do not look like words anymore, e.g., *precis* instead of *precisely*.

## Parts of speech
We can extract the part of speech for every word with the `pos_` atttribute.

In [9]:
pos = [
    [token.pos_ for token in nlp(sentence)] 
    for sentence in documents[:100]
]
print(tokens[1])
print()
print(pos[1])

['Call', 'me', 'Ishmael', '.']

['VERB', 'PRON', 'PROPN', 'PUNCT']


## Named Entities
For each noun phrase, we can infer the semantic type of it. We use the `ents` property of `spacy`'s sentence analyzer, which extracts only NEs:

In [10]:
nlp('John gave a book to Mary and Celia in Cardiff').ents

(John, Mary, Celia)

For each entity, it stores the token and its NE label:

In [11]:
entities = [[(entity.text, entity.label_) 
             for entity in nlp(sentence).ents]
            for sentence in documents[:10]]

entities

[[],
 [],
 [('Some years ago', 'DATE')],
 [],
 [('November', 'DATE')],
 [],
 [('Cato', 'ORG')],
 [],
 [],
 [('Indian', 'NORP')]]

## Parsing
For each word, we can extract the word it is grammatically related to (its **head** word), plus the type of the **dependency** relation.
<img src='parse.png' width=400px />
The `spacy` parser creates properties for each word to track those elements:

In [12]:
parses = [
    [(word.text, word.head.text, word.dep_) for word in nlp(sentence)] 
    for sentence in documents[:10]
]
parses

[[('Loomings', 'Loomings', 'ROOT'), ('.', 'Loomings', 'punct')],
 [('Call', 'Call', 'ROOT'),
  ('me', 'Call', 'dobj'),
  ('Ishmael', 'Call', 'oprd'),
  ('.', 'Call', 'punct')],
 [('Some', 'years', 'det'),
  ('years', 'ago', 'npadvmod'),
  ('ago', 'mind', 'advmod'),
  ('--', 'mind', 'punct'),
  ('never', 'mind', 'neg'),
  ('mind', 'mind', 'ROOT'),
  ('how', 'long', 'advmod'),
  ('long', 'precisely', 'advmod'),
  ('precisely', 'mind', 'advmod'),
  ('--', 'mind', 'punct'),
  ('having', 'mind', 'advcl'),
  ('little', 'money', 'amod'),
  ('or', 'little', 'cc'),
  ('no', 'little', 'conj'),
  ('money', 'having', 'dobj'),
  ('in', 'money', 'prep'),
  ('my', 'purse', 'poss'),
  ('purse', 'in', 'pobj'),
  (',', 'having', 'punct'),
  ('and', 'having', 'cc'),
  ('nothing', 'interest', 'nsubj'),
  ('particular', 'nothing', 'amod'),
  ('to', 'interest', 'aux'),
  ('interest', 'having', 'conj'),
  ('me', 'interest', 'dobj'),
  ('on', 'interest', 'prep'),
  ('shore', 'on', 'pobj'),
  (',', 'thought', 

Instead of doing this at a word-by-word basis, we can do it by larger chunks, the **noun phrases**. This ignores any parts of the sentence that do not "hang off" a noun of some sort. `spacy`'s sentence analyzer has a special `noun_chunks` property for those:

In [13]:
noun_parse = [
    [(word.text, word.root.head.text, word.root.dep_) for word in nlp(sentence).noun_chunks]
    for sentence in documents[:100]
]
noun_parse

[[('Loomings', 'Loomings', 'ROOT')],
 [('me', 'Call', 'dobj'), ('Ishmael', 'Call', 'oprd')],
 [('little or no money', 'having', 'dobj'),
  ('my purse', 'in', 'pobj'),
  ('nothing', 'interest', 'nsubj'),
  ('me', 'interest', 'dobj'),
  ('shore', 'on', 'pobj'),
  ('I', 'thought', 'nsubj'),
  ('I', 'sail', 'nsubj'),
  ('the watery part', 'see', 'dobj'),
  ('the world', 'of', 'pobj')],
 [('It', 'is', 'nsubj'),
  ('a way', 'is', 'attr'),
  ('I', 'have', 'nsubj'),
  ('the spleen', 'off', 'pobj'),
  ('the circulation', 'regulating', 'dobj')],
 [('I', 'find', 'nsubj'),
  ('myself', 'growing', 'nsubj'),
  ('the mouth', 'about', 'pobj'),
  ('it', 'is', 'nsubj'),
  ('a damp', 'is', 'attr'),
  ('my soul', 'in', 'pobj'),
  ('I', 'find', 'nsubj'),
  ('myself', 'pausing', 'nsubj'),
  ('coffin warehouses', 'before', 'pobj'),
  ('the rear', 'bringing', 'dobj'),
  ('every funeral', 'of', 'pobj'),
  ('I', 'meet', 'nsubj'),
  ('my hypos', 'get', 'nsubj'),
  ('such an upper hand', 'get', 'dobj'),
  ('me', 