# Advanced Text Analysis with SpaCy and Scikit-Learn

This notebook was originally prepared for the workshop [Advanced Text Analysis with SpaCy and Scikit-Learn](http://dhweek.nycdh.org/event/advanced-text-analysis-with-spacy-and-scikit-learn/), presented as part of NYCDH Week 2017. Here, we try out features of the SpaCy library for natural language processing. We also use some text analysis techniques from the Scikit-Learn library. 

## Installation

Installing this software is easiest on a Linux-like system. If you're not already running Linux, you can easily download a distribution and copy it to a USB disk, which you can then boot from. I recommend getting [DH-USB](https://github.com/DH-Box/dh-usb), a Linux-based operating system made for the Digital Humanities. DH-USB already has all of this software installed. 

If you have a different Linux-like system, (including, to greater or lesser degrees, Ubuntu, MacOS, Cygwin, and Bash for Windows), you should be able to run these commands to install SpaCy, Scikit-Learn, Pandas, and the other required libraries. Ete3 is a library for tree visualization which is optional. 

```bash
sudo pip install spacy scikit-learn pandas ete3
```

Note that if your system has Python 2 as the default, instead of Python 3, you might have to run `pip3` instead of `pip`. 

Now download the SpaCy data with this command: 

```bash
python -m spacy.en.download all
```

To get my sent2tree library and all the sample data, simply `git clone` the repository where this notebook lives: 

```bash
git clone https://github.com/JonathanReeve/advanced-text-analysis-workshop-2017.git
```

In [None]:
import spacy
import pandas as pd
import numpy as np
from collections import Counter
from glob import glob
import matplotlib.pyplot as plt
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import TfidfVectorizer

# Display plots in this notebook, instead of externally. 
from pylab import rcParams
rcParams['figure.figsize'] = 16, 8
%matplotlib inline

# The following are optional dependencies. 
# Feel free to comment these out. 
# Sent2tree uses the sent2tree.py module in this repository. 
from sent2tree import sentenceTree
import ete3 
import seaborn

In [None]:
# This command might take a little while. 
nlp = spacy.load('en')

The sample data is the script of the 1975 film _Monty Python and the Holy Grail_, taken from the NLTK Book corpus, and the Project Gutenberg edition of Jane Austen's novel _Pride and Prejudice_. 

In [None]:
grail_raw = open('grail.txt').read()
pride_raw = open('pride.txt').read()

In [None]:
# Parse the texts. These commands might take a little while. 
grail = nlp(grail_raw)
pride = nlp(pride_raw)

# Exploring the Document

Each SpaCy document is already tokenized into words, which are accessible by iterating over the document: 

In [None]:
pride[0]

In [None]:
pride[:10]

You can also iterate over the sentences. `doc.sents` is a generator object, so we can use `next()`: 

In [None]:
next(pride.sents)

Or you can force it into a list, and then do things with it: 

In [None]:
prideSents = list(pride.sents)
prideSents[0]

For example, let's find the longest sentence(s) in _Pride and Prejudice_: 

In [None]:
prideSentenceLengths = [len(sent) for sent in prideSents]
[sent for sent in prideSents if len(sent) == max(prideSentenceLengths)]

## Exploring Words

Each word has a crazy number of properties: 

In [None]:
pride[4]

In [None]:
[prop for prop in dir(pride[4]) if not prop.startswith('_')]

Using just the indices (`.i`), we can make a lexical dispersion plot for the occurrences of that word in the novel. (This is just the SpaCy equivalent of the lexical dispersion plot from the NLTK Book, chapter 1.) 

In [None]:
pride[4].i

In [None]:
def locations(needle, haystack): 
    """ 
    Make a list of locations, bin those into a histogram, 
    and finally put it into a Pandas Series object so that we
    can later make it into a DataFrame. 
    """
    return pd.Series(np.histogram(
        [word.i for word in haystack 
         if word.text.lower() == needle], bins=50)[0])

In [None]:
# I have no idea why I have to keep running this. 
rcParams['figure.figsize'] = 16, 8

pd.DataFrame(
    {name: locations(name.lower(), pride) 
     for name in ['Elizabeth', 'Darcy', 'Jane', 'Bennet']}
).plot(subplots=True)

See if you can tell which characters end up getting together at the end, just based on this plot. 

# Exploring Named Entities

Named entities can be accessed through `doc.ents`. Let's find all the types of named entities from _Monty Python and the Holy Grail_: 

In [None]:
set([w.label_ for w in grail.ents])

What about those that are works of art? 

In [None]:
[ent for ent in grail.ents if ent.label_ == 'WORK_OF_ART']

Place names? 

In [None]:
[ent for ent in grail.ents if ent.label_ == 'GPE']

Organizations? 

In [None]:
set(list([ent.string.strip() for ent in grail.ents if ent.label_ == 'ORG']))

How about groups of people? 

In [None]:
set([ent.string for ent in grail.ents if ent.label_ == 'NORP'])

"French" here refers to French _people_, not the French language. We can verify that by getting all the sentences in which this particular type of entity occurs: 

In [None]:
frenchPeople = [ent for ent in grail.ents if ent.label_ == 'NORP' and ent.string.strip() == 'French']
[ent.sent for ent in frenchPeople]

# Parts of Speech

First, let's get the noun chunks: 

In [None]:
list(pride.noun_chunks)

In [None]:
# Make a quick-and-dirty lookup table of POS IDs, 
# since the default representation of a POS is numeric. 
tagDict = {w.pos: w.pos_ for w in pride} 

What's the distribution of parts of speech in these two texts? 

In [None]:
grailPOS = pd.Series(grail.count_by(spacy.attrs.POS))/len(grail)
pridePOS = pd.Series(pride.count_by(spacy.attrs.POS))/len(pride)

rcParams['figure.figsize'] = 16, 8
df = pd.DataFrame([grailPOS, pridePOS], index=['Grail', 'Pride'])
df.columns = [tagDict[column] for column in df.columns]
df.T.plot(kind='bar')

Now we can see, for instance, what the most common punctuation marks might be. 

In [None]:
prideAdjs = [w for w in pride if w.pos_ == 'PUNCT']
Counter([w.string.strip() for w in prideAdjs]).most_common(10)

In [None]:
grailAdjs = [w for w in grail if w.pos_ == 'PUNCT']
Counter([w.string.strip() for w in grailAdjs]).most_common(10)

Let's try this on the level of a sentence. First, let's get all the sentences in which Sir Robin is explicitly mentioned: 

In [None]:
robinSents = [sent for sent in grail.sents if 'Sir Robin' in sent.string]
robinSents

Now let's analyze just one of these sentences.

In [None]:
r2 = robinSents[2]
r2

Let's look at the tags and parts of speech: 

In [None]:
for word in r2: 
    print(word, word.tag_, word.pos_)

# Dependency Parsing
Now let's analyse the structure of the sentence. 

This sentence has lots of properties: 

In [None]:
[prop for prop in dir(r2) if not prop.startswith('_')]

To drill down into the sentence, we can start with the root: 

In [None]:
r2.root

That root has children: 

In [None]:
list(r2.root.children)

Let's see all of the children for each word:  

In [None]:
for word in r2: 
    print(word, ': ', str(list(word.children)))

This is very messy-looking, so let's create a nicer visualization. Here I'll be using a class I wrote called sentenceTree, available in the `sent2tree` module in this repository. It just shoehorns a SpaCy span (sentence or other grammatical fragment) into a tree that can be read by the `ete3` library for handling trees. This library just allows for some pretty visualizations of trees. 

In [None]:
st = sentenceTree(r2)
t, ts = st.render()
t.render('%%inline', tree_style=ts)

You can already see how useful this might be. Since adjectives are typically children of the things they describe, we can get approximations for adjectives that describe characters. How is Sir Robin described? 

In [None]:
for sent in robinSents: 
    for word in sent: 
        if 'Robin' in word.string: 
            for child in word.children: 
                if child.pos_ == 'ADJ':
                    print(child)

Looks like we shouldn't always trust syntactic insight! Now let's do something similar for Pride and Prejudice. First, we'll use named entity extraction to get a list of the most frequently mentioned characters:  

In [None]:
Counter([w.string.strip() for w in pride.ents if w.label_ == 'PERSON']).most_common(10)

Now we can write a function that walks down the tree from each character, looking for the first adjectives it can find: 

In [None]:
def adjectivesDescribingCharacters(text, character):
    sents = [sent for sent in pride.sents if character in sent.string]
    adjectives = []
    for sent in sents: 
        for word in sent: 
            if character in word.string:
                for child in word.children: 
                    if child.pos_ == 'ADJ': 
                        adjectives.append(child.string.strip())
    return Counter(adjectives).most_common(10)

We'll try it on Mr. Darcy: 

In [None]:
adjectivesDescribingCharacters(pride, 'Darcy')

Now let's do the same sort of thing, but look for associated verbs. First, let's get all the sentences in which Elizabeth is mentioned:  

In [None]:
elizabethSentences = [sent for sent in pride.sents if 'Elizabeth' in sent.string]

And we can peek at one of them: 

In [None]:
elizabethSentences[3]

In [None]:
st = sentenceTree(elizabethSentences[3])
t, ts = st.render()
t.render('%%inline', tree_style=ts)

We want the verb associated with Elizabeth, _remained_, not the root verb of the sentence, _walked_, which is associated with Mr. Darcy. So let's write a function that will walk up the dependency tree from a character's name until we get to the first verb. We'll use lemmas instead of the conjugated forms to collapse _remain_, _remains_, and _remained_ into one verb: _remain_. 

In [None]:
def verbsForCharacters(text, character):
    sents = [sent for sent in pride.sents if character in sent.string]
    charWords = []
    for sent in sents: 
        for word in sent: 
            if character in word.string: 
                charWords.append(word)
    charAdjectives = []
    for word in charWords: 
        # Start walking up the list of ancestors 
        # Until we get to the first verb. 
        for ancestor in word.ancestors: 
            if ancestor.pos_.startswith('V'): 
                charAdjectives.append(ancestor.lemma_.strip())
    return Counter(charAdjectives).most_common(20)

In [None]:
elizabethVerbs = verbsForCharacters(pride, 'Elizabeth')
elizabethVerbs

In [None]:
darcyVerbs = verbsForCharacters(pride, 'Darcy')
janeVerbs = verbsForCharacters(pride, 'Jane')

We can now merge these counts into a single table, and then we can visualize it with Pandas. 

In [None]:
def verbsToMatrix(verbCounts): 
    """ 
    Takes verb counts given by verbsForCharacters 
    and makes Pandas Series out of them, suitabe for combination in 
    a DataFrame. 
    """
    return pd.Series({t[0]: t[1] for t in verbCounts})

verbsDF = pd.DataFrame({'Elizabeth': verbsToMatrix(elizabethVerbs), 
                        'Darcy': verbsToMatrix(darcyVerbs), 
                        'Jane': verbsToMatrix(janeVerbs)}).fillna(0)
verbsDF.plot(kind='bar', figsize=(14,4))

# Probabilities

SpaCy has a list of probabilities for English words, and these probabilities are automatically associated with each word once we parse the document. Let's see what the distribution is like: 

In [None]:
probabilities = [word.prob for word in grail] 
pd.Series(probabilities).hist()

Let's peek at some of the improbable words for _Monty Python and the Holy Grail_. 

In [None]:
set([word.string.strip().lower() for word in grail if word.prob < -19])

Now we can do some rudimentary information extraction by counting the improbable words: 

In [None]:
Counter([word.string.strip().lower() 
         for word in grail 
         if word.prob < -19.5]).most_common(20)

What are those words for _Pride and Prejudice_? 

In [None]:
Counter([word.string.strip().lower() 
         for word in pride 
         if word.prob < -19.5 
         and word.is_alpha
         and word.pos_ != 'PROPN'] # This time, let's ignore proper nouns.
       ).most_common(20)

We can do this with ngrams, too, with some fancy Python magic:

In [None]:
def ngrams(doc, n): 
    doc = [word for word in doc 
           if word.is_alpha # Get rid of punctuation
           if not word.string.isupper()] # Get rid of all-caps speaker headings
    return list(zip(*[doc[i:] for i in range(n)]))

In [None]:
grailGrams = set(ngrams(grail, 3))

In [None]:
for gram in grailGrams: 
    if sum([word.prob for word in gram]) < -40: 
        print(gram)

In [None]:
for gram in set(ngrams(pride, 3)): 
    if sum([word.prob for word in gram]) < -40: 
        print(gram)

# Word Embeddings (Word Vectors)

Word embeddings (word vectors) are numeric representations of words, usually generated via dimensionality reduction on a word cooccurrence matrix for a large corpus. The vectors SpaCy uses are the [GloVe](http://nlp.stanford.edu/projects/glove/) vectors, Stanford's Global Vectors for Word Representation. These vectors can be used to calculate semantic similarity between words and documents.

In [None]:
coconut, africanSwallow, europeanSwallow, horse = nlp('coconut'), nlp('African Swallow'), nlp('European Swallow'), nlp('horse')

In [None]:
coconut.similarity(horse)

In [None]:
africanSwallow.similarity(horse)

In [None]:
africanSwallow.similarity(europeanSwallow)

Let's look at vectors for _Pride and Prejudice_. First, let's get the first 150 nouns:

In [None]:
prideNouns = [word for word in pride if word.pos_.startswith('N')][:150]

Now let's get vectors and labels for each of them: 

In [None]:
prideNounVecs = [word.vector for word in prideNouns]
prideNounLabels = [word.string.strip() for word in prideNouns]

In [None]:
prideNounVecs[0].shape

A single vector is 300-dimensional, so in order to plot it in 2D, it might help to reduce the dimensionality to the most meaningful dimensions. We can use Scikit-Learn to perform truncated singular value decomposition for latent semantic analysis (LSA). 

In [None]:
lsa = TruncatedSVD(n_components=2)
lsaOut = lsa.fit_transform(prideNounVecs)

Plot the results in a scatter plot: 

In [None]:
xs, ys = lsaOut[:,0], lsaOut[:,1]
for i in range(len(xs)): 
    plt.scatter(xs[i], ys[i])
    plt.annotate(prideNounLabels[i], (xs[i], ys[i]))

# Document Vectorization

This uses a non-semantic technique for vectorizing documents, just using bag-of-words. We won't need any of the fancy features of SpaCy for this, just scikit-learn. We'll use a subset of the Inaugural Address Corpus that contains 20th and 21st century inaugural addresses. 

First, we'll vectorize the corpus using scikit-learn's `TfidfVectorizer` class. This creates a matrix of word frequencies. (It doesn't actually use TF-IDF, since we're turning that off in the options below.)

In [None]:
tfidf = TfidfVectorizer(input='filename', decode_error='ignore', use_idf=False)

In [None]:
inauguralFilenames = sorted(glob('inaugural/*'))

# Make labels by removing the directory name and .txt extension: 
labels = [filename.split('/')[1] for filename in inauguralFilenames]
labels = [filename.split('.')[0] for filename in labels]

# While we're at it, let's make a list of the lengths, so we can use them to plot dot sizes. 
lengths = [len(open(filename, errors='ignore').read())/100 for filename in inauguralFilenames]

# Add a manually compiled list of presidential party affiliations, 
# So that we can use this to color our dots. 
parties = 'rrrbbrrrbbbbbrrbbrrbrrrbbrrbr'

In [None]:
tfidfOut = tfidf.fit_transform(inauguralFilenames)

In [None]:
tfidfOut.shape

In [None]:
lsaOut = lsa.fit_transform(tfidfOut.todense())

In [None]:
xs, ys = lsaOut[:,0], lsaOut[:,1]
for i in range(len(xs)): 
    plt.scatter(xs[i], ys[i], c=parties[i], s=lengths[i], alpha=0.5)
    plt.annotate(labels[i], (xs[i], ys[i]))

# Average Sentence Lengths

Let's load the Inaugural Address documents into SpaCy to analyze things like average sentence length. SpaCy makes this really easy. 

In [None]:
inaugural = [nlp(open(doc, errors='ignore').read()) for doc in inauguralFilenames]

In [None]:
sentLengths = [ np.mean([len(sent) for sent in doc.sents]) for doc in inaugural ]

In [None]:
pd.Series(sentLengths, index=labels).plot(kind='bar')

# Term Frequency Distributions

This sort of thing you've probably already seen in the NLTK book, but it's made even easier in SpaCy. We're simply going to count the occurrences of words and divide by the total number of words in the document. 

In [None]:
inauguralSeries = [pd.Series(Counter(   
                    [word.string.strip().lower() 
                     for word in doc]))/len(doc) 
                     for doc in inaugural]

In [None]:
seriesDict = {label: series for label, series in zip(labels, inauguralSeries)}

In [None]:
inauguralDf = pd.DataFrame(seriesDict).T.fillna(0)

In [None]:
inauguralDf[['america', 'world']].plot(kind='bar')

In [None]:
americaWorldRatio = inauguralDf['america']/inauguralDf['world']
americaWorldRatio.plot(kind='bar')

In [None]:
similarities = [ [doc.similarity(other) for other in inaugural] for doc in inaugural ]
similaritiesDf = pd.DataFrame(similarities, columns=labels, index=labels)

In [None]:
# Requires the Seaborn library. 
rcParams['figure.figsize'] = 16, 8
seaborn.heatmap(similaritiesDf)

# Exercises

1. Extract all the events from _Pride and Prejudice_. 
2. Make a lexical dispersion plot of the word "ni" in _Monty Python and the Holy Grail_. What does this tell us? 
3. Find the shortest sentence in any inaugural address from our corpus.
4. Find the president that used the lowest proportions of adjectives (or nouns, verbs) in his inaugural address. 
5. Find which of Charles Dickens's novels (or those of any other author) are the most semantically similar. 

# Learn More

 - [SpaCy Homepage](https://spacy.io/)
 - [Pycon: NLP in 10 Lines of Code](https://github.com/cytora/pycon-nlp-in-10-lines)
 - [What You Can Learn About Food By Analyzing a Million Yelp Reviews](http://nbviewer.jupyter.org/github/skipgram/modern-nlp-in-python/blob/master/executable/Modern_NLP_in_Python.ipynb)
 - [Other Tutorials Listed on Spacy.io](https://spacy.io/docs/usage/tutorials)
 
# See Also

 - [Textacy, higher-level NLP based on SpaCy](https://github.com/chartbeat-labs/textacy)