<h1 align="center">Text Analysis with Python</h1>

## Instructors
- Scott Bailey 
- Vincent Tompkins

## Learning objectives

Develop practical knowledge of an end-to-end workflow for text analysis in Python using two specific libraries: spaCy and textacy.

- Import data
- Clean/preprocess text data
- Analyze single documents
- Analyze a full corpus


## Topics

- Document Tokenization
- Part-of-Speech (POS) Tagging
- Named-Entity Recognition (NER)
- Corpus Analysis and Vectorization

##  Setup

Clicking the "Open in Colab" button you can see after opening the Github link above will create a new temporary copy of the notebook in the Google Colaboratory environment. If you then click the "Copy to Drive" button that appears in the menu bar, the notebook will then be attached to your own user account, so you can edit it in any way you like -- you can even take notes directly in the notebook.

## Zoom etiquette

Please make sure that your mic is muted during the workshop.

## Questions during the workshop

During the workshop, we have a second instructor who will be monitoring chat on Zoom. Please feel free to ask questions by chat throughout the workshop. Our second instructor will answer as able, and will aggregate questions with answers that might help everyone. 

At the end of each section of the workshop, the primary instructor will answer aggregated and new questions as time permits. If we aren't able to get to your question during the workshop, please follow up with us afterward. 

## Jupyter Notebooks and Google Colaboratory

Jupyter notebooks are a way to write and run Python code in an interactive way. They're quickly becoming a standard way of putting together data, code, and written explanations or visualizations into a single document and sharing that. There are a lot of ways that you can run Jupyter notebooks, including just locally on your computer, but we've decided to use Google's Colaboratory notebook platform for this workshop.  Colaboratory is “a Google research project created to help disseminate machine learning education and research.”  If you would like to know more about Colaboratory in general, you can visit the [Welcome Notebook](https://colab.research.google.com/notebooks/welcome.ipynb).

Using the Google Colaboratory platform allows us to focus on learning and writing Python in the workshop rather than on setting up Python, which can sometimes take a bit of extra work depending on platforms, operating systems, and other installed applications. If you'd like to install a Python distribution locally, though, we're happy to help. Feel free to drop by our walk-in consulting or schedule an appointment with us.

https://go.ncsu.edu/dvs-request

## Environment
If you would prefer to use Anaconda or your own local installation of python or Jupyter Notebooks, for this workshop you will need an environment with the following packages installed and available:
- `spacy`
- `textacy`

Please note that we will not have time during the workshop to support you with problems related to a local environment, and we do recommend using the Colaboratory notebooks if you are at all unsure.

# Document-level Analysis with `spaCy`

Let's start by learning how spaCy works, and using it to start analyzing a single textual document. We'll work with some sample data throughout, but talk through importing larger corpora later in the workshop. 

For now, we'll start with imports, setting up the model, and working with a short text. 

In [1]:
import spacy

spaCy uses pre-trained neural network models to process text. Here we're going to download and use a medium-sized English multi-task CNN, which has high accuracy for part of speech tagging, entity recognition, and includes word vectors.

In [None]:
!python -m spacy download en_core_web_md

In [2]:
# Once we've installed the model, we can load it like any other Python library
import en_core_web_md

In [3]:
# Load the language model
nlp = en_core_web_md.load()

In [4]:
# From H.G. Well's A Short History of the World, Project Gutenberg 
text = """Even under the Assyrian monarchs and especially under
Sardanapalus, Babylon had been a scene of great intellectual
activity.  {111} Sardanapalus, though an Assyrian, had been quite
Babylon-ized.  He made a library, a library not of paper but of
the clay tablets that were used for writing in Mesopotamia since
early Sumerian days.  His collection has been unearthed and is
perhaps the most precious store of historical material in the
world.  The last of the Chaldean line of Babylonian monarchs,
Nabonidus, had even keener literary tastes.  He patronized
antiquarian researches, and when a date was worked out by his
investigators for the accession of Sargon I he commemorated the
fact by inscriptions.  But there were many signs of disunion in
his empire, and he sought to centralize it by bringing a number of
the various local gods to Babylon and setting up temples to them
there.  This device was to be practised quite successfully by the
Romans in later times, but in Babylon it roused the jealousy of
the powerful priesthood of Bel Marduk, the dominant god of the
Babylonians.  They cast about for a possible alternative to
Nabonidus and found it in Cyrus the Persian, the ruler of the
adjacent Median Empire.  Cyrus had already distinguished himself
by conquering Croesus, the rich king of Lydia in Eastern Asia
Minor.  {112} He came up against Babylon, there was a battle
outside the walls, and the gates of the city were opened to him
(538 B.C.).  His soldiers entered the city without fighting.  The
crown prince Belshazzar, the son of Nabonidus, was feasting, the
Bible relates, when a hand appeared and wrote in letters of fire
upon the wall these mystical words: _"Mene, Mene, Tekel,
Upharsin,"_ which was interpreted by the prophet Daniel, whom he
summoned to read the riddle, as "God has numbered thy kingdom and
finished it; thou art weighed in the balance and found wanting and
thy kingdom is given to the Medes and Persians."  Possibly the
priests of Bel Marduk knew something about that writing on the
wall.  Belshazzar was killed that night, says the Bible.
Nabonidus was taken prisoner, and the occupation of the city was
so peaceful that the services of Bel Marduk continued without
intermission."""

In [5]:
doc = nlp(text)

Once we pass the text into the NLP model, spaCy processes the entire text and makes many features available.

## Tokenization

The doc created by spaCy immediately provides access to the word level tokens of the text.

In [6]:
for token in doc[:15]:
  print(token)

Even
under
the
Assyrian
monarchs
and
especially
under


Sardanapalus
,
Babylon
had
been
a


Each of these tokens has a number of properties, and we'll look a bit more closely at this in a minute when we think about preprocessing texts, but let's continue our quick tour. 

spaCy also automatically provides sentence level tokenization.

In [7]:
for sent in doc.sents:
    print(sent.text + "\n--\n")

Even under the Assyrian monarchs and especially under
Sardanapalus, Babylon had been a scene of great intellectual
activity.  
--

{111} Sardanapalus, though an Assyrian, had been quite
Babylon-ized.  
--

He made a library, a library not of paper but of

--

the clay tablets that were used for writing in Mesopotamia since

--

early Sumerian days.  
--

His collection has been unearthed and is
perhaps the most precious store of historical material in the
world.  
--

The last of the Chaldean line of Babylonian monarchs,
Nabonidus, had even keener literary tastes.  
--

He patronized
antiquarian researches, and when a date was worked out by his
investigators for the accession of Sargon
--

I
--

he commemorated the
fact by inscriptions.  
--

But there were many signs of disunion in
his empire, and he sought to centralize it by bringing a number of
the various local gods to Babylon and setting up temples to them
there.  
--

This device was to be practised quite successfully by the
Rom

We can collect both words and sentences into standard Python data structures.

In [8]:
sentences = [sent.text for sent in doc.sents]
sentences

['Even under the Assyrian monarchs and especially under\nSardanapalus, Babylon had been a scene of great intellectual\nactivity.  ',
 '{111} Sardanapalus, though an Assyrian, had been quite\nBabylon-ized.  ',
 'He made a library, a library not of paper but of\n',
 'the clay tablets that were used for writing in Mesopotamia since\n',
 'early Sumerian days.  ',
 'His collection has been unearthed and is\nperhaps the most precious store of historical material in the\nworld.  ',
 'The last of the Chaldean line of Babylonian monarchs,\nNabonidus, had even keener literary tastes.  ',
 'He patronized\nantiquarian researches, and when a date was worked out by his\ninvestigators for the accession of Sargon',
 'I',
 'he commemorated the\nfact by inscriptions.  ',
 'But there were many signs of disunion in\nhis empire, and he sought to centralize it by bringing a number of\nthe various local gods to Babylon and setting up temples to them\nthere.  ',
 'This device was to be practised quite success

In [9]:
words = [token.text for token in doc]
words[:30]

['Even',
 'under',
 'the',
 'Assyrian',
 'monarchs',
 'and',
 'especially',
 'under',
 '\n',
 'Sardanapalus',
 ',',
 'Babylon',
 'had',
 'been',
 'a',
 'scene',
 'of',
 'great',
 'intellectual',
 '\n',
 'activity',
 '.',
 ' ',
 '{',
 '111',
 '}',
 'Sardanapalus',
 ',',
 'though',
 'an']

### Filtering Tokens

Let's start with cleaning the text and counting to see what we can learn.

In [10]:
# One of the common things we do in text analysis is to remove punctuation
no_punct = [token for token in doc if token.is_punct == False]
for token in no_punct[:20]:
  print(token.text, token.is_punct)

Even False
under False
the False
Assyrian False
monarchs False
and False
especially False
under False

 False
Sardanapalus False
Babylon False
had False
been False
a False
scene False
of False
great False
intellectual False

 False
activity False


In [11]:
# This has worked, but left in new line characters and spaces
no_punct_or_space = [token for token in doc if token.is_punct == False and token.is_space == False]
for token in no_punct_or_space[:30]:
  print(token.text)

Even
under
the
Assyrian
monarchs
and
especially
under
Sardanapalus
Babylon
had
been
a
scene
of
great
intellectual
activity
111
Sardanapalus
though
an
Assyrian
had
been
quite
Babylon
ized
He
made


In [12]:
# Let's say we also want to remove numbers, and lowercase everything
lower_alpha = [token.lower_ for token in no_punct_or_space if token.is_alpha == True]
lower_alpha[:30]

['even',
 'under',
 'the',
 'assyrian',
 'monarchs',
 'and',
 'especially',
 'under',
 'sardanapalus',
 'babylon',
 'had',
 'been',
 'a',
 'scene',
 'of',
 'great',
 'intellectual',
 'activity',
 'sardanapalus',
 'though',
 'an',
 'assyrian',
 'had',
 'been',
 'quite',
 'babylon',
 'ized',
 'he',
 'made',
 'a']

One other common bit of preprocessing is to remove stopwords, that is, the common words in a language that don't convey the information that we are looking for in our analysis. For example, if we looked for the most common words in a text, we would want to remove stopwords so that we don't only get words such as 'a,' 'the,' and 'and.'

In [13]:
clean = [token.lower_ for token in no_punct_or_space if token.is_alpha == True and token.is_stop == False]
clean[:30]

['assyrian',
 'monarchs',
 'especially',
 'sardanapalus',
 'babylon',
 'scene',
 'great',
 'intellectual',
 'activity',
 'sardanapalus',
 'assyrian',
 'babylon',
 'ized',
 'library',
 'library',
 'paper',
 'clay',
 'tablets',
 'writing',
 'mesopotamia',
 'early',
 'sumerian',
 'days',
 'collection',
 'unearthed',
 'precious',
 'store',
 'historical',
 'material',
 'world']

For this piece, we've used spaCy's built in stopword list, which is used to create the property `is_stop` for each token. There's a good chance you would want to create custom stopwords lists though, especially if you're working with historical text. 

In [14]:
# We'll just pick a couple of words we know are in the example
custom_stopwords = ["assyrian", "babylon"]

custom_clean = [token.lower_ for token in doc if token.lower_ not in custom_stopwords]
custom_clean

['even',
 'under',
 'the',
 'monarchs',
 'and',
 'especially',
 'under',
 '\n',
 'sardanapalus',
 ',',
 'had',
 'been',
 'a',
 'scene',
 'of',
 'great',
 'intellectual',
 '\n',
 'activity',
 '.',
 ' ',
 '{',
 '111',
 '}',
 'sardanapalus',
 ',',
 'though',
 'an',
 ',',
 'had',
 'been',
 'quite',
 '\n',
 '-',
 'ized',
 '.',
 ' ',
 'he',
 'made',
 'a',
 'library',
 ',',
 'a',
 'library',
 'not',
 'of',
 'paper',
 'but',
 'of',
 '\n',
 'the',
 'clay',
 'tablets',
 'that',
 'were',
 'used',
 'for',
 'writing',
 'in',
 'mesopotamia',
 'since',
 '\n',
 'early',
 'sumerian',
 'days',
 '.',
 ' ',
 'his',
 'collection',
 'has',
 'been',
 'unearthed',
 'and',
 'is',
 '\n',
 'perhaps',
 'the',
 'most',
 'precious',
 'store',
 'of',
 'historical',
 'material',
 'in',
 'the',
 '\n',
 'world',
 '.',
 ' ',
 'the',
 'last',
 'of',
 'the',
 'chaldean',
 'line',
 'of',
 'babylonian',
 'monarchs',
 ',',
 '\n',
 'nabonidus',
 ',',
 'had',
 'even',
 'keener',
 'literary',
 'tastes',
 '.',
 ' ',
 'he',
 'pat

At this point, we have a list of lower-cased tokens that doesn't contain punctuation, white-space, numbers, or stopwords. Depending on our analysis, we may or may not want to do this much cleaning. But, it is good to understand how much we can do just with spaCy. 

### Counting Tokens

Let's then look at what we can do now that we have groups of tokens at different lengths. We can start with just counting.

In [15]:
print("Number of tokens in document: ", len(doc))
print("Number of tokens in cleaned document: ", len(clean))
print("Number of unique tokens in cleaned document: ", len(set(clean)))

Number of tokens in document:  477
Number of tokens in cleaned document:  175
Number of unique tokens in cleaned document:  147


In [16]:
from collections import Counter

In [17]:
full_counter = Counter([token.lower_ for token in doc])
full_counter.most_common(20)

[('the', 36),
 ('\n', 35),
 (',', 26),
 ('of', 20),
 ('.', 16),
 (' ', 14),
 ('and', 13),
 ('in', 9),
 ('a', 8),
 ('was', 8),
 ('to', 8),
 ('he', 6),
 ('by', 6),
 ('babylon', 5),
 ('had', 4),
 ('that', 4),
 ('his', 4),
 ('nabonidus', 4),
 ('it', 4),
 ('"', 4)]

In [18]:
cleaned_counter = Counter(clean)
cleaned_counter.most_common(20)

[('babylon', 5),
 ('nabonidus', 4),
 ('bel', 3),
 ('marduk', 3),
 ('city', 3),
 ('assyrian', 2),
 ('monarchs', 2),
 ('sardanapalus', 2),
 ('library', 2),
 ('writing', 2),
 ('empire', 2),
 ('god', 2),
 ('found', 2),
 ('cyrus', 2),
 ('belshazzar', 2),
 ('bible', 2),
 ('wall', 2),
 ('mene', 2),
 ('thy', 2),
 ('kingdom', 2)]

**Question:** Why do we have to use a list comprehension for the non-clean doc while we can just pass a variable directly for the cleaned set of tokens?

## Part-of-Speech Tagging

Let's turn to the other aspects of the text that spaCy exposes for us. Depending on what questions we might have about the text, these will be more or less helpful. 

We'll start with parts of speech. 

In [19]:
# spaCy provides two levels of POS tagging. Here's the more general.
for token in doc[:30]:
  print(token.text, token.pos_)

Even ADV
under ADP
the DET
Assyrian ADJ
monarchs NOUN
and CCONJ
especially ADV
under ADP

 SPACE
Sardanapalus PROPN
, PUNCT
Babylon PROPN
had AUX
been AUX
a DET
scene NOUN
of ADP
great ADJ
intellectual ADJ

 SPACE
activity NOUN
. PUNCT
  SPACE
{ PUNCT
111 NUM
} PUNCT
Sardanapalus PROPN
, PUNCT
though SCONJ
an DET


In [20]:
# We also have the more specific Penn Treenbank tags.
# https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
for token in doc[:30]:
  print(token.text, token.tag_)

Even RB
under IN
the DT
Assyrian JJ
monarchs NN
and CC
especially RB
under IN

 _SP
Sardanapalus NNP
, ,
Babylon NNP
had VBD
been VBN
a DT
scene NN
of IN
great JJ
intellectual JJ

 _SP
activity NN
. .
  _SP
{ -LRB-
111 CD
} -RRB-
Sardanapalus NNP
, ,
though IN
an DT


We can accumulate the groups of tokens by way of these in order understand distributions of parts of speech throughout the text. 

In [21]:
nouns = [token for token in doc if token.pos_ == "NOUN"]
verbs = [token for token in doc if token.pos_ == "VERB"]
proper_nouns = [token for token in doc if token.pos_ == "PROPN"]
adjectives = [token for token in doc if token.pos_ == "ADJ"]
adverbs = [token for token in doc if token.pos_ == "ADV"]

In [22]:
pos_counts = {
    "nouns": len(nouns),
    "verbs": len(verbs),
    "proper_nouns": len(proper_nouns),
    "adjectives": len(adjectives),
    "adverbs": len(adverbs) 
}

pos_counts

{'nouns': 67, 'verbs': 39, 'proper_nouns': 47, 'adjectives': 24, 'adverbs': 14}

spaCy also provides full dependency parsing, but we're going to leave that alone for the moment. We'll turn instead to named entity recognition. 

## Named-Entity Recognition

https://spacy.io/api/annotation#named-entities

In [23]:
for ent in doc.ents:
  print(ent.text, ent.label_)

Assyrian NORP
Babylon ORG
111 CARDINAL
Sardanapalus NORP
Assyrian NORP
Babylon PRODUCT
Mesopotamia GPE
early Sumerian days DATE
the
world LOC
Chaldean NORP
Babylonian NORP
Sargon PERSON
Babylon WORK_OF_ART
Romans NORP
Babylon PRODUCT
Bel Marduk PERSON
Babylonians NORP
Cyrus ORG
Persian NORP
the
adjacent Median Empire ORG
Croesus ORG
Lydia GPE
Eastern Asia LOC
112 CARDINAL
Babylon ORG
538 CARDINAL
B.C. GPE
Belshazzar PERSON
Nabonidus NORP
Bible WORK_OF_ART
_"Mene, Mene, PERSON
Tekel GPE
Upharsin PERSON
Daniel PERSON
Medes PERSON
Persians NORP
Bel Marduk PERSON
Belshazzar PERSON
night TIME
Bible WORK_OF_ART
Nabonidus PERSON
Bel Marduk PERSON


What if we only care about geo-political entities or locations?

In [24]:
ent_filtered = [(ent.text, ent.label_) for ent in doc.ents if ent.label_ in ["GPE", "LOC"]]
ent_filtered

[('Mesopotamia', 'GPE'),
 ('the\nworld', 'LOC'),
 ('Lydia', 'GPE'),
 ('Eastern Asia', 'LOC'),
 ('B.C.', 'GPE'),
 ('Tekel', 'GPE')]

### Visualizing Parses

spaCy also has a nice built-in visualizer.

In [25]:
from spacy import displacy

In [26]:
displacy.render(doc, style="ent", jupyter=True)

### Activity

Pick either a particular part of speech or a named entity type, and write code to determine the most common words of that type. 

# Corpus-level Analysis with `textacy`

Let's shift to thinking about a whole corpus rather than a single document.

In doing so, we could keep working with spaCy directly if the features that it exposes help us answer the research questions we are asking. 

Instead, though, we're going to take advantage of textacy, a library built on spaCy that adds features, including a sense of a Corpus and built in analytics on it. 

In [27]:
!pip install textacy



## Generating Corpora

We'll use some of the data that is included in textacy as our corpus. You could absolutely import data otherwise, whether through reading in plain text or xml files, or pulling text data and metadata from a csv file. 

In [28]:
import textacy
import textacy.datasets

In [29]:
# We'll work with some Supreme Court cases: https://chartbeat-labs.github.io/textacy/_modules/textacy/datasets/supreme_court.html
data = textacy.datasets.SupremeCourt()

In [30]:
data.download()

What we have here is a collection of Supreme Court decisions, both full text and metadata. 

Let's look at a single one to see what we have.

In [31]:
single = list(data.texts(limit=1))[0]
single[:200]

'[ Halliburton Oil Well Cementing Co. v. Walker Mr.Earl Babcock, of Duncan, Okl. (Harry C. Robb, of Washington, D.C., on the brief), for petitioner.\n Mr. Harold W. Mattingly, of Los Angeles, Cal., for '

Let's go ahead and pull a full set of texts with metadata. To keep it a bit more manageable time-wise, we'll only collect 100 of the records.

In [32]:
records = data.records(limit=100)

# Records here is a generator - we can look at the first record by passing it to the next function.
next(records)

('[ Halliburton Oil Well Cementing Co. v. Walker Mr.Earl Babcock, of Duncan, Okl. (Harry C. Robb, of Washington, D.C., on the brief), for petitioner.\n Mr. Harold W. Mattingly, of Los Angeles, Cal., for respondents.\n Mr. Justice BLACK delivered the opinion of the Court.\n Cranford P. Walker, owner of Patent No. 2,156,519, and the other respondents, licensees under the patent, brought this suit in a federal district court alleging that petitioner, Halliburton Oil Well Cementing Company, had infringed certain of the claims of the Walker patent. The district court held the claims in issue valid and infringed by Halliburton. The circuit court of appeals affirmed, 9 Cir., 146 F.2d 817, and denied Halliburton\'s petition for rehearing. 149 F.2d 896. Petitioner\'s application to this Court for certiorari urged, among other grounds, that the claims held valid failed to make the \'full, clear, concise, and exact\' description of the alleged invention required by Rev.Stat. 4888C. 33, 35 U.S.C.A

textacy includes the idea of a corpus, while spaCy only has an idea of a single documents, though you can compose documents in standard Python data structures. Every corpus takes some texts or text plus metadata, along with a language model. 

In [33]:
corpus = textacy.Corpus(nlp, data=records)

In [34]:
corpus

Corpus(99 docs, 610586 tokens)

In [35]:
[doc._.preview for doc in corpus[:5]]

['Doc(5222 tokens: "Rehearing Denied Dec. 16, 1946. See . Mr.Claude...")',
 'Doc(3616 tokens: "Rehearing Denied Dec. 16, 1946  See .  Appeal f...")',
 'Doc(9169 tokens: "Mr. Walter J. Cummings, Jr., of Washington, D.C...")',
 'Doc(1423 tokens: "Mr.A. Devitt Vaneck, of Washington, D.C., for p...")',
 'Doc(7624 tokens: "Action by Richfield Oil Corporation against Sta...")']

We can see that the type of each item in the corpus is a `Doc` - this is effectively a spaCy doc with all of the calculated features. Textacy does give you some capacity to work with those features through it's API, and also exposes new features, such as ngrams and ranking algorithms for single documents. We'll come back to these once we work a bit at the corpus level. 

We can filter this corpus based on metadata once we make it.

In [36]:
# Here we'll find all the cases where the number of justices voting in the majority was greater than 6. 
recent = [doc for doc in corpus.get(lambda doc: doc._.meta["n_maj_votes"] > 6)]
len(recent)

64

In [37]:
recent[0]._.preview

'Doc(7624 tokens: "Action by Richfield Oil Corporation against Sta...")'

## Analyzing the Corpus

Let's look at what we get out of the box from textacy once we've built a corpus.

In [38]:
print("number of documents: ", corpus.n_docs)
print("number of sentences: ", corpus.n_sents)
print("number of tokens: ", corpus.n_tokens)

number of documents:  99
number of sentences:  27506
number of tokens:  610586


In [39]:
# We'll pass as_strings so that the results we look at will give us strings rather than unique ids.
counts = corpus.word_counts(as_strings=True)

Notice that, by default, the `word_counts` function is doing a certain amount of cleaning for you: https://chartbeat-labs.github.io/textacy/api_reference/lang_doc_corpus.html#textacy.corpus.Corpus.word_counts 

In [40]:
sorted(counts.items(), key=lambda x: x[1], reverse=True)[:20]

[('-PRON-', 15179),
 ('v.', 3067),
 ('Court', 2150),
 ('case', 2093),
 ('court', 1758),
 ('Act', 1740),
 ('States', 1675),
 ('United', 1606),
 ('state', 1573),
 ('Co.', 1226),
 ('order', 1132),
 ('law', 1108),
 ('Commission', 1028),
 ('Footnote', 949),
 ('right', 886),
 ('Congress', 881),
 ('power', 824),
 ('tax', 786),
 ('act', 783),
 ('employee', 765)]

For an explanation of `-PRON-`, see https://spacy.io/api/annotation#lemmatization. Basically it's spaCy's way of lemmatizing pronouns. 

In [41]:
word_doc_counts = corpus.word_doc_counts(weighting="freq", smooth_idf=True, filter_stops=True, as_strings=True)

In [42]:
sorted(word_doc_counts.items(), key=lambda x:x[1], reverse=True)[:20]

[('Mr.', 1.0),
 ('Court', 1.0),
 ('-PRON-', 1.0),
 ('opinion', 0.98989898989899),
 ('case', 0.98989898989899),
 ('v.', 0.98989898989899),
 ('Justice', 0.9696969696969697),
 ('deliver', 0.9696969696969697),
 ('hold', 0.9494949494949495),
 ('fact', 0.9494949494949495),
 ('question', 0.9393939393939394),
 ('court', 0.9292929292929293),
 ('United', 0.9292929292929293),
 ('order', 0.9292929292929293),
 ('grant', 0.9191919191919192),
 ('provide', 0.9191919191919192),
 ('footnote', 0.9191919191919192),
 ('Act', 0.9090909090909091),
 ('States', 0.9090909090909091),
 ('state', 0.8888888888888888)]

We should note that these are not tf-idf values, which are term frequencies for individual docs weighted by the inverse document frequency. This is a measure of the number of docs the words appear in weighted by inverse document frequency. We're still getting a sense of which words across the corpus and in the context of the corpus seem to have the most importance, if document frequency is a proxy for importance. 

Textacy provides access to different algorithms that can be run on docs, such as TextRank for keyword extraction. We'll start by working on a single doc, and then look at how we might scale up to thinking about the corpus.

In [43]:
import textacy.ke

In [44]:
key_terms_textrank = textacy.ke.textrank(corpus[4])
key_terms_textrank

[('California sale tax', 0.012333388296664348),
 ('tax export', 0.0113673143549449),
 ('state tax invalid', 0.010884733369090502),
 ('Holmes v. Jennison', 0.010712873077734912),
 ('retail sale tax', 0.010381265077279931),
 ('uniform sale tax', 0.010034512222859084),
 ('tax sale', 0.009769600024372289),
 ('constitutional tax immunity', 0.00937754247776007),
 ('California tax', 0.009330806968843836),
 ('federal tax o baseball bat', 0.009297520907650668)]

For comparison, we'll take a look at another algorithm, Yake. 

In [45]:
key_terms_yake = textacy.ke.yake(corpus[4])
key_terms_yake

[('California Supreme Court', 0.01890085790764054),
 ('United States', 0.04633964392576428),
 ('Commerce Clause', 0.053538515965004016),
 ('Export Clause', 0.05606165454483531),
 ('tax', 0.07142028238377197),
 ('Constitution', 0.08323314407834871),
 ('Co.', 0.0888678555608659),
 ('export', 0.09368242887042529),
 ('State Court', 0.09888574865447539),
 ('article', 0.09952920567197214)]

Let's think about aggregating keywords over part of the corpus.

In [46]:
key_terms_textrank_corpus = [textacy.ke.yake(doc) for doc in corpus[:20]]

In [47]:
key_terms_textrank_corpus

[[('United States', 0.013154845765439517),
  ('Mann Act', 0.05148040959149859),
  ('Slave Traffic Act', 0.07313546995499376),
  ('Caminetti', 0.07925194451279186),
  ('Congress', 0.08002616916707488),
  ('Mr. Justice', 0.08175846469121331),
  ('Caminetti case', 0.09646113486901169),
  ('Cir', 0.0996889079810463),
  ('Court', 0.10790782584908168),
  ('purpose', 0.12892137615080948)],
 [('Interstate Commerce Commission', 0.026838954477381077),
  ('Interstate Commerce Act', 0.04745347005938014),
  ('Champlin', 0.07097931896052062),
  ('Pipe Line Cases', 0.0737254136675763),
  ('District Court', 0.08233568972929342),
  ('Champlin Refining Company', 0.09036724824601859),
  ('Uncle Sam', 0.10496819842317137),
  ('common carrier', 0.10666587546030443),
  ('line', 0.11671977019899822),
  ('pipe', 0.1257009603399611)],
 [('United States', 0.005430451306863621),
  ('indian title', 0.013906626654310701),
  ('original indian title', 0.014864654569450124),
  ('indian claim', 0.02876756897005112),
 

In [48]:
flat_list = [item for sublist in key_terms_textrank_corpus for item in sublist]
flat_list

[('United States', 0.013154845765439517),
 ('Mann Act', 0.05148040959149859),
 ('Slave Traffic Act', 0.07313546995499376),
 ('Caminetti', 0.07925194451279186),
 ('Congress', 0.08002616916707488),
 ('Mr. Justice', 0.08175846469121331),
 ('Caminetti case', 0.09646113486901169),
 ('Cir', 0.0996889079810463),
 ('Court', 0.10790782584908168),
 ('purpose', 0.12892137615080948),
 ('Interstate Commerce Commission', 0.026838954477381077),
 ('Interstate Commerce Act', 0.04745347005938014),
 ('Champlin', 0.07097931896052062),
 ('Pipe Line Cases', 0.0737254136675763),
 ('District Court', 0.08233568972929342),
 ('Champlin Refining Company', 0.09036724824601859),
 ('Uncle Sam', 0.10496819842317137),
 ('common carrier', 0.10666587546030443),
 ('line', 0.11671977019899822),
 ('pipe', 0.1257009603399611),
 ('United States', 0.005430451306863621),
 ('indian title', 0.013906626654310701),
 ('original indian title', 0.014864654569450124),
 ('indian claim', 0.02876756897005112),
 ('indian land', 0.03067224

In [56]:
# we now have a flat list of tuples, but let's shift to a flat list of just the keys in order to 
# count the most common keys
flat_list_keys = [k for k,v in flat_list]
flat_list_keys

['United States',
 'Mann Act',
 'Slave Traffic Act',
 'Caminetti',
 'Congress',
 'Mr. Justice',
 'Caminetti case',
 'Cir',
 'Court',
 'purpose',
 'Interstate Commerce Commission',
 'Interstate Commerce Act',
 'Champlin',
 'Pipe Line Cases',
 'District Court',
 'Champlin Refining Company',
 'Uncle Sam',
 'common carrier',
 'line',
 'pipe',
 'United States',
 'indian title',
 'original indian title',
 'indian claim',
 'indian land',
 'indian tribe',
 'indian right',
 'Indians',
 'Shoshone Indians',
 'Congress',
 'Government',
 'A. Devitt Vaneck',
 'respondent',
 'Court',
 'delay',
 'contract',
 'Crook',
 'Rice',
 'United States',
 'work',
 'California Supreme Court',
 'United States',
 'Commerce Clause',
 'Export Clause',
 'tax',
 'Constitution',
 'Co.',
 'export',
 'State Court',
 'article',
 'Electric Bond',
 'Commission',
 'Trade Commission Report',
 'Share',
 'Exchange Commission',
 'Share Company',
 'Share Co.',
 'Federal Trade Commission',
 'American Co.',
 'Share system',
 'Circui

In [57]:
keyword_counter = Counter(flat_list_keys)
keyword_counter.most_common(20)

[('United States', 11),
 ('Court', 7),
 ('District Court', 5),
 ('Circuit Court', 4),
 ('Congress', 3),
 ('Co.', 3),
 ('Act', 3),
 ('Mr. Justice', 2),
 ('Government', 2),
 ('tax', 2),
 ('Commission', 2),
 ('Appeals', 2),
 ('order', 2),
 ('court', 2),
 ('case', 2),
 ('Samuels', 2),
 ('board', 2),
 ('panel', 2),
 ('Irving S. Shapiro', 2),
 ('Mann Act', 1)]

### Activity:
Let's combine a few different pieces. Try filtering the corpus on some metadata to construct a sub-corpus. Then use one of the textacy keyword algorithms to determine the most common keywords across your subcorpus. 

## Keyword in context

One thing that researchers often find helpful in working with text is simply seeing keywords in context. 

In [50]:
for doc in corpus[:20]:
  textacy.text_utils.KWIC(doc.text, "agriculture")

authority between the courts and the Secretary of  Agriculture . These become relevant to the enforcement of Milk
at the validity of the demand by the Secretary of  Agriculture  may be contested in an enforcement proceeding und
ay resist a claim against him by the Secretary of  Agriculture , made according to the procedure defined in the A
 also defined in the Act, before the Secretary of  Agriculture . The answer is found on a fair reading of the Agr
es a handler to challenge before the Secretary of  Agriculture  his order 'or any obligation imposed in connectio
challenged, the determination of the Secretary of  Agriculture , after hearing, is final but only 'if in accordan
hich gives the handler access to the Secretary of  Agriculture  for administrative relief and opportunity for jud
r, or delay the United States or the Secretary of  Agriculture  from obtaining relief' under 8a(6). It is only wh
oceedings were instituted before the Secretary of  Agriculture  and, apparently, are awa

## Vectorization

Let's continue with corpus level analysis by taking advantage of textacy's vectorizer class, which wraps functionality from scikit-learn. We could just work directly in scikit-learn, but it can be nice for mental overhead to learn one library and be able to do a great deal with it. 

We'll create a vectorizer, sticking with the normal term frequency defaults but discarding words that appear in less than 3 documents or more than 95% of documents. We'll also limit our features to the top 500 words according to document frequency.This means our feature set, or columns, will have a higher degree of representation across the corpus. We could vectorize according to tf-idf as well.

In [51]:
import textacy.vsm

vectorizer = textacy.vsm.Vectorizer(min_df=3, max_df=.95, max_n_terms=500)
tokenized_corpus = (doc._.to_terms_list(ngrams=1, as_strings=True,
                                        filter_punct=True, 
                                        filter_stops=True, 
                                        filter_nums=True 
                                        ) for doc in corpus)
dtm = vectorizer.fit_transform(tokenized_corpus)
dtm

<99x500 sparse matrix of type '<class 'numpy.int32'>'
	with 24971 stored elements in Compressed Sparse Row format>

We have now have a matrix representation of our corpus, where rows are documents, and columns (or features) are words from the corpus. The value at any given point is the number of times that the word appears in that document. Once we have a document-term matrix, we could do a few different things with it, just within textacy, though we could take it and pass it into different algorithms within scikit-learn or other libraries. 

In [52]:
# Let's first look at some of the terms
vectorizer.terms_list[:20]

['$',
 '1',
 '10',
 '11',
 '12',
 '13',
 '14',
 '15',
 '16',
 '18',
 '2',
 '28',
 '3',
 '4',
 '49',
 '5',
 '50',
 '6',
 '7',
 '8']

We can see that we are still getting a number of terms which ought to be filtered out, such as numbers and punctuation. We would want to clean this up more before vectorizing in the future. 

## Topic Modeling

Let's look quickly at one examples of what we can do with a vectorized corpus. Topic modeling is very popular for semantic exploration of texts, and there are numerous implementations. Textacy uses implementations from scikit-learn. 

Our corpus is rather small for topic modeling, but just to see how it's done here, we'll go ahead.

In [53]:
import textacy.tm

In [54]:
model = textacy.tm.TopicModel("lda", n_topics=10)
model.fit(dtm)
doc_topic_matrix = model.transform(dtm)
doc_topic_matrix

array([[8.54527031e-01, 9.93312834e-05, 9.93343763e-05, 9.93325316e-05,
        1.43390413e-01, 1.38724040e-03, 9.93247920e-05, 9.93296181e-05,
        9.93286754e-05, 9.93339376e-05],
       [1.18233576e-04, 1.18232186e-04, 1.18227867e-04, 1.18228420e-04,
        1.18239670e-04, 1.18235261e-04, 1.18224544e-04, 1.18234067e-04,
        1.18229806e-04, 9.98935915e-01],
       [9.61105145e-01, 4.68476090e-05, 4.68482572e-05, 4.68494069e-05,
        4.68479850e-05, 4.68506001e-05, 3.85200621e-02, 4.68498615e-05,
        4.68496757e-05, 4.68493313e-05],
       [3.62427237e-04, 4.25212257e-01, 3.62427860e-04, 3.62399819e-04,
        3.62407394e-04, 5.71888450e-01, 3.62385025e-04, 3.62423311e-04,
        3.62408833e-04, 3.62413303e-04],
       [1.13773022e-01, 6.52063138e-05, 2.42076658e-02, 6.52084422e-05,
        7.89396867e-01, 6.86494683e-05, 6.52029686e-05, 7.22277627e-02,
        6.52071092e-05, 6.52083364e-05],
       [3.63996174e-05, 3.63992308e-05, 3.64005600e-05, 3.63998187e-05,
   

In [55]:
for topic_idx, top_terms in model.top_topic_terms(vectorizer.id_to_term, top_n=10):
  print("topic", topic_idx, ":", "   ".join(top_terms))

topic 0 : States   United   United States   act   land   Act   title   claim   law   Congress
topic 1 : employee   work   hour   Act   safety   Labor   service   time   operation   Commission
topic 2 : court   petitioner   party   rule   Rule   fact   statement   action   time   trial
topic 3 : court   Board   law   search   States   state   New   United   York   State
topic 4 : tax   state   commerce   Co.   interstate   State   school   religious   New   Footnote
topic 5 : United   States   United States   court   Act   order   Government   Footnote   contempt   criminal
topic 6 : Indians   land   order   Act   Congress   finding   States   President   United States   United
topic 7 : court   rate   patent   Co.   price   judgment   state   question   decision   law
topic 8 : Illinois   claim   judgment   lien   Co.   state   Missouri   creditor   asset   State
topic 9 : Act   Commission   order   Congress   state   regulation   political   power   employee   section


## Evaluation survey
Please, spend 1 minute answering these questions that can help us a lot on future workshops. 

https://go.ncsu.edu/dvs-eval

## Credits

Originally written by Scott Bailey and co-taught with Simon Wiles at Stanford Libraries. 