
# Week 7 - Information Extraction
This week, we move from arbitrary textual classification to the use of computation and linguistic models to parse precise claims from documents. Rather than focusing on simply the ideas in a corpus, here we focus on understanding and extracting its precise claims. This process involves a sequential pipeline of classifying and structuring tokens from text, each of which generates potentially useful data for the content analyst. Steps in this process, which we examine in this notebook, include: 1) tagging words by their part of speech (POS) to reveal the linguistic role they play in the sentence (e.g., Verb, Noun, Adjective, etc.); 2) tagging words as named entities (NER) such as places or organizations; 3) structuring or "parsing" sentences into nested phrases that are local to, describe or depend on one another; and 4) extracting informational claims from those phrases, like the Subject-Verb-Object (SVO) triples we extract here. While much of this can be done directly in the python package NLTK that we introduced in week 2, here we use NLTK bindings to the Stanford NLP group's open software, written in Java. Try typing a sentence into the online version here to get a sense of its potential. It is superior in performance to NLTK's implementations, but takes time to run, and so for these exercises we will parse and extract information for a very small text corpus. Of course, for final projects that draw on these tools, we encourage you to install the software on your own machines or shared servers at the university (RCC, SSRC) in order to perform these operations on much more text.

For this notebook we will be using the following packages:

In [2]:
#Special module written for this class
#This provides access to data and to helper functions from previous weeks
#Make sure you update it before starting this notebook
import lucem_illud #pip install -U git+git://github.com/Computational-Content-Analysis-2018/lucem_illud.git

#All these packages need to be installed from pip
#For NLP
import nltk

import numpy as np #For arrays
import pandas #Gives us DataFrames
import matplotlib.pyplot as plt #For graphics
import seaborn #Makes the graphics look nicer

#Displays the graphs
import graphviz #You also need to install the command line graphviz

#These are from the standard library
import os.path
import zipfile
import subprocess
import io
import tempfile

%matplotlib inline

You need to run this once to download everything, you will also need Java 1.8+ if you are using Windows or MacOS.

In [2]:
lucem_illud.setupStanfordNLP()

Starting downloads, this will take 5-10 minutes
../stanford-NLP/parser already exists, skipping download
../stanford-NLP/ner already exists, skipping download
../stanford-NLP/postagger already exists, skipping download
../stanford-NLP/core already exists, skipping download
Done setting up the Stanford NLP collection



We need to have stanford-NLP setup before importing, so we are doing the import here. IF you have stanford-NLP working, you can import at the beginning like you would with any other library.

In [3]:
import lucem_illud.stanford as stanford

The StanfordTokenizer will be deprecated in version 3.2.5.
Please use [91mnltk.tag.corenlp.CoreNLPPOSTagger[0m or [91mnltk.tag.corenlp.CoreNLPNERTagger[0m instead.
  super(StanfordNERTagger, self).__init__(*args, **kwargs)
The StanfordTokenizer will be deprecated in version 3.2.5.
Please use [91mnltk.tag.corenlp.CoreNLPPOSTagger[0m or [91mnltk.tag.corenlp.CoreNLPNERTagger[0m instead.
  super(StanfordPOSTagger, self).__init__(*args, **kwargs)


Open Information Extraction is a module packaged within the Stanford Core NLP package, but it is not yet supported by nltk. As a result, we have defining our own lucem_illud function that runs the Stanford Core NLP java code right here. For other projects, it is often useful to use Java or other programs (in C, C++) within a python workflow, and this is an example. stanford.openIE() takes in a string or list of strings and then produces as output all the subject, verb, object (SVO) triples Stanford Corenlp can find, as a DataFrame. You can do this through links to the Stanford Core NLP project that we provide here, or play with their interface directly (in the penultimate cell of this notebook), which produces data in "pretty graphics" like this example parsing of the first sentence in the "Shooting of Trayvon Martin" Wikipedia article:

First, we will illustrate these tools on some very short examples:

In [7]:
text = ['I saw the elephant in my pajamas.', 'The quick brown fox jumped over the lazy dog.', 'While in France, Christine Lagarde discussed short-term stimulus efforts in a recent interview with the Wall Street Journal.', 'Trayvon Benjamin Martin was an African American from Miami Gardens, Florida, who, at 17 years old, was fatally shot by George Zimmerman, a neighborhood watch volunteer, in Sanford, Florida.', 'Buffalo buffalo Buffalo buffalo buffalo buffalo Buffalo buffalo']
tokenized_text = [nltk.word_tokenize(t) for t in text]
print('\n'.join(text))

I saw the elephant in my pajamas.
The quick brown fox jumped over the lazy dog.
While in France, Christine Lagarde discussed short-term stimulus efforts in a recent interview with the Wall Street Journal.
Trayvon Benjamin Martin was an African American from Miami Gardens, Florida, who, at 17 years old, was fatally shot by George Zimmerman, a neighborhood watch volunteer, in Sanford, Florida.
Buffalo buffalo Buffalo buffalo buffalo buffalo Buffalo buffalo


In [13]:
print(tokenized_text[2])

['While', 'in', 'France', ',', 'Christine', 'Lagarde', 'discussed', 'short-term', 'stimulus', 'efforts', 'in', 'a', 'recent', 'interview', 'with', 'the', 'Wall', 'Street', 'Journal', '.']


Part-of-Speech (POS) tagging
In POS tagging, we classify each word by its semantic role in a sentence. The Stanford POS tagger uses the Penn Treebank tag set to POS tag words from input sentences. As discussed in the second assignment, this is a relatively precise tagset, which allows more informative tags, and also more opportunities to err :-).

#.	Tag	Description
1.	CC	Coordinating conjunction
2.	CD	Cardinal number
3.	DT	Determiner
4.	EX	Existential there
5.	FW	Foreign word
6.	IN	Preposition or subordinating conjunction
7.	JJ	Adjective
8.	JJR	Adjective, comparative
9.	JJS	Adjective, superlative
10.	LS	List item marker
11.	MD	Modal
12.	NN	Noun, singular or mass
13.	NNS	Noun, plural
14.	NNP	Proper noun, singular
15.	NNPS	Proper noun, plural
16.	PDT	Predeterminer
17.	POS	Possessive ending
18.	PRP	Personal pronoun
19.	PRP\$	Possessive pronoun
20.	RB	Adverb
21.	RBR	Adverb, comparative
22.	RBS	Adverb, superlative
23.	RP	Particle
24.	SYM	Symbol
25.	TO	to
26.	UH	Interjection
27.	VB	Verb, base form
28.	VBD	Verb, past tense
29.	VBG	Verb, gerund or present participle
30.	VBN	Verb, past participle
31.	VBP	Verb, non-3rd person singular present
32.	VBZ	Verb, 3rd person singular present
33.	WDT	Wh-determiner
34.	WP	Wh-pronoun
35.	WP$	Possessive wh-pronoun
36.	WRB	Wh-adverb

In [12]:
pos_sents = stanford.postTagger.tag_sents(tokenized_text)
print(pos_sents)

[[('I', 'PRP'), ('saw', 'VBD'), ('the', 'DT'), ('elephant', 'NN'), ('in', 'IN'), ('my', 'PRP$'), ('pajamas', 'NNS'), ('.', '.')], [('The', 'DT'), ('quick', 'JJ'), ('brown', 'JJ'), ('fox', 'NN'), ('jumped', 'VBD'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'JJ'), ('dog', 'NN'), ('.', '.')], [('While', 'IN'), ('in', 'IN'), ('France', 'NNP'), (',', ','), ('Christine', 'NNP'), ('Lagarde', 'NNP'), ('discussed', 'VBD'), ('short-term', 'JJ'), ('stimulus', 'NN'), ('efforts', 'NNS'), ('in', 'IN'), ('a', 'DT'), ('recent', 'JJ'), ('interview', 'NN'), ('with', 'IN'), ('the', 'DT'), ('Wall', 'NNP'), ('Street', 'NNP'), ('Journal', 'NNP'), ('.', '.')], [('Trayvon', 'NNP'), ('Benjamin', 'NNP'), ('Martin', 'NNP'), ('was', 'VBD'), ('an', 'DT'), ('African', 'NNP'), ('American', 'NNP'), ('from', 'IN'), ('Miami', 'NNP'), ('Gardens', 'NNP'), (',', ','), ('Florida', 'NNP'), (',', ','), ('who', 'WP'), (',', ','), ('at', 'IN'), ('17', 'CD'), ('years', 'NNS'), ('old', 'JJ'), (',', ','), ('was', 'VBD'), ('fatally'

This looks quite good. Now we will try POS tagging with a somewhat larger corpus. We consider a few of the top posts from the reddit data we used last week.

In [12]:
redditDF = pandas.read_csv('../data/reddit.csv', index_col=0)

FileNotFoundError: File b'../data/reddit.csv' does not exist

Grabbing the 10 highest scoring posts and tokenizing the sentences. Once again, notice that we aren't going to do any kind of stemming this week (although semantic normalization may be performed where we translate synonyms into the same focal word).

In [None]:
redditTopScores = redditDF.sort_values('score')[-10:]
redditTopScores['sentences'] = redditTopScores['text'].apply(lambda x: [nltk.word_tokenize(s) for s in nltk.sent_tokenize(x)])
redditTopScores.index = range(len(redditTopScores) - 1, -1,-1) #Reindex to make things nice in the future
redditTopScores[-5:]

In [None]:
redditTopScores['POS_sents'] = redditTopScores['sentences'].apply(lambda x: stanford.postTagger.tag_sents(x))

In [None]:
redditTopScores['POS_sents']


And count the number of NN (nouns)

In [None]:
countTarget = 'NN'
targetCounts = {}
for entry in redditTopScores['POS_sents']:
    for sentence in entry:
        for ent, kind in sentence:
            if kind != countTarget:
                continue
            elif ent in targetCounts:
                targetCounts[ent] += 1
            else:
                targetCounts[ent] = 1
sortedTargets = sorted(targetCounts.items(), key = lambda x: x[1], reverse = True)
sortedTargets[:20]

What about the number of top verbs (VB)?

In [None]:
countTarget = 'VB'
targetCounts = {}
for entry in redditTopScores['POS_sents']:
    for sentence in entry:
        for ent, kind in sentence:
            if kind != countTarget:
                continue
            elif ent in targetCounts:
                targetCounts[ent] += 1
            else:
                targetCounts[ent] = 1
sortedTargets = sorted(targetCounts.items(), key = lambda x: x[1], reverse = True)
sortedTargets[:20]

What about the adjectives that modify the word, "computer"?

In [None]:

NTarget = 'JJ'
Word = 'computer'
NResults = set()
for entry in redditTopScores['POS_sents']:
    for sentence in entry:
        for (ent1, kind1),(ent2,kind2) in zip(sentence[:-1], sentence[1:]):
            if (kind1,ent2.lower())==(NTarget,Word):
                NResults.add(ent1)
            else:
                continue

print(NResults)

Evaluating POS tagger
We can check the POS tagger by running it on a manually tagged corpus and identifying a reasonable error metric.

In [None]:
treeBank = nltk.corpus.treebank
treeBank.tagged_sents()[0]

In [None]:
treeBank.sents()[0]

In [None]:
stanfordTags = stanford.postTagger.tag_sents(treeBank.sents()[:30])

And compare the two



In [1]:
NumDiffs = 0
for sentIndex in range(len(stanfordTags)):
    for wordIndex in range(len(stanfordTags[sentIndex])):
        if stanfordTags[sentIndex][wordIndex][1] != treeBank.tagged_sents()[sentIndex][wordIndex][1]:
            if treeBank.tagged_sents()[sentIndex][wordIndex][1] != '-NONE-':
                print("Word: {}  \tStanford: {}\tTreebank: {}".format(stanfordTags[sentIndex][wordIndex][0], stanfordTags[sentIndex][wordIndex][1], treeBank.tagged_sents()[sentIndex][wordIndex][1]))
                NumDiffs += 1
total = sum([len(s) for s in stanfordTags])
print("The Precision is {:.3f}%".format((total-NumDiffs)/total * 100))

NameError: name 'stanfordTags' is not defined

So we can see that the stanford POS tagger is quite good. Nevertheless, for a 20 word sentence, we only have a 66% chance ($1-.96^{20}$) of tagging (and later parsing) it correctly.


*Exercise 1*
In the cells immediately following, perform POS tagging on a meaningful (but modest) subset of a corpus associated with your final project. Examine the list of words associated with at least three different parts of speech. Consider conditional frequencies (e.g., adjectives associated with nouns of interest or adverbs with verbs of interest). What do these distributions suggest about your corpus?

In [3]:
#Load dataframe 
plos_df = pandas.read_pickle('data/plos_analysis/plos_normalized_sents_sample.pk1')

In [3]:
plos_df

Unnamed: 0,Article Contents,Copyright Year,Journal Title,Titles,tokenized_text,word_counts,normalized_tokens,normalized_tokens_count,tokenized_sents,normalized_sents
0,The study of animal communication is a complex...,2011,PLoS ONE,UV-Deprived Coloration Reduces Success in Mate...,"[The, study, of, animal, communication, is, a,...",2239,"[studi, anim, commun, complex, scienc, address...",1055,"[[The, study, of, animal, communication, is, a...","[[study, animal, communication, complex, scien..."
1,Aneurysms in general represent a Damocles swor...,2017,PLOS ONE,Metabolomic profiling of ascending thoracic ao...,"[Aneurysms, in, general, represent, a, Damocle...",5547,"[aneurysm, gener, repres, damocl, sword, class...",2663,"[[Aneurysms, in, general, represent, a, Damocl...","[[aneurysms, general, represent, damocles, swo..."
2,Prognostic information about life expectancy i...,2013,PLoS ONE,Predictive Value of a Profile of Routine Blood...,"[Prognostic, information, about, life, expecta...",4275,"[prognost, inform, life, expect, older, peopl,...",2011,"[[Prognostic, information, about, life, expect...","[[prognostic, information, life, expectancy, o..."
3,Interleukin (IL)-23 has been associated with t...,2017,PLOS ONE,Continuous IL-23 stimulation drives ILC3 deple...,"[Interleukin, (, IL, ), -23, has, been, associ...",5091,"[interleukin, il, ha, associ, develop, sever, ...",2413,"[[Interleukin, (, IL, ), -23, has, been, assoc...","[[interleukin, il, associated, development, se..."
4,Labor represents a stress test for the fetus. ...,2014,PLoS ONE,Assessment of Coupling between Trans-Abdominal...,"[Labor, represents, a, stress, test, for, the,...",3149,"[labor, repres, stress, test, fetu, inde, feta...",1543,"[[Labor, represents, a, stress, test, for, the...","[[labor, represents, stress, test, fetus], [in..."
5,Competition has long been recognized as a crit...,2014,PLoS ONE,Seaweed-Coral Interactions: Variance in Seawee...,"[Competition, has, long, been, recognized, as,...",6629,"[competit, ha, long, recogn, critic, process, ...",3029,"[[Competition, has, long, been, recognized, as...","[[competition, long, recognized, critical, pro..."
6,There were errors in the legend of Figure 11. ...,2012,PLoS ONE,Correction: The Zinc Dyshomeostasis Hypothesis...,"[There, were, errors, in, the, legend, of, Fig...",338,"[error, legend, figur, correct, figur, legend,...",169,"[[There, were, errors, in, the, legend, of, Fi...","[[errors, legend, figure], [correct, figure, l..."
7,Phylogeographic studies leverage spatial and g...,2017,PLOS Biology,A latitudinal phylogeographic diversity gradie...,"[Phylogeographic, studies, leverage, spatial, ...",11241,"[phylogeograph, studi, leverag, spatial, genet...",5706,"[[Phylogeographic, studies, leverage, spatial,...","[[phylogeographic, studies, leverage, spatial,..."
8,Insulin-like Growth Factor-1 (IGF-1) is a pote...,2012,PLoS ONE,E-Peptides Control Bioavailability of IGF-1,"[Insulin-like, Growth, Factor-1, (, IGF-1, ), ...",5290,"[growth, potent, peptid, factor, involv, broad...",2492,"[[Insulin-like, Growth, Factor-1, (, IGF-1, ),...","[[growth, potent, peptide, factor, involved, b..."
9,Despite the expansive development of targeted ...,2014,PLoS ONE,Advancements in the Development of HIF-1α-Acti...,"[Despite, the, expansive, development, of, tar...",6722,"[despit, expans, develop, target, cancer, ther...",3053,"[[Despite, the, expansive, development, of, ta...","[[despite, expansive, development, targeted, c..."


In [1]:
plos_df['normalized_sents'].iloc[0]

NameError: name 'plos_df' is not defined

In [11]:
pos_sents = stanford.postTagger.tag_sents(plos_df['tokenized_text'].iloc[0])
print(pos_sents)

[[('T', 'NN'), ('h', 'NN'), ('e', 'SYM')], [('s', 'NN'), ('t', 'NN'), ('u', 'NN'), ('d', 'NN'), ('y', 'NN')], [('o', 'NN'), ('f', 'SYM')], [('a', 'DT'), ('n', 'NN'), ('i', 'FW'), ('m', 'NN'), ('a', 'DT'), ('l', 'NN')], [('c', 'NN'), ('o', 'NN'), ('m', 'NN'), ('m', 'NN'), ('u', 'NN'), ('n', 'NN'), ('i', 'FW'), ('c', 'NN'), ('a', 'DT'), ('t', 'NN'), ('i', 'FW'), ('o', 'NN'), ('n', 'NN')], [('i', 'LS'), ('s', 'NN')], [('a', 'DT')], [('c', 'NN'), ('o', 'NN'), ('m', 'NN'), ('p', 'NN'), ('l', 'NN'), ('e', 'SYM'), ('x', 'NN')], [('s', 'NN'), ('c', 'NN'), ('i', 'FW'), ('e', 'LS'), ('n', 'NN'), ('c', 'NN'), ('e', 'SYM')], [('a', 'DT'), ('d', 'NN'), ('d', 'NN'), ('r', 'NN'), ('e', 'SYM'), ('s', 'NN'), ('s', 'VBZ'), ('i', 'LS'), ('n', 'NN'), ('g', 'NN')], [('a', 'DT')], [('w', 'FW'), ('i', 'FW'), ('d', 'FW'), ('e', 'SYM')], [('r', 'LS'), ('a', 'DT'), ('n', 'NN'), ('g', 'NN'), ('e', 'SYM')], [('o', 'NN'), ('f', 'SYM')], [('m', 'NN'), ('u', 'NN'), ('l', 'NN'), ('t', 'NN'), ('i', 'LS'), ('-', ':'), 