# Intro to Natural Language Processing with Python

## Info
- Scott Bailey (CIDR), *scottbailey@stanford.edu*
- Javier de la Rosa (CIDR), *versae@stanford.edu*


## What are we covering today?
- What is NLP?
- Options for NLP in Python
- Tokenization
- Part of Speech Tagging
- Named Entity Recognition
- Word transformations
- Readability indices

## Goals

By the end of the workshop, we hope you'll have a basic understanding of natural language processing, and enough familiarity with one NLP package, SpaCy, to perform basic NLP tasks like tokenization and part of speech tagging. Through analyzing presidential speeches, we also hope you'll understand how these basic tasks open up a number of possibilities for textual analysis, such as readability indices. 

## What is NLP

NLP stands for Natual Language Processing and it involves a huge variety of tasks such as:
- Automatic summarization.
- Coreference resolution.
- Discourse analysis.
- Machine translation.
- Morphological segmentation.
- Named entity recognition.
- Natural language understanding.
- Part-of-speech tagging.
- Parsing.
- Question answering.
- Relationship extraction.
- Sentiment analysis.
- Speech recognition.
- Topic segmentation.
- Word segmentation.
- Word sense disambiguation.
- Information retrieval.
- Information extraction.
- Speech processing.

One of the key ideas is to be able to process text without reading it.

## NLP in Python

Python is builtin with a very mature regular expression library, which is the building block of natural language processing. However, more advanced tasks need different libraries. Traditionally, in the Python ecosystem the Natural Language Processing Toolkit, abbreviated as `NLTK`, has been until recently the only working choice. Now, though, there are a number of choices based on different technologies and approaches

We'll a solution that appeared relatively recently, called `spaCy`, and it is much faster than NLTK since is written in a pseudo-C Python language optimized for speed called Cython.

Both these libraries are complex and there exist wrappers around them to simplify their APIs. The two more popular are `Textblob` for NLTK and CLiPS Parser, and `textacy` for spaCy. In this workshop we will be using spaCy with a touch of textacy thrown in at the very end.

In [14]:
!pip install spacy



In [15]:
import spacy

In [16]:
!python -m spacy download en

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')
[38;5;2m✔ Linking successful[0m
/Users/lisa/anaconda3/lib/python3.7/site-packages/en_core_web_sm -->
/Users/lisa/anaconda3/lib/python3.7/site-packages/spacy/data/en
You can now load the model via spacy.load('en')


In [17]:
nlp = spacy.load('en')

In [18]:
# helper functions
import requests

def get_text(url):
    return requests.get(url).text

def get_speech(url):
    page = get_text(url)
    full_text = page.split('\n')
    return " ".join(full_text[2:])

In [19]:
clinton_url = "https://raw.githubusercontent.com/sul-cidr/python_workshops/master/data/clinton2000.txt"
clinton_speech = get_speech(clinton_url)
clinton_speech

'Mr. Speaker, Mr. Vice President, members of Congress, honored guests, my fellow Americans:  We are fortunate to be alive at this moment in history. Never before has our nation enjoyed, at once, so much prosperity and social progress with so little internal crisis and so few external threats. Never before have we had such a blessed opportunity and, therefore, such a profound obligation to build the more perfect Union of our Founders’ dreams.  We begin the new century with over 20 million new jobs; the fastest economic growth in more than 30 years; the lowest unemployment rates in 30 years; the lowest poverty rates in 20 years; the lowest African-American and Hispanic unemployment rates on record; the first back-to-back surpluses in 42 years; and next month, America will achieve the longest period of economic growth in our entire history. We have built a new economy.  And our economic revolution has been matched by a revival of the American spirit: crime down by 20 percent, to its lowes

In [7]:
doc = nlp(clinton_speech)

## Tokenization

In NLP, the act of splitting text is called tokenization, and each of the individual chunks is called a token. Therefore, we can talk about word tokenization or sentence tokenization depending on what it is that we need to divide the text into.

In [None]:
# word level
for token in doc:
    print(token.text)

In [None]:
# sentence level
for sent in doc.sents:
    print(sent)

In [None]:
# noun phrases
for phrase in doc.noun_chunks:
    print(phrase)

## Part of Speech Tagging

SpaCy also allows you to perform Part-Of-Speech tagging, a kind of grammatical chunking, out of the box. 

In [20]:
# simple part of speech tag
for token in doc:
    print(token.text, token.pos_)

Mr. PROPN
Speaker PROPN
, PUNCT
Mr. PROPN
Vice PROPN
President PROPN
, PUNCT
members NOUN
of ADP
Congress PROPN
, PUNCT
honored VERB
guests NOUN
, PUNCT
my DET
fellow ADJ
Americans PROPN
: PUNCT
  SPACE
We PRON
are VERB
fortunate ADJ
to PART
be VERB
alive ADJ
at ADP
this DET
moment NOUN
in ADP
history NOUN
. PUNCT
Never ADV
before ADV
has VERB
our DET
nation NOUN
enjoyed VERB
, PUNCT
at ADP
once ADV
, PUNCT
so ADV
much ADJ
prosperity NOUN
and CCONJ
social ADJ
progress NOUN
with ADP
so ADV
little ADJ
internal ADJ
crisis NOUN
and CCONJ
so ADV
few ADJ
external ADJ
threats NOUN
. PUNCT
Never ADV
before ADV
have VERB
we PRON
had VERB
such DET
a DET
blessed ADJ
opportunity NOUN
and CCONJ
, PUNCT
therefore ADV
, PUNCT
such DET
a DET
profound ADJ
obligation NOUN
to PART
build VERB
the DET
more ADV
perfect ADJ
Union PROPN
of ADP
our DET
Founders NOUN
’ PART
dreams NOUN
. PUNCT
  SPACE
We PRON
begin VERB
the DET
new ADJ
century NOUN
with ADP
over ADP
20 NUM
million NUM
new ADJ
jobs NOUN
; PUNCT


on ADP
Earth PROPN
. PUNCT
We PRON
will VERB
pay VERB
off PART
our DET
national ADJ
debt NOUN
for ADP
the DET
first ADJ
time NOUN
since ADP
1835 NUM
. PUNCT
* PUNCT
We PRON
will VERB
bring VERB
prosperity NOUN
to ADP
every DET
American ADJ
community NOUN
. PUNCT
We PRON
will VERB
reverse VERB
the DET
course NOUN
of ADP
climate NOUN
change NOUN
and CCONJ
leave VERB
a DET
safer ADJ
, PUNCT
cleaner ADJ
planet NOUN
. PUNCT
America PROPN
will VERB
lead VERB
the DET
world NOUN
toward ADP
shared VERB
peace NOUN
and CCONJ
prosperity NOUN
and CCONJ
the DET
far ADJ
frontiers NOUN
of ADP
science NOUN
and CCONJ
technology NOUN
. PUNCT
And CCONJ
we PRON
will VERB
become VERB
at ADP
last ADJ
what PRON
our DET
Founders NOUN
pledged VERB
us PRON
to PART
be VERB
so ADV
long ADV
ago ADV
: PUNCT
  SPACE
* PUNCT
White PROPN
House PROPN
correction NOUN
. PUNCT
  SPACE
One NUM
nation NOUN
, PUNCT
under ADP
God PROPN
, PUNCT
indivisible ADJ
, PUNCT
with ADP
liberty NOUN
and CCONJ
justice NOUN
for ADP
all DET

move VERB
a DET
long ADJ
way NOUN
toward ADP
making VERB
sure ADJ
every DET
child NOUN
starts VERB
school NOUN
ready ADJ
to PART
learn VERB
and CCONJ
graduates NOUN
ready ADJ
to PART
succeed VERB
. PUNCT
  SPACE
We PRON
also ADV
need VERB
a DET
21st ADJ
century NOUN
revolution NOUN
to PART
reward VERB
work NOUN
and CCONJ
strengthen VERB
families NOUN
by ADP
giving VERB
every DET
parent NOUN
the DET
tools NOUN
to PART
succeed VERB
at ADP
work NOUN
and CCONJ
at ADP
the DET
most ADV
important ADJ
work NOUN
of ADP
all DET
, PUNCT
raising VERB
children NOUN
. PUNCT
That DET
means VERB
making VERB
sure ADJ
every DET
family NOUN
has VERB
health NOUN
care NOUN
and CCONJ
the DET
support NOUN
to PART
care VERB
for ADP
aging VERB
parents NOUN
, PUNCT
the DET
tools NOUN
to PART
bring VERB
their DET
children NOUN
up ADV
right ADV
, PUNCT
and CCONJ
that ADP
no DET
child NOUN
grows VERB
up PART
in ADP
poverty NOUN
. PUNCT
  SPACE
From ADP
my DET
first ADJ
days NOUN
as ADP
President PROPN
, PUNCT
we P

being VERB
here ADV
tonight NOUN
. PUNCT
Stand VERB
up PART
, PUNCT
Carlos PROPN
. PUNCT
[ PUNCT
Applause NOUN
] PUNCT
Thank VERB
you PRON
. PUNCT
  SPACE
If ADP
there ADV
is VERB
any DET
single ADJ
issue NOUN
on ADP
which DET
we PRON
should VERB
be VERB
able ADJ
to PART
reach VERB
across ADP
party NOUN
lines NOUN
, PUNCT
it PRON
is VERB
in ADP
our DET
common ADJ
commitment NOUN
to PART
reward VERB
work NOUN
and CCONJ
strengthen VERB
families NOUN
. PUNCT
Just ADV
remember VERB
what PRON
we PRON
did VERB
last ADJ
year NOUN
. PUNCT
We PRON
came VERB
together ADV
to PART
help VERB
people NOUN
with ADP
disabilities NOUN
keep VERB
their DET
health NOUN
insurance NOUN
when ADV
they PRON
go VERB
to ADP
work NOUN
. PUNCT
And CCONJ
I PRON
thank VERB
you PRON
for ADP
that DET
. PUNCT
Thanks NOUN
to ADP
overwhelming ADJ
bipartisan ADJ
support NOUN
from ADP
this DET
Congress PROPN
, PUNCT
we PRON
have VERB
improved VERB
foster ADJ
care NOUN
. PUNCT
We PRON
’ve VERB
helped VERB
those DET
young ADJ

put VERB
them PRON
to PART
work VERB
. PUNCT
For ADP
business NOUN
, PUNCT
it PRON
’s PROPN
the DET
smart ADJ
thing NOUN
to PART
do VERB
. PUNCT
For ADP
America PROPN
, PUNCT
it PRON
’s PROPN
the DET
right ADJ
thing NOUN
to PART
do VERB
. PUNCT
And CCONJ
let VERB
me PRON
ask VERB
you PRON
something NOUN
: PUNCT
If ADP
we PRON
do VERB
n’t ADV
do VERB
this DET
now ADV
, PUNCT
when ADV
in ADP
the DET
wide ADJ
world NOUN
will VERB
we PRON
ever ADV
get VERB
around PART
to ADP
it PRON
? PUNCT
  SPACE
So ADV
I PRON
ask VERB
Congress PROPN
to PART
give VERB
businesses NOUN
the DET
same ADJ
incentives NOUN
to PART
invest VERB
in ADP
America PROPN
’s PROPN
new ADJ
markets NOUN
they PRON
now ADV
have VERB
to PART
invest VERB
in ADP
markets NOUN
overseas ADV
. PUNCT
Tonight NOUN
I PRON
propose VERB
a DET
large ADJ
new ADJ
markets NOUN
tax NOUN
credit NOUN
and CCONJ
other ADJ
incentives NOUN
to PART
spur VERB
$ SYM
22 NUM
billion NUM
in ADP
private ADJ
- PUNCT
sector NOUN
capital NOUN
to PART
creat

every DET
conflict NOUN
or CCONJ
stop VERB
every DET
outrage NOUN
. PUNCT
But CCONJ
where ADV
our DET
interests NOUN
are VERB
at ADP
stake NOUN
and CCONJ
we PRON
can VERB
make VERB
a DET
difference NOUN
, PUNCT
we PRON
should VERB
be VERB
, PUNCT
and CCONJ
we PRON
must VERB
be VERB
, PUNCT
peacemakers NOUN
. PUNCT
  SPACE
We PRON
should VERB
be VERB
proud ADJ
of ADP
our DET
role NOUN
in ADP
bringing VERB
the DET
Middle PROPN
East PROPN
closer ADV
to ADP
a DET
lasting VERB
peace NOUN
, PUNCT
building VERB
peace NOUN
in ADP
Northern PROPN
Ireland PROPN
, PUNCT
working VERB
for ADP
peace NOUN
in ADP
East PROPN
Timor PROPN
and CCONJ
Africa PROPN
, PUNCT
promoting VERB
reconciliation NOUN
between ADP
Greece PROPN
and CCONJ
Turkey PROPN
and CCONJ
in ADP
Cyprus PROPN
, PUNCT
working VERB
to PART
defuse VERB
these DET
crises NOUN
between ADP
India PROPN
and CCONJ
Pakistan PROPN
, PUNCT
in ADP
defending VERB
human ADJ
rights NOUN
and CCONJ
religious ADJ
freedom NOUN
. PUNCT
And CCONJ
we PRON
sh

TB PROPN
, PUNCT
and CCONJ
AIDS PROPN
. PUNCT
I PRON
ask VERB
the DET
private ADJ
sector NOUN
and CCONJ
our DET
partners NOUN
around ADP
the DET
world NOUN
to PART
join VERB
us PRON
in ADP
embracing VERB
this DET
cause NOUN
. PUNCT
We PRON
can VERB
save VERB
millions NOUN
of ADP
lives NOUN
together ADV
, PUNCT
and CCONJ
we PRON
ought VERB
to PART
do VERB
it PRON
. PUNCT
  SPACE
I PRON
also ADV
want VERB
to PART
mention VERB
our DET
final ADJ
challenge NOUN
, PUNCT
which DET
, PUNCT
as ADP
always ADV
, PUNCT
is VERB
the DET
most ADV
important ADJ
. PUNCT
I PRON
ask VERB
you PRON
to PART
pass VERB
a DET
national ADJ
security NOUN
budget NOUN
that DET
keeps VERB
our DET
military NOUN
the DET
best ADV
trained VERB
and CCONJ
best ADV
equipped VERB
in ADP
the DET
world NOUN
, PUNCT
with ADP
heightened VERB
readiness NOUN
and CCONJ
21st ADJ
century NOUN
weapons NOUN
, PUNCT
which DET
raises VERB
salaries NOUN
for ADP
our DET
service NOUN
men NOUN
and CCONJ
women NOUN
, PUNCT
which DET
protect

blueprint NOUN
of ADP
life NOUN
. PUNCT
It PRON
is VERB
important ADJ
for ADP
all DET
our DET
fellow ADJ
Americans PROPN
to PART
recognize VERB
that DET
federal ADJ
tax NOUN
dollars NOUN
have VERB
funded VERB
much ADJ
of ADP
this DET
research NOUN
and CCONJ
that ADP
this DET
and CCONJ
other ADJ
wise ADJ
investments NOUN
in ADP
science NOUN
are VERB
leading VERB
to ADP
a DET
revolution NOUN
in ADP
our DET
ability NOUN
to PART
detect VERB
, PUNCT
treat VERB
, PUNCT
and CCONJ
prevent VERB
disease NOUN
. PUNCT
  SPACE
For ADP
example NOUN
, PUNCT
researchers NOUN
have VERB
identified VERB
genes NOUN
that DET
cause VERB
Parkinson PROPN
’s PROPN
, PUNCT
diabetes NOUN
, PUNCT
and CCONJ
certain ADJ
kinds NOUN
of ADP
cancer NOUN
. PUNCT
They PRON
are VERB
designing VERB
precision NOUN
therapies NOUN
that DET
will VERB
block VERB
the DET
harmful ADJ
effect NOUN
of ADP
these DET
genes NOUN
for ADP
good NOUN
. PUNCT
Researchers NOUN
already ADV
are VERB
using VERB
this DET
new ADJ
technique NOUN
t

hope NOUN
and CCONJ
expectation NOUN
and CCONJ
excitement NOUN
for ADP
our DET
nation NOUN
. PUNCT
But CCONJ
tonight NOUN
is VERB
very ADV
special ADJ
, PUNCT
because ADP
we PRON
stand VERB
on ADP
the DET
mountaintop NOUN
of ADP
a DET
new ADJ
millennium NOUN
. PUNCT
Behind ADP
us PRON
we PRON
can VERB
look VERB
back ADV
and CCONJ
see VERB
the DET
great ADJ
expanse NOUN
of ADP
American ADJ
achievement NOUN
, PUNCT
and CCONJ
before ADP
us PRON
we PRON
can VERB
see VERB
even ADV
greater ADJ
, PUNCT
grander NOUN
frontiers NOUN
of ADP
possibility NOUN
. PUNCT
We PRON
should VERB
, PUNCT
all DET
of ADP
us PRON
, PUNCT
be VERB
filled VERB
with ADP
gratitude NOUN
and CCONJ
humility NOUN
for ADP
our DET
present ADJ
progress NOUN
and CCONJ
prosperity NOUN
. PUNCT
We PRON
should VERB
be VERB
filled VERB
with ADP
awe NOUN
and CCONJ
joy NOUN
at ADP
what PRON
lies VERB
over ADP
the DET
horizon NOUN
. PUNCT
And CCONJ
we PRON
should VERB
be VERB
filled VERB
with ADP
absolute ADJ
determination NOUN
to 

In [None]:
# detailed tag
# For what these tags mean, you might check out http://www.clips.ua.ac.be/pages/mbsp-tags
for token in doc:
    print(token.text, token.tag_)

In [26]:
# syntactic dependency
for token in doc:
    print(token.text, token.dep_)

Mr. compound
Speaker ROOT
, punct
Mr. compound
Vice compound
President appos
, punct
members appos
of prep
Congress pobj
, punct
honored amod
guests appos
, punct
my poss
fellow amod
Americans appos
: punct
  
We nsubj
are ROOT
fortunate acomp
to aux
be xcomp
alive acomp
at prep
this det
moment pobj
in prep
history pobj
. punct
Never neg
before advmod
has aux
our poss
nation nsubj
enjoyed ROOT
, punct
at prep
once pcomp
, punct
so advmod
much amod
prosperity pobj
and cc
social amod
progress conj
with prep
so advmod
little amod
internal amod
crisis pobj
and cc
so advmod
few amod
external amod
threats conj
. punct
Never neg
before advmod
have aux
we nsubj
had ROOT
such predet
a det
blessed amod
opportunity dobj
and cc
, punct
therefore advmod
, punct
such predet
a det
profound amod
obligation appos
to aux
build acl
the det
more advmod
perfect amod
Union dobj
of prep
our poss
Founders compound
’ compound
dreams pobj
. punct
  
We nsubj
begin ccomp
the det
new amod
century dobj
with prep
o

after conj
- punct
school pobj
, punct
the det
best advmod
trained amod
teachers appos
in prep
the det
classroom pobj
, punct
and cc
college compound
opportunities conj
for prep
all predet
our poss
children pobj
. punct
  
For prep
seven nummod
years pobj
now advmod
, punct
we nsubj
’ve appos
worked ROOT
hard advmod
to aux
improve advcl
our poss
schools dobj
, punct
with prep
opportunity pobj
and cc
responsibility conj
, punct
investing advcl
more advmod
but cc
demanding conj
more advmod
in prep
turn pobj
. punct
Reading advcl
, punct
math conj
, punct
college compound
entrance compound
scores nsubj
are ROOT
up advmod
. punct
Some nsubj
of prep
the det
most advmod
impressive amod
gains pobj
are ROOT
in prep
schools pobj
in prep
very advmod
poor amod
neighborhoods pobj
. punct
  
But cc
all det
successful amod
schools nsubj
have aux
followed ROOT
the det
same amod
proven amod
formula dobj
: punct
higher amod
standards appos
, punct
more amod
accountability conj
, punct
and cc
extra amod

  
Lifesaving csubj
drugs dobj
are ROOT
an det
indispensable amod
part attr
of prep
modern amod
medicine pobj
. punct
No det
one nsubj
creating acl
a det
Medicare compound
program dobj
today npadvmod
would aux
even advmod
think ROOT
of prep
excluding pcomp
coverage dobj
for prep
prescription compound
drugs pobj
. punct
Yet advmod
more amod
than quantmod
three nsubj
in prep
five pobj
of prep
our poss
seniors pobj
now advmod
lack ROOT
dependable amod
drug compound
coverage dobj
which nsubj
can aux
lengthen relcl
and cc
enrich conj
their poss
lives dobj
. punct
Millions nsubj
of prep
older amod
Americans pobj
, punct
who nsubj
need relcl
prescription compound
drugs dobj
the det
most advmod
, punct
pay ROOT
the det
highest amod
prices dobj
for prep
them pobj
. punct
In prep
good amod
conscience pobj
, punct
we nsubj
can aux
not neg
let ROOT
another det
year nsubj
pass ccomp
without prep
extending pcomp
to prep
all det
our poss
seniors pobj
this det
lifeline dobj
of prep
affordable amod
pre

the det
idea dobj
behind prep
the det
Individual compound
Development compound
Accounts pobj
, punct
the det
IDAs appos
. punct
I nsubj
ask ROOT
you dobj
to aux
take xcomp
that det
idea dobj
to prep
a det
new amod
level pobj
, punct
with prep
new amod
retirement compound
savings compound
accounts pobj
that nsubj
enable relcl
every det
low amod
and cc
moderate conj
income compound
family dobj
in prep
America pobj
to aux
save relcl
for prep
retirement pobj
, punct
a det
first amod
home appos
, punct
a det
medical amod
emergency conj
, punct
or cc
a det
college compound
education conj
. punct
I nsubj
propose ROOT
to aux
match xcomp
their poss
contributions dobj
, punct
however advmod
small amod
, punct
dollar dep
for prep
dollar pobj
, punct
every det
year npadvmod
they nsubj
save relcl
. punct
And cc
I nsubj
propose ROOT
to aux
give xcomp
a det
major amod
new amod
tax compound
credit dobj
to dative
any det
small amod
business pobj
that nsubj
will aux
provide relcl
a det
meaningful amod
p

steps conj
to aux
keep advcl
guns dobj
out prep
of prep
the det
wrong amod
hands pobj
, punct
to aux
keep advcl
our poss
children dobj
safe oprd
. punct
  
You nsubj
know parataxis
, punct
every det
parent nsubj
I nsubj
know parataxis
worries ROOT
about prep
the det
impact pobj
of prep
violence pobj
in prep
the det
media pobj
on prep
their poss
children pobj
. punct
I nsubj
want ROOT
to aux
begin xcomp
by prep
thanking pcomp
the det
entertainment compound
industry dobj
for prep
accepting pcomp
my poss
challenge dobj
to aux
put advcl
voluntary amod
ratings dobj
on prep
TV compound
programs pobj
and cc
video nmod
and cc
Internet conj
games conj
. punct
But cc
frankly advmod
, punct
the det
ratings nsubj
are ROOT
too advmod
numerous acomp
, punct
diverse conj
, punct
and cc
confusing conj
to aux
be xcomp
really advmod
useful acomp
to prep
parents pobj
. punct
So advmod
tonight npadvmod
I nsubj
ask ROOT
the det
industry dobj
to aux
accept xcomp
the det
First compound
Lady compound
’s compo

has aux
changed ccomp
in prep
the det
past amod
decade pobj
: punct
5,000 nummod
former amod
Soviet amod
nuclear amod
weapons nsubj
taken acl
out prep
of prep
commission pobj
; punct
Russian amod
soldiers nsubj
actually advmod
serving ccomp
with prep
ours pobj
in prep
the det
Balkans pobj
; punct
Russian amod
people nsubj
electing ROOT
their poss
leaders dobj
for prep
the det
first amod
time pobj
in prep
1,000 nummod
years pobj
; punct
and cc
in conj
China pobj
, punct
an det
economy appos
more advmod
open amod
to prep
the det
world pobj
than prep
ever advmod
before pcomp
. punct
  
Of advmod
course advmod
, punct
no det
one nsubj
, punct
not neg
a det
single amod
person appos
in prep
this det
chamber pobj
tonight npadvmod
can aux
know ROOT
for prep
sure amod
what det
direction dobj
these det
great amod
nations nsubj
will aux
take ccomp
. punct
But cc
we nsubj
do aux
know ROOT
for prep
sure amod
that mark
we nsubj
can aux
choose ccomp
what dobj
we nsubj
do ccomp
. punct
And cc
we nsubj

what dobj
you nsubj
did pcomp
and cc
ask conj
you dobj
to aux
stay xcomp
the det
course dobj
. punct
  
I nsubj
also advmod
want ROOT
to aux
say xcomp
that mark
America nsubj
must aux
help ccomp
more amod
nations dobj
to aux
break xcomp
the det
bonds dobj
of prep
disease pobj
. punct
Last amod
year npadvmod
in prep
Africa pobj
, punct
10 nummod
times quantmod
as advmod
many amod
people nsubj
died ROOT
from prep
AIDS pobj
as mark
were auxpass
killed advcl
in prep
wars—10 amod
times pobj
. punct
The det
budget nsubj
I nsubj
give relcl
you dative
invests ROOT
$ quantmod
150 compound
million npadvmod
more advmod
in prep
the det
fight pobj
against prep
this pobj
and cc
other amod
infectious amod
killers conj
. punct
And cc
today npadvmod
I nsubj
propose ROOT
a det
tax compound
credit dobj
to aux
speed xcomp
the det
development dobj
of prep
vaccines pobj
for prep
diseases pobj
like prep
malaria pobj
, punct
TB conj
, punct
and cc
AIDS conj
. punct
I nsubj
ask ROOT
the det
private amod
sector

droughts conj
will aux
become ROOT
more advmod
frequent acomp
, punct
coastal amod
areas nsubj
will aux
flood conj
, punct
and cc
economies nsubjpass
will aux
be auxpass
disrupted conj
. punct
That nsubj
is aux
going ROOT
to aux
happen xcomp
, punct
unless mark
we nsubj
act advcl
. punct
  
Many amod
people nsubj
in prep
the det
United compound
States pobj
, punct
some det
people appos
in prep
this det
chamber pobj
, punct
and cc
lots conj
of prep
folks pobj
around prep
the det
world pobj
still advmod
believe ROOT
you nsubj
can aux
not neg
cut ccomp
greenhouse compound
gas compound
emissions dobj
without prep
slowing pcomp
economic amod
growth dobj
. punct
In prep
the det
industrial amod
age pobj
, punct
that nsubj
may aux
well advmod
have aux
been ROOT
true acomp
. punct
But cc
in prep
this det
digital amod
economy pobj
, punct
it nsubj
is ROOT
not neg
true acomp
anymore advmod
. punct
New amod
technologies nsubj
make ROOT
it nsubj
possible ccomp
to aux
cut advcl
harmful amod
emission

fully advmod
participate acl
in prep
our poss
community pobj
. punct
That nsubj
’s ROOT
why advmod
I nsubj
recommend ccomp
spending xcomp
more dobj
to aux
teach advcl
them dative
civics dobj
and cc
English conj
. punct
And cc
since prep
everybody pobj
in prep
our poss
community compound
counts pobj
, punct
we nsubj
’ve appos
got ROOT
to aux
make xcomp
sure ccomp
everyone nsubjpass
is auxpass
counted ccomp
in prep
this det
year pobj
’s compound
census pobj
. punct
  
Within prep
10 nummod
years pobj
— ROOT
just advmod
10 nummod
years npadvmod
— ROOT
there expl
will aux
be ccomp
no det
majority compound
race attr
in prep
our poss
largest amod
state pobj
of prep
California pobj
. punct
In prep
a quantmod
little quantmod
more amod
than quantmod
50 nummod
years pobj
, punct
there expl
will aux
be ROOT
no det
majority compound
race attr
in prep
America pobj
. punct
In prep
a det
more advmod
interconnected amod
world pobj
, punct
this det
diversity nsubj
can aux
be ROOT
our poss
greatest amod

In [27]:
# visualizing the sentence
from spacy import displacy

In [22]:
first_sent = list(doc.sents)[0]
first_sent

Mr. Speaker, Mr. Vice President, members of Congress, honored guests, my fellow Americans:  

In [23]:
single_doc = nlp(str(first_sent))
options = {"compact": True, 'bg': '#09a3d5',
           'color': 'white', 'font': 'Source Sans Pro'}
displacy.render(single_doc, style="dep", jupyter=True, options=options)

<div style="font-size: 1em; margin: 1em 0 1em 0; border: 1px solid #86989B; background-color: #f7f7f7; padding: 0;">
<p style="margin: 0; padding: 0.1em 0 0.1em 0.5em; color: white; border-bottom: 1px solid #86989B; font-weight: bold; background-color: #AFC1C4;">
Activity
</p>
<p style="margin: 0.5em 1em 0.5em 1em; padding: 0;">
Write a function `count_chars(text)` that receives `text` and returns the total number of characters ignoring spaces and punctuation marks. For example, `count_chars("Well, I am not 30 years old.")` should return `20`.
<br/>
* **Hint**: You could count the characters in the words.*
</p>
</div>

In [None]:
# Solution using two functions, one to get just words without punct, one to get chars
def return_words(doc):
    return [token.text for token in doc if token.pos_ is not 'PUNCT']

def count_chars(words):
    return sum(len(w) for w in words)

# count_chars("Well, I am not 30 years old.")
words = return_words(nlp("Well, I am not 30 years old."))
count_chars(words)

## Named Entity Recognition 

In [9]:
# https://spacy.io/api/annotation#named-entities
# trained on OntoNotes corpus
for ent in doc.ents:
    print(ent.text, ent.label_)

Speaker PERSON
Congress ORG
Americans NORP
Union of our Founders ORG
the new century DATE
over 20 million CARDINAL
more than 30 years DATE
30 years DATE
20 years DATE
African-American NORP
Hispanic NORP
first ORDINAL
42 years DATE
next month DATE
America GPE
American NORP
20 percent PERCENT
25 years DATE
seven years in a row DATE
30 percent PERCENT
half CARDINAL
30 years DATE
Americans NORP
Union ORG
American NORP
Eight years ago DATE
Americans NORP
the year 2000 DATE
America GPE
Americans NORP
Americans NORP
40 years DATE
100,000 CARDINAL
Brady PERSON
half a million CARDINAL
20 million CARDINAL
Americans NORP
150,000 CARDINAL
Americans NORP
AmeriCorps ORG
1992 DATE
Today DATE
America GPE
Americans NORP
the 21st century DATE
a 21st century DATE
American NORP
the last century DATE
Theodore Roosevelt PERSON
one CARDINAL
tonight TIME
America GPE
America GPE
Americans NORP
America GPE
Earth LOC
first ORDINAL
1835 DATE
American NORP
America GPE
White House ORG
One CARDINAL
this year DATE
th

In [None]:
# If you're working on tokens, you can still access entity type
# Notice, though that the phrase entities are broken up here because we're iterating over tokens
# https://spacy.io/api/annotation#named-entities
for token in doc:
    if token.ent_type_ is not '':
        print(token.text, token.ent_type_, "----------", spacy.explain(token.ent_type_))

In [24]:
# spacy comes with built in entity visualization
displacy.render(single_doc, style="ent", jupyter=True)

In [25]:
next_sent = list(doc.sents)[3]
next_doc = nlp(str(next_sent))
displacy.render(next_doc, style="ent", jupyter=True)

It is possible to train your own entity recognition model, and to train other types of models in spaCy, but you need sufficient labeled data to make it work well.

## Word transformations

In [None]:
# lemmas
for token in doc:
    print(token.text, token.lemma_)

In [None]:
doc1 = nlp('here are octopi')
for token in doc1:
    print(token.lemma_)

In [None]:
doc1 = nlp('There have been many mice and geese surrounding the pond.')
for token in doc1:
    print(token, token.lemma_)

In [None]:
# say we just want to lematize verbs
for token in doc:
    if token.tag_ == "VBP":
        print(token.text, token.lemma_)

In [None]:
# If you're using the simple part of speech instead of the tags
for token in doc:
    if token.pos_ == "VERB":
        print(token.text, token.lemma_)

In [None]:
# lowercasing
for token in doc:
    print(token.text, token.lower_)

## Counting

In [None]:
from collections import Counter

In [None]:
sample_sents = "One fish, two fish, red fish, blue fish. One is less than two."

In [None]:
# Create a spacy doc
new_doc = nlp(sample_sents)

# Create a list of the words without the punctuation
words = [token.text for token in new_doc if token.pos_ is not 'PUNCT']
words

In [None]:
counter = Counter(words)

In [None]:
counter.most_common(10)

In [None]:
counter["fish"]

## Sentiment Analysis

Right now, spacy doesn't include a model for sentiment analysis. From comments on the spacy github repo, the developers of spacy, Explosion are going to offer sentiment models as part of their commercial offerings.

They have put out examples for how to do sentiment analysis: 
- https://github.com/explosion/spaCy/blob/master/examples/deep_learning_keras.py
- https://github.com/explosion/spaCy/blob/master/examples/training/train_textcat.py

Both of these use some sort of deep learning/neural networks


<div style="font-size: 1em; margin: 1em 0 1em 0; border: 1px solid #86989B; background-color: #f7f7f7; padding: 0;">
<p style="margin: 0; padding: 0.1em 0 0.1em 0.5em; color: white; border-bottom: 1px solid #86989B; font-weight: bold; background-color: #AFC1C4;">
Activity
</p>
<p style="margin: 0.5em 1em 0.5em 1em; padding: 0;">
Let's define the lexicon of a person as the number of different words she uses to speak. Write a function `get_lexicon(text, n)` that receives `text` and `n` and returns the lemmas of nouns, verbs, and adjectives that are used at least `n` times.
<br/>
</p>
</div>

In [None]:
def get_lexicon(text, n):
    doc = nlp(text)
    
    # return a list of words that have the correct part of speech    
    words = [token.lemma_ for token in doc if token.pos_ in ["NOUN", "ADJ", "VERB"]]
    # count the words     
    counter = Counter(words)
    # filter by number
    filtered_words = [word for word in counter if counter[word] > n]
    return sorted(filtered_words)
    
get_lexicon(clinton_speech, 30)

## Readability indices

Readability indices are ways of assessing how easy or complex it is to read a particular text based on the words and sentences it has. They usually output scores that correlate with grade levels.

A couple of indices that are presumably easy to calculate are the Auto Readability Index (ARI) and the Coleman-Liau Index:

$$
ARI = 4.71\frac{chars}{words}+0.5\frac{words}{sentences}-21.43
$$
$$ CL = 0.0588\frac{letters}{100 words} - 0.296\frac{sentences}{100words} - 15.8 $$


https://en.wikipedia.org/wiki/Coleman%E2%80%93Liau_index

https://en.wikipedia.org/wiki/Automated_readability_index

In [None]:
# problem: the tokens in spacy include punctuation. to get this right, we should remove punct
# we then have to make sure our functions handle lists of words rather than spacy doc objects

def coleman_liau_index(doc, words):
    return (0.0588 * letters_per_100(doc)) - (0.296 * sentences_per_100(doc, words)) - 15.8

def count_chars(words):
    return sum(len(w) for w in words)

def sentences_per_100(doc, words):
    return (len(list(doc.sents)) / len(words)) * 100

def letters_per_100(words):
    return (count_chars(words) / len(words)) * 100

In [None]:
# To get just the words, without punctuation tokens
def return_words(doc):
    return [token.text for token in doc if token.pos_ is not 'PUNCT']

In [None]:
fancy_doc = nlp("Regional ontology, clearly defined by Heidegger, equals, if not surpasses, the earlier work of Heidegger's own mentor, Husserl")
fancy_words = return_words(fancy_doc)
fancy_words

In [None]:
coleman_liau_index(fancy_doc, fancy_words)

In [None]:
doc = nlp(clinton_speech)
clinton_speech_words = return_words(doc)
coleman_liau_index(doc, clinton_speech_words)

<div style="font-size: 1em; margin: 1em 0 1em 0; border: 1px solid #86989B; background-color: #f7f7f7; padding: 0;">
<p style="margin: 0; padding: 0.1em 0 0.1em 0.5em; color: white; border-bottom: 1px solid #86989B; font-weight: bold; background-color: #AFC1C4;">
Activity
</p>
<p style="margin: 0.5em 1em 0.5em 1em; padding: 0;">
Write a function `auto_readability_index(doc)` that receives a spacy `Doc` and returns the Auto Readability Index (ARI) score as defined above. 
<br/>
* **Hint**: Feel free to use functions we've defined before.*
   
</p>
</div>

In [None]:
def auto_readability_index(doc):
    words = return_words(doc)
    chars = count_chars(words)
    words = len(words)
    sentences = len(list(doc.sents))
    return (4.71 * (chars / words)) + (0.5 * (words / sentences)) - 21.43

In [None]:
auto_readability_index(fancy_doc)

In [None]:
auto_readability_index(doc)

In [None]:
clinton_url = "https://raw.githubusercontent.com/sul-cidr/python_workshops/master/data/clinton2000.txt"
bush_url = "https://raw.githubusercontent.com/sul-cidr/python_workshops/master/data/bush2008.txt"
obama_url = "https://raw.githubusercontent.com/sul-cidr/python_workshops/master/data/obama2016.txt"
trump_url = "https://raw.githubusercontent.com/sul-cidr/python_workshops/master/data/trump.txt"

In [None]:
clinton_speech = get_speech(clinton_url)
bush_speech = get_speech(bush_url)
obama_speech = get_speech(obama_url)
trump_speech = get_speech(trump_url)

In [None]:
speeches = {
    "clinton": nlp(clinton_speech),
    "bush": nlp(bush_speech),
    "obama": nlp(obama_speech),
    "trump": nlp(trump_speech),
}

In [None]:
print("Name", "Chars", "Words", "Unique", "Sentences", sep="\t")
for speaker, speech in speeches.items():
    words = return_words(speech)
    print(speaker, count_chars(words), len(words), len(set(words)), len(list(speech.sents)), sep="\t")

<div style="font-size: 1em; margin: 1em 0 1em 0; border: 1px solid #86989B; background-color: #f7f7f7; padding: 0;">
<p style="margin: 0; padding: 0.1em 0 0.1em 0.5em; color: white; border-bottom: 1px solid #86989B; font-weight: bold; background-color: #AFC1C4;">
Activity
</p>
<p style="margin: 0.5em 1em 0.5em 1em; padding: 0;">
Write a function `avg_sentence_length(blob)` that receives a spaCy `doc` and returns the average number of words in a sentence for the doc. You might need to use our `return_words` function.
</p>
</div>

In [None]:
# average sentence length
def avg_sentence_length(doc):
    return sum(len(return_words(s)) for s in doc.sents) / len(list(doc.sents))

In [None]:
for speaker, speech in speeches.items():
    print(speaker, avg_sentence_length(speech))

We might stop to ask why Obama's speech seems to have shorter sentences. Is it deliberate rhetorical choice? Or could it be an issue with the data itself?

In this case, if we look closely at the txt file, we can see that the transcription of the speech included the world 'applause' as a one word sentence throughout the text. Let's see what happens if we filter that out. 

In [None]:
obama_clean_speech = obama_speech.replace("(Applause.)", "")

In [None]:
# Let's compare lengths of the texts. We should see a difference.

len(obama_speech), len(obama_clean_speech)

In [None]:
# Now let's recheck the average sentence length of Obama's speech.
avg_sentence_length(nlp(obama_clean_speech))

In [None]:
speeches = {
    "clinton": nlp(clinton_speech),
    "bush": nlp(bush_speech),
    "obama": nlp(obama_clean_speech),
    "trump": nlp(trump_speech),
}

Let's write a quick function to get the most common words used by each person

In [None]:
def most_common_words(doc, n):
    words = return_words(doc)
    c = Counter(words)
    return c.most_common(n)

In [None]:
for speaker, speech in speeches.items():
    print(speaker, most_common_words(speech, 10))

You can see quickly that we need to remove some of these most common words. To do this, we'll use common lists of stopwords.

In [None]:
from spacy.lang.en.stop_words import STOP_WORDS
print(STOP_WORDS)

In [None]:
# to make sure we've got all the punctuation out and to remove some contractions, we'll have a custom stoplist
custom_stopwords = [',', '-', '.', '’s', '-', ' ', '(', ')', '--', '---', 'n’t', ';', "'s", "'ve", "  ", "’ve"]

In [None]:
def most_common_words(doc, n):
    words = [token.text for token in doc if token.pos_ is not 'PUNCT' 
             and token.lower_ not in STOP_WORDS and token.text not in custom_stopwords]
    c = Counter(words)
    return c.most_common(n)

In [None]:
for speaker, speech in speeches.items():
    print(speaker, ": ", most_common_words(speech, 10), "\n")

This sort of exploratory work is often the first step in figuring out how to clean a text for text analysis. 

Let's assess the lexical richness, defined as the ratio of number of unique words by the number of total words.

In [None]:
def lexical_richness(doc):
    words = return_words(doc)
    return len(set(words)) / len(words)

In [None]:
for speaker, speech in speeches.items():
    print(speaker, lexical_richness(speech))

Let's look at the readbility scores for all four speeches now

For the Automated Readability Index, you can get the appropriate grade level here: https://en.wikipedia.org/wiki/Automated_readability_index

In [None]:
for speaker, speech in speeches.items():
    words = return_words(speech)
    print(speaker, "ARI:", auto_readability_index(speech), "CL:", coleman_liau_index(speech, words))

To get some comparison, let's also look at some stats calculated through Textacy. We'll see the ARI and CL scores, which use the same formulas we used. However, you might notice that the scores are different. To understand why, you have to dig into the source code for Textacy, where you'll find that it filters out punctuation in creating the word list, which affects the number of characters. It also lowercases the punctuation-filtered words before creating the set of unique words, decreasing that number as well compared to how we calculated it here. These changes affect both the ARI and CL scores.

In [None]:
!pip install textacy

In [None]:
import textacy

In [None]:
# https://en.wikipedia.org/wiki/Coleman%E2%80%93Liau_index
# https://en.wikipedia.org/wiki/Automated_readability_index
txt_speeches = [clinton_speech, bush_speech, obama_clean_speech, trump_speech]
corpus = textacy.Corpus('en', txt_speeches)
for doc in corpus:
    stats = textacy.text_stats.TextStats(doc)
    print({
        "ARI": stats.automated_readability_index,
        "CL": stats.coleman_liau_index,
        "stats": stats.basic_counts
    })

Why do we have such a significant difference in the CL scores? Let's look quickly at the textacy implementation: https://github.com/chartbeat-labs/textacy/blob/5927d539dd989c090f8a0b0c06ba40bb204fce82/textacy/text_stats.py#L277

In [None]:
print("Name", "Chars", "Words", "Unique", "Sentences", sep="\t")
for speaker, speech in speeches.items():
    words = return_words(speech)
    print(speaker, count_chars(words), len(words), len(set(words)), len(list(speech.sents)), sep="\t")

In [None]:
# clinton, bush, obama, trump
for doc in corpus:
    stats = textacy.text_stats.TextStats(doc)
    print({
        "stats": stats.basic_counts
    })

Post-workshop eval:

https://stanforduniversity.qualtrics.com/jfe/form/SV_aaZ76OCnWDqQbuR