# Exploring spaCy
Elena Cimino
e.cimino@pitt.edu

## Goal
The goal for this notebook is to explore the Python library spaCy and compare it with NLTK. My project would benefit from POS-tagging and lemmatization and I have experience with NLTK. spaCy is new to me, and I would like to explore it a little and compare it to NLTK. 

Note: This was originally done in the jupyter notebook [exploring_balc.ipynb](https://nbviewer.jupyter.org/github/Data-Science-for-Linguists-2019/ESL-Article-Acquisition/blob/master/exploratory-analysis/BALC_clean.ipynb) but has since been moved here.

### Table of Contents:
1. [Setup](#setup): setting up the notebook, reading in files, loading libraries
2. [Tokenization](#token): comparing spaCy's tokenization to NLTK's
3. [Lemmatizing](#lemma): comparing spaCy's lemmatizing to NLTK's
4. [Ambiguity](#ambiguity): comparing how spaCy and NLTK handle ambiguity
5. [Conclusion](#conclusion): wrapping up what was found out in this notebook

<a id='setup'></a>
## Set-up
Loading in libraries, etc.

In [1]:
import spacy
nlp = spacy.load('en')

import nltk
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

import pandas as pd

In [2]:
%pprint   # turn off pretty printing

Pretty printing has been turned OFF


In [3]:
cepa_df = pd.read_pickle('../private/cepa1.pkl')
cepa_df.head(3)

Unnamed: 0,Filename,Level,Original_Text,Normalized_Essay,Revised_Essay,tokens,token_count,TTR,Guiraud
0,200607296,3,\t\t\t\tCEPA 3 200607296\n\n\n\nNow I tell you...,Now I tell you why my worst holiday ever in th...,Now I tell you why my worst holiday ever in th...,"[Now, I, tell, you, why, my, worst, holiday, e...",207,0.492754,7.08949
1,200607457,4,\t\t\t\tCEPA 4 200607457\n\n\n\n ...,My worst holiday Last year I have just had the...,My worst holiday Last year I have just had the...,"[My, worst, holiday, Last, year, I, have, just...",180,0.572222,7.677167
2,200600487,5,\t\t\t\tCEPA 5 200600487\n\n\n\n\nEvery body i...,Every body in this life have a favourite posse...,Every body in this life have a favourite posse...,"[Every, body, in, this, life, have, a, favouri...",229,0.445415,6.74035


<a id='token'></a>
## Exploring tokenization
Because it's something new, let's start with the exploration of spaCy first. Then, we'll compare with NLTK.

In [4]:
y = cepa_df.Revised_Essay[0]
doc = nlp(y)
for token in doc:
    print(token.text, token.lemma_, token.pos_, token.tag_)

Now now ADV RB
I -PRON- PRON PRP
tell tell VERB VBP
you -PRON- PRON PRP
why why ADV WRB
my -PRON- ADJ PRP$
worst bad ADJ JJS
holiday holiday NOUN NN
ever ever ADV RB
in in ADP IN
the the DET DT
last last ADJ JJ
summer summer NOUN NN
I -PRON- PRON PRP
wented went VERB VBD
withe withe VERB VBP
my -PRON- ADJ PRP$
family family NOUN NN
in in ADP IN
the the DET DT
India india PROPN NNP
and and CCONJ CC
this this DET DT
story story NOUN NN
I -PRON- PRON PRP
will will VERB MD
tell tell VERB VB
you -PRON- PRON PRP
what what NOUN WP
happened happen VERB VBD
for for ADP IN
the the DET DT
short short ADJ JJ
story story NOUN NN
when when ADV WRB
I -PRON- PRON PRP
go go VERB VBP
the the DET DT
first first ADJ JJ
the the DET DT
weathe weathe NOUN NN
is be VERB VBZ
very very ADV RB
very very ADV RB
rain rain NOUN NN
now now ADV RB
bady bady VERB VBP
for for ADP IN
the the DET DT
children child NOUN NNS
play play VERB VB
out out PART RP
when when ADV WRB
I -PRON- PRON PRP
go go VERB VBP
in in ADP IN
t

In [5]:
z = cepa_df.Revised_Essay[1]
doc = nlp(z)
for token in doc:
    print(token.text, token.lemma_, token.pos_, token.tag_)

My -PRON- ADJ PRP$
worst bad ADJ JJS
holiday holiday NOUN NN
Last last ADJ JJ
year year NOUN NN
I -PRON- PRON PRP
have have VERB VBP
just just ADV RB
had have VERB VBN
the the DET DT
worst bad ADJ JJS
holiday holiday NOUN NN
ever ever ADV RB
. . PUNCT .
It -PRON- PRON PRP
was be VERB VBD
too too ADV RB
board board NOUN NN
. . PUNCT .
My -PRON- ADJ PRP$
Lonely lonely ADJ JJ
sister sister NOUN NN
had have VERB VBD
got get VERB VBN
married marry VERB VBN
. . PUNCT .
She -PRON- PRON PRP
was be VERB VBD
making make VERB VBG
me -PRON- PRON PRP
Laugh laugh PROPN NNP
and and CCONJ CC
play play VERB VB
with with ADP IN
me -PRON- PRON PRP
. . PUNCT .
But but CCONJ CC
now now ADV RB
I`m i`m VERB VB
alone alone ADV RB
with with ADP IN
my -PRON- ADJ PRP$
male male ADJ JJ
brothers brother NOUN NNS
. . PUNCT .
I -PRON- PRON PRP
ca can VERB MD
nt not ADV RB
stand stand VERB VB
them -PRON- PRON PRP
they -PRON- PRON PRP
are be VERB VBP
too too ADV RB
noisy noisy ADJ JJ
. . PUNCT .
In in ADP IN
the the D

Because the essays have already been tokenized with NLTK, we can just look at the pos tags for those existing tokens.

In [6]:
# now looking at nltk
for token in [cepa_df.tokens[0]]:
    print(nltk.pos_tag(token))

[('Now', 'RB'), ('I', 'PRP'), ('tell', 'VBP'), ('you', 'PRP'), ('why', 'WRB'), ('my', 'PRP$'), ('worst', 'JJS'), ('holiday', 'NN'), ('ever', 'RB'), ('in', 'IN'), ('the', 'DT'), ('last', 'JJ'), ('summer', 'NN'), ('I', 'PRP'), ('wented', 'VBD'), ('withe', 'JJ'), ('my', 'PRP$'), ('family', 'NN'), ('in', 'IN'), ('the', 'DT'), ('India', 'NNP'), ('and', 'CC'), ('this', 'DT'), ('story', 'NN'), ('I', 'PRP'), ('will', 'MD'), ('tell', 'VB'), ('you', 'PRP'), ('what', 'WDT'), ('happened', 'VBD'), ('for', 'IN'), ('the', 'DT'), ('short', 'JJ'), ('story', 'NN'), ('when', 'WRB'), ('I', 'PRP'), ('go', 'VBP'), ('the', 'DT'), ('first', 'JJ'), ('the', 'DT'), ('weathe', 'NN'), ('is', 'VBZ'), ('very', 'RB'), ('very', 'RB'), ('rain', 'RB'), ('now', 'RB'), ('bady', 'VBZ'), ('for', 'IN'), ('the', 'DT'), ('children', 'NNS'), ('play', 'VBP'), ('out', 'RP'), ('when', 'WRB'), ('I', 'PRP'), ('go', 'VBP'), ('in', 'IN'), ('the', 'DT'), ('hotel', 'NN'), ('all', 'DT'), ('may', 'MD'), ('family', 'NN'), ('was', 'VBD'), (

In [7]:
for token in [cepa_df.tokens[1]]:
    print(nltk.pos_tag(token))

[('My', 'PRP$'), ('worst', 'JJS'), ('holiday', 'NN'), ('Last', 'JJ'), ('year', 'NN'), ('I', 'PRP'), ('have', 'VBP'), ('just', 'RB'), ('had', 'VBN'), ('the', 'DT'), ('worst', 'JJS'), ('holiday', 'NN'), ('ever', 'RB'), ('.', '.'), ('It', 'PRP'), ('was', 'VBD'), ('too', 'RB'), ('board', 'NN'), ('.', '.'), ('My', 'NNP'), ('Lonely', 'RB'), ('sister', 'NN'), ('had', 'VBD'), ('got', 'VBN'), ('married', 'VBN'), ('.', '.'), ('She', 'PRP'), ('was', 'VBD'), ('making', 'VBG'), ('me', 'PRP'), ('Laugh', 'NNP'), ('and', 'CC'), ('play', 'NN'), ('with', 'IN'), ('me', 'PRP'), ('.', '.'), ('But', 'CC'), ('now', 'RB'), ('I', 'PRP'), ('`', '``'), ('m', 'VB'), ('alone', 'RB'), ('with', 'IN'), ('my', 'PRP$'), ('male', 'NN'), ('brothers', 'NNS'), ('.', '.'), ('I', 'PRP'), ('cant', 'VBP'), ('stand', 'VBP'), ('them', 'PRP'), ('they', 'PRP'), ('are', 'VBP'), ('too', 'RB'), ('noisy', 'JJ'), ('.', '.'), ('In', 'IN'), ('the', 'DT'), ('Spring', 'NN'), ('holiday', 'NN'), ('my', 'PRP$'), ('brothers', 'NNS'), ('and', '

The tokenization looks pretty comparable across the two, which is great. What about just lemmatizing?

<a id='lemma'></a>
## Lemmatization
Now, let's look at some of the lemmatizing.

In [8]:
doc = nlp(y)
for token in doc:
    print(token.text, token.lemma_)

Now now
I -PRON-
tell tell
you -PRON-
why why
my -PRON-
worst bad
holiday holiday
ever ever
in in
the the
last last
summer summer
I -PRON-
wented went
withe withe
my -PRON-
family family
in in
the the
India india
and and
this this
story story
I -PRON-
will will
tell tell
you -PRON-
what what
happened happen
for for
the the
short short
story story
when when
I -PRON-
go go
the the
first first
the the
weathe weathe
is be
very very
very very
rain rain
now now
bady bady
for for
the the
children child
play play
out out
when when
I -PRON-
go go
in in
the the
hotel hotel
all all
may may
family family
was be
have have
the the
headk headk
in in
there there
and and
all all
was be
sleep sleep
put put
for for
my -PRON-
I -PRON-
can;t can;t
sleep sleep
because because
I -PRON-
not not
love love
the the
area area
in in
the the
morning morning
all all
the the
my -PRON-
family family
weak weak
up up
and and
going go
irant irant
but but
is be
the the
strees stree
, ,
children child
and and
the the
food 

In [9]:
for token in cepa_df.tokens[0]:
    print(token, lemmatizer.lemmatize(token))

Now Now
I I
tell tell
you you
why why
my my
worst worst
holiday holiday
ever ever
in in
the the
last last
summer summer
I I
wented wented
withe withe
my my
family family
in in
the the
India India
and and
this this
story story
I I
will will
tell tell
you you
what what
happened happened
for for
the the
short short
story story
when when
I I
go go
the the
first first
the the
weathe weathe
is is
very very
very very
rain rain
now now
bady bady
for for
the the
children child
play play
out out
when when
I I
go go
in in
the the
hotel hotel
all all
may may
family family
was wa
have have
the the
headk headk
in in
there there
and and
all all
was wa
sleep sleep
put put
for for
my my
I I
can can
; ;
t t
sleep sleep
because because
I I
not not
love love
the the
area area
in in
the the
morning morning
all all
the the
my my
family family
weak weak
up up
and and
going going
irant irant
but but
is is
the the
strees strees
, ,
children child
and and
the the
food food
is is
very very
dearty dearty
earia ea

In this instance, spaCy seems to outperform NLTK a bit. For example, it lemmatizes the word 'wented' as 'went', whereas NLTK lemmatizes this as 'wented'. Additionally, for the word 'strees', both NLTK and spaCy tagged it as a plural common noun ('NNS'), but NLTK's did not then lemmatize the word as 'stree', while spaCy did. NLTK also lemmatizes 'was' (past tense 'be') as 'wa' instead of 'be'. 

One thing that is rather annoying about spaCy's lemmatizing is that all pronouns are lemmatized as '-PRON-', whereas NLTK does lemmatize those as their own entries. So if someone was interested in examining pronouns, they may either want to avoid spaCy or write a function that would allow them to store anything as -PRON- as its text entry or something.

Overall, I'm pretty happy with spaCy so far! It's also convenient that everything you need to use is in the library, so after you import it and load in your target language, it's fairly easy to use and streamlined.

Let's keep chugging along though!

<a id='ambiguity'></a>
## Ambiguity
Here, I'll test two small sentences to see the difference in how spaCy and NLTK deal with ambiguity, and again compare the tagging and lemmatizing.

In [10]:
# Make my own example, to test ambiguity
test = "I like to bow and look at bows on presents."
test2 = "I wented to a store on Fifth Avenue."

# Spacy
t = nlp(test)
for tok in t:
    print(tok, tok.tag_, tok.lemma_)
    
t = nlp(test2)
for tok in t:
    print(tok, tok.tag_, tok.lemma_)

I PRP -PRON-
like VBP like
to TO to
bow VB bow
and CC and
look VB look
at IN at
bows NNS bow
on IN on
presents NNS present
. . .
I PRP -PRON-
wented VBD went
to IN to
a DT a
store NN store
on IN on
Fifth NNP fifth
Avenue NNP avenue
. . .


In [11]:
# NLTK
for tok in nltk.word_tokenize(test):
    print(tok, lemmatizer.lemmatize(tok))
print(nltk.tag.pos_tag(test.split()))

for tok in nltk.word_tokenize(test2):
    print(tok, lemmatizer.lemmatize(tok))
nltk.tag.pos_tag(test2.split())

I I
like like
to to
bow bow
and and
look look
at at
bows bow
on on
presents present
. .
[('I', 'PRP'), ('like', 'VBP'), ('to', 'TO'), ('bow', 'VB'), ('and', 'CC'), ('look', 'VB'), ('at', 'IN'), ('bows', 'NNS'), ('on', 'IN'), ('presents.', 'NN')]
I I
wented wented
to to
a a
store store
on on
Fifth Fifth
Avenue Avenue
. .


[('I', 'PRP'), ('wented', 'VBD'), ('to', 'TO'), ('a', 'DT'), ('store', 'NN'), ('on', 'IN'), ('Fifth', 'NNP'), ('Avenue.', 'NNP')]

They both deal with ambiguity fairly well, and again the tagging seems comparable across both. SpaCy is a lot faster than NLTK though!

<a id='conclusion'></a>
## Conclusion

NLTK and spaCy have reliable POS-tagging across the both of them, but from what I have seen, spaCy has a bit better lemmatizer than NLTK does. It's also faster, which is something to consider since I'm going to be using this on an entire corpus. It's a bit annoying that spaCy lemmatizes _any and all_ pronouns as -PRON- but I won't be looking at pronouns in this project, so it's not a huge dealbreaker for me. For the task of lemmatizing, I'll use spaCy!