# NLP with spaCy

## Setting up spaCy
1. pip install spacy
2. python -m spacy.en.download

The second line will download the model data for the english model. This will load the parser, tagger, vocabulary and word vectors.

In [2]:
# import spacy and load english model
import spacy

nlp = spacy.load('en')

## 1. Process Text

spaCy does tokenization, sentence recognition, part of speech tagging, lemmatization, dependency parsing, and named entity recognition all at once.

In [5]:
# parse text into Document object
doc = nlp(u"I went to school this morning, but it's Sunday. School is closed! Silly me =)")

Each token is an object with lots of different properties. A property with underscore returns the string representation, while a property without an underscore returns an index (int) into spaCy's vocabulary. These are some of the properties:
* orth (original token)
* lower
* lemma
* shape
* prefix
* suffix
* prob (log probability based on counts from a 3 Billion word corpus)


### Get Tokens

In [7]:
# print all the tokens
print [token for token in doc]

[I, went, to, school, this, morning, ,, but, it, 's, Sunday, ., School, is, closed, !, Silly, me, =)]


In [9]:
# another way to do this is indexing the doc
print 'Number of tokens: {}'.format(len(doc))
print 'First token: {}'.format(doc[0])
print 'Last token: {}'.format(doc[-1])
print 'Tokens 2 through 4: {}'.format(doc[1:4])

Number of tokens: 19
First token: I
Last token: =)
Tokens 2 through 4: went to school


### Get Sentences

In [10]:
# print all the sentences
for sent in doc.sents:
    print sent

I went to school this morning, but it's Sunday.
School is closed!
Silly me =)


### Remove Stop Words

In [13]:
# print all of the stop words in the document
print [token for token in doc if token.is_stop]

[I, to, this, but, it, is, me]


In [25]:
print type(doc)

<type 'spacy.tokens.doc.Doc'>


In [None]:
# remove the stop words
# doc_1 = [token for token in doc if not token.is_stop]
doc_1 = u' '.join(token.orth_ for token in doc if not token.is_stop)
print doc_1
print doc_1[0].lemma

### Remove Punctuation

In [15]:
# print all punctuation in the document
print [token for token in doc if token.is_punct]

[,, ., !]


### Lematize

In [11]:
lemmas = [token.lemma_ for token in doc]
print lemmas

[u'i', u'go', u'to', u'school', u'this', u'morning', u',', u'but', u'it', u"'", u'sunday', u'.', u'school', u'be', u'closed', u'!', u'silly', u'me', u'=)']
