# Exercise 03: Splitting sentences and PoS annotation

Let's start with a simple paragraph, copied from the course description:

In [None]:
text = """
Increasingly, customers send text to interact or leave comments, 
which provides a wealth of data for text mining.  That’s a great 
starting point for developing custom search, content recommenders, 
and even AI applications.
"""
repr(text)

Notice how there are explicit *line breaks* in the text. Let's write some code to flow the paragraph without any line breaks:

In [None]:
text = " ".join(map(lambda x: x.strip(), text.split("\n"))).strip()
repr(text)

Now we can use [spaCy](https://spacy.io/) to *split* the paragraph into sentences:

In [None]:
import spacy

nlp = spacy.load("en")
doc = nlp(text, parse=True)

for span in doc.sents:
    print("> ", span)

Next we take a sentence and *annotate* it with part-of-speech (PoS) tags:

In [None]:
for span in doc.sents:
    for i in range(span.start, span.end):
        token = doc[i]
        print(i, token.text, token.tag_, token.pos_)

Given these annotations for part-of-speech tags, we can *lemmatize* nouns and verbs to get their root forms. This will also singularize the plural nouns:

In [None]:
for span in doc.sents:
    for i in range(span.start, span.end):
        token = doc[i]
        print(i, token.text, token.tag_, token.pos_, token.lemma_)

We can also lookup synonyms and definitions for each word, using *synsets* from [WordNet](https://wordnet.princeton.edu/). Understand that `spaCy` is designed to be an *opinionated* API, and it omits support for much of the value of `WordNet`. However, we can use [TextBlob](http://textblob.readthedocs.io/) instead:

In [None]:
from textblob import Word

w = Word("comments")

for synset, definition in zip(w.get_synsets(), w.define()):
    print(synset, definition)