## Sentence Tokenization

Tokenization in NLP refers to when we want to break a text into individual components. This is one form of tokenization known as word tokenization. There are, however, many other forms, such as sentence tokenization. **Sentence tokenization** is precisely the same as word tokenization, except instead of breaking a text up into individual word and punctuation components, we break a text up into individual sentences.

If you are familiar with Python, you may be familiar with the built-in split() function which allows for a programmer to split a text by whitespace (default) or by passing an argument of a string to define where to split a text, i.e. split("."). A common practice (without NLP frameworks) is to split a text into sentences by simply using the split function, but this is ill-advised. Let us consider the example below:

In [12]:
text = "Mary J. Watson is known for his writing skills. She is also a good dancer."

In [13]:
#Now, let's try and use the split function to split the text object based on punctuation.
new = text.split(".")
print (new)

['Mary J', ' Watson is known for his writing skills', ' She is also a good dancer', '']


As we can see than we have split the name in the first sentance along with splitting both of the sentances. The very thing that makes texts easier to read, however, greatly hinders our ability to easily split sentences. 

For this reason, another method is needed. This is where sentence tokenization comes into play. In order to see how sentence tokenization differs, let's begin with our first spaCy usage.

For my use case, I will be using these setup commands, we can find the appropriate set up commands from the SpaCy page itself by entering the platform and use case details :
[SpaCy Installation](https://spacy.io/usage)
```
pip install -U pip setuptools wheel
pip install -U 'spacy[transformers,lookups]'
python -m spacy download en_core_web_sm```

We'll begin with SpaCy and see how we can perform Tokenization. Next, we need to load an NLP model object.To do this, we use the spacy.load() function.
This will take one argument, the model one wishes to load. We will use the small English model.

In [24]:
import spacy
nlp = spacy.load("en_core_web_sm")

With the nlp object created, we can use it to to parse a text.
To do this, we create a doc object.
This object will contain a lot of data on the text.

In [25]:
doc = nlp(text)
print(doc)

Mary J. Watson is known for his writing skills. She is also a good dancer.


In [26]:
doc?

[0;31mType:[0m           Doc
[0;31mString form:[0m    Mary J. Watson is known for his writing skills. She is also a good dancer.
[0;31mLength:[0m         17
[0;31mFile:[0m           ~/.local/lib/python3.10/site-packages/spacy/tokens/doc.cpython-310-x86_64-linux-gnu.so
[0;31mDocstring:[0m     
Doc(Vocab vocab, words=None, spaces=None, user_data=None, *, tags=None, pos=None, morphs=None, lemmas=None, heads=None, deps=None, sent_starts=None, ents=None)
A sequence of Token objects. Access sentences and named entities, export
    annotations to numpy arrays, losslessly serialize to compressed binary
    strings. The `Doc` object holds an array of `TokenC` structs. The
    Python-level `Token` and `Span` objects are views of this array, i.e.
    they don't own the data themselves.

    EXAMPLE:
        Construction 1
        >>> doc = nlp(u'Some text')

        Construction 2
        >>> from spacy.tokens import Doc
        >>> doc = Doc(nlp.vocab, words=["hello", "world", "!"], sp

While this looks identical to the "text" string above, it is quite different. To demonstrate this, let us use the sentence tokenizer.

In [19]:
for sent in doc.sents:
    print (sent)

Mary J. Watson is known for his writing skills.
She is also a good dancer.


As we have used the spaCy sentence tokenizer to generate a desired output: a text correctly broken into sentences. This shows why using an NLP framework for performing even a basic task is not only easier, but essential. 

### Named Entity Recognition 

Objective of this series, is named entity recognition (NER). Here, I'd like to demonstrate how to perform basic NER via spaCy. Again, we will iterate over the doc object as we did above, but instead of iterating over doc.sents, we will iterate over doc.ents. For our purposes right now, I simply want to print off each entity's text (the string itself) and its corresponding label (note the _ after label). I will be explaining this process in much greater detail later.

In [20]:
for ent in doc.ents:
    print (ent.text, ent.label_)


Mary J. Watson PERSON


### Parts Of Speech 


Here, we can see two vital pieces of information: the string and the corresponding part-of-speech (pos). For a complete list of the pos labels, see the spaCy documentation (https://spacy.io/api/annotation#pos-tagging). 
Here, PROPN is proper noun, AUX is an auxiliary verb, ADJ, is adjective, etc.

In [27]:
for token in doc:
    print(token.text, token.pos_)

Mary PROPN
J. PROPN
Watson PROPN
is AUX
known VERB
for ADP
his PRON
writing NOUN
skills NOUN
. PUNCT
She PRON
is AUX
also ADV
a DET
good ADJ
dancer NOUN
. PUNCT


### Extracting Nouns and Noun Chunks
Often times when working with a text, we need to extract nouns and noun chunks. There are a few different ways that we can do this via spaCy. To extract nouns, we can use the doc.noun_chunks attribute.

In [28]:
for chunk in doc.noun_chunks:
    print(chunk.text)

Mary J. Watson
his writing skills
She
a good dancer


### Extracting Verbs and Verb Phrases
In order to extract all verbs, we can leverage the POS tagger's output in spaCy. We can establish a for loop to iterate over all POS tags in the doc object and then print off just the ones that are either a "VERB" or "AUX". These are the two POS tags used to identify tokens in a sentence that function as verbs.

In [29]:
verbs = ["VERB", "AUX"]
for token in doc:
    if token.pos_ in verbs:
        print (token.text, token.pos_)

is AUX
known VERB
is AUX


### Lemmatization

**Lemmatization** is an essential component in most NLP frameworks, though some libraries perform this concept differently. While libraries, such as Stanza will find word stems, spaCy will find word lemmas. They are technically a little different, but both seek to reduce all words to their roots. To find lemmas via spaCy, we use the same process as we did for finding a word's part of speech, via iterating over the tokens in the doc object.

In [30]:
for token in doc:
    print(token.text, token.lemma_)

Mary Mary
J. J.
Watson Watson
is be
known know
for for
his his
writing writing
skills skill
. .
She she
is be
also also
a a
good good
dancer dancer
. .


Note that we see most words remain the same, but notice particularly "is" being identified as "be" and "known" becomes "know". These are the respective lemmas for these verbs. Also notice the same effect on nouns, such as "skills", a plural, being reduced to "skill".