# **Tokenization**

Tokenization is the process of splitting up textual input data, e.g., a sentence or a paragraph into meaningful units [[1]](#scrollTo=op-j6UywUt5i).

Word tokenization frequently uses word pieces such as word stems, prefixes,
and suffixes (e.g., “-ation”). 
Tokenization is used to transform the input data of an NLP model into a more meaningful space [[1]](#scrollTo=op-j6UywUt5i).

This notebook shows examples for the following tokenization:
- Word Tokenization
- Sentence Tokenization

## **Word tokenization**

In this section we will use spaCy for word tokenization.

spaCy is a free, open-source library for advanced Natural Language Processing (NLP) in Python. It can be used to build information extraction or natural language understanding systems, or to pre-process text for deep learning based on [3]. For example, it supports the implementation of tasks for sentiment analysis, chatbots, text summarization, intent and entity extraction, and others [[1]](#scrollTo=op-j6UywUt5i). More information about spaCy please refer to  [[2]](#scrollTo=op-j6UywUt5i).



For word tokenization, we will apply the following steps:
1. Import spaCy library
2. Load the language model (English)
3. Create a spaCy document
4. Access the word tokens by iterating over the document
5. Print the word tokens

### Import spaCy library

In [None]:
# Import spaCy library to process the text
import spacy

### Load language model

In [None]:
# Import "en_core_web_sm" English language model by using spaCy library
## It is a small English pipeline trained on written web text (blogs, news, comments), that includes vocabulary, syntax and entities based on [4].
## It is optimized for CPU and its components are: tok2vec, tagger, parser, senter, ner, attribute_ruler, lemmatizer based on [5].
sp = spacy.load('en_core_web_sm')

### Create spaCy document and tokenize text

When we create a text, spaCy automatically tokenizes the text to produce a Doc object. The following figure demonstrates the processing pipeline of a given text to produce a Doc object based on [[6]](#scrollTo=op-j6UywUt5i):

![spaCy](https://spacy.io/pipeline-fde48da9b43661abcdf62ab70a546d71.svg)

In [None]:
# Create a sample spaCy document
## During the document creation process, spaCy automatically tokenizes the given text.
## Then it saves the tokenized text in the document.
doc = sp(u'I am non-vegetarian, send me the menu at abs-xyz@gmail.com. "They are going to U.K. and then to the U.S.A"')

### Print word tokens

In [None]:
# Print each token in the document:
for token in doc:
    print(token.text)

I
am
non
-
vegetarian
,
send
me
the
menu
at
abs-xyz@gmail.com
.
"
They
are
going
to
U.K.
and
then
to
the
U.S.A
"


## Sentence tokenization
If we want to split up a given text into individual sentences, then we use sentence tokenization [[1]](#scrollTo=op-j6UywUt5i). We have already created a document by using spaCy. We will use the same document for sentence tokenization.

As explained, spaCy automatically performs the tokenization process during document creation. For sentence tokenization, we will simply use the "doc.sents" attribute of the "Sentencizer" class in spaCy. For more details about the "Sentencizer" class, please refer to [[7]](#scrollTo=op-j6UywUt5i).

### Print tokenized sentences

In [None]:
# Use the "doc.sents" attribute for sentence tokenization.
## It iterates over sentences in the document.
## Then it defines the first word token of each sentence.To decide whether a token starts a sentence, spaCy assigns a boolean value to each token.
## This will be either True or False for all tokens.
for sentence in doc.sents:
    print(sentence)

I am non-vegetarian, send me the menu at abs-xyz@gmail.com.
"They are going to U.K. and then to the U.S.A"


## References

- [1] NLP and Computer Vision_DLMAINLPCV01 Course Book
- [2] https://spacy.io/
- [3] https://spacy.io/usage/spacy-101
- [4] https://spacy.io/models
- [5] https://spacy.io/models/en
- [6] https://spacy.io/usage/processing-pipelines
- [7] https://spacy.io/api/sentencizer

Copyright © 2022 IU International University of Applied Sciences