# **Tokenization**

## **1- Introduction**
In this section, we demonstrate how to implement some NLP tasks using spaCy.
In many cases, tokenization is the first step in NLP tasks.


SpaCy is one of the most famous framework for NLP. It can be used for the implementation of tasks for sentiment analysis, chatbots, text summarization, intent and entity extraction, and others.

More information about spaCy please refer to  [[1]](#scrollTo=op-j6UywUt5i)

### **Content**
In this notebook some basic examples for following topics are shown:
- Word Tokenization
- Sentence Tokenization

## **2- Word-Tokenization**

For word tokenization, we will follow the following steps:
* Import the spaCy library
* Load the language model (English)
* Create a spaCy document
* Access the word tokens by iterating over the document object
* Print the word tokens


In [None]:
# Import spaCy library to process the text
## spaCy is a free, open-source library for advanced Natural Language Processing (NLP) in Python.
## spaCy is designed specifically for production use and helps you build applications that process and “understand” large volumes of text. 
## It can be used to build information extraction or natural language understanding systems, or to pre-process text for deep learning based on [2].
import spacy

# Import "en_core_web_sm" English language model by using spaCy library
## It is a small English pipeline trained on written web text (blogs, news, comments), that includes vocabulary, syntax and entities based on [3].
## It is optimized for CPU and its components are: tok2vec, tagger, parser, senter, ner, attribute_ruler, lemmatizer based on [4].
sp = spacy.load('en_core_web_sm')

Now we will create a sample document.

When you create a text, spaCy first tokenizes the text to produce a Doc object. The Doc is then processed in several different steps – this is also referred to as the processing pipeline. 

You san see the processing pipeline in the following picture:

![spaCy](https://spacy.io/pipeline-fde48da9b43661abcdf62ab70a546d71.svg)

In [None]:
# Create a sample document
doc = sp(u'I am non-vegetarian, send me the menu at abs-xyz@gmail.com. "They are going to U.K. and the to the U.S.A"')


for token in doc:
    print(token.text)

I
am
non
-
vegetarian
,
send
me
the
menu
at
abs-xyz@gmail.com
.
"
They
are
going
to
U.K.
and
the
to
the
U.S.A
"


## **3- Sentence-Tokenization**

In [None]:
# Print the whole sentences from the document 'doc'
for sentence in doc.sents:
    print(sentence)

I am non-vegetarian, send me the menu at abs-xyz@gmail.com.
"They are going to U.K. and the to the U.S.A"


## **4- References**

- [1] https://spacy.io/
- [2] https://spacy.io/usage/spacy-101
- [3] https://spacy.io/models
- [4] https://spacy.io/models/en

Copyright © 2021 IU International University of Applied Sciences