### Python | Tokenize text using TextBlob

**TextBlob** module is a Python library and offers a simple API to access its methods and perform basic NLP tasks. It is built on the top of NLTK module.

Install TextBlob using the following commands in terminal:

pip install -U textblob

python -m textblob.download_corpora

In [1]:
pip install -U textblob

Collecting textblob
  Downloading textblob-0.18.0.post0-py3-none-any.whl.metadata (4.5 kB)
Downloading textblob-0.18.0.post0-py3-none-any.whl (626 kB)
   ---------------------------------------- 0.0/626.3 kB ? eta -:--:--
   ---------------------------------------- 0.0/626.3 kB ? eta -:--:--
    --------------------------------------- 10.2/626.3 kB ? eta -:--:--
   - ------------------------------------- 30.7/626.3 kB 259.2 kB/s eta 0:00:03
   ---- ---------------------------------- 71.7/626.3 kB 491.5 kB/s eta 0:00:02
   ------------------------- -------------- 399.4/626.3 kB 2.3 MB/s eta 0:00:01
   ---------------------------------------- 626.3/626.3 kB 2.8 MB/s eta 0:00:00
Installing collected packages: textblob
Successfully installed textblob-0.18.0.post0
Note: you may need to restart the kernel to use updated packages.


In [3]:
!python -m textblob.download_corpora

Finished.


[nltk_data] Downloading package brown to
[nltk_data]     C:\Users\Administrator\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\brown.zip.
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Administrator\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Administrator\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Administrator\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package conll2000 to
[nltk_data]     C:\Users\Administrator\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\conll2000.zip.
[nltk_data] Downloading package movie_reviews to
[nltk_data]     C:\Users\Administrator\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping co

### Some terms that will be frequently used are :

- Corpus – Body of text, singular. Corpora is the plural of this.
- Lexicon – Words and their meanings.
- Token – Each “entity” that is a part of whatever was split up based on rules. For examples, each word is a token when a sentence is “tokenized” into words. Each sentence can also be a token, if you tokenized the sentences out of a paragraph.

So basically tokenizing involves splitting sentences and words from the body of the text.

In [4]:
# from textblob lib. import TextBlob method 
from textblob import TextBlob 

text = ("Natural language processing (NLP) is a field " +
	"of computer science, artificial intelligence " +
	"and computational linguistics concerned with " +
	"the interactions between computers and human " +
	"(natural) languages, and, in particular, " +
	"concerned with programming computers to " +
	"fruitfully process large natural language " +
	"corpora. Challenges in natural language " +
	"processing frequently involve natural " +
	"language understanding, natural language" +
	"generation frequently from formal, machine" +
	"-readable logical forms), connecting language " +
	"and machine perception, managing human-" +
	"computer dialog systems, or some combination " +
	"thereof.") 
	
# create a TextBlob object 
blob_object = TextBlob(text) 

# tokenize paragraph into words. 
print(" Word Tokenize :\n", blob_object.words) 

# tokenize paragraph into sentences. 
print("\n Sentence Tokenize :\n", blob_object.sentences) 


 Word Tokenize :
 ['Natural', 'language', 'processing', 'NLP', 'is', 'a', 'field', 'of', 'computer', 'science', 'artificial', 'intelligence', 'and', 'computational', 'linguistics', 'concerned', 'with', 'the', 'interactions', 'between', 'computers', 'and', 'human', 'natural', 'languages', 'and', 'in', 'particular', 'concerned', 'with', 'programming', 'computers', 'to', 'fruitfully', 'process', 'large', 'natural', 'language', 'corpora', 'Challenges', 'in', 'natural', 'language', 'processing', 'frequently', 'involve', 'natural', 'language', 'understanding', 'natural', 'languagegeneration', 'frequently', 'from', 'formal', 'machine-readable', 'logical', 'forms', 'connecting', 'language', 'and', 'machine', 'perception', 'managing', 'human-computer', 'dialog', 'systems', 'or', 'some', 'combination', 'thereof']

 Sentence Tokenize :
 [Sentence("Natural language processing (NLP) is a field of computer science, artificial intelligence and computational linguistics concerned with the interactions

## Tokenize text using NLTK in python

To run the below python program, (NLTK) natural language toolkit has to be installed in your system.
The NLTK module is a massive tool kit, aimed at helping you with the entire Natural Language Processing (NLP) methodology.
In order to install NLTK run the following commands in your terminal.

- sudo pip install nltk
- Then, enter the python shell in your terminal by simply typing python
- Type import nltk
- nltk.download(‘all’)

In [6]:
!sudo pip install nltk


'sudo' is not recognized as an internal or external command,
operable program or batch file.


- Some terms that will be frequently used are :
- **Corpus** – Body of text, singular. Corpora is the plural of this.
- **Lexicon** – Words and their meanings.
- **Token** – Each “entity” that is a part of whatever was split up based on rules. For examples, each word is a token when a sentence is “tokenized” into words. Each sentence can also be a token, if you tokenized the sentences out of a paragraph.

- **So basically tokenizing involves splitting sentences and words from the body of the text.**

In [7]:
# import the existing word and sentence tokenizing  
# libraries 
from nltk.tokenize import sent_tokenize, word_tokenize 
  
text = "Natural language processing (NLP) is a field of computer science, artificial intelligence and computational linguistics concerned with the interactions between computers and human (natural) languages, and, in particular, concerned with programming computers to fruitfully process large natural language corpora. Challenges in natural language processing frequently involve natural language understanding, natural language generation (frequently from formal, machine-readable logical forms), connecting language and machine perception, managing human-computer dialog systems, or some combination thereof."
  
print(sent_tokenize(text)) 
print(word_tokenize(text)) 


['Natural language processing (NLP) is a field of computer science, artificial intelligence and computational linguistics concerned with the interactions between computers and human (natural) languages, and, in particular, concerned with programming computers to fruitfully process large natural language corpora.', 'Challenges in natural language processing frequently involve natural language understanding, natural language generation (frequently from formal, machine-readable logical forms), connecting language and machine perception, managing human-computer dialog systems, or some combination thereof.']
['Natural', 'language', 'processing', '(', 'NLP', ')', 'is', 'a', 'field', 'of', 'computer', 'science', ',', 'artificial', 'intelligence', 'and', 'computational', 'linguistics', 'concerned', 'with', 'the', 'interactions', 'between', 'computers', 'and', 'human', '(', 'natural', ')', 'languages', ',', 'and', ',', 'in', 'particular', ',', 'concerned', 'with', 'programming', 'computers', 't