# Tokenization Using NLTK

In this notebook we will be looking at tokenziation using nltk (Natural Language Toolkit). 

Tokenization is the process of segmenting / demarcating a string of input characters. The result of this operation is the creation of "tokens".

For additional information see: https://en.wikipedia.org/wiki/Lexical_analysis#Tokenization 

## Libraries and Resources used

-  Python 3
-  nltk

## Note:

For installation of the necessary resources and libraries, please refer to their respective home pages to find installation steps for your operation system.


Written in February 2017

## Importing Libraries

We will first import the required libraries for the notebook. 

In [1]:
# Import the word tokenizer from nltk
from nltk.tokenize import word_tokenize

# Import tweet tokenizer from nltk
from nltk.tokenize import TweetTokenizer

# Import multi-word tokenizer from nltk
from nltk.tokenize import MWETokenizer

# Import nltk data for sentence tokenization
import nltk.data
from nltk.tokenize import sent_tokenize

## Word Tokenization

Using nltk we will tokenize a string into a set of "tokens" representing each word / punctuation.

In [2]:
# Create the example sentence
exampleSentence = "I am an example sentence to showcase how this process works."

# Tokenize the sentence
tokens = word_tokenize(exampleSentence)

# Print the result
print(tokens)

['I', 'am', 'an', 'example', 'sentence', 'to', 'showcase', 'how', 'this', 'process', 'works', '.']


As you can see, the tokenization turned each word / punctuation of the string (sentence) into their own element and stored it into a list. 

### Note:

The resulting tokens are stored in a Python list and can be accessed / manipulated / altered as any other list.

## Tweet Tokenization

Now we will look at a different method of tokenization -- nltk tweet tokenization. Although similar to the previous word tokenizer, there are some differences between the two. One difference is that this method catches common emojis, like " :) ". We will showcase these differences below.

In [3]:
# Initalize the tokenizer
tweetTokenizer = TweetTokenizer()

# Sample sentence
tweetString = "This is a sample tweet :) #LivingTheTokenDream @ImaginaryFriend!!!!"

# Store the results
results = tweetTokenizer.tokenize(tweetString)

# Print the results
print(results)

# You can remove mentions initalizing the tokenizer with different parameters
nohandleTokenizer = TweetTokenizer(strip_handles=True)

# Store the results
results_nohandle = nohandleTokenizer.tokenize(tweetString)

# Print the results
print(results_nohandle)

['This', 'is', 'a', 'sample', 'tweet', ':)', '#LivingTheTokenDream', '@ImaginaryFriend', '!', '!', '!']
['This', 'is', 'a', 'sample', 'tweet', ':)', '#LivingTheTokenDream', '!', '!', '!']


As you can see, the tokenizer recognizes certain characteristics that may not appear outside a tweet. Now we will show what happens when you use the word tokenizer for the previous example tweet sentence.

In [4]:
# Sample sentence
tweetString = "This is a sample tweet :) #LivingTheTokenDream @ImaginaryFriend!!!!"

# Word tokenizer
wordResult = word_tokenize(tweetString)

# Tweet tokenizer
tweetTokenizer = TweetTokenizer()
tweetResult = tweetTokenizer.tokenize(tweetString)

# Print the two results
print(wordResult)
print(tweetResult)

['This', 'is', 'a', 'sample', 'tweet', ':', ')', '#', 'LivingTheTokenDream', '@', 'ImaginaryFriend', '!', '!', '!', '!']
['This', 'is', 'a', 'sample', 'tweet', ':)', '#LivingTheTokenDream', '@ImaginaryFriend', '!', '!', '!']


In the example above you can see "#LivingTheTokenDream" was separated using the word tokenizer, but it wasn't when using the tweet tokenizer. 

## Tokenizing Multiword

This tokenization method combines different words together, and allows the user to define combinations of the words they want. For example, "a lot" can be treated as one single token rather than two. 

## Note:

This requires you to already have the sentences tokenized.

In [5]:
# Initalize the tokenizer by declaring some multiwords that I want to treat as one word
# Also declare how the seperate the two words. (in this case "a" and "lot" becomes "a+lot")
MultiWordTokenizer = MWETokenizer([('a', 'lot')], separator='+')

# Adding additional words post initalization 
MultiWordTokenizer.add_mwe(('new', 'word'))

# Combining the word combinations together
testSentence = word_tokenize("This proves that there is a lot of new word out there")

# Print the results
results = MultiWordTokenizer.tokenize(testSentence)
print(results)

['This', 'proves', 'that', 'there', 'is', 'a+lot', 'of', 'new+word', 'out', 'there']


## Tokenizing by Sentence

In the last section we discussed how to tokenize a sentence into its different components. Now we will show how to tokenize a string by sentences instead of words.

In [6]:
# Declare a bunch of sentences
largeText = "This string will have a varied amount of sentences!! \
Some will be proper. others will not be, as it did not capitalize the first letter of the sentence"

# Tokenize using the sentence tokenizer
normalSentenceTokenize = sent_tokenize(largeText)

# Print Results
print(normalSentenceTokenize)

['This string will have a varied amount of sentences!!', 'Some will be proper.', 'others will not be, as it did not capitalize the first letter of the sentence']


# Loading in a text file

Now we will provide a short example on how to load a text file for tokenization. This may be important for different applications, as a corpus may be stored in text files.

In [7]:
# Read the text file
file = open("dummytext.txt").read()

# Tokenize the text
tokens = word_tokenize(file)

# Print the results
print(tokens)

['You', 'can', 'even', 'load', 'in', 'from', 'a', 'text', 'file', '.', 'And', 'tokenize', 'it', 'too', '!']


# Conclusion

We have went over some of the methods of tokenization available within the nltk library in order to tokenize words, tweets, and sentences. The nltk library contains a variety of options when it comes to tokenizing text. For additonal information regarding tokenization using nltk, see the nltk documentation at: http://www.nltk.org/api/nltk.tokenize.html 
