# Tokenisation

The notebook contains three types of tokenisation techniques:
1. Word tokenisation
2. Sentence tokenisation
3. Tweet tokenisation
4. Custom tokenisation using regular expressions

### 1. Word tokenisation

In [1]:
document = "At nine o'clock I visited him myself. It looks like religious mania, and he'll soon think that he himself is God."
print(document)

At nine o'clock I visited him myself. It looks like religious mania, and he'll soon think that he himself is God.


Tokenising on spaces using python

In [2]:
print(document.split())

['At', 'nine', "o'clock", 'I', 'visited', 'him', 'myself.', 'It', 'looks', 'like', 'religious', 'mania,', 'and', "he'll", 'soon', 'think', 'that', 'he', 'himself', 'is', 'God.']


Tokenising using nltk word tokeniser

In [3]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\HP\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [4]:
from nltk.tokenize import word_tokenize
words = word_tokenize(document)

In [5]:
print(words)

['At', 'nine', "o'clock", 'I', 'visited', 'him', 'myself', '.', 'It', 'looks', 'like', 'religious', 'mania', ',', 'and', 'he', "'ll", 'soon', 'think', 'that', 'he', 'himself', 'is', 'God', '.']


NLTK's word tokeniser not only breaks on whitespaces but also breaks contraction words such as he'll into "he" and "'ll". On the other hand it doesn't break "o'clock" and treats it as a separate token.

### 2. Sentence tokeniser

Tokenising based on sentence requires you to split on the period ('.'). Let's use nltk sentence tokeniser.

In [6]:
from nltk.tokenize import sent_tokenize
sentences = sent_tokenize(document)

In [7]:
print(sentences)

["At nine o'clock I visited him myself.", "It looks like religious mania, and he'll soon think that he himself is God."]


### 3. Tweet tokeniser

A problem with word tokeniser is that it fails to tokeniser emojis and other complex special characters such as word with hashtags. Emojis are common these days and people use them all the time.

In [8]:
message = "i recently watched this show called mindhunters:). i totally loved it 😍. it was gr8 <3. #bingewatching #nothingtodo 😎"

In [9]:
print(word_tokenize(message))

['i', 'recently', 'watched', 'this', 'show', 'called', 'mindhunters', ':', ')', '.', 'i', 'totally', 'loved', 'it', '😍', '.', 'it', 'was', 'gr8', '<', '3', '.', '#', 'bingewatching', '#', 'nothingtodo', '😎']


The word tokeniser breaks the emoji '<3' into '<' and '3' which is something that we don't want. Emojis have their own significance in areas like sentiment analysis where a happy face and sad face can salone prove to be a really good predictor of the sentiment. Similarly, the hashtags are broken into two tokens. A hashtag is used for searching specific topics or photos in social media apps such as Instagram and facebook. So there, you want to use the hashtag as is.

Let's use the tweet tokeniser of nltk to tokenise this message.

In [10]:
from nltk.tokenize import TweetTokenizer
tknzr = TweetTokenizer()

In [11]:
tknzr.tokenize(message)

['i',
 'recently',
 'watched',
 'this',
 'show',
 'called',
 'mindhunters',
 ':)',
 '.',
 'i',
 'totally',
 'loved',
 'it',
 '😍',
 '.',
 'it',
 'was',
 'gr8',
 '<3',
 '.',
 '#bingewatching',
 '#nothingtodo',
 '😎']

As you can see, it handles all the emojis and the hashtags pretty well.

Now, there is a tokeniser that takes a regular expression and tokenises and returns result based on the pattern of regular expression.

Let's look at how you can use regular expression tokeniser.

In [12]:
from nltk.tokenize import regexp_tokenize
message = "i recently watched this show called mindhunters:). i totally loved it 😍. it was gr8 <3. #bingewatching #nothingtodo 😎"
pattern = "#[\w]+"

In [13]:
regexp_tokenize(message, pattern)

['#bingewatching', '#nothingtodo']

###### Write a piece of code that breaks a given sentence into words and store them in a list. Then print the list as well as the length of the list. Use the NLTK tokeniser to tokenise words.

Sample input:

"I love pasta"

Expected output:

3

In [14]:
from nltk.tokenize import word_tokenize
import ast, sys
sentence = input()

# tokenise sentence into words
words =word_tokenize(sentence) # write your code here

# print length - don't change the following piece of code
print(len(words))

I love pasta
3


###### Write a piece of code that breaks a given sentence into words and stores them in a list. Then remove the stop words from this list and then print the list as well as the length of the list. Again, use the NLTK tokeniser to do this.

Sample input: 

“Education is the most powerful weapon that you can use to change the world”

Expected output: 

6

In [15]:
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import ast, sys
sentence = input()

# change sentence to lowercase
sentence = sentence.lower() # write code here


# tokenise sentence into words
words = word_tokenize(sentence)# write code here

# extract nltk stop word list
stopwords = stopwords.words('english') # write code here

# remove stop words
no_stops = [x for x in words if x not in stopwords] # write code here

# print length - don't change the following piece of code
print(len(no_stops))

Education is the most powerful weapon that you can use to change the world
6


###### Write a Python code using the NLTK library that breaks a given piece of text containing multiple sentences into different sentences. Finally print the total number of sentences in the text.

Sample input: 
Develop a passion for your learning. If you do, you’ll never cease to grow.

Expected output:

2

In [16]:
from nltk.tokenize import sent_tokenize
from nltk.corpus import stopwords
import ast, sys
text = input()

# change sentence to lowercase
text = text.lower()


# tokenise sentence into words
sentences = sent_tokenize(text)# write code here

# print length - don't change the following piece of code
print(len(sentences))

Develop a passion for your learning. If you do, you’ll never cease to grow.
2


###### Use NLTK’s regex tokeniser to extract all the mentions from a given tweet and then print the total number of mentions. A mention comprises of a ‘@’ symbol followed by a username containing either alphabets, numbers or underscores.

Sample tweet:
So excited to be a part of machine learning and artificial intelligence program made by @upgrad and @iiitb

Expected output:
2 (because there are two mentions - ‘@upgrad’ and ‘@iiitb’ )

In [17]:
from nltk.tokenize import regexp_tokenize
from nltk.corpus import stopwords
import ast, sys
text = input()

# change text to lowercase
text = text.lower() # write code here

# pattern to extract mentions
pattern = "@[\w]+"# write regex pattern here

# extract mentions by using regex tokeniser
mentions = regexp_tokenize(text, pattern)# write code here

# print length - don't change the following piece of code
print(len(mentions))

So excited to be a part of machine learning and artificial intelligence program made by @upgrad and @iiitb
2
