<a href="https://colab.research.google.com/github/RajuMopidevi/AI-Machine-learning-Deep-learning-Computer-vision-NLP-Projects/blob/main/tokenisation_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tokenisation

The notebook contains three types of tokenisation techniques:
1. Word tokenisation
2. Sentence tokenisation
3. Tweet tokenisation
4. Custom tokenisation using regular expressions

### 1. Word tokenisation

In [6]:
document = "At nine o'clock I visited him myself. It looks like religious mania, and he'll soon think that he himself is God."
print(document)

At nine o'clock I visited him myself. It looks like religious mania, and he'll soon think that he himself is God.


Tokenising on spaces using python

In [7]:
print(document.split())

['At', 'nine', "o'clock", 'I', 'visited', 'him', 'myself.', 'It', 'looks', 'like', 'religious', 'mania,', 'and', "he'll", 'soon', 'think', 'that', 'he', 'himself', 'is', 'God.']


Tokenising using nltk word tokeniser

In [8]:
import nltk
nltk.download("punkt")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [9]:
from nltk.tokenize import word_tokenize
words = word_tokenize(document)

In [10]:
print(words)

['At', 'nine', "o'clock", 'I', 'visited', 'him', 'myself', '.', 'It', 'looks', 'like', 'religious', 'mania', ',', 'and', 'he', "'ll", 'soon', 'think', 'that', 'he', 'himself', 'is', 'God', '.']


NLTK's word tokeniser not only breaks on whitespaces but also breaks contraction words such as he'll into "he" and "'ll". On the other hand it doesn't break "o'clock" and treats it as a separate token.

### 2. Sentence tokeniser

Tokenising based on sentence requires you to split on the period ('.'). Let's use nltk sentence tokeniser.

In [11]:
from nltk.tokenize import sent_tokenize
sentences = sent_tokenize(document)

In [12]:
print(sentences)

["At nine o'clock I visited him myself.", "It looks like religious mania, and he'll soon think that he himself is God."]


### 3. Tweet tokeniser

A problem with word tokeniser is that it fails to tokeniser emojis and other complex special characters such as word with hashtags. Emojis are common these days and people use them all the time.

In [13]:
message = "i recently watched this show called mindhunters:). i totally loved it 😍. it was gr8 <3. #bingewatching #nothingtodo 😎"

In [14]:
print(word_tokenize(message))

['i', 'recently', 'watched', 'this', 'show', 'called', 'mindhunters', ':', ')', '.', 'i', 'totally', 'loved', 'it', '😍', '.', 'it', 'was', 'gr8', '<', '3', '.', '#', 'bingewatching', '#', 'nothingtodo', '😎']


The word tokeniser breaks the emoji '<3' into '<' and '3' which is something that we don't want. Emojis have their own significance in areas like sentiment analysis where a happy face and sad face can salone prove to be a really good predictor of the sentiment. Similarly, the hashtags are broken into two tokens. A hashtag is used for searching specific topics or photos in social media apps such as Instagram and facebook. So there, you want to use the hashtag as is.

Let's use the tweet tokeniser of nltk to tokenise this message.

In [15]:
from nltk.tokenize import TweetTokenizer
tknzr = TweetTokenizer()

In [16]:
tknzr.tokenize(message)

['i',
 'recently',
 'watched',
 'this',
 'show',
 'called',
 'mindhunters',
 ':)',
 '.',
 'i',
 'totally',
 'loved',
 'it',
 '😍',
 '.',
 'it',
 'was',
 'gr8',
 '<3',
 '.',
 '#bingewatching',
 '#nothingtodo',
 '😎']

As you can see, it handles all the emojis and the hashtags pretty well.

Now, there is a tokeniser that takes a regular expression and tokenises and returns result based on the pattern of regular expression.

Let's look at how you can use regular expression tokeniser.

In [17]:
from nltk.tokenize import regexp_tokenize
message = "i recently watched this show called mindhunters:). i totally loved it 😍. it was gr8 <3. #bingewatching #nothingtodo 😎"
pattern = "#[\w]+"

In [18]:
regexp_tokenize(message, pattern)

['#bingewatching', '#nothingtodo']

In [20]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [21]:
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

sentence = '“Education is the most powerful weapon that you can use to change the world”'

# change sentence to lowercase
sentence =  sentence.lower()  # write code here

# tokenise sentence into words
words = word_tokenize(sentence)# write code here

# extract nltk stop word list
stopwords = stopwords.words('english') # write code here

# remove stop words
no_stops = [word for word in words if word not in stopwords] # write code here

# print length - don't change the following piece of code
print(len(no_stops))

8


In [22]:
from nltk.tokenize import sent_tokenize
from nltk.corpus import stopwords
import ast, sys
text = 'Develop a passion for your learning. If you do, you’ll never cease to grow.'

# change sentence to lowercase
text = text.lower()

# tokenise sentence into words
sentences = sent_tokenize(text) # write code here

# print length - don't change the following piece of code
print(len(sentences))

2


In [24]:
from nltk.tokenize import regexp_tokenize
from nltk.corpus import stopwords
text = 'So excited to be a part of machine learning and artificial intelligence program made by @upgrad and @iiitb'

# change text to lowercase
text = text.lower()# write code here

# pattern to extract mentions
pattern = "@[a-zA-Z0-9_]+" # write regex pattern here

# extract mentions by using regex tokeniser
mentions = regexp_tokenize(text,pattern) # write code here

# print length - don't change the following piece of code
print(len(mentions))

2


In [31]:
d1='there was one place on my ankle that was itching'
d2='but we did not scratch it'
d3='and then my ear began to itch'
d4='and next my back'
s1=d1.split()
s=d1.split()+d2.split() + d3.split() + d4.split()
print(sorted(s))
print(len(s))
print(set(sorted(s)))


['and', 'and', 'ankle', 'back', 'began', 'but', 'did', 'ear', 'it', 'itch', 'itching', 'my', 'my', 'my', 'next', 'not', 'on', 'one', 'place', 'scratch', 'that', 'then', 'there', 'to', 'was', 'was', 'we']
27
{'my', 'back', 'that', 'there', 'began', 'and', 'ear', 'place', 'was', 'itching', 'itch', 'one', 'scratch', 'it', 'ankle', 'not', 'on', 'next', 'we', 'to', 'but', 'did', 'then'}
