# Tokenizations Techniques
> Word tokenization is a fundamental technique used in natural language processing (NLP) and computational linguistics to break down a text document or a sentence into individual words or tokens. Each token represents a meaningful unit, such as a word or punctuation mark, and serves as the basic building block for various NLP tasks.

#### Various Tokenizations are:

1. `Word Tokenization:` This technique breaks the text into individual words. The most straightforward approach is to split the text on whitespace characters such as spaces and tabs.

2. `Sentence Tokenization:` In sentence tokenization, the text is divided into individual sentences. This is typically done by identifying sentence boundaries using punctuation marks like periods, question marks, and exclamation marks.

3. `wordpunct tokenization:` It is a specific tokenization method provided by the NLTK (Natural Language Toolkit) library in Python. It is designed to tokenize text by considering both words and punctuation marks as separate tokens.

4. `TextBlob's word tokenization:` It considers whitespace and punctuation marks as token boundaries. It provides a straightforward and convenient way to tokenize text without the need for additional libraries or complex regular expressions.It's important to note that TextBlob's word tokenization may not handle certain cases, such as contractions or specialized punctuation, as accurately as more advanced tokenization techniques. 

## Import Required libraries

In [18]:
import pandas as pd
import numpy as np

import re
import seaborn as sns
import matplotlib.pyplot as plt

import nltk
from nltk.tokenize import word_tokenize, wordpunct_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer

import nltk
nltk.download('stopwords')
# Downloading wordnet before applying Lemmatizer
nltk.download('wordnet')
nltk.download('omw-1.4')
import nltk
nltk.download('punkt')

from pickle import dump
from pickle import load

sns.set_style('whitegrid')
plt.style.use('bmh')

import warnings
warnings.filterwarnings('ignore')

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = 'all'

# for HD visualizations
%config InlineBackend.figure_format='retina'

[nltk_data] Downloading package stopwords to C:\Users\GUDLA
[nltk_data]     RAGUWING\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

[nltk_data] Downloading package wordnet to C:\Users\GUDLA
[nltk_data]     RAGUWING\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

[nltk_data] Downloading package omw-1.4 to C:\Users\GUDLA
[nltk_data]     RAGUWING\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

[nltk_data] Downloading package punkt to C:\Users\GUDLA
[nltk_data]     RAGUWING\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

## Note:
* Here in this code file, first we are going to load the cleaned csv file that we obtained from after performing EDA on the original csv file. Now we are going to apply various Tokenization techniques to see how they works.

In [9]:
df_cleaned = pd.read_csv(r"C:\Users\GUDLA RAGUWING\Data Science Course\Internship_Project\cleaned_df.csv")

In [10]:
df_cleaned.head()

Unnamed: 0.1,Unnamed: 0,id,qid1,question1,qid2,question2,is_duplicate,Question_similarity
0,0,0,1,What is the step by step guide to invest in sh...,2,What is the step by step guide to invest in sh...,0,Not Similar
1,1,1,3,What is the story of Kohinoor (Koh-i-Noor) Dia...,4,What would happen if the Indian government sto...,0,Not Similar
2,2,2,5,How can I increase the speed of my internet co...,6,How can Internet speed be increased by hacking...,0,Not Similar
3,3,3,7,Why am I mentally very lonely? How can I solve...,8,Find the remainder when [math]23^{24}[/math] i...,0,Not Similar
4,4,4,9,"Which one dissolve in water quikly sugar, salt...",10,Which fish would survive in salt water?,0,Not Similar


## Applying word_tokenize(Word Tokenizer)

In [2]:
def preprocess(raw_text):
    # Removing special characters and digits
    sentence = re.sub("[^a-zA-Z]", " ", raw_text)
    
    # change sentence to lower case
    sentence = sentence.lower()
    
    # tokenize into words
    tokens = word_tokenize(sentence)
    # remove stop words                
    clean_tokens = [t for t in tokens if not t in stopwords.words("english")]
    
    return pd.Series([" ".join(clean_tokens), len(clean_tokens)])


In [14]:
from tqdm import tqdm, tqdm_notebook
tqdm.pandas()
'''tqdm.pandas() is a method provided by tqdm library  that allows you to create/apply progress bars to pandas operations.
Works for pandas series as well as DataFrame, 
you can visualize the progress of your operations and get an estimate amount of time to complete the pandas task'''

'tqdm.pandas() is a method provided by tqdm library  that allows you to create/apply progress bars to pandas operations.\nWorks for pandas series as well as DataFrame, \nyou can visualize the progress of your operations and get an estimate amount of time to complete the pandas task'

In [26]:
temp_df = df_cleaned['question1'].progress_apply(lambda x: preprocess(x))

temp_df.head()

100%|█████████████████████████████████████████████████████████████████████████| 404290/404290 [19:16<00:00, 349.73it/s]


Unnamed: 0,0,1
0,step step guide invest share market india,7
1,story kohinoor koh noor diamond,5
2,increase speed internet connection using vpn,6
3,mentally lonely solve,3
4,one dissolve water quikly sugar salt methane c...,10


In [27]:
temp_df1 = df_cleaned['question2'].progress_apply(lambda x: preprocess(x))

temp_df1.head()

100%|█████████████████████████████████████████████████████████████████████████| 404290/404290 [19:44<00:00, 341.28it/s]


Unnamed: 0,0,1
0,step step guide invest share market,6
1,would happen indian government stole kohinoor ...,10
2,internet speed increased hacking dns,5
3,find remainder math math divided,5
4,fish would survive salt water,5


In [31]:
Word_tokenize = pd.concat([temp_df,temp_df1], axis = 1)

In [32]:
Word_tokenize.columns = ['Text_Wordtokenize_Q1','Text_len_Q2','Text_Wordtokenize_Q2','Text_len_Q2']

In [33]:
Word_tokenize.head()

Unnamed: 0,Text_Wordtokenize_Q1,Text_len_Q2,Text_Wordtokenize_Q2,Text_len_Q2.1
0,step step guide invest share market india,7,step step guide invest share market,6
1,story kohinoor koh noor diamond,5,would happen indian government stole kohinoor ...,10
2,increase speed internet connection using vpn,6,internet speed increased hacking dns,5
3,mentally lonely solve,3,find remainder math math divided,5
4,one dissolve water quikly sugar salt methane c...,10,fish would survive salt water,5


## Applying sent_tokenize(Sentence Tokenizer)

In [25]:
def preprocess(raw_text):
    # Removing special characters and digits
    sentence = re.sub("[^a-zA-Z]", " ", raw_text)
    
    # change sentence to lower case
    sentence = sentence.lower()
    
    # tokenize into words
    tokens = sent_tokenize(sentence)
    
    return pd.Series([" ".join(tokens), len(tokens)])


In [26]:
temp_df2 = df_cleaned['question1'].progress_apply(lambda x: preprocess(x))

temp_df2.head()

100%|████████████████████████████████████████████████████████████████████████| 404290/404290 [00:59<00:00, 6750.52it/s]


Unnamed: 0,0,1
0,what is the step by step guide to invest in sh...,1
1,what is the story of kohinoor koh i noor dia...,1
2,how can i increase the speed of my internet co...,1
3,why am i mentally very lonely how can i solve it,1
4,which one dissolve in water quikly sugar salt...,1


In [31]:
temp_df3 = df_cleaned['question2'].progress_apply(lambda x: preprocess(x))

temp_df3.head()

100%|████████████████████████████████████████████████████████████████████████| 404290/404290 [01:01<00:00, 6577.32it/s]


Unnamed: 0,0,1
0,what is the step by step guide to invest in sh...,1
1,what would happen if the indian government sto...,1
2,how can internet speed be increased by hacking...,1
3,find the remainder when math math i...,1
4,which fish would survive in salt water,1


In [32]:
Sentence_tokenize = pd.concat([temp_df2,temp_df3], axis = 1)

In [33]:
Sentence_tokenize.columns = ['Text_sentencetokenize_Q1','Text_len_Q2','Text_sentencetokenize_Q2','Text_len_Q2']

In [34]:
Sentence_tokenize.head()

Unnamed: 0,Text_sentencetokenize_Q1,Text_len_Q2,Text_sentencetokenize_Q2,Text_len_Q2.1
0,what is the step by step guide to invest in sh...,1,what is the step by step guide to invest in sh...,1
1,what is the story of kohinoor koh i noor dia...,1,what would happen if the indian government sto...,1
2,how can i increase the speed of my internet co...,1,how can internet speed be increased by hacking...,1
3,why am i mentally very lonely how can i solve it,1,find the remainder when math math i...,1
4,which one dissolve in water quikly sugar salt...,1,which fish would survive in salt water,1


## Applying wordpunct_tokenize(Punctuation-based Tokenizer)
* This tokenizer splits the sentences into words based on whitespaces and punctuations.

In [19]:
def preprocess(raw_text):
    # Removing special characters and digits
    sentence = re.sub("[^?.,!():;-_a-zA-Z]", " ", raw_text)
    
    # change sentence to lower case
    sentence = sentence.lower()
    
    # tokenize into words
    tokens = wordpunct_tokenize(sentence)
    
    return pd.Series([" ".join(tokens), len(tokens)])


In [21]:
temp_df2 = df_cleaned['question1'].progress_apply(lambda x: preprocess(x))

temp_df2.head()

100%|████████████████████████████████████████████████████████████████████████| 404290/404290 [00:50<00:00, 7994.02it/s]


Unnamed: 0,0,1
0,what is the step by step guide to invest in sh...,15
1,what is the story of kohinoor ( koh i noor ) d...,13
2,how can i increase the speed of my internet co...,15
3,why am i mentally very lonely ? how can i solv...,13
4,"which one dissolve in water quikly sugar , sal...",16


In [22]:
temp_df3 = df_cleaned['question2'].progress_apply(lambda x: preprocess(x))

temp_df3.head()

100%|████████████████████████████████████████████████████████████████████████| 404290/404290 [00:58<00:00, 6953.09it/s]


Unnamed: 0,0,1
0,what is the step by step guide to invest in sh...,13
1,what would happen if the indian government sto...,18
2,how can internet speed be increased by hacking...,11
3,find the remainder when [ math ] ^ [ math ] is...,16
4,which fish would survive in salt water ?,8


In [23]:
Punct_Tokenize = pd.concat([temp_df2,temp_df3], axis = 1)
Punct_Tokenize.columns = ['Text_puncttokenize_Q1','Text_len_Q2','Text_puncttokenize_Q2','Text_len_Q2']

In [24]:
Punct_Tokenize.head()

Unnamed: 0,Text_puncttokenize_Q1,Text_len_Q2,Text_puncttokenize_Q2,Text_len_Q2.1
0,what is the step by step guide to invest in sh...,15,what is the step by step guide to invest in sh...,13
1,what is the story of kohinoor ( koh i noor ) d...,13,what would happen if the indian government sto...,18
2,how can i increase the speed of my internet co...,15,how can internet speed be increased by hacking...,11
3,why am i mentally very lonely ? how can i solv...,13,find the remainder when [ math ] ^ [ math ] is...,16
4,"which one dissolve in water quikly sugar , sal...",16,which fish would survive in salt water ?,8


## Observations:
* We could notice the difference between a word in word_tokenize and split it in the wordpunct_tokenize. We can observe that it consideres the punctuations also for the tokens.

## Applying TextBlob Word Tokenize

In [37]:
#!pip install -U textblob
#!python3 -m textblob.download_corpora

In [38]:
from textblob import TextBlob

> TextBlob is a Python library for processing textual data. It provides a consistent API for diving into common natural language processing (NLP) tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation, and more.

In [39]:
def preproce(raw_text):
    # Removing special characters and digits
    sentence = re.sub("[^?.,!():;-_a-zA-Z]", " ", raw_text)
    
    # change sentence to lower case
    text = sentence.lower()
    # tokenize into words
    blob_object = TextBlob(text)
    text_words = blob_object.words
    
    return pd.Series([" ".join(text_words), len(text_words)])


In [40]:
temp_df4 = df_cleaned['question1'].progress_apply(lambda x: preproce(x))

temp_df4.head()

100%|████████████████████████████████████████████████████████████████████████| 404290/404290 [02:24<00:00, 2792.82it/s]


Unnamed: 0,0,1
0,what is the step by step guide to invest in sh...,14
1,what is the story of kohinoor koh i noor diamond,10
2,how can i increase the speed of my internet co...,14
3,why am i mentally very lonely how can i solve it,11
4,which one dissolve in water quikly sugar salt ...,13


In [41]:
temp_df5 = df_cleaned['question2'].progress_apply(lambda x: preproce(x))

temp_df5.head()

100%|████████████████████████████████████████████████████████████████████████| 404290/404290 [02:31<00:00, 2664.02it/s]


Unnamed: 0,0,1
0,what is the step by step guide to invest in sh...,12
1,what would happen if the indian government sto...,15
2,how can internet speed be increased by hacking...,10
3,find the remainder when math math is divided by,9
4,which fish would survive in salt water,7


In [42]:
Textblob_tokenize = pd.concat([temp_df4,temp_df5], axis = 1)

In [43]:
Textblob_tokenize.columns = ['Text_TextBlobtokenize_Q1','Text_len_Q2','Text_TextBlobtokenize_Q2','Text_len_Q2']

In [44]:
Textblob_tokenize.head()

Unnamed: 0,Text_TextBlobtokenize_Q1,Text_len_Q2,Text_TextBlobtokenize_Q2,Text_len_Q2.1
0,what is the step by step guide to invest in sh...,14,what is the step by step guide to invest in sh...,12
1,what is the story of kohinoor koh i noor diamond,10,what would happen if the indian government sto...,15
2,how can i increase the speed of my internet co...,14,how can internet speed be increased by hacking...,10
3,why am i mentally very lonely how can i solve it,11,find the remainder when math math is divided by,9
4,which one dissolve in water quikly sugar salt ...,13,which fish would survive in salt water,7


### Obsrvation:
* We could notice that the TextBlob tokenizer removes the punctuations, special characters, etc. In addition, it has rules for English contractions.
* English contractions are shortened forms of words or phrases where letters or sounds are omitted and replaced with an apostrophe ('), resulting in a more informal and conversational style of writing or speaking. Eg: can't - can not.

## Conclusion : 
> The use of tokenization in solving a specific nlp project totally depends upon the problem statement we are dealing with and the type of text to numerical word embedding that we are going to use. So choose carefully.