# Applying Different Tokenizers on Texts

## Importing Packages

In [12]:
import pandas as pd
import numpy as np
import regex as re
import nltk

## Reading Dataset

In [13]:
#Read Data from CSV
df = pd.read_csv('tweets_01-08-2021.csv')

## Defining Method for Displaying Tokens

In [14]:
def display_tokens(text, tokens_collect):
  for text, tokens in zip(text,tokens_collect):
    print("Text:   ",text)
    print("Tokens: ",*tokens,sep="|")

## Defining Number of Tweets to be considered

In [15]:
#Choose the Number of Tweets to be considered for tokenization
num = 10

## Applying Different Tokenizers on Text for Tweets

### Tweets - Tokenizer

In [16]:
print("Tokenizing Using Tweet Tokenizer: ")
tokenizer = nltk.tokenize.TweetTokenizer()
df['tokens'] = df['text'].apply(tokenizer.tokenize)
display_tokens(list(df['text'])[:num],list(df['tokens'])[:num])

Tokenizing Using Tweet Tokenizer: 
Text:    Republicans and Democrats have both created our economic problems.
Tokens: |Republicans|and|Democrats|have|both|created|our|economic|problems|.
Text:    I was thrilled to be back in the Great city of Charlotte, North Carolina with thousands of hardworking American Patriots who love our Country, cherish our values, respect our laws, and always put AMERICA FIRST! Thank you for a wonderful evening!! #KAG2020 https://t.co/dNJZfRsl9y
Tokens: |I|was|thrilled|to|be|back|in|the|Great|city|of|Charlotte|,|North|Carolina|with|thousands|of|hardworking|American|Patriots|who|love|our|Country|,|cherish|our|values|,|respect|our|laws|,|and|always|put|AMERICA|FIRST|!|Thank|you|for|a|wonderful|evening|!|!|#KAG2020|https://t.co/dNJZfRsl9y
Text:    RT @CBS_Herridge: READ: Letter to surveillance court obtained by CBS News questions where there will be further disciplinary action and cho…
Tokens: |RT|@CBS_Herridge|:|READ|:|Letter|to|surveillance|court|obtained|by|C

### NLTK's Word Tokenizer

#### Downloading PUNKT Dependency

In [17]:
nltk.download('punkt') 

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [18]:
print("\n\nTokenizing Using NLTK's Recommended Word Tokenizer: ")
df['tokens'] = df['text'].apply(nltk.tokenize.word_tokenize)
display_tokens(list(df['text'])[:num],list(df['tokens'])[:num])



Tokenizing Using NLTK's Recommended Word Tokenizer: 
Text:    Republicans and Democrats have both created our economic problems.
Tokens: |Republicans|and|Democrats|have|both|created|our|economic|problems|.
Text:    I was thrilled to be back in the Great city of Charlotte, North Carolina with thousands of hardworking American Patriots who love our Country, cherish our values, respect our laws, and always put AMERICA FIRST! Thank you for a wonderful evening!! #KAG2020 https://t.co/dNJZfRsl9y
Tokens: |I|was|thrilled|to|be|back|in|the|Great|city|of|Charlotte|,|North|Carolina|with|thousands|of|hardworking|American|Patriots|who|love|our|Country|,|cherish|our|values|,|respect|our|laws|,|and|always|put|AMERICA|FIRST|!|Thank|you|for|a|wonderful|evening|!|!|#|KAG2020|https|:|//t.co/dNJZfRsl9y
Text:    RT @CBS_Herridge: READ: Letter to surveillance court obtained by CBS News questions where there will be further disciplinary action and cho…
Tokens: |RT|@|CBS_Herridge|:|READ|:|Letter|to|surveill

### Regular-Expression Tokenizer

#### Defining Regular Expression for Tokenization

In [19]:
#Defining tokenizers specific parameters
RE_TOKEN = re.compile(r"""
               ( [#]?[@\w'’\.\-\:]*\w
               | [:;<]\-?[\)\(3]     
               | [\U0001F100-\U0001FFFF]
               )
               """, re.VERBOSE)

In [20]:
print("\n\nTokenizing Using Regex Tokenizer: ")
tokenizer = nltk.tokenize.RegexpTokenizer(RE_TOKEN.pattern, flags=re.VERBOSE)
df['tokens'] = df['text'].apply(tokenizer.tokenize)
display_tokens(list(df['text'])[:num],list(df['tokens'])[:num])



Tokenizing Using Regex Tokenizer: 
Text:    Republicans and Democrats have both created our economic problems.
Tokens: |Republicans|and|Democrats|have|both|created|our|economic|problems
Text:    I was thrilled to be back in the Great city of Charlotte, North Carolina with thousands of hardworking American Patriots who love our Country, cherish our values, respect our laws, and always put AMERICA FIRST! Thank you for a wonderful evening!! #KAG2020 https://t.co/dNJZfRsl9y
Tokens: |I|was|thrilled|to|be|back|in|the|Great|city|of|Charlotte|North|Carolina|with|thousands|of|hardworking|American|Patriots|who|love|our|Country|cherish|our|values|respect|our|laws|and|always|put|AMERICA|FIRST|Thank|you|for|a|wonderful|evening|#KAG2020|https|t.co|dNJZfRsl9y
Text:    RT @CBS_Herridge: READ: Letter to surveillance court obtained by CBS News questions where there will be further disciplinary action and cho…
Tokens: |RT|@CBS_Herridge|READ|Letter|to|surveillance|court|obtained|by|CBS|News|questions|wh

### Toktok - Tokenizer

In [21]:
print("\n\nTokenizing Using Toktok Tokenizer: ")
tokenizer = nltk.tokenize.ToktokTokenizer()
df['tokens'] = df['text'].apply(tokenizer.tokenize)
display_tokens(list(df['text'])[:num],list(df['tokens'])[:num])



Tokenizing Using Toktok Tokenizer: 
Text:    Republicans and Democrats have both created our economic problems.
Tokens: |Republicans|and|Democrats|have|both|created|our|economic|problems|.
Text:    I was thrilled to be back in the Great city of Charlotte, North Carolina with thousands of hardworking American Patriots who love our Country, cherish our values, respect our laws, and always put AMERICA FIRST! Thank you for a wonderful evening!! #KAG2020 https://t.co/dNJZfRsl9y
Tokens: |I|was|thrilled|to|be|back|in|the|Great|city|of|Charlotte|,|North|Carolina|with|thousands|of|hardworking|American|Patriots|who|love|our|Country|,|cherish|our|values|,|respect|our|laws|,|and|always|put|AMERICA|FIRST|!|Thank|you|for|a|wonderful|evening|!|!|#KAG2020|https://t.co/dNJZfRsl9y
Text:    RT @CBS_Herridge: READ: Letter to surveillance court obtained by CBS News questions where there will be further disciplinary action and cho…
Tokens: |RT|@CBS_Herridge|:|READ|:|Letter|to|surveillance|court|obtained|b