# This Notebook is for guiding to create Vocabulary in Pytorch


### One shifting to Pytorch from keras for NLP tasks may find it hard while starting. This Notebook explains ways for creating Vocabulary, and Tokenizing the text. 

### There are different ways to create vocabulary for text corpus using python. Pytorch offers 2 different ways using built in vocab object and build_vocab_from_iterator object.

## Creating Vocabulary using build_vocab_from_iterator and generator function

In [None]:
!wget --header="Host: storage.googleapis.com" --header="User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36" --header="Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9" --header="Accept-Language: en-GB,en-US;q=0.9,en;q=0.8" --header="Referer: https://www.kaggle.com/" "https://storage.googleapis.com/kaggle-data-sets/888171/1507821/bundle/archive.zip?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=gcp-kaggle-com%40kaggle-161607.iam.gserviceaccount.com%2F20211026%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20211026T063304Z&X-Goog-Expires=259199&X-Goog-SignedHeaders=host&X-Goog-Signature=23d1bbb56d79ff30daeac9db46b22f4ab32fa74f7674264aedb5bd8eed0da799eaf76849bba3830432d390edaaf299554e390464fb525070a1d40ff213ff0ed346631762a8126598e99ee622f13eb1ed7babd9383fb5e979966b0880d855c917ff5888549a2bd14f597765412252f2fe470620c23229683cc15e8f0d1836330612373301ff7bdef4b6d3f8d23ebc9023a8040ebaaef308416245707faa6e2e284fa85159c3d9507a7a117cc1ef543b0b15aceffd15293887a4014f9ba74d0e06d466fe8b50e9a113c0b9bc69d94c5ad055807cda9ab72c4d1d805f7882fb06554175743159b50ba390a8ef327987072bf4a277eacf501233c9abb14fc40da6b6" -c -O 'archive.zip'

--2021-10-26 06:33:40--  https://storage.googleapis.com/kaggle-data-sets/888171/1507821/bundle/archive.zip?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=gcp-kaggle-com%40kaggle-161607.iam.gserviceaccount.com%2F20211026%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20211026T063304Z&X-Goog-Expires=259199&X-Goog-SignedHeaders=host&X-Goog-Signature=23d1bbb56d79ff30daeac9db46b22f4ab32fa74f7674264aedb5bd8eed0da799eaf76849bba3830432d390edaaf299554e390464fb525070a1d40ff213ff0ed346631762a8126598e99ee622f13eb1ed7babd9383fb5e979966b0880d855c917ff5888549a2bd14f597765412252f2fe470620c23229683cc15e8f0d1836330612373301ff7bdef4b6d3f8d23ebc9023a8040ebaaef308416245707faa6e2e284fa85159c3d9507a7a117cc1ef543b0b15aceffd15293887a4014f9ba74d0e06d466fe8b50e9a113c0b9bc69d94c5ad055807cda9ab72c4d1d805f7882fb06554175743159b50ba390a8ef327987072bf4a277eacf501233c9abb14fc40da6b6
Resolving storage.googleapis.com (storage.googleapis.com)... 173.194.79.128, 108.177.119.128, 108.177.127.128, ...
Connecting to storag

In [None]:
!unzip "/content/archive.zip" -d "/content/data"

Archive:  /content/archive.zip
  inflating: /content/data/SMS_test.csv  
  inflating: /content/data/SMS_train.csv  


In [70]:
import pandas as pd 
import string
from torchtext.vocab import build_vocab_from_iterator
from torchtext.data.utils import get_tokenizer
from collections import Counter, OrderedDict
from torchtext.vocab import vocab

In [None]:
df = pd.read_csv("/content/data/SMS_train.csv",  encoding= 'unicode_escape')

In [None]:
df.head()

Unnamed: 0,S. No.,Message_body,Label
0,1,Rofl. Its true to its name,Non-Spam
1,2,The guy did some bitching but I acted like i'd...,Non-Spam
2,3,"Pity, * was in mood for that. So...any other s...",Non-Spam
3,4,Will ü b going to esplanade fr home?,Non-Spam
4,5,This is the 2nd time we have tried 2 contact u...,Spam


### I'm using Spam classification dataset for explaining these concepts 

1. build_vocab_from_iterator and generator 

In [None]:

# I'm doing basic preprocessing stpes like lowercase and punctuation removal, which are enough for explaining tokenization

In [None]:
punct = string.punctuation

def preprocess_text(text):
  
  text = text.lower()
  text.translate(text.maketrans('','',punct))                             # removes punctuations
  return text 


In [45]:
processed_text_data = df.Message_body.apply(preprocess_text)

In [46]:
processed_text_data.head()

0                           rofl. its true to its name
1    the guy did some bitching but i acted like i'd...
2    pity, * was in mood for that. so...any other s...
3                 will ü b going to esplanade fr home?
4    this is the 2nd time we have tried 2 contact u...
Name: Message_body, dtype: object

In [None]:
# lets build a generator using yield, which returns tokens for each string one at a time.

def token_generator(text):
  for line in text:
    yield line.strip().split()                                       # returns list of tokens for each line 


In [47]:
for line in token_generator(processed_text_data):
  print(line)

['rofl.', 'its', 'true', 'to', 'its', 'name']
['the', 'guy', 'did', 'some', 'bitching', 'but', 'i', 'acted', 'like', "i'd", 'be', 'interested', 'in', 'buying', 'something', 'else', 'next', 'week', 'and', 'he', 'gave', 'it', 'to', 'us', 'for', 'free']
['pity,', '*', 'was', 'in', 'mood', 'for', 'that.', 'so...any', 'other', 'suggestions?']
['will', 'ü', 'b', 'going', 'to', 'esplanade', 'fr', 'home?']
['this', 'is', 'the', '2nd', 'time', 'we', 'have', 'tried', '2', 'contact', 'u.', 'u', 'have', 'won', 'the', '£750', 'pound', 'prize.', '2', 'claim', 'is', 'easy,', 'call', '087187272008', 'now1!', 'only', '10p', 'per', 'minute.', 'bt-national-rate.']
['reminder', 'from', 'o2:', 'to', 'get', '2.50', 'pounds', 'free', 'call', 'credit', 'and', 'details', 'of', 'great', 'offers', 'pls', 'reply', '2', 'this', 'text', 'with', 'your', 'valid', 'name,', 'house', 'no', 'and', 'postcode']
['huh', 'y', 'lei...']
['why', "don't", 'you', 'wait', "'til", 'at', 'least', 'wednesday', 'to', 'see', 'if', 'yo

In [48]:
# use that generator/iterator in build_vocab_from_iterator

vocabulary = build_vocab_from_iterator(token_generator(processed_text_data))           # this will construct the Vocabulary and return vocab object


In [49]:
# we can check the vocabulary built using get_stoi() function on vocabulary object

vocabulary.get_stoi()

{'ü...': 4260,
 '£900': 4258,
 '£800': 4257,
 '£125': 4253,
 '£1.50perweeksub.': 4250,
 'zeros.': 4247,
 'yummy': 4245,
 'younger': 4241,
 'youdoing': 4239,
 'you:-)': 4238,
 'yo': 4234,
 'yijue?': 4232,
 'yest': 4229,
 'yes-910': 4226,
 'yes-762': 4225,
 'yes-440': 4224,
 'yer': 4221,
 'yay.': 4214,
 'yards': 4212,
 'yar...': 4211,
 'yar': 4210,
 'ya,': 4207,
 'y.': 4206,
 'xxxx': 4205,
 "xin's": 4202,
 'xchat.': 4201,
 'www.sms.ac/u/natalie2k9': 4197,
 'www.sms.ac/u/nat27081980': 4196,
 'www.sms.ac/u/goldviking': 4194,
 'www.santacalling.com': 4192,
 'www.ringtones.co.uk,': 4191,
 'www.ldew.com1win150ppmx3age16subscription': 4190,
 'www.ldew.com1win150ppmx3age16': 4189,
 'www.gamb.tv': 4187,
 'www.comuk.net': 4184,
 'www.cashbin.co.uk': 4182,
 'wuld': 4180,
 'wt': 4179,
 'wrote': 4178,
 'wrnog..': 4177,
 'wrkin?': 4176,
 'ym': 4233,
 'wrenching': 4174,
 'worries': 4168,
 'worms': 4166,
 'world.': 4164,
 'working': 4162,
 'work..': 4161,
 'worc': 4157,
 'woot!': 4156,
 'woot': 4155,
 

In [50]:
len(vocabulary.get_stoi())           

4261

In [51]:
# we can define constraint like min_freq to remove rare words, since they dont give much information

In [52]:
vocabulary = build_vocab_from_iterator(token_generator(processed_text_data), min_freq=10)     # this will create vocab with token present atleast in 10 documents.

In [53]:
len(vocabulary.get_stoi())                               # lengh of vocabulary has reduced drastically.

205

In [57]:
# we shoud tokenize the sentence before passing it to Vocabulary object


tokenizer = get_tokenizer('basic_english')                                        # or we can also use generator that we have built. But this is preferred as it tokenizes the words properly
tokens = tokenizer("hello there what is going on")
print(" tokenizer " ,tokens)
vocabulary.set_default_index(-1)                                                  # setting default index for out of vocabulary words
print(" converting tokens to numerics using vocab ", vocabulary(tokens))


 tokenizer  ['hello', 'there', 'what', 'is', 'going', 'on']
 converting tokens to numerics using vocab  [-1, 103, 44, 7, 73, 25]


In [62]:
# now we can convert each text sentence we have to tokens using this vocabulary. 

text_numeric = lambda text : vocabulary(text)
tokenize_text = lambda text : tokenizer(text)


In [66]:
final_tokenized_text = []

for text in processed_text_data:
  text = tokenize_text(text)
  numeric_tokens = text_numeric(text)  
  final_tokenized_text.append(numeric_tokens)

In [67]:
print(" Tokens for text ", final_tokenized_text[:5])

 Tokens for text  [[-1, 55, 43, -1, 0, 43, -1], [4, -1, 124, 80, -1, 32, 1, -1, 54, 1, -1, 98, 33, -1, 6, -1, -1, -1, 147, 151, 8, 65, -1, 15, 0, 139, 11, 46], [-1, -1, -1, 41, 6, -1, 11, 27, 55, 21, 55, 55, 55, 130, 148, -1, 64], [22, 72, 164, 73, 0, -1, -1, 132, 64], [38, 7, 4, -1, 63, 34, 14, -1, 17, 142, 5, 55, 5, 14, 180, 4, -1, -1, 197, 55, 17, 97, 7, -1, -1, 12, -1, -1, 183, 49, -1, -1, -1, 55, -1, 55]]


### this is how we can convert a text to tokens, we can continue with furthur processing by padding and embedding.

# usign Vocab, Counter and OrderedDict 

In [68]:
# preparing list of all tokens for creating a frequency dictionary 
 
tokens = []

for line in token_generator(processed_text_data):
  tokens.extend(line)

print(tokens)    




In [71]:
counter = Counter(tokens)                                                          # generates a frequency dictionary 

ordered_tuples = OrderedDict(counter)                                               # generated a list of tuples containing tokens and freq 

vocabulary = vocab(ordered_tuples)                                                  # returns the vocabulary object with tokens and index for each token

In [72]:
# we can use the vocabulary just like above 

vocabulary.get_stoi()

{'finish?': 4260,
 'pending': 4259,
 'cleaning': 4258,
 "couldn't": 4254,
 '82277.': 4251,
 'canname': 4248,
 'recorder': 4247,
 'quiz!': 4245,
 'bottom...': 4242,
 'anyway...': 4239,
 'wear': 4237,
 'nvm...': 4236,
 'showr': 4234,
 'fuckin': 4233,
 'upon!': 4232,
 'mns': 4227,
 'prabha': 4225,
 'ennal': 4224,
 'prakasam': 4223,
 'sneham"': 4222,
 'neekunna': 4220,
 'irulinae': 4219,
 'presnts': 4215,
 'dollar': 4212,
 '10am-9pm': 4211,
 '169': 4209,
 'sunday': 4205,
 'evening.': 4204,
 'galileo': 4203,
 'grand': 4200,
 'hella': 4198,
 'gay.': 4195,
 'move': 4193,
 'ü...': 4191,
 'ho.': 4189,
 'tv..he': 4187,
 'vijaykanth': 4186,
 'distance': 4183,
 'keeping': 4182,
 'havent.': 4179,
 'poly/true/pix/ringtones/games': 4176,
 'ripped': 4173,
 'anythiing': 4181,
 'keypad.': 4172,
 'din': 4170,
 'story': 4169,
 'persons': 4168,
 'chennai?': 4167,
 '!!!': 4165,
 'legs': 4164,
 'gaps': 4160,
 'fill': 4159,
 'btwn': 4158,
 'created': 4156,
 'password': 4154,
 'thru': 4152,
 'passwords,atm/sms

In [73]:
len(vocabulary.get_stoi())    

4261

In [76]:
#setting default index for oov tokens 
vocabulary.set_default_index(-1)

we can see that length of vocabulary from both the methods are same. 

In [77]:
# now we can convert each text sentence we have to tokens using this vocabulary. 

text_numeric = lambda text : vocabulary(text)
tokenize_text = lambda text : tokenizer(text)


In [78]:
final_tokenized_text = []

for text in processed_text_data:
  text = tokenize_text(text)
  numeric_tokens = text_numeric(text)  
  final_tokenized_text.append(numeric_tokens)

In [79]:
print(" Tokens for text ", final_tokenized_text[:5])

 Tokens for text  [[-1, 104, 1, 2, 3, 1, 4], [5, 6, 7, 8, 9, 10, 11, 12, 13, 11, 1437, 490, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 3, 27, 28, 29], [-1, 724, 31, 32, 17, 33, 28, 151, 104, 122, 104, 104, 104, 946, 36, -1, 543], [38, 39, 40, 41, 3, 42, 43, 795, 543], [45, 46, 5, 47, 48, 49, 50, 51, 52, 53, 55, 104, 55, 50, 56, 5, 57, 58, 750, 104, 52, 60, 46, 591, 724, 62, 63, -1, 1714, 65, 66, 67, 2640, 104, 3234, 104]]
