In [12]:
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
import nltk

nltk.download('punkt_tab')
nltk.download('stopwords')

corpus='''India, officially the Republic of India (Hindi: Bhārat Gaṇarājya),[25] is a country in South Asia. It is the seventh-largest country by area, the second-most populous country, and the most populous democracy in the world. Bounded by the Indian Ocean on the south, the Arabian Sea on the southwest, and the Bay of Bengal on the southeast, it shares land borders with Pakistan to the west;[f] China, Nepal, and Bhutan to the north; and Bangladesh and Myanmar to the east. In the Indian Ocean, India is in the vicinity of Sri Lanka and the Maldives; its Andaman and Nicobar Islands share a maritime border with Thailand, Myanmar, and Indonesia.'''

corpus = corpus.replace("[25]","").replace(")","")
print(corpus)

India, officially the Republic of India (Hindi: Bhārat Gaṇarājya, is a country in South Asia. It is the seventh-largest country by area, the second-most populous country, and the most populous democracy in the world. Bounded by the Indian Ocean on the south, the Arabian Sea on the southwest, and the Bay of Bengal on the southeast, it shares land borders with Pakistan to the west;[f] China, Nepal, and Bhutan to the north; and Bangladesh and Myanmar to the east. In the Indian Ocean, India is in the vicinity of Sri Lanka and the Maldives; its Andaman and Nicobar Islands share a maritime border with Thailand, Myanmar, and Indonesia.


[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


2. **Stop Words Removal**

Using NLTK, we tokenize the text and filter out stop words (e.g., "the", "is") and short tokens:

In [13]:
words = []
for word in word_tokenize(corpus):
  if word.lower() not in stopwords.words('english') and (len(word) >= 2):
    words.append(word.lower())

print('Filtered words: ',words)

Filtered words:  ['india', 'officially', 'republic', 'india', 'hindi', 'bhārat', 'gaṇarājya', 'country', 'south', 'asia', 'seventh-largest', 'country', 'area', 'second-most', 'populous', 'country', 'populous', 'democracy', 'world', 'bounded', 'indian', 'ocean', 'south', 'arabian', 'sea', 'southwest', 'bay', 'bengal', 'southeast', 'shares', 'land', 'borders', 'pakistan', 'west', 'china', 'nepal', 'bhutan', 'north', 'bangladesh', 'myanmar', 'east', 'indian', 'ocean', 'india', 'vicinity', 'sri', 'lanka', 'maldives', 'andaman', 'nicobar', 'islands', 'share', 'maritime', 'border', 'thailand', 'myanmar', 'indonesia']


3. **Building Vocabulary**

A unique vocabulary is created from the filtered words:

In [14]:
vocab = list(set(words))  # Remove duplicates using set
print("Vocabulary Size:", len(vocab))  # Output: 48
print("Sample Vocabulary:", vocab[:5])

Vocabulary Size: 48
Sample Vocabulary: ['populous', 'officially', 'land', 'republic', 'bhutan']


4. **Creating Encoders and Decoders**

Two dictionaries are built to map words to numbers (encoding) and numbers to words (decoding):

In [15]:
num = 1
word_to_num = {}
num_to_word = {}

for word in vocab:
    word_to_num[word] = num
    num_to_word[num] = word
    num += 1

print("Word-to-Number:", word_to_num['world'])  # Output: 21
print("Number-to-Word:", num_to_word[24])       # Output: 'border'

Word-to-Number: 17
Number-to-Word: bounded


# **Text-Encoding and Decoding**

comments
So we have already seen the basic Encoding and Decoding. Here we are mainly going to encode and decode the sentences.

Let us use the sentence tokenization to print the sentences first.

In [19]:

for sent in sent_tokenize(corpus):
  print(sent)

India, officially the Republic of India (Hindi: Bhārat Gaṇarājya, is a country in South Asia.
It is the seventh-largest country by area, the second-most populous country, and the most populous democracy in the world.
Bounded by the Indian Ocean on the south, the Arabian Sea on the southwest, and the Bay of Bengal on the southeast, it shares land borders with Pakistan to the west;[f] China, Nepal, and Bhutan to the north; and Bangladesh and Myanmar to the east.
In the Indian Ocean, India is in the vicinity of Sri Lanka and the Maldives; its Andaman and Nicobar Islands share a maritime border with Thailand, Myanmar, and Indonesia.


so to store them in a list we can use a word_tokenizer

In [20]:
for sent in sent_tokenize(corpus):
  print(word_tokenize(sent))

['India', ',', 'officially', 'the', 'Republic', 'of', 'India', '(', 'Hindi', ':', 'Bhārat', 'Gaṇarājya', ',', 'is', 'a', 'country', 'in', 'South', 'Asia', '.']
['It', 'is', 'the', 'seventh-largest', 'country', 'by', 'area', ',', 'the', 'second-most', 'populous', 'country', ',', 'and', 'the', 'most', 'populous', 'democracy', 'in', 'the', 'world', '.']
['Bounded', 'by', 'the', 'Indian', 'Ocean', 'on', 'the', 'south', ',', 'the', 'Arabian', 'Sea', 'on', 'the', 'southwest', ',', 'and', 'the', 'Bay', 'of', 'Bengal', 'on', 'the', 'southeast', ',', 'it', 'shares', 'land', 'borders', 'with', 'Pakistan', 'to', 'the', 'west', ';', '[', 'f', ']', 'China', ',', 'Nepal', ',', 'and', 'Bhutan', 'to', 'the', 'north', ';', 'and', 'Bangladesh', 'and', 'Myanmar', 'to', 'the', 'east', '.']
['In', 'the', 'Indian', 'Ocean', ',', 'India', 'is', 'in', 'the', 'vicinity', 'of', 'Sri', 'Lanka', 'and', 'the', 'Maldives', ';', 'its', 'Andaman', 'and', 'Nicobar', 'Islands', 'share', 'a', 'maritime', 'border', 'with

These has been printed in the form of list but we should not be using the stop words so lets remove the stop words and do it again.

In [23]:
for sent in sent_tokenize(corpus):
  for word in word_tokenize(sent):
    if (word.lower() not in stopwords.words('english')) and (len(word)>=2):
      print(word,end=' ')
  print()

India officially Republic India Hindi Bhārat Gaṇarājya country South Asia 
seventh-largest country area second-most populous country populous democracy world 
Bounded Indian Ocean south Arabian Sea southwest Bay Bengal southeast shares land borders Pakistan west China Nepal Bhutan north Bangladesh Myanmar east 
Indian Ocean India vicinity Sri Lanka Maldives Andaman Nicobar Islands share maritime border Thailand Myanmar Indonesia 


Now we should be writing the encoded numbers as well along with the text itself so before that first lets create our dictionaries word_to_num and num_to_word which we did previously in our last article.

In [25]:
words=[]
for word in word_tokenize(corpus):
    if (word.lower() not in stopwords.words('english')) and (len(word)>=2):
        words.append(word.lower())

vocab=list(set(words))
len(vocab)

num=1
word_to_num={}
num_to_word={}
for word in vocab:
    word_to_num[word]=num
    num_to_word[num]=word
    num+=1

Now let us do the encoding.

In [26]:
for sent in sent_tokenize(corpus):
    for word in word_tokenize(sent):
        if (word.lower() not in stopwords.words('english')) and (len(word)>=2):
            print(word,end=' ')
            print(word_to_num[word.lower()],end=' ')
    print()

India 15 officially 2 Republic 4 India 15 Hindi 28 Bhārat 25 Gaṇarājya 42 country 35 South 48 Asia 13 
seventh-largest 9 country 35 area 16 second-most 12 populous 1 country 35 populous 1 democracy 21 world 17 
Bounded 24 Indian 6 Ocean 38 south 48 Arabian 43 Sea 7 southwest 44 Bay 46 Bengal 45 southeast 18 shares 33 land 3 borders 26 Pakistan 29 west 47 China 14 Nepal 8 Bhutan 5 north 10 Bangladesh 27 Myanmar 37 east 31 
Indian 6 Ocean 38 India 15 vicinity 39 Sri 40 Lanka 30 Maldives 19 Andaman 32 Nicobar 23 Islands 22 share 41 maritime 20 border 34 Thailand 11 Myanmar 37 Indonesia 36 


Now we can see their encoded number along with the text present within it. Like india's encoded number is 39, Republic is 47.

To get the exactly encoded numbers:

In [27]:
data=[]
for sent in sent_tokenize(corpus):
    temp=[]
    for word in word_tokenize(sent):
        if (word.lower() not in stopwords.words('english')) and (len(word)>=2):
            #print(word,end=' ')
            temp.append(word_to_num[word.lower()])
    print(temp)
    data.append(temp)
    print()

[15, 2, 4, 15, 28, 25, 42, 35, 48, 13]

[9, 35, 16, 12, 1, 35, 1, 21, 17]

[24, 6, 38, 48, 43, 7, 44, 46, 45, 18, 33, 3, 26, 29, 47, 14, 8, 5, 10, 27, 37, 31]

[6, 38, 15, 39, 40, 30, 19, 32, 23, 22, 41, 20, 34, 11, 37, 36]



This is basically the encoded format of the whole corpus. We do these encoding because our machine learning or deep learning models will not understand text as it only understands numbers.





---



Now let us see how we would do the decoding. You can check we have appended the encoded data in a list. The variable data contains all the encoded data present available with us. Let us print it once to check.

In [28]:
for sent in data:
    print(sent)

[15, 2, 4, 15, 28, 25, 42, 35, 48, 13]
[9, 35, 16, 12, 1, 35, 1, 21, 17]
[24, 6, 38, 48, 43, 7, 44, 46, 45, 18, 33, 3, 26, 29, 47, 14, 8, 5, 10, 27, 37, 31]
[6, 38, 15, 39, 40, 30, 19, 32, 23, 22, 41, 20, 34, 11, 37, 36]


In [29]:
# Now let do Decoding using this data variable.
for sent in data:
    for word in sent:
        print(num_to_word[word],end=' ')
    print()

india officially republic india hindi bhārat gaṇarājya country south asia 
seventh-largest country area second-most populous country populous democracy world 
bounded indian ocean south arabian sea southwest bay bengal southeast shares land borders pakistan west china nepal bhutan north bangladesh myanmar east 
indian ocean india vicinity sri lanka maldives andaman nicobar islands share maritime border thailand myanmar indonesia 




---



# **Text Encoding - Decoding | Without Stop Words**

This time we are going to do the same Encoding and Decoding but we won't be removing the stop words. However we will still removing the punctuation marks. Let us first get started by importing the libraries and loading the corpus

In [30]:
from nltk.tokenize import word_tokenize,sent_tokenize
from nltk.corpus import stopwords

corpus='''India, officially the Republic of India (Hindi: Bhārat Gaṇarājya),[25] is a country in South Asia. It is the seventh-largest country by area, the second-most populous country, and the most populous democracy in the world. Bounded by the Indian Ocean on the south, the Arabian Sea on the southwest, and the Bay of Bengal on the southeast, it shares land borders with Pakistan to the west;[f] China, Nepal, and Bhutan to the north; and Bangladesh and Myanmar to the east. In the Indian Ocean, India is in the vicinity of Sri Lanka and the Maldives; its Andaman and Nicobar Islands share a maritime border with Thailand, Myanmar, and Indonesia.'''

corpus = corpus.replace("[25]" , "")
corpus = corpus.replace("[f]" , "")
corpus = corpus.replace(")" , "")

We can remove the punctuation marks with the help of the ASCII values. If the ASCII value doesn't fall into the range of 65 to 90 or 97 to 122 we should be removing them as they will be special characters. Let us see how we can do that with the help of python code.






In [31]:
words=[]
for word in word_tokenize(corpus):
    if(len(word)==1):
        if((ord(word)>=97 and ord(word)<=122) or (ord(word)>=65 and ord(word)<=90)):
            words.append(word.lower())
    else:
        words.append(word.lower())

Let's create vocab and see how many words we have this time.

In [32]:
vocab=list(set(words))
print(len(vocab))

61


So last time we had 48 words in our vocab but this time we have 61 so our stop words has been included the vocab. Now further most of the codes are going to be same except few parts. Let us create the word_to_num and num_to_word  dictionaries first.

In [33]:
num=1
word_to_num={}
num_to_word={}
for word in vocab:
    word_to_num[word]=num
    num_to_word[num]=word
    num+=1

**Encoding**

We are going to encode in the same way which we did previously but we will change the conditions and add the ascii value conditions instead of the stop words condition.

In [34]:
data=[]
for sent in sent_tokenize(corpus):
    temp=[]
    for word in word_tokenize(sent):
        if(len(word)==1):
            if((ord(word)>=97 and ord(word)<=122) or (ord(word)>=65 and ord(word)<=90)):
                temp.append(word_to_num[word.lower()])
        else:
            temp.append(word_to_num[word.lower()])
    data.append(temp)
print(data)

[[18, 2, 35, 4, 41, 18, 37, 29, 55, 9, 21, 46, 11, 61, 16], [60, 9, 35, 10, 46, 42, 19, 35, 15, 1, 46, 12, 35, 50, 1, 25, 11, 35, 20], [28, 42, 35, 6, 51, 30, 35, 61, 35, 56, 7, 30, 35, 57, 12, 35, 59, 41, 58, 30, 35, 22, 60, 44, 3, 31, 47, 38, 32, 35, 36, 17, 8, 12, 5, 32, 35, 13, 12, 33, 12, 49, 32, 35, 40], [11, 35, 6, 51, 18, 9, 11, 35, 52, 41, 53, 39, 12, 35, 23, 34, 43, 12, 27, 26, 54, 21, 24, 45, 47, 14, 49, 12, 48]]


Decoding

In [35]:
for sent in data:
    for word in sent:
        print(num_to_word[word],end=' ')
    print()

india officially the republic of india hindi bhārat gaṇarājya is a country in south asia 
it is the seventh-largest country by area the second-most populous country and the most populous democracy in the world 
bounded by the indian ocean on the south the arabian sea on the southwest and the bay of bengal on the southeast it shares land borders with pakistan to the west china nepal and bhutan to the north and bangladesh and myanmar to the east 
in the indian ocean india is in the vicinity of sri lanka and the maldives its andaman and nicobar islands share a maritime border with thailand myanmar and indonesia 
