# **Index - Word and Word - Index**

# **Word to Index**
Word to Index is one of the vectorization strategies in NLP. We cannot feed the words directly into model for prediction. We have to convert it into the language which the machine can understand and that is NUMBERS! In this post I have implemented Word 2 Index strategy. Every word is converted to one hot vector based on its index. 

In [84]:
import nltk
nltk.download('brown')
from nltk.corpus import brown
import numpy as np

[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Package brown is already up-to-date!


In [0]:
words=brown.words(categories='news')

I am using the words from the BROWN corpus. It is already loaded in NLTK package. 

In [0]:
word_freq = nltk.FreqDist(words)

Now I have created a frequency distribution of words. Lets see how many times the word "The" and "is" are used in the corpus

In [87]:
print(word_freq['The'])
print(word_freq['is'])

806
732


We know that the words "The" and the word "the" are same, but the machine treats both the words separately.

In [88]:
print(word_freq['The'])
print(word_freq['the'])

806
5580


If we use the same frequency distribution in our model, it would not work properly since the words "The", "the", "THE" are all considered seperately even though semanticallyu they are same. So its important for us to make the machine understand the words are same and that can be done by converting all the words in one format. Lets convert all the words to lower case.


In [0]:
lower_words = [w.lower() for w in words ]

In [92]:
print(lower_words[:10])

['the', 'fulton', 'county', 'grand', 'jury', 'said', 'friday', 'an', 'investigation', 'of']


We saw above that all the words are converted to same format and Lets make Frequency distribution on these words now

In [0]:
word_freq = nltk.FreqDist(lower_words)

In [94]:
print(word_freq['The'])
print(word_freq['the'])

0
6386


Now we see that The words "The", "the", "THE" all are treated same by Machine and we are heading in right direction. Now lets convert these words to index. Here we saw above that the first 10 words are **'the', 'fulton', 'county', 'grand', 'jury', 'said', 'friday', 'an', 'investigation', 'of'** we want to convert these words into numbers [0,1,2,3,4,5,6,7,8,9]

In [0]:
word_2_index = [[j,i] for i,j in enumerate(word_freq)]

  

In [97]:
word_2_index[:10]

[['the', 0],
 ['fulton', 1],
 ['county', 2],
 ['grand', 3],
 ['jury', 4],
 ['said', 5],
 ['friday', 6],
 ['an', 7],
 ['investigation', 8],
 ['of', 9]]

Lets create dictionary of the word_2_index so that we can access it easily

In [0]:
w = dict(word_2_index)


In [100]:
print(w['the'], w['fulton'], w['county'], w['grand'])

0 1 2 3


Looking at the above output we can see that the words are converted to numbers. Now lets see how if I have a sentence, How is it converted into Numbers

In [102]:
sentence= 'which is the best place'
sentence= sentence.split()
vector = [w[l] for l in sentence]
print(sentence)
print(vector)

['which', 'is', 'the', 'best', 'place']
[33, 136, 0, 117, 23]


We arent done yet!!!!! When we are passing input to the Neural Network, it will undergo various multiplication operations and so we will convert all these words into ONE HOT VECTORS

In [0]:

one_hot_vector =[]
for i in w:
  a= np.zeros(13112)
  a[w[i]]=1
  one_hot_vector.append(a)

Lets see now how the sentence **this is the** is converted to one hot vector and can be fed into ML model for further modelling. 

In [106]:
s="this is the"
s= s.split()
X = [one_h_v[w[v]] for v in s]
X

[array([0., 0., 0., ..., 0., 0., 0.]),
 array([0., 0., 0., ..., 0., 0., 0.]),
 array([1., 0., 0., ..., 0., 0., 0.])]

In [0]:
index_2_word = [[i,j] for i,j in enumerate(word_freq)]
index_2_word = dict(index_2_word)

Now if we want to convert these one hot vectors into WORDS.Lets see how its done

In [116]:
for i in range(len(X)):
  print(index_2_word[np.argmax(X[i])])
 

this
is
the


What if we have the words that are not in Vocabulary??????? For that we can create UNKNOWN tag and add such words into that!