# Accessing and Using Lexicon

### The lexicon is now stored in a pb file, it is time to retrieve it and use it accordingly

In [1]:
import lexicon_proto_file_pb2 as lexproto
lexicon_pb_path = r"./lexicon.pb"
lexicon = lexproto.Lexicon()

In [2]:
%%time
with open(lexicon_pb_path, 'rb') as file:
    protobufdata = file.read()
lexicon.ParseFromString(protobufdata)

CPU times: user 15 ms, sys: 360 µs, total: 15.4 ms
Wall time: 15.2 ms


1987612

### Time to make relevant functions

- The lexicon has been created, but it is still important to make functions that can be used to use the lexicon in a meaningful way
- For that purpose there are a few important functions to be made
| Function | Purpose |
| --- | --- |
| GetWordID | Function that returns the actual word ID for each word |
| UpdateLexicon | Considering there are additional words added to the lexicon, the following will be called to update our lexicon.pb file |

In [3]:
def UpdateLexicon(lexicon, filepath):
    with open(filepath, 'wb') as f:
        f.write(lexicon.SerializeToString())

### Some important steps to take for GetWordID

- We would get the word in raw form, so we will do the following steps to look for it:
- We will check in order, if the word in the following forms is within the list:
     - The Word Itself
     - The word lemmatized as a noun
     - The word lemmatized as a verb
     - The word lemmatized as an adjective
     - The word lemmatized as an adverb

In [4]:
%%time
lexicon_list = list(lexicon.wordlist)

CPU times: user 9.18 ms, sys: 3.69 ms, total: 12.9 ms
Wall time: 12.8 ms


In [5]:
%%time
list(lexicon.wordlist).index("think")

CPU times: user 8.23 ms, sys: 4.04 ms, total: 12.3 ms
Wall time: 12.4 ms


9092

In [6]:
%%time
lexicon_list.index("think")

CPU times: user 160 µs, sys: 0 ns, total: 160 µs
Wall time: 163 µs


9092

**Hence important to have a prebuilt python list always**

In [7]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
lemmatizer.lemmatize("thinks", "v")



'think'

In [8]:
def GetWordID(word):
    
    try:
        word_id = lexicon_list.index(word)
        return word_id
    except ValueError:
        pass

    
    for pos in ['n', 'v', 'a', 'r']:
        lemmatized_word = lemmatizer.lemmatize(word, pos)
        try:
            word_id = lexicon_list.index(lemmatized_word)
            return word_id
        except ValueError:
            pass

    word_id = len(lexicon_list)
    lexicon_list.append(word)
    return word_id

In [27]:
%%time
GetWordID("thinks")

CPU times: user 1.37 ms, sys: 71 µs, total: 1.44 ms
Wall time: 1.45 ms


9092

In [28]:
%%time
GetWordID("think")

CPU times: user 97 µs, sys: 5 µs, total: 102 µs
Wall time: 105 µs


9092

In [26]:
%%time
GetWordID("Arsalan")

CPU times: user 13.2 ms, sys: 0 ns, total: 13.2 ms
Wall time: 13.2 ms


147703

In [29]:
%%time
GetWordID("Arsalan")

CPU times: user 1.61 ms, sys: 84 µs, total: 1.7 ms
Wall time: 1.7 ms


147703

This indicates maximum time is taken by words not in the list

And once it is added, it doesn't take that long, so in an actual article, the time might not be that long

### To check our implementation, we will use the word list from wordnet

In [10]:
from nltk.corpus import words as wordcorpus
words_all = wordcorpus.words()
len(words_all)

236736

In [16]:
%%time

word_ids = [GetWordID(word.lower()) for word in lexicon_list[:100000]]

CPU times: user 27.4 s, sys: 0 ns, total: 27.4 s
Wall time: 27.4 s


In [35]:
%%time

word_ids = [GetWordID(word.lower()) for word in lexicon_list[0:500]]

CPU times: user 1.1 ms, sys: 7 µs, total: 1.1 ms
Wall time: 1.11 ms


The results are very slow compared to what was expected, indicating the forward_index might take hours to be made