## Issues with the previous word lexicon

**Too slow**
- The previous word lexicon is a python list using the "index" function to locate words, hence very inefficient when it comes to locating each word, being a lot worse when having to do it multiple times in order to cater to adjectives, verbs, nouns and adverbs.
- In order to have better implementation, we will be using both hash tables and tries, and check their memory consumption as well

#### Checking the previous implementation:

In [1]:
from previousimplementation import wordlexicon as oldWL
import time
import sys



In [2]:
def get_time(func, *args):
    start_time = time.time()
    func(*args)
    end_time = time.time()
    print(f"Process took: {(end_time-start_time)*1000} ms")

In [3]:
get_time(print, "hello")

hello
Process took: 0.019073486328125 ms


In [4]:
lexicon_list = oldWL.lexicon_list

In [5]:
print(lexicon_list[:10])
print(lexicon_list[400:410])
print(lexicon_list[240000:240010])
print(len(lexicon_list))

['that', 'an', 'most', 'don', 'now', 'with', 'where', 'yourselves', 'how', 'few']
['crested', 'openhanded', 'pursy', 'pulley', 'unbelt', 'dismantle', 'idolater', 'immunocompetence', 'iconolatry', 'disappoint']
['microprobe', 'elaliite', 'elkinstantonite', 'tschauner', 'phosphides', 'gey', 'wdbj', 'whiteboyd', 'bechtel', 'skloot']
240956


In [6]:
for word in ["microprobe", "thought", "some"]:
    get_time(lexicon_list.index, word)

Process took: 2.5370121002197266 ms
Process took: 0.5304813385009766 ms
Process took: 0.0016689300537109375 ms


**Clearly a linear time can be noticed from over here**

In [7]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

In [8]:
for word in ["thought", "think"]:
    for type in ["v", "n", "r", "a"]:
        print(f"Checking time constraint with word: {word} and type: {type}")
        lemmatizer.lemmatize(word, type)

Checking time constraint with word: thought and type: v
Checking time constraint with word: thought and type: n
Checking time constraint with word: thought and type: r
Checking time constraint with word: thought and type: a
Checking time constraint with word: think and type: v
Checking time constraint with word: think and type: n
Checking time constraint with word: think and type: r
Checking time constraint with word: think and type: a


The lemmatizer does not take a lot of time, hence we can use it in our wordlexicon

In [9]:
lexicon_pb = oldWL.lexicon

In [10]:
print(sys.getsizeof(lexicon_list))
print(sys.getsizeof(lexicon_pb))

1927704
80


A python lexicon list takes about 2 MBs in memory 

**Hence, currently memory is not an issue, but time efficiency is**

We will implement two data structures to deal with this: 
- Hash Tables
- A Trie

In [11]:
class TrieNode:
    def __init__(self):
        self.children = {}
        self.is_end_of_word = False
        self.index = -1

class Trie:
    def __init__(self):
        self.root = TrieNode()
        self.size = 0

    def insert(self, word, index):
        node = self.root
        for char in word:
            if char not in node.children:
                node.children[char] = TrieNode()
            node = node.children[char]
        node.is_end_of_word = True
        node.index=index
        self.size+=1

    def get_index(self, word):
        node = self.root
        for char in word:
            if char not in node.children:
                return -1
            node = node.children[char]
        return node.index

In [12]:
demoTrie = Trie()

In [13]:
demoTrie.insert("word", 3)
demoTrie.insert("wording", 4)
demoTrie.get_index("word")

3

In [20]:

punc = '''!()-[]{};:'"\,<>./?@#$%^&*_~'''

In [14]:
lexiconTrie = Trie()
for index, word in enumerate(lexicon_list):
    lexiconTrie.insert(word, index)

In [15]:
for word in ["microprobe", "thought", "some"]:
    get_time(lexicon_list.index, word)

Process took: 2.8772354125976562 ms
Process took: 0.8378028869628906 ms
Process took: 0.0019073486328125 ms


In [21]:
import string
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [23]:
"microp*robe".translate(str.maketrans('', '', string.punctuation))

'microprobe'

In [27]:
removed_punctuations = [word.translate(str.maketrans('', '', string.punctuation)) for word in lexicon_list]

In [28]:
len(removed_punctuations)

240956

In [29]:
len(list(set(removed_punctuations)))

236102

In [16]:
for word in ["microprobe", "thought", "some"]:
    get_time(lexiconTrie.get_index, word)

Process took: 0.00667572021484375 ms
Process took: 0.00286102294921875 ms
Process took: 0.0019073486328125 ms


**Huge huge win :)**

In [17]:
sys.getsizeof(lexiconTrie)

48

In [18]:
import pickle

In [19]:
with open("lexiconTrie.pkl", "wb") as file:
    pickle.dump(lexiconTrie, file)

**It is 30Mbs now**

We will skip making the hashfunction coz boo

### Making a function to lemmatize each word and get the word ID afterwards
We will copy the previous function

In [76]:
%%time
with open("lexiconTrie.pkl", "rb") as file:
    lexTrie = pickle.load(file)

CPU times: user 4.59 s, sys: 26.2 ms, total: 4.62 s
Wall time: 4.58 s


In [81]:
%%time
for i in range(1000000):
    lexTrie.get_index("check")

CPU times: user 253 ms, sys: 0 ns, total: 253 ms
Wall time: 252 ms


In [1]:
from trie import Trie, TrieNode

In [2]:
%%time
from wordlexicon import GetLexiconSize, return_wordID



CPU times: user 2.63 s, sys: 1.87 s, total: 4.5 s
Wall time: 2.4 s


In [6]:
get_time(return_wordID, "runner")

Process took: 0.010728836059570312 ms


In [10]:
%%time
for word in ["each", "thing", "thought", "brothers", "brother", "good", "better", "best"]:
    print(return_wordID(word))

172
43379
48510
117019
117019
24242
99721
147351
CPU times: user 57 µs, sys: 32 µs, total: 89 µs
Wall time: 89.9 µs


In [8]:
%%time
return_wordID("brothers")

CPU times: user 19 µs, sys: 10 µs, total: 29 µs
Wall time: 31.5 µs


117019

In [12]:

from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

In [14]:
lemmatizer.lemmatize("better", "a")

'good'