## Making a lexicon in this notebook that suits our need

**For our lexicon we need a list of all the vocabulary words we will use. However we need to take some things into consideration as well**

- We need to make sure our lexicon includes only lemmatized words.
- Because of this, when we will be using our word, we have to make sure we parse it lemmatized as well.
- This would be hard considering we would have to take into account which part of speech that specific word is.
- A better alternative approach would be to have the same wordID for different forms of words.
- Achieving this can be very hard with a single list.
- We need to think of a better data structure that can help us assign wordIDs to different words.
- The method to do that is still being worked on :)

In [8]:
nltk.download('words')

[nltk_data] Downloading package words to /home/arsal4an/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.


True

In [4]:
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
from nltk.corpus import stopwords
from nltk.corpus import words as wordcorpus
import lexicon_proto_file_pb2 as lexproto
stop_words = set(stopwords.words('english'))

### Method 1:

- A method we can use is to make 4-5 alternate lists of same size to store different kinds of words.
- Each of these lists will have the same index, but different words:

| Index | List1 | List2 | List3 | List4 |
| --- | --- | --- | --- | --- |
| 0 | Think | Thinks | Thought | Thinking |
| 1 | run | runs | ran | running |
| 2 | super | "" | "" | "" |


#### First of all we will make our List1

In order to make our first list, we will get 4 types of words:

- Verbs
- Nouns
- Adverbs
- Adjectives


We will make this function that gets each kind of word from wordnet in their simple forms
Also will remove stopwords from it

In [5]:
def get_simple_forms(pos):
    words = set()
    for synset in list(wordnet.all_synsets(pos=pos)):
        for lemma in synset.lemmas():
            words.add(lemma.name().lower())
    return list(words)

# getting the 4 types of words
simple_all = []
for pos in ['v', 'n', 'a', 'r']:
    # Getting a list of all types of words
    word_list = get_simple_forms(pos)
    # making sure all the stopwords are removed from here
    word_list = [word for word in word_list if word not in stop_words]
    # word_list = [lemmatizer.lemmatize(word, pos) for word in word_list]  || This is not needed as its already in simple form
    simple_all.append(word_list)
    print(len(word_list))




11517
117751
21443
4442


In [6]:
all_words = []
for i in range(4):
    print(simple_all[i][:5])
    all_words += simple_all[i]

print("Size of words before removing duplicates: ", len(all_words))
all_words = list(set(all_words))
print("Size of words after removing duplicates: ", len(all_words))


['overemphasise', 'trigger_off', 'hang_together', 'ski_jump', 'grub_up']
['islam', 'amphibious_assault', 'genus_fabiana', 'renal_colic', 'bragger']
['court-ordered', 'slim', 'virtuoso', 'uncarpeted', 'muddled']
['extensively', 'vigilantly', 'especially', 'exquisitely', 'head_over_heels']
Size of words before removing duplicates:  155153
Size of words after removing duplicates:  147229


### Now we will get a list of all possible words

In [9]:
all_words_extra = wordcorpus.words()
print(all_words_extra[:10])
print(all_words_extra.__len__())


['A', 'a', 'aa', 'aal', 'aalii', 'aam', 'Aani', 'aardvark', 'aardwolf', 'Aaron']
236736


We will be using our Lemmatizer now to check which form matches any word from the list

## Scratching method 1 as it is overcomplicated. 

A simpler method would be to keep the original list and when a new word comes, we will use all 4 kind of lemmatizers to convert it into 4 types of words and find either of them in the list

The 4 types we will parse in this order: n -> a -> v -> r

**Hence, our lexicon is final, just need to add the stopwords at the start**

In [10]:
stop_words

{'a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 'her',
 'here',
 'hers',
 'herself',
 'him',
 'himself',
 'his',
 'how',
 'i',
 'if',
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it's",
 'its',
 'itself',
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'only',
 'or',
 'other',
 'our',
 'ours',
 'ourselves',
 'out',
 'over',
 'own',
 'r

In [11]:
print(len(list(stop_words)))
print(len(all_words))

179
147229


**Adding 179 Stop words at the start of this list**

We need to separate these words as they occur very frequently in documents.
Completely removing them is not feasible and a better option would be to add them to the start and reduce t

In [12]:
all_words = list(stop_words) + all_words
len(all_words)

147408

#### I will make  a proto buffer file to keep this data safe now, and to be used later

In [13]:
proto_wordlist = lexproto.Lexicon()
proto_wordlist.wordlist.extend(all_words)


fileData = proto_wordlist.SerializeToString()
print(fileData)
with open("lexicon.pb", "wb") as file:
    file.write(fileData)



In [26]:
proto_wordlist.wordlist[16]

'didn'