# <center> NATURAL LANGUAGE PREOCESSING </center>
## <center> Word Sense Disambiguation </center>
### <Center> K Nidhi Sharma, 2148041 </center>

### Introduction

 Word sense disambiguation (WSD) is the ability to identify the meaning of words in context in a computational manner. It is a trending area of research in Natural Language Processing and Machine Learning. WSD is basically solution to the ambiguity which arises due to different meaning of words in different context. 
For example, consider the following sentences:

    •	I need to pass the bar exam to become a lawyer

    •	I am going to eat the energy bar every morning. 

The word bar in the first sentence refers to the bar council of India and in the second sentence, it refers to an eating item. The ambiguity that arises due to this, is tough for a machine to detect and resolve. Detection of ambiguity is the first issue and resolving it and displaying the correct output is the second issue.


### About the data 

The ambiguous word chosen for the study is bar.  Bar is considered to be an ambiguous word as it has nineteen different meaning. The two contextual meaning considered for the study is the bar council of India and an energy bar. To implement WSD, two data files are used and a user input query. 

    •	Barcouncil.txt: this text file has the data for the bar council of India. The Bar Council of India is a statutory body created by Parliament to regulate and represent the Indian bar. The file format of the data is text and the length of the data is 1010 words. 

    •	Energybar.txt: this text file has the data explaining an energy bar. Energy bar is a supplement food which is a great source of energy. It is an eating item. The file format of the data is text and the length of the data is 1000 words. Nutrients  



### Importing libraries

In [1]:
import nltk
import codecs
from nltk.tokenize import PunktSentenceTokenizer,sent_tokenize, word_tokenize
from nltk.corpus import stopwords, wordnet
from nltk.stem import WordNetLemmatizer, PorterStemmer

### Stop words removal and lemmatization

Stopwords removal: Stopwords are the high frequency words in a language which do not contribute much to the topic of the sentence. Hence, a function is defined to remove it.

Lemmatization: the defined function implements lemmatization. Lemmatization () refers to deriving the root word which is morphologically correct.


In [2]:
def simpleFilter(sentence):
    
    filtered_sent = [] #empty list to store the cleaned sentence
    
    lemmatizer = WordNetLemmatizer() #lemmatization
    
    stop_words = set(stopwords.words("english")) #stopwords
    
    words = word_tokenize(sentence) #tokenizing the sentence

    for w in words: #if the words are in the tokenized list, then
       
        if w not in stop_words: #if the sentence words are not in the stop words list, then
            
            filtered_sent.append(lemmatizer.lemmatize(w)) #do lemmatization

    return filtered_sent #returning the clean sentence

### Similarity check

Similarity is checked between the given query/sentence tokens and the training data set tokens. To implement this, synsets from wordnet is employed. The depth and closeness of a word is calculated and returned on scale of 0–1. To get accurate results in for the similarity check, large volume of data is required. 

In [3]:
def simlilarityCheck(word1, word2):

    word1 = word1 + ".n.01"
    word2 = word2 + ".n.01"
    try:
        w1 = wordnet.synset(word1) 
        w2 = wordnet.synset(word2)

        return w1.wup_similarity(w2)

    except:
        return 0

### Synonyms creator

synonymsCreator is a simplistic function to store the synonyms of the given input word.

In [4]:
def synonymsCreator(word):
    synonyms = []

    for syn in wordnet.synsets(word):
        for i in syn.lemmas():
            synonyms.append(i.name())

    return synonyms

### Filtered Sentences

In [5]:
# Remove Stop Words . Word Stemming . Return new tokenised list.
def filteredSentence(sentence):

    filtered_sent = []
    lemmatizer = WordNetLemmatizer()   #lemmatizes the words
    ps = PorterStemmer()    #stemmer stems the root of the word.

    stop_words = set(stopwords.words("english"))
    words = word_tokenize(sentence)

    for w in words:
            if w not in stop_words:
                    filtered_sent.append(lemmatizer.lemmatize(ps.stem(w)))
                    for i in synonymsCreator(w):
                        filtered_sent.append(i)
    return filtered_sent

### Disambiguation

Once all the functions are defined, is, the application is fed with two data set files, first barcouncil.txt, which contains few sentences referring to “bar” used in the bar council of India, and second, energybar.txt, which contains few sentences referring to the energy bar. 

sent1 stores the lowered case string data from the energybar.txt file and sent2 does for barcouncil.txt, sent3 stores user query. Next, the sentences are filtered and similarity is checked using the functions explained above. The comparison is normalised, and output is given accordingly whether the query refers to bar council or to the energy bar.


In [6]:
def simpleFilter(sentence):

    filtered_sent = []
    lemmatizer = WordNetLemmatizer()
    stop_words = set(stopwords.words("english"))
    words = word_tokenize(sentence)

    for w in words:
        if w not in stop_words:
            filtered_sent.append(lemmatizer.lemmatize(w))
    return filtered_sent


if __name__ == '__main__':
    
    
    barcouncilfile = open("barcouncil.txt","r",encoding='utf-8').read()
    sent2 = barcouncilfile.lower()
    
  
    energybarfile = open("energybar.txt", 'r', encoding='utf-8').read()
    sent1 = energybarfile.lower()
    sent3 = "start"


    while(sent3 != "end"):

        sent3 = input("Enter Query: ").lower()

        filtered_sent1 = []
        filtered_sent2 = []
        filtered_sent3 = []

        counter1 = 0
        counter2 = 0
        sent31_similarity = 0
        sent32_similarity = 0

        filtered_sent1 = simpleFilter(sent1)
        filtered_sent2 = simpleFilter(sent2)
        filtered_sent3 = simpleFilter(sent3)

        for i in filtered_sent3:

            for j in filtered_sent1:
                counter1 = counter1 + 1
                sent31_similarity = sent31_similarity + simlilarityCheck(i, j)

            for j in filtered_sent2:
                counter2 = counter2 + 1
                sent32_similarity = sent32_similarity + simlilarityCheck(i, j)

        filtered_sent1 = []
        filtered_sent2 = []
        filtered_sent3 = []

        filtered_sent1 = filteredSentence(sent1)
        filtered_sent2 = filteredSentence(sent2)
        filtered_sent3 = filteredSentence(sent3)

        sent1_count = 0
        sent2_count = 0

        for i in filtered_sent3:

            for j in filtered_sent1:

                if(i == j):
                    sent1_count = sent1_count + 1

            for j in filtered_sent2:
                if(i == j):
                    sent2_count = sent2_count + 1

        if((sent1_count + sent31_similarity) > (sent2_count+sent32_similarity)):
            print("---- Energy bar ----")
        else:
            print("---- Bar council ----")

    print ("\nTERMINATED")

Enter Query: energy bar is a supplement food I usually prefer while working out in the gym or while jogging because it has all the required nutrients that boosts ups the energy. Hence, i recommend these bars.
---- Energy bar ----
Enter Query: The bar council of India holds a very high position in the India law. It has the power to certify a person as an advocate if they pass the bar exam. This council is considered to a rigid council in India. Hence, to pass bar is an achievement. 
---- Bar council ----
Enter Query: end
---- Energy bar ----

TERMINATED


### Inference

As to check if both the data are trained, query related to both the context has been passed. Lengthy queries help in testing the accuracy of this application. 

    •	Query 1: energy bar is a supplement food I usually prefer while working out in the gym or while jogging because it has all the required nutrients that boosts ups the energy. Hence, i recommend these bars.  The query is WRT energy bar. 

    •	Query 2: The bar council of India holds a very high position in the India law. It has the power to certify a person as an advocate if they pass the bar exam. This council is considered to a rigid council in India. Hence, to pass bar is an achievement. The query is WRT the bar council. 
    
   **The results were accurate as it was able to distinguish between the respective bars.**


### Reference

https://towardsdatascience.com/a-simple-word-sense-disambiguation-application-3ca645c56357