# Semantics and Pragmatics, KIK-LG103

## Lab session 2, Part 1
### Word Sense Disambiguation with the Lesk algorithm

---

### Section 0: Review

In this section you can quickly review what we did in the last session. The section consists of example code and there are no exercises. Feel free to skip this if you are comfortable with strings, lists and sets. 


#### 0.1. Strings and lists

- Splitting strings to lists
- Iteration through lists
- Calculating the length of a list

In [None]:
# Assign the sentence to a variable
sentence_as_string = "this is just another string"
print(sentence_as_string)

# Split string on whitespace, the result is a list
sentence_as_list = sentence_as_string.split()
print(sentence_as_list)

# len-function returns the length of the list
amt_tokens = len(sentence_as_list)
print("The sentence has", amt_tokens, "tokens.")

# Iterate through the list and print every element (token)
for token in sentence_as_list:
    print(token)

#### 0.2 Sets

- Set as a collection of unique elements

In [None]:
sentence = "the cat saw the dog"
tokens_in_sentence = sentence.split()

# set-function takes in a list and returns a set
types_in_sentence = set(tokens_in_sentence)
print("The set-function results in a:", type(types_in_sentence), "data structure.")

# Pay attention to how the outputs of the prints differ
print(tokens_in_sentence)
print(types_in_sentence)

# len also works with sets
amt_types = len(types_in_sentence)
print("The string has", amt_types, "types.")


#### 0.3 Functions & Flow control

- function: input $\rightarrow$ process $\rightarrow$ output
- if-else statement
- booleans

In [None]:
def fun(word, list_of_words):
    # the 'x in list' construct returns True if x
    # is in list and False otherwise
    if word in list_of_words:
        print("'" + word + "' is in the list.")
    else:
        print("'" + word + "' is not in the list.")
        
list1 = ["bananas", "excalibur", "fun"]

fun("bananas", list1)
fun("pineapple", list1)

---
## Section 1: Objects and methods
---
In this section we offer a brief introduction to two new programming concepts, **objects** and **methods**. Objects and methods are fundamental in Python programming (or at least the style of programming we learn in this course), and in fact you have already worked with both of them. Let's start with a familiar example:

In [None]:
sentence = "This is a sentence" # This assignment creates a string object...

tokens = sentence.split()       # ...and here we use a split-method
                                # the object offers
print(sentence)
print(tokens)

Whenever we assign a string to some variable, we actually create a *string object* with the specified character sequence as its contents. So what's the deal with the variable now containing an object instead of just a plain old string like we have been saying all the time? An object is simply an instantiation of some pre-defined arbitrary data structure. The definition of the structure is called a class. We won't have to define our own classes, but keep the term in mind. The class is in other words a template for creating objects of the type, and it specifies what is common between all its instantiations, leaving the details to the individual objects. For example, the definition of the string class includes something about the contents being a sequence of characters, but the specific sequence is decided only when we create a string object. 

As we saw in the code cell above, the `split()` is actually a method of the string. A method is a special kind of function tied to some specific class, and it is called with the dot-notation (`object.method()`). The distinction between methods and functions is subtle. You might think of functions as feeding your data to some process or machine and later gathering the result of the process for your use. In the case of methods, you tell the entity containing the data what you want of it, and the entity gives you the result. There is no reason in principle why you could not implement your own `split`-function that would take as an argument a string and return a list, and use that instead of the `split`-method.

Fortunately, we have a great example of a task we can implement both as a function and a method. In the last lab session we implemented a function `ends_with(string1, string2)` that takes two arguments and returns a `boolean` indicating whether `string1` ends with `string2`. You can find the definition below. 

Implementing the `ends_with` function was a good illustration, but in fact unnecessary work. The Python string class already supplies a method `endswith()` that takes one string argument and returns a boolean indicating whether the string on which we are calling the method ends with the argument string.

In [None]:
def ends_with(string1, string2):
    string_end = string1[-len(string2):]
    return string_end == string2

# Here we use the function we defined above
print(ends_with("string", "ing"))

# Here we use the 'endswith' method Python strings supply
s = "string"
print(s.endswith("ing"))

In the rest of this lab we will use more complex classes and methods supplied by the WordNet. Keep these ideas in mind as you work with the code.

---
## Section 2. Accessing word senses in WordNet
---
To use WordNet in Python we need to import it. The idiomatic way of doing this is abbreviating 'wordnet' to 'wn' using the `as`-command. This is done simply for convenience. We can now access the contents of WordNet by using dot-notation (`wn.some_function()`). Note that even though we are using the dot-notation, what we import here is *not* a WordNet object, but a module of functions, classes and so on.

In [None]:
from nltk.corpus import wordnet as wn

In this section we will only cover one small part of the WordNet module, the `Synset` class. This is enough for implementing the Lesk algorithm. In part 2 of this lab and in the home assignment we will go into more detail with WordNet.

So what is the **Synset** (synonym set) class? As we mentioned in the previous section, classes are pre-defined data structures. `Synset` represents one specific sense or meaning, and contains the sense's definition, examples of the sense and and all the words (lemmas) that can have the sense. Let's go through a few examples to see `Synsets` in action.

First of all, to get all the senses (Synsets) of a word, we can use the function `wn.synsets(word)`. The function returns a list of `Synset` objects for the word.

In [None]:
synsets = wn.synsets("dog")
print(synsets)

Let's pick the first Synset from the list and see how we can access the definition and other things about it. The `name`-method returns the identifier of the sense. The method `definition` returns a string containing the definition of the sense, and the `examples`-method returns a list of example sentences illustrating its usage.

In [None]:
first_ss = wn.synsets("dog")[0]

print(first_ss.name())
print(first_ss.definition())
print(first_ss.examples())

---

**Ex 2.1.** Write a `for`-loop that iterates through the Synsets returned by the `wn.synsets`-function for a word of your choice. For each sense, print out the name and the definition of the sense.

---

In [None]:
# Iterate through word senses here

---
## Section 3: The Lesk algorithm
---
We now have all the necessary building blocks for the actual algorithm. Here is a pseudocode/plain text representation for running Simplified Lesk on a single word:

```
target = target word for disambiguation                                  (1)
context = context words for disambiguation                                |

for all senses of target word                                            (2)
    words_definition = words from the definition of the sense            (3)
    words_examples = words from the example sentences of the sense        |
    
    overlap = calculate vocabulary overlap between                       (4)
              context and definition/examples                             |
    
    if overlap is higher than before                                     (5)
        rembember current sense as the best option                        |
```

---

**Ex 3.1** Let's implement the algorithm one step at a time. The numbers on the right hand side of the pseudocode mark the lines corresponding to each step of this exercise.

---

**Step 1.** In the first two lines of our pseudocode we initialize the target word we want to disambiguate as well as the context we will use. Your first task is to initialize these two things. Use the sentence "*time flies like an arrow*" and disambiguate the word "*time*". In the code cell below, initialize a variable `target` to contain the string "time". Also initialize a variable `context` that contains the context words as a Python set. You can find examples of how to do this in the review section.


**Step 2.** The next step in the algorithm is to go through all the senses of the target word. Remember, you can access all the senses (synsets) of the word with `wn.synsets(word)`. The function returns a list containing all the synset objects. The`for`-loop for iterating through the senses is already given. To make sure you got this right, you can print out the synset objects. The output should look something like this:

    Synset('time.n.01')
    Synset('time.n.02')
        .  .  .
            
**Step 3.** You now need to gather the words in the sense definition and the examples of the sense, so we can compare them to the context words. Recall the two methods of the Synset object we saw before, `definition()` and `examples()`. We have actually already extracted the words from the examples for you so what is left is the definition. We also combine the two sets of words using a method `union` offered by the set class.

**Step 4.** In the fourth step you need to calculate the overlap between the context and the definition/examples. The set-class offers another convenient function, `a.intersection(b)`, that returns the set of overlapping elements of the sets `a` and `b`. After you have the set of common words, you can calculate its length using the familiar `len`-function.

**Step 5.** In the final step you need to compare this overlap to the previous overlaps. For this we initialized the variables `best_sense` an `best_overlap` for you at the start of the code cell. If current overlap is greater than `best_overlap`, assign current sense to the `best_sense` variable and update the `best_overlap` variable to keep track of the best one this far. Use the supplied `if`-statement for this.


You can now uncomment the `print`-statement on the final line of the code cell to check the correctness of you algorithm. The output of the cell should be:

    time time.v.05 adjust so that a force is applied and an action occurs at the desired time
    
Feel free to play around with different target words and sentences by changing the variables `target` and `sentence`. How well does the algorithm work?

In [None]:
# We need these variables in the fifth step.
# Don't worry about until then.
best_sense = None
best_overlap = -1


# Step 1. TODO: Initialize target word and context here.
# 'context' should be a set containing the words in 'sentence'
target = ""
sentence = "time flies like an arrow"
context = set()


# Step 2. TODO: Get the list of senses using wn.synsets()
senses = []

# Iterate through all the senses using a for-loop
for sense in senses:
    # You can print out the 'sense' objects here to 
    # check correctness of step 2. You can comment this
    # out when you are finished with this step to declutter
    # the output
    print(sense)
    

    # Step 3. TODO: Retrieve the definition of the 
    # sense under consideration and turn it into a list
    definition_as_string = ""
    definition_as_list = []
    # Step 3. TODO: Represent the words as a set
    words_in_definition = set()
    
    # Here we give you the words in the examples
    words_in_examples = set(" ".join(ss.examples()).split())
    # And here we combine the two sets of words
    words_in_both = words_in_definition.union(words_in_examples)
    
    
    # Step 4. TODO: Use the 'intersection' method here to get the
    # common words in 'context' and 'words_in_both'. See how 
    # the 'union' method was used above. The intersection
    # works in a similar way.
    words_overlapping = set() 
    
    # Step 4. TODO: Use the 'len' function to calculate the number of
    # overlapping words and assign the number to the variable 'overlap'
    overlap = -1
    
    
    if overlap > best_overlap:
        # Step 5 TODO: Update these accordingly
        best_sense = None
        best_overlap = -1
    
    
# Uncomment the print-statement below when you are done with
# steps 1-5 to see if you implemented the calgorithm correctly.

# print(target, best_sense.name(), best_sense.definition())

You can now move on to Part 3 of this lab.