# Instructions for WSD annotation

In this assignment, we'll ask you to perform annotation for a Word Sense Disambiguation task.  

## What is Word Sense Disambiguation (WSD)?
WSD is an NLP problem which asks a classifier to distinguish different senses of a "polysemous" word.  Polysemy means many senses. For example, consider the word "bank" in these two sentences:

A) The **bank** was out of money.

B) The river **bank** was flooded.

In sentence A, "bank" is a financial institution, while in sentence B "bank" means the area between high and low tide marks. Distinguishing between different word senses given a context is a challenging task for computers.  The annotations that you provide in this proejct will allow us to train a model to  perform better!

## What should I do?
In each round, you will first read some example sentences from an [Wikihow](https://www.wikihow.com/Main-Page) article. After reading, you will be shown a list of possible definitions of a highlighted polysemous word, and asked to choose the most appopriate definition for the polysemous word among different senses. 

Here is an example: 

=============================================

#### Please read the example sentences from a Wikihow article:
 

Title:  How to Remove an Effect in Final Cut Pro

...Double-click directly on the clip to enlarge and display it in the Viewer **window.**...

...The Viewer **window** is an area in the top middle section of your project session that allows you to preview your edits.... 

...Select the "Audio" or "Video" button at the top of the Inspector **window.**... 

<br>
 
#### Which of the following describes the word "window"  best?

**0**: window: opening in a wall, door, roof or vehicle that allows the passage of light.

**1**: window: visual area containing some kind of user interface.

=============================================

Idially you should select 1 instead of 0.

## Some notes before you start
1. A csv file `annotation_ID.csv` will be generated in your **Google Drive's root directory**. Please upload your csv file to gradescope after annotation.
2. If you want to **continue your previous work**, please make sure you have `annotation_ID.csv` in your Google Drive's root directory before running `annotaion()`. The program will ask you to continue your work.
3. If you are not sure what to choose, you can click the **hyperlink** on each selection to get more information.
4. If you think there are multiple answers, you can enter a comma seperated list. (ex. 0, 2, 3)
5. **[IMPORTANT]** If your performance is poor, the program may stop you from further annotation.

# Input your information

Before you begin, we'll ask for 2 piece of information.
1. We'll ask you to enter your ID number.
2. The colab notebook will generate a link that will allow you to sign in to your google drive, and copy a code that will grant the notebook the ability to save a file in your Google Drive folder.  The notebook will save your annotation file to your Google drive after you're done working, so that you can continue working on it later, and so that you can upload it to Gradescope. 

Please input your ID(number)

In [None]:
print("Please input your ID number")
id = input()

Access to your Google Drive is required for reading and saving your annotation result(annotation_ID.csv) in your drive's root directory.


In [None]:
from google.colab import drive
drive.mount('/content/drive')

# Initialize the notebook
Please run this block! You don't have to expand it.

In [None]:
try:
    id
except:
    raise ValueError('please input id.')

if len(id) == 0:
    raise ValueError('please input id.')

if not id.isnumeric():
    raise ValueError('please input valid id(number ONLY).')


import pickle
import random
import csv
import re
import os
from os import listdir
from os.path import isfile, join
from IPython.display import clear_output, HTML, display, Markdown
import datetime

def comma2list(input):
  input = re.sub(',\s+',',',input)
  return input.split(',')
  
def set_css():
  display(HTML('''
  <style>
    pre {
        white-space: pre-wrap;
    }
  </style>
  '''))
get_ipython().events.register('pre_run_cell', set_css)

!wget https://osf.io/p26by/download --output-document=df_wikihow.csv
!wget https://osf.io/agtrn/download --output-document=nodes_desc.pickle
!wget https://osf.io/aumwh/download --output-document=word_nodes.pickle
!wget https://osf.io/jgakd/download --output-document=control.csv

dict_wikihow = {}
dict_wikihow['title'] = []
dict_wikihow['url'] = []
dict_wikihow['text'] = []
dict_wikihow['target'] = []

with open('df_wikihow.csv', newline='') as f:
    reader = csv.reader(f, delimiter=',')
    for i_row, row in enumerate(reader):
        if i_row != 0:
            dict_wikihow['title'].append(row[0])
            dict_wikihow['url'].append(row[1])
            dict_wikihow['text'].append(row[2])
            dict_wikihow['target'].append(row[3])

with open('word_nodes.pickle', 'rb') as handle:
    word_nodes = pickle.load(handle)
    
with open('nodes_desc.pickle', 'rb') as handle:
    nodes_desc = pickle.load(handle)

control_word = []
control_qnode = []
control_url =[]

with open('control.csv', newline='') as f:
    reader = csv.reader(f, delimiter=',')
    for i_row, row in enumerate(reader):
        if i_row != 0:
            control_url.append(row[0])
            control_qnode.append(row[1])
            control_word.append(row[2])

k = len(dict_wikihow['title'])

In [None]:
def annotation():
  idx_list = [i for i in range(k)]
  controls = [i for i in range(30)]
  control_acc = []
  answered = 0
  errors = 0
  try:
      random.Random(str(id)).shuffle(idx_list)
  except:
      print('please input id')
      return

  path = '/content/drive/My Drive/'
  onlyfiles = [f for f in listdir(path) if isfile(join(path, f))]
  if 'annotation_'+id+'.csv' not in onlyfiles:
      with open(path+'annotation_'+id+'.csv','a') as fd:
          writer = csv.writer(fd)
          writer.writerow(['word', 'qnode', 'url','id','start_time','end_time'])
          # print('generated file: annotation_'+id+'.csv')
  else:
      with open(path+'annotation_'+id+'.csv','r') as f:
          reader = csv.reader(f, delimiter=',')
          prev_progress = len(list(reader))-1
      condition = False
      while condition == False:
          print('annotation_'+id+'.csv is found in the directory.')
          print('There are '+'\x1b[1;36m'+str(prev_progress)+'\x1b[0m'+' questions have been answered.')
          print('Do you want to continue your previous progress?')
          print()
          print('\x1b[1;36m'+'0'+'\x1b[0m'+': continue')
          print('\x1b[1;36m'+'1'+'\x1b[0m'+': start from beginning')
          print('\x1b[1;36m'+'x'+'\x1b[0m'+': EXIT PROGRAM')
          print()
          user_input = input('Your answer:')
          if user_input == '0':
              answered = prev_progress
              idx_list = idx_list[answered:]
              condition = True
              clear_output(wait = True)
          elif user_input == '1':
              os.remove(path+'annotation_'+id+'.csv')
              with open(path+'annotation_'+id+'.csv','a') as fd:
                  writer = csv.writer(fd)
                  writer.writerow(['word', 'qnode', 'url','id','start_time','end_time'])
              condition = True
              clear_output(wait = True)
          elif user_input == 'x':
                print()
                clear_output(wait = True)
                print('Thank you for your participation.')
                print('you answered: ',answered, " questions")
                return
          else:
                clear_output(wait = True)
                print("invaild input, please try again.")
                print()

  
  while True:
    if answered%10 == 0:
        control_idx = random.choice(controls)
        control_url_idx = dict_wikihow['url'].index(control_url[control_idx])
        rand_title = dict_wikihow['title'][control_url_idx]
        rand_url = dict_wikihow['url'][control_url_idx]
        rand_text = dict_wikihow['text'][control_url_idx]
        rand_target = control_word[control_idx]
        control_indicator = True

    else:
        idx = idx_list[0]
        idx_list = idx_list[1:]
        rand_title = dict_wikihow['title'][idx]
        rand_url = dict_wikihow['url'][idx]
        rand_text = dict_wikihow['text'][idx]
        rand_target = dict_wikihow['target'][idx]
        control_indicator = False
    
    target_list = rand_target.split(', ')
    for word in target_list:
        start_time = str(datetime.datetime.now())
        nodes = []
        try:
            nodes = nodes + word_nodes[word]
        except:
            pass
        try:
            if word[0].islower():
                singular = p.singular_noun(word.lower())
            if singular != word and singular != False:
                nodes = nodes + word_nodes[singular]
        except:
            pass
        nodes = list(set(nodes))
        descs = [(nodes_desc[node], node) for node in nodes]
        grid_size = 100
        if len(descs)>0:  
            descs_grid_list = [descs[i*grid_size:((i+1)*grid_size)] for i in range(int((len(descs)-1)/grid_size)+1)]
            sent_list = re.split(r'(?<=[^A-Z].[.?]) +(?=[A-Z])', rand_text)
            sent_list = [sent for sent in sent_list if word in sent.lower()]
            solved = False
            grid_idx = 0
            while not solved:
                print('currently answered: ',answered, " questions")
                print()
                print('============================================================')
                print()
                print('Please read the example sentences from an Wikihow article:')
                print()
                print('Title: ',rand_title)
                print()
                for sentence in random.sample(sent_list,k=min(5,len(sent_list))):
                    sentence = sentence.split()
                    sentence = ['\x1b[1;31m'+w+'\x1b[0m' if word in w.lower() else w for w in sentence]
                    sentence = ' '.join(sentence)
                    print('...'+sentence+'...')
                print()
                print('------------------------------------------------------------')
                print()
                print('Which of the following describes the word','\x1b[1;31m'+word+'\x1b[0m',' best?\n(If more than one choice applies, you can write a comma seperated list of numbers.)')
                print()
                descs_grid = descs_grid_list[grid_idx]
                if grid_idx>0:
                    print('\x1b[1;36m'+"p"+'\x1b[0m'+": ** PREVIOUS PAGE **")
                for idx_desc,desc in enumerate(descs_grid):
                    if str(desc[1])[0] == 'Q':
                        display(Markdown("{num}: [{word}]({url}): {desc}".format(num = str(idx_desc),word = word, url = 'https://www.wikidata.org/wiki/'+str(desc[1]),desc = desc[0])))
                    else:
                        display(Markdown("{num}: [{word}]({url}): {desc}".format(num = str(idx_desc),word = word, url = 'https://www.wikidata.org/wiki/Property:'+str(desc[1]),desc = desc[0])))
                    # print('\x1b[1;36m'+str(idx_desc)+'\x1b[0m'+":",word+': '+desc[0])
                print('\x1b[1;36m'+str(idx_desc+1)+'\x1b[0m'+':','NO ANSWER')
                if grid_idx<len(descs_grid_list)-1:
                    print('\x1b[1;36m'+'n'+'\x1b[0m'+': ** NEXT PAGE **')
                print()
                print('\x1b[1;36m'+'x'+'\x1b[0m'+': EXIT PROGRAM')
                print()
                print()
                print()
                user_input = input('Your answer:')
                if user_input == 'p':
                    grid_idx = grid_idx -1
                    clear_output(wait = True)
                    continue
                elif user_input == 'n':
                    grid_idx = grid_idx + 1
                    clear_output(wait = True)
                    continue
                elif user_input == 'x':
                    print()
                    clear_output(wait = True)
                    print('Thank you for your participation.')
                    print('you answered: ',answered, " questions")
                    return
                elif sum([x not in [str(num) for num in range(idx_desc+1)] for x in comma2list(user_input)])==0:
                    # picked an answer/ list of answers
                    solved = True
                    clear_output(wait = True)
                    print('answer recorded!')
                    answered += 1
                    with open(path+'annotation_'+id+'.csv','a') as fd:
                        writer = csv.writer(fd)
                        answer_list_input = ','.join([descs_grid[int(x)][1] for x in comma2list(user_input)])
                        writer.writerow([word, answer_list_input, rand_url, id, start_time, str(datetime.datetime.now())])
                    if control_indicator:
                        if sum([x in control_qnode[control_idx] for x in answer_list_input])>0:
                            control_acc.append(1)
                        else:
                            control_acc.append(0)
                elif user_input == str(idx_desc+1):
                    # no answer
                    solved = True
                    clear_output(wait = True)
                    print('answer recorded!')
                    answered += 1
                    with open(path+'annotation_'+id+'.csv','a') as fd:
                        writer = csv.writer(fd)
                        writer.writerow([word, None, rand_url, id, start_time, str(datetime.datetime.now())])
                    if control_indicator:
                        control_acc.append(0)
                else:
                    clear_output(wait = True)
                    print("invaild input, please try again.")
                    


# Annotate

Please run the code below to start annotation. 

If you don't want to do all your annotations all at once, you can choose 'x' to exit the program.  This will save your progress.  You can come back to the colab notebook later and run it again to resume.

In [None]:
annotation()