# COLX 563 Lab Assignment 4: Slot filling
## Assignment Objectives

In this lab, you will build an end-to-end system for basic (binary) intent recognition and slot filling in the context of a dialogue system. It is a team assignment, and you have nearly complete freedom with regards to your solution, with a few restrictions mentioned below. For this lab, you will work in teams of 3.

## Getting Started

Add imports below.

In [1]:
#provided code
from sklearn.metrics import accuracy_score

For this lab, you'll be working with the MultiWOZ dataset of goal-oriented dialogues (2.2). You can look at the full corpus [here](https://github.com/budzianowski/multiwoz/tree/master/data/MultiWOZ_2.2). It has an impressively detailed annotation involving multiple turns and multiple goals which we have simplified to just the initiating request (first turn) and involving two possible intents and the corresponding slots for those intents. Download the data from [github](https://github.ubc.ca/MDS-CL-2023-24/COLX_563_adv-semantics_students/tree/master/data/lab4/Multiwoz.zip), unzip it into a directory outside of your lab repo and change the path below.

In [2]:
#provided code
# woz_directory ="/Users/nicol/COLX_563/Labs/Data/Lab4/"
woz_directory ="./data/MultiWOZ/"

## Tidy Submission
rubric={mechanics:1}

To get the marks for tidy submission:
- Submit the assignment by filling in this Jupyter notebook with your answers embedded
- Be sure to follow the instructions

## Inspecting the data

Let's look at corresponding pairs of utterances and answers from the training portion of our corpus

In [3]:
count = 0
with open(woz_directory + "WOZ_train_utt.txt") as f1:
    with open(woz_directory + "WOZ_train_ans.txt") as f2:
        while count < 20:
            print("Question: ", f1.readline().strip())
            print("Answer: ", f2.readline().strip())
            print("------")
            count += 1

Question:  Guten Tag, I am staying overnight in Cambridge and need a place to sleep. I need free parking and internet.
Answer:  find_hotel|hotel-area=centre|hotel-internet=yes|hotel-parking=yes
------
Question:  Hi there! Can you give me some info on Cityroomz?
Answer:  find_hotel|hotel-name=cityroomz
------
Question:  I am looking for a hotel named alyesbray lodge guest house.
Answer:  find_hotel|hotel-name=alyesbray lodge guest house
------
Question:  I am looking for a restaurant. I would like something cheap that has Chinese food.
Answer:  find_restaurant|restaurant-food=chinese|restaurant-pricerange=cheap
------
Question:  I'm looking for an expensive restaurant in the centre if you could help me.
Answer:  find_restaurant|restaurant-area=centre|restaurant-pricerange=expensive
------
Question:  I'm looking for a places to go and see during my upcoming trip to Cambridge.
Answer:  find_hotel
------
Question:  Yeah, could you recommend a good gastropub?
Answer:  find_restaurant|restau

In [12]:
import pandas as pd

dev4_df = pd.read_csv("./data/dev_step4.csv",index_col=0)
dev_predictions = list(dev4_df['pred_answer_raw'])
# dev_predictions

The utterances consists of a request for information about either hotels or restaurants. The first part of the answer starts with the intent (either find_restaurant or find_hotel) and then lists the slots that have been filled in based on the utterance. Your goal is to generate this string of intents and slots based purely on the utterance. A few things to note:

* Not all slots are filled in, and sometimes there are no slots filled in at all (but there is always an intent).
* There are a fixed number of slots for each intent, and they always appear in a particular order, when they are filled in
* The slot values sometimes but do not always correspond to what appears in the utterance. For example, a mention of wanting wifi in the request becomes hotel-internet=yes.

We will be evaluating based on exact duplication of the entire output string, so before you start coding a solution, you should look carefully at examples in the training set and make sure you understand all the different components of the output, and how they related to the input utterance. In particular, you should identify the various constituent parts of the task, and judge which are likely to be easy, and which are likely to be more difficult.

In [13]:
# code above should create dev_predictions
with open(woz_directory + "WOZ_dev_ans.txt") as f:
    dev_correct = [answer.strip() for answer in f.readlines()]
    
assert accuracy_score(dev_predictions, dev_correct) > 0.7
print(accuracy_score(dev_predictions, dev_correct))

0.7336561743341404


## Solution
rubric={accuracy:10,quality:5,efficiency:3, raw:2}

You will build a system that, when provided with an utterance, predicts the appropriate intent and slots in the format used in the provided answers. This is an open-ended problem and you may solve it however you like, with the following restrictions:

* Your solution should include at least one of token-level prediction model used in Labs 1-3 of this course, i.e. you should make use of a CRF, an LSTM, or a BERT model. You may use multiple models.  You can also use ChatGPT if you want...
* You may use basic NLP tools (tokenizer, POS, parser) and unsupervised resources such as word embeddings, but you should NOT use an existing NER system, or any additional labeled data for this task.
* Your solution should be appropriately decomposed into parts, and documented. This is a complex enough problem that you should have several functions. You may wrap things up into a single class if you like, but you don't have to.
* Use the provided assert to test `dev_predicted`, the output of your complete model on the dev set, you will need to pass the assert to get full accuracy points. 
* Though you may use dev *accuracy* to guide the development of your model, you should not look at either utterances or answers for the dev (or the test) when developing your model. Limit your inspection of the data (e.g. for the purposes of error analysis) to the training set.
* You will get 2 bonus marks if you try a model or feature not implemented by any other team.

Other things to consider:

* You may want to build "standard" (non-sequential) ML classifiers for some aspects of this problem, but you don't have to!
* You may want to use appropriate lexicons. You can build them yourself, or find some.
* Rather than using statistical classifiers, you may want to use rule-based methods to solve some of the problems you're facing.
* You should probably do regular error analysis, some kind of cross-validation in the training set is a good approach for this, or you can create another (inspectable) internal dev set by splitting up the training set.
* If you're looking for just a little bit more performance, don't forget to tune your hyperparameters!

## Report
rubric={raw:2,reasoning:3,writing:1}

Describe your system, and discuss what your thinking about particular choices and any experiments you tried. Please talk about things you tried but didn't work, or things you thought of doing but didn't. Finally, discuss how each group member contributed to the project. As usual, there is an expectation that every group member will have made some significant contribution to the project.

This is a chance to practice scientific writing.  Provide an introduction, a description of your methods (ones that worked and didn't), and an overview of your results and error analysis.  The more practice you get at this kind of writing, the easier it becomes.

## Submit to Kaggle 
rubric={accuracy:2}

Run your system over the test data, and submit the result (in the same format as the train/dev answers) to the Kaggle competition. The competition is hosted [here](https://www.kaggle.com/t/206f9684bc5d4142a359065ff5d53365). To get full points, you need to beat the public baseline.  Provide your team name in your report.


## Exercise: Kaggle competition (Optional)
rubric={raw:2}

As a team, compete to get the best result in the task. Since there are only 8 teams, the distribution of marks is a bit different than usual, only the top 3 groups will get bonus points. As usual, the rankings will be based on the score on the private leaderboard (remember - doing well in an in-class kaggle is something that can go on a resume...):


- 1st place: 2
- 2nd place: 1
- 3rd place: 0.5

To put things in perspective, 3 years ago, the best results were about 84%.  2 years ago, the best team got to 86.5%.  Last year, the best was only 86.3.  Can you beat them?  By how much?