# Week 09: Word Sense Disambiguation

This week, we introduced a hot topic in Natural Language Proccessing: *Word Sense Disambiguation (WSD)* .  
Many words in natural languages have ambiguous meanings. For example, the word *[party](https://dictionary.cambridge.org/dictionary/english/party)* can refer to 1) a social gathering (派對), 2) a political organization (政黨), or 3) an entity in law (當事人；⋯⋯方).  
As a human, we can distinguish different meanings easily, but can a machine do the same? This is what WSD aims for.  

## Introduction

### tl; dr
You have to 
1. preprocess the data
2. (stage 1) generate a small training dataset from the given collocation seed,
3. (stage 1) train a weak model on that small dataset,
4. (stage 2) use the weak model to generate more labeled data, and
5. (stage 2) train your final model
6. Evaluate your model on testing data (requirement: accuracy > 0.7)

### Concept

In [Lesk's assumption](https://en.wikipedia.org/wiki/Lesk_algorithm), each word has only one sense when it appears in the same collocation.  
For example, if *party* shows up with the word *court* (法庭), most likely the sense of this *party* is the 3rd one: an entity in law (當事人；⋯⋯方).  
However, we are not implementing Lesk's algorithm this week. Instead, we will combine his assumption with [Yarowsky's](https://en.wikipedia.org/wiki/Yarowsky_algorithm) *bootstrap technique* .  

You are given some pre-defined collocations, or called *seeds*, of the word *party*, along with which sense each collocation belongs to.  
With the given seeds, you can generate a small set of labeled data by rule. Then with this small set, we can train a small model with limited accuracy.  
The current classifier might not perform well on the whole dataset, sure, but it's already enough to generate more reliable labeled data. With the newly labeled training data, we can now train another sense-classification model with more robustness, which aims for the real WSD task.  
This process, about training on smaller dataset, generating more data, and then improving the model itself, is called *[bootstrapping](https://www.mastersindatascience.org/learning/introduction-to-machine-learning-algorithms/bootstrapping/)* .  


<a name="I.-Data-preparation"></a>
## I. Data preparation

First thing first. To make natural language understandable for machines, we have to transform sentences into embeddings.  
So here are four things to do:

1. load data
2. preprocess the sentences
3. transform sentences into embeddings
4. pad the sentences to the same length

To make the task simple and easy to understand, we will only work on a single word *party* .  
Three senses of *party* is defined as below with their corresponding `sense id`s. 

In [1]:
SENSE = {
    1: 'a social event at which a group of people meet to talk, eat, drink, dance, etc.', # 派對
    2: 'an organization of people with particular political beliefs', # 政黨
    3: 'a single entity which can be identified as one for the purposes of the law' # （法庭）當事人；⋯⋯方
}

### 1. Load data

The data is a set of sentences containing the word *party*, all extracted from wikipedia. The uniqueness of each sentence is guaranteed. 

In [2]:
import os

In [3]:
with open(os.path.join('data', 'party.train.txt'), 'r') as f:
    data = f.read().strip().split('\n')

# this dict maps sentence_id to the sentence itself
pure_data = {sent_id: text for sent_id, text in [line.split('\t', 1) for line in data]}

Let's see what the data looks like.

In [4]:
for sent_id, sentence in pure_data.items():
    if int(sent_id) > 1003: break
    print(f'{sent_id}: {sentence}')

1001: A naked party, also known as nude party, is a party where the participants are required to be nude.
1002: The town center bears the hallmarks of a typical migration-accepting Turkish rural town, with traditional structures coexisting with a collection of concrete apartment blocks providing public housing, as well as amenities such as basic shopping and fast-food restaurants, and essential infrastructure but little in the way of culture except for cinemas and large rooms hired out for wedding parties.
1003: Elections Alberta oversees the creation of political parties and riding associations, compiles election statistics on ridings, and collects financial statements from party candidates and riding associations.


In [5]:
# a look up table from sentence to id
id_mapper = {v: k for k, v in pure_data.items()}
# a table for id to embedding; we will deal with this later
processed_data = {}

We define 2 samples here to validate the preprocess during our coding.

In [6]:
samples = [
    'Adnan Al-Hakim (died May 26, 1990) was the leader of the Najjadeh Party, an Arab nationalist party in Lebanon, for more than 30 years.',
    'A block party or street party is a party in which many members of a single community congregate, either to observe an event of some importance or simply for mutual enjoyment.'
]

### 2. Preprocess the sentences 

<font color="red">[TODO]</font> Define your preprocessing function to transform a sentence into tokens here.  

\-

<small>
*hint: If you can't get a high accuracy in the final result, you may want to come back and modify your preprocessing here.<br/>
*hint: Think about what words are useful and what are useless when distinguishing a sense.
</small>

In [7]:
import re
def preprocess(text):
    # [ TODO ]
    token = re.findall(r"[\w]+", text.lower())
    
    return token

In [8]:
sent_tokens = [preprocess(sent) for sent in samples]
sent_tokens[0][:5]

['adnan', 'al', 'hakim', 'died', 'may']

### 3. Transform sentences into embeddings

For the simplicity, we are still using word2vec here, so you can copy-paste your code from previous week.  
This is not required; you don't have to use word2vec if you want to train a embedding model along with the classifier.  

<small>\*Download w2v: [Google Code Archive](https://code.google.com/archive/p/word2vec/#Pretrained-word-and-phrase-vectors)</small>

In [9]:
import numpy as np
from gensim.models import KeyedVectors

In [10]:
w2v = KeyedVectors.load_word2vec_format(
        os.path.join('data', 'GoogleNews-vectors-negative300.bin'), 
        binary = True
        )

In [11]:
def to_embedding(tokens):
    # [ TODO ]
    result = []
    for t in tokens:
        try: result.append(w2v[t])
        except: pass
    
    return result

In [12]:
embeddings = [to_embedding(tokens) for tokens in sent_tokens]
embeddings[0][:5]

[array([-3.93066406e-02,  1.86523438e-01,  3.44238281e-02,  4.27246094e-02,
         1.34765625e-01,  2.26562500e-01, -1.61132812e-01, -1.58203125e-01,
        -3.19824219e-02,  4.12109375e-01, -2.96875000e-01, -2.45117188e-01,
        -1.44531250e-01,  1.70898438e-01, -1.95312500e-01,  8.39843750e-02,
        -9.76562500e-02, -1.05957031e-01,  3.96484375e-01,  1.08886719e-01,
        -4.21875000e-01, -2.36328125e-01,  1.37695312e-01,  1.86523438e-01,
        -3.39355469e-02,  2.23388672e-02,  6.59179688e-02,  1.55273438e-01,
        -3.32031250e-02, -3.92578125e-01, -4.61425781e-02,  1.53320312e-01,
         4.06250000e-01,  8.49609375e-02, -3.57421875e-01,  3.32641602e-03,
         2.13623047e-02,  2.57812500e-01,  9.52148438e-02, -2.51953125e-01,
        -7.47070312e-02,  2.91015625e-01, -1.26953125e-01, -2.23388672e-02,
         5.56640625e-02, -7.47070312e-02, -1.03759766e-02, -1.38671875e-01,
        -3.16406250e-01,  5.12695312e-02, -6.29882812e-02,  3.20434570e-03,
         4.6

### 4. Pad the sentences to the same length

The input size of model is fixed. However, the sentence lengths are various.  
An intuitive solution is to stuff some dummy values into arrays util they share the same size, and this is called *padding*.  

<small>*<a href="https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/sequence/pad_sequences">tf.keras.preprocessing.sequence.pad_sequences</a></small>

In [13]:
# if you prefer numpy
import numpy as np
import tensorflow as tf
# or if you prefer tensorflow
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [14]:
seed = 1233
tf.random.set_seed(seed)
np.random.seed(seed)

In [15]:
def add_padding(embeddings, padding_width=None):
    # [ TODO ]
    # Pad all embeddings to padding_width, or detect it automatically when it's not given
    # ps. tensorflow's `pad_sequences` can detect that for you

    return pad_sequences(embeddings, maxlen=padding_width, dtype="float32")
    

In [16]:
emb_padded = add_padding(embeddings)
emb_padded[0]

array([[ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       ...,
       [ 0.06396484, -0.25585938, -0.08447266, ...,  0.02746582,
         0.06494141,  0.06201172],
       [-0.07666016, -0.10400391, -0.00175476, ..., -0.01965332,
        -0.03442383,  0.0007515 ],
       [-0.12695312,  0.20898438, -0.10644531, ...,  0.13476562,
         0.01879883, -0.1484375 ]], dtype=float32)

In [17]:
print(len(embeddings[0]), len(embeddings[1]))
print(emb_padded[0].shape, emb_padded[1].shape)

19 25
(25, 300) (25, 300)


You should see the embedding of shorter sentence is padded by empty arrays, and they are at the same length now.

In [18]:
# record the width for the future use.
PADDING_WIDTH = emb_padded[0].shape[0]
print(PADDING_WIDTH)

25


### 5. all-in-one

Define a function to setup the pipeline, and transform all sentences into embeddings!  

<small>\*Your embedding shape might not be the same with ours due to our different preprocessing procedure. </small>

In [19]:
def process_text(sentences, padding=None):
    result = [preprocess(sentence) for sentence in sentences]
    result = [to_embedding(sentence) for sentence in result]
    result = add_padding(result, padding)
    
    return result

In [20]:
X = process_text(pure_data.values())

In [21]:
X[0] # should be an embedding with padding

array([[ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       ...,
       [-0.12890625, -0.18261719,  0.10351562, ..., -0.07714844,
        -0.11572266, -0.02832031],
       [-0.22851562, -0.08837891,  0.12792969, ..., -0.21289062,
         0.18847656, -0.14550781],
       [ 0.21582031, -0.12207031,  0.09765625, ..., -0.06201172,
        -0.17089844,  0.02563477]], dtype=float32)

In [22]:
X.shape # should be (637, *, 300), * depends on your preprocessing

(637, 89, 300)

Let's use a dictionary to store all embeddings with their sentence_id.

In [23]:
processed_data = { 
    sent_id: embedding for sent_id, embedding in zip(pure_data, X) 
}

In [24]:
print(pure_data['1001'])
processed_data['1001']

A naked party, also known as nude party, is a party where the participants are required to be nude.


array([[ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       ...,
       [-0.12890625, -0.18261719,  0.10351562, ..., -0.07714844,
        -0.11572266, -0.02832031],
       [-0.22851562, -0.08837891,  0.12792969, ..., -0.21289062,
         0.18847656, -0.14550781],
       [ 0.21582031, -0.12207031,  0.09765625, ..., -0.06201172,
        -0.17089844,  0.02563477]], dtype=float32)

## II. First stage

After preprocessing the training data, now we are going to train our first-stage model!  

According to the method described at the beginning, we can train a simple model on a smaller dataset, and this dataset can be generated by rule from seeds.  

### Steps

1. Prepare the training data
2. Encode labels
3. Split training and testing dataset
4. Build classifier
5. Train

### 1. Prepare the training data

Given the seed collocationss, you can add a sentence into the training data with label if that sentence contains that collocation.  
For example, we can say <i>"A party is a **social** gathering."</i> should be the first sense, because it contains the keyword *social*. Hence, your training data will have this sentence with its label `1`.  

Don't worry about the false-positive cases for now.  
If the seed is generally good enough, the model will learn to ignore those wrong data by itself. (though yeah, you can get better results if you deal with it beforehand)

In [25]:
SEEDS = {
    1: ['social', 'events'],
    2: ['system', 'coalition'],
    3: ['court', 'law']
}

<font color="red">[TODO]</font> Get the initial training data from the given seeds.  

In [26]:
# [TODO]
indice, first_X, first_Y = [], [], [] # sentence id of selected samples, selected sentences, detected labels
for sent_id, sentence in pure_data.items():
    tokens = preprocess(sentence)
    for clas in SEEDS:
        label = [t for t in tokens if t in SEEDS[clas]]
        if len(label) > 0:
            indice.append(sent_id)
            first_X.append(to_embedding(tokens)), first_Y.append(clas)
            break
first_X = add_padding(first_X)

Examine training data.  
The labels might not be 100% correct, but it should look reasonable.  

In [27]:
for i in range(5):
    print(pure_data[indice[i]])
    print(f' -> {first_Y[i]}: {SENSE[first_Y[i]]}')
    print()

From these social conventions derive in turn also the variants worn on related occasions of varying solemnity, such as formal political, diplomatic, and academic events, in addition to certain parties including award ceremonies, balls, fraternal orders, high school proms, etc.
 -> 1: a social event at which a group of people meet to talk, eat, drink, dance, etc.

The Free-minded People's Party () or Radical People's Party was a social liberal party in the German Empire, founded as a result of the split of the German Free-minded Party in 1893.
 -> 1: a social event at which a group of people meet to talk, eat, drink, dance, etc.

Typically, a party has the right to object in court to a line of questioning or at the introduction of a particular piece of evidence.
 -> 3: a single entity which can be identified as one for the purposes of the law

Dizzy bat is commonly played at parties, colleges and universities, bars, and other drinking festivities such as a tailgate party at sporting eve

Transform X and Y into numpy array for future use.

In [28]:
first_X = np.array(first_X)
first_Y = np.array(first_Y)
first_X.shape

(178, 89, 300)

### 2. Encode labels

The labels now are all categorical, which are `1`, `2`, and `3` . However, it's hard to teach a machine this kind of answers.  
Most of the time, machine learning generates a *numeric probability*, like `0.329`, rather than a categorical result.  
That's why we want to encode the label into a floating point between 0 ~ 1, so that the machine can generate the probability of each answer.  

Here we suggest you use the one-hot encoding, which is suitable for categorical classification.  
So the label `2` will look like
```
 Sense 1, Sense 2, Sense 3
[      0,       1,       0]
```

*<small><a href="https://machinelearningmastery.com/why-one-hot-encode-data-in-machine-learning/">Why One-Hot Encode Data in Machine Learning?</a></small>

In [29]:
# if you prefer tensorflow
from tensorflow import one_hot
# or if you don't like tensorflow
from sklearn.preprocessing import OneHotEncoder

<font color="red">[TODO]</font> one-hot encode `first_Y`

<small>
*<a href="https://www.tensorflow.org/api_docs/python/tf/one_hot">tf.one_hot</a><br/>
*<a href="https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html">sklearn.preprocessing.OneHotEncoder</a>
</small>

In [30]:
# [TODO]
onehot_Y = np.zeros((len(first_Y), 3))
onehot_Y[np.arange(len(first_Y)), first_Y - 1] = 1

In [31]:
onehot_Y[:5]

array([[1., 0., 0.],
       [1., 0., 0.],
       [0., 0., 1.],
       [1., 0., 0.],
       [0., 1., 0.]])

### 3. Prepare training and validation set

Split the dataset into training set and validation set.  
The reason for splitting is because, you may not want the model to see what you'll use to test it when it is still learning.

Machine is very smart; sometimes it just *memorizes* the answers, rather than *learns* them. Even that the model has yielded a perfect accuracy in the test, it still might fail miserably when facing the cruel, real world. *(heh)*  
That's why we need a validation set. We reserve a partition of data that will never be learnt by the model, and use it to validate whether the model really learns someting.

<small>*<a href="https://tarangshah.com/blog/2017-12-03/train-validation-and-test-sets/">Train, Validation and Test Sets</a></small>

In [32]:
# if you prefer sklearn
from sklearn.model_selection import train_test_split
# or if you don't like sklearn. **Remember to shuffle your data before splitting.**

In [33]:
X_train, X_val, Y_train, Y_val = train_test_split(
    first_X, onehot_Y,
    test_size=0.25,   # [TODO] How much data you want to used as validation set
    shuffle=True
)

In [34]:
print(X_train.shape, X_val.shape, Y_train.shape, Y_val.shape)

(133, 89, 300) (45, 89, 300) (133, 3) (45, 3)


### 4. Build your multi-labeling classifier 

Now the data is all prepared.  
Let's build a model to learn from it!  

Note that, different from last week, your output dimension should be the size of all categories, rather than `2` .  

\-

<small>
*Although tensorflow is used below, you can always change it to any other framework you are familiar with. <br/>
*<a href="https://www.tensorflow.org/api_docs/python/tf/keras/layers">tf.keras.layers</a>
</small>

In [35]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Bidirectional#, and all the other layers you may use
from tensorflow.keras import layers


In [36]:
_, PADDING_WIDTH, EMBEDDING_DIM = X_train.shape
OUTPUT_CATEGORY = len(SENSE)

print(PADDING_WIDTH, EMBEDDING_DIM, OUTPUT_CATEGORY)

89 300 3


In [37]:
def weighted_class(Y):
    class_1, class_2, class_3, total, weight = 0, 0, 0, len(Y), {}
    for i in Y:
        if i.argmax() == 0:
            class_1 += 1
        elif i.argmax() == 1:
            class_2 += 1
        elif i.argmax() == 2:
            class_3 += 1
    print("class_1 = {}, class_2 = {}, class_3 = {}".format(class_1, class_2, class_3))
    wei = [total / class_1 + 1e-6, total / class_2 + 1e-6, total / class_3 + 1e-6]
    wei = [w / sum(wei) for w in wei]
    for i in range(3):
        weight[i] = wei[i]
    print(wei)
    return weight

weight = weighted_class(onehot_Y)

class_1 = 37, class_2 = 91, class_3 = 50
[0.46585436920141016, 0.1894133724124783, 0.34473225838611155]


<font color="red">[TODO]</font> Build a classifier

In [38]:
model_1 = Sequential()

# [TODO]
model_1.add(LSTM(60, input_shape=(89, 300), activation="selu"))
model_1.add(layers.BatchNormalization())
model_1.add(Dense(50, activation="selu"))
model_1.add(layers.BatchNormalization())
model_1.add(layers.Dropout(0.5))
model_1.add(Dense(3, activation="softmax"))

print(model_1.summary())

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm (LSTM)                  (None, 60)                86640     
_________________________________________________________________
batch_normalization (BatchNo (None, 60)                240       
_________________________________________________________________
dense (Dense)                (None, 50)                3050      
_________________________________________________________________
batch_normalization_1 (Batch (None, 50)                200       
_________________________________________________________________
dropout (Dropout)            (None, 50)                0         
_________________________________________________________________
dense_1 (Dense)              (None, 3)                 153       
Total params: 90,283
Trainable params: 90,063
Non-trainable params: 220
__________________________________________________

Time to choose the optimizer and loss function.  

Loss function is an equation evaluating how wrong your model has answered (the lower the better), while optimizer tells the model how to improve itself.  
But seriously, we are not asking you to fine-tune these parameters. That is for Machine Learning class, not for NLP class, so if you are not able to pass the baseline, go check your processing procedure first. Something might go wrong there.  

\-

<small>
*<a href="https://www.tensorflow.org/api_docs/python/tf/keras/Model#compile">tf.keras.model#compile</a> <br/>
*<a href="https://www.tensorflow.org/api_docs/python/tf/keras/optimizers">tf.keras.optimizers</a> <br/>
*<a href="https://www.tensorflow.org/api_docs/python/tf/keras/losses">tf.keras.losses</a>
</small>

<font color="red">[TODO]</font> Compile your model

In [39]:
# [TODO]
model_1.compile(optimizer='Adam',
                # loss=tf.losses.CategoricalCrossentropy(from_logits=True), 
                loss=tf.losses.CategoricalCrossentropy(from_logits=False), 
                metrics=["accuracy"])

### 5. Train 

Time to train your model!  

You should always prevent the model from overfitting, so take validation accuracy into consideration and choose your epoch number wisely.  

<small>*<a href="https://www.ibm.com/cloud/learn/overfitting">What is Overfitting?</a></small>

<font color="red">[TODO]</font> Train and tune your model

In [40]:
history = model_1.fit(X_train, Y_train, validation_data=(X_val, Y_val), epochs=10, batch_size=5, class_weight=weight)
# [TODO] how many iterations you want to run
# initial_epoch = ?    # set this if you're continuing previous training

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [41]:
pre = model_1.predict(first_X)
_ = weighted_class(pre)

class_1 = 37, class_2 = 94, class_3 = 47
[0.4585365495650765, 0.1804878485828936, 0.36097560185202987]


In [42]:
# example of continued training

# history = model_1.fit(
#     X_train, Y_train, 
#     validation_data=(X_val, Y_val),
#     epochs = 10,          # how many iterations you want to run
#     initial_epoch = 7     # set this if you're continuing previous training
# )

### 6. Examine your model

Let's see how good your model does.  

In [43]:
testcases = [
    # 1
    'A block party or street party is a party in which many members of a single community congregate, either to observe an event of some importance or simply for mutual enjoyment.',
    'A party is a social gathering.',
    # 2
    'Ukraine has a multi-party system, with numerous parties in which often not a single party has a chance of gaining power alone, and parties must work with each other to form coalition governments.',
    'Serbia has a multi-party system, with numerous parties in which no one party often has a chance of gaining power alone, and parties must work with each other to form coalition governments.',
    # 3
    'In a civil lawsuit, a nominal party is one named as a party on the record of an action, but having no interest in the action.',
]

In [44]:
# you must specify the padding width here, since the input size of model should always be the same
test_X = process_text(testcases, padding = PADDING_WIDTH)

In [45]:
predictions = model_1.predict(test_X)

In [46]:
predictions[0]

array([0.85232174, 0.11062728, 0.03705095], dtype=float32)

#### What does the result mean?

As you can see, a list of floats are generated, and since we used one-hot encoding when preparing the training data, each number presents the result of corresponding categories.  
```
 Sense 1, Sense 2, Sense 3
[   0.89,    0.12,    0.21]
```
You can consider these values as the probability of each column, or said category. Hence, the true predicted label should be the one with the highest probability, which is Sense 1 for this sample.  

Now let's get all the predicted labels from these probabilities.  

In [47]:
for idx, result in enumerate(predictions):
    predict_id = result.argmax() # select the index of the maximum value
    sense_id = predict_id + 1    # sense_id starts from 1
    print(testcases[idx])
    print(f'-> Sense {sense_id} (prob={result[predict_id]:.2f}): {SENSE[sense_id]}')
    print()

A block party or street party is a party in which many members of a single community congregate, either to observe an event of some importance or simply for mutual enjoyment.
-> Sense 1 (prob=0.85): a social event at which a group of people meet to talk, eat, drink, dance, etc.

A party is a social gathering.
-> Sense 1 (prob=0.97): a social event at which a group of people meet to talk, eat, drink, dance, etc.

Ukraine has a multi-party system, with numerous parties in which often not a single party has a chance of gaining power alone, and parties must work with each other to form coalition governments.
-> Sense 2 (prob=0.96): an organization of people with particular political beliefs

Serbia has a multi-party system, with numerous parties in which no one party often has a chance of gaining power alone, and parties must work with each other to form coalition governments.
-> Sense 2 (prob=0.95): an organization of people with particular political beliefs

In a civil lawsuit, a nominal

Again, the label might not be 100% correct, but it should look reasonable somehow.  

## III. Second stage

The previous model might not be enough for real-world use; another model with better ability is needed.  

<small>*Most contents of this section are the same as previous one, so you can make use of your code above.</small>

### 1. Prepare the training data 

The model from the previous section is weak, yet it still has learned some valuable knowledge.  
Let's ask that model to label more training data for us!

In [48]:
# Get the probability on the whold dataset
predictions = model_1.predict(np.array(list(processed_data.values())))


<font color="red">[TODO]</font> Get the labels of all data, and reserve only those labels with high probabilities.

In [49]:
THRESHOLD = 0.7  # you may want to change this :)
indice, second_X, second_Y = [], [], [] # sentence id of selected samples, selected sentences, detected labels

for sent_id, result in zip(processed_data, predictions):
    # [TODO]
    index = int(result.argmax())
    if result[index] > THRESHOLD:
        indice.append(sent_id)
        second_X.append(processed_data[sent_id]), second_Y.append(index + 1)

Observe the selected data size and the quality of labels.  
You might want to go back and modify your preprocessing, first model, or the threshold until you get a better training data.

In [50]:
for i in range(5):
    print(pure_data[indice[i]])
    print(f' -> {second_Y[i]}: {SENSE[second_Y[i]]}')
    print()

A naked party, also known as nude party, is a party where the participants are required to be nude.
 -> 1: a social event at which a group of people meet to talk, eat, drink, dance, etc.

The town center bears the hallmarks of a typical migration-accepting Turkish rural town, with traditional structures coexisting with a collection of concrete apartment blocks providing public housing, as well as amenities such as basic shopping and fast-food restaurants, and essential infrastructure but little in the way of culture except for cinemas and large rooms hired out for wedding parties.
 -> 1: a social event at which a group of people meet to talk, eat, drink, dance, etc.

Elections Alberta oversees the creation of political parties and riding associations, compiles election statistics on ridings, and collects financial statements from party candidates and riding associations.
 -> 2: an organization of people with particular political beliefs

A group of characters can join together to form 

In [51]:
second_X = np.array(second_X)
second_Y = np.array(second_Y)
second_X.shape

(470, 89, 300)

### 2. Encode labels 

<font color="red">[TODO]</font> one-hot encode secone_Y

In [52]:
# [TODO]
onehot_Y_second = np.zeros((len(second_Y), 3))
onehot_Y_second[np.arange(len(second_Y)), second_Y - 1] = 1

In [53]:
onehot_Y_second[:3]
weight_second = weighted_class(onehot_Y_second)

class_1 = 148, class_2 = 256, class_3 = 66
[0.2617424888877947, 0.1513199111596212, 0.586937599952584]


### 3. Prepare training and validating dataset

In [54]:
X_train, X_val, Y_train, Y_val = train_test_split(
    second_X, onehot_Y_second,
    test_size = 0.25,    # [TODO] How much data you want to used as validation set
    shuffle = True
)

In [55]:
X_train.shape

(352, 89, 300)

### 4. Build model

In [56]:
# the number comes from previous setting
print(PADDING_WIDTH, EMBEDDING_DIM, OUTPUT_CATEGORY)

89 300 3


<font color="red">[TODO]</font> Build your second model

<small>*This model can be different from the previous one.</small>

In [57]:
model_2 = Sequential()

# [TODO]
model_2.add(layers.LSTM(100, input_shape=(89, 300), activation="selu"))
model_2.add(layers.BatchNormalization())
model_2.add(layers.Dense(80, activation="selu"))
model_2.add(layers.BatchNormalization())
model_2.add(layers.Dropout(0.5))
model_2.add(layers.Dense(3, activation="softmax"))

print(model_2.summary())

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_1 (LSTM)                (None, 100)               160400    
_________________________________________________________________
batch_normalization_2 (Batch (None, 100)               400       
_________________________________________________________________
dense_2 (Dense)              (None, 80)                8080      
_________________________________________________________________
batch_normalization_3 (Batch (None, 80)                320       
_________________________________________________________________
dropout_1 (Dropout)          (None, 80)                0         
_________________________________________________________________
dense_3 (Dense)              (None, 3)                 243       
Total params: 169,443
Trainable params: 169,083
Non-trainable params: 360
______________________________________________

<font color="red">[TODO]</font> Compile your model

In [58]:
# [TODO]
model_2.compile(optimizer='Adam',
                # loss=tf.losses.CategoricalCrossentropy(from_logits=True), 
                loss=tf.losses.CategoricalCrossentropy(from_logits=False), 
                metrics=["accuracy"])

### 5. Train model

<font color="red">[TODO]</font> Train it!

In [59]:
history = model_2.fit(X_train, Y_train, validation_data=(X_val, Y_val), epochs=10, batch_size=20, class_weight=weight_second)
# [TODO] how many iterations you want to run
# initial_epoch = ?  # set this if you're continuing previous training

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


### 6. Examine the result

In [60]:
testcases = [
    # 1
    'Green Beer Day (GBD) is a day-long party, where celebrants drink beer dyed green with artificial coloring or natural processes.',
    'When the siblings grew up, they held parties and introduced the tradition to friends while in college, and the tradition began to spread.',
    # 2
    'Politicians from the two main parties tend to win elections when not confronted by strong challengers from their own party (in which cases their traditional opponents tend to win).',
    'After the general election on 22 March 1992, five parties (Rassadorn, Justice Unity, Social Action, Thai Citizen, Chart Thai) designated Suchinda as the prime minister.',
    # 3
    'Typically, a party has the right to object in court to a line of questioning or at the introduction of a particular piece of evidence.',
    'In the practice of law, judicial estoppel (also known as estoppel by inconsistent positions) is an estoppel that precludes a party from taking a position in a case that is contrary to a position it has taken in earlier legal proceedings.'
]

In [61]:
# you must specify the padding width! 
test_X = process_text(testcases, padding=PADDING_WIDTH)

In [62]:
predictions = model_2.predict(test_X)

In [63]:
for idx, result in enumerate(predictions):
    predict_id = result.argmax()
    sense_id = predict_id + 1    # sense_id starts from 1
    print(testcases[idx])
    print(f'-> Sense {sense_id} (prob={result[predict_id]:.2f}): {SENSE[sense_id]}')
    print()

Green Beer Day (GBD) is a day-long party, where celebrants drink beer dyed green with artificial coloring or natural processes.
-> Sense 1 (prob=0.97): a social event at which a group of people meet to talk, eat, drink, dance, etc.

When the siblings grew up, they held parties and introduced the tradition to friends while in college, and the tradition began to spread.
-> Sense 1 (prob=1.00): a social event at which a group of people meet to talk, eat, drink, dance, etc.

Politicians from the two main parties tend to win elections when not confronted by strong challengers from their own party (in which cases their traditional opponents tend to win).
-> Sense 2 (prob=0.93): an organization of people with particular political beliefs

After the general election on 22 March 1992, five parties (Rassadorn, Justice Unity, Social Action, Thai Citizen, Chart Thai) designated Suchinda as the prime minister.
-> Sense 1 (prob=0.58): a social event at which a group of people meet to talk, eat, drin

Yet again, the label might not be 100% correct, but it still should look reasonable.

## IV. Evaluation

We have our model built! It's time to see how good it is on the testing dataset.  
Get the predictions from the final model and examine the results.  

In [64]:
with open(os.path.join('data', 'party.test.txt'), 'r') as f:
    data = f.read().strip().split('\n')

# this dict maps sentence_id to the sentence itself
test_data = {sent_id: text for sent_id, text in [line.split('\t', 1) for line in data]}

In [65]:
for idx, (sent_id, sentence) in enumerate(test_data.items()):
    if idx > 3: break
    print(f'{sent_id}: {sentence}')

1638: Patent ambiguity is that ambiguity which is apparent on the face of an instrument to any one perusing it, even if unacquainted with the circumstances of the parties.
1639: Smith played at parties, juke joints, and fish fries.
1640: Turkey has a multi-party system, with two or three strong parties and often a fourth party that is electorally successful.
1641: The Christian Liberation Movement ( or simply MCL) is a Cuban dissident party advocating political change in Cuba.


<font color="red">[TODO]</font> Get the labels of testing data.  

Try to reserve the sentence id, because you will need it while requesting your accuracy.  
Recommended format of `final_predictions` : 
```
{ sent_id: sense_id }
```

In [66]:
final_predictions = {}
# [TODO]
for (k, v) in test_data.items():
    processed = process_text([v], padding=PADDING_WIDTH)
    prediction = model_2.predict(processed)[0]
    final_predictions[k] = int(prediction.argmax() + 1)

In [67]:
for idx, (sent_id, pred) in enumerate(final_predictions.items()):
    if idx > 5: break
        
    print(test_data[sent_id])
    print(f'-> Sense {pred}: {SENSE[pred]}')
    print()

Patent ambiguity is that ambiguity which is apparent on the face of an instrument to any one perusing it, even if unacquainted with the circumstances of the parties.
-> Sense 3: a single entity which can be identified as one for the purposes of the law

Smith played at parties, juke joints, and fish fries.
-> Sense 1: a social event at which a group of people meet to talk, eat, drink, dance, etc.

Turkey has a multi-party system, with two or three strong parties and often a fourth party that is electorally successful.
-> Sense 2: an organization of people with particular political beliefs

The Christian Liberation Movement ( or simply MCL) is a Cuban dissident party advocating political change in Cuba.
-> Sense 1: a social event at which a group of people meet to talk, eat, drink, dance, etc.

Greens Party () was a green liberal party in Turkey.
-> Sense 1: a social event at which a group of people meet to talk, eat, drink, dance, etc.

Under the Constitution of North Korea, all citize

### Get your accuracy

Send your predictions in json format to our server, and we will calculate the accuracy for you.  
The format should be 
```
{ sentence_id: sense_id }
```
Example,
```
{
    1001: 1,
    1002: 1,
    ...
}
```

In [68]:
import json
import requests

In [69]:
data = json.dumps(final_predictions)
ret = requests.post('http://jedi.nlplab.cc:4500/check', json = {'data': data})

In [70]:
if not ret.ok: print('Something wrong :o')
print(ret.json())

{'accuracy': 0.7285714285714285, 'comment': ['Well done!']}


**REQUIREMENT**  
**Your accuracy should be <u>higher than 0.70</u> to get the full points.**

But do note that your assignment is mostly scored on your implementation, not just on the accuracy.  
So even if you brute-forcely attack our server and get 100% accuracy, you still can't get your points if your code doesn't make sense to TA.

## TA's note

Congratuation! You've finished the assignment this week.  
Don't forget to <b>[make an appoiment with TA](https://docs.google.com/spreadsheets/d/1QGeYl5dsD9sFO9SYg4DIKk-xr-yGjRDOOLKZqCLDv2E/edit#gid=1902646609) to demo/explain your implementation <u>before <font color="red">11/18 15:30</font></u></b> .  
Also make sure you submit your {student_id}.ipynb to [eeclass](https://eeclass.nthu.edu.tw/course/homework/4615).

Please note that <font color="red">we will announce our final project on 11/18</font>. Again, **we strongly suggest you join and listen** .  
We will have 2 Ph.D. students introduce the selected topics in class and give you some guidelines about how to approach your project.  
Also, we will have a team-matching session at the end of the class, in which you may want to participate to find teammates.