## Modelling a supervised Guesser

Modelling the unsupervised guesser and creating a synthetic dataset made me realize I could attempt making a supervised Guesser as well. Although there will be a lot of tradeoffs in terms of parallelization, pretraining, and clarity, hopefully it may in turn yield a better performance.

We can use word2vec once more to get semantic context features.

In [1]:
import word2vec_loader as wv_loader

limit = 200_000
print("Loading {limit} keys")
google_news_wv = wv_loader.load_word2vec_keyedvectors(wv_loader.GOOGLE_NEWS_PATH_NAME, limit)

Loading {limit} keys


Let's take the synthetic dataset and make a training and test split.

In [2]:
import synthetic_datamuse as sd
import numpy as np
import pandas

meaning_df = pandas.read_csv(sd.MEANING_CSV_PATH)
triggerword_df = pandas.read_csv(sd.TRIGGERWORD_CSV_PATH)

# together there are about 70,000 samples
# if we save 80% for training and 20% for testing we get a similar split as MNIST
split_ratio = 0.8
meaning_split_index = int(len(meaning_df) * split_ratio)
triggerword_split_index = int(len(triggerword_df) * split_ratio)

meaning_train, meaning_test = meaning_df[:meaning_split_index], meaning_df[meaning_split_index:]
triggerword_train, triggerword_test = triggerword_df[:triggerword_split_index], triggerword_df[triggerword_split_index:]


train_df, test_df = pandas.concat([meaning_train, triggerword_train], axis=0), pandas.concat([meaning_test, triggerword_test], axis=0)

A naive way to encode each clue would be to put each keyword embedding followed by each clue embedding in order.

In [3]:
import torch

def normalize(v):
    norm = np.linalg.norm(v)
    if norm == 0: 
       return v
    return v / norm

def features_to_tensor(features):
    words_to_vecs = features.map(wv_loader.official_keyword_to_word).map(google_news_wv.__getitem__).map(normalize)
    return torch.from_numpy(np.array(words_to_vecs.tolist()).transpose()).contiguous()

# initialize DataLoader
train_features, train_target = train_df.drop('code_index', axis=1), train_df['code_index']

# pytorch requires this to be Sequence of input, label pairs. If we can't store in RAM will make custom Dataset class
training_data = [(features_to_tensor(train_features.iloc[i]), train_target.iloc[i]) for i in range(len(train_features))]



We may use the negative log probability as the loss because we already found that it is a good metric for determining guesses with the unsupervised guesser. An architecture which uses sigmoids and convolutions should help the model come up with features resembling probabilistic quantities.

In [4]:
# adapted from MNIST example https://github.com/nicknochnack/PyTorchin15
import os

train_dataloader = torch.utils.data.DataLoader(training_data, batch_size=64, shuffle=True)

class ClueClassifier(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.model = torch.nn.Sequential(
            torch.nn.Conv1d(in_channels=300, out_channels=64, kernel_size=3),
            torch.nn.Sigmoid(),
            torch.nn.Conv1d(in_channels=64, out_channels=128, kernel_size=3),
            torch.nn.Sigmoid(),
            torch.nn.Conv1d(in_channels=128, out_channels=128, kernel_size=3),
            torch.nn.Sigmoid(),
            torch.nn.Flatten(),
            torch.nn.Linear(128, 24), # 24 outputs = 4 permute 3 codes
            torch.nn.LogSoftmax(1) 
        )

    def forward(self, x):
        return self.model(x)
    
    
# instance, loss, optimizer
device = 'cuda:0' if torch.cuda.is_available() else 'cpu'

classifier = ClueClassifier().to(device)
optimizer  = torch.optim.Adam(classifier.parameters(), lr=1e-3)

# maximizing log-likelihood is a decent metric for this classification - see unsupervised notebook for more
loss_func = torch.nn.NLLLoss()

# train loop
model_path = "model_state.pt"
if not os.path.exists(model_path):
    for epoch in range(10):
        for X, y in train_dataloader: # loop through batches
            X, y = X.to(device), y.to(device)
            yhat = classifier(X)
            loss = loss_func(yhat, y)

            # backpropagation
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
        print(f"Epoch {epoch} loss: {loss.item()}")

    with open(model_path, 'wb') as f:
        torch.save(classifier.state_dict(), f)
else:
    with open(model_path, 'rb') as f:
        classifier.load_state_dict(torch.load(f))

In [5]:
test_features, test_target = test_df.drop('code_index', axis=1), test_df['code_index']

tensor, label = (features_to_tensor(test_features.iloc[0]), test_target.iloc[0])
output = classifier(tensor.unsqueeze(0))
print(output)
print(torch.argmax(output), label)

tensor([[-3.1712, -3.1706, -3.1718, -3.1832, -3.1815, -3.1901, -3.1832, -3.1753,
         -3.1788, -3.1841, -3.1648, -3.1786, -3.1819, -3.1888, -3.1787, -3.1809,
         -3.1755, -3.1800, -3.1751, -3.1780, -3.1790, -3.1611, -3.1837, -3.1779]],
       grad_fn=<LogSoftmaxBackward0>)
tensor(21) 0


For each training epoch, the loss stayed around 3.17 and even raised at times. As we can see, our model is having trouble learning.

We are asking it to do a lot. Not only are we asking it to come up with a maximum log probability guess, but we are implicitly asking it to learn the context of the game. Unlike the supervised guesser, our net knows nothing about how each keyword and clueword isn't related, or that the cluewords should be related to some of the keywords. Maybe there is an input format which will reflect this better.

What if instead of thinking of the input as a sequence 

```K1 K2 K3 K4 C1 C2 C3```

we thought of it as a matrix
```   
      C1      C2     C3
K1   K1C1    K1C2   K1C3

K2   K2C1    K2C2   K2C3

K3   K3C1    K3C2   K3C3

K4   K4C1    K4C2   K4C3

```

This encodes the input as keyword-clue pairs that the classifier may have an easier time learning associations for. The matrix is also relatively small (although we have many channels for embedding features), so it may make sense to opt for a more fully-connected architecture.

In [6]:
# X, y = next(iter(train_dataloader))

# print(X.shape)
# out = torch.nn.Conv1d(in_channels=300, out_channels=64, kernel_size=3)(X)
# print(out.shape)
# out = torch.nn.Conv1d(in_channels=64, out_channels=128, kernel_size=3)(out)
# print(out.shape)
# out = torch.nn.Conv1d(in_channels=128, out_channels=128, kernel_size=3)(out)
# print(out.shape)
# out = torch.nn.Flatten()(out)
# print(out.shape)
# out = torch.nn.Linear(128, 24)(out)
# print(out.shape)