<a href="https://colab.research.google.com/github/TurkuNLP/intro-to-nlp/blob/master/mlp_imdb_hf_dset_and_trainer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Setup

Before we start running our own Python code, install the required Python packages using [pip](https://en.wikipedia.org/wiki/Pip):

* [`transformers`](https://huggingface.co/docs/transformers/index) is a popular deep learning package primarily on top of torch
* [`datasets`](https://huggingface.co/docs/datasets/) provides support for loading, creating, and manipulating datasets
* evaluate is a library of performance metrics (like accuracy etc)

Both of these packages will be used in this course.

In [52]:
!pip3 install -q transformers datasets evaluate

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/81.4 KB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.4/81.4 KB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[?25h

(Above, the `!` at the start of the line tells the notebook to run the line as an operating system command rather than Python code, and the `-q` argument to `pip` runs the command in "quiet" mode, with less output.)

---

# Get and prepare data

*   Let us work with the IMDB dataset of movie review sentiment
*   25,000 positive reviews
*   25,000 negative reviews
*   50,000 unlabeled reviews (which we discard for the time being)


In [3]:
from pprint import pprint #pprint => pretty-print, I use it occassionally throughout the notebook
import datasets
dset=datasets.load_dataset("imdb")
pprint(dset)

Downloading builder script:   0%|          | 0.00/4.31k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/2.17k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/7.59k [00:00<?, ?B/s]

Downloading and preparing dataset imdb/plain_text to /root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0...


Downloading data:   0%|          | 0.00/84.1M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

Dataset imdb downloaded and prepared to /root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

{'test': Dataset({
    features: ['text', 'label'],
    num_rows: 25000
}),
 'train': Dataset({
    features: ['text', 'label'],
    num_rows: 25000
}),
 'unsupervised': Dataset({
    features: ['text', 'label'],
    num_rows: 50000
})}


In [4]:
dset=dset.shuffle() #This is never a bad idea, datasets may have ordering to them, which is not what we want
del dset["unsupervised"] #Delete the unlabeled part of the dataset, we don't need it for anything

In [7]:
pprint(dset['train'][0]['text'])
print(dset['train'][0]['label'])

("I hate how this movie has absolutely no creative input. I know they're going "
 "for realism, but to be frank I just don't want realism. Realism is boring. "
 "If I want to see daily life, I'll uhm, live. Tell me an interesting story "
 "and we'll talk. I can deal with the low production values, hell I'm a sucker "
 'for low production values, but at least work in some good ideas. The '
 'direction only goes as far as grabbing a camcorder and walking around a bit, '
 "but obviously I'm supposed to dig that because it makes stuff so much more "
 'realistic. Hitchcock used to say drama was essentially life with the dull '
 'bits cut out. I can only conclude this is not drama, not by a long shot. We '
 'get to see Rosetta walking to someplace, Rosetta working in a bakery, '
 'Rosetta eating a waffle, Rosetta carrying around bags of far, Rosetta '
 "walking back home, Rosetta walking someplace...it's just not that "
 "entertaining. There isn't really a deeper meaning either. I got so bor

## Tokenize and vectorize data
         
*   We need to achieve two complementary tasks
*   **Tokenize** split the text into units which can be interpreted as features (words in this case most likely)
*   **Vectorize** build the feature vector
*   Since this is NLP, vectorize here means listing the non-zero elements of the feature vector, or in other words the indices of the rows in the embedding matrix
*   A traditional and well-tested way it to use sklearn's feature extraction package
*   CountVectorizer is most likely what we want in here, but for other NLP work the TfidfVectorizer is also very handy



In [10]:
import sklearn.feature_extraction

vectorizer=sklearn.feature_extraction.text.CountVectorizer(binary=True,max_features=20000)

texts=[ex["text"] for ex in dset["train"]] #get a list of all texts from the training data
vectorizer.fit(texts) #"Trains" the vectorizer, i.e. builds its vocabulary


# Building the feature vectors

* This is super-easy with the vectorizer
* It produces a sparse matrix of the non-zero elements

In [21]:
def vectorize_example(ex):
    vectorized=vectorizer.transform([ex["text"]]) # [...] because the vectorizer expects a list/iterable over inputs, not one input
    non_zero_features=vectorized.nonzero()[1] #.nonzero gives a pair of (rows,columns), we want the columns
    non_zero_features+=1 #feature index 0 will have a special meaning
                         # so let us not produce it by adding +1 to everything
    return {"input_ids":non_zero_features} 

print(vectorize_example(dset["train"][0]))

{'input_ids': array([  320,   860,   887,  1157,  1207,  1300,  1495,  1533,  1545,
        1702,  1737,  1980,  1988,  2226,  2231,  2592,  2604,  2625,
        2686,  2711,  2847,  3591,  3793,  4260,  4476,  4516,  4635,
        4724,  5112,  5164,  5429,  5529,  5658,  5775,  5869,  6141,
        6254,  6663,  7127,  7259,  7627,  7775,  7777,  7801,  7839,
        7856,  8047,  8289,  8303,  8443,  8604,  8657,  8780,  8905,
        8929,  9085,  9314,  9428,  9602,  9619,  9630,  9890, 10090,
       10328, 10449, 10561, 10576, 10628, 10642, 10696, 10708, 10890,
       11176, 11185, 11681, 11721, 11762, 11778, 12134, 12202, 12212,
       12328, 12363, 12437, 12445, 12577, 12658, 13295, 13805, 14087,
       14310, 14338, 14340, 14349, 14491, 15461, 15682, 16017, 16478,
       16542, 16548, 16922, 17075, 17197, 17305, 17412, 17643, 17781,
       17893, 17897, 17929, 17944, 17968, 18115, 18580, 18963, 19013,
       19324, 19354, 19398, 19446, 19712, 19773, 19779], dtype=int32)}


In [24]:
# We can map back to vocabulary and check that everything works
# vectorizer.vocabulary_ is a dictionary {key:word, value:idx}

idx2word=dict((i,w) for (w,i) in vectorizer.vocabulary_.items()) #inverse the vocab dictionary
words=[]
for idx in vectorized["input_ids"]:
    words.append(idx2word[idx-1]) ## It is easy to forgot we moved all by +1
pprint(", ".join(words)) #This is now the bag of words representation of the document

('absolutely, an, and, around, as, at, back, bags, bakery, be, because, bit, '
 'bits, bored, boring, business, but, by, camcorder, can, carrying, come, '
 'conclude, creative, cut, daily, deal, deeper, dig, direction, don, drama, '
 'dull, eating, either, entertaining, essentially, far, for, frank, get, goes, '
 'going, good, got, grabbing, guess, has, hate, hell, hitchcock, home, how, '
 'ideas, if, in, input, interesting, is, isn, it, just, know, least, life, '
 'live, ll, long, looking, love, low, makes, me, meaning, more, most, movie, '
 'much, no, not, nothing, obviously, of, on, only, out, overrated, plain, '
 'production, quality, re, realism, realistic, really, reflections, say, see, '
 'shot, so, some, someplace, started, story, stuff, sucker, supposed, talk, '
 'tell, that, the, there, they, this, to, uhm, used, values, walking, want, '
 'was, we, with, work, working')


# Tokenizing / vectorizing the whole dataset

* The datasets library allows us to efficiently map() a function across the whole dataset
* Can run in parallel

**Note**: confusingly, and unlike the Python`map` function, [`Dataset.map`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map) function _updates_ its argument dataset, keeping existing values. Here, the call adds the values returned by the function call (here `input_ids`) to each example while also keeping the original `text` and `label` values.


In [25]:
# Apply the tokenizer to the whole dataset using .map()
dset_tokenized = dset.map(vectorize_example,num_proc=4)
pprint(dset_tokenized["train"][0])

Map (num_proc=4):   0%|          | 0/25000 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/25000 [00:00<?, ? examples/s]

{'input_ids': [320,
               860,
               887,
               1157,
               1207,
               1300,
               1495,
               1533,
               1545,
               1702,
               1737,
               1980,
               1988,
               2226,
               2231,
               2592,
               2604,
               2625,
               2686,
               2711,
               2847,
               3591,
               3793,
               4260,
               4476,
               4516,
               4635,
               4724,
               5112,
               5164,
               5429,
               5529,
               5658,
               5775,
               5869,
               6141,
               6254,
               6663,
               7127,
               7259,
               7627,
               7775,
               7777,
               7801,
               7839,
               7856,
               8047,
               8

## Input encoding for MLP

* Our `input_ids` are an array containing the indices of the tokens found in the text
* This corresponds to the indices into the row of the embedding matrix in the model


# Batching and padding

* When working with neural networks, one rarely trains one example at a time
* Instead, processing always happens a batch at a time
* This has two important reasons:
  1. No batching is too slow (GPU parallelization cannot kick in across examples)
  2. The gradients are averaged across the whole batch and applied only once, i.e. batching acts as a regularizer


# Padding

* In order to build a batch as a 2D array of (example, seq) (see below), we need to fit together examples of different length!
* Solution: pad the shorter examples with zeroes to the length of the longest example in the batch
* Make sure that zero is understood as padding value rather than a (hypothetical) feature with index 0
* This is best shown by example, it is in the end easier than it may sound

In [44]:
import torch
#Build a batch from 2 examples, with padding
batch=collator([dset_tokenized["train"][2],dset_tokenized["train"][7]])
print("Shape of labels:",batch["labels"].shape)
print("Shape of input_ids:",batch["input_ids"].shape)
pprint(batch["labels"])
pprint(batch["input_ids"])

Shape of labels: torch.Size([2])
Shape of input_ids: torch.Size([2, 101])
tensor([0, 1])
tensor([[  727,   826,   860,   887,  1005,  1115,  1207,  1702,  1764,  1897,
          2625,  2711,  2878,  2889,  3480,  3494,  3653,  3987,  4174,  4418,
          4918,  5372,  5414,  5746,  5750,  5857,  5942,  6514,  6604,  6614,
          6856,  6878,  6986,  7127,  7176,  7767,  8012,  8128,  8224,  8322,
          8476,  8776,  8898,  9085,  9363,  9428,  9602,  9630, 10259, 10475,
         10696, 10890, 10970, 11035, 11176, 11181, 11681, 11694, 11766, 11867,
         12295, 12363, 12566, 12993, 13028, 13626, 13801, 14349, 14825, 14844,
         15695, 15919, 16017, 16020, 16174, 16316, 16359, 16478, 17412, 17637,
         17885, 17893, 17897, 17907, 17929, 17944, 17954, 17968, 17972, 17982,
         18018, 18108, 18115, 18192, 19000, 19071, 19440, 19462, 19609, 19768,
         19805],
        [  419,   860,  1207,  2199,  2604,  2625,  2686,  3053,  3282,  5167,
          5403,  5425,  5

# Collator

* This is simply a function which takes a list of examples and builds a training batch out of them
* Much like examples are dictionaries with the data, also batches are dictionaries with the data
* The only difference is that in a batch, all data tensors have one extra dimension, that's all there is to it

In [42]:
def collator(list_of_examples):
    batch={"labels":torch.tensor(list(ex["label"] for ex in list_of_examples))} #this is easy, labels are made into a single tensor
    #the worse bit is now to pad the examples, as they are of different length
    tensors=[]
    max_len=max(len(example["input_ids"]) for example in list_of_examples) #this is the longest example in the batch
    #everything needs to be padded to fit in length the longest example
    #(so we can build a single tensor out of it)
    for example in list_of_examples:
        ids=torch.tensor(example["input_ids"]) #pick the input ids
        # pad(what,(from_left, from_right)) <- this is how we call the stock pad function
        padded=torch.nn.functional.pad(ids,(0,max_len-ids.shape[0])) #pad by max - current length, pads with zero by default
        tensors.append(padded) #accumulated the padded ids
    batch["input_ids"]=torch.vstack(tensors) #now that we have all of them the same length, a simple vstack() stacks them up
    return batch #...and that's all there is to it


# Build the MLP model

* Now that all of our data is in shape, we can build the model
* That is luckily quite easy in this case

The model class in its simplest form has `__init__()` which instantiates the layers and `forward()` which implements the actual computation. For more information on these, please see the [PyTorch turorial](https://pytorch.org/tutorials/beginner/introyt/modelsyt_tutorial.html).

In [45]:
import torch
import transformers

# A model wants a config, I can simply inherit from the base
# class for pretrained configs
class MLPConfig(transformers.PretrainedConfig):
    pass

# This is the model
class MLP(transformers.PreTrainedModel):

    config_class=MLPConfig

    # In the initialization method, one instantiates the layers
    # these will be, for the most part the trained parameters of the model
    def __init__(self,config):
        super().__init__(config)
        self.vocab_size=config.vocab_size #embedding matrix row count
        # Build and initialize embedding of vocab size +1 x hidden size (+1 because of the padding index 0!)
        self.embedding=torch.nn.Embedding(num_embeddings=self.vocab_size+1,embedding_dim=config.hidden_size,padding_idx=0)
        # Normally you would not initialize these yourself, but I have my reasons here ;)
        torch.nn.init.uniform_(self.embedding.weight.data,-0.001,0.001) #initialize the embeddings with small random values
        # Note! This is quite clever and keeps the embedding for 0, the padding, pure zeros
        # This takes care of the lower half of the network, now the upper half
        # Output layer: hidden size x output size
        self.output=torch.nn.Linear(in_features=config.hidden_size,out_features=config.nlabels)
        # Now we have the parameters of the model

        
    # The computation of the model is put into the forward() function
    # it receives a batch of data and optionally the correct `labels`
    #
    # If given `labels` it returns (loss,output)
    # if not, then it returns (output,)
    def forward(self,input_ids,labels=None): #nevermind the attention_mask, its time will come, data collator insists on adding it
        #1) sum up the embeddings of the items
        embedded=self.embedding(input_ids) #(batch,ids)->(batch,ids,embedding_dim)
        # Since the Embedding keeps the first row of the matrix pure zeros, we don't need to worry about the padding
        # so next we sum the embeddings across the word dimension
        # (batch,ids,embedding_dim) -> (batch,embedding_dim)
        embedded_summed=torch.sum(embedded,dim=1)
        
        #2) apply non-linearity
        # (batch,embedding_dim) -> (batch,embedding_dim)
        projected=torch.tanh(embedded_summed) #Note how non-linearity is applied here and not when configuring the layer in __init__()

        #3) and now apply the upper, output layer of the network
        # (batch,embedding_dim) -> (batch, num_of_classes i.e. 2 in our case)
        logits=self.output(projected)

        # ...and that's all there is to it!

        #print("input_ids.shape",input_ids.shape)
        #print("embedded.shape",embedded.shape)
        #print("embedded_summed.shape",embedded_summed.shape)
        #print("projected.shape",projected.shape)
        #print("logits.shape",logits.shape)
        
        # We have labels, so we ought to calculate the loss
        if labels is not None:
            loss=torch.nn.CrossEntropyLoss() #This loss is meant for classification, so let's use it
            # You run it as loss(model_output,correct_labels)
            return (loss(logits,labels),logits)
        else:
            # No labels, so just return the logits
            return (logits,)

# Configure the model:
#   these parameters are used in the model's __init__()
mlp_config=MLPConfig(vocab_size=len(vectorizer.vocabulary_),hidden_size=20,nlabels=2)


In [47]:
# And we can make a model
mlp=MLP(mlp_config)
fake_batch=collator([dset_tokenized["train"][0],dset_tokenized["train"][1]])
mlp(**fake_batch) #** expands input_ids and labels as parameters of the call

(tensor(0.7160, grad_fn=<NllLossBackward0>), tensor([[0.1685, 0.2454],
         [0.1834, 0.1965]], grad_fn=<AddmmBackward0>))

# Train the model

We will use the Hugging Face [Trainer](https://huggingface.co/docs/transformers/main_classes/trainer) class for training

* Loads of arguments that control the training
* Configurable metrics to evaluate performance
* Data collator builds the batches
* Early stopping callback stops when eval loss no longer improves
* Model load/save
* Good foundation for later deep learning course
  

First, let's create a [`TrainingArguments`](https://huggingface.co/docs/transformers/v4.17.0/en/main_classes/trainer#transformers.TrainingArguments) object to specify hyperparameters and various other settings for training. 

Printing this simple dataclass object will show not only the values we set, but also the defaults for all other arguments. Don't worry if you don't understand what all of these do! Many are not relevant to us here, and you can find the details in [`Trainer` documentation](https://huggingface.co/docs/transformers/main_classes/trainer) if you are interested.

In [49]:
# Set training arguments
# their names are mostly self-explanatory
trainer_args = transformers.TrainingArguments(
    "mlp_checkpoints", #save checkpoints here
    evaluation_strategy="steps",
    logging_strategy="steps",
    eval_steps=500,
    logging_steps=500,
    learning_rate=1e-4, #learning rate of the gradient descent
    max_steps=20000,
    load_best_model_at_end=True,
    per_device_train_batch_size=128
)

pprint(trainer_args)

TrainingArguments(
_n_gpu=1,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_pin_memory=True,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
do_eval=True,
do_predict=False,
do_train=False,
eval_accumulation_steps=None,
eval_delay=0,
eval_steps=500,
evaluation_strategy=steps,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_config={'fsdp_min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False},
fsdp_min_num_params=0,
fsdp_transformer_layer_cls_to_wrap=None,
full_determinism=False,
gradient_accumulation_steps=1,
gradient_checkpointing=False,
greater_is_better=False,
group_by_length=False,
half_precision_backend=auto,
hub_model_id=None,
hub_private_repo=False,
hub_strategy=every_save,
hub_token=<HUB_TOKEN>,
ign

Next, let's create a metric for evaluating performance during and after training. We can use the convenience function [`load_metric`](https://huggingface.co/docs/datasets/about_metrics) to load one of many pre-made metrics and wrap this for use by the trainer.

As the task is simple binary classification and our data is even 50:50 balanced, we can comfortably use the basic `accuracy` metric, defined as the proportion of correctly predicted labels out of all labels.

In [54]:
import numpy as np
import evaluate

accuracy = evaluate.load("accuracy")

def compute_accuracy(outputs_and_labels):
    outputs, labels = outputs_and_labels
    predictions = np.argmax(outputs, axis=-1) #pick the index of the "winning" label
    return accuracy.compute(predictions=predictions, references=labels)

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

We can then create the `Trainer` and train the model by invoking the [`Trainer.train`](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.Trainer.train) function.

In addition to the model, the settings passed in through the `TrainingArguments` object created above (`trainer_args`), the data, and the metric defined above, we create and pass the following to the `Trainer`:

* [data collator](https://huggingface.co/docs/transformers/main_classes/data_collator): groups input into batches
* [`EarlyStoppingCallback`](https://huggingface.co/docs/transformers/main_classes/callback#transformers.EarlyStoppingCallback): stops training when performance stops improving

In [55]:
# Make a new model  
mlp = MLP(mlp_config)


# Argument gives the number of steps of patience before early stopping
# i.e. training is stopped when the evaluation loss fails to improve
# certain number of times
early_stopping = transformers.EarlyStoppingCallback(5)

trainer = transformers.Trainer(
    model=mlp,
    args=trainer_args,
    train_dataset=dset_tokenized["train"],
    eval_dataset=dset_tokenized["test"].select(range(1000)), #make a smaller subset to evaluate on
    compute_metrics=compute_accuracy,
    data_collator=collator,
    callbacks=[early_stopping]
)

# FINALLY!
trainer.train()



Step,Training Loss,Validation Loss,Accuracy
500,0.4938,0.388805,0.874
1000,0.2883,0.307788,0.888
1500,0.2099,0.28281,0.891
2000,0.1676,0.277067,0.885
2500,0.138,0.279414,0.886
3000,0.1153,0.286287,0.888
3500,0.0966,0.297495,0.891
4000,0.0814,0.311186,0.887
4500,0.0695,0.327691,0.88


TrainOutput(global_step=4500, training_loss=0.1844968982272678, metrics={'train_runtime': 71.3528, 'train_samples_per_second': 35878.035, 'train_steps_per_second': 280.297, 'total_flos': 61730226432.0, 'train_loss': 0.1844968982272678, 'epoch': 22.96})

We can then evaluate the trained model on a given dataset (here our test subset) by calling [`Trainer.evaluate`](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.Trainer.evaluate):

In [56]:
eval_results = trainer.evaluate(dset_tokenized["test"])

print(eval_results)

{'eval_loss': 0.2788609266281128, 'eval_accuracy': 0.88804, 'eval_runtime': 10.3315, 'eval_samples_per_second': 2419.791, 'eval_steps_per_second': 302.474, 'epoch': 22.96}


# Save the model for later use

* You can save it with `trainer.save_model()`
* You can load it with `MLP.from_pretrained()`


In [57]:
trainer.save_model("mlp-imdb")

# Check save/load

In [58]:
mlp2=MLP.from_pretrained("mlp-imdb")

In [59]:
trainer = transformers.Trainer(
    model=mlp2,
    args=trainer_args,
    train_dataset=dset_tokenized["train"],
    eval_dataset=dset_tokenized["test"],
    compute_metrics=compute_accuracy,
    data_collator=collator,
    callbacks=[early_stopping]
)

In [60]:
eval_results = trainer.evaluate(dset_tokenized["test"])
print(eval_results)
print('Accuracy:', eval_results['eval_accuracy'])

{'eval_loss': 0.2788609266281128, 'eval_accuracy': 0.88804, 'eval_runtime': 5.9314, 'eval_samples_per_second': 4214.887, 'eval_steps_per_second': 526.861}
Accuracy: 0.88804


# Extra time left?

* Read through the TrainingArguments documentation, try to understand at least some parts of it https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments
* Read through Torch tensor operations, try to understand at least some parts of it: https://pytorch.org/docs/stable/tensors.html
* Run the model with different parameters (hidden layer width, learning rate, etc), how much do the results change?


# What has the model learned?

* The embeddings should have some meaning to them
* Similar features should have similar embeddings

In [62]:
# Grab the embedding matrix out of the trained model
# and drop the first row (padding 0)
# then we can treat the embeddings as vectors
# and maybe compare them to each other
# ha ha this below took some googling
weights=mlp.embedding.weight.detach().cpu().numpy()
weights=weights[1:,:]

In [64]:
qry_idx=vectorizer.vocabulary_["great"] #embedding of "lousy"

#calculate the distance of the "lousy" embedding to all other embeddings
distance_to_qry=sklearn.metrics.pairwise.euclidean_distances(weights[qry_idx:qry_idx+1,:],weights)
nearest_neighbors=np.argsort(distance_to_qry) #indices of words nearest to "lousy"
for nearest in nearest_neighbors[0,:20]:
    print(idx2word[nearest])
# This works great!

great
wonderfully
superb
refreshing
rare
amazing
loved
perfect
enjoyable
gem
wonderful
fantastic
favorite
today
underrated
incredible
perfectly
funniest
enjoyed
noir


* The embeddings indeed seem to reflect the task
* There is a meaning to them

# Feature weights

*   A typical "old-school" way to approach the classification would be a simple linear model, like LinearSVM
*   Under such model, each feature (word) would have a single one weight
*   And the classification would simply be based on the sum of these weights
*   In this context of this task, "positive" words would get a high weight, "negative" words would get a low weight
*   It is in fact quite easy to reconfigure the MLP model to work more or less like this and this effect can be replicated
*   I will leave that as an exercise for you

