# Surprise similarity for text classification
install the surprise_similarity package with:
 `pip install surprise_similarity`

# Word similarity
In this simple example we'll compare the cosine-similarty and the surprise-similarity of a few words to the word `dog`.  This is simply a toy example demonstrating the difference between the two scores, that will also demonstrate the SurpriseSimilarity.rank_documents method, which can be used for document ranking and retrieval in general.  Finally we will also demonstrate how you can fine-tune the underlying sentence-transformer model to reproduce desired similarities.

In [1]:
import surprise_similarity
similarity = surprise_similarity.SurpriseSimilarity()

  from .autonotebook import tqdm as notebook_tqdm


## Vocabulary
For this example we'll use the vocabulary given by `english_words_alpha_set` shipped with the package https://pypi.org/project/english-words/.

In [2]:
from english_words import english_words_alpha_set
vocabulary = list(english_words_alpha_set)
print(f'There are {len(vocabulary)} words in the vocabulary.')

There are 25474 words in the vocabulary.


## Cosine Ranking
Start by ranking the words in the vocabulary by their similarity to the word `dog`. By setting `surprise_weight = 0` we see the ranking based on the cosine similarity score.

In [3]:
my_dogs_name = 'Jude'

dog_similarity_df = similarity.rank_documents(queries=['dog'],
                                documents=vocabulary,
                                surprise_weight=0,
                                sample_num_cutoff=None,
                                normalize_raw_similarity=False,
                                )
dog_similarity_df.head()


Batches: 100%|██████████| 797/797 [00:07<00:00, 105.53it/s]


Unnamed: 0,documents,dog
6596,dog,1.0
1584,canine,0.952441
11666,pup,0.935472
19924,animal,0.929151
20715,pooch,0.928523


Great, but let's focus in on a small subset of words that will highlight the differences between the cosine and surprise similarity scores

In [4]:
my_dogs_name = 'Jude'
example_words = ['the', 'potato', 'my', 'Alsatian', 'furry', 'puppyish', my_dogs_name]
dog_similarity_df[dog_similarity_df['documents'].isin(example_words)]

Unnamed: 0,documents,dog
648,the,0.852129
19796,potato,0.850233
25127,my,0.850161
11001,Alsatian,0.849615
2049,furry,0.833391
3254,puppyish,0.829942
14978,Jude,0.783955


Cosine similarity ranks the word ‘potato’ as more closely related to the word ‘dog’ than the word ‘Alsatian’.
This may be counter-intuitive since ‘Alsatian’ refers to a specific breed of dog, also known as the
German Shepherd, while ‘potato’ is a starchy, tuberous crop and is not related to dogs at all.

Unsurprisingly, the embedding model does not know the name of my dog.

## Surprise ranking
Now we'll set `surprise_weight = 1` and see how the surprise similarity score ranks our example words:

In [5]:
dog_similarity_df = similarity.rank_documents(queries=['dog'],
                                documents=vocabulary,
                                surprise_weight=1,
                                sample_num_cutoff=None,
                                normalize_raw_similarity=False,
                                )
dog_similarity_df[dog_similarity_df['documents'].isin(example_words)]

Unnamed: 0,documents,dog
11001,Alsatian,0.999322
2049,furry,0.994495
3254,puppyish,0.98784
19796,potato,0.986203
648,the,0.979125
25127,my,0.973552
14978,Jude,0.646892


Great! the surprise similarity takes into account the fact that words like `the`, `my`, and apparently `potato` have a high similarity with many other words -- probably because they appear in many different contexts in the pre-training corpus -- and adjusts the similarity score accordingly.

However the model still does not know about my dog's name.

# Fine-tuning
Now I'd like to fine-tune the underlying embedding model to know about my dog, Jude.

In [6]:
similarity.train(keys=["dog", 'pet'], queries=[my_dogs_name, my_dogs_name], min_its=30, lr_factor=1)

Training on 2 examples...

Training time: 0:07min (30 iterations, F1: 1.0)


And check the rankings of the example words again.  For cosine similarity we can see that the fine-tuning works and Jude climbs the similarity ladder.

In [7]:
dog_similarity_df = similarity.rank_documents(queries=['dog'],
                                documents=vocabulary,
                                surprise_weight=0,
                                sample_num_cutoff=None,
                                normalize_raw_similarity=False,
                                )
dog_similarity_df[dog_similarity_df['documents'].isin(example_words)]

Batches: 100%|██████████| 796/796 [00:07<00:00, 105.04it/s]


Unnamed: 0,documents,dog
19796,potato,0.925398
14978,Jude,0.923478
25127,my,0.921206
648,the,0.915926
11001,Alsatian,0.909983
2049,furry,0.909034
3254,puppyish,0.904946


Likewise, for surprise similarity we observe that `Jude` is now more similar to `dog`, however, unexpectedly training has also increased the relative surprise similarity of `potato` to dog.  

In [8]:
dog_similarity_df = similarity.rank_documents(queries=['dog'],
                                documents=list(english_words_alpha_set),
                                surprise_weight=1,
                                sample_num_cutoff=None,
                                normalize_raw_similarity=False,
                                )
dog_similarity_df[dog_similarity_df['documents'].isin(example_words)]

Unnamed: 0,documents,dog
11001,Alsatian,0.999363
19796,potato,0.999309
2049,furry,0.999302
14978,Jude,0.999295
3254,puppyish,0.999283
648,the,0.999091
25127,my,0.998881
