# Surprise similarity for text classification
install the surprise_similarity package with:
 `pip install surprise_similarity`

# Word similarity
In this simple example we'll compare the cosine-similarty and the surprise-similarity of a few words to the word `dog`.  This is simply a toy example demonstrating the difference between the two scores, that will also demonstrate the SurpriseSimilarity.rank_documents method, which can be used for document ranking and retrieval in general.  Finally we will also demonstrate how you can fine-tune the underlying sentence-transformer model to reproduce desired similarities.

In [19]:
import surprise_similarity
similarity = surprise_similarity.SurpriseSimilarity()

## Vocabulary
For this example we'll use the vocabulary given by `english_words_alpha_set` shipped with the package https://pypi.org/project/english-words/.

`pip install english-words==1.1.0`

In [20]:
from english_words import english_words_alpha_set
vocabulary = list(english_words_alpha_set)
print(f'There are {len(vocabulary)} words in the vocabulary.')

There are 25474 words in the vocabulary.


## Cosine Ranking
Start by ranking the words in the vocabulary by their similarity to the word `dog`. By setting `surprise_weight = 0` we see the ranking based on the cosine similarity score.

In [21]:
my_dogs_name = 'Jude'

dog_similarity_df = similarity.rank_documents(queries=['dog'],
                                documents=vocabulary,
                                surprise_weight=0,
                                sample_num_cutoff=None,
                                normalize_raw_similarity=False,
                                )
dog_similarity_df.head()


Batches:   0%|          | 0/797 [00:00<?, ?it/s]

Unnamed: 0,documents,dog
19923,dog,1.0
13908,canine,0.952441
14719,pup,0.935472
9724,animal,0.929151
402,pooch,0.928523


Great, but let's focus in on a small subset of words that will highlight the differences between the cosine and surprise similarity scores

In [22]:
my_dogs_name = 'Jude'
example_words = ['the', 'potato', 'my', 'Alsatian', 'furry', 'puppyish', my_dogs_name]
dog_similarity_df[dog_similarity_df['documents'].isin(example_words)]

Unnamed: 0,documents,dog
1231,the,0.852129
6091,potato,0.850233
20634,my,0.850161
15945,Alsatian,0.849615
9775,furry,0.833391
12785,puppyish,0.829942
8224,Jude,0.783955


Cosine similarity ranks the word ‘potato’ as more closely related to the word ‘dog’ than the word ‘Alsatian’.
This may be counter-intuitive since ‘Alsatian’ refers to a specific breed of dog, also known as the
German Shepherd, while ‘potato’ is a starchy, tuberous crop and is not related to dogs at all.

Unsurprisingly, the embedding model does not know the name of my dog.

## Surprise ranking
Now we'll set `surprise_weight = 1` and see how the surprise similarity score ranks our example words:

In [23]:
dog_similarity_df = similarity.rank_documents(queries=['dog'],
                                documents=vocabulary,
                                surprise_weight=1,
                                sample_num_cutoff=None,
                                normalize_raw_similarity=False,
                                )
dog_similarity_df[dog_similarity_df['documents'].isin(example_words)]

Unnamed: 0,documents,dog
15945,Alsatian,0.999324
9775,furry,0.99527
12785,puppyish,0.987276
6091,potato,0.987208
1231,the,0.980269
20634,my,0.974568
8224,Jude,0.641516


Great! the surprise similarity takes into account the fact that words like `the`, `my`, and apparently `potato` have a high similarity with many other words -- probably because they appear in many different contexts in the pre-training corpus -- and adjusts the similarity score accordingly.

However the model still does not know about my dog's name.

# Fine-tuning
Now I'd like to fine-tune the underlying embedding model to know about my dog, Jude.

In [24]:
similarity.train(keys=["dog", 'pet',"potato"], queries=[my_dogs_name, my_dogs_name,None], min_its=1)

Training on 6 examples...

Training time: 0:04min (14 iterations, F1: 1.0)


And check the rankings of the example words again.  For cosine similarity we can see that the fine-tuning works and Jude climbs the similarity ladder.

In [25]:
dog_similarity_df = similarity.rank_documents(queries=['dog'],
                                documents=vocabulary,
                                surprise_weight=0,
                                sample_num_cutoff=None,
                                normalize_raw_similarity=False,
                                )
dog_similarity_df[dog_similarity_df['documents'].isin(example_words)]

Batches:   0%|          | 0/796 [00:00<?, ?it/s]

Unnamed: 0,documents,dog
8224,Jude,0.842758
20634,my,0.80466
1231,the,0.799246
15945,Alsatian,0.797201
9775,furry,0.779097
12785,puppyish,0.775524
6091,potato,0.445254


Likewise, for surprise similarity we observe that `Jude` is now more similar to `dog`, however, unexpectedly training has also increased the relative surprise similarity of `potato` to dog.  

In [26]:
dog_similarity_df = similarity.rank_documents(queries=['dog'],
                                documents=list(english_words_alpha_set),
                                surprise_weight=1,
                                sample_num_cutoff=None,
                                normalize_raw_similarity=False,
                                )
dog_similarity_df[dog_similarity_df['documents'].isin(example_words)]

Unnamed: 0,documents,dog
8224,Jude,0.999323
15945,Alsatian,0.997128
9775,furry,0.973405
12785,puppyish,0.964916
20634,my,0.944466
1231,the,0.906733
6091,potato,0.000353
