Eerste opzet, wat doet het model? Hoe werkt het?

In [1]:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')

#Our sentences we like to encode
sentences = ['This framework generates embeddings for each input sentence',
    'Sentences are passed as a list of string.',
    'The quick brown fox jumps over the lazy dog.']

#Sentences are encoded by calling model.encode()
embeddings = model.encode(sentences)

#Print the embeddings
for sentence, embedding in zip(sentences, embeddings):
    print("Sentence:", sentence)
    print("Embedding:", embedding)
    print("")

Sentence: This framework generates embeddings for each input sentence
Embedding: [-1.37173748e-02 -4.28515524e-02 -1.56286340e-02  1.40537247e-02
  3.95537578e-02  1.21796273e-01  2.94333920e-02 -3.17524076e-02
  3.54959816e-02 -7.93139860e-02  1.75878499e-02 -4.04369980e-02
  4.97259349e-02  2.54912488e-02 -7.18700886e-02  8.14968571e-02
  1.47073052e-03  4.79627326e-02 -4.50336188e-02 -9.92174745e-02
 -2.81769857e-02  6.45046160e-02  4.44670394e-02 -4.76217009e-02
 -3.52952369e-02  4.38671596e-02 -5.28566092e-02  4.33036505e-04
  1.01921476e-01  1.64072067e-02  3.26996371e-02 -3.45986634e-02
  1.21339252e-02  7.94870928e-02  4.58343141e-03  1.57778300e-02
 -9.68204997e-03  2.87625641e-02 -5.05806319e-02 -1.55793764e-02
 -2.87906770e-02 -9.62280855e-03  3.15556899e-02  2.27348879e-02
  8.71449560e-02 -3.85027118e-02 -8.84718224e-02 -8.75498727e-03
 -2.12343428e-02  2.08923519e-02 -9.02078003e-02 -5.25732227e-02
 -1.05638849e-02  2.88310796e-02 -1.61455162e-02  6.17836276e-03
 -1.23234

Met een pretrained model kunnen we aantonen wat de mate van similarity is tussen twee zinnen

In [2]:
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('all-MiniLM-L6-v2')

#Sentences are encoded by calling model.encode()
emb1 = model.encode("This is a red cat with a hat.")
emb2 = model.encode("Have you seen my red cat?")

cos_sim = util.cos_sim(emb1, emb2)
print("Cosine-Similarity:", cos_sim)

Cosine-Similarity: tensor([[0.6153]])


Similarity test met eigen input. Lagere mate van similarity dan de voorbeeldzinnen.

In [3]:
emb1 = model.encode("This is a rotten apple")
emb2 = model.encode("Do these apples have blotches")

cos_sim = util.cos_sim(emb1, emb2)
print("Cosine-Similarity:", cos_sim)

Cosine-Similarity: tensor([[0.4199]])


Met onderstaande code kunnen we een top-5 van meest vergelijkbare zinnen samenstellen en bijbehorende percentages tonen.

In [4]:
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('all-MiniLM-L6-v2')

sentences = ['A man is eating food.',
          'A man is eating a piece of bread.',
          'The girl is carrying a baby.',
          'A man is riding a horse.',
          'A woman is playing violin.',
          'Two men pushed carts through the woods.',
          'A man is riding a white horse on an enclosed ground.',
          'A monkey is playing drums.',
          'Someone in a gorilla costume is playing a set of drums.'
          ]

#Encode all sentences
embeddings = model.encode(sentences)

#Compute cosine similarity between all pairs
cos_sim = util.cos_sim(embeddings, embeddings)

#Add all pairs to a list with their cosine similarity score
all_sentence_combinations = []
for i in range(len(cos_sim)-1):
    for j in range(i+1, len(cos_sim)):
        all_sentence_combinations.append([cos_sim[i][j], i, j])

#Sort list by the highest cosine similarity score
all_sentence_combinations = sorted(all_sentence_combinations, key=lambda x: x[0], reverse=True)

print("Top-5 most similar pairs:")
for score, i, j in all_sentence_combinations[0:5]:
    print("{} \t {} \t {:.4f}".format(sentences[i], sentences[j], cos_sim[i][j]))
# See on the left the Usage sections for more examples how to use SentenceTransformers.

Top-5 most similar pairs:
A man is eating food. 	 A man is eating a piece of bread. 	 0.7553
A man is riding a horse. 	 A man is riding a white horse on an enclosed ground. 	 0.7369
A monkey is playing drums. 	 Someone in a gorilla costume is playing a set of drums. 	 0.6433
A woman is playing violin. 	 Someone in a gorilla costume is playing a set of drums. 	 0.2564
A man is eating food. 	 A man is riding a horse. 	 0.2474


Onderstaand met zelf geformuleerde zinnen om te kijken of het patroon waar naar gezocht wordt enigszins duidelijk kan worden. Het lijkt er sterk op dat meerdere key words of variaties (enkelvoud-meervoud) daarop het snelst triggeren.

In [5]:
sentences = ["This apple has rot.",
          "A healthy apple doesn't have blotches.",
          "Damaged fruits are used to make apple butter.",
          "Some of the apple diseases are: rot, blotch and scab.",
          "An apples grows on a tree.",
          "Fallen fruit might be spoiled.",
          "Apples don't drive cars.",
          "Pears and apples can't be compared, but can be compeared",
          "Some apples with scabs do not disqualify the complete batch."
          ]

#Encode all sentences
embeddings = model.encode(sentences)

#Compute cosine similarity between all pairs
cos_sim = util.cos_sim(embeddings, embeddings)

#Add all pairs to a list with their cosine similarity score
all_sentence_combinations = []
for i in range(len(cos_sim)-1):
    for j in range(i+1, len(cos_sim)):
        all_sentence_combinations.append([cos_sim[i][j], i, j])

#Sort list by the highest cosine similarity score
all_sentence_combinations = sorted(all_sentence_combinations, key=lambda x: x[0], reverse=True)

print("Top-5 most similar pairs:")
for score, i, j in all_sentence_combinations[0:5]:
    print("{} \t {} \t {:.4f}".format(sentences[i], sentences[j], cos_sim[i][j]))

Top-5 most similar pairs:
A healthy apple doesn't have blotches. 	 Some of the apple diseases are: rot, blotch and scab. 	 0.7044
This apple has rot. 	 Some of the apple diseases are: rot, blotch and scab. 	 0.6864
A healthy apple doesn't have blotches. 	 Some apples with scabs do not disqualify the complete batch. 	 0.6364
Some of the apple diseases are: rot, blotch and scab. 	 Some apples with scabs do not disqualify the complete batch. 	 0.6289
This apple has rot. 	 A healthy apple doesn't have blotches. 	 0.6095


In [7]:
from sentence_transformers import SentenceTransformer, models
from torch import nn

word_embedding_model = models.Transformer('bert-base-uncased', max_seq_length=256)
pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension())
dense_model = models.Dense(in_features=pooling_model.get_sentence_embedding_dimension(), out_features=256, activation_function=nn.Tanh())

model = SentenceTransformer(modules=[word_embedding_model, pooling_model, dense_model])

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


Downloading:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [8]:
from sentence_transformers import SentenceTransformer, InputExample
from torch.utils.data import DataLoader

model = SentenceTransformer('distilbert-base-nli-mean-tokens')
train_examples = [InputExample(texts=['My first sentence', 'My second sentence'], label=0.8),
   InputExample(texts=['Another pair', 'Unrelated sentence'], label=0.3)]
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)

In [9]:
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('all-MiniLM-L6-v2')                   # -> Similarity: tensor([[0.5627, 0.5645]])
# model_original = SentenceTransformer('multi-qa-MiniLM-L6-cos-v1') -> Similarity: tensor([[0.5472, 0.6330]])

query_embedding = model.encode('How big is London')
passage_embedding = model.encode(['London has 9,787,426 inhabitants at the 2011 census',
                                  'London is known for its finacial district'])

print("Similarity:", util.dot_score(query_embedding, passage_embedding))

Similarity: tensor([[0.5627, 0.5645]])


Use argmax() to select the highest scoring value. Find the location of this value and print it as answer to the question.

In [14]:
model = SentenceTransformer('all-MiniLM-L6-v2') 

query_embedding = 'How many pears or apples'
passage_embedding = model.encode(query_embedding, convert_to_tensor=True)
answer_array = util.dot_score(passage_embedding, embeddings)

# print("Similarity:", util.dot_score(passage_embedding, embeddings))

answer_location = answer_array.argmax()

# print(answer_location)
# print(answer_location.item())

# de x-ste zin heeft de hoogste score

print(sentences[answer_location.item()])

Similarity: tensor([[0.3636, 0.3556, 0.4080, 0.3773, 0.4912, 0.3326, 0.4199, 0.6447, 0.3668]])
Pears and apples can't be compared, but can be compeared


Transfer Learning model keuze : all-MiniLM-L6-v2 is 5 keer sneller (dan het beste model: all-mpnet-base-v2) en geeft nog steeds een goede kwaliteit. Een model dat werkt op basis van symmetric semantic search (sss) is hier de beste oplossing, want vraag en antwoord zijn beide relatief kort en bondig, <i>sss</i> zoekt naar vergelijkbare vragen.bbb

In [11]:
"""
This is a simple application for sentence embeddings: semantic search

We have a corpus with various sentences. Then, for a given query sentence,
we want to find the most similar sentence in this corpus.

This script outputs for various queries the top 5 most similar sentences in the corpus.
"""
from sentence_transformers import SentenceTransformer, util
import torch

embedder = SentenceTransformer('all-MiniLM-L6-v2')

# Corpus with example sentences
corpus = ['A man is eating food.',
          'A man is eating a piece of bread.',
          'The girl is carrying a baby.',
          'A man is riding a horse.',
          'A woman is playing violin.',
          'Two men pushed carts through the woods.',
          'A man is riding a white horse on an enclosed ground.',
          'A monkey is playing drums.',
          'A cheetah is running behind its prey.'
          ]
corpus_embeddings = embedder.encode(corpus, convert_to_tensor=True)

# Query sentences:
queries = ['A man is eating pasta.', 'Someone in a gorilla costume is playing a set of drums.', 'A cheetah chases prey on across a field.']


# Find the closest 5 sentences of the corpus for each query sentence based on cosine similarity
top_k = min(5, len(corpus))
for query in queries:
    query_embedding = embedder.encode(query, convert_to_tensor=True)

    # We use cosine-similarity and torch.topk to find the highest 5 scores
    cos_scores = util.cos_sim(query_embedding, corpus_embeddings)[0]
    top_results = torch.topk(cos_scores, k=top_k)

    print("\n\n======================\n\n")
    print("Query:", query)
    print("\nTop 5 most similar sentences in corpus:")

    for score, idx in zip(top_results[0], top_results[1]):
        print(corpus[idx], "(Score: {:.4f})".format(score))

    """
    # Alternatively, we can also use util.semantic_search to perform cosine similarty + topk
    # hits = util.semantic_search(query_embedding, corpus_embeddings, top_k=5)
    # hits = hits[0]      #Get the hits for the first query
    # for hit in hits:
    #     print(corpus[hit['corpus_id']], "(Score: {:.4f})".format(hit['score']))
    """
    
    # hits = util.semantic_search(query_embedding, corpus_embeddings, top_k=5)
    # hits = hits[0]      #Get the hits for the first query
    # for hit in hits:
    #     print(corpus[hit['corpus_id']], "(Score: {:.4f})".format(hit['score']))





Query: A man is eating pasta.

Top 5 most similar sentences in corpus:
A man is eating food. (Score: 0.7035)
A man is eating a piece of bread. (Score: 0.5272)
A man is riding a horse. (Score: 0.1889)
A man is riding a white horse on an enclosed ground. (Score: 0.1047)
A cheetah is running behind its prey. (Score: 0.0980)




Query: Someone in a gorilla costume is playing a set of drums.

Top 5 most similar sentences in corpus:
A monkey is playing drums. (Score: 0.6433)
A woman is playing violin. (Score: 0.2564)
A man is riding a horse. (Score: 0.1389)
A man is riding a white horse on an enclosed ground. (Score: 0.1191)
A cheetah is running behind its prey. (Score: 0.1080)




Query: A cheetah chases prey on across a field.

Top 5 most similar sentences in corpus:
A cheetah is running behind its prey. (Score: 0.8253)
A man is eating food. (Score: 0.1399)
A monkey is playing drums. (Score: 0.1292)
A man is riding a white horse on an enclosed ground. (Score: 0.1097)
A man is riding a 

<strong>Met util.semantic_search:</strong><br>
<ul>
<li><strong>Query:</strong> A man is eating pasta.</li><br>
<li>A man is eating food. (Score: 0.7035)</li>
<li>A man is eating a piece of bread. (Score: 0.5272)</li>
<li>A man is riding a horse. (Score: 0.1889)</li>
<li>A man is riding a white horse on an enclosed ground. (Score: 0.1047)</li>
<li>A cheetah is running behind its prey. (Score: 0.0980)</li><br>
<li><strong>Query:</strong> Someone in a gorilla costume is playing a set of drums.</li><br>
<li>A monkey is playing drums. (Score: 0.6433)</li>
<li>A woman is playing violin. (Score: 0.2564)</li>
<li>A man is riding a horse. (Score: 0.1389)</li>
<li>A man is riding a white horse on an enclosed ground. (Score: 0.1191)</li>
<li>A cheetah is running behind its prey. (Score: 0.1080)</li><br>
<li><strong>Query:</strong> Someone in a gorilla costume is playing a set of drums.</li><br>
<li>A cheetah is running behind its prey. (Score: 0.8253)</li>
<li>A man is eating food. (Score: 0.1399)</li>
<li>A monkey is playing drums. (Score: 0.1292)</li>
<li>A man is riding a white horse on an enclosed ground. (Score: 0.1097)</li>
<li>A man is riding a horse. (Score: 0.0650)</li>
</ul>

<strong>Met cosine-similarity en torch.topk:</strong><br>
<ul>
<li><strong>Query:</strong> A man is eating pasta.</li><br>
<li><i>Top 5 most similar sentences in corpus:</i></li>
<li>A man is eating food. (Score: 0.7035)</li>
<li>A man is eating a piece of bread. (Score: 0.5272)</li>
<li>A man is riding a horse. (Score: 0.1889)</li>
<li>A man is riding a white horse on an enclosed ground. (Score: 0.1047)</li>
<li>A cheetah is running behind its prey. (Score: 0.0980)</li><br>

<li><strong>Query:</strong> Someone in a gorilla costume is playing a set of drums.</li><br>
<li><i>Top 5 most similar sentences in corpus:</i></li>
<li>A monkey is playing drums. (Score: 0.6433)</li>
<li>A woman is playing violin. (Score: 0.2564)</li>
<li>A man is riding a horse. (Score: 0.1389)</li>
<li>A man is riding a white horse on an enclosed ground. (Score: 0.1191)</li>
<li>A cheetah is running behind its prey. (Score: 0.1080)</li><br>

<li><strong>Query:</strong> Someone in a gorilla costume is playing a set of drums.</li><br>
<li><i>Top 5 most similar sentences in corpus:</i></li>
<li>A cheetah is running behind its prey. (Score: 0.8253)</li>
<li>A man is eating food. (Score: 0.1399)</li>
<li>A monkey is playing drums. (Score: 0.1292)</li>
<li>A man is riding a white horse on an enclosed ground. (Score: 0.1097)</li>
<li>A man is riding a horse. (Score: 0.0650)</li>
</ul>

<strong>Resultaten <i>cosine-similarity/torch.topk</i> en <i>util.semantic_search</i> zijn identiek.</strong>

Eens kijken wat er gebeurt als we eigen vragen en antwoorden m.b.t. tot onze kwaliteitscontrole gebruiken.<br>
Met <i>top_k = min(1, len(corpus))</i> zouden we het beste antwoord kunnen selecteren.

In [15]:
embedder = SentenceTransformer('all-MiniLM-L6-v2')

# Corpus with example sentences
corpus = ['This batch contains 2 blotched apples.',
          'The percentage of healthy apples is 97.3%',
          'The complete batch consists of 80 apples.',
          '98%\ of the apples is healthy.',
          '4 Rotten apples spoil the batch.',
          'A batch can consist of healthy, rotten, blotched or scabbed apples',
          'Apples are a healthy fruit.',
          'The batch is categorized as Class 1.',
          'The batch has been rejected.'
          ]
corpus_embeddings = embedder.encode(corpus, convert_to_tensor=True)

# Query sentences:
queries = ['What is the quality of this batch?', 
           'How many scabbed apples are in this batch?', 
           'What is the percentage of healthy apples?']


# Find the closest 5 sentences of the corpus for each query sentence based on cosine similarity
top_k = min(5, len(corpus))
for query in queries:
    query_embedding = embedder.encode(query, convert_to_tensor=True)

    # We use cosine-similarity and torch.topk to find the highest 5 scores
    cos_scores = util.cos_sim(query_embedding, corpus_embeddings)[0]
    top_results = torch.topk(cos_scores, k=top_k)

    print("\n\n======================\n\n")
    print("Query:", query)
    print("\nTop 5 most similar sentences in corpus:")

    for score, idx in zip(top_results[0], top_results[1]):
        print(corpus[idx], "(Score: {:.4f})".format(score))





Query: What is the quality of this batch?

Top 5 most similar sentences in corpus:
The batch has been rejected. (Score: 0.4820)




Query: How many scabbed apples are in this batch?

Top 5 most similar sentences in corpus:
A batch can consist of healthy, rotten, blotched or scabbed apples (Score: 0.7868)




Query: What is the percentage of healthy apples?

Top 5 most similar sentences in corpus:
The percentage of healthy apples is 97.3% (Score: 0.9325)


<h3><strong>Dit zijn de resultaten van de eerst test:</strong></h3>

<ul>
<li><strong>Query:</strong> What is the quality of this batch?</li><br>

<li><i>Top 5 most similar sentences in corpus:</i></li><br>
<li>The batch has been rejected. (Score: 0.48200) &nbsp;&nbsp;&nbsp;&nbsp; <span style="color:MediumSeaGreen;">Top! Wel lage score...</span></li>
<li>This batch contains 2 blotched apples. (Score: 0.4491) &nbsp;&nbsp;&nbsp;&nbsp; <span style="color:DodgerBlue;">Zegt iets over kwaliteit, maar niet specifiek genoeg.</span></li>
<li>A batch can consist of healthy, rotten, blotched or scabbed apples (Score: 0.4385) &nbsp;&nbsp;&nbsp;&nbsp;<span style=color:Tomato> Zegt niet wat de kwaliteit is, wel hoe deze tot stand komt.</span></li>
<li>The complete batch consists of 80 apples. (Score: 0.4252) &nbsp;&nbsp;&nbsp;&nbsp;<span style=color:Tomato> Hoeveelheid, niet kwaliteit.</span></li>
<li>The batch is categorized as Class 1. (Score: 0.3697) &nbsp;&nbsp;&nbsp;&nbsp;<span style=color:Tomato> Deze zou veel hoger moeten scoren.</span></li>

======================

<li><strong>Query:</strong> How many scabbed apples are in this batch?</li><br>

<li><i>Top 5 most similar sentences in corpus:</i></li><br>
<li>A batch can consist of healthy, rotten, blotched or scabbed apples (Score: 0.7868) &nbsp;&nbsp;&nbsp;&nbsp; <span style="color:DodgerBlue;">Zegt alleen dat er ook 'scabbed' appels in de batch <i>kunnen</i> zitten.</span></li>
<li>This batch contains 2 blotched apples. (Score: 0.7694) &nbsp;&nbsp;&nbsp;&nbsp; <span style="color:DodgerBlue;">'Blotched', niet 'scabbed'.</span></li>
<li>The complete batch consists of 80 apples. (Score: 0.7507) &nbsp;&nbsp;&nbsp;&nbsp; <span style="color:DodgerBlue;">Zegt iets over totaal, niets over 'scabbed' appels.</span></li>
<li>4 Rotten apples spoil the batch. (Score: 0.6157) &nbsp;&nbsp;&nbsp;&nbsp; <span style=color:Tomato> 'Rotten', niet 'scabbed'.</span></li>
<li>98% of the apples is healthy. (Score: 0.5437) &nbsp;&nbsp;&nbsp;&nbsp; <span style=color:Tomato> 'Healthy', niet 'scabbed'.</span></li>

======================

<li><strong>Query:</strong> What is the percentage of healthy apples?</li><br>

<li><i>Top 5 most similar sentences in corpus:</i></li><br>
<li>The percentage of healthy apples is 97.3% (Score: 0.9325) &nbsp;&nbsp;&nbsp;&nbsp; <span style="color:MediumSeaGreen;">Top!</span></li>
<li>98"%" of the apples is healthy. (Score: 0.9070) &nbsp;&nbsp;&nbsp;&nbsp; <span style="color:MediumSeaGreen;">Top!</span></li>
<li>Apples are a healthy fruit. (Score: 0.7947) &nbsp;&nbsp;&nbsp;&nbsp; <span style="color:DodgerBlue;">Zegt niets over het percentage.</span></li>
<li>A batch can consist of healthy, rotten, blotched or scabbed apples (Score: 0.6077) &nbsp;&nbsp;&nbsp;&nbsp; <span style=color:Tomato> Zegt niets over het percentage.</span></li>
<li>The complete batch consists of 80 apples. (Score: 0.6046) &nbsp;&nbsp;&nbsp;&nbsp; <span style=color:Tomato> Zegt niets over gezonde appels of percentage.</span></li>

<h1><strong>En nu?</strong></h1>

Hoe kunnen we de zelfgemaakte vragen en antwoorden beter op elkaar laten aansluiten?    

In [13]:
# sentence-transformers==1.0.4, torch==1.7.0.
# import random
from collections import defaultdict
from sentence_transformers import SentenceTransformer, SentencesDataset
from sentence_transformers.losses import TripletLoss
from sentence_transformers.readers import LabelSentenceReader, InputExample
from torch.utils.data import DataLoader

# Load pre-trained model - we are using the original Sentence-BERT for this example. / 
# 'all-MiniLM-L6-v2' < eerder gebruikt, werkt dit ook?
sbert_model = SentenceTransformer('all-MiniLM-L6-v2')

# Set up data for fine-tuning 
sentence_reader = LabelSentenceReader(folder='C:/MakeAIWork2/projects/apple_disease_classification/notebooks/')
data_list = sentence_reader.get_examples(filename='sbert_first.tsv')
triplets = triplets_from_labeled_dataset(input_examples=data_list)
finetune_data = SentencesDataset(examples=triplets, model=sbert_model)
finetune_dataloader = DataLoader(finetune_data, shuffle=True, batch_size=16)

# Initialize triplet loss
loss = TripletLoss(model=sbert_model)

# Fine-tune the model
sbert_model.fit(train_objectives=[(finetune_dataloader, loss)], epochs=4,output_path='all-MiniLM-L6-v2-disease')

NameError: name 'triplets_from_labeled_dataset' is not defined