<a href="https://colab.research.google.com/github/PeerChristensen/SemanticSearchWine/blob/main/transformer_models_embedding_experiments.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Semantic wine search

In [1]:
from google.colab import drive
#from fastcore.all import *
#from fastai.vision.all import *

drive.mount('/content/drive',force_remount=True)

#path = Path('/content/drive/MyDrive/WineSearch')

Mounted at /content/drive


## Huggingface api example

In [3]:
!touch hg_api_token.txt
!echo "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" > hg_api_token.txt
!cp -r 'hg_api_token.txt' /content/drive/MyDrive/

In [9]:
import json
import requests
from pathlib import Path

api_token = Path('hg_api_token.txt').read_text().replace('\n', '')

API_URL = "https://api-inference.huggingface.co/models/sentence-transformers/msmarco-distilbert-base-tas-b"
API_URL2 = "https://api-inference.huggingface.co/models/sentence-transformers/all-MiniLM-L6-v2"
headers = {"Authorization": f"Bearer {api_token}"}

def query(payload):
    response = requests.post(API_URL, headers=headers, json=payload)
    return response.json()

data = query(
    {
        "inputs": {
            "source_sentence": "That is a happy person",
            "sentences": [
                "That is a happy dog",
                "That is a very happy person",
                "That is a unhappy person"
            ]
        }
    }
)

data

[0.8534060120582581, 0.9814601540565491, 0.8614112138748169]

## Embed reviews

In [5]:
import pandas as pd

df = pd.read_csv("/content/drive/MyDrive/WineSearch/wine_reviews.csv")
df.head(2)

Unnamed: 0,id,title,description,variety,country,province,points,price,link
0,0,Nicosia 2013 Vulkà Bianco (Etna),"Aromas include tropical fruit, broom, brimstone and dried herb. The palate isn't overly expressive, offering unripened apple, citrus and dried sage alongside brisk acidity.",White Blend,Italy,Sicily & Sardinia,87,,"<a href=https://www.wine-searcher.com/find/Nicosia+2013+Vulkà+Bianco++Etna target=""_blank"">Find it!</a>"
1,1,Quinta dos Avidagos 2011 Avidagos Red (Douro),"This is ripe and fruity, a wine that is smooth while still structured. Firm tannins are filled out with juicy red berry fruits and freshened with acidity. It's already drinkable, although it will certainly be better from 2016.",Portuguese Red,Portugal,Douro,87,15.0,"<a href=https://www.wine-searcher.com/find/Quinta+dos+Avidagos+2011+Avidagos+Red+Douro target=""_blank"">Find it!</a>"


In [6]:
len(df)

118840

In [None]:
!pip install -U sentence_transformers
from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer("sentence-transformers/msmarco-MiniLM-L-6-v3")

In [11]:
from numpy import save
embeddings = model.encode(df.description, show_progress_bar=True)

save('drive/MyDrive/WineSearch/embeddings_msmarco-MiniLM-L-6-v3.npy', embeddings)


In [12]:
from numpy import load

embeddings = load('drive/MyDrive/WineSearch/embeddings_msmarco-MiniLM-L-6-v3.npy')

In [33]:
from torch import topk
top_k = 5

query = "Full-bodied with notes of red berries"
query_embedding = model.encode(query) #, convert_to_tensor=True

# We use cosine-similarity and torch.topk to find the highest 5 scores
cos_scores = util.cos_sim(query_embedding, embeddings)[0]
top_results = topk(cos_scores, k=top_k)

for score, idx in zip(top_results[0], top_results[1]):

  print(f"{int(idx)}, Score: {round(float(score), 4)}, \nText: {df.description[int(idx)]}\n")

47208, Score: 0.7035, 
Text: This is full bodied and supple, with attractive red-berry fruit, but also displays a slightly acrid, smoky note. Decant vigorously for best results.

87325, Score: 0.6132, 
Text: Bright cherry and blueberry notes combine with hints of leather, earth, crushed pepper and grapeseed on the nose of this bottling. It's relatively light in body, showing peppery verve on the palate plus blueberry-elderberry fruit and bay-leaf herbals.

42739, Score: 0.6089, 
Text: A robust blend of two popular varieties, this full-bodied red is basically dry and fairly tannic. It's enormously rich and extracted in blackberries and cherries, with notes of bacon, beef jerky, tobacco and scads of peppery spices. Drink now.

59243, Score: 0.6052, 
Text: Fleshy blueberries are met with soft vanilla notes on this full-fruited yet light-footed red. On the palate the bursting blueberry notes are even more appetizing and joined by pleasant pepper with just a hint of tobacco. There is someth

## msmarco-distilbert-base-v4

In [None]:
model = SentenceTransformer("sentence-transformers/msmarco-distilbert-base-v4")
embeddings = model.encode(df.description, show_progress_bar=True)
save('drive/MyDrive/WineSearch/embeddings_msmarco-distilbert-base-v4.npy', embeddings)


In [35]:
embeddings = load('drive/MyDrive/WineSearch/embeddings_msmarco-distilbert-base-v4.npy')

from torch import topk
top_k = 5

query = "Full-bodied with notes of red berries"
query_embedding = model.encode(query) #, convert_to_tensor=True

# We use cosine-similarity and torch.topk to find the highest 5 scores
cos_scores = util.cos_sim(query_embedding, embeddings)[0]
top_results = topk(cos_scores, k=top_k)

for score, idx in zip(top_results[0], top_results[1]):

  print(f"{int(idx)}, Score: {round(float(score), 4)}, \nText: {df.description[int(idx)]}\n")

87325, Score: 0.5721, 
Text: Bright cherry and blueberry notes combine with hints of leather, earth, crushed pepper and grapeseed on the nose of this bottling. It's relatively light in body, showing peppery verve on the palate plus blueberry-elderberry fruit and bay-leaf herbals.

59243, Score: 0.5713, 
Text: Fleshy blueberries are met with soft vanilla notes on this full-fruited yet light-footed red. On the palate the bursting blueberry notes are even more appetizing and joined by pleasant pepper with just a hint of tobacco. There is something rather seductive about these fruit-driven reds that deliver more than simple fruit. Utterly delicious, very elegant and dangerously drinkable.

63916, Score: 0.57, 
Text: Bright and concentrated at the same time, with strong notes of black berries and cassis backed by softer notes of yellow flowers, almond skin, rosemary and pepper. The mouth is lightweight but full, with soft acids and a short but clean finish. Very approachable; drink now.

22

## all-mpnet-base-v2

In [None]:
model = SentenceTransformer("sentence-transformers/all-mpnet-base-v2")
embeddings = model.encode(df.description, show_progress_bar=True)
save('drive/MyDrive/WineSearch/embeddings_all-mpnet-base-v2.npy', embeddings)


In [38]:
embeddings = load('drive/MyDrive/WineSearch/embeddings_all-mpnet-base-v2.npy')

from torch import topk
top_k = 5

query = "Full-bodied with notes of red berries"
query_embedding = model.encode(query) #, convert_to_tensor=True

# We use cosine-similarity and torch.topk to find the highest 5 scores
cos_scores = util.cos_sim(query_embedding, embeddings)[0]
top_results = topk(cos_scores, k=top_k)

for score, idx in zip(top_results[0], top_results[1]):

  print(f"{int(idx)}, Score: {round(float(score), 4)}, \nText: {df.description[int(idx)]}\n")

71371, Score: 0.809, 
Text: Full-bodied and heavy, with a soft texture framing red and black currant, leather, licorice and spice flavors. Would benefit from greater liveliness. Drink now.

47208, Score: 0.795, 
Text: This is full bodied and supple, with attractive red-berry fruit, but also displays a slightly acrid, smoky note. Decant vigorously for best results.

10599, Score: 0.791, 
Text: Broadly fruity, with a palate of mixed red berries and cherries. Light and balanced, this is well-made, of medium length, and ready to drink right away.

91801, Score: 0.7752, 
Text: This is a full-bodied, fruity selection that's packed with a red berry flavor, soft tannins and a delicious, forward and bright character. It is already ready to drink.

51513, Score: 0.7746, 
Text: There is lovely richness here, with fresh acidity and red berry fruit flavors. The texture is round, lightly chewy, with some wood and final stalky fruit.

