# Sentence embeddings
We will mainly use `sentence-transformers`, which is a dedicated package from Hugging Face 🤗. 

Relevant documentation
- Semantic textual similarity https://www.sbert.net/docs/usage/semantic_textual_similarity.html
- Semantic search https://www.sbert.net/examples/applications/semantic-search/README.html

In [None]:
!pip install -U sentence-transformers faiss-cpu langchain  "unstructured[md]" openai nest-asyncio streamlit

### From word embeddings to sentence embeddings

In [7]:
import nest_asyncio
nest_asyncio.apply()

In [8]:
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('paraphrase-MiniLM-L6-v2')

# Sentences we want to encode. Example:
sentence = ['This framework generates embeddings for each input sentence']

# Sentences are encoded by calling model.encode()
embedding = model.encode(sentence)
embedding

  from .autonotebook import tqdm as notebook_tqdm


array([[-1.76214680e-01,  1.20601490e-01, -2.93624043e-01,
        -2.29858160e-01, -8.22923556e-02,  2.37709701e-01,
         3.39984596e-01, -7.80964196e-01,  1.18127435e-01,
         1.63373843e-01, -1.37715429e-01,  2.40282565e-01,
         4.25125778e-01,  1.72417641e-01,  1.05280034e-01,
         5.18164277e-01,  6.22214526e-02,  3.99285913e-01,
        -1.81652635e-01, -5.85578501e-01,  4.49724011e-02,
        -1.72750384e-01, -2.68443584e-01, -1.47386163e-01,
        -1.89217702e-01,  1.92150414e-01, -3.83842826e-01,
        -3.96007091e-01,  4.30648834e-01, -3.15320015e-01,
         3.65949929e-01,  6.05159178e-02,  3.57325375e-01,
         1.59736529e-01, -3.00983638e-01,  2.63250142e-01,
        -3.94310504e-01,  1.84855461e-01, -3.99549633e-01,
        -2.67889559e-01, -5.45117497e-01, -3.13404575e-02,
        -4.30644214e-01,  1.33278072e-01, -1.74793854e-01,
        -4.35465217e-01, -4.77379173e-01,  7.12554380e-02,
        -7.37003982e-02,  5.69136977e-01, -2.82579482e-0

See, a sentence embedding is just a vector, just like a word embedding. That means we can also calculate similarities in a similar way:

In [9]:

# Two lists of sentences
sentences1 = ['The cat sits outside',
             'A man is playing guitar',
             'The new movie is awesome!']

sentences2 = ['The dog plays in the garden',
              'My plants look a bit sick, could it be bitrot?',
              'The new movie is so great!']

#Compute embedding for both lists
embeddings1 = model.encode(sentences1, convert_to_tensor=True)
embeddings2 = model.encode(sentences2, convert_to_tensor=True)

#Compute cosine-similarities
cosine_scores = util.cos_sim(embeddings1, embeddings2)

#Output the pairs with their score
for i in range(len(sentences1)):
    print("{} \t\t {} \t\t Score: {:.4f}".format(sentences1[i], sentences2[i], cosine_scores[i][i]))

The cat sits outside 		 The dog plays in the garden 		 Score: 0.2853
A man is playing guitar 		 My plants look a bit sick, could it be bitrot? 		 Score: -0.0119
The new movie is awesome! 		 The new movie is so great! 		 Score: 0.9463


## Semantic search and retrieval

The idea behind semantic search is to embed all entries in your corpus, whether they be sentences, paragraphs, or documents, into a vector space.

At search time, the query is embedded into the same vector space and the closest embeddings from your corpus are found. These entries should have a high semantic overlap with the query.


![title](https://raw.githubusercontent.com/UKPLab/sentence-transformers/master/docs/img/SemanticSearch.png
)

Instead of trying to build a semantic search engine from first principles, we'll use `langchain`. 

## [Don't run this again] Crawl the Vlerick website using Apify

The following code crawls the Vlerick website so we have some text to model. It's just example code. 

Langchain supports more than 100 integrations, so depending on where you find interesting data you'll need to use something else.

In [None]:
# from langchain.utilities import ApifyWrapper
# import os

# os.environ["APIFY_API_TOKEN"] = ""

# apify = ApifyWrapper()
# # Call the Actor to obtain text from the crawled webpages
# loader = apify.call_actor(
#     actor_id="apify/website-content-crawler",
#     run_input={
#         "startUrls": [{"url": "https://www.vlerick.com/en/"}]
#     },
#     dataset_mapping_function=lambda item: Document(
#         page_content=item["text"] or "", metadata={"source": item["url"]}
#     ),
# )


## Create new vector store and embed all documents

In [3]:
from langchain.document_loaders import TextLoader, DirectoryLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import FAISS
from langchain.embeddings import HuggingFaceEmbeddings

In [10]:
# Let's load all documents
# Adapt this code to your own source of data.

from langchain.document_loaders import JSONLoader
import json
from pathlib import Path
from pprint import pprint

file_path='./vlerick_website_scrape/2023-12-24-vlerick-website.json'
data = json.loads(Path(file_path).read_text())


In [11]:
data

[{'url': 'https://www.vlerick.com/en/',
  'crawl': {'loadedUrl': 'https://www.vlerick.com/en/',
   'loadedTime': '2023-12-24T10:11:54.596Z',
   'referrerUrl': 'https://www.vlerick.com/en/',
   'depth': 0,
   'httpStatusCode': 200},
  'metadata': {'canonicalUrl': 'https://www.vlerick.com/en/',
   'title': 'Vlerick Business School | Vlerick Business School',
   'description': 'A place where entrepreneurial dreams are born and game-changing ideas become reality',
   'author': None,
   'keywords': None,
   'languageCode': 'en'},
  'screenshotUrl': None,
  'text': 'LIVE LEARN LEAP\nDiscover our programmes\nMBAs, Masters and executive programmes\nTAKE YOUR NEXT LEAP WITH A LEADING EUROPEAN BUSINESS SCHOOL \nThe world needs entrepreneurial leaders – and this is where you get the cutting-edge learning and insight to have even greater impact. Vlerick is a top-ranked, triple accredited business school, located in Brussels. Together, we’ll make business a force for good.\n25,560\nalumni in 99 cou

In [129]:
# source: https://python.langchain.com/docs/expression_language/cookbook/retrieval


text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
documents = text_splitter.split_documents(raw_documents + gitbookloader + confluence)
documents[0]

Created a chunk of size 3347, which is longer than the specified 1000
Created a chunk of size 1119, which is longer than the specified 1000
Created a chunk of size 1070, which is longer than the specified 1000
Created a chunk of size 1066, which is longer than the specified 1000
Created a chunk of size 1584, which is longer than the specified 1000
Created a chunk of size 17816, which is longer than the specified 1000
Created a chunk of size 1670, which is longer than the specified 1000
Created a chunk of size 3916, which is longer than the specified 1000
Created a chunk of size 1378, which is longer than the specified 1000
Created a chunk of size 2250, which is longer than the specified 1000
Created a chunk of size 6484, which is longer than the specified 1000
Created a chunk of size 1768, which is longer than the specified 1000
Created a chunk of size 1069, which is longer than the specified 1000
Created a chunk of size 2267, which is longer than the specified 1000
Created a chunk of 

Document(page_content='QUESTION: If API is supported as Input Source, please indicate any possible pre-requirement (if any), \nANSWER: For a complete description of the Metamaze REST API, please see https://app.metamaze.eu/docs/index.html', metadata={'source': 'rfpgpt/resources/faq/question_170.md'})

In [10]:
%%time

embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
# to test, use query_result = embeddings.embed_query("My text")

if False: # change to True if you want to (re)create your store   
    vectorstore = FAISS.from_documents(
        documents, embedding=embeddings
    )
    # store because this is slow
    vectorstore.save_local("vectorstore") 

CPU times: user 183 ms, sys: 102 ms, total: 284 ms
Wall time: 316 ms


In [11]:
import os
os.getcwd()


'/Users/jospolfliet/src/vlerick'

In [12]:
vectorstore = FAISS.load_local("vectorstore", embeddings)
vectorstore.index

<faiss.swigfaiss.IndexFlat; proxy of <Swig Object of type 'faiss::IndexFlat *' at 0x2b3329ec0> >

In [27]:
from langchain.embeddings import OpenAIEmbeddings
from langchain.prompts import ChatPromptTemplate
from langchain.schema.output_parser import StrOutputParser
from langchain.schema.runnable import RunnableLambda, RunnablePassthrough, RunnableMap, RunnableSequence
from langchain.chat_models import ChatOpenAI
from operator import itemgetter

retriever = vectorstore.as_retriever(k=8)


In [None]:
import streamlit as st


## Show classification based on Sentence Embeddings