# Sentence embeddings
We will mainly use `sentence-transformers`, which is a dedicated package from Hugging Face 🤗. 

Relevant documentation
- Semantic textual similarity https://www.sbert.net/docs/usage/semantic_textual_similarity.html
- Semantic search https://www.sbert.net/examples/applications/semantic-search/README.html

In [None]:
!pip install -U sentence-transformers faiss-cpu langchain langchain-community "unstructured[all-docs]" openai nest-asyncio streamlit jq

### From word embeddings to sentence embeddings

In [None]:
import nest_asyncio
nest_asyncio.apply()

In [None]:
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('paraphrase-MiniLM-L6-v2')

# Sentences we want to encode. Example:
sentence = ['This framework generates embeddings for each input sentence']

# Sentences are encoded by calling model.encode()
embedding = model.encode(sentence)
embedding

See, a sentence embedding is just a vector, just like a word embedding. That means we can also calculate similarities in a similar way:

In [None]:

# Two lists of sentences
sentences1 = ['The cat sits outside',
             'A man is playing guitar',
             'The new movie is awesome!']

sentences2 = ['The dog plays in the garden',
              'My plants look a bit sick, could it be bitrot?',
              'The new movie is so great!']

#Compute embedding for both lists
embeddings1 = model.encode(sentences1, convert_to_tensor=True)
embeddings2 = model.encode(sentences2, convert_to_tensor=True)

#Compute cosine-similarities
cosine_scores = util.cos_sim(embeddings1, embeddings2)

#Output the pairs with their score
for i in range(len(sentences1)):
    print("{} \t\t {} \t\t Score: {:.4f}".format(sentences1[i], sentences2[i], cosine_scores[i][i]))

## Semantic search and retrieval

The idea behind semantic search is to embed all entries in your corpus, whether they be sentences, paragraphs, or documents, into a vector space.

At search time, the query is embedded into the same vector space and the closest embeddings from your corpus are found. These entries should have a high semantic overlap with the query.


![title](https://raw.githubusercontent.com/UKPLab/sentence-transformers/master/docs/img/SemanticSearch.png
)

Instead of trying to build a semantic search engine from first principles, we'll use `langchain`. 

## [Don't run this again] Crawl the Vlerick website using Apify

The following code crawls the Vlerick website so we have some text to model. It's just example code. 

Langchain supports more than 100 integrations, so depending on where you find interesting data you'll need to use something else.

In [None]:
# from langchain.utilities import ApifyWrapper
# import os

# os.environ["APIFY_API_TOKEN"] = ""

# apify = ApifyWrapper()
# # Call the Actor to obtain text from the crawled webpages
# loader = apify.call_actor(
#     actor_id="apify/website-content-crawler",
#     run_input={
#         "startUrls": [{"url": "https://www.vlerick.com/en/"}]
#     },
#     dataset_mapping_function=lambda item: Document(
#         page_content=item["text"] or "", metadata={"source": item["url"]}
#     ),
# )


## Create new vector store and embed all documents
Source: https://python.langchain.com/docs/expression_language/cookbook/retrieval

In [1]:
# Let's load all documents
# Adapt this code to your own source of data.

from langchain_community.document_loaders import DirectoryLoader
from pathlib import Path
from pprint import pprint

from langchain.document_loaders import TextLoader, DirectoryLoader, JSONLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import FAISS
from langchain.embeddings import HuggingFaceEmbeddings

In [4]:
%time

loader = DirectoryLoader('DataFiles', silent_errors=True)
course_docs = loader.load()

print(f"Number of documents {len(course_docs)}")

CPU times: total: 0 ns
Wall time: 0 ns


  from .autonotebook import tqdm as notebook_tqdm
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping taggers\averaged_perceptron_tagger.zip.


Number of documents 1


In [8]:
course_docs[0].metadata

{'source': 'DataFiles\\Mémoire - Mathieu Demarets - MASTER.pdf'}

In [None]:

# from langchain_community.document_loaders import ApifyDatasetLoader
# from langchain_community.document_loaders.base import Document

# loader = ApifyDatasetLoader(
#     dataset_id="RcArHfVs80xOg9IKs",
#     dataset_mapping_function=lambda dataset_item: Document(
#         page_content=dataset_item["text"], metadata={"source": dataset_item["url"]}
#     ),
# )
# website_docs = loader.load()
# print(f"Number of documents {len(website_docs)}")
# website_docs = [doc for doc in website_docs if not doc.page_content.startswith("Your choice regarding cookies on this site")]
# print(f"Number of non-trivial documents {len(website_docs)}")
# website_docs[5:7]

In [None]:
# text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
# documents = text_splitter.split_documents(course_docs + website_docs)
# documents[0]
# print(f"Number of chunks {len(documents)}")

In [12]:
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
documents = text_splitter.split_documents(course_docs)
documents[0]
print(f"Number of chunks {len(documents)}")

Created a chunk of size 1627, which is longer than the specified 1000
Created a chunk of size 1801, which is longer than the specified 1000
Created a chunk of size 1447, which is longer than the specified 1000
Created a chunk of size 1279, which is longer than the specified 1000
Created a chunk of size 1569, which is longer than the specified 1000
Created a chunk of size 1100, which is longer than the specified 1000
Created a chunk of size 1122, which is longer than the specified 1000
Created a chunk of size 1082, which is longer than the specified 1000
Created a chunk of size 1250, which is longer than the specified 1000
Created a chunk of size 1027, which is longer than the specified 1000
Created a chunk of size 1393, which is longer than the specified 1000
Created a chunk of size 1214, which is longer than the specified 1000
Created a chunk of size 1122, which is longer than the specified 1000
Created a chunk of size 1390, which is longer than the specified 1000
Created a chunk of s

Number of chunks 352


### Embed into a vector store - and cache the results
We got a decent store of data loaded into memory now. Next thing we need to do is calculate sentence embeddings. 
We'll use simple, reasonably fast embeddings that we can calculate locally withouting requiring an expensive GPU or cloud service like OpenAI's GPTx.

In [26]:
%%time

# embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
embeddings = HuggingFaceEmbeddings(model_name="dangvantuan/sentence-camembert-large")

# query_result = embeddings.embed_query("What is the difference between a data scientist and a data engineer?")

if True: # change to True if you want to (re)create your store   
    vectorstore = FAISS.from_documents(
        documents, embedding=embeddings
    )
    # store because this is slow
    vectorstore.save_local("vectorstore") 

.gitattributes: 100%|██████████| 1.23k/1.23k [00:00<?, ?B/s]
README.md: 100%|██████████| 5.25k/5.25k [00:00<?, ?B/s]
config.json: 100%|██████████| 683/683 [00:00<00:00, 685kB/s]
model.safetensors:   6%|▌         | 83.9M/1.35G [00:05<01:29, 14.1MB/s]


KeyboardInterrupt: 

In [17]:
vectorstore = FAISS.load_local("vectorstore", embeddings)
vectorstore.index

<faiss.swigfaiss.IndexFlat; proxy of <Swig Object of type 'faiss::IndexFlat *' at 0x00000222C2B1EB80> >

In [21]:
from langchain.embeddings import OpenAIEmbeddings
from langchain.prompts import ChatPromptTemplate
from langchain.schema.output_parser import StrOutputParser
from langchain.chat_models import ChatOpenAI
from operator import itemgetter

retriever = vectorstore.as_retriever(k=1)


In [25]:
def q(s):
    results = retriever.get_relevant_documents(s)
    for doc in results:
        print("#"*100)
        print(doc.metadata["source"])
        print("#"*100)
        print(doc.page_content)
print(q("Quelle sont les taux de passage pas décile socio-économiques pour l'examen d'entrée?"))

####################################################################################################
DataFiles\Mémoire - Mathieu Demarets - MASTER.pdf
####################################################################################################
2.3.4 : Impact du milieu socio-économique d’origine sur les risques concur- rents

Afin d’augmenter la précision et la fiabilité de nos conclusions, il faut pallier la petite taille d’échantillon après l’examen d’entrée. En effet, il est dérisoire de réaliser des comparaisons pour les risques concurrents par décile d’ISE alors que nous n’avons que 45 étudiants dans le premier quartile d’ISE à l’admission après l’examen d’entrée.
####################################################################################################
DataFiles\Mémoire - Mathieu Demarets - MASTER.pdf
####################################################################################################
En cumulant l’effet que l’examen d’entrée a eu sur le mix socia

In [None]:
q("what type of awards does vlerick give")