# Sentence embeddings
We will mainly use `sentence-transformers`, which is a dedicated package from Hugging Face ü§ó. 

Relevant documentation
- Semantic textual similarity https://www.sbert.net/docs/usage/semantic_textual_similarity.html
- Semantic search https://www.sbert.net/examples/applications/semantic-search/README.html

In [None]:
# !pip install -U sentence-transformers faiss-cpu langchain langchain-community "unstructured[all-docs]" openai nest-asyncio streamlit jq

### From word embeddings to sentence embeddings

In [6]:
import nest_asyncio
nest_asyncio.apply()

In [7]:
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('paraphrase-MiniLM-L6-v2') # https://www.sbert.net/docs/pretrained_models.html

# Sentences we want to encode. Example:
sentence = ['This framework generates embeddings for each input sentence']

# Sentences are encoded by calling model.encode()
embedding = model.encode(sentence)
embedding

array([[-1.76214680e-01,  1.20601490e-01, -2.93624043e-01,
        -2.29858160e-01, -8.22923556e-02,  2.37709701e-01,
         3.39984596e-01, -7.80964196e-01,  1.18127435e-01,
         1.63373843e-01, -1.37715429e-01,  2.40282565e-01,
         4.25125778e-01,  1.72417641e-01,  1.05280034e-01,
         5.18164277e-01,  6.22214526e-02,  3.99285913e-01,
        -1.81652635e-01, -5.85578501e-01,  4.49724011e-02,
        -1.72750384e-01, -2.68443584e-01, -1.47386163e-01,
        -1.89217702e-01,  1.92150414e-01, -3.83842826e-01,
        -3.96007091e-01,  4.30648834e-01, -3.15320015e-01,
         3.65949929e-01,  6.05159178e-02,  3.57325375e-01,
         1.59736529e-01, -3.00983638e-01,  2.63250142e-01,
        -3.94310504e-01,  1.84855461e-01, -3.99549633e-01,
        -2.67889559e-01, -5.45117497e-01, -3.13404575e-02,
        -4.30644214e-01,  1.33278072e-01, -1.74793854e-01,
        -4.35465217e-01, -4.77379173e-01,  7.12554380e-02,
        -7.37003982e-02,  5.69136977e-01, -2.82579482e-0

In [8]:
embedding.shape

(1, 384)

See, a sentence embedding is just a vector, just like a word embedding. That means we can also calculate similarities in a similar way:

In [11]:

# Two lists of sentences - source https://www.sbert.net/
sentences1 = ['The cat sits outside',
             'A man is playing guitar',
             'The new movie is awesome!']

sentences2 = ['The dog plays in the garden',
              'My plants look a bit sick, could it be bitrot?',
              'The film I just saw really sucked']

#Compute embedding for both lists
embeddings1 = model.encode(sentences1, convert_to_tensor=True)
embeddings2 = model.encode(sentences2, convert_to_tensor=True)

#Compute cosine-similarities
cosine_scores = util.cos_sim(embeddings1, embeddings2)

#Output the pairs with their score
for i in range(len(sentences1)):
    print("{} \t\t {} \t\t Score: {:.4f}".format(sentences1[i], sentences2[i], cosine_scores[i][i]))

The cat sits outside 		 The dog plays in the garden 		 Score: 0.2853
A man is playing guitar 		 My plants look a bit sick, could it be bitrot? 		 Score: -0.0119
The new movie is awesome! 		 The film I just saw really sucked 		 Score: 0.3771


## Semantic search and retrieval

The idea behind semantic search is to embed all entries in your corpus, whether they be sentences, paragraphs, or documents, into a vector space.

At search time, the query is embedded into the same vector space and the closest embeddings from your corpus are found. These entries should have a high semantic overlap with the query.


![title](https://raw.githubusercontent.com/UKPLab/sentence-transformers/master/docs/img/SemanticSearch.png
)

Instead of trying to build a semantic search engine from first principles, we'll use `langchain`. 

## [Don't run this again] Crawl the Vlerick website using Apify

The following code crawls the Vlerick website so we have some text to model. It's just example code. 

Langchain supports more than 100 integrations, so depending on where you find interesting data you'll need to use something else.

In [None]:
# from langchain.utilities import ApifyWrapper
# import os

# os.environ["APIFY_API_TOKEN"] = ""

# apify = ApifyWrapper()
# # Call the Actor to obtain text from the crawled webpages
# loader = apify.call_actor(
#     actor_id="apify/website-content-crawler",
#     run_input={
#         "startUrls": [{"url": "https://www.vlerick.com/en/"}]
#     },
#     dataset_mapping_function=lambda item: Document(
#         page_content=item["text"] or "", metadata={"source": item["url"]}
#     ),
# )


## Create new vector store and embed all documents
Source: https://python.langchain.com/docs/expression_language/cookbook/retrieval

In [13]:
# Let's load all documents
# Adapt this code to your own source of data.

from langchain_community.document_loaders import DirectoryLoader
from pathlib import Path
from pprint import pprint

from langchain.document_loaders import TextLoader, DirectoryLoader, JSONLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import FAISS
from langchain.embeddings import HuggingFaceEmbeddings

### Source 1: MAI 2023 dump

In [14]:
%time

loader = DirectoryLoader('/Users/jospolfliet/src/vlerick/DATA/MAI-2023 dump/', silent_errors=True)
course_docs = loader.load()

print(f"Number of documents {len(course_docs)}")

CPU times: user 3 ¬µs, sys: 2 ¬µs, total: 5 ¬µs
Wall time: 8.11 ¬µs


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Error loading file /Users/jospolfliet/src/vlerick/DATA/MAI-2023 dump/People Analytics/~$Enagement_Data_Vlerick.xlsx: Excel file format cannot be determined, you must specify an engine manually.
Error loading file /Users/jospolfliet/src/vlerick/DATA/MAI-2023 dump/People Analytics/~$Fair_Pay_Vlerick.xlsx: Excel file format cannot be determined, you must specify an engine manually.
Error loading file /Users/jospolfliet/src/vlerick/DATA/MAI-2023 dump/Sustainable AI/~$anscription interview Brainjar.docx: Package not found at '/Users/jospolfliet/src/vlerick/DATA/MAI-2023 dump/Sustainable AI/~$anscription interview Brainjar.docx'
Error loading file /Users/jospolfliet/src/vlerick/DATA/MAI-2023 dump/Sustainable AI/~$

Number of documents 68


In [15]:
course_docs[9:12]

[Document(page_content='D E L O I T T E B E L G I U M\n\nModule 3: Strategic Workforce Planning\n\n1 6 T H O F N O V E M B E R 2 0 2 2\n\n1 | Copyright ¬© 2022 Deloitte Development LLC. All rights reserved.\n\nHere today\n\nThomas Pensaert Strategic Workforce Intelligence\n\nSenior Manager\n\nReward Consultant\n\nData Miner - Predictive Maintenance\n\nHR Report (SAP BO)\n\nBa c k g r ou n d\n\nMaster-after-Master in Computational Statistics\n\nPost-Graduate in Big Data\n\nMaster In Industrial Psychology\n\nStrategic Workforce Intelligence (People Analytics)\n\n2\n\nS W P O V E R V I E W\n\nThe Workforce Planning spectrum\n\nVarious types of Workforce Planning exist, each with their own focus and goals. SWP focuses on the long term, takes external driving forces into account and takes a less detailed approach to planning.\n\nWorkforce Management\n\nHeadcount Planning\n\nOperational Planning\n\nStrategic Workforce Planning\n\nSchedule staffing supply to short-term forecasted demand.\n\nM

### Source 2: Vlerick website

In [16]:

from langchain_community.document_loaders import ApifyDatasetLoader
from langchain_community.document_loaders.base import Document

loader = ApifyDatasetLoader(
    dataset_id="RcArHfVs80xOg9IKs",
    dataset_mapping_function=lambda dataset_item: Document(
        page_content=dataset_item["text"], metadata={"source": dataset_item["url"]}
    ),
)
website_docs = loader.load()
print(f"Number of documents {len(website_docs)}")
website_docs = [doc for doc in website_docs if not doc.page_content.startswith("Your choice regarding cookies on this site")]
print(f"Number of non-trivial documents {len(website_docs)}")
website_docs[5:7]

Number of documents 1221
Number of non-trivial documents 679


[Document(page_content='Why give to Vlerick?\nBy giving back to Vlerick, you‚Äôll provide direct support for pioneering entrepreneurship, help drive the School‚Äôs strategic projects or contribute to scholarships that will attract bright minds from around the world with the capacity to change that world for the better.\nGive back with your class\nThe Vlerick experience is life-changing. Very often, the bonds you forge here end up lasting a lifetime. \nPUB90 was the first to set up a class donation. ‚ÄúIn the short term, we want to offer support to cover the cost of living for a promising student. In the long term, we envisage contributing to the Scholarship Fund. Vlerick has given us so much. Not just knowledge and insights but also friendship, laughter and a sense of purpose.‚Äù\nDo you want to reunite with your classmates and reconnect with the place that shaped who you are today? Contact us to celebrate a milestone reunion.\nMake a difference to our collective future\n‚ÄúSocial entr

In [17]:
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
documents = text_splitter.split_documents(course_docs + website_docs)
documents[0]
print(f"Number of chunks {len(documents)}")

Created a chunk of size 2143, which is longer than the specified 1000
Created a chunk of size 1935, which is longer than the specified 1000
Created a chunk of size 1566, which is longer than the specified 1000
Created a chunk of size 2234, which is longer than the specified 1000
Created a chunk of size 1251, which is longer than the specified 1000
Created a chunk of size 2234, which is longer than the specified 1000
Created a chunk of size 36468, which is longer than the specified 1000
Created a chunk of size 1731, which is longer than the specified 1000
Created a chunk of size 2680, which is longer than the specified 1000


Number of chunks 1918


### Embed into a vector store - and cache the results
We got a decent store of data loaded into memory now. Next thing we need to do is calculate sentence embeddings. 
We'll use simple, reasonably fast embeddings that we can calculate locally withouting requiring an expensive GPU or cloud service like OpenAI's GPTx.

In [18]:
%%time

embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
# to test, use query_result = embeddings.embed_query("My text")

if True: # change to True if you want to (re)create your store   
    vectorstore = FAISS.from_documents(
        documents, embedding=embeddings
    )
    # store because this is slow
    vectorstore.save_local("vectorstore") 

CPU times: user 1min 16s, sys: 25.5 s, total: 1min 42s
Wall time: 31.3 s


In [19]:
vectorstore = FAISS.load_local("vectorstore", embeddings)
vectorstore.index

<faiss.swigfaiss.IndexFlat; proxy of <Swig Object of type 'faiss::IndexFlat *' at 0x103c2dd40> >

In [20]:
from langchain.embeddings import OpenAIEmbeddings
from langchain.prompts import ChatPromptTemplate
from langchain.schema.output_parser import StrOutputParser
from langchain.chat_models import ChatOpenAI
from operator import itemgetter

retriever = vectorstore.as_retriever(k=8)


In [21]:
def q(s):
    results = retriever.get_relevant_documents(s)
    for doc in results:
        print("#"*100)
        print(doc.metadata["source"])
        print("#"*100)
        print(doc.page_content)
q("stochastic gradient descent")

####################################################################################################
/Users/jospolfliet/src/vlerick/DATA/MAI-2023 dump/Deep_Learning/MAI01-neural networks - handouts.pdf
####################################################################################################
ùúïùúÄ ùúïùúÄ ùúïùë§4 = ùúïùë¶ ùúïùëú4 ùúïùë§5 ùúïùë§4 ùúïùëú4

ùúïùë¶ ùúïùë§5 ùúïùë§4 ùúïùëú3

ùúïùúÄ ùúïùë§3 = ùúïùë§5 ùúïùëú4 ùúïùëú4 ùúïùë§4

ùúïùúÄ ùúïùë¶

ùúïùë¶ ùúïùë§5

ùúïùë§4 ùúïùëú3

ùúïùëú3 ùúïùë§3

ùúïùë§3 ùúïùëú2

ùúïùëú2 ùúïùë§2

ùúïùúÄ ùúïùë§5 = ùúïùë§5 ùúïùëú4 ùúïùëú3 ùúïùë§3

ùúïùúÄ ùúïùë¶ ùúïùëú4 ùúïùë§4

ùúïùë¶ ùúïùë§5

¬© Prof. dr. Philippe Baecke

KEY ELEMENTS OF NEURAL NETWORKS

Gradient descent: ‚ñ™

In reality, loss landscape may not be smooth

w2

Gradient descent

Loss

Source: https://www.cs.umd.edu/~tomg/projects/landscapes/

w1

¬© Prof. dr. Philippe Baecke

KEY ELEMENTS OF NE

In [23]:
q("what type of prizes does vlerick give")

####################################################################################################
https://www.vlerick.com/en/insights/tim-van-hauwermeiren-and-pieter-loose-win-the-vlerick-award-2022/
####################################################################################################
During an award show on Wednesday 15 June at the Handelsbeurs in Ghent, the winners of the 21st edition of the Vlerick Award were announced in the presence of numerous business leaders and alumni. The Vlerick Award is presented annually by Vlerick Business School as a tribute to two successful Vlerick alumni who, as entrepreneurs and business leaders, are at the helm of fast-growing organisations.
The Vlerick Enterprising Leader Award went to Tim Van Hauwermeiren, co-founder and CEO of argenx. Pieter Loose, CEO of Ekopak, received the Vlerick Venture Award.
The Vlerick Enterprising Leader Award is bestowed by Vlerick Business School on a business leader and Vlerick alumnus with a convinc