# Sentence embeddings
We will mainly use `sentence-transformers`, which is a dedicated package from Hugging Face 🤗. 

Relevant documentation
- Semantic textual similarity https://www.sbert.net/docs/usage/semantic_textual_similarity.html
- Semantic search https://www.sbert.net/examples/applications/semantic-search/README.html

In [None]:
# !pip install -U sentence-transformers faiss-cpu langchain langchain-community "unstructured[all-docs]" openai nest-asyncio streamlit jq

### From word embeddings to sentence embeddings

In [1]:
import nest_asyncio
nest_asyncio.apply()

In [3]:
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('paraphrase-MiniLM-L6-v2') # https://www.sbert.net/docs/pretrained_models.html

# Sentences we want to encode. Example:
sentence = ['This framework generates embeddings for each input sentence']

# Sentences are encoded by calling model.encode()
embedding = model.encode(sentence)
embedding

array([[-1.76214159e-01,  1.20600946e-01, -2.93624103e-01,
        -2.29858249e-01, -8.22925270e-02,  2.37709418e-01,
         3.39985251e-01, -7.80964315e-01,  1.18127592e-01,
         1.63373768e-01, -1.37715191e-01,  2.40282565e-01,
         4.25125331e-01,  1.72417864e-01,  1.05279565e-01,
         5.18164277e-01,  6.22217394e-02,  3.99286211e-01,
        -1.81652650e-01, -5.85578680e-01,  4.49721254e-02,
        -1.72750533e-01, -2.68443465e-01, -1.47385836e-01,
        -1.89217985e-01,  1.92150563e-01, -3.83842617e-01,
        -3.96007031e-01,  4.30648953e-01, -3.15319538e-01,
         3.65949690e-01,  6.05157800e-02,  3.57325613e-01,
         1.59736484e-01, -3.00984204e-01,  2.63250291e-01,
        -3.94311011e-01,  1.84855521e-01, -3.99549007e-01,
        -2.67889529e-01, -5.45117259e-01, -3.13403197e-02,
        -4.30643976e-01,  1.33278221e-01, -1.74793825e-01,
        -4.35465395e-01, -4.77379024e-01,  7.12557212e-02,
        -7.37001002e-02,  5.69136739e-01, -2.82579124e-0

In [4]:
embedding.shape

(1, 384)

See, a sentence embedding is just a vector, just like a word embedding. That means we can also calculate similarities in a similar way:

In [9]:

# Two lists of sentences - source https://www.sbert.net/
sentences1 = ['The new movie is awesome!']

sentences2 = ['The dog plays in the garden',
              'My plants look a bit sick, could it be bitrot?',
              'The film I just saw really sucked',
              'The film I just saw is just really good',
              'The film I just saw deserves 10 oscars'
              ]

#Compute embedding for both lists
embeddings1 = model.encode(sentences1, convert_to_tensor=True)
embeddings2 = model.encode(sentences2, convert_to_tensor=True)

#Compute cosine-similarities
cosine_scores = util.cos_sim(embeddings1, embeddings2)

#Output the pairs with their score
for i in range(len(sentences1)):
    for j in range(len(sentences2)):
        print("{} \t\t {} \t\t Score: {:.3f}".format(sentences1[i], sentences2[j], cosine_scores[i][j]))

The new movie is awesome! 		 The dog plays in the garden 		 Score: 0.112
The new movie is awesome! 		 My plants look a bit sick, could it be bitrot? 		 Score: -0.116
The new movie is awesome! 		 The film I just saw really sucked 		 Score: 0.377
The new movie is awesome! 		 The film I just saw is just really good 		 Score: 0.611
The new movie is awesome! 		 The film I just saw deserves 10 oscars 		 Score: 0.376


## Semantic search and retrieval

The idea behind semantic search is to embed all entries in your corpus, whether they be sentences, paragraphs, or documents, into a vector space.

At search time, the query is embedded into the same vector space and the closest embeddings from your corpus are found. These entries should have a high semantic overlap with the query.


![title](https://raw.githubusercontent.com/UKPLab/sentence-transformers/master/docs/img/SemanticSearch.png
)

Instead of trying to build a semantic search engine from first principles, we'll use `langchain`. 

## [Don't run this again] Crawl the Vlerick website using Apify

The following code crawls the Vlerick website so we have some text to model. It's just example code. 

Langchain supports more than 100 integrations, so depending on where you find interesting data you'll need to use something else.

In [None]:
# from langchain.utilities import ApifyWrapper
# import os

# os.environ["APIFY_API_TOKEN"] = ""

# apify = ApifyWrapper()
# # Call the Actor to obtain text from the crawled webpages
# loader = apify.call_actor(
#     actor_id="apify/website-content-crawler",
#     run_input={
#         "startUrls": [{"url": "https://www.vlerick.com/en/"}]
#     },
#     dataset_mapping_function=lambda item: Document(
#         page_content=item["text"] or "", metadata={"source": item["url"]}
#     ),
# )


## Create new vector store and embed all documents
Source: https://python.langchain.com/docs/expression_language/cookbook/retrieval

In [10]:
# Let's load all documents
# Adapt this code to your own source of data.

from langchain_community.document_loaders import DirectoryLoader
from pathlib import Path
from pprint import pprint

from langchain.document_loaders import TextLoader, DirectoryLoader, JSONLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import FAISS
from langchain.embeddings import HuggingFaceEmbeddings

### Source 1: MAI 2023 dump

In [13]:
%time

loader = DirectoryLoader('example data/MAI-2023 dump/', silent_errors=True)
course_docs = loader.load()

print(f"Number of documents {len(course_docs)}")

CPU times: user 2 μs, sys: 9 μs, total: 11 μs
Wall time: 13.1 μs


Error loading file example data/MAI-2023 dump/Introduction To Business Statistics/WP2 Probability - Annotation.pdf: empty_like method already has a different docstring
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELIS

Number of documents 62


In [14]:
course_docs[9:12]

[Document(metadata={'source': 'example data/MAI-2023 dump/People Analytics/Module 4- From Descriptive2Diagnostic.pdf'}, page_content='Module 4 From Descriptive to Diagnostics\n\nM1: Introduction\n\nToday\n\n1.\n\nIntroduction:\n\na. Human Resources: from 1930 to tomorrow b. People Analytics: creating business value\n\n2. From Operational to Descriptive a. KPIs & Surveys: exercise 1 b. Masterclass 1: An gentle introduction to Psychometrics\n\n3. Strategic Workforce Planning: Thomas Pensaert 4. From Descriptive to Diagnostics: a. Case study: A Retention Model b. Masterclass 2: A gentle introduction to Organisational Network Analysis\n\n5. From Diagnostics to Predictive/Prescriptive: a. Fair Pay b. Presentation 4 groups and evaluation\n\nPart 1 Diagnostics: A Retention Case\n\nM4: From Descriptive to Diagnostics\n\nWhy did something happen?\n\nQuery Why did it happen?\n\nDiscovery Where should we look?\n\nS E V I T C E J B O\n\nS K S A T D N A S N O I T C A\n\nT U P T U O\n\nS E L O R\n\n

### Source 2: Vlerick website

In [19]:

from langchain_community.document_loaders import ApifyDatasetLoader
from langchain_core.documents import Document

loader = ApifyDatasetLoader(
    dataset_id="T0s7afek7lckeNDKO",
    dataset_mapping_function=lambda dataset_item: Document(
        page_content=dataset_item["text"], metadata={"source": dataset_item["url"]}
    ),
)

website_docs = loader.load()
print(f"Number of documents {len(website_docs)}")
website_docs = [doc for doc in website_docs if not doc.page_content.startswith("Your choice regarding cookies on this site")]
print(f"Number of non-trivial documents {len(website_docs)}")
website_docs[5:7]

Number of documents 21
Number of non-trivial documents 21


[Document(metadata={'source': 'https://www.vlerick.com/en/alumni/'}, page_content='Vlerick Alumni | Vlerick Business School\n25,704\nalumni across the globe\nInspiring stories\nNetworking and activities\nThere are so many opportunities to share memories, experiences and ideas with your fellow alumni. We offer events all year round, including our flagship Global Alumni Winter Reunion, as well as our exclusive clubs just for alumni. Find out how to get together and grow together.\nLifelong learning\nOur alumni never lose their thirst for knowledge. We offer valuable opportunities to deepen your insight, broaden your understanding and discover new perspectives, so you can stay ahead of the curve in your own field – or branch out into new ones. Discover opportunities for lifelong learning.\nCareer and business resources\nWhether you’re looking for your next role or want to hire one of our talented alumni, our career coaching, recruitment platform and other resources are designed to help yo

In [17]:
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
documents = text_splitter.split_documents(course_docs + website_docs)
documents[0]
print(f"Number of chunks {len(documents)}")

Created a chunk of size 2143, which is longer than the specified 1000
Created a chunk of size 1935, which is longer than the specified 1000
Created a chunk of size 1566, which is longer than the specified 1000
Created a chunk of size 2234, which is longer than the specified 1000
Created a chunk of size 1251, which is longer than the specified 1000
Created a chunk of size 2234, which is longer than the specified 1000
Created a chunk of size 36468, which is longer than the specified 1000
Created a chunk of size 1731, which is longer than the specified 1000
Created a chunk of size 2680, which is longer than the specified 1000


Number of chunks 1918


### Embed into a vector store - and cache the results
We got a decent store of data loaded into memory now. Next thing we need to do is calculate sentence embeddings. 
We'll use simple, reasonably fast embeddings that we can calculate locally withouting requiring an expensive GPU or cloud service like OpenAI's GPTx.

In [18]:
%%time

embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
# to test, use query_result = embeddings.embed_query("My text")

if True: # change to True if you want to (re)create your store   
    vectorstore = FAISS.from_documents(
        documents, embedding=embeddings
    )
    # store because this is slow
    vectorstore.save_local("vectorstore") 

CPU times: user 1min 16s, sys: 25.5 s, total: 1min 42s
Wall time: 31.3 s


In [19]:
vectorstore = FAISS.load_local("vectorstore", embeddings)
vectorstore.index

<faiss.swigfaiss.IndexFlat; proxy of <Swig Object of type 'faiss::IndexFlat *' at 0x103c2dd40> >

In [20]:
from langchain.embeddings import OpenAIEmbeddings
from langchain.prompts import ChatPromptTemplate
from langchain.schema.output_parser import StrOutputParser
from langchain.chat_models import ChatOpenAI
from operator import itemgetter

retriever = vectorstore.as_retriever(k=8)


In [21]:
def q(s):
    results = retriever.get_relevant_documents(s)
    for doc in results:
        print("#"*100)
        print(doc.metadata["source"])
        print("#"*100)
        print(doc.page_content)
q("stochastic gradient descent")

####################################################################################################
/Users/jospolfliet/src/vlerick/DATA/MAI-2023 dump/Deep_Learning/MAI01-neural networks - handouts.pdf
####################################################################################################
𝜕𝜀 𝜕𝜀 𝜕𝑤4 = 𝜕𝑦 𝜕𝑜4 𝜕𝑤5 𝜕𝑤4 𝜕𝑜4

𝜕𝑦 𝜕𝑤5 𝜕𝑤4 𝜕𝑜3

𝜕𝜀 𝜕𝑤3 = 𝜕𝑤5 𝜕𝑜4 𝜕𝑜4 𝜕𝑤4

𝜕𝜀 𝜕𝑦

𝜕𝑦 𝜕𝑤5

𝜕𝑤4 𝜕𝑜3

𝜕𝑜3 𝜕𝑤3

𝜕𝑤3 𝜕𝑜2

𝜕𝑜2 𝜕𝑤2

𝜕𝜀 𝜕𝑤5 = 𝜕𝑤5 𝜕𝑜4 𝜕𝑜3 𝜕𝑤3

𝜕𝜀 𝜕𝑦 𝜕𝑜4 𝜕𝑤4

𝜕𝑦 𝜕𝑤5

© Prof. dr. Philippe Baecke

KEY ELEMENTS OF NEURAL NETWORKS

Gradient descent: ▪

In reality, loss landscape may not be smooth

w2

Gradient descent

Loss

Source: https://www.cs.umd.edu/~tomg/projects/landscapes/

w1

© Prof. dr. Philippe Baecke

KEY ELEMENTS OF NEURAL NETWORKS

Learning rate: = hyperparameter that determines how much to change the weights in response to the estimated error each time the model is updated ▪ Needs to be chosen well:

© Prof. dr. Philippe Baecke

KEY ELEMENTS OF NEURAL NETWORKS

Optimi

In [23]:
q("what type of prizes does vlerick give")

####################################################################################################
https://www.vlerick.com/en/insights/tim-van-hauwermeiren-and-pieter-loose-win-the-vlerick-award-2022/
####################################################################################################
During an award show on Wednesday 15 June at the Handelsbeurs in Ghent, the winners of the 21st edition of the Vlerick Award were announced in the presence of numerous business leaders and alumni. The Vlerick Award is presented annually by Vlerick Business School as a tribute to two successful Vlerick alumni who, as entrepreneurs and business leaders, are at the helm of fast-growing organisations.
The Vlerick Enterprising Leader Award went to Tim Van Hauwermeiren, co-founder and CEO of argenx. Pieter Loose, CEO of Ekopak, received the Vlerick Venture Award.
The Vlerick Enterprising Leader Award is bestowed by Vlerick Business School on a business leader and Vlerick alumnus with a convinc