# Sentence embeddings
We will mainly use `sentence-transformers`, which is a dedicated package from Hugging Face 🤗. 

Relevant documentation
- Semantic textual similarity https://www.sbert.net/docs/usage/semantic_textual_similarity.html
- Semantic search https://www.sbert.net/examples/applications/semantic-search/README.html

In [37]:
!pip freeze | grep lang

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[0mlangchain==0.0.344
langchain-core==0.0.8
langdetect==1.0.9
langsmith==0.0.67


In [34]:
!pip install -U sentence-transformers faiss-cpu langchain  "unstructured[md]" openai nest-asyncio streamlit

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Collecting langchain
  Downloading langchain-0.0.344-py3-none-any.whl.metadata (16 kB)
Collecting openai
  Downloading openai-1.3.7-py3-none-any.whl.metadata (17 kB)
Collecting streamlit
  Downloading streamlit-1.29.0-py2.py3-none-any.whl.metadata (8.2 kB)
Collecting unstructured[md]
  Downloading unstructured-0.11.2-py3-none-any.whl.metadata (25 kB)
Collecting langchain-core<0.1,>=0.0.8 (from langchain)
  Downloading langchain_core-0.0.8-py3-none-any.whl.metadata (750 bytes)
Collecting altair<6,>=4.0 (from streamlit)
  Downloading altair-5.2.0-py3-none-any.whl.metadata (8.7 kB)
Collecting cachetools<6,>=4.0 (from streamlit)
  Downloading cachetools-5.3.2-py3-none-any.whl.metadata (5.2 kB)
Collecting importlib-metadata<7,>=1.4 (from streamlit)
  Downloading importlib_metadata-6.9.0-py3-none-any.whl.metadata (4.9 kB)
Collecting pandas<3,>=1.3.0 (from streamlit)
  Downloading pandas-2.1.3-cp311-cp311-macosx_11_0_arm64.whl.metadata (18 kB)
Collecting protobuf<5,>=3.20 (from streamlit)
  D

In [1]:
import nest_asyncio
nest_asyncio.apply()

In [2]:
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('paraphrase-MiniLM-L6-v2')

# Sentences we want to encode. Example:
sentence = ['This framework generates embeddings for each input sentence']

# Sentences are encoded by calling model.encode()
embedding = model.encode(sentence)
embedding

  from .autonotebook import tqdm as notebook_tqdm


array([[-1.76214680e-01,  1.20601490e-01, -2.93624043e-01,
        -2.29858160e-01, -8.22923556e-02,  2.37709701e-01,
         3.39984596e-01, -7.80964196e-01,  1.18127435e-01,
         1.63373843e-01, -1.37715429e-01,  2.40282565e-01,
         4.25125778e-01,  1.72417641e-01,  1.05280034e-01,
         5.18164277e-01,  6.22214526e-02,  3.99285913e-01,
        -1.81652635e-01, -5.85578501e-01,  4.49724011e-02,
        -1.72750384e-01, -2.68443584e-01, -1.47386163e-01,
        -1.89217702e-01,  1.92150414e-01, -3.83842826e-01,
        -3.96007091e-01,  4.30648834e-01, -3.15320015e-01,
         3.65949929e-01,  6.05159178e-02,  3.57325375e-01,
         1.59736529e-01, -3.00983638e-01,  2.63250142e-01,
        -3.94310504e-01,  1.84855461e-01, -3.99549633e-01,
        -2.67889559e-01, -5.45117497e-01, -3.13404575e-02,
        -4.30644214e-01,  1.33278072e-01, -1.74793854e-01,
        -4.35465217e-01, -4.77379173e-01,  7.12554380e-02,
        -7.37003982e-02,  5.69136977e-01, -2.82579482e-0

See, a sentence embedding is just a vector, just like a word embedding. That means we can also calculate similarities in a similar way:

In [None]:

# Two lists of sentences
sentences1 = ['The cat sits outside',
             'A man is playing guitar',
             'The new movie is awesome!']

sentences2 = ['The dog plays in the garden',
              'My plants look a bit sick, could it be bitrot?',
              'The new movie is so great!']

#Compute embedding for both lists
embeddings1 = model.encode(sentences1, convert_to_tensor=True)
embeddings2 = model.encode(sentences2, convert_to_tensor=True)

#Compute cosine-similarities
cosine_scores = util.cos_sim(embeddings1, embeddings2)

#Output the pairs with their score
for i in range(len(sentences1)):
    print("{} \t\t {} \t\t Score: {:.4f}".format(sentences1[i], sentences2[i], cosine_scores[i][i]))

## Semantic search and retrieval

The idea behind semantic search is to embed all entries in your corpus, whether they be sentences, paragraphs, or documents, into a vector space.

At search time, the query is embedded into the same vector space and the closest embeddings from your corpus are found. These entries should have a high semantic overlap with the query.


![title](https://raw.githubusercontent.com/UKPLab/sentence-transformers/master/docs/img/SemanticSearch.png
)

Instead of trying to build a semantic search engine from first principles, we'll use `langchain`. 

In [111]:
%%time
from langchain.document_loaders import GitbookLoader
gitbookloader = GitbookLoader("https://docs.app.metamaze.eu", load_all_paths=True).load()
gitbookloader[0]


Fetching pages:   0%|                                                                                                                         | 0/108 [00:00<?, ?it/s][A
Fetching pages:  15%|################5                                                                                               | 16/108 [00:01<00:06, 14.91it/s][A
Fetching pages:  31%|##################################2                                                                             | 33/108 [00:02<00:04, 15.54it/s][A
Fetching pages:  45%|##################################################8                                                             | 49/108 [00:03<00:03, 15.08it/s][A
Fetching pages:  61%|####################################################################4                                           | 66/108 [00:04<00:02, 15.47it/s][A
Fetching pages:  76%|#####################################################################################                           | 82/108 [00:05<

Document(page_content='❔\nWhat is Metamaze?\nOn a mission to liberate mankind from repetitive document and e-mail processing.\nMetamaze is a platform for building semi-automated flows for processing any type of document or e-mail. Metamaze enables companies to automate large parts of repetitive data entry and validation tasks. \nBy using Metamaze, companies can \nautomate 50 to 98% of the manual work. \nThis leads to\nimproved employee well-being \nlower labor costs\nmore time for value-adding activities\nthe unlocking of new data and insights\nAdapts to your process, and your documents \nAdaptive IDP platforms like Metamaze are flexible systems that adapt to \nyour process\n and learn through \nfully integrated human feedback\n. Metamaze is not a rigid off-the-shelf ‘one-size fits no-one’ solution. \nYou can automate any document type you want, including fully custom document types.\nYou can start from scratch with as little as 10 examples, or start fine-tuning an existing model from 

In [128]:
%%time
from langchain.document_loaders import ConfluenceLoader
loader = ConfluenceLoader(
    url="https://metamaze.atlassian.net/wiki", username="j.polfliet@metamaze.eu", api_key="ATATT3xFfGF0OQk4dvOwp3divV4TO5bkwuzHv-jUVGfXus9hSag3BFfHBuIefjS8H64Qi0dgu5DoOEdDPRZaaOk_K7qSgUSra25gxbhv5WECT5Dw026_JokSMe7ovUrQgn8y4HzsvfWB-RNdGQZEvuzXh5L0nwRqEfP0H79T1hOvg85fNkfoHC0=55F53EEF"
)
confluence = loader.load(space_key="DO", include_attachments=False)
confluence[12]

CPU times: user 11.6 s, sys: 1.72 s, total: 13.3 s
Wall time: 2min 3s


Document(page_content='', metadata={'title': 'Documentation', 'id': '925499519', 'source': 'https://metamaze.atlassian.net/wiki/spaces/DO/pages/925499519/Documentation'})

In [None]:
# Load the document, split it into chunks, embed each chunk and load it into the vector store.
raw_documents = DirectoryLoader('rfpgpt/resources/', glob="**/*.md").load()

## Create new vector store and embed all documents

In [3]:
from langchain.document_loaders import TextLoader, DirectoryLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import FAISS
from langchain.embeddings import HuggingFaceEmbeddings

In [129]:
# source: https://python.langchain.com/docs/expression_language/cookbook/retrieval


text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
documents = text_splitter.split_documents(raw_documents + gitbookloader + confluence)
documents[0]

Created a chunk of size 3347, which is longer than the specified 1000
Created a chunk of size 1119, which is longer than the specified 1000
Created a chunk of size 1070, which is longer than the specified 1000
Created a chunk of size 1066, which is longer than the specified 1000
Created a chunk of size 1584, which is longer than the specified 1000
Created a chunk of size 17816, which is longer than the specified 1000
Created a chunk of size 1670, which is longer than the specified 1000
Created a chunk of size 3916, which is longer than the specified 1000
Created a chunk of size 1378, which is longer than the specified 1000
Created a chunk of size 2250, which is longer than the specified 1000
Created a chunk of size 6484, which is longer than the specified 1000
Created a chunk of size 1768, which is longer than the specified 1000
Created a chunk of size 1069, which is longer than the specified 1000
Created a chunk of size 2267, which is longer than the specified 1000
Created a chunk of 

Document(page_content='QUESTION: If API is supported as Input Source, please indicate any possible pre-requirement (if any), \nANSWER: For a complete description of the Metamaze REST API, please see https://app.metamaze.eu/docs/index.html', metadata={'source': 'rfpgpt/resources/faq/question_170.md'})

In [10]:
%%time

embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
# to test, use query_result = embeddings.embed_query("My text")

if False: # change to True if you want to (re)create your store   
    vectorstore = FAISS.from_documents(
        documents, embedding=embeddings
    )
    # store because this is slow
    vectorstore.save_local("vectorstore") 

CPU times: user 183 ms, sys: 102 ms, total: 284 ms
Wall time: 316 ms


In [11]:
import os
os.getcwd()


'/Users/jospolfliet/src/vlerick'

In [12]:
vectorstore = FAISS.load_local("vectorstore", embeddings)
vectorstore.index

<faiss.swigfaiss.IndexFlat; proxy of <Swig Object of type 'faiss::IndexFlat *' at 0x2b3329ec0> >

In [27]:
from langchain.embeddings import OpenAIEmbeddings
from langchain.prompts import ChatPromptTemplate
from langchain.schema.output_parser import StrOutputParser
from langchain.schema.runnable import RunnableLambda, RunnablePassthrough, RunnableMap, RunnableSequence
from langchain.chat_models import ChatOpenAI
from operator import itemgetter

retriever = vectorstore.as_retriever(k=8)

template = """Answer the question using only information from the following, related previous answers or context:
# CONTEXT:
{context}
# INSTRUCTIONS:
- Replace any mentions of "provider", "supplier" or similar with "Metamaze"
- Replace any mentions of "AXA", "AG Insurance", "KBC", or other potential client names with "Client"
# QUESTION: 
{question}
"""
prompt = ChatPromptTemplate.from_template(template)


In [28]:
model = ChatOpenAI(model_name="gpt-4")

In [31]:
rag_chain_from_docs = (
    {
        "context": lambda input: input["documents"],
        "question": itemgetter("question"),
    }
    | prompt
    | model
    | StrOutputParser()
)
rag_chain_with_source = RunnableMap(
    {"documents": retriever, "question": RunnablePassthrough()}
) | {
    "documents": lambda input: input["documents"],
    "answer": rag_chain_from_docs,
}

def q(s):
    result = rag_chain_with_source.invoke(s)
    print(f"## Reference data")
    for doc in result['documents']:
        print(f"({doc.source})"))
        print(doc.page_content)
        print("-------")
    print(f"\n ## Answer:\n\n{result['answer']}")



SyntaxError: unmatched ')' (3294547597.py, line 21)

In [33]:
q("""Does the provider perform regular vulnerability assessments / penetration tests to determine security gaps? 
""")

## Reference data
Penetration Testing Policy Penetration Testing is a legal, authorized simulated attack performed in order to evaluate the security controls of the Metamaze application and infrastructure and of the Metamaze organization while identifying the exploitable vulnerabilities as well as its strengths, enabling a full risk assessment to be completed. This policy specifies the penetration testing process steps in order to maximise the business value and minimise the security risks. Scope This policy applies throughout the organization and it is particularly relevant to the Engineering department. The objective of the penetration testing is to detect the security weaknesses that could be used by externals to gain unauthorized access to information. The testing will be performed At least on a annual basis When major releases are planned that significantly impact authorization, authentication or data tenancy. In this case, the penetration test should be performed before the relea

In [None]:
import streamlit as st


## Make Apify crawl vlierck

In [None]:
from langchain.utilities import ApifyWrapper
import os
# os.environ["OPENAI_API_KEY"] = "Your OpenAI API key"
os.environ["APIFY_API_TOKEN"] = "apify_api_X4ssRRKfPInbJv5HV24mBkwpkhTs084rrM3o"

apify = ApifyWrapper()
# Call the Actor to obtain text from the crawled webpages
loader = apify.call_actor(
    actor_id="apify/website-content-crawler",
    run_input={
        "startUrls": [{"url": "https://www.vlerick.com/en/"}]
    },
    dataset_mapping_function=lambda item: Document(
        page_content=item["text"] or "", metadata={"source": item["url"]}
    ),
)
