# Semantic Search with Embeddings on the Formula Student Rules

This notebook demonstrates how to use Cohere's semantic search API to search through the Formula Student Rules using word embeddings.
<br>
The implementation is [1] based on this notebook. This could be seen as a standalone version but also as the first step of the preprocessing for a RAG pipeline.
<br>
<br>
[1] https://github.com/cohere-ai/notebooks/blob/main/notebooks/Wikipedia_Semantic_Search_With_Cohere_Embeddings_Archives.ipynb?ref=txt.cohere.com

In [None]:
# Let's install cohere and HF datasets
!pip install cohere datasets

In [None]:
import torch
import cohere

# Add your cohere API key from https://dashboard.cohere.com/api-keys
# Test key is enough for this example
co = cohere.Client("")
# basic embedding model
MODEL = "embed-english-v2.0"
#Load at max 2000 documents + embeddings
max_docs = 2000

Prepare the data, i.e. creating Question-Answer pairs or chunks. Whatever you want to name it.

In [None]:
import re
# find all titles and texts in this format:
# """
#    AIP
#
# Anti Intrusion Plate
# """
#
# or in this format:
# """
# A 1.2.6
# 
# Vehicles of both classes can take part in an additional Driverless Cup (DC).
# """

# read data/FS-Rules_2024_v1.1.0.txt 
with open('data/Rules_2024_v1.1.txt', 'r') as f:
    rules = f.read()

# Define a regex pattern to match titles and texts
pattern = re.compile(r'([A-Z0-9\s\.-]+)\n\n((?:.*?\n)+?)(?=\n[A-Z0-9\s\.-]+|$)', re.DOTALL)

# Find all matches in the text
matches = pattern.findall(rules)

# Extract titles and texts and filter based on length
filtered_matches = [(title.strip(), text.strip()) for title, text in matches if len(title.strip()) > 3 or len(text.strip()) > 3]

titles = [title for title, text in filtered_matches]
texts = [text for title, text in filtered_matches]

Use the next cell to generate new Embedings - normally not necessary since they can be read from pkl

In [16]:
import pickle

# Embed titles and texts with the given embedding model
texts_embedded = co.embed(texts=texts, model=MODEL)
titles_embedded = co.embed(texts=titles, model=MODEL)
print(texts_embedded.embeddings[0][:5]) # Let's check embeddings for the first text

# create a list of json objects with {id, title, text, emb}
json_docs = []
for i in range(len(texts)):
    json_docs.append({'id': i, 'title': titles[i], 'text': texts[i], 'emb': texts_embedded.embeddings[i]})
    

# open a file, where you ant to store the data
file = open('data/Rules_2024_v1.1_embedding.pkl', 'wb')

# dump information to that file
pickle.dump(json_docs, file)

# close the file
file.close()

[2.5546875, 0.9472656, 0.8847656, -0.52978516, 1.1357422]


Create a tensor with the given embeddings and use it to search for similar documents.

In [28]:
import pickle

# open a file, where you stored the pickled data
file = open('data/Rules_2024_v1.1_embedding.pkl', 'rb')

# dump information to that file
json_docs = pickle.load(file)

# close the file
file.close()


docs = []
doc_embeddings = []
for doc in json_docs:
    docs.append(doc)
    doc_embeddings.append(doc['emb'])
    if len(docs) >= max_docs:
        print("Too many documents, breaking")
        break

doc_embeddings = torch.tensor(doc_embeddings)

To search, we embed the query, then get the nearest neighbors to its embedding (using dot product).

In [18]:
def calculateQuery(co, query, model):
    response = co.embed(texts=[query], model=MODEL)
    query_embedding = response.embeddings
    query_embedding = torch.tensor(query_embedding)

    # Compute dot score between query embedding and document embeddings
    dot_scores = torch.mm(query_embedding, doc_embeddings.transpose(0, 1))
    top_k = torch.topk(dot_scores, k=3)

    # Print results
    print("Query:", query)
    print("\nSimilar rules:")
    for doc_id in top_k.indices[0].tolist():
        print(docs[doc_id]['title'])
        print(docs[doc_id]['text'], "\n")


Now, we can test our model with some queries:

In [27]:
# Get the query, then embed it
query = """
What is the minimum required thickness for the scatter shield of the motor?
"""   

calculateQuery(co, query, MODEL)

Query: 
What is the minimum required thickness for the scatter shield of the motor?


Similar rules:
T 7.3.4
The tractive electric motor(s) must have a housing or separate scatter shield from nonperforated 2 mm aluminium alloy 6061-T6 or equivalent. The scatter shield may be split into
two equal sections, each 1 mm thick. 

T 9.1.9
Gas cylinders/tanks and their pressure regulators must be shielded from the driver. The
shields must be steel or aluminium with a minimum thickness of 1 mm. 

T 7.3.2
Exposed rotating final drivetrain parts, such as gears, clutches, chains and belts must be fitted
with scatter shields. Scatter shields and their mountings must: 

