# Wikipedia Semantic Search with Cohere Embedding Archives
This notebook contains the starter code to do simple [semantic search](https://txt.cohere.ai/what-is-semantic-search/) on the [Wikipedia embeddings archives](https://txt.cohere.ai/embeddings-archives-wikipedia/) published by Cohere. These archives embed Wikipedia sites in multiple languages. In this example, we'll use [Wikipedia Simple English](https://huggingface.co/datasets/Cohere/wikipedia-22-12-simple-embeddings). 

source: https://github.com/cohere-ai/notebooks/blob/main/notebooks/Wikipedia_Semantic_Search_With_Cohere_Embeddings_Archives.ipynb

In [1]:
# Let's install cohere and HuggingFace datasets
!pip install cohere datasets

Collecting cohere
  Downloading cohere-4.3.1-py3-none-any.whl (32 kB)
Installing collected packages: cohere
Successfully installed cohere-4.3.1


Let's now download 1,000 records from the English Wikipedia embeddings archive so we can search it afterwards.

In [27]:
from datasets import load_dataset
import torch
import cohere

# Add your cohere API key from www.cohere.com
co = cohere.Client("kWWNHxy1gyMljLopnNMMXxLP0sN3fO7u94lJV2xd")  

#Load at max 10000 documents + embeddings
max_docs = 10000
docs_stream = load_dataset(f"Cohere/wikipedia-22-12-en-embeddings", split="train", streaming=True)

docs = []
doc_embeddings = []

for doc in docs_stream:
    docs.append(doc)
    doc_embeddings.append(doc['emb'])
    if len(docs) >= max_docs:
        break

doc_embeddings = torch.tensor(doc_embeddings)

Now, `doc_embeddings` holds the embeddings of the first 1,000 documents in the dataset. Each document is represented as an [embeddings vector](https://txt.cohere.ai/sentence-word-embeddings/) of 768 values.

In [31]:
#this is the first batch of 1,000 articles
print(doc_embeddings.shape)
print(type(doc_embeddings))
print(doc_embeddings)

torch.Size([10000, 768])
<class 'torch.Tensor'>
tensor([[ 0.2866, -0.0318,  0.0667,  ...,  0.0602, -0.2362, -0.0712],
        [-0.0969,  0.1619, -0.0980,  ...,  0.2985,  0.0398,  0.0344],
        [ 0.1302,  0.2657,  0.4018,  ...,  0.2803, -0.0346,  0.1877],
        ...,
        [ 0.1199, -0.5237,  0.1080,  ...,  0.1375, -0.1667, -0.3578],
        [-0.1171,  0.1487, -0.5787,  ..., -0.3205, -0.4541,  0.0279],
        [ 0.0129, -0.1160,  0.0472,  ..., -0.1739, -0.1747, -0.3018]])


In [29]:
#these embeddings are really small! Get the size of 1,000 embeddings isn bytes
import sys
print(f'The size of the object is {sys.getsizeof(doc_embeddings)} bytes, but this is not the size of the embeddings.)

The size of the object is 72 bytes


In [30]:
#TO DO GET SIZE OF EMBEDDINGS

We can now search these vectors for any query we want. For this toy example, we'll ask a question about Wikipedia since we know the Wikipedia page is included in the first 1000 documents we used here.

To search, we embed the query, then get the nearest neighbors to its embedding (using dot product).

In [22]:
# Get the query, then embed it
query = 'Who founded Wikipedia'
response = co.embed(texts=[query], model='multilingual-22-12')
query_embedding = response.embeddings 
query_embedding = torch.tensor(query_embedding)

# Compute dot score between query embedding and document embeddings
dot_scores = torch.mm(query_embedding, doc_embeddings.transpose(0, 1))
top_k = torch.topk(dot_scores, k=3)

# Print results
print("Query:", query)
for doc_id in top_k.indices[0].tolist():
    print(docs[doc_id]['title'])
    print(docs[doc_id]['text'], "\n")


Query: Who founded Wikipedia
Wikipedia
Various collaborative online encyclopedias were attempted before the start of Wikipedia, but with limited success. Wikipedia began as a complementary project for Nupedia, a free online English-language encyclopedia project whose articles were written by experts and reviewed under a formal process. It was founded on March 9, 2000, under the ownership of Bomis, a web portal company. Its main figures were Bomis CEO Jimmy Wales and Larry Sanger, editor-in-chief for Nupedia and later Wikipedia. Nupedia was initially licensed under its own Nupedia Open Content License, but before Wikipedia was founded, Nupedia switched to the GNU Free Documentation License at the urging of Richard Stallman. Wales is credited with defining the goal of making a publicly editable encyclopedia, while Sanger is credited with the strategy of using a wiki to reach that goal. On January 10, 2001, Sanger proposed on the Nupedia mailing list to create a wiki as a "feeder" project

This shows the top three passages that are relevant to the query. We can retrieve more results by changing the `k` value. The question in this simple demo is about Wikipedia because we know that the Wikipedia page is part of the documents in this subset of the archive.