In [None]:
# Install Cohere for embeddings, Umap to reduce embeddings to 2 dimensions, 
# Altair for visualization, Annoy for approximate nearest neighbor search
!pip install cohere umap-learn altair annoy datasets tqdm

Get your Cohere API key by [signing up here](https://os.cohere.ai/register). Paste it in the cell below.

In [2]:
#@title Import libraries (Run this cell to execute required code) {display-mode: "form"}

import cohere
import numpy as np
import re
import pandas as pd
from tqdm import tqdm
from datasets import load_dataset
import umap
import altair as alt
from sklearn.metrics.pairwise import cosine_similarity
from annoy import AnnoyIndex
import warnings
warnings.filterwarnings('ignore')
pd.set_option('display.max_colwidth', None)

You'll need your API key for this next cell. [Sign up to Cohere](https://os.cohere.ai/) and get one if you haven't yet.

In [3]:
# Paste your API key here. Remember to not share publicly
api_key = ''

# Create and retrieve a Cohere API key from os.cohere.ai
co = cohere.Client(api_key)

In [4]:

df = pd.read_csv('/content/claim_dataset.csv')

# Preview the data to ensure it has loaded correctly
df.head(10)

Unnamed: 0,veracity,Claim_by,Claim
0,FALSE,Claim via Social Media,“The CEO of FTX is the daughter of the chairman of the SEC.”
1,BLATANT LIE,Claim by The Highwire with Del Bigtree,Covid-19 vaccines are responsible for soaring RSV cases.
2,FALSE,Claim via Social Media,Billionaire Bill Gates is on the Maricopa County Board of Supervisors.
3,FALSE,Claim via Social Media,It’s proof of “voter fraud” that it takes some states days to count 2 million votes while Florida results were available on election night.
4,TRUE,Claim via Social Media,Ted Cruz Book Cover Features Soviet-Era Stalingrad Statue.
5,BLATANT LIE,(International,A Malaysian Physician Was ‘Put To Death Under The Nuremberg Codes For Giving The Bioweapon Vaccine And Killing A Patient’ In 2022
6,FALSE,Claim via the Gateway Pundit,U.S. tax dollars sent to help Ukraine were laundered back by cryptocurrency firm FTX to help Democrats in midterms.
7,BLATANT LIE,Claim via Social Media,"“if a human is injected with a GMO it becomes a patented piece of property from the government,” “they take away all your rights with mRNAs”"
8,FALSE,Claim via Social Media,Allegheny County has more Democratic Mail-In Ballot Applications than Philadelphia County
9,FALSE,Claim by Donald Trump (D),Former President Jimmy Carter said “don’t ever use” mail in ballots because “they can be so easily corrupted.”


## 2. Embed the archive
The next step is to embed the text of the questions.

![embedding archive texts](https://github.com/cohere-ai/notebooks/raw/main/notebooks/images/semantic-search-embed-text-archive.png)

To get a thousand embeddings of this length should take about fifteen seconds.

In [5]:
# Get the embeddings
embeds = co.embed(texts=list(df['Claim']),
                  model="large",
                  truncate="LEFT").embeddings

In [6]:
# Check the dimensions of the embeddings
embeds = np.array(embeds)
embeds.shape

(2172, 4096)

In [10]:
type(embeds[0,0])

numpy.float64

In [12]:
embeds.tofile('embeds.dat')

In [15]:
embeds ==  np.fromfile('embeds.dat', dtype=np.float64)

False

## 3. Search using an index and nearest neighbor search
![Building the search index from the embeddings](https://github.com/cohere-ai/notebooks/raw/main/notebooks/images/semantic-search-index.png)
Let's now use [Annoy](https://github.com/spotify/annoy) to build an index that stores the embeddings in a way that is optimized for fast search. This approach scales well to a large number of texts (other options include [Faiss](https://github.com/facebookresearch/faiss), [ScaNN](https://github.com/google-research/google-research/tree/master/scann), and [PyNNDescent](https://github.com/lmcinnes/pynndescent)).

After building the index, we can use it to retrieve the nearest neighbors either of existing questions (section 3.1), or of new questions that we embed (section 3.2).

In [16]:
# Create the search index, pass the size of embedding
search_index = AnnoyIndex(embeds.shape[1], 'angular')
# Add all the vectors to the search index
for i in range(len(embeds)):
    search_index.add_item(i, embeds[i])

search_index.build(10) # 10 trees
search_index.save('test.ann')

True

### 3.1. Find the neighbors of an example from the dataset
If we're only interested in measuring the distance between the questions in the dataset (no outside queries), a simple way is to calculate the distance between every pair of embeddings we have.

In [17]:
# Choose an example (we'll retrieve others similar to it)
example_id = 92

# Retrieve nearest neighbors
similar_item_ids = search_index.get_nns_by_item(example_id,10,
                                                include_distances=True)
# Format and print the text and distances
results = pd.DataFrame(data={'Claim': df.iloc[similar_item_ids[0]]['Claim'], 
                             'distance': similar_item_ids[1]}).drop(example_id)

print(f"Claim:'{df.iloc[example_id]['Claim']}'\nNearest neighbors:")
results

Claim:' A photo shows David DePape filming the Jan. 6 Capitol attack.'
Nearest neighbors:


Unnamed: 0,Claim,distance
591,"A video shows antifa inside the Capitol building with floor plans and dressed as Trump supporters on Jan. 6, proving the attack was a setup.",0.942165
908,There Were Armed Insurrectionists at the Capitol on January 6.,0.960456
1844,"Recently released Jan. 6 footage shows “these cops using massive amounts of force against unarmed Trump supporters, including women.”",0.98046
747,Why was nobody arrested inside of the Capitol on January 6th if a crime was being committed.,1.018702
877,"purportedly shows a statement from former President Donald Trump implying he destroyed evidence related to the Jan. 6, 2021 attack on the U.S. Capitol.",1.020593
1407,NPR published an article about the January 6 2021 storming of the US capitol before it happened.,1.02089
690,"“Antifa was already inside the Capitol with floor plans dressed as Trump supporters” on Jan. 6, 2021, proving the attack “was 100% a setup.”",1.031029
1413,purportedly shows a man being arrested for wearing a shirt with the phrase “We the people” on it.,1.034307
900,"Image shows Paul Pelosi, bruised, in a booking mugshot.",1.054174


### 3.2. Find the neighbors of a user query
We're not limited to searching using existing items. If we get a query, we can embed it and find its nearest neighbors from the dataset.

In [18]:
query = "Iranian protesters set fire to Ayatollah Khomeini's house?"

# Get the query's embedding
query_embed = co.embed(texts=[query],
                  model="large",
                  truncate="LEFT").embeddings

# Retrieve the nearest neighbors
similar_item_ids = search_index.get_nns_by_vector(query_embed[0],10,
                                                include_distances=True)
# Format the results
results = pd.DataFrame(data={'Claim': df.iloc[similar_item_ids[0]]['Claim'], 
                             'distance': similar_item_ids[1]})


print(f"Query:'{query}'\nNearest neighbors:")
results

Query:'Iranian protesters set fire to Ayatollah Khomeini's house?'
Nearest neighbors:


Unnamed: 0,Claim,distance
17,"In Iran, 15,000 protesters were sentenced to death.",0.971598
293,"“[T]he Iran police publish[ed] a videoclip show[ing] that her [Mahsa Amini’s] death was due to a heart attack, proving the false news broadcasted [sic] by the Western media and the anti-Iranian press.”",1.066752
1014,"In the immediate wake of a fire and vandalism at the Madison headquarters of an anti-abortion group, Democrats had not condemned “activists who are engaging in this repugnant illegal activity.”",1.08665
585,Food shortages planned by the government AND crops being burned.,1.098786
209,Flag made of Iranian women’s chopped hair during September 2022 protest.,1.103626
1031,"purportedly shows a massive crowd adhering to Muslim prayer in the streets of Paris, France.",1.110822
88,The Pelosis “are refusing to turn over surveillance video of their home.”,1.111557
1680,"“They call” Jan. 6 “an insurrection,” but “were FBI agents used as political agitators?”",1.113612
372,"Following protests in Grand Rapids, Michigan, in 2020, Democratic U.S. House candidate Hillary Scholten “dismissed the destruction and praised the rioters.”",1.115913
267,"Doug Mastriano said, Iranian President Ali Khamenei has the right idea of how Women should be treated.",1.123345


## 4. Visualizing the archive
Finally, let's plot out all the questions onto a 2D chart so you're able to visualize the semantic similarities of this dataset!

In [20]:
#@title Plot the archive {display-mode: "form"}

# UMAP reduces the dimensions from 1024 to 2 dimensions that we can plot
reducer = umap.UMAP(n_neighbors=20) 
umap_embeds = reducer.fit_transform(embeds)
# Prepare the data to plot and interactive visualization
# using Altair
df_explore = pd.DataFrame(data={'Claim': df['Claim']})
df_explore['x'] = umap_embeds[:,0]
df_explore['y'] = umap_embeds[:,1]

# Plot
chart = alt.Chart(df_explore).mark_circle(size=60).encode(
    x=#'x',
    alt.X('x',
        scale=alt.Scale(zero=False)
    ),
    y=
    alt.Y('y',
        scale=alt.Scale(zero=False)
    ),
    tooltip=['Claim']
).properties(
    width=700,
    height=400
)
chart.interactive()

Hover over the points to read the text. Do you see some of the patterns in clustered points? Similar questions, or questions asking about similar topics?

This concludes this introductory guide to semantic search using sentence embeddings. As you continue the path of building a search product additional considerations arise (like dealing with long texts, or finetuning to better improve the embeddings for a specific use case). 


We can’t wait to see what you start building! Share your projects or find support at [community.cohere.ai](https://community.cohere.ai).
