**In this Notebook we do word embedding means we will convert text into vectors than we will store it into vector database after that we do the vector search**

In [None]:
!pip install faiss-cpu==1.7.4 chromadb

**What is faiss and chromadb?**

**Ans: Faiss (Facebook AI Similarity Search) is an open-source library developed by Facebook AI Research that is primarily used for effiecient similarity search and clustering of large datasets. FAISS has several ways for similarity search.**
**L2(Euclidean distance), cosine similarity.**
**Vector Libraries are often suffiecient for small, static data. Since it's not full fledged database. It doesnt have the CRUD(Create, Read, Update and delete) support.vector library are easy, lightweight, and fast to use. Examples of vector libraries are FAISS, ScaNN, ANNoy and HNSM**

**ChromaDB is an open-source vector database designed to store vector embeddings to develop and build large language model applications. The database makes it simpler to store knowledge, skills, and facts for LLM applications.**



## Step 1: Reading the Data

In [None]:
import pandas as pd
df = pd.read_csv("/content/labelled_newscatcher_dataset.csv",sep=';')
df['id'] = df.index
df

Unnamed: 0,topic,link,domain,published_date,title,lang,id
0,SCIENCE,https://www.eurekalert.org/pub_releases/2020-0...,eurekalert.org,2020-08-06 13:59:45,A closer look at water-splitting's solar fuel ...,en,0
1,SCIENCE,https://www.pulse.ng/news/world/an-irresistibl...,pulse.ng,2020-08-12 15:14:19,"An irresistible scent makes locusts swarm, stu...",en,1
2,SCIENCE,https://www.express.co.uk/news/science/1322607...,express.co.uk,2020-08-13 21:01:00,Artificial intelligence warning: AI will know ...,en,2
3,SCIENCE,https://www.ndtv.com/world-news/glaciers-could...,ndtv.com,2020-08-03 22:18:26,Glaciers Could Have Sculpted Mars Valleys: Study,en,3
4,SCIENCE,https://www.thesun.ie/tech/5742187/perseid-met...,thesun.ie,2020-08-12 19:54:36,Perseid meteor shower 2020: What time and how ...,en,4
...,...,...,...,...,...,...,...
108769,NATION,https://www.vanguardngr.com/2020/08/pdp-govern...,vanguardngr.com,2020-08-08 02:40:00,PDP governors’ forum urges security agencies t...,en,108769
108770,BUSINESS,https://www.patentlyapple.com/patently-apple/2...,patentlyapple.com,2020-08-08 01:27:12,"In Q2-20, Apple Dominated the Premium Smartpho...",en,108770
108771,HEALTH,https://www.belfastlive.co.uk/news/health/coro...,belfastlive.co.uk,2020-08-12 17:01:00,Coronavirus Northern Ireland: Full breakdown s...,en,108771
108772,ENTERTAINMENT,https://www.thenews.com.pk/latest/696364-paul-...,thenews.com.pk,2020-08-05 04:59:00,Paul McCartney details post-Beatles distress a...,en,108772


**Below code takes a subset of a DataFrame, the DataFrame contains a 'title' column, and creates a list of InputExample objects from the 'title' column using the example_create_fn function. These InputExample objects are structured data that used as input for sentence embedding models**

In [None]:
from sentence_transformers import InputExample


df_subset = df.head(1000)
def example_create_fn(doc1: pd.Series)-> InputExample:

  return InputExample(texts=[doc1])


examples = df_subset.apply(lambda x: example_create_fn(x['title']),axis =1).tolist()

In [None]:
examples[:3]

[<sentence_transformers.readers.InputExample.InputExample at 0x78458d570340>,
 <sentence_transformers.readers.InputExample.InputExample at 0x78458d5703a0>,
 <sentence_transformers.readers.InputExample.InputExample at 0x78458d570400>]

## Step 2: Text into Embedding Vectors

**In this code snippet, we are using the "sentence_transformers" library to create sentence embeddings for the 'title' column of the DataFrame df_subset using a pretrained model called "all-MiniLM-L6-v2."**

In [None]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer(
    "all-MiniLM-L6-v2",
    cache_folder="/content/Cache Folder")
faiss_title_embedding =model.encode(df_subset.title.values.tolist())

Downloading (…)e9125/.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)7e55de9125/README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading (…)55de9125/config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading (…)125/data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)e9125/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

Downloading (…)9125/train_script.py:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading (…)7e55de9125/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)5de9125/modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

In [None]:
len(faiss_title_embedding), len(faiss_title_embedding[0])

(1000, 384)

In [None]:
faiss_title_embedding

array([[-0.11270548,  0.04076543,  0.02181416, ..., -0.01874594,
        -0.03136874,  0.0682483 ],
       [-0.02187165, -0.03349995,  0.073218  , ...,  0.0336232 ,
        -0.0056389 , -0.00630978],
       [ 0.01608383,  0.00279444, -0.0150442 , ..., -0.00706244,
         0.00905905, -0.02835054],
       ...,
       [ 0.01506921,  0.04583016, -0.06114504, ..., -0.07814188,
        -0.08025025,  0.01337819],
       [-0.0708223 ,  0.00643823,  0.00809321, ..., -0.05520815,
        -0.03652043,  0.07594123],
       [-0.06321976,  0.04461519, -0.07385813, ...,  0.06559424,
         0.03276766,  0.09070992]], dtype=float32)

## Step 3: Saving Embedding  vectors to FAISS Index

**Below code sets up a Faiss index for efficient similarity search based on a set of sentence embeddings and associated unique IDs. This is useful for  finding similar content items in a large dataset efficiently.**

In [None]:
import numpy as np
import faiss
df_to_index = df_subset.set_index(["id"], drop= False)
id_index = np.array(df_to_index.id.values).flatten().astype("int")

content_encoded_normalized = faiss_title_embedding.copy()
faiss.normalize_L2(content_encoded_normalized)

index_content = faiss.IndexIDMap(faiss.IndexFlatIP(len(faiss_title_embedding[0])))
index_content.add_with_ids(content_encoded_normalized, id_index)

## Step 4: Search for relevant documents

**the search_content function takes a query, encodes it into a vector, and uses a Faiss index to find the most similar items in a DataFrame with associated IDs. It returns a DataFrame with the results, including the similar items and their similarity scores.**

In [None]:
def search_content(query, df_to_index, k=3):
  query_vector = model.encode([query])
  faiss.normalize_L2(query_vector)

  top_k = index_content.search(query_vector, k)
  ids = top_k[1][0].tolist()
  similarities = top_k[0][0].tolist()
  results = df_to_index.loc[ids]
  results["similarities"]= similarities
  return results

In [None]:
display(search_content("animal",df_to_index))

Unnamed: 0_level_0,topic,link,domain,published_date,title,lang,id,similarities
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
176,TECHNOLOGY,https://www.pushsquare.com/news/2020/08/random...,pushsquare.com,2020-08-03 16:30:00,Random: You Can Pick Up and Pet Cats in Assass...,en,176,0.391902
975,HEALTH,https://www.news-medical.net/news/20200813/Res...,news-medical.net,2020-08-13 05:18:00,Researchers explore social behavior of animals...,en,975,0.376784
99,TECHNOLOGY,https://www.gematsu.com/2020/08/ghostwire-toky...,gematsu.com,2020-08-07 16:43:13,Ghostwire: Tokyo confirms dog petting,en,99,0.344059


## Now we will Use Vector Database to store vectors and Do search accordingly.

In [None]:
 import chromadb
 chroma_client = chromadb.Client()

In [None]:
collection_name = "News"

collection= chroma_client.create_collection(name=collection_name)


In [None]:
#### If you already have the collection you can delete and create new one with the help of this code
if len(chroma_client.list_collections()) >0 and collection_name in [chroma_client.list_collections()[0].name]:
  chroma_client.delete_collection(name=collection_name)
else:
  collection= chroma_client.create_collection(name=collection_name)


In [None]:
import pandas as pd
df = pd.read_csv("/content/labelled_newscatcher_dataset.csv",sep=';')
df['id'] = df.index
df_subset = df.head(1000)

In [None]:
df_subset.head()

Unnamed: 0,topic,link,domain,published_date,title,lang,id
0,SCIENCE,https://www.eurekalert.org/pub_releases/2020-0...,eurekalert.org,2020-08-06 13:59:45,A closer look at water-splitting's solar fuel ...,en,0
1,SCIENCE,https://www.pulse.ng/news/world/an-irresistibl...,pulse.ng,2020-08-12 15:14:19,"An irresistible scent makes locusts swarm, stu...",en,1
2,SCIENCE,https://www.express.co.uk/news/science/1322607...,express.co.uk,2020-08-13 21:01:00,Artificial intelligence warning: AI will know ...,en,2
3,SCIENCE,https://www.ndtv.com/world-news/glaciers-could...,ndtv.com,2020-08-03 22:18:26,Glaciers Could Have Sculpted Mars Valleys: Study,en,3
4,SCIENCE,https://www.thesun.ie/tech/5742187/perseid-met...,thesun.ie,2020-08-12 19:54:36,Perseid meteor shower 2020: What time and how ...,en,4


In [None]:
collection.add(
    documents= df_subset["title"][:1000].tolist(),
    metadatas= [{"topic": topic} for topic in df_subset["topic"][:1000].tolist()],
    ids=[f"id{x}" for x in range(1000)]
)

In [None]:
import json
results = collection.query(
    query_texts=["space"],
    n_results=10
)
print(json.dumps(results, indent=4))

{
    "ids": [
        [
            "id811",
            "id72",
            "id735",
            "id157",
            "id846",
            "id7",
            "id104",
            "id527",
            "id797",
            "id122"
        ]
    ],
    "distances": [
        [
            0.8878770470619202,
            1.2250351905822754,
            1.2487094402313232,
            1.2891318798065186,
            1.2929279804229736,
            1.3089773654937744,
            1.3210983276367188,
            1.3542897701263428,
            1.358769416809082,
            1.3604464530944824
        ]
    ],
    "metadatas": [
        [
            {
                "topic": "SCIENCE"
            },
            {
                "topic": "TECHNOLOGY"
            },
            {
                "topic": "SCIENCE"
            },
            {
                "topic": "SCIENCE"
            },
            {
                "topic": "SCIENCE"
            },
            {
                "topic

### When to use FAISS library or vector database chromadb?
**The choice between a vector database and Faiss depends on the complexity of your data and queries, the need for metadata handling, and the scalability requirements of your application. If your primary focus is on similarity search and you have large numerical vectors, Faiss is a strong choice. If your application involves diverse data types, complex queries, and metadata, a vector database may be more suitable.**

#### Vector database like Chromadb support CRUD (create, read, update, Delete)


In [None]:
#### Filter Statement
collection.query(
    query_texts=["space"],
    where={"topic": "SCIENCE"},
    n_results=10
)

{'ids': [['id811',
   'id735',
   'id157',
   'id846',
   'id7',
   'id104',
   'id527',
   'id797',
   'id122',
   'id823']],
 'distances': [[0.8878770470619202,
   1.2487094402313232,
   1.2891318798065186,
   1.2929279804229736,
   1.3089773654937744,
   1.3210983276367188,
   1.3542897701263428,
   1.358769416809082,
   1.3604464530944824,
   1.363983392715454]],
 'metadatas': [[{'topic': 'SCIENCE'},
   {'topic': 'SCIENCE'},
   {'topic': 'SCIENCE'},
   {'topic': 'SCIENCE'},
   {'topic': 'SCIENCE'},
   {'topic': 'SCIENCE'},
   {'topic': 'SCIENCE'},
   {'topic': 'SCIENCE'},
   {'topic': 'SCIENCE'},
   {'topic': 'SCIENCE'}]],
 'embeddings': None,
 'documents': [['The scramble for space at Earth’s outer limits',
   'Outrage after NASA ‘goes woke’ and renames ‘insensitive’ space objects',
   'NASA astronauts "This is an extraordinary day to be in space ..." shoot music videos in the orbit',
   'Land of a billion faces',
   'Orbital space tourism set for rebirth in 2021',
   'Tonight off

### Update data in collection

In [None]:
collection.delete(ids=['id0'])

In [None]:
collection.get(ids=['id2'])


{'ids': ['id2'],
 'embeddings': None,
 'metadatas': [{'topic': 'SCIENCE'}],

In [None]:
collection.update(ids=["id2"], metadatas=[{"topic" : "TECHNOLOGY"}])

In [None]:
collection.get(ids=['id2'])

{'ids': ['id2'],
 'embeddings': None,
 'metadatas': [{'topic': 'TECHNOLOGY'}],