<a href="https://colab.research.google.com/github/Muntasir2179/vector-database-learning/blob/main/VD_Qdrant_database.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Qdrant

* Vector Database
* Open Source
* An alternative to Pinecone database
* Managed Services

https://qdrant.tech/

## Setup

* Setup 1GB free cluster in cloud service
* Vector database persistent
* Database available from URL
* Data available via simple APIs

In [1]:
# https://41340d3a-8de6-41e4-a9a4-b60b28a9d142.us-east4-0.gcp.cloud.qdrant.io:6333
# ntBnTQKMNfWL7JHCcrPNiyLyJhh5xmZPEPDYXtNZQ41UjHgcdMVhxQ

In [2]:
!pip install qdrant_client==1.7.2 langchain==0.1.4 sentence-transformers==2.3.1 langchain-community==0.0.16

Collecting qdrant_client==1.7.2
  Downloading qdrant_client-1.7.2-py3-none-any.whl (206 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m206.2/206.2 kB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting langchain==0.1.4
  Downloading langchain-0.1.4-py3-none-any.whl (803 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m803.6/803.6 kB[0m [31m13.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting sentence-transformers==2.3.1
  Downloading sentence_transformers-2.3.1-py3-none-any.whl (132 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m132.8/132.8 kB[0m [31m11.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting langchain-community==0.0.16
  Downloading langchain_community-0.0.16-py3-none-any.whl (1.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m24.8 MB/s[0m eta [36m0:00:00[0m
Collecting grpcio-tools>=1.41.0 (from qdrant_client==1.7.2)
  Downloading grpcio_tools-1.60.1-cp310-cp310

In [3]:
import os
import qdrant_client

from langchain.vectorstores import Qdrant
from langchain.embeddings import HuggingFaceEmbeddings
# from langchain.embeddings.huggingface import HuggingFaceEmbeddings

In [4]:
# create a qdrant client
os.environ['QDRANT_HOST'] = 'https://41340d3a-8de6-41e4-a9a4-b60b28a9d142.us-east4-0.gcp.cloud.qdrant.io:6333'
os.environ['QDRANT_API_KEY'] = 'ntBnTQKMNfWL7JHCcrPNiyLyJhh5xmZPEPDYXtNZQ41UjHgcdMVhxQ'

In [5]:
client = qdrant_client.QdrantClient(
    url=os.getenv('QDRANT_HOST'),
    api_key=os.getenv('QDRANT_API_KEY')
)

In [6]:
client.delete_collection(collection_name="collection1")

True

In [7]:
from qdrant_client.http import models

# creating a collection (A database with vectors)
# client.recreate_collection() -> does not create collection if it exists and but creates if not exists
client.create_collection(
    collection_name="collection1",
    vectors_config=models.VectorParams(size=768,
                                       distance=models.Distance.COSINE)
)

True

In [8]:
client.get_collections()

CollectionsResponse(collections=[CollectionDescription(name='collection1')])

## Now let's create a vector store for docs

In [9]:
embeddings = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-mpnet-base-v2"
)

vector_store = Qdrant(
    client=client,
    collection_name='collection1',
    embeddings=embeddings
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

## It's time to store some embeddings into our vector store

Manually uploaded some text files into the colab files sections to load the texts and convert into embeddings.

In [10]:
from langchain_community.document_loaders import TextLoader
from langchain.text_splitter import CharacterTextSplitter, RecursiveCharacterTextSplitter

loader = TextLoader("Transfer Learning.txt")
documents = loader.load()

# with open("Transfer Learning.txt") as f:
#   documents = f.read()

# text_splitter = CharacterTextSplitter(
#     separator="\n",
#     chunk_size=500,
#     chunk_overlap=90,
#     length_function=len
# )

text_splitter = RecursiveCharacterTextSplitter(
    separators=["\n",],
    chunk_size=500,
    chunk_overlap=90,
    length_function=len
)

chunks = text_splitter.split_documents(documents)

In [11]:
chunks[0].page_content

'Infinite numbers of real-world applications use Machine Learning (ML) techniques to develop potentially the best data available for the users. Transfer learning (TL), one of the categories under ML, has received much attention from the research communities in the past few years. Traditional ML algorithms perform under the assumption that a model uses limited data distribution to train and test samples. These conventional methods predict target tasks undemanding and are applied to small data distribution. However, this issue conceivably is resolved using TL. TL is acknowledged for its connectivity among the additional testing and training samples resulting in faster output with efficient results. This paper contributes to the domain and scope of TL, citing situational use based on their periods and a few of its applications. The paper provides an in-depth focus on the techniques; Inductive TL, Transductive TL, Unsupervised TL, which consists of sample selection, and domain adaptation, 

In [12]:
len(chunks)

16

In [13]:
docs = [doc.page_content for doc in chunks]

In [14]:
len(docs)

16

In [15]:
ids = [id for id in range(1, len(chunks)+1)]
ids

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16]

In [16]:
# vector_store.add_documents(chunks)
vector_store.add_texts(texts=docs, ids=ids)

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16]

## Retriving a single vector with its ID

In [18]:
client.retrieve(
    collection_name="collection1",
    ids=[2, 5],  # multiple ids can be passed
    # with_vectors=True # the default is False
)

[Record(id=5, payload={'metadata': None, 'page_content': '\nAccording to Matt, he defines TL, a category under ML is when the reuse of pre-existing models to solve current challenges. He also acknowledges that TL is a technique employed to train models together. The concepts of pre-existing training data are utilized to enhance the performance of the ongoing challenge, so the solution need not have to be developed from scratch. Similarly, Daipanja also aligns with the above definition of TL. He further uses the comparison between the traditional ML approach where the data were isolated based on specific tasks, and each challenge was developed from scratch, with limited knowledge to acknowledge one another. Now, however, the TL, the acknowledgment of previous data; trained models for the current training models have been comparatively enhanced and emphasized. An article by Yoshua et al. defines TL as the technique that trains current models with trained models of previous similar relate

## Now let's query the data

### Using FAISS to perform similarity search

In [19]:
!pip install faiss-gpu

Collecting faiss-gpu
  Downloading faiss_gpu-1.7.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (85.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m85.5/85.5 MB[0m [31m9.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: faiss-gpu
Successfully installed faiss-gpu-1.7.2


In [20]:
from langchain_community.vectorstores import FAISS

In [21]:
db = FAISS.from_documents(chunks, embeddings)
retriever = db.as_retriever(search_type="similarity_score_threshold",  # or search_type="mmr"
                            search_kwargs={"score_threshold": 0.2,  # threshold score
                                           "k": 2})  # top similar 2 results will be returned

In [22]:
response = retriever.get_relevant_documents("what is layer freezing in Transfer Learning")
response

[Document(page_content='\nPeople can hardly afford the luxury of investing resources in data gathering in today’s world since they are rare, inaccessible, often expensive, and difficult to compile. As a result, most people found a better means of data collection: one of the ways is to transfer knowledge between the tasks. This philosophy has inspired Transfer Learning(TL): to improve data gathering and learn in machine learning (ML) using the data compiled before it has been introduced. Most of the algorithms of ML are to predict future outcomes, which are traditionally in the interest of addressing tasks in isolation. Whereas TL does the otherwise, it bridges the data from the source and targets the task to find a solution, perhaps a better one.', metadata={'source': 'Transfer Learning.txt'}),
 Document(page_content='\nTL aims to improve understanding of the current task by relating it to other tasks performed at different periods but through a related source domain. Figure 1 explains

In [23]:
len(response)

2

### Similarity search using Qdrant object

**🔑Note:** In order to run the `similarity_search()` function the version of the following packages need to be maintained-

```
qdrant_client==1.7.2
langchain==0.1.4
sentence-transformers==2.3.1
langchain-community==0.0.16
```

In [24]:
results = vector_store.similarity_search(query="what is layer freezing in Transfer Learning", k=2)

In [25]:
results

[Document(page_content='\nPeople can hardly afford the luxury of investing resources in data gathering in today’s world since they are rare, inaccessible, often expensive, and difficult to compile. As a result, most people found a better means of data collection: one of the ways is to transfer knowledge between the tasks. This philosophy has inspired Transfer Learning(TL): to improve data gathering and learn in machine learning (ML) using the data compiled before it has been introduced. Most of the algorithms of ML are to predict future outcomes, which are traditionally in the interest of addressing tasks in isolation. Whereas TL does the otherwise, it bridges the data from the source and targets the task to find a solution, perhaps a better one.'),
 Document(page_content='\nTL aims to improve understanding of the current task by relating it to other tasks performed at different periods but through a related source domain. Figure 1 explains the improvement brought by using the TL strat

## Counting points (vectors/entries)

In [36]:
count_result = client.count(collection_name="collection1")
print(count_result.count)

16


In [32]:
# using filter
client.count(collection_name="collection1",
             count_filter=models.Filter(
                 must=[
                     models.FieldCondition(key="", match=models.MatchValue(
                         value='''
                         TL aims to improve understanding of the current task by relating
                          it to other tasks performed at different periods but through a
                          related source domain. Figure 1 explains the improvement brought
                          by using the TL strategy in ML. It enhances learning by creating
                          a relation between previous tasks and the target task, providing
                          logical, faster, and better solutions. TL attempts to provide an
                          efficient manner of learning and communication between the source
                          task and the target task, making learning debatable [3].In addition,
                          TL is most applicable when there is a limited supply of target
                          training data. The strategic use of TL is that not only among the
                          performed(ing) task itself but somewhat beyond and across other
                          tasks. However, the relationship between source and target task is
                          sometimes not compatible. If the user transfers the testing and
                          training samples, it decreases the target task’s performance; such
                          a situation is a negative transfer and vice versa.
                         '''
                        )
                     )
                 ]
             ),
             exact=False)

CountResult(count=8)

## Deleting specific data/points/payload with IDs

In [38]:
client.delete(
    collection_name="collection1",
    points_selector=models.PointIdsList(
        points=[5]
    )
)

UpdateResult(operation_id=1, status=<UpdateStatus.COMPLETED: 'completed'>)

In [39]:
client.count(collection_name="collection1")

CountResult(count=15)