<a href="https://colab.research.google.com/github/Muntasir2179/vector-database-learning/blob/main/VD_Qdrant_database.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Qdrant

* Vector Database
* Open Source
* An alternative to Pinecone database
* Managed Services

https://qdrant.tech/

## Setup

* Setup 1GB free cluster in cloud service
* Vector database persistent
* Database available from URL
* Data available via simple APIs

In [1]:
# https://b47a5326-877a-4a90-be9a-731e13f863db.us-east4-0.gcp.cloud.qdrant.io
api_key = "eNFl4P8wMsGqS3ApS2-5R1Eutrs6Pnunq5ATti_R0xISqur-ztGCDA"

In [2]:
!pip install qdrant_client openai tiktoken langchain sentence-transformers

Collecting qdrant_client
  Downloading qdrant_client-1.7.1-py3-none-any.whl (205 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m205.9/205.9 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting openai
  Downloading openai-1.10.0-py3-none-any.whl (225 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m225.1/225.1 kB[0m [31m18.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting tiktoken
  Downloading tiktoken-0.5.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m41.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting langchain
  Downloading langchain-0.1.4-py3-none-any.whl (803 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m803.6/803.6 kB[0m [31m29.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting sentence-transformers
  Downloading sentence_transformers-2.3.1-py3-none-any.whl (132 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━

In [3]:
import os
import qdrant_client

from langchain.vectorstores import Qdrant
from langchain.embeddings import HuggingFaceEmbeddings

In [4]:
# create a qdrant client
os.environ['QDRANT_HOST'] = 'https://b47a5326-877a-4a90-be9a-731e13f863db.us-east4-0.gcp.cloud.qdrant.io'
os.environ['QDRANT_API_KEY'] = 'eNFl4P8wMsGqS3ApS2-5R1Eutrs6Pnunq5ATti_R0xISqur-ztGCDA'

In [5]:
client = qdrant_client.QdrantClient(
    url=os.getenv('QDRANT_HOST'),
    api_key=os.getenv('QDRANT_API_KEY')
)

In [9]:
from qdrant_client.http import models

# creating a collection (A database with vectors)
# client.recreate_collection() -> does not create collection if it exists and but creates if not exists
client.create_collection(
    collection_name="collection1",
    vectors_config=models.VectorParams(size=768,
                                       distance=models.Distance.COSINE)
)

True

In [10]:
client.get_collections()

CollectionsResponse(collections=[CollectionDescription(name='collection1')])

## Now let's create a vector store for docs

In [11]:
embeddings = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-mpnet-base-v2"
)

vector_store = Qdrant(
    client=client,
    collection_name='collection1',
    embeddings=embeddings
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

## It's time to store some embeddings into our vector store

Manually uploaded some text files into the colab files sections to load the texts and convert into embeddings.

In [13]:
from langchain_community.document_loaders import TextLoader
from langchain.text_splitter import CharacterTextSplitter

loader = TextLoader("Transfer Learning.txt")
documents = loader.load()

text_splitter = CharacterTextSplitter(
    separator="\n",
    chunk_size=500,
    chunk_overlap=90,
    length_function=len
)

chunks = text_splitter.split_documents(documents)



In [14]:
chunks[0].page_content

'Infinite numbers of real-world applications use Machine Learning (ML) techniques to develop potentially the best data available for the users. Transfer learning (TL), one of the categories under ML, has received much attention from the research communities in the past few years. Traditional ML algorithms perform under the assumption that a model uses limited data distribution to train and test samples. These conventional methods predict target tasks undemanding and are applied to small data distribution. However, this issue conceivably is resolved using TL. TL is acknowledged for its connectivity among the additional testing and training samples resulting in faster output with efficient results. This paper contributes to the domain and scope of TL, citing situational use based on their periods and a few of its applications. The paper provides an in-depth focus on the techniques; Inductive TL, Transductive TL, Unsupervised TL, which consists of sample selection, and domain adaptation, 

In [17]:
len(chunks)

16

In [16]:
vector_store.add_documents(chunks)

['daa34e0644b44a3e97af334476a417c9',
 '9c0d50e4d7ec40699ec2fd8c2de7df61',
 '052905a7e3c0442bbbaa6859ef386b78',
 'cf57b2b31f484f23961b01027b3ded40',
 '283db38167e943e9b6e865857c69e480',
 '39960577192d460e8da87c23c0b99c06',
 '6b72db7b77e2481c87d93f96f63b854c',
 'a95fa2d8d78845319c031e05fec78b56',
 '362e17a9e1174add8d415a7cb642dc3b',
 '7e10facf07c144e281db2c064f4e2eea',
 '0ef5381062c3458da1d5dae6d342910a',
 '64d4b1f1f119421cb5dafda41d01a8ec',
 'aa0ecb33ac4d4c5697eb1de29859973d',
 'ad5e6927d21546428c32e83b027dafb0',
 '563518ab895d4810b6c35b570f62f3f6',
 'bb4f06ccbc4c402ea9bc5926db464453']

## Now let's query the data

In [19]:
!pip install faiss-gpu

Collecting faiss-gpu
  Downloading faiss_gpu-1.7.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (85.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m85.5/85.5 MB[0m [31m9.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: faiss-gpu
Successfully installed faiss-gpu-1.7.2


In [29]:
from langchain_community.vectorstores import FAISS

In [31]:
db = FAISS.from_documents(chunks, embeddings)
retriever = db.as_retriever(search_type="similarity_score_threshold",  # or search_type="mmr"
                            search_kwargs={"score_threshold": 0.2,  # threshold score
                                           "k": 2})  # top similar 2 results will be returned

In [32]:
response = retriever.get_relevant_documents("what is layer freezing in Transfer Learning")
response

[Document(page_content='People can hardly afford the luxury of investing resources in data gathering in today’s world since they are rare, inaccessible, often expensive, and difficult to compile. As a result, most people found a better means of data collection: one of the ways is to transfer knowledge between the tasks. This philosophy has inspired Transfer Learning(TL): to improve data gathering and learn in machine learning (ML) using the data compiled before it has been introduced. Most of the algorithms of ML are to predict future outcomes, which are traditionally in the interest of addressing tasks in isolation. Whereas TL does the otherwise, it bridges the data from the source and targets the task to find a solution, perhaps a better one.', metadata={'source': 'Transfer Learning.txt'}),
 Document(page_content='TL aims to improve understanding of the current task by relating it to other tasks performed at different periods but through a related source domain. Figure 1 explains the

In [33]:
len(response)

2

In [35]:
vector_store.similarity_search(query="what is layer freezing in Transfer Learning", k=2)

[Document(page_content='People can hardly afford the luxury of investing resources in data gathering in today’s world since they are rare, inaccessible, often expensive, and difficult to compile. As a result, most people found a better means of data collection: one of the ways is to transfer knowledge between the tasks. This philosophy has inspired Transfer Learning(TL): to improve data gathering and learn in machine learning (ML) using the data compiled before it has been introduced. Most of the algorithms of ML are to predict future outcomes, which are traditionally in the interest of addressing tasks in isolation. Whereas TL does the otherwise, it bridges the data from the source and targets the task to find a solution, perhaps a better one.', metadata={'source': 'Transfer Learning.txt'}),
 Document(page_content='TL aims to improve understanding of the current task by relating it to other tasks performed at different periods but through a related source domain. Figure 1 explains the