<a href="https://colab.research.google.com/github/peremartra/Large-Language-Model-Notebooks-Course/blob/main/2-Vector%20Databases%20with%20LLMs/2_1_Vector_Databases_LLMs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<div>
    <h1>Large Language Models Projects</a></h1>
    <h3>Apply and Implement Strategies for Large Language Models</h3>
    <h2>2.1-Vector Databases with LLMs</h2>
</div>

by [Pere Martra](https://www.linkedin.com/in/pere-martra/)
__________
Models: TinyLlama/TinyLlama-1.1B-Chat-v1.0

Colab environment: CPU.

Keys:
* Vector Database.
* ChromaDB.
* RAG
* Embeddings.

Article related: [Harness the Power of Vector Databases: Influencing Language Models with Personalized Information.](https://medium.com/towards-artificial-intelligence/harness-the-power-of-vector-databases-influencing-language-models-with-personalized-information-ab2f995f09ba)
__________


If you are executing this notebook on Colab you will need a High RAM capacity environment, depending on the model used.

If you don't have a Colab Pro acount you can execute this notebook on kaggle, since you will get more memory from the free tier.

Here yo have a version of this notebook, that uses a Dolly 3B model, that can be executed on Kaggle: [Vector Databases with LLMs-Kaggle Version](https://github.com/peremartra/Large-Language-Model-Notebooks-Course/blob/main/2-Vector%20Databases%20with%20LLMs/how-to-use-a-embedding-database-with-a-llm-from-hf.ipynb)
__________



In this notebook you will see how to use an embedding database to store the information that you want to pass to a large language model so that it takes it into account in its responses.

The information could be your own documents, or whatever was contained in a business knowledge database.

I have prepared the notebook so that it can work with three different Kaggle datasets, so that it is easy to carry out different tests with different Datasets.

![RAG Structure](https://github.com/peremartra/Large-Language-Model-Notebooks-Course/blob/main/img/Martra_Figure_2-7.jpg?raw=true)


#Import Libraries.
To start is necessaryto install some Python packages.

* **sentence transformers**. This library is necessary to transform the sentences into fixed-length vectors, also know as embeddings.

* **chromadb**. This is our vector Database. ChromaDB is easy to use and open source, maybe the most used Vector Database used to store embeddings.

In [None]:
!pip install -q transformers

In [None]:
!pip install -q sentence-transformers
#!pip install -q xformers==0.0.23
!pip install -q chromadb

I'm sure that you know the next two packages: Numpy and Pandas, maybe the most used python libraries.

Numpy is a powerful library for numerical computing.

Pandas is a library for data manipulation

In [None]:
import numpy as np
import pandas as pd

# Load the Dataset
As you will see the notebook is ready to work with three different Datasets. Just uncomment the lines of the Dataset you want to use.

I selected Datasets with News. Two of them have just a brief decription of the new, but the other contains the full text.

As you are working in a memory limited environment, and you can use just a few gb of memory I limited the number of news to use with the variable MAX_NEWS.

The name of the field containing the text of the new is stored in the variable *DOCUMENT* and the metadata in *TOPIC*

# Copy Kaggle Dataset
I used the kotartemiy/topic-labeled-news-dataset 此处只用了这个数据集
https://www.kaggle.com/datasets/kotartemiy/topic-labeled-news-dataset

Artem Burgara. (2020). R vs. Python: Topic Labeled News Dataset, . Retrieved December 2023, from https://www.kaggle.com/discussions/general/46091.

But you can ose other datasets, I encourage you to try at least one of these:
* https://www.kaggle.com/datasets/gpreda/bbc-news
* https://www.kaggle.com/datasets/deepanshudalal09/mit-ai-news-published-till-2023


In [None]:
# from google.colab import drive
# drive.mount('/content/drive')

In [None]:
!pip install kaggle

In [None]:
import os
#This directory should contain you kaggle.json file with you key
# os.environ['KAGGLE_CONFIG_DIR'] = '/content/drive/MyDrive/kaggle'
os.environ['KAGGLE_CONFIG_DIR'] = '/workspaces/Large-Language-Model-Notebooks-Course'

In [None]:
!kaggle datasets download -d kotartemiy/topic-labeled-news-dataset

In [None]:
import zipfile

# Define the path to your zip file
# file_path = '/content/topic-labeled-news-dataset.zip'
file_path = '/workspaces/Large-Language-Model-Notebooks-Course/2-Vector Databases with LLMs/topic-labeled-news-dataset.zip'

In [None]:
with zipfile.ZipFile(file_path, 'r') as zip_ref:
    zip_ref.extractall('/workspaces/Large-Language-Model-Notebooks-Course/kaggle')

#Loading the Dataset

Although I've utilized a single dataset for the notebook, I've set it up to facilitate testing with different datasets, available on Kaggle.

I selected Datasets with News. Two of them have just a brief decription of the new, but the other contains the full text.

As we are working in a free and limited space, and we can use just 30 gb of memory I limited the number of news to use with the variable MAX_NEWS.

The name of the field containing the text of the new is stored in the variable DOCUMENT and the metadata in TOPIC.


In [None]:
import numpy as np
import pandas as pd
news = pd.read_csv('/workspaces/Large-Language-Model-Notebooks-Course/kaggle/labelled_newscatcher_dataset.csv', sep=';')
MAX_NEWS = 1000
DOCUMENT="title"
TOPIC="topic"

#Just in case you want to try with a different Dataset.
#news = pd.read_csv('/content/drive/MyDrive/kaggle/bbc_news.csv')
#MAX_NEWS = 1000
#DOCUMENT="description"
#TOPIC="title"

#news = pd.read_csv('/content/drive/MyDrive/kaggle/mit-ai-news-published-till-2023/articles.csv')
#MAX_NEWS = 100
#DOCUMENT="Article Body"
#TOPIC="Article Header"

ChromaDB requires that the data has a unique identifier. You can achieve it with the statement below, which will create a new column called **Id**. 结尾多了1列变量

In [None]:
news["id"] = news.index
news.head(3)

In [None]:
#Because it is just a example we select a small portion of News.
subset_news = news.head(MAX_NEWS)
subset_news.shape

# Import and configure the Vector Database
You are going to use ChromaDB, the most popular OpenSource embedding Database.

First you need to import ChromaDB, and after that import the **Settings** class from **chromadb.config** module. This class allows to change the setting for the ChromaDB system, and customize its behavior.

In [None]:
# ! pip install pysqlite3-binary
# 官方bug解决
import chromadb
from chromadb.config import Settings

Now you need to create the seetings object calling the Settings function imported previously. The object is stored in the variable **settings_chroma**.

You need to inform two parameters

* **chroma_db_impl**. Here you must specify the database implementation and the format how store the data. I choose **duckdb**, because his high-performace. It operate primarly in memory. And is fully compatible with SQL. The store format **parquet** is good for tabular data. With good compression rates and performance.

* **persist_directory**: It just contains the directory where the data will be stored. Is possible work without a directory and the data will be stored in memory without persistece, but some cloud providers or platforms like Kaggle dosn't support that.

In [None]:
#OLD VERSION
#settings_chroma = Settings(chroma_db_impl="duckdb+parquet",
#                          persist_directory='./input')
#chroma_client = chromadb.Client(settings_chroma)

# use things below, which can displace the default embeddings
# settings = Settings(
#     chroma_db_impl="duckdb+parquet",
#     persist_directory="./data",
#     anonymized_telemetry=False,
#     embedding_function=custom_embedding  # 使用自定义的嵌入函数
# )

# chroma_client = chromadb.Client(settings)

#NEW VERSION => 0.40
chroma_client = chromadb.PersistentClient(path="/workspaces/Large-Language-Model-Notebooks-Course/chromadb")

# Filling and Querying the ChromaDB Database
The Data in ChromaDB is stored in collections. If the collection previously exist is necessary to delete it.

In the next lines, the collection is created by calling the ***create_collection*** function in the ***chroma_client*** created above.

In [None]:
from datetime import datetime

In [None]:
collection_name = "news_collection"+datetime.now().strftime("%Y-%m-%d")
if len(chroma_client.list_collections()) > 0 and collection_name in [chroma_client.list_collections()[0].name]:
        chroma_client.delete_collection(name=collection_name)

collection = chroma_client.create_collection(name=collection_name)
print(collection_name)

It's time to add the data to the collection. Using the function ***add*** you should inform, at least ***documents***, ***metadatas*** and ***ids***.
* In the **document** the full news text is stored, remember that it is contained in a different column for each Dataset.
* In **metadatas**, we can inform a list of topics.
* In **id** an unique identificator for each row must be informed. It MUST be unique! I'm creating the ID using the range of MAX_NEWS.

In [None]:
collection.add(
    documents=subset_news[DOCUMENT].tolist(),
    metadatas=[{TOPIC: topic} for topic in subset_news[TOPIC].tolist()],
    ids=[f"id{x}" for x in range(MAX_NEWS)],
)

In [None]:
results = collection.query(query_texts=["laptop"], n_results=10 )

print(results)

#Vector MAP

In [None]:
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA

In [None]:
getado = collection.get(ids="id141",
                       include=["documents", "embeddings"])

In [None]:
word_vectors = getado["embeddings"]
word_list = getado["documents"]
# word_vectors
word_list

Once the information is on the Database you can query It, and ask for data that matches your needs. The search is done inside the content of the document. It dosn't look for the exact word, or phrase, the results will be based on the similarity between the search terms and the content of documents.

The metadata is not used in the search, but they can be utilized for filtering or refining the results after the initial search.

# Loading the model and creating the prompt
TRANSFORMERS!!
Time to use the library **transformers**, the most famous library from [hugging face](https://huggingface.co/) for working with language models.

We are importing:
* **Autotokenizer**: It is a utility class for tokenizing text inputs that are compatible with various pre-trained language models.
* **AutoModelForCasualLLM**: it provides an interface to pre-trained language models specifically designed for language generation tasks using causal language modeling (e.g., GPT models), or the model used in this notebook ***TinyLlama-1.1B-Chat-v1.0***.
* **pipeline**: provides a simple interface for performing various natural language processing (NLP) tasks, such as text generation (our case) or text classification.

The model I have selected is [TinyLlama-1.1B-Chat-v1.0](https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0), which is one of the smartest Small Language Models. Even so, it still has 1.1 billion parameters.

Please, feel free to test [different Models](https://huggingface.co/models?pipeline_tag=text-generation&sort=trending), you need to search for NLP models trained for text-generation. My recomendation is choose "small" models, or we will run out of memory in kaggle.  

In [None]:
!pip install -q einops

In [1]:
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline

model_id = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
#model_id = "databricks/dolly-v2-3b"
tokenizer = AutoTokenizer.from_pretrained(model_id)
lm_model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True)

: 


The next step is to initialize the pipeline using the objects created above.

The model's response is limited to 256 tokens, for this project I'm not interested in a longer response, but it can easily be extended to whatever length you want.

Setting ***device_map*** to ***auto*** we are instructing the model to automaticaly select the most appropiate device: CPU or GPU for processing the text generation.

In [None]:
pipe = pipeline(
    "text-generation",
    model=lm_model,
    tokenizer=tokenizer,
    max_new_tokens=256,
    device_map="auto",
)

## Creating the extended prompt
To create the prompt you can use the result from query the Vector Database  and the sentence introduced by the user.

The prompt have two parts, the **relevant context** that is the information recovered from the database and the **user's question**.

You only need to join the two parts together to create the prompt sended to the model.

You can limit the lenght of the context passed to the model, because you can get some Memory problems with one of the datasets that contains a realy large text in the document part.

In [None]:
question = "Can I buy a new Toshiba laptop?"
context = " ".join([f"#{str(i)}" for i in results["documents"][0]])
#context = context[0:5120]
prompt_template = f"""
Relevant context: {context}
Considering the relevant context, answer the question.
Question: {question}
Answer: """
prompt_template

Now all that remains is to send the prompt to the model and wait for its response!


In [None]:
lm_response = pipe(prompt_template)
print(lm_response[0]["generated_text"])

__________


# Connecting to a ChromaDB existing collection

In [None]:
!pip install chromadb

In [None]:
import chromadb
chroma_client_2 = chromadb.PersistentClient(path="/workspaces/Large-Language-Model-Notebooks-Course/chromadb")

In [None]:
collection2 = chroma_client_2.get_collection(name=collection_name)
results2 = collection.query(query_texts=["laptop"], n_results=10 )


In [None]:
print(results2)

# Conclusions
A very short notebook, but with a lot of content.

You have used a vector database to store information. Then move on to retrieve it and use it to create an extended prompt that you've used to call one of the newer large language models available in Hugging Face.

The model has returned a response taking into account the context that you have passed to it in the prompt.

This way of working with language models is very powerful.

Is possible to make the model use our information without the need for Fine Tuning. This technique really has some very big advantages over fine tuning.