## Inference .gguf models with Vector DB

I used langchain, llama.cpp, FAISS.

This notebook suggests you to run inference with vector DB(FAISS). 

### Requirements

Currently, python3.12 is not compatible. Make sure that you have installed previous python version.

In [1]:
%pip install -U gpt4all chromadb langchainhub sentence-transformers faiss-cpu langchain

Collecting gpt4all
  Using cached gpt4all-2.0.2-py3-none-macosx_10_15_universal2.whl.metadata (892 bytes)
Collecting chromadb
  Using cached chromadb-0.4.18-py3-none-any.whl.metadata (7.4 kB)
Collecting langchainhub
  Using cached langchainhub-0.1.14-py3-none-any.whl.metadata (478 bytes)
Collecting sentence-transformers
  Using cached sentence-transformers-2.2.2.tar.gz (85 kB)
  Preparing metadata (setup.py) ... [?25ldone
[?25hCollecting faiss-cpu
  Downloading faiss_cpu-1.7.4-cp39-cp39-macosx_11_0_arm64.whl (2.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.7/2.7 MB[0m [31m23.8 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hCollecting langchain
  Using cached langchain-0.0.346-py3-none-any.whl.metadata (16 kB)
Collecting requests (from gpt4all)
  Using cached requests-2.31.0-py3-none-any.whl.metadata (4.6 kB)
Collecting tqdm (from gpt4all)
  Using cached tqdm-4.66.1-py3-none-any.whl.metadata (57 kB)
Collecting chroma-hnswlib==0.7.3 (from chromadb)
  

### Install llama-cpp-python

If you are using Apple silicon Mac, run the code below. Otherwise, just install llama-cpp-python yourself.

In [3]:
!CMAKE_ARGS="-DLLAMA_METAL=on" FORCE_CMAKE=1 pip install --upgrade --force-reinstall llama-cpp-python --no-cache-dir

Collecting llama-cpp-python
  Downloading llama_cpp_python-0.2.20.tar.gz (8.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.7/8.7 MB[0m [31m45.0 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25h  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Installing backend dependencies ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
[?25hCollecting typing-extensions>=4.5.0 (from llama-cpp-python)
  Downloading typing_extensions-4.8.0-py3-none-any.whl.metadata (3.0 kB)
Collecting numpy>=1.20.0 (from llama-cpp-python)
  Downloading numpy-1.26.2-cp312-cp312-macosx_11_0_arm64.whl.metadata (61 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.2/61.2 kB[0m [31m40.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting diskcache>=5.6.1 (from llama-cpp-python)
  Downloading diskcache-5.6.3-py3-none-any.whl.metadata (20 kB)
Downloading diskcache-5.6.3-py3-none-any.whl (

### Import data

In [1]:
from langchain.document_loaders import CSVLoader
from langchain.text_splitter import CharacterTextSplitter

loader = CSVLoader("./cleaned_data_knowledge.csv")
documents = loader.load()

# 데이터를 불러와서 텍스트를 일정한 수로 나누고 구분자로 연결하는 작업
text_splitter = CharacterTextSplitter(
	chunk_size=1000, 
    chunk_overlap=0, 
    separator="\n"
    )
texts = text_splitter.split_documents(documents)

print(len(texts))

286


### Embed imported data

In [2]:
from langchain.vectorstores import FAISS
from langchain.embeddings import HuggingFaceEmbeddings

# 임베딩 모델 로드
embeddings = HuggingFaceEmbeddings(model_name="intfloat/multilingual-e5-large")

# 문서에 있는 텍스트를 임베딩하고 FAISS 에 인덱스를 구축함
index = FAISS.from_documents(
	documents=texts,
	embedding=embeddings,
	)

# faiss_db 로 로컬에 저장하기
index.save_local("faiss_db")
# faiss_db 로 로컬에 로드하기
docsearch = FAISS.load_local("faiss_db", embeddings)

  from .autonotebook import tqdm as notebook_tqdm


### Import model

In [10]:
from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
from langchain.llms import LlamaCpp

n_gpu_layers = 1
CallbackManager = CallbackManager([StreamingStdOutCallbackHandler()])

llm = LlamaCpp(
	# model_path: 로컬머신에 다운로드 받은 모델의 위치
    model_path="/Volumes/Jinho/AIDoc_test_models/llama-2-7b-pubmed-qa-211k.gguf",
    temperature=0.0,
    top_p=1,
    max_tokens=8192,
    verbose=True,
    # n_ctx: 모델이 한 번에 처리할 수 있는 최대 컨텍스트 길이
    n_ctx=4096,
    n_gpu_layers=n_gpu_layers,
)

llama_model_loader: loaded meta data with 15 key-value pairs and 291 tensors from /Volumes/Jinho/AIDoc_test_models/llama-2-7b-pubmed-qa-211k.gguf (version GGUF V2)
llama_model_loader: - tensor    0:                token_embd.weight q8_0     [  4096, 32000,     1,     1 ]
llama_model_loader: - tensor    1:              blk.0.attn_q.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor    2:              blk.0.attn_k.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor    3:              blk.0.attn_v.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor    4:         blk.0.attn_output.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor    5:            blk.0.ffn_gate.weight q8_0     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor    6:              blk.0.ffn_up.weight q8_0     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor    7:            blk.0.ffn_down.weight q8_0     [ 1

load model without vector DB. 'llm_chain' object is that.

In [4]:

from langchain import PromptTemplate, LLMChain


template = """
### Instruction:
Use the Input below to create an instruction, which could have been used to generate the input using an LLM.

### Input:
{question}

### Response:
"""
prompt = PromptTemplate(template=template, input_variables=["question"])



llm_chain = LLMChain(prompt=prompt, llm=llm)



load model with vector DB. 'qa' object is that.

In [5]:
from langchain.retrievers.document_compressors import EmbeddingsFilter
from langchain.retrievers import ContextualCompressionRetriever
from langchain.chains import RetrievalQA

# 유사도 0.7로 임베딩 필터를 저장
# 유사도에 맞추어 대상이 되는 텍스트를 임베딩함
embeddings_filter = EmbeddingsFilter(
    embeddings=embeddings, 
    similarity_threshold=0.70
)
# 압축 검색기 생성
compression_retriever = ContextualCompressionRetriever(
	# embeddings_filter 설정
    base_compressor=embeddings_filter, 
    # retriever 를 호출하여 검색쿼리와 유사한 텍스트를 찾음
    base_retriever=docsearch.as_retriever()
)
# RetrievalQA 클래스의 from_chain_type이라는 클래스 메서드를 호출하여 질의응답 객체를 생성
qa = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=compression_retriever)

Test run without vector DB

In [None]:
prompt = """A 16-year-old girl presents to the emergency department with a 3-day history of abdominal pain. She describes the onset as being initially at the umbilical region and then gradually migrating to the right lower quadrant (RLQ). The pain has escalated in severity, and she currently rates it as 7 on the Numeric Rating Scale (NRS), noting that it has become severe enough to hinder her movements. She denies consuming any unusual foods recently. She also reports feelings of nausea. On examination, there is tenderness elicited upon palpation of the RLQ. What is the most likely diagnosis? Answer only one most likely diagnosis and do not say anything else."""


response = llm_chain.run(prompt)
print(response)

Test run with vector DB

In [None]:
prompt = """A 16-year-old girl presents to the emergency department with a 3-day history of abdominal pain. She describes the onset as being initially at the umbilical region and then gradually migrating to the right lower quadrant (RLQ). The pain has escalated in severity, and she currently rates it as 7 on the Numeric Rating Scale (NRS), noting that it has become severe enough to hinder her movements. She denies consuming any unusual foods recently. She also reports feelings of nausea. On examination, there is tenderness elicited upon palpation of the RLQ. What is the most likely diagnosis? Answer only one most likely diagnosis and do not say anything else."""

response = qa.run(prompt)

print(response)

### Save the result with vector DB in .csv format

In [6]:
import pandas as pd

# Read the '20qa.csv' file
testData = pd.read_csv('./20qa.csv', encoding='utf-8')

# Initialize an empty DataFrame with the specified columns
columns = ['question', 'answer', 'response']
result = pd.DataFrame(columns=columns)

# Append rows to the DataFrame for each question-answer pair
for i in range(len(testData)):
    prompt = testData['question'][i]
    response = qa.run(prompt)  # Make sure 'qa.run' is a valid function or method
    new_row = pd.DataFrame([[prompt, testData['answer'][i], response]], columns=columns)
    result = pd.concat([result, new_row], ignore_index=True)

# Save the DataFrame to a CSV file
result.to_csv("./7b_llama_rag.csv", index=False, encoding='utf-8')



llama_print_timings:        load time =   14198.28 ms
llama_print_timings:      sample time =       1.21 ms /     4 runs   (    0.30 ms per token,  3319.50 tokens per second)
llama_print_timings: prompt eval time =   76546.26 ms /  2893 tokens (   26.46 ms per token,    37.79 tokens per second)
llama_print_timings:        eval time =     179.45 ms /     3 runs   (   59.82 ms per token,    16.72 tokens per second)
llama_print_timings:       total time =   77536.64 ms
Llama.generate: prefix-match hit

llama_print_timings:        load time =   14198.28 ms
llama_print_timings:      sample time =       0.67 ms /     4 runs   (    0.17 ms per token,  5943.54 tokens per second)
llama_print_timings: prompt eval time =   53750.06 ms /  2207 tokens (   24.35 ms per token,    41.06 tokens per second)
llama_print_timings:        eval time =     176.74 ms /     3 runs   (   58.91 ms per token,    16.97 tokens per second)
llama_print_timings:       total time =   54273.98 ms
Llama.generate: prefix-

### Save the result without vector DB in .csv format

In [11]:
import pandas as pd

# Read the '20qa.csv' file
testData = pd.read_csv('./20qa.csv', encoding='utf-8')

# Initialize an empty DataFrame with the specified columns
columns = ['question', 'answer', 'response']
result = pd.DataFrame(columns=columns)

# Append rows to the DataFrame for each question-answer pair
for i in range(len(testData)):
    prompt = testData['question'][i]
    response = llm_chain.run(prompt)  # Make sure 'qa.run' is a valid function or method
    new_row = pd.DataFrame([[prompt, testData['answer'][i], response]], columns=columns)
    result = pd.concat([result, new_row], ignore_index=True)

# Save the DataFrame to a CSV file
result.to_csv("./7b_pubmed.csv", index=False, encoding='utf-8')


Llama.generate: prefix-match hit

llama_print_timings:        load time =   14198.28 ms
llama_print_timings:      sample time =       4.19 ms /    18 runs   (    0.23 ms per token,  4301.08 tokens per second)
llama_print_timings: prompt eval time =    8194.79 ms /   168 tokens (   48.78 ms per token,    20.50 tokens per second)
llama_print_timings:        eval time =     844.47 ms /    18 runs   (   46.91 ms per token,    21.32 tokens per second)
llama_print_timings:       total time =    9096.18 ms
Llama.generate: prefix-match hit

llama_print_timings:        load time =   14198.28 ms
llama_print_timings:      sample time =       4.18 ms /    23 runs   (    0.18 ms per token,  5502.39 tokens per second)
llama_print_timings: prompt eval time =    2774.28 ms /   138 tokens (   20.10 ms per token,    49.74 tokens per second)
llama_print_timings:        eval time =    1026.92 ms /    22 runs   (   46.68 ms per token,    21.42 tokens per second)
llama_print_timings:       total time =    3