<a href="https://colab.research.google.com/github/GaryPython/Cathay_LLM/blob/main/R4/R4_Langchain_Tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install langchain
!pip install -U langchain-community
!pip install chromadb
!pip install sentence_transformers



Collecting langchain
  Downloading langchain-0.2.11-py3-none-any.whl (990 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m990.3/990.3 kB[0m [31m6.9 MB/s[0m eta [36m0:00:00[0m
Collecting langchain-core<0.3.0,>=0.2.23 (from langchain)
  Downloading langchain_core-0.2.23-py3-none-any.whl (374 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m374.2/374.2 kB[0m [31m9.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting langchain-text-splitters<0.3.0,>=0.2.0 (from langchain)
  Downloading langchain_text_splitters-0.2.2-py3-none-any.whl (25 kB)
Collecting langsmith<0.2.0,>=0.1.17 (from langchain)
  Downloading langsmith-0.1.93-py3-none-any.whl (139 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m139.8/139.8 kB[0m [31m9.3 MB/s[0m eta [36m0:00:00[0m
Collecting jsonpatch<2.0,>=1.33 (from langchain-core<0.3.0,>=0.2.23->langchain)
  Downloading jsonpatch-1.33-py2.py3-none-any.whl (12 kB)
Collecting orjson<4.0.0,>=3.9.14 (from langsm

# R4: 模型高效服務
- 向量資料庫
- 量化服務

In [None]:
from langchain_core.documents import Document
from langchain.vectorstores import Chroma
#from langchain_chroma import Chroma
from langchain_community.embeddings.sentence_transformer import SentenceTransformerEmbeddings
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.document_loaders import WebBaseLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

import re
import chromadb
from pprint import pprint

import pandas as pd
from sentence_transformers import SentenceTransformer

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_validate
from sklearn.metrics.pairwise import cosine_similarity

import torch
from transformers import BitsAndBytesConfig
from transformers import LlamaForCausalLM

  from tqdm.autonotebook import tqdm, trange


## 向量資料庫的基本操作
Chroma 是用於建立具有嵌入向量（vector embedding）的 AI 應用程式的向量資料庫。它們可以表示文字、圖像，很快還可以表示音訊和視訊。

### 建立DB
集合（資料庫名稱）是您儲存嵌入、文件和任何其他元資料的地方。您可以建立一個具有以下名稱的集合（相當於關係資料庫mysql裡面的資料庫名稱）

In [None]:
# Create a Chroma Client
chroma_client = chromadb.PersistentClient(path="document_store")
# Create a collection
collection = chroma_client.get_or_create_collection(name="collection_name")

In [None]:
# 刪除集合
#chroma_client.delete_collection(name="collection_name")

### 匯入資料
這裡的documents是你的數據內容，元數據（Metadata）是關於數據的組織、數據域及其關係的信息，簡言之，元數據就是關於數據的數據，可以你自己定義的章節等內容，ids是索引

In [None]:
collection.add(
    documents=[
        "This is a document about pineapple",
        "This is a document about oranges",
        "This is a document about mango",
        "This is a document about apple",
    ],
    metadatas=[{"chapter": "1", "verse": "a"},
          {"chapter": "1", "verse": "a"},
          {"chapter": "2", "verse": "a"},
          {"chapter": "2", "verse": "a"}],
    ids=["id1", "id2", "id3", "id4"]
)
pprint(collection.get())

{'data': None,
 'documents': ['This is a document about pineapple',
               'This is a document about oranges',
               'This is a document about mango',
               'This is a document about apple'],
 'embeddings': None,
 'ids': ['id1', 'id2', 'id3', 'id4'],
 'included': ['metadatas', 'documents'],
 'metadatas': [{'chapter': '1', 'verse': 'a'},
               {'chapter': '1', 'verse': 'a'},
               {'chapter': '2', 'verse': 'a'},
               {'chapter': '2', 'verse': 'a'}],
 'uris': None}


#### 讀取DB
讀取先前保存好的db，當document龐大時不用每次都重新轉embedding

In [None]:
client2 = chromadb.PersistentClient(path="document_store")
collection2 = client2.get_or_create_collection(name="collection_name")
pprint(collection2.get())

{'data': None,
 'documents': ['This is a document about pineapple',
               'This is a document about oranges',
               'This is a document about mango',
               'This is a document about apple'],
 'embeddings': None,
 'ids': ['id1', 'id2', 'id3', 'id4'],
 'included': ['metadatas', 'documents'],
 'metadatas': [{'chapter': '1', 'verse': 'a'},
               {'chapter': '1', 'verse': 'a'},
               {'chapter': '2', 'verse': 'a'},
               {'chapter': '2', 'verse': 'a'}],
 'uris': None}


#### 檢索資料
根據問題檢索文檔的相似度

In [None]:
results = collection2.query(
    query_texts=["This is a query document about hawaii"], # Chroma will embed this for you
    n_results=4 # how many results to return
)
pprint(results)

{'data': None,
 'distances': [[1.0404008937271816,
                1.1399504747618734,
                1.2430800215233073,
                1.3259602282234746]],
 'documents': [['This is a document about pineapple',
                'This is a document about mango',
                'This is a document about oranges',
                'This is a document about apple']],
 'embeddings': None,
 'ids': [['id1', 'id3', 'id2', 'id4']],
 'included': ['metadatas', 'documents', 'distances'],
 'metadatas': [[{'chapter': '1', 'verse': 'a'},
                {'chapter': '2', 'verse': 'a'},
                {'chapter': '1', 'verse': 'a'},
                {'chapter': '2', 'verse': 'a'}]],
 'uris': None}


#### 新增資料
因應營運需要，可以在既有的資料庫中持續新增新文檔

In [None]:
collection2.add(
    documents=["This is a document about plum",
          "This is a document about cherry"],
    metadatas=[{"chapter": "3", "verse": "b"},
          {"chapter": "3", "verse": "b"}],
    ids=["id5", "id6"]
)

In [None]:
collection2.get()

{'ids': ['id1', 'id2', 'id3', 'id4', 'id5', 'id6'],
 'embeddings': None,
 'metadatas': [{'chapter': '1', 'verse': 'a'},
  {'chapter': '1', 'verse': 'a'},
  {'chapter': '2', 'verse': 'a'},
  {'chapter': '2', 'verse': 'a'},
  {'chapter': '3', 'verse': 'b'},
  {'chapter': '3', 'verse': 'b'}],
 'documents': ['This is a document about pineapple',
  'This is a document about oranges',
  'This is a document about mango',
  'This is a document about apple',
  'This is a document about plum',
  'This is a document about cherry'],
 'uris': None,
 'data': None,
 'included': ['metadatas', 'documents']}

#### 檢索特定範圍的資料

In [None]:
# 透過 metadata 做過濾
collection2.query(
    query_texts=["This is a query document about hawaii"],
    n_results=10,
    where={"verse": "a"}
)



{'ids': [['id1', 'id3', 'id2', 'id4']],
 'distances': [[1.0404008937271816,
   1.1399504747618734,
   1.2430800215233073,
   1.3259602282234746]],
 'metadatas': [[{'chapter': '1', 'verse': 'a'},
   {'chapter': '2', 'verse': 'a'},
   {'chapter': '1', 'verse': 'a'},
   {'chapter': '2', 'verse': 'a'}]],
 'embeddings': None,
 'documents': [['This is a document about pineapple',
   'This is a document about mango',
   'This is a document about oranges',
   'This is a document about apple']],
 'uris': None,
 'data': None,
 'included': ['metadatas', 'documents', 'distances']}

In [None]:
# 檢索文本包含特定文字
collection2.query(
    query_texts=["This is a query document about hawaii"],
    n_results=10,
    where_document={"$contains":"p"}
)



{'ids': [['id1', 'id5', 'id4']],
 'distances': [[1.0404008937271816, 1.2933018376352365, 1.3259602282234746]],
 'metadatas': [[{'chapter': '1', 'verse': 'a'},
   {'chapter': '3', 'verse': 'b'},
   {'chapter': '2', 'verse': 'a'}]],
 'embeddings': None,
 'documents': [['This is a document about pineapple',
   'This is a document about plum',
   'This is a document about apple']],
 'uris': None,
 'data': None,
 'included': ['metadatas', 'documents', 'distances']}

#### 刪除文檔

In [None]:
collection2.get()

{'ids': ['id1', 'id2', 'id3', 'id4', 'id5', 'id6'],
 'embeddings': None,
 'metadatas': [{'chapter': '1', 'verse': 'a'},
  {'chapter': '1', 'verse': 'a'},
  {'chapter': '2', 'verse': 'a'},
  {'chapter': '2', 'verse': 'a'},
  {'chapter': '3', 'verse': 'b'},
  {'chapter': '3', 'verse': 'b'}],
 'documents': ['This is a document about pineapple',
  'This is a document about oranges',
  'This is a document about mango',
  'This is a document about apple',
  'This is a document about plum',
  'This is a document about cherry'],
 'uris': None,
 'data': None,
 'included': ['metadatas', 'documents']}

In [None]:
collection2.delete(
    where={"verse": {"$eq": "b"}}, # 表示 metadata 中 "author" 字段值等于 "jack" 的文档
)

In [None]:
collection2.get()

{'ids': ['id1', 'id2', 'id3', 'id4'],
 'embeddings': None,
 'metadatas': [{'chapter': '1', 'verse': 'a'},
  {'chapter': '1', 'verse': 'a'},
  {'chapter': '2', 'verse': 'a'},
  {'chapter': '2', 'verse': 'a'}],
 'documents': ['This is a document about pineapple',
  'This is a document about oranges',
  'This is a document about mango',
  'This is a document about apple'],
 'uris': None,
 'data': None,
 'included': ['metadatas', 'documents']}

## 量化服務
- Embeddings may be challenging to scale up, which leads to expensive solutions and high latencies. Currently, many state-of-the-art models produce embeddings with 1024 dimensions, each of which is encoded in float32, i.e., they require 4 bytes per dimension. To perform retrieval over 50 million vectors, you would therefore need around 200GB of memory. This tends to require complex and costly solutions at scale.

#### Sample code

In [None]:
model = SentenceTransformer("all-MiniLM-L6-v2") # 載入 SentenceTransformer 模型

corpus = ["I am driving to the lake.", "It is a beautiful day."] # 測試字句
embeddings = model.encode(corpus) # 將字句轉換為向量

binary_embeddings = model.encode(corpus, precision="binary") # 將字句轉換為二進位向量

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [None]:
print(embeddings.shape)   # 向量的形狀
print(embeddings.nbytes)  # 向量所佔的字節數
print(embeddings.dtype)   # 向量的數據類型

(2, 384)
3072
float32


In [None]:
print(binary_embeddings.shape)   # 向量的形狀
print(binary_embeddings.nbytes)  # 向量所佔的字節數
print(binary_embeddings.dtype)   # 向量的數據類型

(2, 48)
96
int8


#### text clssification example

In [None]:
df = pd.read_parquet('https://huggingface.co/datasets/stanfordnlp/imdb/resolve/main/plain_text/train-00000-of-00001.parquet')
df = df[['text', 'label']]
df

Unnamed: 0,text,label
0,I rented I AM CURIOUS-YELLOW from my video sto...,0
1,"""I Am Curious: Yellow"" is a risible and preten...",0
2,If only to avoid making this type of film in t...,0
3,This film was probably inspired by Godard's Ma...,0
4,"Oh, brother...after hearing about this ridicul...",0
...,...,...
24995,A hit at the time but now better categorised a...,1
24996,I love this movie like no other. Another time ...,1
24997,This film and it's sequel Barry Mckenzie holds...,1
24998,'The Adventures Of Barry McKenzie' started lif...,1


In [None]:
model = SentenceTransformer("all-MiniLM-L6-v2") # 載入 SentenceTransformer 模型

In [None]:
corpus = df['text'].tolist() # 測試字句
embeddings = model.encode(corpus, show_progress_bar=True) # 將字句轉換為向量
binary_embeddings = model.encode(corpus, precision="binary", show_progress_bar=True) # 將字句轉換為二進位向量

Batches:   0%|          | 0/782 [00:00<?, ?it/s]

Batches:   0%|          | 0/782 [00:00<?, ?it/s]

In [None]:
clf = LogisticRegression(max_iter=1000, random_state=0) # 初始化 LogisticRegression 模型，執行 1000 次迭代，隨機數生成器的種子

In [None]:
cross_validate(
    clf, # Logistic Regression 模型
    embeddings,   # 測試字句
    df['label'].tolist(),#標籤數據
    scoring='accuracy', # 使用準確率作為評估指標
    cv=5, #5 折交叉驗證
    n_jobs=-1,
    return_train_score=True) # 5-fold 交叉驗證

  #fit_time: 訓練時間（秒）
  #score_time: 評估時間（秒）
  #test_score: 測試集的準確率
  #train_score: 訓練集的準確率


{'fit_time': array([1.58138824, 1.58836603, 1.22450542, 1.37860584, 1.0299952 ]),
 'score_time': array([0.00962305, 0.0086689 , 0.01398921, 0.01703763, 0.00746751]),
 'test_score': array([0.807 , 0.8038, 0.8016, 0.7988, 0.8006]),
 'train_score': array([0.8211 , 0.81955, 0.8213 , 0.8231 , 0.82345])}

In [None]:
#改用二進位向量
cross_validate(clf, binary_embeddings, df['label'].tolist(), scoring='accuracy', cv=5, n_jobs=-1, return_train_score=True)

{'fit_time': array([0.11667776, 0.10767174, 0.1108346 , 0.11503792, 0.06895471]),
 'score_time': array([0.0044024 , 0.00298166, 0.00308776, 0.00441623, 0.00183916]),
 'test_score': array([0.6598, 0.6604, 0.637 , 0.6506, 0.6476]),
 'train_score': array([0.65905, 0.6562 , 0.6631 , 0.6591 , 0.66025])}

### 將 Embedding 量化並放入向量資料庫

In [None]:
# 初始化 Embedding 模型
embedding_func = HuggingFaceEmbeddings(
    model_name="infgrad/stella-base-zh-v3-1792d",
    encode_kwargs={"normalize_embeddings": True}) # 生成單位長度的正則化嵌入向量，提高比較精度，不改變相對關係。

# 將字句轉換為向量
a = embedding_func.embed_query('突襲式發表！蘋果推 2 款 M3 MacBook Air，強調 AI 、遊戲效能皆強化')
b = embedding_func.embed_query('蘋果最新M3版MacBook Air突襲登場！6亮點下放1技術不漲價 M2版還降3000元')

# 計算相似度
cosine_similarity([a], [b])

  warn_deprecated(


modules.json:   0%|          | 0.00/311 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/32.1k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/51.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/533 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/408M [00:00<?, ?B/s]

Some weights of BertModel were not initialized from the model checkpoint at infgrad/stella-base-zh-v3-1792d and are newly initialized: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


vocab.txt:   0%|          | 0.00/110k [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/185 [00:00<?, ?B/s]

2_Dense/config.json:   0%|          | 0.00/121 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/5.51M [00:00<?, ?B/s]

array([[0.90356264]])

In [None]:
# 初始化 Embedding 模型
embedding_func = HuggingFaceEmbeddings(
    model_name="infgrad/stella-base-zh-v3-1792d",
    encode_kwargs={"precision":"binary"}) # 表示生成二進制形式的嵌入向量，節省存儲空間，但可能降低精度。

# 將字句轉換為向量
a = embedding_func.embed_query('突襲式發表！蘋果推 2 款 M3 MacBook Air，強調 AI 、遊戲效能皆強化')
b = embedding_func.embed_query('蘋果最新M3版MacBook Air突襲登場！6亮點下放1技術不漲價 M2版還降3000元')

# 計算相似度
cosine_similarity([a], [b])

Some weights of BertModel were not initialized from the model checkpoint at infgrad/stella-base-zh-v3-1792d and are newly initialized: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


array([[0.72331135]])

In [None]:
url = "https://www.bnext.com.tw/article/76864/what-is-the-meaning-of-llm"

loader = WebBaseLoader(url)
news_docs = loader.load()
news_docs[0].page_content = re.sub('\n\s+', '',news_docs[0].page_content)

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=200,
    chunk_overlap=20)
texts_chunks = text_splitter.split_documents(news_docs)
pprint(texts_chunks)

[Document(metadata={'source': 'https://www.bnext.com.tw/article/76864/what-is-the-meaning-of-llm', 'title': 'LLM是什麼？跟AI的關聯為何？大型語言模型要面對什麼挑戰？一文看懂|數位時代 BusinessNext', 'description': 'LLM（大型語言模型）是一種深度學習模型，它能從大量的文章、影音、書籍中學習單詞和句子之間的關係，然後回答問題、翻譯、生成文本。', 'language': 'zh-Hant-TW'}, page_content='LLM是什麼？跟AI的關聯為何？大型語言模型要面對什麼挑戰？一文看懂|數位時代 BusinessNextABOUT US廣告合作內容授權新聞最新新聞'),
 Document(metadata={'source': 'https://www.bnext.com.tw/article/76864/what-is-the-meaning-of-llm', 'title': 'LLM是什麼？跟AI的關聯為何？大型語言模型要面對什麼挑戰？一文看懂|數位時代 BusinessNext', 'description': 'LLM（大型語言模型）是一種深度學習模型，它能從大量的文章、影音、書籍中學習單詞和句子之間的關係，然後回答問題、翻譯、生成文本。', 'language': 'zh-Hant-TW'}, page_content='熱門圖解前端科技產業應用數位生活服務消費企業職場時事焦點AI與大數據5G通訊電動車／交通科技物聯網區塊鏈能源環保醫療生技半導體與電子產業資訊安全智慧製造雲端運算與服務智慧城市遊戲／電競3C生活影音／新媒體教育／人文金融科技新零售服務創新創新創業商業經營行銷與MARTECH職場／工作術程式開發深度專題\n影音新聞\n專家觀點社群未來商務創業小聚Web3+活動\n課程\n雜誌登入\n/\n註冊熱門\n新聞\n專題'),
 Document(metadata={'source': 'https://www.bnext.com.tw/article/76864/what-is-the-meaning-of-llm', 'title': 'LLM是什麼？跟AI的關聯為何？大型語言模型要面對什麼

In [None]:
# load it into Chroma
db = Chroma.from_documents(texts_chunks, embedding_func)

# query it
query = "什麼是 LLM 模型？"
docs = db.similarity_search_with_score(query)
docs[0]

(Document(metadata={'description': 'LLM（大型語言模型）是一種深度學習模型，它能從大量的文章、影音、書籍中學習單詞和句子之間的關係，然後回答問題、翻譯、生成文本。', 'language': 'zh-Hant-TW', 'source': 'https://www.bnext.com.tw/article/76864/what-is-the-meaning-of-llm', 'title': 'LLM是什麼？跟AI的關聯為何？大型語言模型要面對什麼挑戰？一文看懂|數位時代 BusinessNext'}, page_content='LLM（大型語言模型）是什麼？'),
 800762.0)

In [None]:
for doc, score in docs:
    print(f"文檔: {doc}\n相似度得分: {score}\n")

文檔: page_content='LLM（大型語言模型）是什麼？' metadata={'description': 'LLM（大型語言模型）是一種深度學習模型，它能從大量的文章、影音、書籍中學習單詞和句子之間的關係，然後回答問題、翻譯、生成文本。', 'language': 'zh-Hant-TW', 'source': 'https://www.bnext.com.tw/article/76864/what-is-the-meaning-of-llm', 'title': 'LLM是什麼？跟AI的關聯為何？大型語言模型要面對什麼挑戰？一文看懂|數位時代 BusinessNext'}
相似度得分: 800762.0

文檔: page_content='Language Model,大型語言模型）是什麼嗎？LLM是一種深度學習模型，透過吸收海量的文本數據學習知識。它能從大量的文章、影音、書籍中學習單詞和句子之間的關係，然後回答問題、翻譯、生成文本。除了作為聊天機器人，它也被廣泛運用在醫療、開發軟體和服務業，經常出現在日常生活中。想知道它的運作原理、優點與挑戰和其他實際應用？一起來看看這篇文章吧！' metadata={'description': 'LLM（大型語言模型）是一種深度學習模型，它能從大量的文章、影音、書籍中學習單詞和句子之間的關係，然後回答問題、翻譯、生成文本。', 'language': 'zh-Hant-TW', 'source': 'https://www.bnext.com.tw/article/76864/what-is-the-meaning-of-llm', 'title': 'LLM是什麼？跟AI的關聯為何？大型語言模型要面對什麼挑戰？一文看懂|數位時代 BusinessNext'}
相似度得分: 1085566.0

文檔: page_content='LLM 如何運作？用途是什麼？
大型語言模型的工作原理是獲取大量的文本數據，從中學習單詞和句子之間的關係，訓練完畢後可用來分析現有文字的情感與意義或生成新的文本。而且隨著人工智慧的發展，模型能消化的數據集也越來越大，如此大量的文本使用無監督學習輸入人工智慧演算法進行訓練，當它被給予一個數據集而沒有明確的指令要如何處理它時，模型會自己學習單詞以及單詞和語句之間的關係與背後的概念。' 

## 總結
- 量化雖然能加速，但也會掉精準度，值不值得就看專案的需求
- 也因此後面有發展出許多其他量化的技術，嘗試在加速的同事不要掉太多效度