# LlamaIndexでGraph RAG

## 目次
- [概要](#概要)
- [参考](#参考)
- [チェック](#チェック)
- [チュートリアル](#チュートリアル)
  - [準備](#準備)
    - [使用する変数](#使用する変数)
    - [インストレーション](#インストレーション)
    - [ライブラリ読み込み](#ライブラリ読み込み)
  - [OpenAI](#OpenAI)
    - [追加のインストレーション](#追加のインストレーション)
    - [追加のライブラリ読み込み](#追加のライブラリ読み込み)
    - [API Keyの確認](#API_Keyの確認)
  - [KnowledgeGraphIndex](#KnowledgeGraphIndex)
    - [LLMの設定](#LLMの設定)
    - [インデックス作成](#インデックス作成)
    - [RAGのRetrieval部](#RAGのRetrieval部)
  - [PropertyGraphIndex](#PropertyGraphIndex)
    - [準備](#準備)
    - [インデックス作成](#インデックス作成)
    - [RAGのRetrieval部](#RAGのRetrieval部)
    - [永続化して実行](#永続化して実行)
    - [StoringでChromaDBを使用](#StoringでChromaDBを使用)

## 概要
- LlamaIndex（公式）をトレースして基本的な利用方法を確認する。
- 破壊的に変更が発生するまで使えるでしょう。
- 破壊的に変更が発生後は、公式サイトの当該バージョンの情報（≒一次情報）をあたって。

## 参考

LLMのRAG - .NET 開発基盤部会 Wiki  
https://dotnetdevelopmentinfrastructure.osscons.jp/index.php?LLM%E3%81%AERAG
- 知識情報の分割
- 知識情報の埋め込み
- 質問の入力（Query Input）
- 質問の埋め込み（Query Embedding）
- 情報の検索（Information Retrieval）
- 情報の生成（Information Generation）
- 回答の提供（Answer Delivery）

LlamaIndex - .NET 開発基盤部会 Wiki  
https://dotnetdevelopmentinfrastructure.osscons.jp/index.php?LlamaIndex
- Loading
- Indexing
- Storing
- Querying
- Evaluation

## チェック

In [1]:
#!pip list

In [2]:
#%env

## チュートリアル
OpenAIが前提、Ollamaは別途。

### 準備

#### 使用する変数

In [3]:
DATA_DIR = "./llamaindex/data/paul_graham_essay"
PERSIST_DIR = "./llamaindex/storage/paul_graham_essay_pgi"
CHROMA_DIR = "./llamaindex/chroma_db/paul_graham_essay_pgi"

#### インストレーション

##### 新規インストール
Graph可視化用

```bash
!pip install pyvis
```

#### ライブラリ読み込み

In [4]:
from llama_index.core import SimpleDirectoryReader, Settings, StorageContext
from llama_index.core.graph_stores import SimpleGraphStore

### OpenAI

#### 追加のインストレーション

In [5]:
# 不要（OpenAIは依存関係パッケージらしい）
# !pip install llama-index-llms-openai

#### 追加のライブラリ読み込み

In [6]:
from llama_index.llms.openai import OpenAI

#### API_Keyの確認
準備は、OpenAIにログインしてAPIからKeyを取得、カード登録してチャージするだけ。

```Python
import os
print(os.environ['OPENAI_API_KEY'])
```

### KnowledgeGraphIndex

#### LLMの設定

In [7]:
# define LLM
# NOTE: at the time of demo, text-davinci-002 did not have rate-limit errors
Settings.llm = OpenAI(model="gpt-3.5-turbo", temperature=0)
Settings.chunk_size = 512

#### インデックス作成
Graph検索のインデックス。

##### Loading

In [8]:
documents = SimpleDirectoryReader(DATA_DIR).load_data()

##### Indexing & Storing
- 以下では、KnowledgeGraphIndexを使用している。
- SimpleGraphStoreはメモリなのか？ディスクなのか？

In [9]:
from llama_index.core import KnowledgeGraphIndex

graph_store = SimpleGraphStore()
storage_context = StorageContext.from_defaults(graph_store=graph_store)

# NOTE: can take a while!
index = KnowledgeGraphIndex.from_documents(
    documents,
    max_triplets_per_chunk=2,
    storage_context=storage_context,
    show_progress=True,
)

  from .autonotebook import tqdm as notebook_tqdm
Parsing nodes: 100%|███████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 22.76it/s]
Processing nodes: 100%|██████████████████████████████████████████████████████████████████████████████| 64/64 [00:45<00:00,  1.41it/s]


#### RAGのRetrieval部

##### Querying

In [10]:
query_engine = index.as_query_engine(
    include_text=False, response_mode="tree_summarize"
)
response = query_engine.query(
    "Tell me more about Interleaf",
)
print(response)

Interleaf was a software that made use of a scripting language and was inspired by Emacs. It taught useful things and had smart people working on it. However, it eventually got crushed by Moore's law.


##### Evaluation

###### nodes

In [11]:
for node in response.source_nodes:
    print("score ", node.score)
    print("id_", node.id_)
    #print("file_name", node.metadata["file_name"])
    print("text", node.text)
    print("------------------------------------------------------")

score  1000.0
id_ 572a0574-26ec-4668-9b37-9136375cc8b5
text The following are knowledge sequence in max depth 2 in the form of directed graph like:
`subject -[predicate]->, object, <-[predicate_next_hop]-, object_next_hop ...`
['Interleaf', 'Made', 'Software']
['Software', 'Mention in', 'Footnotes']
['Software', 'Was', 'One of the best general-purpose site builders']
['Software', 'Was', "Raison d'etre"]
['Interleaf', 'Added', 'Scripting language']
['Interleaf', 'Added', 'Scripting language']
['Interleaf', 'Inspired by', 'Emacs']
['Interleaf', 'Taught', 'Useful things']
['Interleaf', 'Got crushed by', "Moore's law"]
['Interleaf', 'Had', 'Smart people']
------------------------------------------------------


###### networkx_graph

In [12]:
from pyvis.network import Network

g = index.get_networkx_graph()
net = Network(notebook=True, cdn_resources="in_line", directed=True)
net.from_nx(g)
net.show("./llamaindex/KnowledgeGraphIndex.html")

./llamaindex/KnowledgeGraphIndex.html


### PropertyGraphIndex

#### 準備

##### おまじない
イベントループのネストを許可？

In [13]:
import nest_asyncio
nest_asyncio.apply()

##### モデル

In [14]:
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI
llm=OpenAI(model="gpt-3.5-turbo", temperature=0)
embed_model=OpenAIEmbedding(model_name="text-embedding-3-small")

#### インデックス作成
Graph検索のインデックス。

##### Loading

In [15]:
documents = SimpleDirectoryReader(DATA_DIR).load_data()

##### Indexing & Storing
- 以下では、PropertyGraphIndexを使用している。
- Embedding用のモデルを別で作る必要がある。

In [16]:
from llama_index.core import PropertyGraphIndex

index = PropertyGraphIndex.from_documents(
    documents,
    llm=llm,
    embed_model=embed_model,
    show_progress=True,
)

Parsing nodes: 100%|███████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 24.11it/s]
Extracting paths from text: 100%|████████████████████████████████████████████████████████████████████| 64/64 [00:23<00:00,  2.74it/s]
Extracting implicit paths: 100%|██████████████████████████████████████████████████████████████████| 64/64 [00:00<00:00, 26319.78it/s]
Generating embeddings: 100%|███████████████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.71s/it]
Generating embeddings: 100%|█████████████████████████████████████████████████████████████████████████| 14/14 [00:03<00:00,  4.35it/s]


#### RAGのRetrieval部

##### Querying

In [17]:
query_engine = index.as_query_engine(
    include_text=True,
)
response = query_engine.query("What happened at Interleaf and Viaweb?")

print(str(response))

Interleaf had smart people and built impressive technology but got crushed by Moore's Law. Viaweb started because the founder needed money, had negative net worth, and became a model company that worked via the web. They needed founders, called themselves a company, and made consulting services.


##### Evaluation

###### retriever
ソーステキストを含めないで、retrieverしてnodesを表示。

In [18]:
retriever = index.as_retriever(
    include_text=False,  # include source text, default True
)

nodes = retriever.retrieve("What happened at Interleaf and Viaweb?")

for node in nodes:
    print(node.text)

Viaweb -> Needed -> Something
Interleaf -> Was -> On the way down
Viaweb -> Owed -> Government
Viaweb -> Had -> Negative net worth
Viaweb -> Became -> Model
Viaweb -> Worked via -> Web
Postscript file -> Created by -> Viaweb
Interleaf -> Wanted -> Lisp hacker
Viaweb -> Needed -> Ourselves
Viaweb -> Called -> Company
Viaweb -> Gave -> 10%
Interleaf -> Added -> Dialect of lisp
Interleaf -> Built -> Impressive technology
Interleaf -> Was on -> Way down
Interleaf -> Added -> Scripting language
Viaweb -> Started -> Because i needed the money
Interleaf -> Got crushed by -> Moore's law
Viaweb -> Called -> New company
Dan giffin -> Worked for -> Viaweb
Viaweb -> Seemed -> Lame
Viaweb -> Needed -> Founders
Interleaf -> Had -> Smart people
Interleaf -> Made -> Software
Interleaf -> Made -> Scripting language
Code editor -> Was in -> Viaweb
Viaweb -> Made -> Consulting
Interleaf -> Had -> Added
Viaweb logo -> Had been -> White v on red circle
Viaweb stock -> Was -> Valuable
I -> Learned -> Useful

###### networkx_graph
PropertyGraphIndexには、networkx_graphプロパティがないらしい。

###### query
ソーステキストを含めた、queryで使用したnodesを表示。

In [19]:
for node in response.source_nodes:
    print(node.text)

Here are some facts extracted from the provided text:

Viaweb -> Needed -> Something
Interleaf -> Was -> On the way down
Viaweb -> Owed -> Government
Viaweb -> Had -> Negative net worth
Viaweb -> Became -> Model
Viaweb -> Worked via -> Web
Interleaf -> Wanted -> Lisp hacker
Viaweb -> Needed -> Ourselves
Viaweb -> Called -> Company
Viaweb -> Gave -> 10%
Interleaf -> Added -> Dialect of lisp
Interleaf -> Built -> Impressive technology
Interleaf -> Was on -> Way down
Interleaf -> Added -> Scripting language
Viaweb -> Started -> Because i needed the money
Interleaf -> Got crushed by -> Moore's law
Viaweb -> Called -> New company
Viaweb -> Seemed -> Lame
Viaweb -> Needed -> Founders
Interleaf -> Had -> Smart people
Interleaf -> Made -> Software
Interleaf -> Made -> Scripting language
Code editor -> Was in -> Viaweb
Viaweb -> Made -> Consulting
Interleaf -> Had -> Added

[5] Interleaf was one of many companies that had smart people and built impressive technology, and yet got crushed by Moor

#### 永続化して実行
永続化はオプショナルでVectorStoreIndexと同じだがKnowledgeGraphIndexと作法が違う？

In [20]:
import os.path
from llama_index.core import (
    StorageContext,
    load_index_from_storage,
)

# check if storage already exists
if not os.path.exists(PERSIST_DIR):
    # load the documents and create the index
    documents = SimpleDirectoryReader(DATA_DIR).load_data()
    index = PropertyGraphIndex.from_documents(
        documents,
        llm=llm,
        embed_model=embed_model,
        show_progress=True,
    )
    # store it for later
    index.storage_context.persist(persist_dir=PERSIST_DIR)
else:
    # load the existing index
    storage_context = StorageContext.from_defaults(persist_dir=PERSIST_DIR)
    index = load_index_from_storage(storage_context)

# Either way we can now query the index
query_engine = index.as_query_engine()
response = query_engine.query("What did the author do growing up?")
print(response)

The author skipped a step in the evolution of computers, going straight from batch processing to microcomputers, which made microcomputers seem all the more exciting.


#### StoringでChromaDBを使用
以下のコードは現状、動かない。

##### 追加のインストレーション

```bash
!pip install chromadb
!pip install llama-index-vector-stores-chroma
```

##### 追加のライブラリ読み込み

In [21]:
import chromadb
from llama_index.vector_stores.chroma import ChromaVectorStore
from llama_index.core.graph_stores import SimplePropertyGraphStore
from llama_index.core import StorageContext

##### パーツ毎に分解して実行

###### Loading

In [22]:
# load some documents
documents = SimpleDirectoryReader(DATA_DIR).load_data()

###### Settings
Vector Store の Storage Context に Chroma DB を使う

In [None]:
# initialize client, setting path to save data
db = chromadb.PersistentClient(path=CHROMA_DIR)

# create collection
collection = db.get_or_create_collection("property_graph_index")

###### Indexing & Storing

```Python
# 初回
index = PropertyGraphIndex.from_documents(
    documents,
    llm=llm,
    embed_model=embed_model,
    graph_store=SimplePropertyGraphStore(),
    vector_store=ChromaVectorStore(collection=collection),
    show_progress=True,
)

index.storage_context.persist(PERSIST_DIR)
```

```Python

# 2回目以降
index = PropertyGraphIndex.from_existing(
    llm=llm,
    graph_store=SimplePropertyGraphStore.from_persist_dir(PERSIST_DIR),
    vector_store=ChromaVectorStore(chroma_collection=collection),
    show_progress=True,
)
```

###### Querying

```Python
# Either way we can now query the index
query_engine = index.as_query_engine()
response = query_engine.query("What did the author do growing up?")
print(response)
```