# 與你的 CSV 檔案聊天

本章節中，學生將熟練利用 csv 檔案存放至 Google 雲端硬碟，並且結合 LangChain 提供的 csv_loader 讀取檔案後，跟 OpenAI 換取 Embedding，最後透過 Agent 來進行問答。

## 安裝套件包



In [None]:
!pip install openai
!pip install tiktoken
!pip install langchain
!pip install chromadb

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting openai
  Downloading openai-0.27.4-py3-none-any.whl (70 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m70.3/70.3 kB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting aiohttp
  Downloading aiohttp-3.8.4-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m17.3 MB/s[0m eta [36m0:00:00[0m
Collecting aiosignal>=1.1.2
  Downloading aiosignal-1.3.1-py3-none-any.whl (7.6 kB)
Collecting multidict<7.0,>=4.5
  Downloading multidict-6.0.4-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (114 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m114.2/114.2 kB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting frozenlist>=1.1.1
  Downloading frozenlist-1.3.3-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.man

## 環境參數設置

本章節中，將會使用到 OpenAI API key。

In [None]:
import os
os.environ["OPENAI_API_KEY"] = "sk-XXXX" # 會是 sk-XXXX 樣式的字樣

## 連接至 Google Drive

In [None]:
# connect your Google Drive
from google.colab import drive
drive.mount('/content/gdrive', force_remount=True)
root_dir = "/content/gdrive/My Drive/"

Mounted at /content/gdrive


## 讀取 csv 檔案 - 使用 LangChain 封裝 class 作法

寶可夢 csv：[檔案連結](https://gist.github.com/armgilles/194bcff35001e7eb53a2a8b441e8b2c6#file-pokemon-csv)

參考連結：[LangChain-csv loader](https://python.langchain.com/en/latest/modules/indexes/document_loaders/examples/csv.html)

In [None]:
from langchain.document_loaders.csv_loader import CSVLoader

In [None]:
loader = CSVLoader(file_path='/content/gdrive/My Drive/pokemon.csv')

In [None]:
print(loader)

<langchain.document_loaders.csv_loader.CSVLoader object at 0x7fdfa18b0a60>


In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)

In [None]:
data = loader.load_and_split(splitter)

In [None]:
data[0]

Document(page_content='#: 1\nName: Bulbasaur\nType 1: Grass\nType 2: Poison\nTotal: 318\nHP: 45\nAttack: 49\nDefense: 49\nSp. Atk: 65\nSp. Def: 65\nSpeed: 45\nGeneration: 1\nLegendary: False', metadata={'source': '/content/gdrive/My Drive/pokemon.csv', 'row': 0})

In [None]:
data[1]

Document(page_content='#: 2\nName: Ivysaur\nType 1: Grass\nType 2: Poison\nTotal: 405\nHP: 60\nAttack: 62\nDefense: 63\nSp. Atk: 80\nSp. Def: 80\nSpeed: 60\nGeneration: 1\nLegendary: False', metadata={'source': '/content/gdrive/My Drive/pokemon.csv', 'row': 1})

## 讀取 Embeddings

In [None]:
from langchain.embeddings.openai import OpenAIEmbeddings

In [None]:
# Download embeddings from OpenAI
embeddings = OpenAIEmbeddings()

## 存放至向量資料庫 - Chroma

In [None]:
from langchain.vectorstores import Chroma

In [None]:
db = Chroma.from_documents(data, embeddings)



In [None]:
db

<langchain.vectorstores.chroma.Chroma at 0x7fdf93a6ea60>

## 建構 RetrievalQA Chain

In [None]:
from langchain.chains import RetrievalQA
from langchain.llms import OpenAI

In [None]:
retriever = db.as_retriever()

In [None]:
qa = RetrievalQA.from_chain_type(llm=OpenAI(), chain_type="stuff", retriever=retriever, input_key="question")

In [None]:
query = "What is Pikachu stats?"
response = qa({"question": query})
print(response['question'])
print(response['result'])

What is Pikachu stats?
 Pikachu has a total stat of 320, HP of 35, Attack of 55, Defense of 40, Sp. Atk of 50, Sp. Def of 50, and Speed of 90. It is a Generation 1 Pokémon and is not legendary.


In [None]:
query = "What are the columns in the document?"
response = qa({"question": query})
print(response['question'])
print(response['result'])

What are the columns in the document?
 The columns are Name, Type 1, Type 2, Total, HP, Attack, Defense, Sp. Atk, Sp. Def, Speed, Generation, and Legendary.


In [None]:
query = "What is the most powerful Pokemon in terms of HP? And Who own it?"
response = qa({"question": query})
print(response['question'])
print(response['result'])

What is the most powerful Pokemon in terms of HP? And Who own it?
 MewtwoMega Mewtwo X is the most powerful Pokemon in terms of HP with a total of 106 HP. It is owned by Generation 1 and is a Legendary Pokemon.


## 使用 Agent

In [None]:
from langchain.agents import create_csv_agent
from langchain.llms import OpenAI

In [None]:
agent = create_csv_agent(OpenAI(temperature=0), '/content/gdrive/My Drive/pokemon.csv', verbose=True)

In [None]:
agent.run("What is Pikachu stats?")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThought: I need to find Pikachu's stats in the dataframe.
Action: python_repl_ast
Action Input: df[df['Name'] == 'Pikachu'][0m
Observation: [36;1m[1;3m     #     Name    Type 1 Type 2  Total  HP  Attack  Defense  Sp. Atk  \
30  25  Pikachu  Electric    NaN    320  35      55       40       50   

    Sp. Def  Speed  Generation  Legendary  
30       50     90           1      False  [0m
Thought:[32;1m[1;3m I now know the final answer
Final Answer: Pikachu has a total stat of 320, HP of 35, Attack of 55, Defense of 40, Sp. Atk of 50, Sp. Def of 50, and Speed of 90.[0m

[1m> Finished chain.[0m


'Pikachu has a total stat of 320, HP of 35, Attack of 55, Defense of 40, Sp. Atk of 50, Sp. Def of 50, and Speed of 90.'

In [None]:
agent.run("how many rows are there?")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThought: I need to count the number of rows
Action: python_repl_ast
Action Input: len(df)[0m
Observation: [36;1m[1;3m800[0m
Thought:[32;1m[1;3m I now know the final answer
Final Answer: There are 800 rows.[0m

[1m> Finished chain.[0m


'There are 800 rows.'

In [None]:
agent.run("What is the most powerful Pokemon in terms of HP? And Who own it?")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThought: I need to find the Pokemon with the highest HP value.
Action: python_repl_ast
Action Input: df[df['HP'] == df['HP'].max()][0m
Observation: [36;1m[1;3m       #     Name  Type 1 Type 2  Total   HP  Attack  Defense  Sp. Atk  \
261  242  Blissey  Normal    NaN    540  255      10       10       75   

     Sp. Def  Speed  Generation  Legendary  
261      135     55           2      False  [0m
Thought:[32;1m[1;3m I now know the final answer
Final Answer: The most powerful Pokemon in terms of HP is Blissey with 255 HP.[0m

[1m> Finished chain.[0m


'The most powerful Pokemon in terms of HP is Blissey with 255 HP.'