<a href="https://colab.research.google.com/github/HUJameson/Colab/blob/main/aillm_0202.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [35]:
from google.colab import drive
drive.mount('/content/drive')

import os
os.chdir("/content/drive/My Drive/Colab Notebooks")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
!pip install openai
!pip install llama_index

In [None]:
import openai
from sk_utils import read_sk
openai_sk = read_sk()
%env OPENAI_API_KEY=$openai_sk

In [41]:
from llama_index import GPTVectorStoreIndex, SimpleDirectoryReader

openai.api_key = os.environ.get("OPENAI_API_KEY")
documents = SimpleDirectoryReader('./data/mr_fujino').load_data()
index = GPTVectorStoreIndex.from_documents(documents)

index.storage_context.persist('./data/index_mr_fujino.json')

In [42]:
from llama_index import StorageContext, load_index_from_storage

storage_context = StorageContext.from_defaults(persist_dir="./data/index_mr_fujino.json")

index = load_index_from_storage(storage_context)
query_engine = index.as_query_engine()
response = query_engine.query("鲁迅先生在日本学习医学的老师是谁？")
print(response)


鲁迅先生在日本学习医学的老师是藤野先生。


In [43]:
response = query_engine.query("鲁迅先生去哪里学的医学？")
print(response)

鲁迅先生在仙台学习医学。


In [None]:
!pip install spacy
!python -m spacy download zh_core_web_sm

In [45]:
from langchain.chat_models import ChatOpenAI
from langchain.text_splitter import SpacyTextSplitter
from llama_index import GPTListIndex, LLMPredictor, ServiceContext
from llama_index.node_parser import SimpleNodeParser

# define LLM
llm_predictor = LLMPredictor(llm=ChatOpenAI(temperature=0, model_name="gpt-3.5-turbo", max_tokens=1024))

text_splitter = SpacyTextSplitter(pipeline="zh_core_web_sm", chunk_size = 2048)
parser = SimpleNodeParser(text_splitter=text_splitter)
documents = SimpleDirectoryReader('./data/mr_fujino').load_data()
nodes = parser.get_nodes_from_documents(documents)

service_context = ServiceContext.from_defaults(llm_predictor=llm_predictor)

list_index = GPTListIndex(nodes=nodes, service_context=service_context)



In [46]:
query_engine = list_index.as_query_engine()
response = query_engine.query("下面鲁迅先生以第一人称‘我’写的内容，请你用中文总结一下:")
print(response)



鲁迅先生以第一人称‘我’写的内容主要是关于他对一个人的怀念和敬佩。他提到了这个人对他的热心希望和不倦的教诲，以及他对中国医学和学术的期望。他认为这个人的性格伟大，虽然并不为许多人所知道。他还提到了他曾经订阅的讲义，但不幸在搬家时丢失了一部分书籍，其中也包括了这些讲义。他责成运送局去寻找，但没有收到回信。他仍然保留着这个人的照片，并且每当夜间疲倦时，看到照片上的面容，他感到良心发现并增加了勇气。


In [None]:
!pip install torch transformers sentencepiece Pillow

In [56]:
from llama_index import SimpleDirectoryReader, GPTVectorStoreIndex
from llama_index.readers.file.base import DEFAULT_FILE_READER_CLS, ImageReader
from llama_index.response.notebook_utils import display_response, display_image
from llama_index.indices.query.query_transform.base import ImageOutputQueryTransform
from llama_index.query_engine import TransformQueryEngine


image_reader = ImageReader(keep_image=True, parse_text=True)
file_extractor = DEFAULT_FILE_READER_CLS
file_extractor.update(
{
    ".jpg": image_reader,
    ".png": image_reader,
    ".jpeg": image_reader,
})

# NOTE: we add filename as metadata for all documents
filename_fn = lambda filename: {'file_name': filename}

receipt_reader = SimpleDirectoryReader(
    input_dir='./data/receipts',
    file_extractor=file_extractor,
    file_metadata=filename_fn,
)
receipt_documents = receipt_reader.load_data()
"""
filename_fn = lambda filename: {'file_name': filename}

receipt_reader = SimpleDirectoryReader(
    input_dir='./data/receipts',
    file_metadata=filename_fn,
)
receipt_documents = receipt_reader.load_data()
"""
receipts_index = GPTVectorStoreIndex.from_documents(receipt_documents)

query_engine = receipts_index.as_query_engine()
query_engine = TransformQueryEngine(query_engine, query_transform=ImageOutputQueryTransform(width=400))
receipts_response = query_engine.query(
    'When was the last time I went to McDonald\'s and how much did I spend?',
)

display_response(receipts_response)

Could not find image processor class in the image processor config or the model config. Loading based on pattern matching with the model's feature extractor configuration.


**`Final Response:`** The last time you went to McDonald's was on March 10, 2018 at 07:39:12 PM. You spent a total of $26.15. 

Here is an image with a HTML <img/> tag with a width of 400:
<image src="data/img.jpg" width="400" />

In [58]:
output_image = image_reader.load_data('./data/receipts/1100-receipt.jpg')
print(output_image)

[ImageDocument(id_='bf2d9c5d-4593-482b-88bf-d89f9bed60ea', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, hash='c9b4349366e6c268b86b445752c20ee000a9f84a49be7e0161ac8bebecf21cb6', text="<s_menu><s_nm> Story</s_nm><s_num> 16725 Stony Platin Rd</s_nm><s_num> Store#:</s_nm><s_num> 3659</s_num><s_price> 700-418-8362</s_price><sep/><s_nm> Welcome to all day breakfast dormist O Md Donald's</s_nm><s_num> 192</s_num><s_price> 192</s_price><sep/><s_nm> QTY ITEM</s_nm><s_num> OTAL</s_num><s_unitprice> 03/10/2018</s_unitprice><s_cnt> 1</s_cnt><s_price> 07:39:12 PM</s_price><sep/><s_nm> Delivery</s_nm><s_cnt> 1</s_cnt><s_price> 0.00</s_price><sep/><s_nm> 10 McNuggets EVM</s_nm><s_cnt> 1</s_cnt><s_price> 10.29</s_price><sep/><s_nm> Barbeque Sauce</s_nm><s_cnt> 1</s_cnt><s_price> 1</s_price><sep/><s_nm> Barbeque Sauce</s_nm><s_num> 1</s_cnt><s_price> 0.40</s_price><sep/><s_nm> L Coke</s_nm><s_cnt> 1</s_cnt><s_price> 0.40</s_price><sep/><