# Problem

##### Region = eu-west-1
##### Jumpstart Version = 2.0.4
##### Meta Llama-2-13b instance type = ml.g5.12xlarge
##### llama-index==0.8.35

## Pip install the following 
llama-index==0.8.35

pypdf==3.16.4

transformers==4.34.1

torchvision==0.16.0

langchain==0.0.317

langsmith==0.0.49

openai==0.28.0

In [1]:
from llama_index import SimpleDirectoryReader, ServiceContext, VectorStoreIndex, download_loader
from llama_index.llms import OpenAI
from llama_index.tools import QueryEngineTool, ToolMetadata
from llama_index.query_engine import SubQuestionQueryEngine

import nest_asyncio

import logging
import sys
import json
from langchain.llms.sagemaker_endpoint import LLMContentHandler
from langchain.llms.sagemaker_endpoint import SagemakerEndpoint

from llama_index.callbacks import CallbackManager, LlamaDebugHandler
from llama_index.llms.llama_utils import messages_to_prompt, completion_to_prompt

from llama_index import VectorStoreIndex, SimpleDirectoryReader, ServiceContext
from llama_index import StorageContext, load_index_from_storage
from IPython.display import Markdown, display
from llama_index.prompts import PromptTemplate

logging.basicConfig(stream=sys.stdout, level=logging.INFO)  # Change INFO to DEBUG if you want more extensive logging
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

nest_asyncio.apply()

llama_debug = LlamaDebugHandler(print_trace_on_end=True)
callback_manager = CallbackManager([llama_debug])

# S3Reader = download_loader("S3Reader")
# loader = S3Reader(bucket='llm-sql', prefix='ec2_qbr_priv/financials-llama-index/data/')

loader = SimpleDirectoryReader('./data', recursive=True, exclude_hidden=True)

fin_docs = loader.load_data()

class ContentHandlerForTextGeneration(LLMContentHandler):
    content_type = "application/json"
    accepts = "application/json"

    def transform_input(self, prompt: str, model_kwargs: dict) -> bytes:
        input_str = json.dumps({"inputs": [[{"role": "user", "content": prompt},]],
                                  "parameters" : model_kwargs
                                  })
        return input_str.encode('utf-8')

    def transform_output(self, output: bytes) -> str:
        response_json = json.loads(output.read().decode("utf-8"))
        return response_json[0]['generation']['content']

parameters = {
    "max_new_tokens": 1024,
    "temperature": 0.1,}
region="eu-west-1"

endpoint_name = "meta-textgeneration-llama-2-13b-f-2024-04-11-10-59-12-609"
# endpoint_name= "huggingface-pytorch-tgi-inference-2024-04-11-11-15-17-687"
content_handler = ContentHandlerForTextGeneration()
llm = SagemakerEndpoint(
    endpoint_name=endpoint_name,
    region_name=region,
    model_kwargs=parameters,
    endpoint_kwargs={"CustomAttributes":"accept_eula=true"},
    content_handler=content_handler)
    

# ------------------------------------------------------------------------------------------
storage_directory = "index"
# chunk_size - It defines the size of the chunks (or nodes) that documents are broken into when they are indexed by LlamaIndex
service_context = ServiceContext.from_defaults(llm=llm, chunk_size=600,
                                               embed_model="local",
                                               callback_manager=callback_manager)

# Build the index
index = VectorStoreIndex.from_documents(fin_docs, service_context=service_context, show_progress=True)

# Persist the index to disk
index.storage_context.persist(persist_dir=storage_directory)

storage_context = StorageContext.from_defaults(persist_dir=storage_directory)
index = load_index_from_storage(storage_context, service_context=service_context)

query_engine = index.as_query_engine(service_context=service_context,
                                     similarity_top_k=3)
response = query_engine.query("Give me a summary of the document")
display(Markdown(f"<b>{response}</b>"))

INFO:botocore.credentials:Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole
Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole


Parsing documents into nodes:   0%|          | 0/2 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/60 [00:00<?, ?it/s]

**********
Trace: index_construction
    |_node_parsing ->  0.143187 seconds
      |_chunking ->  0.0648 seconds
      |_chunking ->  0.061799 seconds
    |_embedding ->  2.762972 seconds
    |_embedding ->  1.892869 seconds
    |_embedding ->  1.494365 seconds
    |_embedding ->  1.544892 seconds
    |_embedding ->  1.509577 seconds
    |_embedding ->  1.540883 seconds
**********
INFO:llama_index.indices.loading:Loading all indices.
Loading all indices.
**********
Trace: index_construction
**********
**********
Trace: query
    |_query ->  4.297448 seconds
      |_retrieve ->  0.024519 seconds
        |_embedding ->  0.021604 seconds
      |_synthesize ->  4.272468 seconds
        |_templating ->  2.9e-05 seconds
        |_llm ->  4.266244 seconds
**********


<b> Sure! Based on the context information provided, here's a summary of the document:

The document is a personal essay by Paul Graham, a well-known computer programmer and investor. He describes his experiences working at Interleaf, a software company, and later co-founding Y Combinator, a successful startup accelerator. Graham reflects on what he learned during his time at Interleaf, including the importance of being the "entry-level" option and the dangers of prestige. He also discusses the challenges of working on Hacker News, a news aggregator for startup founders, and how it became a significant source of stress for him. The essay concludes with Graham's thoughts on the value of learning from one's mistakes and the importance of working hard.</b>

In [2]:
response

Response(response=' Sure! Based on the context information provided, here\'s a summary of the document:\n\nThe document is a personal essay by Paul Graham, a well-known computer programmer and investor. He describes his experiences working at Interleaf, a software company, and later co-founding Y Combinator, a successful startup accelerator. Graham reflects on what he learned during his time at Interleaf, including the importance of being the "entry-level" option and the dangers of prestige. He also discusses the challenges of working on Hacker News, a news aggregator for startup founders, and how it became a significant source of stress for him. The essay concludes with Graham\'s thoughts on the value of learning from one\'s mistakes and the importance of working hard.', source_nodes=[NodeWithScore(node=TextNode(id_='3ff7642d-dfcb-419e-8ba4-6cf738ebf1f0', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={<NodeRelationship.SOURC

In [3]:
response.get_formatted_sources()

'> Source (Doc id: 3ff7642d-dfcb-419e-8ba4-6cf738ebf1f0): This one was reasonably fast, because it was compiled into Scheme. To test this new Arc, I wrote ...\n\n> Source (Doc id: e6b285f2-4331-4f3a-aef9-ee68dc846733): This one was reasonably fast, because it was compiled into Scheme. To test this new Arc, I wrote ...\n\n> Source (Doc id: 4954b032-9929-44ae-a784-27818631cb17): But Interleaf still had a few years to live yet. [5]\n\nInterleaf had done something pretty bold. I...'

In [4]:
query_engine = index.as_query_engine(service_context=service_context,
                                     similarity_top_k=3, streaming=True)

In [6]:
response = query_engine.query("Give me a summary of the document")
response.print_response_stream()

**********
Trace: query
    |_query ->  0.027361 seconds
      |_retrieve ->  0.021727 seconds
        |_embedding ->  0.018827 seconds
      |_synthesize ->  0.005437 seconds
        |_templating ->  2.7e-05 seconds
        |_llm ->  0.0 seconds
    |_llm ->  0.0 seconds
**********


Exception in thread Thread-7 (wrapped_llm_predict):
Traceback (most recent call last):
  File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/langchain/llms/sagemaker_endpoint.py", line 342, in _call
    resp = json.loads(line)
  File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
  File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 2 (char 1)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/threading.py", line 1016, i