# Building high level sample

In [25]:
import os
from llama_index.callbacks import LlamaDebugHandler, CallbackManager
from llama_index.node_parser import SimpleNodeParser
from llama_index.llms import OpenAI
from llama_index import VectorStoreIndex, SimpleDirectoryReader, OpenAIEmbedding, PromptHelper, ServiceContext, \
    StorageContext, load_index_from_storage


# Set up GPT Key 

In [26]:
OPENAI_API_KEY = ""
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY

# Set up llm

In [27]:
# Defining ChatGPT model
OPENAI_MODEL = "gpt-3.5-turbo-16k"

# ChatGPT completion setup
OPENAI_COMPLETION_OPTIONS = {
    "temperature": 0.1,  # respond to accuracy of llm (from 0.1 up to 2)
    "max_tokens": 1000,  # max amount of tokens that llm is uses 
    "top_p": 1,  # top value of temperature  
    "frequency_penalty": 0,
    "presence_penalty": 0,
    "request_timeout": 60.0,
}

# Base prompt
LLM_BASE_PROMPT = ""


# Create service context

In [28]:
llm = OpenAI(model=OPENAI_MODEL)
embed_model = OpenAIEmbedding()

# Set up Node parser
node_parser = SimpleNodeParser.from_defaults(
    chunk_size=1024,
    chunk_overlap=20
)

prompt_helper = PromptHelper(
    context_window=4096,
    num_output=256,
    chunk_overlap_ratio=0.1,
    chunk_size_limit=None
)

llama_debug = LlamaDebugHandler(print_trace_on_end=True)
callback_manager = CallbackManager([llama_debug])

service_context = ServiceContext.from_defaults(
    llm=llm,
    embed_model=embed_model,
    node_parser=node_parser,
    prompt_helper=prompt_helper,
    system_prompt=LLM_BASE_PROMPT,
    callback_manager=callback_manager
)   

# Indexing

In [29]:
if not os.path.exists('./storage'):
    # Reading files in directory
    documents = SimpleDirectoryReader('data').load_data()
    # Indexing data in embedding of Vector store
    index = VectorStoreIndex.from_documents(
        documents,
        service_context=service_context
    )
    
    # Creates storage of indexes that we do not have to vectorise them once again
    index.storage_context.persist()
else:
    # Load stored indexes
    storage_context = StorageContext.from_defaults(persist_dir='./storage')
    index = load_index_from_storage(
        storage_context=storage_context,
        service_context=service_context,
    )
 

Advanced encoding /SymbolSetEncoding not implemented yet
Advanced encoding /SymbolSetEncoding not implemented yet
Advanced encoding /SymbolSetEncoding not implemented yet
Advanced encoding /SymbolSetEncoding not implemented yet
Advanced encoding /SymbolSetEncoding not implemented yet
Advanced encoding /SymbolSetEncoding not implemented yet


**********
Trace: index_construction
    |_CBEventType.EMBEDDING ->  2.329824 seconds
    |_CBEventType.EMBEDDING ->  1.592508 seconds
**********


# Creating an engine of llm
Query engine has several types of use: simple querying, chat mode, stream chat. 
All of those can be used asynchronously. 

In [None]:
query_engine = index.as_chat_engine()

In [30]:
response = query_engine.chat("Give me unit structure of Introduction to big data")
print(response)

**********
Trace: chat
    |_CBEventType.AGENT_STEP ->  12.036483 seconds
      |_CBEventType.LLM ->  2.180296 seconds
      |_CBEventType.FUNCTION_CALL ->  5.638281 seconds
        |_CBEventType.QUERY ->  5.638281 seconds
          |_CBEventType.RETRIEVE ->  0.283716 seconds
            |_CBEventType.EMBEDDING ->  0.27217 seconds
          |_CBEventType.SYNTHESIZE ->  5.354565 seconds
            |_CBEventType.TEMPLATING ->  0.001 seconds
            |_CBEventType.LLM ->  5.350564 seconds
      |_CBEventType.LLM ->  4.215815 seconds
**********
The unit structure of the Introduction to Big Data chapter includes the following sections:

1.0 Objectives
1.1 Introduction to Big Data
1.2 Characteristics of Data and Big Data
1.3 Evolution of Big Data
1.4 Definition of Big Data
1.5 Challenges with big data
1.6 Why Big data?
1.7 Data Warehouse environment
1.8 Traditional Business Intelligence versus Big Data
1.9 State of Practice in Analytics
1.10 Key roles for New Big Data Ecosystems
1.11 Exa

In [31]:
response = query_engine.chat("What is main objective of Big Data")
print(response)

**********
Trace: chat
    |_CBEventType.AGENT_STEP ->  11.03429 seconds
      |_CBEventType.LLM ->  3.76138 seconds
      |_CBEventType.FUNCTION_CALL ->  2.769968 seconds
        |_CBEventType.QUERY ->  2.769968 seconds
          |_CBEventType.RETRIEVE ->  0.375452 seconds
            |_CBEventType.EMBEDDING ->  0.363945 seconds
          |_CBEventType.SYNTHESIZE ->  2.394516 seconds
            |_CBEventType.TEMPLATING ->  0.0 seconds
            |_CBEventType.LLM ->  2.390011 seconds
      |_CBEventType.LLM ->  4.500869 seconds
**********
The main objective of Big Data is to help organizations derive new value and create a competitive advantage from their most valuable asset: information. It aims to drive efficiency, improve quality, and provide personalized products and services, leading to enhanced customer satisfaction and profitability. Big Data analytics also enable organizations to explore new avenues of investigation and gain deeper insights that were not previously possible.

In [32]:
response = query_engine.chat("What is Key roles of the new big data ecosystems")
print(response)

**********
Trace: chat
    |_CBEventType.AGENT_STEP ->  27.617718 seconds
      |_CBEventType.LLM ->  1.550344 seconds
      |_CBEventType.FUNCTION_CALL ->  6.65849 seconds
        |_CBEventType.QUERY ->  6.65849 seconds
          |_CBEventType.RETRIEVE ->  0.482441 seconds
            |_CBEventType.EMBEDDING ->  0.469905 seconds
          |_CBEventType.SYNTHESIZE ->  6.176049 seconds
            |_CBEventType.TEMPLATING ->  0.0 seconds
            |_CBEventType.LLM ->  6.171545 seconds
      |_CBEventType.LLM ->  19.407831 seconds
**********
The key roles of the new big data ecosystems include:

1. Deep Analytical Talent: These individuals possess strong analytical skills and technical expertise. They have advanced training in quantitative disciplines such as mathematics, statistics, and machine learning. They are capable of handling raw, unstructured data and applying complex analytical techniques at massive scales.

2. Data Savvy Professionals: This group has a basic knowledge of st

In [None]:
# Building 