In [None]:
!pip install llama-index llama-index-llms-huggingface llama-index-embeddings-huggingface transformers accelerate bitsandbytes llama-index-readers-web

## Setup

### Data

In [1]:
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
filename_fn = lambda filename: {"file_name": filename}

In [2]:
# automatically sets the metadata of each document according to filename_fn
documents = SimpleDirectoryReader(
    "./data", file_metadata=filename_fn
).load_data()

### LLM


In [3]:
import torch
from transformers import BitsAndBytesConfig
from llama_index.core.prompts import PromptTemplate
from llama_index.llms.huggingface import HuggingFaceLLM

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
)


def messages_to_prompt(messages):
  prompt = ""
  for message in messages:
    if message.role == 'system':
      prompt += f"<|system|>\n{message.content}</s>\n"
    elif message.role == 'user':
      prompt += f"<|user|>\n{message.content}</s>\n"
    elif message.role == 'assistant':
      prompt += f"<|assistant|>\n{message.content}</s>\n"

  # ensure we start with a system prompt, insert blank if needed
  if not prompt.startswith("<|system|>\n"):
    prompt = "<|system|>\n</s>\n" + prompt

  # add final assistant prompt
  prompt = prompt + "<|assistant|>\n"

  return prompt


llm = HuggingFaceLLM(
    model_name="HuggingFaceH4/zephyr-7b-beta",
    tokenizer_name="HuggingFaceH4/zephyr-7b-beta",
    query_wrapper_prompt=PromptTemplate("<|system|>\n</s>\n<|user|>\n{query_str}</s>\n<|assistant|>\n"),
    context_window=3900,
    max_new_tokens=256,
    model_kwargs={"quantization_config": quantization_config},
    # tokenizer_kwargs={},
    generate_kwargs={"temperature": 0.7, "top_k": 50, "top_p": 0.95},
    messages_to_prompt=messages_to_prompt,
    device_map="auto",
)




Loading checkpoint shards:   0%|          | 0/8 [00:00<?, ?it/s]

In [4]:
from llama_index.core import Settings
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

Settings.llm = llm
Settings.embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")



In [None]:
%pip install llama-index-vector-stores-faiss

In [5]:
%pip install faiss-gpu

Collecting faiss-gpu
  Downloading faiss_gpu-1.7.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (85.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m85.5/85.5 MB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: faiss-gpu
Successfully installed faiss-gpu-1.7.2


### Index Setup

In [10]:
from llama_index.core import VectorStoreIndex

vector_index = VectorStoreIndex.from_documents(documents)

### Helpful Imports / Logging

In [11]:
from llama_index.core.response.notebook_utils import display_response

## Basic Query Engine

### Compact (default)

In [12]:
query_engine = vector_index.as_query_engine(response_mode="compact")

response = query_engine.query("what is transformer?")

display_response(response)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


**`Final Response:`** The Transformer is a neural network architecture proposed in the paper "Attention is All You Need" by Vaswani et al. (2017). It is a sequence-to-sequence model that uses self-attention and point-wise, fully connected layers for both the encoder and decoder. The Transformer follows an overall architecture with stacked self-attention and feed-forward layers, and employs residual connections and layer normalization. It achieves state-of-the-art results in machine translation tasks, outperforming previous models while requiring fewer training resources. The Transformer's success is attributed to its ability to learn the relative importance of different positions in a sequence without the need for recurrence or convolution.

### Refine

In [13]:
query_engine = vector_index.as_query_engine(response_mode="refine")

response = query_engine.query("What is prompt engineering?")

display_response(response)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


**`Final Response:`** Prompt engineering is a technique in the field of large language models (LLMs) and multimodal language models (MMLMs) that aims to maximize the utility and accuracy of these models by designing effective prompts, or input instructions, to guide their behavior and output. This involves both foundational and advanced methodologies, such as Chain of Thought, Self-consistency, and Generated Knowledge, that significantly enhance model performance.

The foundational methods of prompt engineering emphasize the importance of clear and precise instructions, role-prompting, and iterative attempts to optimize outputs. These methods ensure that the model understands the context and intent of the input and generates accurate and relevant responses.

Advanced methodologies, such as Chain of Thought, Self-consistency, and Generated Knowledge, guide the models in generating high-quality content. Chain of Thought involves breaking down complex tasks into smaller subtasks and guiding the model through each step. Self-consistency ensures that the model's output is consistent with its previous responses, making it more reliable and trustworthy. Generated Knowledge involves training the model on a large corpus of text to generate new knowledge and insights.

Prompt

### Tree Summarize

In [14]:
query_engine = vector_index.as_query_engine(response_mode="tree_summarize")

response = query_engine.query("What is prompt engineering?")

display_response(response)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


**`Final Response:`** Prompt engineering is a technique in the field of artificial intelligence that involves providing clear and precise instructions, role-prompting, and iterative attempts to optimize outputs to guide large language models (LLMs) and multimodal language models (MMLMs) in generating high-quality content. It extends to numerous disciplines and has facilitated the creation of robust feature extractors using LLMs, improving their efficacy in tasks such as defect detection and classification. Advanced methodologies such as Chain of Thought, Self-consistency, and Generated Knowledge are introduced to guide models in generating high-quality content. The efficacy of various prompt methods is assessed through both subjective and objective evaluations, ensuring a robust analysis of their effectiveness. The security implications of prompt engineering are also addressed, identifying common vulnerabilities in LLMs and proposing strategies to enhance security through adversarial training and robust prompt design. Prompt engineering has a broad impact across diverse fields such as education, content creation, computer programming, and reasoning tasks.