<a href="https://colab.research.google.com/github/Adnan525/2048python/blob/master/stablelm_zephyr_3b_feature_test.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Feature Test for stablelm-zephyr-3b

[Model card on Hugging Face](https://huggingface.co/stabilityai/stablelm-zephyr-3b).

Here's the [release post](https://stability.ai/news/stablelm-zephyr-3b-stability-llm).

In [1]:
!pip install llama-index llama-index-llms-huggingface llama-index-embeddings-huggingface transformers accelerate bitsandbytes llama-index-readers-web

Collecting llama-index
  Downloading llama_index-0.10.36-py3-none-any.whl (6.8 kB)
Collecting llama-index-llms-huggingface
  Downloading llama_index_llms_huggingface-0.2.0-py3-none-any.whl (10 kB)
Collecting llama-index-embeddings-huggingface
  Downloading llama_index_embeddings_huggingface-0.2.0-py3-none-any.whl (7.1 kB)
Collecting accelerate
  Downloading accelerate-0.30.1-py3-none-any.whl (302 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.6/302.6 kB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting bitsandbytes
  Downloading bitsandbytes-0.43.1-py3-none-manylinux_2_24_x86_64.whl (119.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m119.8/119.8 MB[0m [31m1.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting llama-index-readers-web
  Downloading llama_index_readers_web-0.1.14-py3-none-any.whl (69 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m69.2/69.2 kB[0m [31m1.4 MB/s[0m eta [36m0:00:00[0m
[?25hColle

## Setup

### Data

In [2]:
!mkdir -p "data/java/"

In [4]:
from llama_index.core import SimpleDirectoryReader

documents = SimpleDirectoryReader("data/java").load_data()

### LLM

This should run on a T4 instance on the free tier

In [5]:
import torch
from transformers import BitsAndBytesConfig
from llama_index.core.prompts import PromptTemplate
from llama_index.llms.huggingface import HuggingFaceLLM

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
)


def messages_to_prompt(messages):
  prompt = ""
  for message in messages:
    if message.role == 'system':
      prompt += f"<|system|>\n{message.content}<|endoftext|>\n"
    elif message.role == 'user':
      prompt += f"<|user|>\n{message.content}<|endoftext|>\n"
    elif message.role == 'assistant':
      prompt += f"<|assistant|>\n{message.content}<|endoftext|>\n"

  # ensure we start with a system prompt, insert blank if needed
  if not prompt.startswith("<|system|>\n"):
    prompt = "<|system|>\n<|endoftext|>\n" + prompt

  # add final assistant prompt
  prompt = prompt + "<|assistant|>\n"

  return prompt


llm = HuggingFaceLLM(
    model_name="stabilityai/stablelm-zephyr-3b",
    tokenizer_name="stabilityai/stablelm-zephyr-3b",
    query_wrapper_prompt=PromptTemplate("<|system|>\n<|endoftext|>\n<|user|>\n{query_str}<|endoftext|>\n<|assistant|>\n"),
    context_window=3900,
    max_new_tokens=256,
    model_kwargs={"quantization_config": quantization_config},
    # tokenizer_kwargs={},
    generate_kwargs={"temperature": 0.8},
    messages_to_prompt=messages_to_prompt,
    device_map="auto",
)




config.json:   0%|          | 0.00/599 [00:00<?, ?B/s]



model.safetensors:   0%|          | 0.00/5.59G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/5.21k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/587 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [6]:
from llama_index.core import Settings
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

Settings.llm = llm
Settings.embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/94.8k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/743 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

### Index Setup

In [7]:
from llama_index.core import VectorStoreIndex

vector_index = VectorStoreIndex.from_documents(documents)

In [8]:
from llama_index.core import SummaryIndex

summary_index = SummaryIndex.from_documents(documents)

### Helpful Imports / Logging

In [9]:
from llama_index.core.response.notebook_utils import display_response

In [10]:
import logging
import sys

logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

## Basic Query Engine

### Compact (default)

In [17]:
query_engine = vector_index.as_query_engine(response_mode="compact")

response = query_engine.query("the author suggested the next big revolution will be in what?")

display_response(response)

Token indices sequence length is longer than the specified maximum sequence length for this model (2126 > 2048). Running this sequence through the model will result in indexing errors
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


**`Final Response:`** The author suggested the next big revolution will be in genetic engineering.

### Refine

In [18]:
query_engine = vector_index.as_query_engine(response_mode="refine")

response = query_engine.query("the author suggested the next big revolution will be in what?")

display_response(response)

Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


**`Final Response:`** The author suggested that the next big revolution will be in genetic engineering. This is in line with the context provided, which discusses the inspiration for the cover design of the book and the author's perspective on Java as an attempt to elevate the programmer from being just a mechanic to a "software craftsman." While the context is not directly related to the topic of revolutions, it highlights the author's perspective on the evolution of technology and the importance of being a skilled craftsman in the digital age.

### Tree Summarize

In [19]:
query_engine = vector_index.as_query_engine(response_mode="tree_summarize")

response = query_engine.query("the author suggested the next big revolution will be in what?")

display_response(response)

Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


**`Final Response:`** The author suggested the next big revolution will be in genetic engineering.