# Semantic Chunking

In [1]:
# !pip install llama-index-embeddings-huggingface
# !p1p install llama-index-embeddings-instructor
# !pip install llama-index-llms-huggingface
# !pip install llama-index-llms-huggingface-api
# !pip install matplotlib

### HuggingFace embeddings

In [8]:
### https://docs.llamaindex.ai/en/stable/examples/embeddings/huggingface/

In [22]:
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

# loads BAAI/bge-small-en
# embed_model = HuggingFaceEmbedding()

# loads BAAI/bge-small-en-v1.5
embed_model = HuggingFaceEmbedding(model_name="HuggingFaceH4/zephyr-7b-alpha")
#embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")

No sentence-transformers model found with name HuggingFaceH4/zephyr-7b-alpha. Creating a new one with mean pooling.


config.json:   0%|          | 0.00/628 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/8 [00:00<?, ?it/s]

model-00001-of-00008.safetensors:   0%|          | 0.00/1.89G [00:00<?, ?B/s]

In [2]:
embeddings = embed_model.get_text_embedding("Hello World!")
print(len(embeddings))
print(embeddings[:5])

768
[-0.014177391305565834, -0.016448240727186203, 0.00847985502332449, -0.035727839916944504, -0.025835426524281502]


### Semantic Chunking

In [3]:
### https://docs.llamaindex.ai/en/stable/examples/node_parsers/semantic_chunking/

In [4]:
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'pg_essay.txt'

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


--2024-07-16 08:43:00--  https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.108.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 75042 (73K) [text/plain]
Saving to: ‘pg_essay.txt’


2024-07-16 08:43:00 (9.30 MB/s) - ‘pg_essay.txt’ saved [75042/75042]



In [5]:
from llama_index.core import SimpleDirectoryReader

# load documents
documents = SimpleDirectoryReader(input_files=["pg_essay.txt"]).load_data()

In [6]:
from llama_index.core.node_parser import (
    SentenceSplitter,
    SemanticSplitterNodeParser,
)

In [7]:
embed_model = embed_model
splitter = SemanticSplitterNodeParser(
    buffer_size=1, breakpoint_percentile_threshold=95, embed_model=embed_model
)

# also baseline splitter
base_splitter = SentenceSplitter(chunk_size=512)

In [8]:
nodes = splitter.get_nodes_from_documents(documents)

In [9]:
print(nodes[1].get_content())

I was puzzled by the 1401. 


In [10]:
print(nodes[2].get_content())

I couldn't figure out what to do with it. And in retrospect there's not much I could have done with it. The only form of input to programs was data stored on punched cards, and I didn't have any data stored on punched cards. The only other option was to do things that didn't rely on any input, like calculate approximations of pi, but I didn't know enough math to do anything interesting of that type. So I'm not surprised I can't remember any programs I wrote, because they can't have done much. My clearest memory is of the moment I learned it was possible for programs not to terminate, when one of mine didn't. On a machine without time-sharing, this was a social as well as a technical error, as the data center manager's expression made clear.

With microcomputers, everything changed. Now you could have a computer sitting right in front of you, on a desk, that could respond to your keystrokes as it was running instead of just churning through a stack of punch cards and then stopping. [1]


In [11]:
print(nodes[3].get_content())

And moreover this was something you could make a living doing. Not as easily as you could by writing software, of course, but I thought if you were really industrious and lived really cheaply, it had to be possible to make enough to survive. And as an artist you could be truly independent. You wouldn't have a boss, or even need to get research funding.

I had always liked looking at paintings. Could I make them? 


In [12]:
from llama_index.core import VectorStoreIndex
from llama_index.core.response.notebook_utils import display_source_node

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Defaulting to user installation because normal site-packages is not writeable
Looking in indexes: https://artifactory-eo.corp.qbe.com/artifactory/api/pypi/gl-pypi-remote/simple
Collecting llama-index-llms-huggingface
  Downloading https://artifactory-eo.corp.qbe.com/artifactory/api/pypi/gl-pypi-remote/packages/packages/b3/93/ea8f03548ac84b050c16b1fc769f6dd59b2f79795b90400d4eb2e59f265d/llama_index_llms_huggingface-0.2.4-py3-none-any.whl (11 kB)
Collecting text-generation<0.8.0,>=0.7.0 (from llama-index-llms-huggingface)
  Downloading https://artifactory-eo.corp.qbe.com/artifactory/api/pypi/gl-pypi-remote/packages/packages/7b/79/8fc351fd919a41287243c998a47692c7eb0fa5acded13db0080f2c6f1852/text_generation-0.7.0-py3-none-any.whl (12 kB)
Collecting accelerate>=0.21.0 (from transformers[torch]<5.0.0,>=4.37.0->llama-index-llms-huggingface)
  Using cached https://artifactory-eo.corp.qbe.com/artifactory/api/pypi/gl-pypi-remote/packages/packages/e4/74/564f621699b049b0358f7ad83d7437f8219a5d6efb69

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Defaulting to user installation because normal site-packages is not writeable
Looking in indexes: https://artifactory-eo.corp.qbe.com/artifactory/api/pypi/gl-pypi-remote/simple
Collecting llama-index-llms-huggingface-api
  Downloading https://artifactory-eo.corp.qbe.com/artifactory/api/pypi/gl-pypi-remote/packages/packages/cf/97/1db63177240317aad0704c8bec84284a039db6c498a60a910f4878906fee/llama_index_llms_huggingface_api-0.1.0-py3-none-any.whl (5.0 kB)
Installing collected packages: llama-index-llms-huggingface-api
Successfully installed llama-index-llms-huggingface-api-0.1.0


In [18]:
base_nodes = base_splitter.get_nodes_from_documents(documents)

In [13]:
from llama_index.llms.huggingface import HuggingFaceLLM




In [15]:
llm = HuggingFaceLLM(model_name="HuggingFaceH4/zephyr-7b-alpha")

config.json:   0%|          | 0.00/628 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/8 [00:00<?, ?it/s]

model-00001-of-00008.safetensors:   0%|          | 0.00/1.89G [00:00<?, ?B/s]

model-00002-of-00008.safetensors:   0%|          | 0.00/1.95G [00:00<?, ?B/s]

model-00003-of-00008.safetensors:   0%|          | 0.00/1.98G [00:00<?, ?B/s]

model-00004-of-00008.safetensors:   0%|          | 0.00/1.95G [00:00<?, ?B/s]

model-00005-of-00008.safetensors:   0%|          | 0.00/1.98G [00:00<?, ?B/s]

model-00006-of-00008.safetensors:   0%|          | 0.00/1.95G [00:00<?, ?B/s]

model-00007-of-00008.safetensors:   0%|          | 0.00/1.98G [00:00<?, ?B/s]

model-00008-of-00008.safetensors:   0%|          | 0.00/816M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/8 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


tokenizer_config.json:   0%|          | 0.00/264 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/99.0 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
The model `HuggingFaceH4/zephyr-7b-alpha` and tokenizer `StabilityAI/stablelm-tuned-alpha-3b` are different, please ensure that they are compatible.


In [16]:
vector_index = VectorStoreIndex(nodes, embed_model=embed_model)
query_engine = vector_index.as_query_engine(llm=llm)

In [20]:
base_vector_index = VectorStoreIndex(base_nodes, embed_model=embed_model)
base_query_engine = base_vector_index.as_query_engine(llm=llm)

In [21]:
response = query_engine.query(
    "Tell me about the author's programming journey through childhood to college"
)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [246,0,0], thread: [96,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [246,0,0], thread: [97,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [246,0,0], thread: [98,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [246,0,0], thread: [99,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [246,0,0], thread: [100,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [246,0,0], thread: [101,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/

RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
