### Issues with RecursiveCharacter Text Splitter
- It used to only chunk a sentence
- If a single paragraph is talking about multiple things, then it splits not based on context, based on tokens or overlap
- But we need chunks based on different context and each context must have there own unique meaning 

## Semantic Search : It provides contextual rich logically seperated chunks

### Steps 
- Document Segmentation (based on sentence or paragraph)
- Embeddings
- Semantic similarity search (cosine similarity on embeddings)
- Merging sentences with higher similar cosine value
- Forming chunks

In [1]:
from sklearn.metrics.pairwise import cosine_similarity
from sentence_transformers import SentenceTransformer

  from .autonotebook import tqdm as notebook_tqdm


In [6]:
model = SentenceTransformer('all-MiniLM-L6-v2')
text = """
    AI is transforming industries.
    Machine learning is a subset of AI.
I love pizza.
It tastes great.
"""

In [8]:
#Split into sentences
sentences = [s.strip() for s in text.split("\n") if s.strip()]

#embeddings
embeddings = model.encode(sentences)

#Initalize parameters
threshold = 0.4
chunks = []
current_chunk = [sentences[0]]

#Similarirty based chunking
for i in range(1,len(sentences)):
    sim = cosine_similarity([embeddings[i]], [embeddings[i-1]])[0][0]
    if sim >=threshold:
        current_chunk.append(sentences[i])
    else:
        chunks.append(" ".join(current_chunk))
        current_chunk = [sentences[i]]

chunks.append(" ".join(current_chunk))

for idx, chunk in enumerate(chunks):
    print(f"Chunk {idx+1}: {chunk}\n")


Chunk 1: AI is transforming industries. Machine learning is a subset of AI.

Chunk 2: I love pizza. It tastes great.

