# DBS II - 2.Project

## Heehwan Soul, 885941
## Mehmet Görkem Basar, 921637

## Step 1 - Load the PDF Documents

In [1]:
from langchain.document_loaders import PyPDFLoader

pdf_path1 = "./data/data_science.pdf"
pdf_path2 = "./data/angewandte_mathematik.pdf"

loader1 = PyPDFLoader(pdf_path1)
loader2 = PyPDFLoader(pdf_path2)

# Load the documents
documents1 = loader1.load()
documents2 = loader2.load()

# Combine the loaded documents
documents = documents1 + documents2

# Extract text from the combined documents
pdf_text = ' '.join([doc.page_content for doc in documents])
print(f"Total pages loaded: {len(documents)}")

print(pdf_text)


Total pages loaded: 104
 
 
 
 
Master Data Science  
Modulhandbuch  
 
 
 
 
Gesamtansprechpartner: Dekan des FBVI  
Prof. Dr. -Ing. Wolfgang Kesseler  
fb06@beuth -hochschule.de  
 
Ansprechpartner Studiengang:  
Prof. Dr. Stefan Edlich  
sedlich@beuth -hochschule.de  
Stand:  24.11.2020 
  
 2 
 Modul -
Nr. Name  Verantwortliche  Seite  
M01 Mathematische Modelle  Prof. Müller  3 
M02 Fortgeschrittene Softwaretechnik  Prof. Edlich  5 
M03 Statistical Computing  Prof. Downie  7 
M04 Praxis der Data Science Programmierung  Prof. Biessmann  9 
M05 Computer Science für Big Data  Prof. Graupner  10 
M06 Business Intelligence und Verantwortung  Prof. Löser  11 
    
M07 Visualisierung von Daten  Prof. Grömping  13 
M08 Regression  Prof. Grömping  14 
M09 Machine Learning I Prof. Downie  16 
M10 Anwendung 1: Data Science Workflow  / Applications  Prof. Biessmann  17 
M11 Wahlpflichtmodul I FB VI / FB II  19 
    
M12 Machine Learning II Prof. Downie  20 
M13 Anwendung 2: Urbane Technologie

## Step 2 - Define Two Embedder Models

In [2]:
from sentence_transformers import SentenceTransformer
# Load the models 
model_short = SentenceTransformer("all-MiniLM-L6-v2", device="cuda") # we call this model as model_short, because max_seq_length is 256(<512)
tokenizer_short = model_short.tokenizer
MAX_TOKENS_SHORT = model_short.max_seq_length  # 256 tokens

model_long = SentenceTransformer("sentence-transformers/msmarco-distilbert-base-tas-b", device="cuda") # we call this model as model_long, because max_seq_length is 512(>256) 
tokenizer_long = model_long.tokenizer
MAX_TOKENS_LONG = model_long.max_seq_length  # 512 tokens

print(f"Max supported token length with short model: {MAX_TOKENS_SHORT}")
print(f"Max supported token length with long model: {MAX_TOKENS_LONG}")

  from tqdm.autonotebook import tqdm, trange


Max supported token length with short model: 256
Max supported token length with long model: 512


##  Step 3- Chunking and Embedding for Both Models

In [3]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import regex as re
import numpy as np

### 1-1. First Model (all-MiniLM-L6-v2) using Naive Chunking Method

In [4]:
# Function to count tokens
def token_count(text, tokenizer):
    return len(tokenizer.encode(text))

# Function to chunk text using the RecursiveCharacterTextSplitter
text_splitter_short = RecursiveCharacterTextSplitter(
    separators=["\n\n", "\n", " ", ""],
    chunk_size=200,
    chunk_overlap=20,
    length_function=lambda x: token_count(x, tokenizer_short),
)

# Example of a chunk
naive_chunks_short = text_splitter_short.split_documents(documents)
for chunk in naive_chunks_short[10:11]:
    print(chunk.page_content+ "\n")

4 
 • Mehrdimensionale Zufallsvariable (Zufallsvektoren), Kovarianz 
und Korrelation, Abhängigkeit und Unabhängigkeit, lineare 
Transformationen v on Zufallsvektoren  
• Spezielle Verteilungen, insbesondere:Gleichverteilung, 
Binomialverteilung, Poissonverteilung, Normalverteilung, 
Exponentialverteilung, mehrdimensionale Normalverteilung  
 
Literatur  Bosch, K.: Elementare Einführung in die Wahrscheinlichkeitsrechnung, 
Vieweg



In [5]:
# To check if the length of input is bigger than the number of max tokens
def check_input_length_short(text):
    tokens = tokenizer_short.tokenize(text)
    return len(tokens)

for chunk in naive_chunks_short:
    length = check_input_length_short(chunk.page_content)
    if length > MAX_TOKENS_SHORT:
        print(f"Chunk length {length} exceeds max length of {MAX_TOKENS_SHORT}")

In [6]:
# full content of the document as a list of strings
content_short = [chunk.page_content for chunk in naive_chunks_short] 
print(f'first chunk: character len:{len(content_short[0])} tokens:{check_input_length_short(content_short[0])} \n')
print('--------------------- \n')
print(f'{content_short[0]}\n\n')
print(f'last chunk: character len:{len(content_short[-1])} tokens:{check_input_length_short(content_short[-1])} \n')
print('--------------------- \n')
print(content_short[-1])

first chunk: character len:258 tokens:82 

--------------------- 

Master Data Science  
Modulhandbuch  
 
 
 
 
Gesamtansprechpartner: Dekan des FBVI  
Prof. Dr. -Ing. Wolfgang Kesseler  
fb06@beuth -hochschule.de  
 
Ansprechpartner Studiengang:  
Prof. Dr. Stefan Edlich  
sedlich@beuth -hochschule.de  
Stand:  24.11.2020


last chunk: character len:56 tokens:15 

--------------------- 

Modulhandbuch Bachelor Studiengang Mathematik  68 von 68


In [7]:
# generate embedding
naive_vectors_short = model_short.encode(
    content_short,
    show_progress_bar=True,
)
naive_vectors_short.shape # (n_chunks, embedding_dim)

Batches:   0%|          | 0/14 [00:00<?, ?it/s]

(444, 384)

### 1-2. First Model (all-MiniLM-L6-v2) using Semantic Chunking Method

In [8]:
model = model_short
max_tokens = MAX_TOKENS_SHORT
tokenizer = tokenizer_short

def check_input_length(text):
    tokens = tokenizer.tokenize(text)
    return len(tokens)

def split_long_chunk_recursive(text, chunk, start_index, index):
    single_sentences_list = _split_sentences(text)
    if check_input_length(chunk) <= max_tokens:
        return [chunk]

    # when we use a model like all-MiniLM-L6-v2, whose max sequence length is 256, this case can happen(A single sentence is greater than max sequence length).
    # In this case we erase this sentence.
    if start_index - index == 0:
        print("start_index = index")
        return []
    
    # Find a suitable split index
    mid_index = (index + start_index) // 2
    split_index = mid_index
      
    # Recreate the split text
    first_part = ' '.join(single_sentences_list[start_index:split_index+1])
    second_part = ' '.join(single_sentences_list[split_index+1:index+1])
    
    # Recursively split the second part if needed
    return split_long_chunk_recursive(text, first_part, start_index, split_index) + split_long_chunk_recursive(text, second_part, split_index+1, index)


# Main chunking function
def chunk_text(text):
    # Split the input text into individual sentences.
    single_sentences_list = _split_sentences(text)

    for single_sentence in single_sentences_list:
        if check_input_length(single_sentence) > max_tokens:
            print("single setence is bigger than max tokens")
    # Combine adjacent sentences to form a context window around each sentence.
    combined_sentences = _combine_sentences(single_sentences_list)
    
    # Convert the combined sentences into vector representations using a neural network model.
    embeddings = convert_to_vector(combined_sentences)
    
    # Calculate the cosine distances between consecutive combined sentence embeddings to measure similarity.
    distances = _calculate_cosine_distances(embeddings)
    
    # Determine the threshold distance for identifying breakpoints based on the 80th percentile of all distances.
    breakpoint_percentile_threshold = 80
    breakpoint_distance_threshold = np.percentile(distances, breakpoint_percentile_threshold)
    # Find all indices where the distance exceeds the calculated threshold, indicating a potential chunk breakpoint.
    indices_above_thresh = [i for i, distance in enumerate(distances) if distance > breakpoint_distance_threshold]
    # Initialize the list of chunks and a variable to track the start of the next chunk.
    chunks = []
    start_index = 0
    # Loop through the identified breakpoints and create chunks accordingly.
    for index in indices_above_thresh:
        chunk = ' '.join(single_sentences_list[start_index:index+1])
        if check_input_length(chunk) <= max_tokens:
            chunks.append(chunk)
        else:
            print(f"Chunk too long; splitting recursively...")
            long_chunks = split_long_chunk_recursive(text, chunk, start_index, index)
            chunks.extend(long_chunks)
        start_index = index + 1
    
    # If there are any sentences left after the last breakpoint, add them as the final chunk.
    if start_index < len(single_sentences_list):
        chunk = ' '.join(single_sentences_list[start_index:])
        chunks.append(chunk)
        """
        if check_input_length(chunk) <= max_tokens:
            chunks.append(chunk)
        else:
            print(f"Chunk too long; splitting recursively...")
            long_chunks = split_long_chunk_recursive(chunk, start_index, index)
            chunks.extend(long_chunks)
    	"""
    
    # Return the list of text chunks.
    return chunks

def _split_sentences(text):
    # Use regular expressions to split the text into sentences based on punctuation followed by whitespace.
    sentences = re.split(r'(?<=[.?!])\s+', text)
    return sentences
    
# Function to combine sentences for contextual embeddings
def _combine_sentences(sentences):
    # Create a buffer by combining each sentence with its previous and next sentence to provide a wider context.
    combined_sentences = []
    for i in range(len(sentences)):
        combined_sentence = sentences[i]
        if i > 0:
            combined_sentence = sentences[i-1] + ' ' + combined_sentence
        if i < len(sentences) - 1:
            combined_sentence += ' ' + sentences[i+1]
        combined_sentences.append(combined_sentence)
    return combined_sentences

def convert_to_vector(texts):
    # Try to generate embeddings for a list of texts using a pre-trained model and handle any exceptions.
    try:
        embeddings = model.encode(texts)
        return embeddings
    except Exception as e:
        print("An error occurred:", e)
        return np.array([])  # Return an empty array in case of an error

# Function to calculate cosine distances between embeddings
def _calculate_cosine_distances(embeddings):
    similarity_matrix = cosine_similarity(embeddings)
    
    consecutive_similarities = np.diag(similarity_matrix, k=1)
    
    # Calculate the distances (1 - similarity)
    distances = 1 - consecutive_similarities
    
    return distances


# Main Section
full_pdf = " ".join(content_short)

chunks_semantic_short = chunk_text(full_pdf)

#print("Chunks:", chunks_semantic)

# Output the chunks
print(f"Number of semantic chunks: {len(chunks_semantic_short)}")
print(f"Number of naive chunks: {len(content_short)}")
print("\n---\n")
for chunk in chunks_semantic_short[:5]:  # Display first 5 semantic chunks
    print(chunk + "\n---\n")

Token indices sequence length is longer than the specified maximum sequence length for this model (511 > 256). Running this sequence through the model will result in indexing errors


single setence is bigger than max tokens
single setence is bigger than max tokens
single setence is bigger than max tokens
single setence is bigger than max tokens
single setence is bigger than max tokens
single setence is bigger than max tokens
single setence is bigger than max tokens
single setence is bigger than max tokens
single setence is bigger than max tokens
single setence is bigger than max tokens
single setence is bigger than max tokens
single setence is bigger than max tokens
single setence is bigger than max tokens
single setence is bigger than max tokens
single setence is bigger than max tokens
single setence is bigger than max tokens
single setence is bigger than max tokens
single setence is bigger than max tokens
single setence is bigger than max tokens
single setence is bigger than max tokens
single setence is bigger than max tokens
single setence is bigger than max tokens
single setence is bigger than max tokens
single setence is bigger than max tokens
single setence i

In [9]:
# To check if the length of input is bigger than the number of max tokens
for chunk in chunks_semantic_short:
    length = check_input_length_short(chunk)
    if length > model_short.max_seq_length:
        print(f"Chunk length {length} exceeds max length of {model_short.max_seq_length}")

In [10]:
print(f'first chunk: character len:{len(chunks_semantic_short[0])} tokens:{check_input_length_short(chunks_semantic_short[0])} \n')
print('--------------------- \n')
print(f'{chunks_semantic_short[0]}\n\n')
print(f'last chunk: character len:{len(chunks_semantic_short[-1])} tokens:{check_input_length_short(chunks_semantic_short[-1])} \n')
print('--------------------- \n')
print(chunks_semantic_short[-1])

first chunk: character len:95 tokens:26 

--------------------- 

Master Data Science  
Modulhandbuch  
 
 
 
 
Gesamtansprechpartner: Dekan des FBVI  
Prof. Dr.


last chunk: character len:233 tokens:70 

--------------------- 

Jähich: Topologie. Springer  
Weitere Hinweise  Dieses Modul wird auf Deutsch angeboten  und gehört zu den Profil richtung en „Computer Vision“  und „Industrielle Mathematik“ . Modulhandbuch Bachelor Studiengang Mathematik  68 von 68


In [11]:
# Generate Embeddings for the Chunks
semantic_vectors_short = model_short.encode(
    chunks_semantic_short,
    show_progress_bar=True,
)
semantic_vectors_short.shape

Batches:   0%|          | 0/13 [00:00<?, ?it/s]

(385, 384)

### 1-3 Discussion About Semantic Chunking Mehtod

1. It was possible that we have a chunk whose length is over the max sequence length of a model during this semantic chunking method. So we needed to check the length of a chunk in the function of `chunk_text()`. If the length of a chunk is over the max sequence length, the function, `split_long_chunk_recursive()`, is called. This function(`split_long_chunk_recursive()`) splits the chunk over max sequence length recursively until the length is smaller than the max sequence length.

2. **Limitation**:
For this semantic chunking method, the samllest unit is a sentence. Here comes a problem that the length of a single setence can be bigger than max sequence length of a model. For example, when we used a model like all-MiniLM-L6-v2, whose max sequence length is 256, the lengths of a few sentences were bigger than the max sequence length. In this case we decide to delete these sentences. This part can be improved. For example, only first 256 tokens can be used instead of deleting those sentences.

5. **Limitation**:
If there are any sentences left after the last breakpoint, we add them as the final chunk. We wanted to add code here, because the number of tokens of this final chunk can bigger than the max sequence length. This part can be improved as well.

### 2-1. Second Model (sentence-transformers/msmarco-distilbert-base-tas-b) using Naive Chunking Method

In [12]:
# Function to chunk text using the RecursiveCharacterTextSplitter
text_splitter_long = RecursiveCharacterTextSplitter(
    separators=["\n\n", "\n", " ", ""],
    chunk_size=500,
    chunk_overlap=50,
    length_function=lambda x: token_count(x, tokenizer_long),
)

# Example of a chunk
naive_chunks_long = text_splitter_long.split_documents(documents)
for chunk in naive_chunks_long[10:11]:
    print(chunk.page_content+ "\n")

7 
  
Datenfeld  Erklärung  
Modulnummer  M03 
Titel  Statistical Computing / Statistical Computing  
Leistungspunkte  5 LP 
Workload  2 SWS SU und 1 SWS Ü  
(51 h Präsenz / 99 h Selbststudium)  
Verwendbarkeit  Eigener Studiengang, Anerkennung für andere Studiengänge gemäß 
Rahmenstudien - und -prüfungsordnung  
Lerngebiet  Fachspezifische Vertiefung  
Lernziele / Kompetenzen  Die Studierenden erarbeiten sich statistische Kenntnisse direkt im 
Zusammenspiel der Bearbeitung datenanalytischer Fragestellungen, 
der Anwendung statistischer Methoden in geeigneter Statistiksoftware 
sowie der theoretischen Formulierung der verwendeten Methoden. Die 
Lehrveranstaltung vermittelt dabei  fundierte Kenntnisse in der 
statistischen Programmierung.  
Voraussetzungen  Empfehlung: Mathematische Grundlagen aus einem Bachelor  
Niveaustufe  1. Studienplansemester  
Lehr- und Lernform  Seminaristischer Unterricht und Übung am Rechner  
Status  Pflichtmodul  
Häufigkeit des Angebotes  Wintersemester  


In [13]:
# To check if the length of input is bigger than the number of max tokens
def check_input_length_long(text):
    tokens = tokenizer_long.tokenize(text)
    return len(tokens)

for chunk in naive_chunks_long:
    length = check_input_length_long(chunk.page_content)
    if length > MAX_TOKENS_LONG:
        print(f"Chunk length {length} exceeds max length of {MAX_TOKENS_LONG}")

In [14]:
# full content of the document as a list of strings
content_long = [chunk.page_content for chunk in naive_chunks_long] 
print(f'first chunk: character len:{len(content_long[0])} tokens:{check_input_length_long(content_long[0])} \n')
print('--------------------- \n')
print(f'{content_long[0]}\n\n')
print(f'last chunk: character len:{len(content_long[-1])} tokens:{check_input_length_long(content_long[-1])} \n')
print('--------------------- \n')
print(content_long[-1])

first chunk: character len:258 tokens:82 

--------------------- 

Master Data Science  
Modulhandbuch  
 
 
 
 
Gesamtansprechpartner: Dekan des FBVI  
Prof. Dr. -Ing. Wolfgang Kesseler  
fb06@beuth -hochschule.de  
 
Ansprechpartner Studiengang:  
Prof. Dr. Stefan Edlich  
sedlich@beuth -hochschule.de  
Stand:  24.11.2020


last chunk: character len:56 tokens:15 

--------------------- 

Modulhandbuch Bachelor Studiengang Mathematik  68 von 68


In [15]:
# generate embedding
naive_vectors_long = model_long.encode(
    content_long,
    show_progress_bar=True,
)
naive_vectors_long.shape # (n_chunks, embedding_dim)

Batches:   0%|          | 0/7 [00:00<?, ?it/s]

(197, 768)

### 2-2. Second Model (sentence-transformers/msmarco-distilbert-base-tas-b) using Semantic Chunking Method

In [16]:
model = model_long
max_tokens = MAX_TOKENS_LONG
tokenizer = tokenizer_long

def check_input_length(text):
    tokens = tokenizer.tokenize(text)
    return len(tokens)

def split_long_chunk_recursive(text, chunk, start_index, index):
    single_sentences_list = _split_sentences(text)
    if check_input_length(chunk) <= max_tokens:
        return [chunk]

    # when we use a model like all-MiniLM-L6-v2, whose max sequence length is 256, this case can happen(A single sentence is greater than max sequence length).
    # In this case we erase this sentence.
    if start_index - index == 0:
        print("start_index = index")
        return []
    
    # Find a suitable split index
    mid_index = (index + start_index) // 2
    split_index = mid_index
      
    # Recreate the split text
    first_part = ' '.join(single_sentences_list[start_index:split_index+1])
    second_part = ' '.join(single_sentences_list[split_index+1:index+1])
    
    # Recursively split the second part if needed
    return split_long_chunk_recursive(text, first_part, start_index, split_index) + split_long_chunk_recursive(text, second_part, split_index+1, index)


# Main chunking function
def chunk_text(text):
    # Split the input text into individual sentences.
    single_sentences_list = _split_sentences(text)

    for single_sentence in single_sentences_list:
        if check_input_length(single_sentence) > max_tokens:
            print("single setence is bigger than max tokens")
    # Combine adjacent sentences to form a context window around each sentence.
    combined_sentences = _combine_sentences(single_sentences_list)
    
    # Convert the combined sentences into vector representations using a neural network model.
    embeddings = convert_to_vector(combined_sentences)
    
    # Calculate the cosine distances between consecutive combined sentence embeddings to measure similarity.
    distances = _calculate_cosine_distances(embeddings)
    
    # Determine the threshold distance for identifying breakpoints based on the 80th percentile of all distances.
    breakpoint_percentile_threshold = 80
    breakpoint_distance_threshold = np.percentile(distances, breakpoint_percentile_threshold)
    # Find all indices where the distance exceeds the calculated threshold, indicating a potential chunk breakpoint.
    indices_above_thresh = [i for i, distance in enumerate(distances) if distance > breakpoint_distance_threshold]
    # Initialize the list of chunks and a variable to track the start of the next chunk.
    chunks = []
    start_index = 0
    # Loop through the identified breakpoints and create chunks accordingly.
    for index in indices_above_thresh:
        chunk = ' '.join(single_sentences_list[start_index:index+1])
        if check_input_length(chunk) <= max_tokens:
            chunks.append(chunk)
        else:
            print(f"Chunk too long; splitting recursively...")
            long_chunks = split_long_chunk_recursive(text, chunk, start_index, index)
            chunks.extend(long_chunks)
        start_index = index + 1
    
    # If there are any sentences left after the last breakpoint, add them as the final chunk.
    if start_index < len(single_sentences_list):
        chunk = ' '.join(single_sentences_list[start_index:])
        chunks.append(chunk)
        """
        if check_input_length(chunk) <= max_tokens:
            chunks.append(chunk)
        else:
            print(f"Chunk too long; splitting recursively...")
            long_chunks = split_long_chunk_recursive(chunk, start_index, index)
            chunks.extend(long_chunks)
    	"""
    
    # Return the list of text chunks.
    return chunks

def _split_sentences(text):
    # Use regular expressions to split the text into sentences based on punctuation followed by whitespace.
    sentences = re.split(r'(?<=[.?!])\s+', text)
    return sentences
    
# Function to combine sentences for contextual embeddings
def _combine_sentences(sentences):
    # Create a buffer by combining each sentence with its previous and next sentence to provide a wider context.
    combined_sentences = []
    for i in range(len(sentences)):
        combined_sentence = sentences[i]
        if i > 0:
            combined_sentence = sentences[i-1] + ' ' + combined_sentence
        if i < len(sentences) - 1:
            combined_sentence += ' ' + sentences[i+1]
        combined_sentences.append(combined_sentence)
    return combined_sentences

def convert_to_vector(texts):
    # Try to generate embeddings for a list of texts using a pre-trained model and handle any exceptions.
    try:
        embeddings = model.encode(texts)
        return embeddings
    except Exception as e:
        print("An error occurred:", e)
        return np.array([])  # Return an empty array in case of an error

# Function to calculate cosine distances between embeddings
def _calculate_cosine_distances(embeddings):
    similarity_matrix = cosine_similarity(embeddings)
    
    consecutive_similarities = np.diag(similarity_matrix, k=1)
    
    # Calculate the distances (1 - similarity)
    distances = 1 - consecutive_similarities
    
    return distances


# Main Section
full_pdf = " ".join(content_short)

chunks_semantic_long = chunk_text(full_pdf)

#print("Chunks:", chunks_semantic)

# Output the chunks
print(f"Number of semantic chunks: {len(chunks_semantic_long)}")
print(f"Number of naive chunks: {len(content_long)}")
print("\n---\n")
for chunk in chunks_semantic_long[:5]:  # Display first 5 semantic chunks
    print(chunk + "\n---\n")

Token indices sequence length is longer than the specified maximum sequence length for this model (720 > 512). Running this sequence through the model will result in indexing errors


Chunk too long; splitting recursively...
Chunk too long; splitting recursively...
Chunk too long; splitting recursively...
Chunk too long; splitting recursively...
Chunk too long; splitting recursively...
Chunk too long; splitting recursively...
Chunk too long; splitting recursively...
Chunk too long; splitting recursively...
Chunk too long; splitting recursively...
Chunk too long; splitting recursively...
Chunk too long; splitting recursively...
Chunk too long; splitting recursively...
Chunk too long; splitting recursively...
Chunk too long; splitting recursively...
Chunk too long; splitting recursively...
Chunk too long; splitting recursively...
Chunk too long; splitting recursively...
Chunk too long; splitting recursively...
Chunk too long; splitting recursively...
Chunk too long; splitting recursively...
Chunk too long; splitting recursively...
Chunk too long; splitting recursively...
Chunk too long; splitting recursively...
Chunk too long; splitting recursively...
Chunk too long; 

In [17]:
# To check if the length of input is bigger than the number of max tokens
for chunk in chunks_semantic_long:
    length = check_input_length_long(chunk)
    if length > model_long.max_seq_length:
        print(f"Chunk length {length} exceeds max length of {model_long.max_seq_length}")

In [18]:
print(f'first chunk: character len:{len(chunks_semantic_long[0])} tokens:{check_input_length_long(chunks_semantic_long[0])} \n')
print('--------------------- \n')
print(f'{chunks_semantic_long[0]}\n\n')
print(f'last chunk: character len:{len(chunks_semantic_long[-1])} tokens:{check_input_length_long(chunks_semantic_long[-1])} \n')
print('--------------------- \n')
print(chunks_semantic_long[-1])

first chunk: character len:95 tokens:26 

--------------------- 

Master Data Science  
Modulhandbuch  
 
 
 
 
Gesamtansprechpartner: Dekan des FBVI  
Prof. Dr.


last chunk: character len:233 tokens:70 

--------------------- 

Jähich: Topologie. Springer  
Weitere Hinweise  Dieses Modul wird auf Deutsch angeboten  und gehört zu den Profil richtung en „Computer Vision“  und „Industrielle Mathematik“ . Modulhandbuch Bachelor Studiengang Mathematik  68 von 68


In [19]:
# Generate Embeddings for the Chunks
semantic_vectors_long = model_long.encode(
    chunks_semantic_long,
    show_progress_bar=True,
)
semantic_vectors_long.shape

Batches:   0%|          | 0/9 [00:00<?, ?it/s]

(261, 768)

In [20]:
# Variable check
# Embedding vectors for the short model
naive_vectors_short
semantic_vectors_short

# Embedding vectors for the long model
naive_vectors_long
semantic_vectors_long

# Check the shape of the embeddings
print(f"Shape of Embedding vectors with naive chunking method (short model): {naive_vectors_short.shape}")
print(f"Shape of Embedding vectors with semantic chunking method (short model): {semantic_vectors_short.shape}")
print(f"Shape of Embedding vectors with naive chunking method (long model): {naive_vectors_long.shape}")
print(f"Shape of Embedding vectors with semantic chunking method (long model): {semantic_vectors_long.shape}")


Shape of Embedding vectors with naive chunking method (short model): (444, 384)
Shape of Embedding vectors with semantic chunking method (short model): (385, 384)
Shape of Embedding vectors with naive chunking method (long model): (197, 768)
Shape of Embedding vectors with semantic chunking method (long model): (261, 768)


## Step 4 - Upload the Embeddings to Qdrant

In [21]:
from qdrant_client import QdrantClient
from qdrant_client.models import VectorParams, Distance

qdrant_host = 'http://dsm.bht-berlin.de:6333/dashboard'       # http://dsm.bht-berlin.de:6333/dashboard
port = 6333                     # standard port for Qdrant

# Initialize your Qdrant client
client = QdrantClient(url=qdrant_host, port=port)


# Helper function to upload data to Qdrant
def upload_to_Qdrant(content_vector, content_text, collection_name):

    # Prepare the payload (metadata) for each text chunk
    payload = list(map(lambda text: {"text": text}, content_text))

    # Recreate or create the Qdrant collection with the vector size and cosine distance
    client.recreate_collection(
        collection_name=collection_name,
        vectors_config=VectorParams(size=content_vector.shape[1], distance=Distance.DOT),
    )

    # Upload vectors and payloads to Qdrant
    client.upload_collection(
        collection_name=collection_name,
        vectors=content_vector,
        payload=payload,
        ids=None  # Vector IDs will be assigned automatically
    )

    print(f"Embeddings uploaded successfully to Qdrant for collection: {collection_name}")

# Upload data with naive chunking method (short model)
upload_to_Qdrant(naive_vectors_short, content_short, "Basar_Soul_short_model_naive_method")

# Upload data with semantic chunking method (short model)
upload_to_Qdrant(semantic_vectors_short, chunks_semantic_short, "Basar_Soul_short_model_semantic_method")

# Upload data with naive chunking method (long model)
upload_to_Qdrant(naive_vectors_long, content_long, "Basar_Soul_long_model_naive_method")

# Upload data with semantic chunking method (long model)
upload_to_Qdrant(semantic_vectors_long, chunks_semantic_long, "Basar_Soul_long_model_semantic_method")

  client.recreate_collection(


Embeddings uploaded successfully to Qdrant for collection: Basar_Soul_short_model_naive_method
Embeddings uploaded successfully to Qdrant for collection: Basar_Soul_short_model_semantic_method
Embeddings uploaded successfully to Qdrant for collection: Basar_Soul_long_model_naive_method
Embeddings uploaded successfully to Qdrant for collection: Basar_Soul_long_model_semantic_method


In [None]:
### use same model for retrivial. model name should match for comparision

# Comparison of Embedding Models and Chunking Strategies in Vector Search with Qdrant

Here we compare different chunking techniques and embedding models for text-based vector searches using Qdrant. Specifically, we utilize two sentence transformers: a short 256-token model (all-MiniLM-L6-v2) and a longer 512-token model (msmarco-distilbert-base-tas-b). Each model is evaluated using two text chunking methods:

**Naive Chunking Method:** This method splits the text into fixed-size chunks without considering sentence boundaries or semantic coherence. We use the RecursiveCharacterTextSplitter with specified parameters to divide the text into chunks of a certain size (e.g., 200 tokens) with some overlap (e.g., 20 tokens) between chunks to maintain context.

**Semantic Chunking Method:** This method splits the text into semantically coherent chunks by analyzing the content's structure and meaning. The process involves:

- Sentence Splitting: The text is first split into individual sentences using regular expressions.
- Contextual Sentence Combining: Each sentence is combined with its neighboring sentences to provide additional context, forming combined sentences.
- Embedding Generation: The combined sentences are converted into vector representations using the embedding model.
- Cosine Distance Calculation: Cosine distances between consecutive combined sentence embeddings are calculated to measure semantic similarity.
- Breakpoint Determination: Breakpoints are identified where the cosine distance exceeds a certain threshold (e.g., the 80th percentile of all distances), indicating potential chunk boundaries.
- Recursive Splitting: If a chunk exceeds the model's maximum token length, it is recursively split into smaller, semantically coherent chunks.


For each combination of embedding model and chunking method, we embed the data and upload it to Qdrant. We then perform vector searches for given queries and analyze the search results based on the relevance scores returned.

The goal is to compare the performance of different embedding models and chunking strategies by evaluating the search results and their respective relevance scores.

### Naive chunking - short model (256 tokens)

In [22]:
question = "What is learning optimization about?"

# Encode the query using the same embedding model used for the naive chunks
question_vectorized = model_short.encode(question)

# Perform the search in Qdrant using the naive chunking method
search_results = client.search(
    collection_name="Basar_Soul_short_model_naive_method",  # The collection name for naive chunks with the short model
    query_vector=question_vectorized,
    query_filter=None,  # No additional filters
    limit=5,  # Return the top 5 most relevant results
    score_threshold=0.5,  # Only consider results with a score of 0.5 or higher
)

# Check if any results meet the score threshold
if not search_results or all(result.score < 0.5 for result in search_results):
    print("Could not answer the question")
else:
    # Process and print the search results
    for result in search_results:
        print(f"Document: {result.payload['text']}\nScore: {result.score}\n---")


Document: anzuwenden. Die Unterschiede zwischen deterministischer und 
stochasitscher optimierung und die Wic htigkeit von robusten Lösungen 
werden klar. In eigener Implementierung werden Lernkonzepte auf 
Optimierung angewendet und state -of-the-art Werkzeuge verwendet.  
Voraussetzungen  Empfehlung: Machine Learning I  
Niveaustufe  3. Studienplansemester  
Lehr- und Lernform  Seminaristischer Unterricht und Übung am Rechner  
Status  Wahlpflichtmodul  
Häufigkeit des Angebotes  nach Bedarf
Score: 0.5195446
---
Document: 36 
 Datenfeld  Erklärung  
Modulnummer  WP06  
Titel  Learning Optimization / Learning Optimization  
Leistungspunkte  5 LP 
Workload  4 SWS Ü  
(68 h Präsenz / 82 h Selbststudium)  
Lerngebiet  Fachspezifische Vertiefung  
Lernziele / Kompetenzen  Die Studierenden lernen die wichtigsten Optimierungsmodelle und –
methoden kennen. Sie werden damit in die Lage versetzt, die 
passende Lösung für die praktische Anwendung auszuwählen und
Score: 0.51228446
---


### Semantic chunking - short model(256 tokens)

In [23]:
question = "What is learning optimization about?"

# Encode the query using the same embedding model used for the semantic chunks
question_vectorized = model_short.encode(question)  # Encode your query

# Perform the search in Qdrant using the semantic chunking method
search_results = client.search(
    collection_name="Basar_Soul_short_model_semantic_method",  # Replace with your semantic chunk collection name
    query_vector=question_vectorized,
    query_filter=None,  
    limit=5,  
    score_threshold=0.5,  # Filter out results with a score less than 0.5
)

# Check if results are below the score threshold
if not search_results or all(result.score < 0.5 for result in search_results):
    print("Could not answer the question")
else:
    for result in search_results:
        print(f"Document: {result.payload['text']}\nScore: {result.score}\n---")


Could not answer the question


### Naive chunking - long model(512 tokens)

In [24]:
question = "What is learning optimization about?"

# Encode the query using the long embedding model
question_vectorized = model_long.encode(question)  # Encode your query

# Perform the search in Qdrant using the naive chunking method with the long model
search_results = client.search(
    collection_name="Basar_Soul_long_model_naive_method",  # The collection name for naive chunks with the long model
    query_vector=question_vectorized,
    query_filter=None,  
    limit=5,  
    score_threshold=0.5,  # Filter out results with a score less than 0.5
)

# Check if results are below the score threshold
if not search_results or all(result.score < 0.5 for result in search_results):
    print("Could not answer the question")
else:
    for result in search_results:
        print(f"Document: {result.payload['text']}\nScore: {result.score}\n---")


Document: 36 
 Datenfeld  Erklärung  
Modulnummer  WP06  
Titel  Learning Optimization / Learning Optimization  
Leistungspunkte  5 LP 
Workload  4 SWS Ü  
(68 h Präsenz / 82 h Selbststudium)  
Lerngebiet  Fachspezifische Vertiefung  
Lernziele / Kompetenzen  Die Studierenden lernen die wichtigsten Optimierungsmodelle und –
methoden kennen. Sie werden damit in die Lage versetzt, die 
passende Lösung für die praktische Anwendung auszuwählen und 
anzuwenden. Die Unterschiede zwischen deterministischer und 
stochasitscher optimierung und die Wic htigkeit von robusten Lösungen 
werden klar. In eigener Implementierung werden Lernkonzepte auf 
Optimierung angewendet und state -of-the-art Werkzeuge verwendet.  
Voraussetzungen  Empfehlung: Machine Learning I  
Niveaustufe  3. Studienplansemester  
Lehr- und Lernform  Seminaristischer Unterricht und Übung am Rechner  
Status  Wahlpflichtmodul  
Häufigkeit des Angebotes  nach Bedarf  
Prüfungsform / 
Voraussetzung für die 
Vergabe von 
Leistung

### Semantic chunking - long model (512 tokens)

In [25]:
question = "What is learning optimization about?"

# Encode the query using the long embedding model
question_vectorized = model_long.encode(question)  # Encode your query

# Perform the search in Qdrant using the semantic chunking method with the long model
search_results = client.search(
    collection_name="Basar_Soul_long_model_semantic_method",  # The collection name for semantic chunks with the long model
    query_vector=question_vectorized,
    query_filter=None,
    limit=5,
    score_threshold=0.5,  # Filter out results with a score less than 0.5
)

# Check if results are below the score threshold
if not search_results or all(result.score < 0.5 for result in search_results):
    print("Could not answer the question")
else:
    for result in search_results:
        print(f"Document: {result.payload['text']}\nScore: {result.score}\n---")


Document: Machine Learning plus Intelligent Optimization, LIONlab, University of Trento, Italy, 2014. Dorigo, M., Stützle, T.: Ant Colony Optimization, MIT Press, 2004.
Score: 96.81333
---
Document: Biessmann  31 
WP03  Advances in Machine Learning  Prof. Gers  33 
WP04  Learning from Images  Prof. Hildebrand  34 
WP05  Stichprobenverfahren und Versuchsplanung  Prof. Grömping  35 
WP06  Learning Optimization  Prof. Winter  36 3 
 Datenfeld  Erklärung  
Modulnummer  M01 
Titel  Mathematische Modelle / Mathematical Models  
Leistungspunkte  5 LP 
Workload  4 SWS SU  
(68 h Präsenz / 82 h Selbststudium)  
Verwendbarkeit  Eigener Studiengang, Anerkennung für andere Studiengänge gemäß 
Rahmenstudien - und -prüfungsordnung  
Lerngebiet  Fachspezifische Vertiefung  
Lernziele / Kompetenzen  Die mathematischen Kenntnisse der Studierenden in den Gebieten der 
linearen Algebra, Analysis reeller Funktionen und der linearen Algebra, Analysis reeller Funktionen und der 
Wahrscheinlichkeitsrechnung 

# Summary of Findings
### Naive Chunking vs. Semantic Chunking:

**Short Model (256 tokens, 384 dimensions):**
- Naive chunking returned relevant documents with moderate scores (~0.51).
- Semantic chunking did not retrieve any documents above the score threshold.

**Long Model (512 tokens, 768 dimensions):**
- Naive chunking returned highly relevant documents with high scores (>90).
- Semantic chunking provided the highest scores, indicating better relevance and semantic alignment.

### Effectiveness of Embedding Models:

- The long embedding model significantly outperformed the short model in terms of relevance scores.
- The ability to handle longer sequences allowed for better context capture, especially when combined with semantic chunking.

### Impact of Chunking Strategies:

**Semantic Chunking:**
- More effective with the long model due to its capacity to process longer, semantically rich chunks.
- Failed to produce results with the short model, likely due to token length limitations.

**Naive Chunking:**
- Provided acceptable results but may lack semantic coherence compared to semantic chunking.


### Conclusion

The combination of semantic chunking and the long embedding model (msmarco-distilbert-base-tas-b) yielded the most relevant search results. This suggests that:

Longer embedding models enhance the ability to capture semantic relationships within the text, especially when dealing with longer or more complex documents.
Semantic chunking is more effective when the embedding model can handle longer input sequences, as it preserves the semantic integrity of the content.