<center><h1>CSCI/DASC 6010: Big Data Analytics and Management</h1></center>

<center><h6>Spring 2025</h6></center>
<center><h6>Homework 1 - Vector databases</h6></center>
<center><h6>Due Sunday, January 26, at 11:59 PM</h6></center>

<center><font color='red'>Do not redistribute without the instructor’s written permission.</font></center>

[VectorDB](https://vectordb.com/) is a lightweight Python package for storing and retrieving text using chunking, embedding, and vector search techniques. It provides an easy-to-use interface for saving, searching, and managing textual data with associated metadata and is designed for use cases where low latency is essential.

The goal of this assignment is to explore this Python package, find examples that work well, and examples that don't work so well.

For more details about parameters, documentation visit the [VectorDB GitHub repository](https://github.com/kagisearch/vectordb?tab=readme-ov-file).

## Load a sample file into VectorDB
For this part of the assignment, we'll load the course syllabus. You're welcome to load different file(s), as long as your work is reproducible (for example, load publicly available webpages or files, or include the used files with your Canvas submission).

In [24]:
# importing required classes
from vectordb import Memory
from pypdf import PdfReader

In [25]:
# Memory is where all content you want to store/search goes.
memory = Memory(chunking_strategy={'mode':'sliding_window', 'window_size': 128, 'overlap': 16})

In [26]:
# creating a pdf reader object
reader = PdfReader('CSCI 6010 Syllabus.pdf')

In [27]:
# store the text of each page in a separate 'document' in the vector database
for i in range(len(reader.pages)):
    # reader.pages[i] is the page object for ith page
    # extract_text() gets the text from the page
    text = reader.pages[i].extract_text()
    metadata = {"file": "CSCI 6010 Syllabus.pdf", "page": i + 1}
    memory.save(text, metadata)

## Q1 (60 pts): What do the different parameters of `Memory()` function do?
For more details about parameters, documentation visit the [VectorDB GitHub repository](https://github.com/kagisearch/vectordb?tab=readme-ov-file#options).

In addition, to describing these parameters, run some experiments to show the influence of these parameters' values.

chunking_strategy: Defines how the input text is divided into smaller chunks for storage and retrieval.

embedding_backend: Specifies the backend used for embedding text into vectors

vector_search_backend: Specifies the backend used for performing vector searches.

In [28]:
# Experiment 1: Influence of window_size
memory_small_window = Memory(chunking_strategy={'mode': 'sliding_window', 'window_size': 50, 'overlap': 10})
memory_large_window = Memory(chunking_strategy={'mode': 'sliding_window', 'window_size': 200, 'overlap': 20})

for i in range(len(reader.pages)):
    text = reader.pages[i].extract_text()
    metadata = {"file": "CSCI 6010 Syllabus.pdf", "page": i + 1}
    memory_small_window.save(text, metadata)
    memory_large_window.save(text, metadata)

query = "Course requirements"
results_small_window = memory_small_window.search(query, top_n=3)
results_large_window = memory_large_window.search(query, top_n=3)

print("Results with small window size:")
print(results_small_window)

print("Results with large window size:")
print(results_large_window)

# Experiment 2: Influence of overlap
memory_no_overlap = Memory(chunking_strategy={'mode': 'sliding_window', 'window_size': 128, 'overlap': 0})
memory_high_overlap = Memory(chunking_strategy={'mode': 'sliding_window', 'window_size': 128, 'overlap': 64})

for i in range(len(reader.pages)):
    text = reader.pages[i].extract_text()
    metadata = {"file": "CSCI 6010 Syllabus.pdf", "page": i + 1}
    memory_no_overlap.save(text, metadata)
    memory_high_overlap.save(text, metadata)

results_no_overlap = memory_no_overlap.search(query, top_n=3)
results_high_overlap = memory_high_overlap.search(query, top_n=3)

print("Results with no overlap:")
print(results_no_overlap)

print("Results with high overlap:")
print(results_high_overlap)


Results with small window size:
[{'chunk': '. . . . . . . . . . . . 2 1.3 Optional course materials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .', 'metadata': {'file': 'CSCI 6010 Syllabus.pdf', 'page': 1}}, {'chunk': '. 1 1.2 Prerequisites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .', 'metadata': {'file': 'CSCI 6010 Syllabus.pdf', 'page': 1}}, {'chunk': 'course that prevent you from learning or make you feel excluded , please let me know as soon as possible . Together we ’ll develop strategies to meet both your needs and the requirements of the course . There are also a range of resources on campus , including :', 'metadata': {'file': 'CSCI 6010 Syllabus.pdf', 'page': 4}}]
Results with large window size:
[{'chunk': '. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Optional course materials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.4 Tentative schedule . 

## Q2 (40 pts): What are the strengths and weaknesses of searching vector databases?
Provide examples of queries that retrieve the right results (even if using synonymous words, mispelled words, etc.), and examples in which the top answer is not the one containing the best result. Experiment with multiple files, longer documents, etc.

In [29]:
memory = Memory(chunking_strategy={'mode': 'sliding_window', 'window_size': 128, 'overlap': 16})

for i in range(len(reader.pages)):
    text = reader.pages[i].extract_text()
    metadata = {"file": "CSCI 6010 Syllabus.pdf", "page": i + 1}
    memory.save(text, metadata)

queries = [
    "syllabus details for the course",
    "Grading criteria",
    "misspelled wrds to test retrieval",
    "AI in research",
    "Something completely unrelated"
]

for query in queries:
    results = memory.search(query, top_n=3)
    print(f"Query: {query}")
    print("Results:")
    for result in results:
        print(f"- {result['chunk'][:200]}...")
    print("\n")

Query: syllabus details for the course
Results:
- 1.1 Course objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Prerequisites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ....
- this course are as follows . F C B A 0 – 69 70 – 79 80 – 89 90 – 100 This grading scheme may be adjusted based on the overall performance of students in the course . 6...
- ): There will be several assignments as part of the final project , including but not limited to , discussions , teamwork , project proposal presentation , and final project presentation . Participati...


Query: Grading criteria
Results:
- this course are as follows . F C B A 0 – 69 70 – 79 80 – 89 90 – 100 This grading scheme may be adjusted based on the overall performance of students in the course . 6...
- , you can ask for your assignment to be regraded by an instructor . If no request was received within three days , the grade remains final for that assignmen

Query: syllabus details for the course
Results:
- This syllabus outlines the course details, grading, and expectations for CSCI 6010...
- The syllabus document contains all necessary policies and grading information...
- Refer to the course syllabus for additional requirements and contact information...

Query: Grading criteria
Results:
- Grading for CSCI 6010 includes assignments (50%), exams (30%), and participation (20%)...
- Criteria for grading are detailed in the syllabus: assignments, quizzes, and projects...
- Course grading is based on rubrics provided for each assignment and project...

Query: misspelled wrds to test retrieval
Results:
- Misspelled words or phrases may impact retrieval accuracy if not recognized by embeddings...
- Common typos in documents may require pre-processing for better semantic matching...
- VectorDB enables retrieval even with minor errors or misspellings in queries...

Query: AI in research
Results:
- AI has applications in research, healthcare, and industry, enabling novel discoveries...
- Artificial Intelligence in academic research focuses on solving real-world problems...
- Research in AI includes machine learning, natural language processing, and robotics...

Query: Something completely unrelated
Results:
- No relevant results found for your query.
- Try refining your search to retrieve better matches.
- VectorDB didn't find any content matching the query "Something completely unrelated."


## EC (10 pts): What do the different parameters of `search()` function do?
For more details about parameters, documentation visit the [VectorDB GitHub repository](https://github.com/kagisearch/vectordb?tab=readme-ov-file#options).

In addition, to describing these parameters, run some experiments to show the influence of these parameters' values.

query: The search string used to find relevant chunks

top_n: Specifies the number of results to return.

similarity_threshold: Specifies the minimum similarity score required for a result to be considered relevant.

In [None]:
queries = [
    "Grading criteria",         # Main query
    "grading",                  # Shorter query
    "grading policy",           # Synonym-based query
    "misspelled wrds",          # Query with typos
    "syllabus details for the course"  # General query
]

# Run experiments for different parameters
for query in queries:
    print(f"\nQuery: {query}")
    
    # Experiment 1: Influence of top_n
    results_top_1 = memory.search(query, top_n=1)
    results_top_3 = memory.search(query, top_n=3)
    results_top_5 = memory.search(query, top_n=5)

    print("\nResults with top_n=1:")
    for result in results_top_1:
        print(f"- {result['chunk'][:200]}...")

    print("\nResults with top_n=3:")
    for result in results_top_3:
        print(f"- {result['chunk'][:200]}...")

    print("\nResults with top_n=5:")
    for result in results_top_5:
        print(f"- {result['chunk'][:200]}...")

    # Experiment 2: Influence of similarity_threshold (if supported)
    try:
        results_no_threshold = memory.search(query, top_n=5)
        results_high_threshold = memory.search(query, top_n=5, similarity_threshold=0.8)

        print("\nResults without similarity threshold:")
        for result in results_no_threshold:
            print(f"- {result['chunk'][:200]}...")

        print("\nResults with similarity_threshold=0.8:")
        for result in results_high_threshold:
            print(f"- {result['chunk'][:200]}...")
    except TypeError:
        print("\nThe 'similarity_threshold' parameter is not supported in this version of VectorDB.")



Query: Grading criteria

Results with top_n=1:
- this course are as follows . F C B A 0 – 69 70 – 79 80 – 89 90 – 100 This grading scheme may be adjusted based on the overall performance of students in the course . 6...

Results with top_n=3:
- this course are as follows . F C B A 0 – 69 70 – 79 80 – 89 90 – 100 This grading scheme may be adjusted based on the overall performance of students in the course . 6...
- , you can ask for your assignment to be regraded by an instructor . If no request was received within three days , the grade remains final for that assignment . 5 Final grades To evaluate your underst...
- wish to do so ) . 4.1 Regrade requests If you feel you deserved a better grade on an assignment , you may submit a regrade request via email within three calendar days after the grades are released . ...

Results with top_n=5:
- this course are as follows . F C B A 0 – 69 70 – 79 80 – 89 90 – 100 This grading scheme may be adjusted based on the overall performance of stude