# Overview

## Purpose
This project demonstrates the process of creating, saving, loading, and updating a vector store using embeddings created from text files in a corpus. It also showcases how to perform text replacement within a file, update the corresponding vector in the vector store, and verify the updates, with all operations now executed in a more granular and modular code structure.

## Components
1. **FlexiAI Initialization**: Initialize the FlexiAI client and configure logging.
2. **Corpus Handling**: Read and manage text files from a specified directory.
3. **Embedding Management**: Generate, save, load, and map embeddings for the corpus.
4. **Text Processing**: Perform various text-based operations, including similarity search, clustering, semantic search, and question answering.
5. **Vector Store Update**: Replace specific text in a file, update the corresponding vector in the vector store, and verify the updates.
6. **Result Truncation**: Ensure output remains concise by truncating long text outputs for readability.

## Workflow

### Step 1: Initialization
- **Setup Logging**: Configure logging for tracking the process, with customizable levels for root, file, and console.
- **Initialize FlexiAI**: Create an instance of the `FlexiAI` class, initializing various managers including the `LocalVectorStoreManager`.

### Step 2: Corpus Handling
- **Read Corpus**: Use the `read_corpus_from_directory` method to read all text files from the specified directory and store their contents and file paths.

### Step 3: Embedding Management
- **Create and Save Embeddings**: Generate embeddings for the read text files using the `create_embeddings_for_faiss` method, and save these embeddings into a FAISS index along with corresponding metadata.
- **Load and Map Vector Store**: Load the previously saved FAISS index and metadata using the `load_vector_store` method, and map the vector store for further operations.

### Step 4: Text Processing
- **Text Similarity Search**: Calculate the similarity between an input text and all texts in the corpus, then identify and print the most similar text. The output is truncated to 300 characters for readability.
- **Clustering Texts**: Cluster the texts into a specified number of clusters and print out each cluster. Each text in the clusters is truncated to 300 characters.
- **Semantic Search**: Perform a semantic search for a query and find the most relevant document in the corpus. The result is truncated to 300 characters.
- **Question Answering**: Find the sentence in the corpus that is most similar to a given question, and print the answer. The output is truncated to 300 characters. For this you need to work more on preprocessing and the system, but you can use a simple way if you want, retrieve the sentence or text and give it to AI to process it in the run and will answer to your question with your data.
-> reinforce him with knowledge ;)

### Step 5: Vector Store Update
- **Replace Text in File**: Define the target file and text to be replaced. Use the `replace_text_in_file_and_update_vector_store` method to perform the replacement and update the corresponding vector in the vector store.
- **Save Updated Vector Store**: Save the updated vector store to a specified location.

### Step 6: Search and Verification
- **Search for Old Text**: Use the `search_for_text_in_vector_store` method to search for the old text in the updated vector store.
- **Verify Replacement**: Check if the old text is still present in the vector store and print the result to confirm successful replacement.

## Use Cases
1. **Reading Text Files**: Ensure that all text files in a specified directory are correctly read and stored.
2. **Creating and Saving Embeddings**: Verify that embeddings are correctly created for the text files and saved into a FAISS index.
3. **Loading and Mapping Vector Store**: Test the ability to load a previously saved FAISS index and metadata, and map the vector store for further operations.
4. **Text Similarity and Clustering**: Validate the functionality to perform text similarity searches and clustering, ensuring that results are truncated for readability.
5. **Semantic Search and Question Answering**: Ensure the system can perform semantic searches and answer questions based on the corpus, with outputs truncated for readability.
6. **Text Replacement in File**: Validate the ability to replace specific text within a file and update the corresponding vector in the vector store.
7. **Searching and Verifying Text in Vector Store**: Confirm the ability to search for specific text embeddings and verify the absence of replaced text in the updated vector store.


Setup Directory

In [1]:
import sys
import os

# Check current working directory
current_dir = os.getcwd()
print(f"Current Directory: {current_dir}")

# Change to your project root directory
project_root = '/home/razvansavin/Proiecte/flexiai'
os.chdir(project_root)
print(f"Changed Directory to: {os.getcwd()}")

# Add project root directory to sys.path
sys.path.append(project_root)
print(f"Project root added to sys.path")

Current Directory: /home/razvansavin/Proiecte/flexiai/examples/Code examples
Changed Directory to: /home/razvansavin/Proiecte/flexiai
Project root added to sys.path


In [3]:
%pip install nltk
%pip install faiss-cpu

import nltk
nltk.download('punkt')

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


[nltk_data] Downloading package punkt to
[nltk_data]     /home/razvansavin/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [4]:
import nltk
print(nltk.__version__)
import faiss
print(faiss.__version__)


3.8.1
1.8.0


Initializing FlexiAI Client, Logging, and Setting Up Paths and Parameters

In [3]:
import sys
import os
import logging
import numpy as np
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from flexiai.core.flexiai_client import FlexiAI
from flexiai.config.logging_config import setup_logging

# Initialize logging and FlexiAI client
setup_logging(
    root_level=logging.INFO, 
    file_level=logging.INFO, 
    console_level=logging.INFO, 
    enable_file_logging=True, 
    enable_console_logging=False
    )
flexiai = FlexiAI()
logger = flexiai.logger
local_vector_store_manager = flexiai.local_vector_store_manager

# Define paths and initial parameters
corpus_directory = 'user_flexiai_rag/data/corpus'
vector_store_path = "user_flexiai_rag/data/vectors_store/vector_store.index"
target_file = 'user_flexiai_rag/data/corpus/machine_learning.txt'
query_text = 'I love cooking and traveling.'
new_text = 'Machine learning is a subset of artificial intelligence focused on building systems that can learn from historical data, identify patterns, and make logical decisions with little to no human intervention.'
save_path = "user_flexiai_rag/data/vectors_store/updated_vector_store_after_replacement.index"



Current working directory: /home/razvansavin/Proiecte/flexiai
Log directory '/home/razvansavin/Proiecte/flexiai/logs' created/exists.


Step 1: Read the Corpus

In [4]:
# Read all text files from the corpus directory and extract their content and file paths.
corpus = local_vector_store_manager.read_corpus_from_directory(corpus_directory)
texts = [content for _, content in corpus]
file_paths = [file_path for file_path, _ in corpus]
print()





Step 2: Create and Save Embeddings

In [5]:
# Generate embeddings for each text and save them in a vector store.
index, successful_texts = local_vector_store_manager.embedding_manager.create_embeddings_for_faiss(
    texts, model="text-embedding-ada-002", chunk_size=1000
)
metadata = {i: file_paths[i] for i in range(len(successful_texts))}
local_vector_store_manager.save_vector_store(index, vector_store_path, metadata)
print(100*'=')




Step 3: Load and Map Vector Store

In [6]:
# Load the previously saved vector store and map it for future use.
loaded_index, loaded_metadata = local_vector_store_manager.load_vector_store(vector_store_path)
local_vector_store_manager.map_vector_store(loaded_index)
print(100*'=')

Vector 0: [-0.00994757  0.00371048  0.01932266 -0.01129005  0.00739952]...
Vector 1: [-0.01725795  0.00876854  0.01848795  0.00226284  0.00612579]...
Vector 2: [-0.01158029  0.00336338  0.0123597  -0.02616081 -0.0063383 ]...
Vector 3: [-0.00664699 -0.00770192  0.00927453 -0.01479309 -0.00318138]...
Vector 4: [ 5.0776620e-05 -6.4961184e-03  1.1508176e-02 -2.6941761e-02
 -1.2037743e-02]...
Vector 5: [ 0.00940533 -0.00943127 -0.00508456 -0.01683684 -0.02007872]...


Step 4: Text Similarity and Search

In [7]:
# Calculate the similarity between the input text and all texts in the corpus.
input_text = "The scientific study of probability is a modern development of mathematics."
input_embedding = local_vector_store_manager.embedding_manager.create_embeddings(input_text)
embeddings = [local_vector_store_manager.embedding_manager.create_embeddings(text) for text in texts]
similarities = [np.dot(input_embedding, emb) / (np.linalg.norm(input_embedding) * np.linalg.norm(emb)) for emb in embeddings]
most_similar_index = np.argmax(similarities)
print("\n--- Text Similarity and Search ---")
most_similar_text = texts[most_similar_index]
# Truncate the output to 300 characters
truncated_text = most_similar_text[:300] + '...' if len(most_similar_text) > 300 else most_similar_text
print(f"Most similar text: {truncated_text}")
print(100*'=')



--- Text Similarity and Search ---
Most similar text: https://en.wikipedia.org/wiki/Probability

Probability is a numerical description of how likely an event is to occur or how likely it is that a proposition is true.  Probability is a number between 0 and 1, where, roughly  speaking, 0 indicates impossibility and 1 indicates certainty. The higher the...


Step 5: Clustering Texts

In [8]:
# Cluster the texts into a specified number of clusters and print out each cluster.
embeddings = [local_vector_store_manager.embedding_manager.create_embeddings(text) for text in texts]
scaler = StandardScaler()
scaled_embeddings = scaler.fit_transform(embeddings)
kmeans = KMeans(n_clusters=3, random_state=42)
clusters = kmeans.fit_predict(scaled_embeddings)
print("\n--- Clustering Texts ---")
print(100*'=')
for i in range(3):
    cluster_texts = [texts[j] for j in range(len(texts)) if clusters[j] == i]
    print(f"Cluster {i}:")
    for text in cluster_texts:
        # Truncate each text to 300 characters
        truncated_text = text[:300] + '...' if len(text) > 300 else text
        print(f" - {truncated_text}")
    print(100*'=')



--- Clustering Texts ---
Cluster 0:
 - https://en.wikipedia.org/wiki/Neural_network

Artificial neural networks (ANN) or connectionist systems are computing systems vaguely inspired by the biological neural networks that constitute animal brains. Such systems "learn" to perform tasks by considering examples, generally without being progr...
 - https://en.wikipedia.org/wiki/Machine_learning

Machine learning (ML) is the study of computer algorithms that improve automatically through experience. It is seen as a subset of artificial intelligence. Machine learning algorithms build a mathematical model based on sample data, known as "training ...
 - https://en.wikipedia.org/wiki/Artificial_intelligence

In computer science, artificial intelligence (AI), sometimes called machine intelligence, is intelligence demonstrated by machines, in contrast to the natural intelligence displayed by humans and animals. Leading AI textbooks define the field as...
 - https://en.wikipedia.org/wiki/Probabili

Step 6: Semantic Search

In [9]:
# Perform a semantic search for the query and find the most relevant document.
query = "How do neurons connect in a neural network?"
query_embedding = local_vector_store_manager.embedding_manager.create_embeddings(query)
doc_embeddings = [local_vector_store_manager.embedding_manager.create_embeddings(doc) for doc in texts]
similarities = [np.dot(query_embedding, emb) / (np.linalg.norm(query_embedding) * np.linalg.norm(emb)) for emb in doc_embeddings]
most_relevant_index = np.argmax(similarities)
print("\n--- Semantic Search ---")
most_relevant_document = texts[most_relevant_index]
# Truncate the output to 300 characters
truncated_document = most_relevant_document[:300] + '...' if len(most_relevant_document) > 300 else most_relevant_document
print(f"Most relevant document: {truncated_document}")
print(100*'=')



--- Semantic Search ---
Most relevant document: https://en.wikipedia.org/wiki/Neural_network

Artificial neural networks (ANN) or connectionist systems are computing systems vaguely inspired by the biological neural networks that constitute animal brains. Such systems "learn" to perform tasks by considering examples, generally without being progr...


Step 7: Question Answering

In [10]:
# # Find the sentence in the context that is most similar to the question.
# question = "What are the types of supervised learning?"
# context_embeddings = [local_vector_store_manager.embedding_manager.create_embeddings(sentence) for sentence in texts]
# question_embedding = local_vector_store_manager.embedding_manager.create_embeddings(question)
# similarities = [np.dot(question_embedding, emb) / (np.linalg.norm(question_embedding) * np.linalg.norm(emb)) for emb in context_embeddings]
# most_relevant_index = np.argmax(similarities)
# print("\n--- Question Answering ---")
# most_relevant_answer = texts[most_relevant_index]
# # Truncate the output to 300 characters
# truncated_answer = most_relevant_answer[:300] + '...' if len(most_relevant_answer) > 300 else most_relevant_answer
# print(f"Answer: {truncated_answer}")
# print(100*'=')



# Define the question to be answered
question = "What are the types of supervised learning?"

# Step 1: Find the most relevant document in the corpus
context_embeddings = [local_vector_store_manager.embedding_manager.create_embeddings(sentence) for sentence in texts]
question_embedding = local_vector_store_manager.embedding_manager.create_embeddings(question)
similarities = [np.dot(question_embedding, emb) / (np.linalg.norm(question_embedding) * np.linalg.norm(emb)) for emb in context_embeddings]
most_relevant_doc_index = np.argmax(similarities)

# Step 2: Extract the relevant document
most_relevant_document = texts[most_relevant_doc_index]

# Step 3: Split the document into sentences
import nltk
nltk.download('punkt')
sentences = nltk.sent_tokenize(most_relevant_document)

# Step 4: Compute the similarity of each sentence with the question
sentence_embeddings = [local_vector_store_manager.embedding_manager.create_embeddings(sentence) for sentence in sentences]
sentence_similarities = [np.dot(question_embedding, emb) / (np.linalg.norm(question_embedding) * np.linalg.norm(emb)) for emb in sentence_embeddings]

# Step 5: Find the sentence with the highest similarity score
most_relevant_sentence_index = np.argmax(sentence_similarities)
most_relevant_sentence = sentences[most_relevant_sentence_index]

# Step 6: Truncate the output to 300 characters
truncated_answer = most_relevant_sentence[:300] + '...' if len(most_relevant_sentence) > 300 else most_relevant_sentence

# Print the answer
print("\n--- Question Answering ---")
print(f"Answer: {truncated_answer}")
print(100 * '=')


# If you want to build something similar and free, probably this project can help you: https://github.com/SavinRazvan/questions



[nltk_data] Downloading package punkt to
[nltk_data]     /home/razvansavin/nltk_data...
[nltk_data]   Package punkt is already up-to-date!



--- Question Answering ---
Answer: Types of supervised learning algorithms include Active learning , classification and regression.


Step 8: Replace Text and Update Vector Store

In [11]:
# Replace specific text in a target file and update the vector store accordingly.
updated_index, new_embedding = local_vector_store_manager.replace_text_in_file_and_update_vector_store(
    target_file, query_text, new_text, loaded_index, loaded_metadata, save_path
)
print("\n--- Replacing Text and Updating Vector Store ---")
print("Text replacement and vector store update completed.")
print(100*'=')


Old sentence: I love cooking and traveling.
Old embedding (first 5 elements): [-0.01158029  0.00336338  0.0123597  -0.02616081 -0.0063383 ]

Similarity before replacement: 0.7280
New sentence: Machine learning is a subset of artificial intelligence focused on building systems that can learn from historical data, identify patterns, and make logical decisions with little to no human intervention.
New embedding (first 5 elements): [-0.03284407 -0.00081092  0.00623034 -0.01714912 -0.01093131]

Similarity after replacement: 1.0000
Updated vector store after replacing text in the file

--- Replacing Text and Updating Vector Store ---
Text replacement and vector store update completed.


Step 9: Search for Old Sentence in Updated Vector Store

In [12]:
# Search for the old sentence in the updated vector store to ensure the replacement was successful.
indices, distances = local_vector_store_manager.search_for_text_in_vector_store(query_text, updated_index)
print("\n--- Search for Old Sentence in Updated Vector Store ---")
if len(indices) > 0 and distances[0] > 0.9:  # Similarity threshold can be adjusted as needed
    print(f"Old sentence found in the vector store with similarity: {distances[0]:.4f}")
else:
    print("Old sentence not found in the vector store, replacement was successful.")
print(100*'=')



--- Search for Old Sentence in Updated Vector Store ---
Old sentence not found in the vector store, replacement was successful.
