# Assignment 1: Implementing Document Loaders in LangChain

## Objective:
Write a Python script that uses LangChain’s document loaders to load documents from a directory. Your task is to implement functionality that reads `.txt` and `.pdf` files and outputs their content as LangChain `Document` objects.

---

## Requirements:
1. Use LangChain’s `TextLoader` for `.txt` files and `PyPDFLoader` for `.pdf` files.
2. Implement a function `load_documents(directory: str)` that:
   - Iterates through all files in the specified directory.
   - Loads the content of `.txt` and `.pdf` files.
   - Returns a list of `Document` objects, where:
     - **Page Content:** Contains the file’s text content.
     - **Metadata:** Includes the filename or other relevant file metadata.
3. Handle unsupported file types or errors gracefully.

---

## Input:
- A directory containing files with extensions `.txt` and `.pdf`.

---

## Output:
- A list of LangChain `Document` objects. Each document should contain:
  - The text content of the file.
  - Metadata such as the filename.

---

## Example:
### Input:
A directory with the following files:
- `example.txt` containing "Hello, this is a text file."
- `example.pdf` containing "This is a PDF document."
- `image.jpg` (unsupported file type).

### Output:
```python
[
    Document(page_content="Hello, this is a text file.", metadata={"filename": "example.txt"}),
    Document(page_content="This is a PDF document.", metadata={"filename": "example.pdf"})
]

In [None]:
from langchain.document_loaders import TextLoader, PyPDFLoader
from langchain.schema import Document
import os
from typing import List

def load_documents(directory: str) -> List[Document]:
    """
    Skeleton Function: Load .txt and .pdf documents from a directory.

    Args:
        directory (str): Path to the directory containing files.

    Returns:
        List[Document]: A list of LangChain Document objects.
    """
    # Initialize an empty list to store the documents
    documents = []

    # Loop through files in the specified directory
    for filename in os.listdir(directory):
        file_path = os.path.join(directory, filename)

        try:
            # Placeholder for loading .txt files
            if filename.endswith(".txt"):
             loader= TextLoader(file_path)
             documents.extend(loader.load())
                 
            # Placeholder for loading .pdf files
            elif filename.endswith(".pdf"):
              loader = PyPDFLoader(file_path)
              documents.extend(loader.load())

        except Exception as e:
            # Print error for files that could not be loaded
            print(f"Error loading {filename}: {e}")

    # Return the list of documents
    return documents

# Example usage
if __name__ == "__main__":
    # Define the directory path
    directory_path = r"C:\Users\pavan\OneDrive\Desktop\Rag_Lab"
 

    # Call the function to load documents
    docs = load_documents(directory_path)

    # Iterate through the loaded documents and print metadata and content preview
    for doc in docs:
        print(f"File: {doc.metadata.get('filename', 'Unknown')}, Content Preview: {doc.page_content[:100]}")

File: Unknown, Content Preview: Provided proper attribution is provided, Google hereby grants permission to
reproduce the tables and
File: Unknown, Content Preview: 1 Introduction
Recurrent neural networks, long short-term memory [13] and gated recurrent [7] neural
File: Unknown, Content Preview: Figure 1: The Transformer - model architecture.
The Transformer follows this overall architecture us
File: Unknown, Content Preview: Scaled Dot-Product Attention
 Multi-Head Attention
Figure 2: (left) Scaled Dot-Product Attention. (r
File: Unknown, Content Preview: output values. These are concatenated and once again projected, resulting in the final values, as
de
File: Unknown, Content Preview: Table 1: Maximum path lengths, per-layer complexity and minimum number of sequential operations
for 
File: Unknown, Content Preview: length n is smaller than the representation dimensionality d, which is most often the case with
sent
File: Unknown, Content Preview: Table 2: The Transformer achieves bet

# Assignment: Chunking Data and Converting It to Vector Embeddings

## Objective:
Write a Python script that uses LangChain to:
1. Load `.txt` and `.pdf` files as `Document` objects from a directory.
2. Chunk the data into smaller pieces for efficient processing.
3. Convert the chunks into vector embeddings using a text embedding model.

---

## Requirements:
1. **Document Loading**:
   - Use LangChain’s `TextLoader` for `.txt` files and `PyPDFLoader` for `.pdf` files.
   - Implement a function `load_documents(directory: str)` to load all files from a directory as LangChain `Document` objects.

2. **Chunking**:
   - Use LangChain’s `RecursiveCharacterTextSplitter` to split the document text into smaller chunks.
   - Implement a function `chunk_documents(documents: List[Document]) -> List[Document]`.

3. **Embedding Generation**:
   - Use a pre-trained embedding model (e.g., `OpenAIEmbeddings` or any other LangChain-compatible embedding model).
   - Implement a function `generate_embeddings(chunks: List[Document]) -> List[List[float]]` that converts each chunk into a vector embedding.

4. **Error Handling**:
   - Handle unsupported file types and errors gracefully.

---

## Input:
- A directory containing `.txt` and `.pdf` files.

---

## Output:
- A list of vector embeddings for the chunks of the loaded documents.

---

## Example:
### Input:
A directory with the following files:
- `example.txt` containing "This is an example text file."
- `example.pdf` containing "This is an example PDF document."

### Output:
A list of embeddings (e.g., 768-dimensional vectors) for the chunks generated from the documents.

---

In [2]:
from langchain.document_loaders import TextLoader, PyPDFLoader
from langchain.schema import Document
import os
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_ollama import OllamaEmbeddings
from typing import List

def chunk_documents(documents: List[Document]) -> List[Document]:
    """
    Splits documents into smaller chunks.

    Args:
        documents (List[Document]): List of LangChain Document objects.

    Returns:
        List[Document]: A list of chunked Document objects.
    """
    # Create an instance of the text splitter with specified chunk size and overlap
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=50)
    chunks = []

    # Iterate over each document and split it into chunks
    for doc in documents:
        # Split the document and add chunks to the list
        chunks.extend(text_splitter.split_documents([doc]))

    return chunks

def generate_embeddings(chunks: List[Document]) -> List[List[float]]:
    """
    Generates vector embeddings for the given chunks.

    Args:
        chunks (List[Document]): List of chunked Document objects.

    Returns:
        List[List[float]]: A list of vector embeddings.
    """
    # Initialize the OpenAI embeddings model
    embeddings = OllamaEmbeddings(model="llama2")

    # Generate embeddings for each chunk
    return [embeddings.embed_query(chunk.page_content) for chunk in chunks]

# Defining load document 
def load_documents(directory: str) -> List[Document]:
    """
    Function to load .txt and .pdf documents from a directory.

    Args:
        directory (str): Path to the directory containing files.

    Returns:
        List[Document]: A list of LangChain Document objects.
    """
    # Initialize an empty list to store the documents
    documents = []

    # Loop through files in the specified directory
    for filename in os.listdir(directory):
        file_path = os.path.join(directory, filename)

        try:
            # Load .txt files
            if filename.endswith(".txt"):
                loader = TextLoader(file_path)
                documents.extend(loader.load())  # Use extend to append multiple documents
            
            # Load .pdf files
            elif filename.endswith(".pdf"):
                loader = PyPDFLoader(file_path)
                documents.extend(loader.load())  # Use extend to append multiple documents

        except Exception as e:
            # Print error for files that could not be loaded
            print(f"Error loading {filename}: {e}")

    # Return the list of documents
    return documents
        
# Example Usage
if __name__ == "__main__":

    # Correct file path
    directory_path = r"C:\Users\pavan\OneDrive\Desktop\Rag_Lab"  # Make sure the path is valid

    # Load documents
    documents = load_documents(directory_path)

    if documents:
        # Chunk the documents into smaller chunks
        chunks = chunk_documents(documents)

        # Generate embeddings for the chunks
        embeddings = generate_embeddings(chunks)

        # Display first 5 embeddings for demonstration
        for i, embedding in enumerate(embeddings[:5]):  # Display first 5 embeddings for brevity
            print(f"Embedding {i + 1}: {embedding[:10]}...")  # Print first 10 dimensions for brevity
    else:
        print("No documents loaded.")


Embedding 1: [0.0075033302, -0.012546867, -0.00985978, -0.0059293485, -0.005868988, -0.01740459, 0.010194887, 0.0021364293, -0.014074748, -0.013853891]...
Embedding 2: [0.004858137, -0.010867439, -0.01785298, -0.0008316965, -0.020331746, -0.029635351, -0.0011648479, -0.01588062, 0.008147696, -0.01155419]...
Embedding 3: [0.012772601, -0.006307596, -0.006896984, 0.005971319, -0.02572523, -0.00057803775, 0.014220787, 0.00056576275, 0.0016497367, 0.0003296631]...
Embedding 4: [-0.010720819, -0.028336309, 0.024151672, -0.0037161945, -0.0050437474, 0.0047234343, 0.005227068, -0.0150557, 0.026371628, -0.0091641955]...
Embedding 5: [-0.0079080295, -0.029269118, 0.01162099, -0.00971782, -0.02498191, 0.00015435182, 0.022947038, -0.02820296, -0.002049774, -0.01838087]...


# Assignment 3: Creating and Querying a Vector Database with Chroma

## Objective:
Write a Python script to:
1. Create a vector database using the **Chroma** library.
2. Store vector embeddings of document chunks in the database.
3. Query the database using similarity search and retrieve the top `k` results.

---

## Requirements:
1. **Vector Database Creation**:
   - Use Chroma to create a persistent vector database.
   - Add document embeddings (e.g., from OpenAI or any other embedding model) along with metadata to the database.

2. **Similarity Search**:
   - Implement a function to query the database with a user-provided text and retrieve the top `k` most similar results.

3. **Input Data**:
   - Use a list of text chunks or embeddings for this task. You may generate these from documents (e.g., `.txt` or `.pdf` files).

4. **Outputs**:
   - Return the metadata and content of the top `k` most similar results from the database.

---

## Example:
### Input:
1. A collection of text chunks from documents such as:
   - `"LangChain is a framework for developing applications powered by LLMs."`
   - `"Chroma is a vector database used for storing embeddings and performing similarity search."`
   - `"Document loaders are part of LangChain and help load data from multiple formats."`

2. Query text: `"What is Chroma?"`
3. `k=2`

### Output:
Top `k` results based on similarity:
1. Content: `"Chroma is a vector database used for storing embeddings and performing similarity search."`
   Metadata: `{...}`
2. Content: `"LangChain is a framework for developing applications powered by LLMs."`
   Metadata: `{...}`

---

In [3]:
from langchain.vectorstores import FAISS
from langchain_ollama import OllamaEmbeddings
from langchain.schema import Document
from typing import List
import os


def load_documents(directory_path: str) -> List[Document]:
    """
    Loads text files from a directory and creates LangChain Document objects.

    Args:
        directory_path (str): Path to the directory containing text files.

    Returns:
        List[Document]: A list of Document objects.
    """
    documents = []
    for filename in os.listdir(directory_path):
        file_path = os.path.join(directory_path, filename)
        if os.path.isfile(file_path) and filename.endswith(".txt"):
            with open(file_path, 'r', encoding='utf-8') as file:
                content = file.read()
                documents.append(Document(page_content=content, metadata={"filename": filename}))

    return documents


def initialize_faiss_with_ollama(documents: List[Document], db_path: str) -> FAISS:
    """
    Initializes a FAISS vector database with Ollama embeddings and stores documents.

    Args:
        documents (List[Document]): List of LangChain Document objects.
        db_path (str): Path to store the FAISS database.

    Returns:
        FAISS: FAISS vector store object.
    """
    # Initialize Ollama embeddings
    ollama_embeddings = OllamaEmbeddings(model="llama2")

    # Create FAISS vector store and store documents
    faiss_db = FAISS.from_documents(documents, ollama_embeddings)

    # Save the FAISS index to the specified path
    faiss_db.save_local(db_path)

    return faiss_db

def query_faiss(query: str, faiss_db: FAISS, k: int) -> List[str]:
    """
    Queries the FAISS database for the top-k similar documents.

    Args:
        query (str): Query text.
        faiss_db (FAISS): FAISS vector store object.
        k (int): Number of top results to return.

    Returns:
        List[str]: List of top-k document contents from the database.
    """
    # Perform similarity search in the database
    retrieved_results = faiss_db.similarity_search(query, k=k)

    # Collect the content of the retrieved documents
    results = [result.page_content for result in retrieved_results]

    return results

# Example Usage
if __name__ == "__main__":
    directory_path = r"C:\Users\pavan\OneDrive\Documents\ragpdf"

    # Load documents
    documents = load_documents(directory_path)

    # Initialize FAISS with Ollama embeddings
    db_path = r"C:\Users\pavan\OneDrive\Desktop\Rag_Lab/faiss_index"
    faiss_db = initialize_faiss_with_ollama(documents, db_path)

    # Query FAISS database
    query_text = "What is FAISS?"
    top_k = 2
    results = query_faiss(query_text, faiss_db, top_k)

    for i, result in enumerate(results):
        print(f"Result {i+1}: {result}")

Result 1: The world must be made safe for democracy. Its peace must be planted upon the tested foundations of political liberty. We have no selfish ends to serve. We desire no conquest, no dominion. We seek no indemnities for ourselves, no material compensation for the sacrifices we shall freely make. We are but one of the champions of the rights of mankind. We shall be satisfied when those rights have been made as secure as the faith and the freedom of nations can make them.

Just because we fight without rancor and without selfish object, seeking nothing for ourselves but what we shall wish to share with all free peoples, we shall, I feel confident, conduct our operations as belligerents without passion and ourselves observe with proud punctilio the principles of right and of fair play we profess to be fighting for.

…

It will be all the easier for us to conduct ourselves as belligerents in a high spirit of right and fairness because we act without animus, not in enmity toward a peo