# Assignment 1: Implementing Document Loaders in LangChain

## Objective:
Write a Python script that uses LangChain’s document loaders to load documents from a directory. Your task is to implement functionality that reads `.txt` and `.pdf` files and outputs their content as LangChain `Document` objects.

---

## Requirements:
1. Use LangChain’s `TextLoader` for `.txt` files and `PyPDFLoader` for `.pdf` files.
2. Implement a function `load_documents(directory: str)` that:
   - Iterates through all files in the specified directory.
   - Loads the content of `.txt` and `.pdf` files.
   - Returns a list of `Document` objects, where:
     - **Page Content:** Contains the file’s text content.
     - **Metadata:** Includes the filename or other relevant file metadata.
3. Handle unsupported file types or errors gracefully.

---

## Input:
- A directory containing files with extensions `.txt` and `.pdf`.

---

## Output:
- A list of LangChain `Document` objects. Each document should contain:
  - The text content of the file.
  - Metadata such as the filename.

---

## Example:
### Input:
A directory with the following files:
- `example.txt` containing "Hello, this is a text file."
- `example.pdf` containing "This is a PDF document."
- `image.jpg` (unsupported file type).

### Output:
```python
[
    Document(page_content="Hello, this is a text file.", metadata={"filename": "example.txt"}),
    Document(page_content="This is a PDF document.", metadata={"filename": "example.pdf"})
]

In [None]:
from langchain_community.document_loaders import TextLoader, PyPDFLoader
from langchain.schema import Document
import os
from typing import List

def load_documents(directory: str) -> List[Document]:
    """
    Skeleton Function: Load .txt and .pdf documents from a directory.

    Args:
        directory (str): Path to the directory containing files.

    Returns:
        List[Document]: A list of LangChain Document objects.
    """
    # Initialize an empty list to store the documents
    documents = []

    # Loop through files in the specified directory
    for filename in os.listdir(directory):
        file_path = os.path.join(directory, filename)

        try:
            # Placeholder for loading .txt files
            if filename.endswith(".txt"):
                loader=TextLoader(file_path)
                loaded_docs = loader.load()
                documents.extend(loaded_docs)
                pass  # Replace with code for TextLoader

            # Placeholder for loading .pdf files
            if filename.endswith(".pdf"):
                loader = PyPDFLoader(file_path)
                loaded_docs = loader.load()
                documents.extend(loaded_docs)
                pass  # Replace with code for PyPDFLoader

        except Exception as e:
            # Print error for files that could not be loaded
            print(f"Error loading {filename}: {e}")

    # Return the list of documents
    return documents

# Example usage
if __name__ == "__main__":
    # Define the directory path
    directory_path = "File path"

    # Call the function to load documents
    docs = load_documents(directory_path)

    print(len(docs))
    

    # Iterate through the loaded documents and print metadata and content preview
    for doc in docs:
        #print(f"File: {doc.metadata.get('filename', 'Unknown')}, Content Preview: {doc.page_content[:100]}")
        print('Document(Metadata of the file is : ' + str(doc.metadata) , end = ' ')
        print('Content is :' + doc.page_content[:200] + ')')


# Assignment 2: Chunking Data and Converting It to Vector Embeddings

## Objective:
Write a Python script that uses LangChain to:
1. Load `.txt` and `.pdf` files as `Document` objects from a directory.
2. Chunk the data into smaller pieces for efficient processing.
3. Convert the chunks into vector embeddings using a text embedding model.

---

## Requirements:
1. **Document Loading**:
   - Use LangChain’s `TextLoader` for `.txt` files and `PyPDFLoader` for `.pdf` files.
   - Implement a function `load_documents(directory: str)` to load all files from a directory as LangChain `Document` objects.

2. **Chunking**:
   - Use LangChain’s `RecursiveCharacterTextSplitter` to split the document text into smaller chunks.
   - Implement a function `chunk_documents(documents: List[Document]) -> List[Document]`.

3. **Embedding Generation**:
   - Use a pre-trained embedding model (e.g., `OpenAIEmbeddings` or any other LangChain-compatible embedding model).
   - Implement a function `generate_embeddings(chunks: List[Document]) -> List[List[float]]` that converts each chunk into a vector embedding.

4. **Error Handling**:
   - Handle unsupported file types and errors gracefully.

---

## Input:
- A directory containing `.txt` and `.pdf` files.

---

## Output:
- A list of vector embeddings for the chunks of the loaded documents.

---

## Example:
### Input:
A directory with the following files:
- `example.txt` containing "This is an example text file."
- `example.pdf` containing "This is an example PDF document."

### Output:
A list of embeddings (e.g., 768-dimensional vectors) for the chunks generated from the documents.

---

In [None]:
from langchain.schema import Document
from langchain_community.document_loaders import TextLoader, PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.embeddings import OpenAIEmbeddings
import os
from typing import List



def load_documents(directory: str) -> List[Document]:
    """
    Skeleton Function: Load .txt and .pdf documents from a directory.

    Args:
        directory (str): Path to the directory containing files.

    Returns:
        List[Document]: A list of LangChain Document objects.
    """
    # Initialize an empty list to store the documents
    documents = []

    # Loop through files in the specified directory
    for filename in os.listdir(directory):
        file_path = os.path.join(directory, filename)

        try:
            # Placeholder for loading .txt files
            if filename.endswith(".txt"):
                loader=TextLoader(file_path)
                loaded_docs = loader.load()
                documents.extend(loaded_docs)
                pass  # Replace with code for TextLoader

            # Placeholder for loading .pdf files
            if filename.endswith(".pdf"):
                loader = PyPDFLoader(file_path)
                loaded_docs = loader.load()
                documents.extend(loaded_docs)
                pass  # Replace with code for PyPDFLoader

        except Exception as e:
            # Print error for files that could not be loaded
            print(f"Error loading {filename}: {e}")

    # Return the list of documents
    return documents



def chunk_documents(documents: List[Document]) -> List[Document]:
    """
    Splits documents into smaller chunks.

    Args:
        documents (List[Document]): List of LangChain Document objects.

    Returns:
        List[Document]: A list of chunked Document objects.
    """
    # Create an instance of the text splitter with specified chunk size and overlap
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=50)
    chunks = []

    # Iterate over each document and split it into chunks
    for doc in documents:
        # Split the document and add chunks to the list
        chunks.extend(text_splitter.split_documents([doc]))

    return chunks



def generate_embeddings(chunks: List[Document]) -> List[List[float]]:
    """
    Generates vector embeddings for the given chunks.

    Args:
        chunks (List[Document]): List of chunked Document objects.

    Returns:
        List[List[float]]: A list of vector embeddings.
    """
    # Initialize the OpenAI embeddings model
    embeddings = OpenAIEmbeddings(
        api_key="YOUR_API_KEY"
    )
    # Generate embeddings for each chunk
    return [embeddings.embed_query(chunk.page_content) for chunk in chunks]




# Example Usage
if __name__ == "__main__":
    # Sample directory path where documents are stored
    directory_path = "File path"

    # Load documents (This function is implemented in Assignment 1) ----> Done
    documents = load_documents(directory_path)

    # Chunk the documents into smaller chunks
    chunks = chunk_documents(documents)
    

    # Generate embeddings for the chunks
    embeddings = generate_embeddings(chunks)

    # Display first 5 embeddings for demonstration
    for i, embedding in enumerate(embeddings[:5]):  # Display first 5 embeddings for brevity
        print(f"Embedding {i + 1}: {embedding[:10]}...")  # Print first 10 dimensions for brevity

# Assignment 3: Creating and Querying a Vector Database with Chroma

## Objective:
Write a Python script to:
1. Create a vector database using the **FAISS** library.
2. Store vector embeddings of document chunks in the database.
3. Query the database using similarity search and retrieve the top `k` results.

---

## Requirements:
1. **Vector Database Creation**:
   - Use FAISS to create a persistent vector database.
   - Add document embeddings (e.g., from OpenAI or any other embedding model) along with metadata to the database.

2. **Similarity Search**:
   - Implement a function to query the database with a user-provided text and retrieve the top `k` most similar results.

3. **Input Data**:
   - Use a list of text chunks or embeddings for this task. You may generate these from documents (e.g., `.txt` or `.pdf` files).

4. **Outputs**:
   - Return the metadata and content of the top `k` most similar results from the database.

---

## Example:
### Input:
1. A collection of text chunks from documents such as:
   - `"LangChain is a framework for developing applications powered by LLMs."`
   - `"FAISS is a vector database used for storing embeddings and performing similarity search."`
   - `"Document loaders are part of LangChain and help load data from multiple formats."`

2. Query text: `"What is Attention?"`
3. `k=2`

### Output:
Top `k` results based on similarity:
1. Content: `"FAISS is a vector database used for storing embeddings and performing similarity search."`
   Metadata: `{...}`
2. Content: `"LangChain is a framework for developing applications powered by LLMs."`
   Metadata: `{...}`

---

In [None]:
from langchain_community.vectorstores import FAISS
from langchain_ollama import OllamaEmbeddings
from langchain_community.document_loaders import TextLoader, PyPDFLoader
from langchain.schema import Document
from typing import List, Tuple
import os


def load_documents(directory: str) -> List[Document]:
    """
    Load .txt and .pdf documents from a directory.

    Args:
        directory (str): Path to the directory containing files.

    Returns:
        List[Document]: A list of LangChain Document objects.
    """
    documents = []

    for filename in os.listdir(directory):
        file_path = os.path.join(directory, filename)

        try:
            if filename.endswith(".txt"):
                loader = TextLoader(file_path)
                loaded_docs = loader.load()
                documents.extend(loaded_docs)

            if filename.endswith(".pdf"):
                loader = PyPDFLoader(file_path)
                loaded_docs = loader.load()
                documents.extend(loaded_docs)

        except Exception as e:
            print(f"Error loading {filename}: {e}")

    return documents


def initialize_faiss(documents: List[Document], db_path: str) -> FAISS:
    """
    Initializes a FAISS vector database and stores documents.

    Args:
        documents (List[Document]): List of LangChain Document objects.
        db_path (str): Path to store the FAISS database.

    Returns:
        FAISS: FAISS vector store object.
    """
    # Initialize Ollama embeddings
    ollama_embeddings = OllamaEmbeddings(model="llama2")

    # Create FAISS vector store and add documents
    vectorstore = FAISS.from_documents(documents, ollama_embeddings)
    
    # Save the vector store
    vectorstore.save_local(db_path)

    return vectorstore


def query_database(query: str, vectorstore: FAISS, k: int) -> List[Document]:
    """
    Queries the FAISS database for the top-k similar documents.

    Args:
        query (str): Query text.
        vectorstore (FAISS): FAISS vector store object.
        k (int): Number of top results to return.

    Returns:
        List[str]: List of top-k document contents from the database.
    """
    # Perform similarity search in the database
    retrieved_documents = vectorstore.similarity_search(query, k=k)

    # Extract the content from the retrieved documents
    results = [doc.page_content for doc in retrieved_documents]

    return results


# Example Usage
if __name__ == "__main__":
    # Define the path to the directory where your documents are stored
    directory_path = "File path"

    # Load documents
    documents = load_documents(directory_path)

    # Initialize FAISS with the documents
    db_path = "./vector_db"
    vectorstore = initialize_faiss(documents, db_path)

    # Load existing index (if needed)
    vectorstore = FAISS.load_local(db_path, OllamaEmbeddings(model="llama2"))

    # Define the query and retrieve top-k results
    query_text = "What is attention?"
    top_k = 2
    results = query_database(query_text, vectorstore, top_k)

    # Print out the results
    for i, result in enumerate(results):
        print(f"Result {i+1}: {result}")