# LangChainDocQuery

LangChainDocQuery is an intelligent document querying system that leverages OpenAI's language models, embeddings, and a vector store backed by Cassandra. It allows users to input natural language questions and retrieve relevant text snippets from PDF documents.

## Features

- Extract text from PDF documents
- Split text into manageable chunks
- Generate embeddings for text chunks
- Store embeddings in a Cassandra-backed vector store
- Query the stored text using natural language questions
- Retrieve and display relevant text snippets based on semantic similarity

## Installation

1. Clone the repository:
    ```sh
    git clone https://github.com/yourusername/LangChainDocQuery.git
    cd LangChainDocQuery
    ```

2. Install the required dependencies:
    ```sh
    pip install -r requirements.txt
    ```

3. Set up your environment variables:
    - `ASTRA_DB_APPLICATION_TOKEN`: Your Astra DB application token
    - `ASTRA_DB_ID`: Your Astra DB ID
    - `OPENAI_API_KEY`: Your OpenAI API key

## Usage

1. Place your PDF file in the project directory and update the file path in the script.
2. Run the script to extract text from the PDF, generate embeddings, and store them in the Cassandra-backed vector store:
    ```sh
    python main.py
    ```

3. Start the querying interface to input natural language questions and retrieve relevant text snippets:
    ```sh
    python query_interface.py
    ```

## Example Code

```python
# provide the path of the PDF file
pdf_path = 'GOT.pdf'

# Initialize the connection to your database
cassio.init(token=ASTRA_DB_APPLICATION_TOKEN, database_id=ASTRA_DB_ID)

# Create LangChain embedding and LLM objects
llm = OpenAI(openai_api_key=OPENAI_API_KEY)
embedding = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)

# Create Cassandra vector store
astra_vector_store = Cassandra(
    embedding=embedding,
    table_name="qa_mini_demo",
    session=None,
    keyspace=None,
)

# Split the text using CharacterTextSplitter
text_splitter = CharacterTextSplitter(
    separator='\n',
    chunk_size=800,
    chunk_overlap=200,
    length_function=len,
)
texts = text_splitter.split_text(raw_text)

# Add texts to the vector store
astra_vector_store.add_texts(texts[:50])

# Initialize the vector store index wrapper
astra_vector_index = VectorStoreIndexWrapper(vectorstore=astra_vector_store)

# Start the query loop
first_question = True
while True:
    if first_question:
        q_text = input("\nEnter the question(or q to quit): ").strip()
        first_question = False
    else:
        q_text = input("\nEnter another question(or q to quit): ").strip()

    if q_text.lower() == 'q':
        break

    answer = astra_vector_index.query(q_text, llm=llm).strip()
    print(f"""
    question: "{q_text}"
    answer: "{answer}"
    
    Documents for relevance
    """)
    
    for doc, score in astra_vector_store.similarity_search_with_score(q_text, k=4):
        print(f"   [{score:.4f}] \"{doc.page_content[:80]}...\"")
