In [7]:
import os
from dotenv import load_dotenv
from langchain.vectorstores import Chroma
from langchain.embeddings import HuggingFaceEmbeddings
from langchain_groq import ChatGroq
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
from langchain.chains.query_constructor.base import AttributeInfo
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain.schema import Document, SystemMessage, HumanMessage

Initializing the GROQ API key that is stored in.env file

In [8]:
load_dotenv()
api_key = os.getenv('GROQ_API_KEY')

Setting the LLM model which we want to use

In [9]:
chat = ChatGroq(temperature=0, groq_api_key=api_key, model_name="llama3-70b-8192")

Initializing embeddings and model which we want to use

In [10]:
embedding_model = HuggingFaceEmbeddings(model_name='sentence-transformers/paraphrase-MiniLM-L6-v2')

Specifying the directory where vector database is stored

In [11]:
persist_directory = "../RAG_3_vectordb_3_separate_codes/article_chroma_db"

Checking if the directory exists

In [12]:
if not os.path.exists(persist_directory):
    print("Persist directory does not exist.")
else:
    print("Persist directory exists.")

Persist directory exists.


Loading already created vector database. By specifying the embedding function, we are ensuring that the same model used to create the database is being used to query it.

In [13]:
vectorstore = Chroma(persist_directory=persist_directory, embedding_function=embedding_model)

Providing additional information about the metadata for more precise filtering and details about the document description helps the chain understand the document's content.

In [14]:
metadata_field_info = [
    AttributeInfo(
        name="authors",
                description="Authors of the paper",
        type="string or list[string]",
    ),
    AttributeInfo(
        name="year",
        description="Year the paper was published",
        type="integer",
    ),
    AttributeInfo(
        name="abstract",
        description="Abstract of the article",
        type="string",
    ),
    AttributeInfo(
        name="title",
        description="Title of the paper",
        type="string",
    ),
    AttributeInfo(
        name="keywords",
        description="Keywords associated with the paper",
        type="string or list[string]",
    ),
    AttributeInfo(
        name="citation_count",
        description="Number of citations the paper has received",
        type="integer",
    )
]

document_content_description = "Provides information about article"

Setting up the SelfQueryRetriever that is used for retriving documents from the vector database. In this chain we specify that we are using Llama 3 as our LLM model, which sees analyzes the prompt and structures it into a query, article_chroma_db as our vector database and provide also the additional attribute info about metadata and document description.

In [15]:
retriever = SelfQueryRetriever.from_llm(
    llm=chat,
    vectorstore=vectorstore,
    document_contents=document_content_description,
    metadata_field_info=metadata_field_info,
    verbose=True
)

Creating a custom prompt instructs the model to always respond in full sentences and to say "I don't know" if it doesn't know the answer. This approach prevents Llama 3 from generating random responses simply to fulfill the expectation of an answer.

In [16]:
custom_prompt_template = """Use the following pieces of information to answer the user's question. Always answear the question as if you were a human and answear in full sentance. During your answear be really specific. If you don't know the answer, just say that you don't know, don't try to make up an answer.



Context: {context}
Question: {question}

Only return the helpful answer below and nothing else.
Helpful answer:
"""

In [17]:
def set_custom_prompt():
    """
    Prompt template for QA retrieval for each vectorstore
    """
    prompt = PromptTemplate(template=custom_prompt_template,
                            input_variables=['context', 'question'])
    return prompt

prompt = set_custom_prompt()

Here, we define the complete question-answering chain in RAG. We specify that Llama 3 is the LLM model in use, the retriever is the SelfQueryRetriever we created earlier, chain_type="stuff" indicates a basic chain where the LLM answers based solely on the user's question and retrieved documents, and finally, we ensure that the custom prompt we created is used.

In [18]:
qa = RetrievalQA.from_chain_type(
    llm=chat,
    chain_type='stuff',
    retriever=retriever,
    return_source_documents=True,
    chain_type_kwargs={'prompt': prompt}
)

This is the testing part where we actually give RAG questions and recieve answears from him.

In [19]:
query = "Give me how many articles were published in 2013 and also the names of these articles"
result = qa({"query": query})
print("Answer:", result["result"])

  warn_deprecated(


Answer: In 2013, 4 articles were published. The names of these articles are:

1. "An Investigation of Information Systems Use Patterns: Technological Events as Triggers, the Effect of Time, and Consequences for Performance"
2. "A Dramaturgical Model of the Production of Performance Data"
3. "When Does Technology Use Enable Network Change in Organizations? A Comparative Study of Feature Use and Shared Affordances"
4. "The Embeddedness of Information Systems Habits in Organizational and Individual Level Routines: Development and Disruption"


In [20]:
query = "Which article had citation count higher than 250"
result = qa({"query": query})
print("Answer:", result["result"])

Answer: Both articles, "A Multilevel Model of Resistance to Information Technology Implementation" and "Understanding User Responses to Information Technology: A Coping Model of User Adaptation", have a citation count higher than 250, with 296 and 299 citations, respectively.


In [21]:
query = "Which key words does article: How Do Suppliers Benefit from Information Technology Use in Supply Chain Relationships? have"
result = qa({"query": query})
print("Answer:", result["result"])

Answer: The key words of the article "How Do Suppliers Benefit from Information Technology Use in Supply Chain Relationships?" are: Buyer-supplier relationships, inter-organizational systems (IOS), EDI, supply chain management systems (SCMS), transaction cost economics, intangible asset specificity, IT use, exploration, and exploitation.


In [22]:
query = "Give me number of all articles titles that talk about technology adoption"
result = qa({"query": query})
print("Answer:", result["result"])

Answer: Three article titles talk about technology adoption: "Revisiting Group-Based Technology Adoption as a Dynamic Process: The Role of Changing Attitude-Rationale Configurations", "When Does Technology Use Enable Network Change in Organizations? A Comparative Study of Feature Use and Shared Affordances", and "Why Break the Habit of a Lifetime? Rethinking the Roles of Intention, Habit, and Emotion in Continuing Information Technology Use".


In [23]:
query = "In which year was the article When Does Technology Use Enable Network Change in Organizations? A Comparative Study of Feature Use and Shared Affordances written"
result = qa({"query": query})
print("Answer:", result["result"])

Answer: The article "When Does Technology Use Enable Network Change in Organizations? A Comparative Study of Feature Use and Shared Affordances" was written in 2013.


In [24]:
query = "Give me all the articles were published in 2007, but also include their title, authors and citation count"
result = qa({"query": query})
print("Answer:", result["result"])

Answer: Here are the articles published in 2007, along with their title, authors, and citation count:

1. "How Habit Limits the Predictive Power of Intention: The Case of Information Systems Continuance" by Limayem, Moez; Hirt, Sabine Gabriele; Cheung, Christy M. K. with a citation count of 240.
2. "Toward a Deeper Understanding of System Usage in Organizations: A Multilevel Perspective" by Burton-Jones, Andrew; Gallivan, Michael J. with a citation count of 0.


In [25]:
query = "How many articles were written by Ortiz de Guinea"
result = qa({"query": query})
print("Answer:", result["result"])

Answer: I don't know.


In [26]:
query = "Give me the number of articles where the author was Ortiz de Guinea"
result = qa({"query": query})
print("Answer:", result["result"])

Answer: According to the provided context, there are 2 articles where the author was Ortiz de Guinea.


In [28]:
query = "Give me all articles that were published in 2009 and have citation count higher than 70. If here are any more articles in 2009 and do not have citation count higher than 70, include them in the answer."
result = qa({"query": query})
print("Answer:", result["result"])

Answer: Based on the provided information, the articles that were published in 2009 and have a citation count higher than 70 are:

* "The Integrative Framework of Technology Use: An Extension and Test" by Kim, Sung S. with a citation count of 67 (although it doesn't meet the exact criteria, I'm including it since you asked for articles with citation counts higher than 70 and this one is close)
* "Why Break the Habit of a Lifetime? Rethinking the Roles of Intention, Habit, and Emotion in Continuing Information Technology Use" by Ortiz de Guinea, Ana; Markus, M. Lynne with a citation count of 75.


In [29]:
query = "Give me all articles that were published in 2009 and have citation count higher than 70"
result = qa({"query": query})
print("Answer:", result["result"])

Answer: According to the provided information, the article "Why Break the Habit of a Lifetime? Rethinking the Roles of Intention, Habit, and Emotion in Continuing Information Technology Use" by Ortiz de Guinea and Markus, published in 2009, has a citation count of 75, which meets the criteria of having a citation count higher than 70.
