In [1]:
import os
from dotenv import load_dotenv
from langchain.vectorstores import Chroma
from langchain.embeddings import HuggingFaceEmbeddings
from langchain_groq import ChatGroq
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
from langchain.chains.query_constructor.base import AttributeInfo
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain.schema import Document, SystemMessage, HumanMessage

Initializing the GROQ API key that is stored in.env file

In [2]:
load_dotenv()

True

Setting the LLM model which we want to use

In [3]:
api_key = os.getenv('GROQ_API_KEY')

In [4]:
chat = ChatGroq(temperature=0, groq_api_key=api_key, model_name="llama3-70b-8192")

Initializing embeddings and model which we want to use

In [5]:
embedding_model = HuggingFaceEmbeddings(model_name='sentence-transformers/paraphrase-MiniLM-L6-v2')

  from tqdm.autonotebook import tqdm, trange
comet_ml is installed but `COMET_API_KEY` is not set.


Specifying the directory where vector database is stored

In [6]:
persist_directory = '../RAG_identical_metadata_page_content/all_info_in_page_content_chroma_db_MISQ'

Checking if the directory exists

In [7]:
if not os.path.exists(persist_directory):
    print("Persist directory does not exist.")
else:
    print("Persist directory exists.")

Persist directory exists.


Loading already created vector database. By specifying the embedding function, we are ensuring that the same model used to create the database is being used to query it.

In [8]:
vectordb = Chroma(persist_directory=persist_directory, embedding_function=embedding_model)

In [9]:
num_documents = vectordb._collection.count()  
print(f"Number of documents in the vector store: {num_documents}")

Number of documents in the vector store: 1829


Providing additional information about the metadata for more precise filtering and details about the document description helps the chain understand the document's content.

In [10]:
metadata_field_info = [
    AttributeInfo(
        name="article_id",
        description="id of an article",
        type="string",
    ),
    AttributeInfo(
        name="para_id",
        description="Paragraph ID of the section",
        type="string",
    ),
    AttributeInfo(
        name="last_section_title",
        description="Title of the section so multiple paragraphs can have same title section",
        type="string",
    ),
    AttributeInfo(
        name="ent_id",
        description="Entities mentioned in the paragraph",
        type="string or list[string]",
    ),
    AttributeInfo(
        name="label",
        description="Labels associated with the paragraph",
        type="string or list[string]",
    ),
    AttributeInfo(
        name="authors",
        description="Authors of the article",
        type="string",
    ),
    AttributeInfo(
        name="year",
        description="Year of publication",
        type="int",
    ),
    AttributeInfo(
        name="title",
        description="Title of the article",
        type="string",
    ),
    AttributeInfo(
        name="keywords",
        description="Keywords associated with the article",
        type="string or list[string]",
    ),
    AttributeInfo(
        name="citation_count",
        description="Number of citations the article has received",
        type="int",
    )
]


In [11]:
document_content_description = "each document consists of a paragraph from an article"

Setting up the SelfQueryRetriever that is used for retriving documents from the vector database. In this chain we specify that we are using Llama 3 as our LLM model, which sees analyzes the prompt and structures it into a query, all_info_in_page_content_chroma_db_MISQ as our vector database and provide also the additional attribute info about metadata and document description.

In [12]:
retriever = SelfQueryRetriever.from_llm(
    llm=chat,
    vectorstore=vectordb,
    document_contents=document_content_description,
    metadata_field_info=metadata_field_info
)

Testing the output of SelfQueryRetriever

In [13]:
query = "How is the data collection method in article named When Does Technology Use Enable Network Change in Organizations? A Comparative Study of Feature Use and Shared Affordances."
results = retriever.get_relevant_documents(query)
print(results)

  warn_deprecated(


[Document(page_content='\n        Title: When Does Technology Use Enable Network Change in Organizations? A Comparative Study of Feature Use and Shared Affordances\n        Authors: Leonardi, Paul M.\n        Year: 2013\n        Article ID: 7383\n        Paragraph ID: 7383_29\n        Last Section Title: Data Collection\n        Entity ID: data collection method, survey\n        Label: COLLECTION_METHOD\n        Keywords: Technology implementation, organizational change, advice networks, feature use, affordances, frames\n        Citation Count: 97\n        \n        Content:\n        I conducted field observations about three related activities: the work of crashworthiness engineers before CrashLab was implemented, the activities of developers, trainers, and managers during the implementation process, and the work of engineers after CrashLab was implemented. During the periods in which I was a resident in Safety I utilized three primary data sources: observations made of informants at 

In [14]:
query = "Give me names of articles were written in year 2013?"
retrieved_docs = retriever.get_relevant_documents(query)
for doc in retrieved_docs:
    print(f"Content: {doc.page_content}\nMetadata: {doc.metadata}\n")

Content: 
        Title: A Dramaturgical Model of the Production of Performance Data
        Authors: Vieira da Cunha, João
        Year: 2013
        Article ID: 12686
        Paragraph ID: 12686_135
        Last Section Title: I feel like a captain in the eastern front [ in World
        Entity ID: armed conflict
        Label: TOPIC
        Keywords: Management information systems, production of performance data, performance monitoring, implementation of information technology, ethnography
        Citation Count: 2
        
        Content:
        War II] reporting to a faraway command center saying that everything is going as planned when, in reality, my soldiers are being slaughtered. 
        
Metadata: {'article_id': 12686, 'authors': 'Vieira da Cunha, João', 'citation_count': 2, 'ent_id': 'armed conflict', 'keywords': 'Management information systems, production of performance data, performance monitoring, implementation of information technology, ethnography', 'label': 'TOPIC'

Creating a custom prompt instructs the model to always respond in full sentences and to say "I don't know" if it doesn't know the answer. This approach prevents Llama 3 from generating random responses simply to fulfill the expectation of an answer.

In [16]:
custom_prompt_template = """You will be provided conten information and also metadata.Use the following pieces of information to answer the user's question. If you don't know the answer, just say that you don't know, don't try to make up an answer. Only use information from the datasource.
Context: {context}
Question: {question}

Only return the helpful answer below and nothing else.
Helpful answer:
"""

In [17]:
def set_custom_prompt():
    """
    Prompt template for QA retrieval for each vectorstore
    """
    prompt = PromptTemplate(template=custom_prompt_template,
                            input_variables=['context', 'question'])
    return prompt

prompt = set_custom_prompt()

Here, we define the complete question-answering chain in RAG. We specify that Llama 3 is the LLM model in use, the retriever is the SelfQueryRetriever we created earlier, chain_type="stuff" indicates a basic chain where the LLM answers based solely on the user's question and retrieved documents, and finally, we ensure that the custom prompt we created is used.

In [18]:
qa = RetrievalQA.from_chain_type(
    llm=chat,
    chain_type='stuff',
    retriever=retriever,
    return_source_documents=True
)

Testing of RAG

In [22]:
query = "Give me names of articles were written in year 2013?"
result = qa({"query": query})
print("Answer:", result["result"])

Answer: Based on the provided context, there is only one article written in 2013:

1. "A Dramaturgical Model of the Production of Performance Data" by João Vieira da Cunha.


In [23]:
query = "Give me how many articles were published in 2013 and also the names of these articles?"
result = qa({"query": query})
print("Answer:", result["result"])

Answer: Based on the provided context, I can only see one article published in 2013, which is:

* "A Dramaturgical Model of the Production of Performance Data" by João Vieira da Cunha.

I don't have information about other articles published in 2013. The provided context only contains information about this specific article.


Actual answear 4 articles were published in 2013(based on SQL query)

In [25]:
query = "Which articles had citation count higher than 250?"
result = qa({"query": query})
print("Answer:", result["result"])

Answer: According to the provided context, all the articles mentioned have a citation count of 299, which is higher than 250.


There are only 2 articles with citation count higher than 250(based on SQL query)

In [26]:
query = "Give me all the articles were published in 2007, but also include their title, authors and citation count?"
result = qa({"query": query})
print("Answer:", result["result"])

Answer: Based on the provided context, there is only one article that meets the criteria:

* Title: How Habit Limits the Predictive Power of Intention: The Case of Information Systems Continuance
* Authors: Limayem, Moez; Hirt, Sabine Gabriele; Cheung, Christy M. K.
* Year: 2007
* Citation Count: 240


It is missing one article, full answear that we should recieve: "How Habit Limits the Predictive Power of Intention: The Case of Information Systems Continuance" by Limayem, Moez; Hirt, Sabine Gabriele; Cheung, Christy M. K. with a citation count of 240.

"Toward a Deeper Understanding of System Usage in Organizations: A Multilevel Perspective" by Burton-Jones, Andrew; Gallivan, Michael J. with a citation count of 0. (Based on SQL query, but transformed into sentance)


In [28]:
query = "Is case study mentioned in article: Understanding User Revisions When Using Information System Features: Adaptive System Use and Triggers. if it is not mentioned, which kind of study was used?"
result = qa({"query": query})
print("Answer:", result["result"])

OutputParserException: Parsing text
```json
{
    "query": "case study",
    "filter": "or(eq(\"title\", \"Understanding User Revisions When Using Information System Features: Adaptive System Use and Triggers\"), NO_FILTER)"
}
```

Please let me know if this is correct or if I need to make any adjustments.
 raised following error:
Unexpected token Token('RPAR', ')') at line 1, column 131.
Expected one of: 
	* LPAR
Previous tokens: [Token('CNAME', 'NO_FILTER')]


Error because it cannot filter based on meta data.

In [27]:
query = "Which groups of people were involved in the survey in article: How Habit Limits the Predictive Power of Intention: The Case of Information Systems Continuance"
result = qa({"query": query})
print("Answer:", result["result"])

Answer: According to the provided context, the group of people involved in the survey were business students at a university in Hong Kong.


This is a correct answear

In [29]:
query = "Was PLS used in An Alternative to Methodological Individualism: A Non-Reductionist Approach to Studying Technology Adoption by Groups"
result = qa({"query": query})
print("Answer:", result["result"])

Answer: According to the provided context, yes, PLS (Partial Least Squares) analysis was mentioned as one of the keywords in the article "An Alternative to Methodological Individualism: A Non-Reductionist Approach to Studying Technology Adoption by Groups" by Sarker and Valacich (2010).


It seems to find it in keywords, but not from the actual paragraph(based on MISQ IS Use Curation)

In [30]:
query = "Which theory was mentioned in this article: Revisiting Group-Based Technology Adoption as a Dynamic Process: The Role of Changing Attitude-Rationale Configurations"
result = qa({"query": query})
print("Answer:", result["result"])

Answer: Based on the provided context, there is no specific theory mentioned in the article. The keywords and content suggest that the article is focused on technology adoption, collective adoption, diversity, distribution, group valence, and process view, but it does not explicitly mention a specific theory.


In [31]:
query = "How was study conduted in article: The Integrative Framework of Technology Use: An Extension and Test."
result = qa({"query": query})
print("Answer:", result["result"])

Answer: According to the provided context, the study was conducted using secondary data from past research. A thorough literature review was performed to identify past studies that conducted a TPB-based three-wave panel study and reported data in the form of correlations. Three datasets from Venkatesh et al. (2000) and Morris et al. (2005) were identified and used to test the competing models. The use of secondary data was expected to minimize subjective biases that theory developers may acquire in the course of their data collection.


Answear seems to be not exact but contextualy correct, becuase it mentioned it was done based on data from past research: Answear:Three secondary data sets examined from prior longitudinal surveys.  Analysis performed with LISREL. seems to be getting close.(based on MISQ IS Use Curation)

In [32]:
query = "How Many articles in 2013 use case study"
result = qa({"query": query})
print("Answer:", result["result"])

Answer: Based on the provided context, I can identify three articles published in 2013:

1. "A Dramaturgical Model of the Production of Performance Data" by Vieira da Cunha, João
2. "When Does Technology Use Enable Network Change in Organizations? A Comparative Study of Feature Use and Shared Affordances" by Leonardi, Paul M.
3. "An Investigation of Information Systems Use Patterns: Technological Events as Triggers, the Effect of Time, and Consequences for Performance" by Ortiz de Guinea, Ana; Webster, Jane

Among these articles, two of them use case study or qualitative methods:

1. "A Dramaturgical Model of the Production of Performance Data" uses ethnography and qualitative interviews.
2. "When Does Technology Use Enable Network Change in Organizations? A Comparative Study of Feature Use and Shared Affordances" uses qualitative observational study and qualitative coding.

The third article, "An Investigation of Information Systems Use Patterns: Technological Events as Triggers, the 

Only one article published in 2013 seem to use case study (based on MISQ IS Use Curation), so this answear is incorect

Results:

Overall, as a result of combining all column information into one vector database, the RAG system tends to hallucinate frequently. While it manages to provide correct answers for some qualitative questions, it often fails to retrieve accurate information for others. During qualitative evaluation, the RAG system starts to hallucinate and produces incorrect answers. Based on these findings, we have decided to take a different approach.