In [1]:
import os
from dotenv import load_dotenv
from langchain.vectorstores import Chroma
from langchain.embeddings import HuggingFaceEmbeddings
from langchain_groq import ChatGroq
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate


Loading the API key through Groq. The API is saved in .env file so it is not exposed.

In [2]:
load_dotenv()

True

In [3]:
api_key = os.getenv('GROQ_API_KEY')

In [4]:
chat = ChatGroq(temperature=0, groq_api_key=api_key, model_name="llama3-70b-8192")

Load embedding model

In [5]:
embedding_model = HuggingFaceEmbeddings(model_name='sentence-transformers/paraphrase-MiniLM-L6-v2')

  from tqdm.autonotebook import tqdm, trange
comet_ml is installed but `COMET_API_KEY` is not set.


Find path to the persist directory, this should be a directory where you stored your vector database.

In [6]:
persist_directory = r"C:\Users\andyu\OneDrive\Počítač\Text, Web and Social Media Analytics Lab\Rag_project\RAG_multiple_vector_stores\article_chroma_db_MISQ"

This is just a check. sometimes I had an problem that it could not find the directory.

In [7]:
if not os.path.exists(persist_directory):
    print("Persist directory does not exist.")
else:
    print("Persist directory exists.")

Persist directory exists.


Connecting to already created ChromaDB and setting it up as a retriever(our source of information)

In [8]:
vectordb = Chroma(persist_directory=persist_directory,
                  embedding_function=embedding_model)

In [20]:
num_documents = vectordb._collection.count()  
print(f"Number of documents in the vector store: {num_documents}")

Number of documents in the vector store: 19


In [60]:
retriever = vectordb.as_retriever()

Attempt of custom prompt template. Just so if the information is not find in the veector database the LLM does not start to give us random answears.

In [72]:
custom_prompt_template = """Use the following pieces of information to answer the user's question. Always answear the question as if you were a human. If you don't know the answer, just say that you don't know, don't try to make up an answer.



Context: {context}
Question: {question}

Only return the helpful answer below and nothing else.
Helpful answer:
"""

In [73]:
def set_custom_prompt():
    """
    Prompt template for QA retrieval for each vectorstore
    """
    prompt = PromptTemplate(template=custom_prompt_template,
                            input_variables=['context', 'question'])
    return prompt

prompt = set_custom_prompt()

Setting up the QAchain. What it does it binds everything togeather. Based on prompt we find similarity in the vector database, we get that information and LLM gives us the answear while following the custom prompt.

In [74]:
qa = RetrievalQA.from_chain_type(
    llm=chat,
    chain_type='stuff',
    retriever=retriever,
    return_source_documents=True,
    chain_type_kwargs={'prompt': prompt}
)

In [68]:
query = "Article whit highest citation count"
result = qa({"query": query})

In [69]:
print("Answer:", result["result"])

Answer: The article with the highest citation count is "A Multilevel Model of Resistance to Information Technology Implementation" by Lapointe and Rivard, with a citation count of 296.


In [75]:
query = "How many articles were published in 2005"
result = qa({"query": query})
print("Answer:", result["result"])

Answer: 1


Trying out queries to see if we get answears based on our vector database.

In [16]:
query = "Give me articles that talk about software innovation."
result = qa({"query": query})
print("Answer:", result["result"])

Answer: The article "Essence: facilitating software innovation" talks about software innovation.


In [17]:
query = "What is article with the name Data Science for Scoial Good about."
result = qa({"query": query})
print("Answer:", result["result"])

Answer: The article "Data Science for Social Good" is about the diminishing emphasis on social good challenges in data science research, and presents a framework for "data science for social good" research that considers the interplay between relevant data science research genres, social good challenges, and different levels of sociotechnical abstraction.


In [18]:
import textwrap

In [19]:
query = "What is the article with the name Data Science for Social Good about."
result = qa({"query": query})
wrapped_answer = textwrap.fill(result["result"], width=80)
print("Answer:\n", wrapped_answer)

Answer:
 The article "Data Science for Social Good" is about presenting a framework for
"data science for social good" (DSSG) research that considers the interplay
between relevant data science research genres, social good challenges, and
different levels of sociotechnical abstraction, and highlighting the lack of
research focus on social good challenges in the field of data science.


In [20]:
query = "Give me full abstract of article Data Science for Social Good."
result = qa({"query": query})
wrapped_answer = textwrap.fill(result["result"], width=80)
print("Answer:\n", wrapped_answer)

Answer:
 Here is the abstract of the article "Data Science for Social Good":  Data
science has been described as the fourth paradigm of scientific discovery. The
latest wave of data science research, pertaining to machine learning and
artificial intelligence (AI), is growing exponentially and garnering millions of
annual citations. However, this growth has been accompanied by a diminishing
emphasis on social good challenges-our analysis reveals that the proportion of
data science research focusing on social good is less than it has ever been. At
the same time, the proliferation of machine learning and generative AI has
sparked debates about the sociotechnical prospects and challenges associated
with data science for human flourishing, organizations, and society. Against
this backdrop, we present a framework for "data science for social good" (DSSG)
research that considers the interplay between relevant data science research
genres, social good challenges, and different levels of sociot

In [21]:
query = "Give me an name of an article that talks about  research that considers the interplay between relevant data science research genres, social good challenges, and different levels of sociotechnical abstraction, and highlighting the lack of research focuon social good challenges in the field of data science."
result = qa({"query": query})
wrapped_answer = textwrap.fill(result["result"], width=80)
print("Answer:\n", wrapped_answer)

Answer:
 Data Science for Social Good


In [22]:
query = "Give me full abstract of article Data Science for Social Good."
result = qa({"query": query})
wrapped_answer = textwrap.fill(result["result"], width=80)
print("Answer:\n", wrapped_answer)

Answer:
 Here is the abstract of the article "Data Science for Social Good":  Data
science has been described as the fourth paradigm of scientific discovery. The
latest wave of data science research, pertaining to machine learning and
artificial intelligence (AI), is growing exponentially and garnering millions of
annual citations. However, this growth has been accompanied by a diminishing
emphasis on social good challenges-our analysis reveals that the proportion of
data science research focusing on social good is less than it has ever been. At
the same time, the proliferation of machine learning and generative AI has
sparked debates about the sociotechnical prospects and challenges associated
with data science for human flourishing, organizations, and society. Against
this backdrop, we present a framework for "data science for social good" (DSSG)
research that considers the interplay between relevant data science research
genres, social good challenges, and different levels of sociot

In [23]:
query = "How similar are the these two articles:Essence: facilitating software innovation and Data Science for Social Good."
result = qa({"query": query})
wrapped_answer = textwrap.fill(result["result"], width=80)
print("Answer:\n", wrapped_answer)

Answer:
 These two articles appear to be quite different in terms of their topics and
focus areas. The first article, "Essence: facilitating software innovation",
focuses on software development and innovation, while the second article, "Data
Science for Social Good", discusses the application of data science for social
good and the lack of research in this area. There is no obvious connection or
similarity between the two articles.


In [26]:
query = "Give me 3 articles that talk about the similar topic as this one and also tell me why is it similar Essence: facilitating software innovation"
result = qa({"query": query})
wrapped_answer = textwrap.fill(result["result"], width=80)
print("Answer:\n", wrapped_answer)

Answer:
 Here are three articles that talk about similar topics as "Essence: facilitating
software innovation":  1. "Architectures in context: on the evolution of
business, application software, and ICT platform architectures" - This article
is similar because it also discusses software development and innovation, albeit
from a different perspective, focusing on the evolution of architectures.  2.
"From Placebo to Panacea: Studying the Diffusion of IT Management Techniques
with Ambiguous Efficiencies: The Case of Capability Maturity Model" - This
article is similar because it also explores the diffusion of IT management
techniques, which is related to software innovation.  3. "The antecedents and
consequents of user perceptions in information technology adoption" - This
article is similar because it also examines the adoption of information
technology innovations, which is closely related to software innovation.  These
articles are similar because they all discuss aspects of software d