In [2]:
import os
import duckdb
from sentence_transformers import SentenceTransformer
from langchain.embeddings import HuggingFaceInstructEmbeddings
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Chroma
from langchain.schema import Document

  from tqdm.autonotebook import tqdm, trange
comet_ml is installed but `COMET_API_KEY` is not set.


Connecting to Duckdb and selecting onyl 3 columns with query also limiting the number of rows for 500

In [4]:
db_path = '../duck_db/isrecon_AIS11.duckdb'

with duckdb.connect(database=db_path, read_only=True) as conn:
    query = 'SELECT article_id, title, abstract FROM papers LIMIT 500'
    df = conn.execute(query).fetchdf()

In [5]:
df.head(5)

Unnamed: 0,article_id,title,abstract
0,1,Examining interdependence between product user...,Firm-sponsored online user communities have be...
1,2,Computer support for strategic organizational ...,While information systems continue to be promo...
2,3,Essence: facilitating software innovation,This paper suggests ways to facilitate creativ...
3,4,The dark side of data ecosystems: A longitudin...,Data are often vividly depicted as strategic a...
4,5,Symbolic Action Research in Information System...,An essay is presented as an introduction to th...


Concatenate columns together into a list. We do this beacuse in the vector database we cannot store the vectores in column, meaning it does not have tabular format.

In [6]:
texts = (df['title'] + ' ' + df['abstract']).tolist()

In [7]:
print(texts)



Creating a persist direcotry here will be the vector database stored

In [13]:
persist_directory = 'chroma_db'

Checking for null values, they cause errors when we are creating vector database

In [8]:
for _, row in df.iterrows():
    article_id = row['article_id']
    title = row['title']
    abstract = row['abstract']
    
    if article_id is None:
        print(f"None value found for article_id in row with title={title} and abstract={abstract}")
    if title is None:
        print(f"None value found for title in row with article_id={article_id} and abstract={abstract}")
    if abstract is None:
        print(f"None value found for abstract in row with article_id={article_id} and title={title}")

None value found for abstract in row with article_id=113 and title=Editors' Preface
None value found for abstract in row with article_id=124 and title=Editorial Notes
None value found for abstract in row with article_id=125 and title=Editorial Notes
None value found for abstract in row with article_id=127 and title=Editorial Notes
None value found for abstract in row with article_id=128 and title=Editorial Notes
None value found for abstract in row with article_id=129 and title=Editorial Notes
None value found for abstract in row with article_id=247 and title=Letting living intelligence put the artificial version in its place
None value found for abstract in row with article_id=249 and title=An introduction to qualitative research
None value found for abstract in row with article_id=250 and title=Software process improvement: Concepts and practices
None value found for abstract in row with article_id=251 and title=Handbook of Action Research Participative Inquiry and Practice
None valu

In [9]:
df['article_id'].fillna('Unknown article_id', inplace=True)
df['title'].fillna('No title available', inplace=True)
df['abstract'].fillna('No abstract available', inplace=True)

(THIS PART I HAVE TO CHECK NOT SURE IF IT IS WORKING PROPERLY)
Creating document object. Page content is the concatenated text and we add the metadata for improved similarity search.

In [10]:
documents = [
    Document(page_content=text, metadata={'id': row['article_id'], 'title': row['title'], 'abstract': row['abstract']})
    for text, (_, row) in zip(texts, df.iterrows())
]

In [12]:
for doc in documents:
    print(f"Document ID: {doc.metadata['id']}")
    print(f"Title: {doc.metadata['title']}")
    print(f"Abstract: {doc.metadata['abstract']}")
    print(f"Content: {doc.page_content}")
    print("\n" + "-"*50 + "\n")

Document ID: 1
Title: Examining interdependence between product users and employees in online user communities: The role of employee-generated content
Abstract: Firm-sponsored online user communities have become product innovation and support hubs of strategic importance to firms. Product users and host firm employees comprise the participants of firm-sponsored online user communities. The online user community provides a forum wherein the product users and firm employees discuss questions, problems or issues resulting from the use of host firms’ products. Extant research on online user communities has largely focused on either product users or employees and has examined the various dynamics that ensue from each entity’s community participation. This paper seeks to investigate the interdependence between the two entities in the communities and, in particular, how product users’ reading of employee-generated content influences subsequent knowledge contribution by product users as well a

Setting up our embedding model. We specify that the model we are using is sentence-transformers/paraphrase-MiniLM-L6-v2

In [37]:
embedding_model = HuggingFaceEmbeddings(model_name='sentence-transformers/paraphrase-MiniLM-L6-v2')

Combining it all togeather and storing everything into vector database. Here we are setting up what gets embedded(documents), which model does the embedding(embedding_model) and where are we storing the vector database(persist_directory)

In [38]:
vectordb = Chroma.from_documents(documents=documents, 
                                 embedding=embedding_model,
                                 persist_directory=persist_directory,
                                 collection_name="title_abstract_chroma_db")

In [39]:
vectordb.persist()
vectordb = None

  warn_deprecated(


Connecting to created vector database

In [57]:
vectordb = Chroma(persist_directory=persist_directory, 
                  embedding_function=embedding_model,
                  collection_name="title_abstract_chroma_db")

Setting it up as an retriever, so the source of our information

In [58]:
retriever = vectordb.as_retriever()

Defining how many reults should the query take. This step is a test if we can just connect to the created vector database and use it as a retriever. In this part no LLM model is used.

In [59]:
def query_vectordb(query, top_k=1):
    results = retriever.get_relevant_documents(query, k=top_k)
    return results

In [60]:
query = "AD blockers"
results = query_vectordb(query)

In [61]:
print(results)

