In [1]:
import os
import duckdb
from sentence_transformers import SentenceTransformer
from langchain.embeddings import HuggingFaceInstructEmbeddings
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Chroma
from langchain.schema import Document

  from tqdm.autonotebook import tqdm, trange
comet_ml is installed but `COMET_API_KEY` is not set.


Connecting to DuckDB and setting which columns do we use for the vector database. We have decided to only use papers from MISQ journal that are mentioned in the article "MISQ Research Curation on IS Use"

In [2]:
db_path = '../duck_db/isrecon_AIS11.duckdb'

In [3]:
with duckdb.connect(database=db_path, read_only=True) as conn:
    query = '''SELECT article_id,authors, year, title, journal, abstract, keywords, citation_count FROM papers
    WHERE title IN (
    'Technology Adaptation: The Case of a Computer-Supported Inter-organizational Virtual Team',
    'How Do Suppliers Benefit from Information Technology Use in Supply Chain Relationships?',
    'A Multilevel Model of Resistance to Information Technology Implementation',
    'Understanding User Responses to Information Technology: A Coping Model of User Adaptation',
    'A Comprehensive Conceputalization of the Post-Adoptive Behaviors Associated with IT-Enabled Work Systems',
    'Information Technology and the Performance of the Customer Service Process: A Resource-Based Analysis',
    'Toward a Deeper Understanding of System Usage in Organizations: A Multilevel Perspective',
    'How Habit Limits the Predictive Power of Intention: The Case of Information Systems Continuance',
    'Predicting Different Conceptualizations of System Use: The Competing Roles of Behavioral Intention, Facilitating Conditions, and Behavioral Expectation',
    'The Integrative Framework of Technology Use: An Extension and Test',
    'Why Break the Habit of a Lifetime? Rethinking the Roles of Intention, Habit, and Emotion in Continuing Information Technology Use',
    'An Alternative to Methodological Individualism: A Non-Reductionist Approach to Studying Technology Adoption by Groups',
    'Capturing Bottom-Up Information Technology Use Processes: A Complex Adaptive Systems Model',
    'Understanding User Revisions When Using Information System Features: Adaptive System Use and Triggers',
    'Interfirm IT Capability Profiles and Communications for Cocreating Relational Value: Evidence from the Logistics Industry',
    'A Dramaturgical Model of the Production of Performance Data',
    'The Embeddedness of Information Systems Habits in Organizational and Individual Level Routines: Development and Disruption',
    'When Does Technology Use Enable Network Change in Organizations? A Comparative Study of Feature Use and Shared Affordances',
    'Nature and Nurture: The Impact of Automaticity and the Structuration of Communication on Virtual Team Behavior and Performance',
    'An Investigation of Information Systems Use Patterns: Technological Events as Triggers, the Effect of Time, and Consequences for Performance',
    'Toward Generalizable Sociomaterial Inquiry: A Computational Approach for Zooming In and Out of Sociomaterial Routines',
    'Coping with Information Technology: Mixed Emotions, Vacillation, and Nonconforming Use Patterns',
    'Information Technology Use as a Learning Mechanism: The Impact of IT Use on Knowledge Transfer Effectiveness, Absorptive Capacity, and Franchisee Performance',
    'ICT, Intermediaries, and the Transformation of Gendered Power Structures',
    'Multiplex Appropriation in Complex Systems Implementation: The Case of Brazil''s Correspondent Banking System',
    'Revisiting Group-Based Technology Adoption as a Dynamic Process: The Role of Changing Attitude-Rationale Configurations',
    'Capturing the Complexity of Malleable IT Use: Adaptive Structuration Theory for Individuals',
    'A Temporally Situated Self-Agency Theory of Information Technology Reinvention'
);'''
    df_article = conn.execute(query).fetchdf()

Controlling the values in the dataframe

In [4]:
print(df_article.head())

   article_id                                     authors  year  \
0         926        Bartelt, Valerie L.; Dennis, Alan R.  2014   
1        1033          Beaudry, Anne; Pinsonneault, Alain  2005   
2        1658  Burton-Jones, Andrew; Gallivan, Michael J.  2007   
3        6541                                Kim, Sung S.  2009   
4        7061           Lapointe, Liette; Rivard, Suzanne  2005   

                                               title  \
0  Nature and Nurture: The Impact of Automaticity...   
1  Understanding User Responses to Information Te...   
2  Toward a Deeper Understanding of System Usage ...   
3  The Integrative Framework of Technology Use: A...   
4  A Multilevel Model of Resistance to Informatio...   

                                    journal  \
0  Management Information Systems Quarterly   
1  Management Information Systems Quarterly   
2  Management Information Systems Quarterly   
3  Management Information Systems Quarterly   
4  Management Information

In [5]:
print(df_article.shape)

(19, 8)


In [6]:
null_counts = df_article.isnull().sum()

Null values have to be handled before creating vector database because they are causing error

In [7]:
print(null_counts)

article_id        0
authors           0
year              0
title             0
journal           0
abstract          0
keywords          0
citation_count    0
dtype: int64


We see that there are no missing values but in different tests we hade some null values so we implement filling of them just in case for the future.

In [8]:
df_article = df_article.fillna('Information not provided in the source DB')

In this step we concatenate the columns of the dataframe to create a new column called page content. We do this because we want cannot use tabular data as input to the embeddings model. We need to convert the tabular data into a text format.

First we create empty list. In this list we will hold the disctionaries.
We create this loop that iterates over each row in the dataframe.
A the end we append all documents (one document=1 row from Dataframe) intho one list.

In [9]:
def concatenate_with_headers(df):
    concatenated_rows = []
    for index, row in df.iterrows():
        concatenated_row = " ".join([f"{col}: {row[col]}" for col in df.columns])
        concatenated_rows.append(concatenated_row)
    return concatenated_rows

In [16]:
documents = [Document(page_content=row) for row in concatenate_with_headers(df_article)]

In [17]:
for doc in documents[:5]:
    print(doc.page_content)

article_id: 926 authors: Bartelt, Valerie L.; Dennis, Alan R. year: 2014 title: Nature and Nurture: The Impact of Automaticity and the Structuration of Communication on Virtual Team Behavior and Performance journal: Management Information Systems Quarterly abstract: Much prior research on virtual teams has examined the impact of the features and capabilities of different communication tools (the nature of communication) on team performance. In this paper, we examine how the social structures (i.e., genre rules) that emerge around different communication tools (the nurture of communication) can be as important in influencing performance. During habitual use situations, team members enact genre rules associated with communication tools without conscious thought via automaticity. These genre rules influence how teams interact and ultimately how well they perform. We conducted an experimental study to examine the impact of different genre rules that have developed for two communication too

In [10]:
documents = [
    Document(page_content=row_content, metadata=row.to_dict())
    for row_content, (_, row) in zip(concatenate_with_headers(df_article), df_article.iterrows())
]

In [11]:
for doc in documents[:5]:
    print("Page Content:", doc.page_content)
    print("Metadata:", doc.metadata)

Page Content: article_id: 926 authors: Bartelt, Valerie L.; Dennis, Alan R. year: 2014 title: Nature and Nurture: The Impact of Automaticity and the Structuration of Communication on Virtual Team Behavior and Performance journal: Management Information Systems Quarterly abstract: Much prior research on virtual teams has examined the impact of the features and capabilities of different communication tools (the nature of communication) on team performance. In this paper, we examine how the social structures (i.e., genre rules) that emerge around different communication tools (the nurture of communication) can be as important in influencing performance. During habitual use situations, team members enact genre rules associated with communication tools without conscious thought via automaticity. These genre rules influence how teams interact and ultimately how well they perform. We conducted an experimental study to examine the impact of different genre rules that have developed for two com

Creating a persist directory where the vector database will be stored

In [12]:
persist_directory = '../RAG_multiple_vector_stores/article_chroma_db_MISQ'

 Here we are creating object with concatenated text (page content) with metadata that is associated to the page content.

We use sentence transformers model for our embeddings model

In [3]:
embedding_model = HuggingFaceEmbeddings(model_name='sentence-transformers/paraphrase-MiniLM-L6-v2')

In this step we create our vector database where we take doucuments which is the source of information and which will get embedded and stored in the vector database, embedding model which is "sentence transformes", and we store this vector database in persist directory for later use. This way the vector database is stored on disk and we can access it later without recreating it from the scratch.

In [14]:
vectordb_articles = Chroma.from_documents(documents=documents, 
                                 embedding=embedding_model,
                                 persist_directory=persist_directory)

In [15]:
retriever = vectordb_articles.as_retriever()

In [18]:
def query_vectordb(query, top_k=1):
    results = retriever.invoke(query, k=top_k)
    return results

In [25]:
def query_vectordb(query, top_k=1):
    # Perform a semantic search
    results = retriever.invoke(query, k=top_k)
    
    # Example of handling numerical info: find highest citation count
    if "highest citation count" in query.lower():
        highest_citation_doc = max(results, key=lambda doc: doc.metadata.get('citation_count', 0))
        return [highest_citation_doc]
    
    return results

In [26]:
query = "article where author is Ortiz de Guinea, Ana"
results = query_vectordb(query)
print(results)

[Document(page_content="article_id: 9462 authors: Ortiz de Guinea, Ana; Markus, M. Lynne year: 2009 title: Why Break the Habit of a Lifetime? Rethinking the Roles of Intention, Habit, and Emotion in Continuing Information Technology Use journal: Management Information Systems Quarterly abstract: One of the most welcome recent developments in Information Systems scholarship has been the growing interest in individuals' continuing use of information technology well after initial adoption, known in the literature as IT usage, IT continuance, and post-adoptive IT usage. In this essay, we explore the theoretical underpinnings of IS research on continuing IT use. Although the IS literature on continuing IT use emphasizes the role of habitual behavior that does not require conscious behavioral intention, it does so in a way that largely remains faithful to the theoretical tradition of planned behavior and reasoned action. However, a close reading of reference literatures on automatic behavior

We take the same steps for sentences table where we have full article separated into sentences --> each row = 1 sentence

In [4]:
with duckdb.connect(database=db_path, read_only=True) as conn:
    query = '''SELECT para_id, last_section_title  FROM sentences
                JOIN papers ON sentences.article_id = papers.article_id
                WHERE title IN (
                'Technology Adaptation: The Case of a Computer-Supported Inter-organizational Virtual Team',
                'How Do Suppliers Benefit from Information Technology Use in Supply Chain Relationships?',
                'A Multilevel Model of Resistance to Information Technology Implementation',
                'Understanding User Responses to Information Technology: A Coping Model of User Adaptation',
                'A Comprehensive Conceputalization of the Post-Adoptive Behaviors Associated with IT-Enabled Work Systems',
                'Information Technology and the Performance of the Customer Service Process: A Resource-Based Analysis',
                'Toward a Deeper Understanding of System Usage in Organizations: A Multilevel Perspective',
                'How Habit Limits the Predictive Power of Intention: The Case of Information Systems Continuance',
                'Predicting Different Conceptualizations of System Use: The Competing Roles of Behavioral Intention, Facilitating Conditions, and Behavioral Expectation',
                'The Integrative Framework of Technology Use: An Extension and Test',
                'Why Break the Habit of a Lifetime? Rethinking the Roles of Intention, Habit, and Emotion in Continuing Information Technology Use',
                'An Alternative to Methodological Individualism: A Non-Reductionist Approach to Studying Technology Adoption by Groups',
                'Capturing Bottom-Up Information Technology Use Processes: A Complex Adaptive Systems Model',
                'Understanding User Revisions When Using Information System Features: Adaptive System Use and Triggers',
                'Interfirm IT Capability Profiles and Communications for Cocreating Relational Value: Evidence from the Logistics Industry',
                'A Dramaturgical Model of the Production of Performance Data',
                'The Embeddedness of Information Systems Habits in Organizational and Individual Level Routines: Development and Disruption',
                'When Does Technology Use Enable Network Change in Organizations? A Comparative Study of Feature Use and Shared Affordances',
                'Nature and Nurture: The Impact of Automaticity and the Structuration of Communication on Virtual Team Behavior and Performance',
                'An Investigation of Information Systems Use Patterns: Technological Events as Triggers, the Effect of Time, and Consequences for Performance',
                'Toward Generalizable Sociomaterial Inquiry: A Computational Approach for Zooming In and Out of Sociomaterial Routines',
                'Coping with Information Technology: Mixed Emotions, Vacillation, and Nonconforming Use Patterns',
                'Information Technology Use as a Learning Mechanism: The Impact of IT Use on Knowledge Transfer Effectiveness, Absorptive Capacity, and Franchisee Performance',
                'ICT, Intermediaries, and the Transformation of Gendered Power Structures',
                'Multiplex Appropriation in Complex Systems Implementation: The Case of Brazil''s Correspondent Banking System',
                'Revisiting Group-Based Technology Adoption as a Dynamic Process: The Role of Changing Attitude-Rationale Configurations',
                'Capturing the Complexity of Malleable IT Use: Adaptive Structuration Theory for Individuals',
                'A Temporally Situated Self-Agency Theory of Information Technology Reinvention'
                );'''
    df_sentences= conn.execute(query).fetchdf()

In [5]:
print(df_sentences.head())

  para_id last_section_title
0   926_0               None
1   926_1           Abstract
2   926_1           Abstract
3   926_1           Abstract
4   926_1           Abstract


In [6]:
df_sentences = df_sentences.drop_duplicates(subset=['para_id'])
print(df_sentences.head())

   para_id last_section_title
0    926_0               None
1    926_1           Abstract
10   926_2     Introduction 1
11   926_3     Introduction 1
14   926_4     Introduction 1


In [7]:
with duckdb.connect(database=db_path, read_only=True) as conn:
    query = '''SELECT title, para_id, paragraph FROM paragraphs
                JOIN papers ON paragraphs.article_id = papers.article_id
                WHERE title IN (
                'Technology Adaptation: The Case of a Computer-Supported Inter-organizational Virtual Team',
                'How Do Suppliers Benefit from Information Technology Use in Supply Chain Relationships?',
                'A Multilevel Model of Resistance to Information Technology Implementation',
                'Understanding User Responses to Information Technology: A Coping Model of User Adaptation',
                'A Comprehensive Conceputalization of the Post-Adoptive Behaviors Associated with IT-Enabled Work Systems',
                'Information Technology and the Performance of the Customer Service Process: A Resource-Based Analysis',
                'Toward a Deeper Understanding of System Usage in Organizations: A Multilevel Perspective',
                'How Habit Limits the Predictive Power of Intention: The Case of Information Systems Continuance',
                'Predicting Different Conceptualizations of System Use: The Competing Roles of Behavioral Intention, Facilitating Conditions, and Behavioral Expectation',
                'The Integrative Framework of Technology Use: An Extension and Test',
                'Why Break the Habit of a Lifetime? Rethinking the Roles of Intention, Habit, and Emotion in Continuing Information Technology Use',
                'An Alternative to Methodological Individualism: A Non-Reductionist Approach to Studying Technology Adoption by Groups',
                'Capturing Bottom-Up Information Technology Use Processes: A Complex Adaptive Systems Model',
                'Understanding User Revisions When Using Information System Features: Adaptive System Use and Triggers',
                'Interfirm IT Capability Profiles and Communications for Cocreating Relational Value: Evidence from the Logistics Industry',
                'A Dramaturgical Model of the Production of Performance Data',
                'The Embeddedness of Information Systems Habits in Organizational and Individual Level Routines: Development and Disruption',
                'When Does Technology Use Enable Network Change in Organizations? A Comparative Study of Feature Use and Shared Affordances',
                'Nature and Nurture: The Impact of Automaticity and the Structuration of Communication on Virtual Team Behavior and Performance',
                'An Investigation of Information Systems Use Patterns: Technological Events as Triggers, the Effect of Time, and Consequences for Performance',
                'Toward Generalizable Sociomaterial Inquiry: A Computational Approach for Zooming In and Out of Sociomaterial Routines',
                'Coping with Information Technology: Mixed Emotions, Vacillation, and Nonconforming Use Patterns',
                'Information Technology Use as a Learning Mechanism: The Impact of IT Use on Knowledge Transfer Effectiveness, Absorptive Capacity, and Franchisee Performance',
                'ICT, Intermediaries, and the Transformation of Gendered Power Structures',
                'Multiplex Appropriation in Complex Systems Implementation: The Case of Brazil''s Correspondent Banking System',
                'Revisiting Group-Based Technology Adoption as a Dynamic Process: The Role of Changing Attitude-Rationale Configurations',
                'Capturing the Complexity of Malleable IT Use: Adaptive Structuration Theory for Individuals',
                'A Temporally Situated Self-Agency Theory of Information Technology Reinvention'
                );'''
    df_paragraphs= conn.execute(query).fetchdf()

In [8]:
print(df_paragraphs.head())

                                               title para_id  \
0  The Embeddedness of Information Systems Habits...  9957_0   
1  The Embeddedness of Information Systems Habits...  9957_1   
2  The Embeddedness of Information Systems Habits...  9957_3   
3  The Embeddedness of Information Systems Habits...  9957_4   
4  The Embeddedness of Information Systems Habits...  9957_5   

                                           paragraph  
0  THE EMBEDDEDNESS OF INFORMATION SYSTEMS HABITS...  
1  Despite recent interest in studying informatio...  
2  The psychological construct of habit has attra...  
3  Further, within organizations, IS use is almos...  
4  The objective of the current paper is to contr...  


In [9]:
merged_df = df_sentences.merge(df_paragraphs, on=['para_id'])

In [10]:
print(merged_df.head())

  para_id last_section_title  \
0   926_0               None   
1   926_1           Abstract   
2   926_3     Introduction 1   
3   926_4     Introduction 1   
4   926_5     Introduction 1   

                                               title  \
0  Nature and Nurture: The Impact of Automaticity...   
1  Nature and Nurture: The Impact of Automaticity...   
2  Nature and Nurture: The Impact of Automaticity...   
3  Nature and Nurture: The Impact of Automaticity...   
4  Nature and Nurture: The Impact of Automaticity...   

                                           paragraph  
0  NATURE AND NURTURE: THE IMPACT OF AUTOMATICITY...  
1  Much prior research on virtual teams has exami...  
2  Prior research has argued-and demonstrated emp...  
3  In this paper, we argue that nurture has an eq...  
4  Genre rules are like many other social structu...  


In [11]:
print(merged_df.sort_values(by='para_id').head(15))

      para_id  last_section_title  \
773    1033_0                None   
774    1033_1            Abstract   
833  1033_101       A) and Bill B   
834  1033_102       A) and Bill B   
835  1033_104     Data Collection   
836  1033_105     Data Collection   
837  1033_107     Data Collection   
838  1033_108     Data Collection   
839  1033_109     Data Collection   
780   1033_11  The Coping Process   
840  1033_111       Data Analysis   
841  1033_112       Data Analysis   
842  1033_113       Data Analysis   
843  1033_115             Results   
844  1033_116             Results   

                                                 title  \
773  Understanding User Responses to Information Te...   
774  Understanding User Responses to Information Te...   
833  Understanding User Responses to Information Te...   
834  Understanding User Responses to Information Te...   
835  Understanding User Responses to Information Te...   
836  Understanding User Responses to Information Te...   
8

In [12]:
with duckdb.connect(database=db_path, read_only=True) as conn:
    query = '''SELECT para_id, ent_id, level_3, FROM entities
                JOIN papers ON entities.article_id = papers.article_id
                WHERE title IN (
                'Technology Adaptation: The Case of a Computer-Supported Inter-organizational Virtual Team',
                'How Do Suppliers Benefit from Information Technology Use in Supply Chain Relationships?',
                'A Multilevel Model of Resistance to Information Technology Implementation',
                'Understanding User Responses to Information Technology: A Coping Model of User Adaptation',
                'A Comprehensive Conceputalization of the Post-Adoptive Behaviors Associated with IT-Enabled Work Systems',
                'Information Technology and the Performance of the Customer Service Process: A Resource-Based Analysis',
                'Toward a Deeper Understanding of System Usage in Organizations: A Multilevel Perspective',
                'How Habit Limits the Predictive Power of Intention: The Case of Information Systems Continuance',
                'Predicting Different Conceptualizations of System Use: The Competing Roles of Behavioral Intention, Facilitating Conditions, and Behavioral Expectation',
                'The Integrative Framework of Technology Use: An Extension and Test',
                'Why Break the Habit of a Lifetime? Rethinking the Roles of Intention, Habit, and Emotion in Continuing Information Technology Use',
                'An Alternative to Methodological Individualism: A Non-Reductionist Approach to Studying Technology Adoption by Groups',
                'Capturing Bottom-Up Information Technology Use Processes: A Complex Adaptive Systems Model',
                'Understanding User Revisions When Using Information System Features: Adaptive System Use and Triggers',
                'Interfirm IT Capability Profiles and Communications for Cocreating Relational Value: Evidence from the Logistics Industry',
                'A Dramaturgical Model of the Production of Performance Data',
                'The Embeddedness of Information Systems Habits in Organizational and Individual Level Routines: Development and Disruption',
                'When Does Technology Use Enable Network Change in Organizations? A Comparative Study of Feature Use and Shared Affordances',
                'Nature and Nurture: The Impact of Automaticity and the Structuration of Communication on Virtual Team Behavior and Performance',
                'An Investigation of Information Systems Use Patterns: Technological Events as Triggers, the Effect of Time, and Consequences for Performance',
                'Toward Generalizable Sociomaterial Inquiry: A Computational Approach for Zooming In and Out of Sociomaterial Routines',
                'Coping with Information Technology: Mixed Emotions, Vacillation, and Nonconforming Use Patterns',
                'Information Technology Use as a Learning Mechanism: The Impact of IT Use on Knowledge Transfer Effectiveness, Absorptive Capacity, and Franchisee Performance',
                'ICT, Intermediaries, and the Transformation of Gendered Power Structures',
                'Multiplex Appropriation in Complex Systems Implementation: The Case of Brazil''s Correspondent Banking System',
                'Revisiting Group-Based Technology Adoption as a Dynamic Process: The Role of Changing Attitude-Rationale Configurations',
                'Capturing the Complexity of Malleable IT Use: Adaptive Structuration Theory for Individuals',
                'A Temporally Situated Self-Agency Theory of Information Technology Reinvention'
                );'''
    df_entities= conn.execute(query).fetchdf()

In [13]:
print(df_entities.head())

  para_id                         ent_id        level_3
0  1033_0                  IS technology  IS technology
1  1033_1                  IS technology  IS technology
2  1033_1              theoretical model          model
3  1033_1                  IS technology  IS technology
4  1033_1  theory of bounded rationality   named theory


In [14]:
df_entities = df_entities.groupby('para_id').agg({
    'ent_id': lambda x: ', '.join(sorted(set(x))),
    'level_3': lambda x: ', '.join(sorted(set(x)))
}).reset_index()

In [15]:
print(df_entities.head())

    para_id                                             ent_id  \
0    1033_0                                      IS technology   
1    1033_1  IS technology, United States, individual level...   
2  1033_100                                   banking industry   
3  1033_101  banking industry, database system, individual ...   
4  1033_102  banking industry, database system, electronic ...   

                                             level_3  
0                                      IS technology  
1  IS technology, IS topic, geographic names, lev...  
2                                    economic sector  
3       IS technology, economic sector, study object  
4    IS technology, economic sector, research method  


In [16]:
merged_df = merged_df.merge(df_entities, on=['para_id'])

In [17]:
print(merged_df.head())

  para_id last_section_title  \
0   926_0               None   
1   926_1           Abstract   
2   926_3     Introduction 1   
3   926_4     Introduction 1   
4   926_5     Introduction 1   

                                               title  \
0  Nature and Nurture: The Impact of Automaticity...   
1  Nature and Nurture: The Impact of Automaticity...   
2  Nature and Nurture: The Impact of Automaticity...   
3  Nature and Nurture: The Impact of Automaticity...   
4  Nature and Nurture: The Impact of Automaticity...   

                                           paragraph  \
0  NATURE AND NURTURE: THE IMPACT OF AUTOMATICITY...   
1  Much prior research on virtual teams has exami...   
2  Prior research has argued-and demonstrated emp...   
3  In this paper, we argue that nurture has an eq...   
4  Genre rules are like many other social structu...   

                                              ent_id  \
0       IT supported collaboration, virtual teamwork   
1  IT supported colla

In [18]:
print(merged_df.shape)

(1829, 6)


In [36]:
merged_df = merged_df.rename(columns={'title':'Title of the article', 'last_section_title': 'Title of the section', 'paragraph': 'Paragraph', 'ent_id': 'Entity from text', 'level_3': 'Entity more general'})

In [19]:
merged_df = merged_df[['para_id','title', 'last_section_title', 'paragraph', 'ent_id', 'level_3']]

In [20]:
null_counts = merged_df.isnull().sum()

In [21]:
print(null_counts)

para_id                0
title                  0
last_section_title    22
paragraph              0
ent_id                 0
level_3                0
dtype: int64


In [22]:
merged_df = merged_df.fillna('No section information')

In [23]:
null_counts = merged_df.isnull().sum()

In [24]:
print(null_counts)

para_id               0
title                 0
last_section_title    0
paragraph             0
ent_id                0
level_3               0
dtype: int64


In [25]:
def concatenate_with_headers(df):
    concatenated_rows = []
    for index, row in df.iterrows():
        concatenated_row = " ".join([f"{col}: {row[col]}" for col in df.columns])
        concatenated_rows.append(concatenated_row)
    return concatenated_rows

In [27]:
documents2 = [
    Document(
        page_content=row['paragraph'], 
        metadata={
            'para_id': row['para_id'],
            'title': row['title'],
            'last_section_title': row['last_section_title'],
            'ent_id': row['ent_id'],
            'level_3': row['level_3']
        }
    )
    for _, row in merged_df.iterrows()
]

In [28]:
for doc in documents2[:5]:
    print("Page Content:", doc.page_content)
    print("Metadata:", doc.metadata)

Page Content: NATURE AND NURTURE: THE IMPACT OF AUTOMATICITY AND THE STRUCTURATION OF COMMUNICATION ON VIRTUAL TEAM BEHAVIOR AND PERFORMANCE 
Metadata: {'para_id': '926_0', 'title': 'Nature and Nurture: The Impact of Automaticity and the Structuration of Communication on Virtual Team Behavior and Performance', 'last_section_title': 'No section information', 'ent_id': 'IT supported collaboration, virtual teamwork', 'level_3': 'IS technology, IS topic'}
Page Content: Much prior research on virtual teams has examined the impact of the features and capabilities of different communication tools (the nature of communication) on team performance. In this paper, we examine how the social structures (i.e., genre rules) that emerge around different communication tools (the nurture of communication) can be as important in influencing performance. During habitual use situations, team members enact genre rules associated with communication tools without conscious thought via automaticity. These gen

In [29]:
persist_directory2 = '../RAG_multiple_vector_stores/paragraphs_chroma_db_MISQ'

In [30]:
vectordb_sentences = Chroma.from_documents(documents=documents2, 
                                 embedding=embedding_model,
                                 persist_directory=persist_directory2)