## Loading in Multiple Documents

My current solution is a drag and drop. But I want to preload data from our notion. And be able to query off of that.

In [2]:
import os
import sys

from langchain.document_loaders import TextLoader
from langchain.document_loaders import DirectoryLoader
from langchain.indexes import VectorstoreIndexCreator
from langchain.text_splitter import RecursiveCharacterTextSplitter

data_dir = './../data/'

loader = DirectoryLoader(data_dir, glob="*.txt")

In [42]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)

In [43]:
docs = loader.load()

In [44]:
docs[0].page_content

"Title: 10/19/23 Weekly Meeting\n\nCreated On: 2023-10-20T05:19:00.000Z\n\nTh, Oct 19, 2023 | Weekly Meetings - HarmonyHaven Hackathon Attendees: Emma Myint Kristy Deng Susan Micah Fleming  Action items from the last meeting:\n\nUpdates on wireframes, color schemes, icons, illustrations, and fonts.\n\nDecisions on KPIs (Key Performance Indicators) to be added to the dashboard wireframe.\n\nProgress on the Journal Demo.\n\nResearch related to the remote virtual hub and additional data points supporting the HarmonyHaven Solution.\n\nProgress on the Presentation for Demo Day Draft Ongoing Action Items Kristy will schedule the upcoming meeting on Vowel and share the link via Skype before the meeting on Oct. 24th. Kristy, Brett, and Micah will collaborate on the Dashboard content, focusing on statistics, metrics, and related details. The team needs to finalize their choice between the two primary Nav Bar options by the end of the day on Friday, Oct. 20th. Micah is responsible for updating t

In [45]:
docs_list = []
for doc in docs:
    text = text_splitter.split_text(doc.page_content)
    docs_list.append(text)

In [49]:
docs_list

[['Title: 10/19/23 Weekly Meeting\n\nCreated On: 2023-10-20T05:19:00.000Z\n\nTh, Oct 19, 2023 | Weekly Meetings - HarmonyHaven Hackathon Attendees: Emma Myint Kristy Deng Susan Micah Fleming  Action items from the last meeting:\n\nUpdates on wireframes, color schemes, icons, illustrations, and fonts.\n\nDecisions on KPIs (Key Performance Indicators) to be added to the dashboard wireframe.\n\nProgress on the Journal Demo.\n\nResearch related to the remote virtual hub and additional data points supporting the HarmonyHaven Solution.',
  'Progress on the Presentation for Demo Day Draft Ongoing Action Items Kristy will schedule the upcoming meeting on Vowel and share the link via Skype before the meeting on Oct. 24th. Kristy, Brett, and Micah will collaborate on the Dashboard content, focusing on statistics, metrics, and related details. The team needs to finalize their choice between the two primary Nav Bar options by the end of the day on Friday, Oct. 20th. Micah is responsible for updati

In [50]:
#merge sublists
texts2 = [item for sublist in docs_list for item in sublist]

In [53]:
texts2

['Title: 10/19/23 Weekly Meeting\n\nCreated On: 2023-10-20T05:19:00.000Z\n\nTh, Oct 19, 2023 | Weekly Meetings - HarmonyHaven Hackathon Attendees: Emma Myint Kristy Deng Susan Micah Fleming  Action items from the last meeting:\n\nUpdates on wireframes, color schemes, icons, illustrations, and fonts.\n\nDecisions on KPIs (Key Performance Indicators) to be added to the dashboard wireframe.\n\nProgress on the Journal Demo.\n\nResearch related to the remote virtual hub and additional data points supporting the HarmonyHaven Solution.',
 'Progress on the Presentation for Demo Day Draft Ongoing Action Items Kristy will schedule the upcoming meeting on Vowel and share the link via Skype before the meeting on Oct. 24th. Kristy, Brett, and Micah will collaborate on the Dashboard content, focusing on statistics, metrics, and related details. The team needs to finalize their choice between the two primary Nav Bar options by the end of the day on Friday, Oct. 20th. Micah is responsible for updating

In [56]:
text_metadata = [{"source": f"{i+1}-pl"} for i in range(len(texts2))]

In [57]:
text_metadata

[{'source': '1-pl'},
 {'source': '2-pl'},
 {'source': '3-pl'},
 {'source': '4-pl'},
 {'source': '5-pl'},
 {'source': '6-pl'}]

In [22]:
file_dir = './../data/10_24_23WeeklyMeeting.txt'

In [7]:
with open(file_dir,'r') as f:
    file = f.read()

In [36]:
type(file)

str

In [14]:
texts = text_splitter.split_text(file)

In [19]:
metadatas = [{"source": f"{i}-pl"} for i in range(len(texts))]

In [20]:
metadatas

[{'source': '0-pl'}, {'source': '1-pl'}, {'source': '2-pl'}]

## Using Chroma for CSV Files

In [1]:
from langchain.document_loaders.csv_loader import CSVLoader

In [5]:
loader = CSVLoader(file_path=data_dir+'test_db.csv')

data = loader.load()

In [6]:
data

[Document(page_content=': 0\nrecord_id: 1\ndatabase_id: 82ad3921-0b18-4e75-9441-c93357718bf0\nDescription: Document mostly geared towards presentations, but good thoughts we could use on the benefit of memorable visualizations and stories.\nLink: https://www.wilmu.edu/edtech/documents/the-science-of-effective-presenations---prezi-vs-powerpoint.pdf\nTitle: The Science of Effective Presentations', metadata={'source': './../data/test_db.csv', 'row': 0}),
 Document(page_content=': 1\nrecord_id: 2\ndatabase_id: 82ad3921-0b18-4e75-9441-c93357718bf0\nDescription: Focused towards project-based visualization, but if we want to add that feature we could convert some of these ideas into a garden representation\nLink: https://www.getrodeo.io/blog/how-to-visualize-project-progress\nTitle: How to Visualize Project Progress', metadata={'source': './../data/test_db.csv', 'row': 1}),
 Document(page_content=': 2\nrecord_id: 3\ndatabase_id: 82ad3921-0b18-4e75-9441-c93357718bf0\nDescription: Study driven 

In [7]:
loader = DirectoryLoader(data_dir, glob='*.csv', loader_cls=CSVLoader)

documents = loader.load()

In [8]:
documents

[Document(page_content=': 0\nrecord_id: 1\ndatabase_id: 82ad3921-0b18-4e75-9441-c93357718bf0\nDescription: Document mostly geared towards presentations, but good thoughts we could use on the benefit of memorable visualizations and stories.\nLink: https://www.wilmu.edu/edtech/documents/the-science-of-effective-presenations---prezi-vs-powerpoint.pdf\nTitle: The Science of Effective Presentations', metadata={'source': '../data/test_db.csv', 'row': 0}),
 Document(page_content=': 1\nrecord_id: 2\ndatabase_id: 82ad3921-0b18-4e75-9441-c93357718bf0\nDescription: Focused towards project-based visualization, but if we want to add that feature we could convert some of these ideas into a garden representation\nLink: https://www.getrodeo.io/blog/how-to-visualize-project-progress\nTitle: How to Visualize Project Progress', metadata={'source': '../data/test_db.csv', 'row': 1}),
 Document(page_content=': 2\nrecord_id: 3\ndatabase_id: 82ad3921-0b18-4e75-9441-c93357718bf0\nDescription: Study driven by t

I was expecting this to load as one document with multiple rows, but I do see the value in parsing these out. Using the same Directory Source does work to capture multiple CSVs so I'll go with that process.

To start, I'll work on adding the build csv process to my ingest file.