## Set Up the Environment

In [1]:
%run setup.ipynb

## Document Loaders

Document loaders are used to import data from various sources into LangChain as `Document` objects. A `Document` typically includes a piece of text along with its associated metadata.

### Examples of Document Loaders:

- **Text File Loader:** Loads data from a simple `.txt` file.
- **Web Page Loader:** Retrieves the text content from any web page.
- **YouTube Video Transcript Loader:** Loads transcripts from YouTube videos.

### Functionality:

- **Load Method:** Each document loader has a `load` method that enables the loading of data as documents from a pre-configured source.
- **Lazy Load Option:** Some loaders also support a "lazy load" feature, which allows data to be loaded into memory gradually as needed.

For more detailed information, visit [LangChain's document loader documentation](https://python.langchain.com/docs/modules/data_connection/document_loaders/).


### JSON Loader

[JSON (JavaScript Object Notation)](https://en.wikipedia.org/wiki/JSON) is an open standard file format and data interchange format that uses human-readable text to store and transmit data objects consisting of attribute–value pairs and arrays (or other serializable values).

[JSON Lines](https://jsonlines.org/) is a file format where each line is a valid JSON value.

LangChain implements a [JSONLoader](https://api.python.langchain.com/en/latest/document_loaders/langchain_community.document_loaders.json_loader.JSONLoader.html) to convert JSON and JSONL data into LangChain `Document` objects. It uses a specified [`jq` schema](https://en.wikipedia.org/wiki/Jq_(programming_language)) to parse the JSON files, allowing for the extraction of specific fields into the content and metadata of the LangChain Document.

It uses the `jq` python package. Check out [this manual](https://jqlang.github.io/jq/manual/) for a detailed documentation of the `jq` syntax.

In [2]:
import json

# Sample data dictionary similar to the one you provided but with modified contents
data = {
    'image': {'creation_timestamp': 1675549016, 'uri': 'image_of_the_meeting.jpg'},
    'is_still_participant': True,
    'joinable_mode': {'link': '', 'mode': 1},
    'magic_words': [],
    'messages': [
        {'content': 'See you soon!',
         'sender_name': 'User B',
         'timestamp_ms': 1675597571851},
        {'content': 'Thanks for the update! See you then.',
         'sender_name': 'User A',
         'timestamp_ms': 1675597435669},
        {'content': 'Actually, the green one is sold out.',
         'sender_name': 'User B',
         'timestamp_ms': 1675596277579},
        {'content': 'I was hoping to purchase the green one!',
         'sender_name': 'User A',
         'timestamp_ms': 1675595140251},
        {'content': 'I’m really interested in the green one, not the red!',
         'sender_name': 'User A',
         'timestamp_ms': 1675595109305},
        {'content': 'Here’s the $150 for it.',
         'sender_name': 'User B',
         'timestamp_ms': 1675595068468},
        {'photos': [{'creation_timestamp': 1675595059,
                     'uri': 'image_of_the_item.jpg'}],
         'sender_name': 'User B',
         'timestamp_ms': 1675595060730},
        {'content': 'It typically sells for at least $200 online',
         'sender_name': 'User B',
         'timestamp_ms': 1675595045152},
        {'content': 'How much are you asking?',
         'sender_name': 'User A',
         'timestamp_ms': 1675594799696},
        {'content': 'Good morning! $50 is far too low.',
         'sender_name': 'User B',
         'timestamp_ms': 1675577876645},
        {'content': 'Hello! I’m interested in the item you posted. I can offer $50. Let me know if that works for you. Thanks!',
         'sender_name': 'User A',
         'timestamp_ms': 1675549022673}
    ],
    'participants': [{'name': 'User A'}, {'name': 'User B'}],
    'thread_path': 'inbox/User A and User B chat',
    'title': 'User A and User B chat'
}

# Save the modified data to a JSON file
with open('../../docs/chat_data.json', 'w') as file:
    json.dump(data, file, indent=4)


In [3]:
from langchain_community.document_loaders import JSONLoader

loader = JSONLoader(file_path="../../docs/chat_data.json",
                    jq_schema='.',
                    text_content=False)
docs = loader.load()

In [4]:
len(docs)

1

In [5]:
print(docs[0].page_content)
print(docs[0].metadata)

{"image": {"creation_timestamp": 1675549016, "uri": "image_of_the_meeting.jpg"}, "is_still_participant": true, "joinable_mode": {"link": "", "mode": 1}, "magic_words": [], "messages": [{"content": "See you soon!", "sender_name": "User B", "timestamp_ms": 1675597571851}, {"content": "Thanks for the update! See you then.", "sender_name": "User A", "timestamp_ms": 1675597435669}, {"content": "Actually, the green one is sold out.", "sender_name": "User B", "timestamp_ms": 1675596277579}, {"content": "I was hoping to purchase the green one!", "sender_name": "User A", "timestamp_ms": 1675595140251}, {"content": "I\u2019m really interested in the green one, not the red!", "sender_name": "User A", "timestamp_ms": 1675595109305}, {"content": "Here\u2019s the $150 for it.", "sender_name": "User B", "timestamp_ms": 1675595068468}, {"photos": [{"creation_timestamp": 1675595059, "uri": "image_of_the_item.jpg"}], "sender_name": "User B", "timestamp_ms": 1675595060730}, {"content": "It typically sell

In [6]:
print(docs[0])

page_content='{"image": {"creation_timestamp": 1675549016, "uri": "image_of_the_meeting.jpg"}, "is_still_participant": true, "joinable_mode": {"link": "", "mode": 1}, "magic_words": [], "messages": [{"content": "See you soon!", "sender_name": "User B", "timestamp_ms": 1675597571851}, {"content": "Thanks for the update! See you then.", "sender_name": "User A", "timestamp_ms": 1675597435669}, {"content": "Actually, the green one is sold out.", "sender_name": "User B", "timestamp_ms": 1675596277579}, {"content": "I was hoping to purchase the green one!", "sender_name": "User A", "timestamp_ms": 1675595140251}, {"content": "I\u2019m really interested in the green one, not the red!", "sender_name": "User A", "timestamp_ms": 1675595109305}, {"content": "Here\u2019s the $150 for it.", "sender_name": "User B", "timestamp_ms": 1675595068468}, {"photos": [{"creation_timestamp": 1675595059, "uri": "image_of_the_item.jpg"}], "sender_name": "User B", "timestamp_ms": 1675595060730}, {"content": "It 

In [7]:
docs

[Document(metadata={'source': '/Users/sourav.banerjee/Documents/My Codebases/GenerativAI_Demystified/RAG/docs/chat_data.json', 'seq_num': 1}, page_content='{"image": {"creation_timestamp": 1675549016, "uri": "image_of_the_meeting.jpg"}, "is_still_participant": true, "joinable_mode": {"link": "", "mode": 1}, "magic_words": [], "messages": [{"content": "See you soon!", "sender_name": "User B", "timestamp_ms": 1675597571851}, {"content": "Thanks for the update! See you then.", "sender_name": "User A", "timestamp_ms": 1675597435669}, {"content": "Actually, the green one is sold out.", "sender_name": "User B", "timestamp_ms": 1675596277579}, {"content": "I was hoping to purchase the green one!", "sender_name": "User A", "timestamp_ms": 1675595140251}, {"content": "I\\u2019m really interested in the green one, not the red!", "sender_name": "User A", "timestamp_ms": 1675595109305}, {"content": "Here\\u2019s the $150 for it.", "sender_name": "User B", "timestamp_ms": 1675595068468}, {"photos":

Suppose we are interested in extracting the values under the `messages` key of the JSON data

In [8]:
loader = JSONLoader(
    file_path='../../docs/chat_data.json',
    jq_schema='.messages[]',
    text_content=False)

data = loader.load()
data

[Document(metadata={'source': '/Users/sourav.banerjee/Documents/My Codebases/GenerativAI_Demystified/RAG/docs/chat_data.json', 'seq_num': 1}, page_content='{"content": "See you soon!", "sender_name": "User B", "timestamp_ms": 1675597571851}'),
 Document(metadata={'source': '/Users/sourav.banerjee/Documents/My Codebases/GenerativAI_Demystified/RAG/docs/chat_data.json', 'seq_num': 2}, page_content='{"content": "Thanks for the update! See you then.", "sender_name": "User A", "timestamp_ms": 1675597435669}'),
 Document(metadata={'source': '/Users/sourav.banerjee/Documents/My Codebases/GenerativAI_Demystified/RAG/docs/chat_data.json', 'seq_num': 3}, page_content='{"content": "Actually, the green one is sold out.", "sender_name": "User B", "timestamp_ms": 1675596277579}'),
 Document(metadata={'source': '/Users/sourav.banerjee/Documents/My Codebases/GenerativAI_Demystified/RAG/docs/chat_data.json', 'seq_num': 4}, page_content='{"content": "I was hoping to purchase the green one!", "sender_nam

Suppose we are interested in extracting the values under the `content` field within the `messages` key of the JSON data

In [9]:
loader = JSONLoader(
    file_path='../../docs/chat_data.json',
    jq_schema='.messages[].content',
    text_content=False)

data = loader.load()
data

[Document(metadata={'source': '/Users/sourav.banerjee/Documents/My Codebases/GenerativAI_Demystified/RAG/docs/chat_data.json', 'seq_num': 1}, page_content='See you soon!'),
 Document(metadata={'source': '/Users/sourav.banerjee/Documents/My Codebases/GenerativAI_Demystified/RAG/docs/chat_data.json', 'seq_num': 2}, page_content='Thanks for the update! See you then.'),
 Document(metadata={'source': '/Users/sourav.banerjee/Documents/My Codebases/GenerativAI_Demystified/RAG/docs/chat_data.json', 'seq_num': 3}, page_content='Actually, the green one is sold out.'),
 Document(metadata={'source': '/Users/sourav.banerjee/Documents/My Codebases/GenerativAI_Demystified/RAG/docs/chat_data.json', 'seq_num': 4}, page_content='I was hoping to purchase the green one!'),
 Document(metadata={'source': '/Users/sourav.banerjee/Documents/My Codebases/GenerativAI_Demystified/RAG/docs/chat_data.json', 'seq_num': 5}, page_content='I’m really interested in the green one, not the red!'),
 Document(metadata={'sou

#### Basic JSON Loading
For robust loading, especially with diverse file types, consider these options:

In [10]:
from pprint import pprint

In [11]:
file_path = '../../docs/facebook_chat.json'
with open(file_path, "r") as file:
    data = json.load(file)

pprint(data)

{'image': {'creation_timestamp': 1675549016, 'uri': 'image_of_the_chat.jpg'},
 'is_still_participant': True,
 'joinable_mode': {'link': '', 'mode': 1},
 'magic_words': [],
 'messages': [{'content': 'Bye!',
               'sender_name': 'User 2',
               'timestamp_ms': 1675597571851},
              {'content': 'Oh no worries! Bye',
               'sender_name': 'User 1',
               'timestamp_ms': 1675597435669},
              {'content': 'No Im sorry it was my mistake, the blue one is not '
                          'for sale',
               'sender_name': 'User 2',
               'timestamp_ms': 1675596277579},
              {'content': 'I thought you were selling the blue one!',
               'sender_name': 'User 1',
               'timestamp_ms': 1675595140251},
              {'content': 'Im not interested in this bag. Im interested in the '
                          'blue one!',
               'sender_name': 'User 1',
               'timestamp_ms': 1675595109305},
   

#### Using JSONLoader for Structured Retrieval: 
Use jq_schema to specify the data structure and extract only the required fields (Schema-Based Retrieval)

In [12]:
loader = JSONLoader(
    file_path='../../docs/facebook_chat.json',
    jq_schema='.messages[].content',
    text_content=False)

data = loader.load()
pprint(data)

[Document(metadata={'source': '/Users/sourav.banerjee/Documents/My Codebases/GenerativAI_Demystified/RAG/docs/facebook_chat.json', 'seq_num': 1}, page_content='Bye!'),
 Document(metadata={'source': '/Users/sourav.banerjee/Documents/My Codebases/GenerativAI_Demystified/RAG/docs/facebook_chat.json', 'seq_num': 2}, page_content='Oh no worries! Bye'),
 Document(metadata={'source': '/Users/sourav.banerjee/Documents/My Codebases/GenerativAI_Demystified/RAG/docs/facebook_chat.json', 'seq_num': 3}, page_content='No Im sorry it was my mistake, the blue one is not for sale'),
 Document(metadata={'source': '/Users/sourav.banerjee/Documents/My Codebases/GenerativAI_Demystified/RAG/docs/facebook_chat.json', 'seq_num': 4}, page_content='I thought you were selling the blue one!'),
 Document(metadata={'source': '/Users/sourav.banerjee/Documents/My Codebases/GenerativAI_Demystified/RAG/docs/facebook_chat.json', 'seq_num': 5}, page_content='Im not interested in this bag. Im interested in the blue one!')

#### Processing JSON Lines (JSONL): 
Seamlessly handle files where each line represents a separate JSON object by setting json_lines=True.

In [13]:
# Example - JSON (Processing JSON Lines)

loader = JSONLoader(
    file_path='../../docs/facebook_chat_messages.jsonl',
    jq_schema=".",
    text_content=False,
    json_lines=True
)

data = loader.load()
pprint(data)

[Document(metadata={'source': '/Users/sourav.banerjee/Documents/My Codebases/GenerativAI_Demystified/RAG/docs/facebook_chat_messages.jsonl', 'seq_num': 1}, page_content='{"sender_name": "User 2", "timestamp_ms": 1675597571851, "content": "Bye!"}'),
 Document(metadata={'source': '/Users/sourav.banerjee/Documents/My Codebases/GenerativAI_Demystified/RAG/docs/facebook_chat_messages.jsonl', 'seq_num': 2}, page_content='{"sender_name": "User 1", "timestamp_ms": 1675597435669, "content": "Oh no worries! Bye"}'),
 Document(metadata={'source': '/Users/sourav.banerjee/Documents/My Codebases/GenerativAI_Demystified/RAG/docs/facebook_chat_messages.jsonl', 'seq_num': 3}, page_content='{"sender_name": "User 2", "timestamp_ms": 1675596277579, "content": "No Im sorry it was my mistake, the blue one is not for sale"}')]


In [14]:
# Example - JSON (Processing JSON Lines)

loader = JSONLoader(
    file_path='../../docs/facebook_chat_messages.jsonl',
    jq_schema='.sender_name',
    text_content=False,
    json_lines=True
)

data = loader.load()
pprint(data)

[Document(metadata={'source': '/Users/sourav.banerjee/Documents/My Codebases/GenerativAI_Demystified/RAG/docs/facebook_chat_messages.jsonl', 'seq_num': 1}, page_content='User 2'),
 Document(metadata={'source': '/Users/sourav.banerjee/Documents/My Codebases/GenerativAI_Demystified/RAG/docs/facebook_chat_messages.jsonl', 'seq_num': 2}, page_content='User 1'),
 Document(metadata={'source': '/Users/sourav.banerjee/Documents/My Codebases/GenerativAI_Demystified/RAG/docs/facebook_chat_messages.jsonl', 'seq_num': 3}, page_content='User 2')]


In [15]:
# Example - JSON (Use jq_schema='.' and content_key for simpler extraction)

loader = JSONLoader(
    file_path='../../docs/facebook_chat_messages.jsonl',
    jq_schema='.',
    content_key="sender_name",
    text_content=False,
    json_lines=True
)

data = loader.load()
pprint(data)

[Document(metadata={'source': '/Users/sourav.banerjee/Documents/My Codebases/GenerativAI_Demystified/RAG/docs/facebook_chat_messages.jsonl', 'seq_num': 1}, page_content='User 2'),
 Document(metadata={'source': '/Users/sourav.banerjee/Documents/My Codebases/GenerativAI_Demystified/RAG/docs/facebook_chat_messages.jsonl', 'seq_num': 2}, page_content='User 1'),
 Document(metadata={'source': '/Users/sourav.banerjee/Documents/My Codebases/GenerativAI_Demystified/RAG/docs/facebook_chat_messages.jsonl', 'seq_num': 3}, page_content='User 2')]


#### Adding Metadata from JSON: 
Use custom functions to extract additional metadata, enhancing data context and traceability.

In [16]:
# Example - JSON (Adding Metadata from JSON)

def metadata_func(record: dict, metadata: dict) -> dict:
    metadata["sender_name"] = record.get("sender_name")
    metadata["timestamp_ms"] = record.get("timestamp_ms")
    return metadata

loader = JSONLoader(
    file_path='../../docs/facebook_chat.json',
    jq_schema='.messages[]',
    content_key="content",
    metadata_func=metadata_func # Add metadata from JSON
)

data = loader.load()
pprint(data)

[Document(metadata={'source': '/Users/sourav.banerjee/Documents/My Codebases/GenerativAI_Demystified/RAG/docs/facebook_chat.json', 'seq_num': 1, 'sender_name': 'User 2', 'timestamp_ms': 1675597571851}, page_content='Bye!'),
 Document(metadata={'source': '/Users/sourav.banerjee/Documents/My Codebases/GenerativAI_Demystified/RAG/docs/facebook_chat.json', 'seq_num': 2, 'sender_name': 'User 1', 'timestamp_ms': 1675597435669}, page_content='Oh no worries! Bye'),
 Document(metadata={'source': '/Users/sourav.banerjee/Documents/My Codebases/GenerativAI_Demystified/RAG/docs/facebook_chat.json', 'seq_num': 3, 'sender_name': 'User 2', 'timestamp_ms': 1675596277579}, page_content='No Im sorry it was my mistake, the blue one is not for sale'),
 Document(metadata={'source': '/Users/sourav.banerjee/Documents/My Codebases/GenerativAI_Demystified/RAG/docs/facebook_chat.json', 'seq_num': 4, 'sender_name': 'User 1', 'timestamp_ms': 1675595140251}, page_content='I thought you were selling the blue one!'),