## Set Up the Environment

In [1]:
%run setup.ipynb

## Document Loaders

Document loaders are used to import data from various sources into LangChain as `Document` objects. A `Document` typically includes a piece of text along with its associated metadata.

### Examples of Document Loaders:

- **Text File Loader:** Loads data from a simple `.txt` file.
- **Web Page Loader:** Retrieves the text content from any web page.
- **YouTube Video Transcript Loader:** Loads transcripts from YouTube videos.

### Functionality:

- **Load Method:** Each document loader has a `load` method that enables the loading of data as documents from a pre-configured source.
- **Lazy Load Option:** Some loaders also support a "lazy load" feature, which allows data to be loaded into memory gradually as needed.

For more detailed information, visit [LangChain's document loader documentation](https://python.langchain.com/docs/modules/data_connection/document_loaders/).


### Microsoft Office Document Loaders

The Microsoft Office suite of productivity software includes Microsoft Word, Microsoft Excel, Microsoft PowerPoint, Microsoft Outlook, and Microsoft OneNote. It is available for Microsoft Windows and macOS operating systems. It is also available on Android and iOS.

[Unstructured.io](https://docs.unstructured.io/open-source/introduction/overview) provides a variety of document loaders to load MS Office documents. Check them out [here](https://docs.unstructured.io/open-source/core-functionality/partitioning).

Here we will leverage LangChain's [`UnstructuredWordDocumentLoader`](https://api.python.langchain.com/en/latest/document_loaders/langchain_community.document_loaders.word_document.UnstructuredWordDocumentLoader.html) to load data from a MS Word document.

In [32]:
from pprint import pprint

In [33]:
doc_path = '../../docs/Intel Strategy.docx'

In [34]:
from langchain_community.document_loaders import UnstructuredWordDocumentLoader

loader = UnstructuredWordDocumentLoader(doc_path)
data = loader.load()

In [36]:
print(len(data))

1


In [39]:
data

[Document(metadata={'source': '../../docs/Intel Strategy.docx'}, page_content="Intel Strategy\n\nOver the last few years, Intel, one of the world’s biggest chipmakers, has been transitioning towards a more datacentric approach than PC-centric. This is a welcome move, not only for the company but for innovation in technology as well, says Pat Gelsinger, CEO, Intel. According to him, the changing times as well as strides in innovation have placed Intel in a position where it can leverage the “superpowers” to make the world of computing better sustainable and far superior to the present scenario.\n\nThe Superpowers\n\nPervasive connectivity, Ubiquitous compute, AI and Cloud-to-Edge Infrastructure -- the four superpowers that will bolster Intel’s footprints into the future, will also play a key role in transforming the world of computing in any device.\n\n“Each of these superpowers is impressive on its own, but when they come together, that’s magic. If you’re not applying AI to every one o

In [42]:
for elem in data:
    print(elem)
    print("-"*100)

page_content='Intel Strategy

Over the last few years, Intel, one of the world’s biggest chipmakers, has been transitioning towards a more datacentric approach than PC-centric. This is a welcome move, not only for the company but for innovation in technology as well, says Pat Gelsinger, CEO, Intel. According to him, the changing times as well as strides in innovation have placed Intel in a position where it can leverage the “superpowers” to make the world of computing better sustainable and far superior to the present scenario.

The Superpowers

Pervasive connectivity, Ubiquitous compute, AI and Cloud-to-Edge Infrastructure -- the four superpowers that will bolster Intel’s footprints into the future, will also play a key role in transforming the world of computing in any device.

“Each of these superpowers is impressive on its own, but when they come together, that’s magic. If you’re not applying AI to every one of your business processes, you’re falling behind. We’re seeing this acros

In [12]:
print(data[0].page_content)

Intel Strategy

Over the last few years, Intel, one of the world’s biggest chipmakers, has been transitioning towards a more datacentric approach than PC-centric. This is a welcome move, not only for the company but for innovation in technology as well, says Pat Gelsinger, CEO, Intel. According to him, the changing times as well as strides in innovation have placed Intel in a position where it can leverage the “superpowers” to make the world of computing better sustainable and far superior to the present scenario.

The Superpowers

Pervasive connectivity, Ubiquitous compute, AI and Cloud-to-Edge Infrastructure -- the four superpowers that will bolster Intel’s footprints into the future, will also play a key role in transforming the world of computing in any device.

“Each of these superpowers is impressive on its own, but when they come together, that’s magic. If you’re not applying AI to every one of your business processes, you’re falling behind. We’re seeing this across every indust

### Load word doc with complex parsing and section based chunks

In [43]:
loader = UnstructuredWordDocumentLoader(doc_path,
                                        strategy='fast',
                                        chunking_strategy="by_title",
                                        max_characters=3000, # max limit of a document chunk
                                        new_after_n_chars=2500, # preferred document chunk size
                                        mode='elements')
data = loader.load()

In [44]:
len(data)

4

In [45]:
from pprint import pprint

In [46]:
pprint(data)

[Document(metadata={'source': '../../docs/Intel Strategy.docx', 'emphasized_text_contents': ['The Superpowers', 'Pervasive Connectivity', 'Ubiquitous compute'], 'emphasized_text_tags': ['b', 'b', 'b'], 'file_directory': '../../docs', 'filename': 'Intel Strategy.docx', 'last_modified': '2025-05-30T10:16:46', 'orig_elements': 'eJzNV2tvGzcW/SuEPnUBjVZvS95PhZGmBorUQJxdLLqFwSHvjAjPkFOSI2VS7H/fc8mRH4kDNNgCNWDIEp/3ce65h7/8PqGGWrLxzujJpZhcVLTdyB0V2/lqV6zVThX75XxZzPfVrrqQ5YVc0mQqJi1FqWWU2PP7RMlItfPDnaYuHjA0x4rKNHSnjScVMcVnz2Z/x592KkzGeStb4plrG6kR76Png4YZlnzkJY0M8a512lSGknWwZFPMN8VqfruYXy62l+vt5L9YGOlj/PIcPiIOXbrhgx2NNJ9I3/Jy7PvceVrtd3ovqViXe4mP5bYoq8W8qPZyvljL5Xqx2b5e538+khfxQIJ3iopOYiDpw1SkC6bCWRKuSitOzjf6P/1yvtgHUZq6JuxQB9O18p54y0FinMgKWGWDicZZY2sR3Ul6HYQUrfMkOAYK4fNGCdl13kl1wPnSipurYpyYiduDCcLwphM1yrWEzUeaCusibGoGUblsN+Y6aQdR9jGNGWvdUfLd+CoiqYN1jasHAeNwFFwKcgjiRkbxlpoAA8lPxdWbn0eXZ+J7pZzX2XJxMO003wML6zRoWgrn0/h/gMUaQ8Y+vfwgjyS6Bq7qfDDPS9G5HBhxOhCCYRBBeN4Q0iBrSjelEKvQd+Q7d0Jk04BmazjSj7ngxLD7f

In [47]:
len(data)

4

In [48]:
print(data[0])

page_content='Intel Strategy

Over the last few years, Intel, one of the world’s biggest chipmakers, has been transitioning towards a more datacentric approach than PC-centric. This is a welcome move, not only for the company but for innovation in technology as well, says Pat Gelsinger, CEO, Intel. According to him, the changing times as well as strides in innovation have placed Intel in a position where it can leverage the “superpowers” to make the world of computing better sustainable and far superior to the present scenario.

The Superpowers

Pervasive connectivity, Ubiquitous compute, AI and Cloud-to-Edge Infrastructure -- the four superpowers that will bolster Intel’s footprints into the future, will also play a key role in transforming the world of computing in any device.

“Each of these superpowers is impressive on its own, but when they come together, that’s magic. If you’re not applying AI to every one of your business processes, you’re falling behind. We’re seeing this acros

In [49]:
print(data[0].page_content)

Intel Strategy

Over the last few years, Intel, one of the world’s biggest chipmakers, has been transitioning towards a more datacentric approach than PC-centric. This is a welcome move, not only for the company but for innovation in technology as well, says Pat Gelsinger, CEO, Intel. According to him, the changing times as well as strides in innovation have placed Intel in a position where it can leverage the “superpowers” to make the world of computing better sustainable and far superior to the present scenario.

The Superpowers

Pervasive connectivity, Ubiquitous compute, AI and Cloud-to-Edge Infrastructure -- the four superpowers that will bolster Intel’s footprints into the future, will also play a key role in transforming the world of computing in any device.

“Each of these superpowers is impressive on its own, but when they come together, that’s magic. If you’re not applying AI to every one of your business processes, you’re falling behind. We’re seeing this across every indust

In [50]:
print(data[0].metadata)

{'source': '../../docs/Intel Strategy.docx', 'emphasized_text_contents': ['The Superpowers', 'Pervasive Connectivity', 'Ubiquitous compute'], 'emphasized_text_tags': ['b', 'b', 'b'], 'file_directory': '../../docs', 'filename': 'Intel Strategy.docx', 'last_modified': '2025-05-30T10:16:46', 'orig_elements': 'eJzNV2tvGzcW/SuEPnUBjVZvS95PhZGmBorUQJxdLLqFwSHvjAjPkFOSI2VS7H/fc8mRH4kDNNgCNWDIEp/3ce65h7/8PqGGWrLxzujJpZhcVLTdyB0V2/lqV6zVThX75XxZzPfVrrqQ5YVc0mQqJi1FqWWU2PP7RMlItfPDnaYuHjA0x4rKNHSnjScVMcVnz2Z/x592KkzGeStb4plrG6kR76Png4YZlnzkJY0M8a512lSGknWwZFPMN8VqfruYXy62l+vt5L9YGOlj/PIcPiIOXbrhgx2NNJ9I3/Jy7PvceVrtd3ovqViXe4mP5bYoq8W8qPZyvljL5Xqx2b5e538+khfxQIJ3iopOYiDpw1SkC6bCWRKuSitOzjf6P/1yvtgHUZq6JuxQB9O18p54y0FinMgKWGWDicZZY2sR3Ul6HYQUrfMkOAYK4fNGCdl13kl1wPnSipurYpyYiduDCcLwphM1yrWEzUeaCusibGoGUblsN+Y6aQdR9jGNGWvdUfLd+CoiqYN1jasHAeNwFFwKcgjiRkbxlpoAA8lPxdWbn0eXZ+J7pZzX2XJxMO003wML6zRoWgrn0/h/gMUaQ8Y+vfwgjyS6Bq7qfDDPS9G5HBhxOhCCYRBBeN4Q0iBrSjelEKvQd+Q7d0Jk04BmazjSj7ngxLD7fWS7SooRuQx9iNJYWTYk

In [52]:
print(data[0].metadata["emphasized_text_contents"])

['The Superpowers', 'Pervasive Connectivity', 'Ubiquitous compute']
