# 📄 LangChain Basics – Document Loading & Preparation

This notebook demonstrates a basic LangChain pipeline focused on preparing unstructured `.md` documents for downstream processing.

## ✅ What’s Included

### 🔧 Environment Setup
- Used `dotenv` to securely load the OpenAI API key.
- Configured the model to use `gpt-4o-mini`.

### 📂 Document Loading
- Loaded multiple markdown files from subfolders inside a `knowledge-base` directory using `DirectoryLoader`.
- Used `TextLoader` with UTF-8 encoding.

### 🏷️ Metadata Tagging
- Tagged each document with its corresponding folder name under the `"doc_type"` key in metadata.

### ✂️ Text Splitting
- Split the loaded documents using `CharacterTextSplitter` to chunk text into smaller pieces for easier processing (useful for embedding or retrieval tasks).

---

This notebook sets the stage for building a retrieval-augmented generation (RAG) pipeline by preparing the documents and their metadata correctly.


In [1]:
import os
import glob
from dotenv import load_dotenv
import gradio as gr
from langchain.document_loaders import DirectoryLoader, TextLoader
from langchain.text_splitter import CharacterTextSplitter

In [2]:
MODEL = "gpt-4o-mini"
#db_name = "vector_db"

In [3]:
load_dotenv(override = True)
os.environ['OPENAI_API_KEY'] = os.getenv('OPENAI_API_KEY', 'your-key-if-not-using-env')

In [4]:
folders = glob.glob("knowledge-base/*")
text_loader_kwargs = {'encoding' : 'utf-8'}
folders

['knowledge-base\\company',
 'knowledge-base\\contracts',
 'knowledge-base\\employees',
 'knowledge-base\\products']

In [7]:
documents = []
for folder in folders:
    doc_type = os.path.basename(folder)
    #print(doc_type)
    loader = DirectoryLoader(folder, glob = "**/*.md", loader_cls = TextLoader, loader_kwargs = text_loader_kwargs)
    folder_docs = loader.load()
    for doc in folder_docs:
        #print(doc)
        doc.metadata["doc_type"] = doc_type
        #print(doc)
        documents.append(doc)

In [8]:
len(documents)

31

In [10]:
documents[30]

Document(metadata={'source': 'knowledge-base\\products\\Rellm.md', 'doc_type': 'products'}, page_content="# Product Summary\n\n# Rellm: AI-Powered Enterprise Reinsurance Solution\n\n## Summary\n\nRellm is an innovative enterprise reinsurance product developed by Insurellm, designed to transform the way reinsurance companies operate. Harnessing the power of artificial intelligence, Rellm offers an advanced platform that redefines risk management, enhances decision-making processes, and optimizes operational efficiencies within the reinsurance industry. With seamless integrations and robust analytics, Rellm enables insurers to proactively manage their portfolios and respond to market dynamics with agility.\n\n## Features\n\n### AI-Driven Analytics\nRellm utilizes cutting-edge AI algorithms to provide predictive insights into risk exposures, enabling users to forecast trends and make informed decisions. Its real-time data analysis empowers reinsurance professionals with actionable intelli

In [11]:
text_splitter = CharacterTextSplitter(chunk_size = 1000, chunk_overlap = 200)
chunks = text_splitter.split_documents(documents)

Created a chunk of size 1088, which is longer than the specified 1000


In [12]:
len(chunks)

123

In [13]:
chunks[4]

Document(metadata={'source': 'knowledge-base\\contracts\\Contract with Apex Reinsurance for Rellm.md', 'doc_type': 'contracts'}, page_content='## Renewal\n\n1. **Automatic Renewal**: This Agreement will automatically renew for successive one-year terms unless either party provides a written notice of intent to terminate at least thirty (30) days prior to the expiration of the current term.\n\n2. **Renewal Pricing**: Upon renewal, the pricing may be subject to adjustment by the Provider. The Provider will give a minimum of sixty (60) days’ notice of any changes in pricing.\n\n## Features\n\n1. **AI-Driven Analytics**: The Rellm platform will utilize AI algorithms to provide predictive insights into risk exposures, allowing the Client to make informed decisions with real-time data analysis.\n\n2. **Seamless Integrations**: The architecture of Rellm allows for easy integration with existing systems used by the Client, including policy management and claims processing.')

In [14]:
chunks[5]

Document(metadata={'source': 'knowledge-base\\contracts\\Contract with Apex Reinsurance for Rellm.md', 'doc_type': 'contracts'}, page_content="2. **Seamless Integrations**: The architecture of Rellm allows for easy integration with existing systems used by the Client, including policy management and claims processing.\n\n3. **Customizable Dashboard**: The dashboard will be tailored to display metrics specific to the Client's operational needs, enhancing productivity and facilitating more efficient data access.\n\n4. **Regulatory Compliance**: The solution will include compliance tracking features to assist the Client in maintaining adherence to relevant regulations.\n\n5. **Dedicated Client Portal**: A portal for the Client will facilitate real-time communication and document sharing, ensuring seamless collaboration throughout the partnership.\n\n## Support\n\n1. **Technical Support**: Provider shall offer dedicated technical support to the Client via phone, email, and a ticketing syst

In [18]:
doc_types = set(chunk.metadata['doc_type'] for chunk in chunks)
print(f"Document types found: {', '.join(doc_types)}")

Document types found: products, company, contracts, employees


In [19]:
for chunk in chunks:
    if 'CEO' in chunk.page_content:
        print(chunk)
        print("___________________")

page_content='3. **Regular Updates:** Insurellm will offer ongoing updates and enhancements to the Homellm platform, including new features and security improvements.

4. **Feedback Implementation:** Insurellm will actively solicit feedback from GreenValley Insurance to ensure Homellm continues to meet their evolving needs.

---

**Signatures:**

_________________________________  
**[Name]**  
**Title**: CEO  
**Insurellm, Inc.**

_________________________________  
**[Name]**  
**Title**: COO  
**GreenValley Insurance, LLC**  

---

This agreement represents the complete understanding of both parties regarding the use of the Homellm product and supersedes any prior agreements or communications.' metadata={'source': 'knowledge-base\\contracts\\Contract with GreenValley Insurance for Homellm.md', 'doc_type': 'contracts'}
___________________
page_content='## Support

1. **Customer Support**: Velocity Auto Solutions will have access to Insurellm’s customer support team via email or chatb