# Populate MinIO Bucket with a list of URLs to Partition Text Datasets using Unstructured-IO

The script automates fetching text from URLs, processes it, and uploads to a MinIO bucket. It sanitizes URLs for storage, partitions web content to structured text, and manages temporary files for clean-up. Error handling and feedback are included for process transparency.

In [None]:
import requests
from minio import Minio
import os
import tempfile
import re
from unstructured.partition.auto import partition
import io

def sanitize_url_to_object_name(url):
    clean_url = re.sub(r'^https?://', '', url)
    clean_url = re.sub(r'[^\w\-_\.]', '_', clean_url)
    return clean_url[:250] + '.txt'

def prepare_text_for_tokenization(text):
    # Simple placeholder for text cleaning logic
    clean_text = re.sub(r'\s+', ' ', text).strip()
    return clean_text

minio_client = Minio("cda-DESKTOP:9000", access_key="cda_cdaprod", secret_key="cda_cdaprod", secure=False)
bucket_name = "cda-datasets"

urls = [
    "https://nanonets.com/blog/langchain/amp/",
    "https://www.sitepoint.com/langchain-python-complete-guide/",
    "https://medium.com/@aisagescribe/langchain-101-a-comprehensive-introduction-guide-7a5db81afa49",
    "https://blog.min.io/minio-langchain-tool",
    "https://quickaitutorial.com/langgraph-create-your-hyper-ai-agent/",
    "https://python.langchain.com/docs/langserve",
    "https://python.langchain.com/docs/expression_language/interface",
    "https://blog.min.io/minio-langchain-tool",
    "https://python.langchain.com/docs/langgraph",
    "https://www.33rdsquare.com/langchain/",
    "https://medium.com/widle-studio/building-ai-solutions-with-langchain-and-node-js-a-comprehensive-guide-widle-studio-4812753aedff", "https://blog.min.io/", "https://sanity.cdaprod.dev/"]


if not minio_client.bucket_exists(bucket_name):
    minio_client.make_bucket(bucket_name)

for url in urls:
    try:
        response = requests.get(url)
        response.raise_for_status()  # Raises an error for bad status

        html_content = io.BytesIO(response.content)
        partitioned_elements = partition(file=html_content, content_type="text/html")
        combined_text = ""

        for element in partitioned_elements:
            if hasattr(element, 'text'):
                combined_text += element.text + "\n\n"

        combined_text = prepare_text_for_tokenization(combined_text)
        object_name = sanitize_url_to_object_name(url)

        # Using tempfile to automatically handle file creation and deletion
        with tempfile.NamedTemporaryFile(delete=False, mode="w", encoding="utf-8", suffix=".txt") as tmp_file:
            tmp_file.write(combined_text)
            tmp_file_path = tmp_file.name
            print(f"OK - Successfully created {object_name}")

        minio_client.fput_object(bucket_name, object_name, tmp_file_path)
        os.remove(tmp_file_path)  # Ensure deletion if delete=False

    except requests.RequestException as e:
        print(f"Request error for {url}: {e}")
    except Exception as e:
        print(f"Error processing {url}: {e}")

# Hydrate Weaviate with Text Datasets stored in MinIO Bucket

#### Key Points:

- This script directly executes each step without defining classes or functions. It downloads PDF files from MinIO, processes them to extract text, and stores this text in Weaviate.
- Error handling, logging, and detailed text extraction logic are omitted for brevity but should be considered in a production environment.
- Adjust the bucket_name, MinIO, and Weaviate endpoints and credentials as necessary for your setup.

In [1]:
from minio import Minio
import weaviate
from unstructured.partition.auto import partition
from pathlib import Path
import os

# MinIO setup
minio_client = Minio(
    "192.168.0.25:9000",
    access_key="cda_cdaprod",
    secret_key="cda_cdaprod",
    secure=False
)
bucket_name = "cda-datasets"

# Weaviate setup
client = weaviate.Client("http://192.168.0.25:8080")

# List and download PDFs from MinIO
for obj in minio_client.list_objects(bucket_name, recursive=True):
    if obj.object_name.endswith('.txt'):
        file_path = f"{obj.object_name}"
        minio_client.fget_object(bucket_name, obj.object_name, file_path)
        
        # Process each PDF with Unstructured
        elements = partition(filename=file_path)
        
        # Extract text (simplified logic, replace with actual extraction logic)
        extracted_texts = [e.text for e in elements if hasattr(e, 'text')]
        text_content = "\n".join(extracted_texts)
        
        # Store the extracted text in Weaviate (simplified to just storing the content as text)
        data_object = {
            "source": obj.object_name,
            "content": text_content
        }
        
        # Insert data into Weaviate, assuming 'Document' class exists
        client.data_object.create(data_object, "Document")
        
        # Optional: Clean up by removing the downloaded PDF
        os.remove(file_path)

            Consider upgrading to the new and improved v4 client instead!
            See here for usage: https://weaviate.io/developers/weaviate/client-libraries/python
            


# Query the Weaviate Results

#### Query #1
This will return all documents in the Document class, displaying their source and content.

In [8]:
# Query #1
query_result = client.query.get(
    "Document",
    ["source", "content"]
).do()

for result in query_result['data']['Get']['Document']:
    print(f"Source: {result['source']}")
    print(f"Content: {result['content']}\n")

Source: blog.min.io_author_david-cannan_.txt
Content: Topics All Architect's Guide Operator's Guide Best Practices AI/ML Modern Data Lakes Performance Kubernetes Integrations Benchmarks Security Multicloud Try the Erasure Code Calculator to configure your usable capacity Try Now Developing Langchain Agents with the MinIO SDK for LLM Tool-Use David Cannan David Cannan on AI/ML Explore Langchain‚Äôs LLM Tool-Use and leverage Langgraph for monitoring MinIO‚Äôs S3 Object Store. This guide walks you through developing custom conversational AI agents and creating powerful OpenAI LLM chains for efficient data management and enhanced application functionality. Read more... Powering AI/ML workflows with GitOps Automation David Cannan David Cannan on AI/ML Explore the fusion of GitOps, MinIO, Weaviate, and Python in AI development for unparalleled automation and innovation. This combination offers a solid foundation for creating scalable, efficient, and automated AI solutions, propelling project

#### Query #2

This approach is suitable for text fields in Weaviate and leverages its vector search capabilities to find documents that are semantically related to your query text, even if they don‚Äôt contain the exact words.

In [10]:
# Query #2
query_result = client.query.get(
    "Document",
    ["source", "content"]
).with_near_text({
    "concepts": ["some text your interested in"],
    "certainty": 0.5
}).do()

for result in query_result['data']['Get']['Document']:
    print(f"Source: {result['source']}")
    print(f"Content: {result['content']}\n")

Source: blog.min.io_.txt
Content: Topics All Architect's Guide Operator's Guide Best Practices AI/ML Modern Data Lakes Performance Kubernetes Integrations Benchmarks Security Multicloud Building Modern Data Architectures with Iceberg, Tabular and MinIO Brenna Buuck Brenna Buuck on Modern Data Lakes Explore modern data architecture with Iceberg, Tabular, and MinIO. Learn to seamlessly integrate structured and unstructured data, optimize AI/ML workloads, and build a high-performance, cloud-native data lake. Read more... Developing Langchain Agents with the MinIO SDK for LLM Tool-Use David Cannan David Cannan on AI/ML Explore Langchain‚Äôs LLM Tool-Use and leverage Langgraph for monitoring MinIO‚Äôs S3 Object Store. This guide walks you through developing custom conversational AI agents and creating powerful OpenAI LLM chains for efficient data management and enhanced application functionality. Read more... Prefix vs Folder AJ AJ on Object Storage How you ever wondered how object storage 

# Combining the MinIO Dataset Population with the Weaviate Hydration Method (v1)
This script does the following:

1. Defines helper functions to sanitize URLs and prepare text for tokenization.
2. Sets up clients for both MinIO and Weaviate.
3. Processes a list of URLs by fetching their content, using the Unstructured library to partition the content into text elements, and storing this text in a MinIO bucket.
4. Lists the stored text files in the MinIO bucket, processes each text file to extract the text content again (assuming they need reprocessing, though this might be redundant if they were already processed before storing), and then stores the content in Weaviate.

This combined script handles the entire workflow from fetching URL content to storing processed text in Weaviate, making it a comprehensive solution. Remember to replace placeholder URLs and configurations with your actual data and settings.

In [11]:
import requests
from minio import Minio
import weaviate
import os
import tempfile
import re
from unstructured.partition.auto import partition
import io

# Setup for MinIO and Weaviate
minio_client = Minio("192.168.0.25:9000", access_key="cda_cdaprod", secret_key="cda_cdaprod", secure=False)
client = weaviate.Client("http://192.168.0.25:8080")
bucket_name = "testtesttest"

def sanitize_url_to_object_name(url):
    clean_url = re.sub(r'^https?://', '', url)
    clean_url = re.sub(r'[^\w\-_\.]', '_', clean_url)
    return clean_url[:250] + '.txt'

def prepare_text_for_tokenization(text):
    clean_text = re.sub(r'\s+', ' ', text).strip()
    return clean_text

urls = [
    "https://nanonets.com/blog/langchain/amp/",
    "https://www.sitepoint.com/langchain-python-complete-guide/",
    "https://medium.com/@aisagescribe/langchain-101-a-comprehensive-introduction-guide-7a5db81afa49",
    "https://blog.min.io/minio-langchain-tool",
    "https://quickaitutorial.com/langgraph-create-your-hyper-ai-agent/",
    "https://python.langchain.com/docs/langserve",
    "https://python.langchain.com/docs/expression_language/interface",
    "https://blog.min.io/minio-langchain-tool",
    "https://python.langchain.com/docs/langgraph",
    "https://www.33rdsquare.com/langchain/",
    "https://medium.com/widle-studio/building-ai-solutions-with-langchain-and-node-js-a-comprehensive-guide-widle-studio-4812753aedff", "https://blog.min.io/", "https://sanity.cdaprod.dev/"]

# Ensure the bucket exists
if not minio_client.bucket_exists(bucket_name):
    minio_client.make_bucket(bucket_name)

# Process URLs: Fetch content, partition, and store in MinIO
for url in urls:
    response = requests.get(url)
    if response.status_code == 200:
        html_content = io.BytesIO(response.content)
        elements = partition(file=html_content, content_type="text/html")
        combined_text = "\n".join([e.text for e in elements if hasattr(e, 'text')])
        combined_text = prepare_text_for_tokenization(combined_text)
        object_name = sanitize_url_to_object_name(url)
        # Temporary storage
        with tempfile.NamedTemporaryFile(delete=False, mode="w", encoding="utf-8", suffix=".txt") as tmp_file:
            tmp_file.write(combined_text)
            tmp_file_path = tmp_file.name
        # Upload to MinIO and remove the temporary file
        minio_client.fput_object(bucket_name, object_name, tmp_file_path)
        os.remove(tmp_file_path)

# List, process, and store in Weaviate
for obj in minio_client.list_objects(bucket_name, recursive=True):
    if obj.object_name.endswith('.txt'):
        file_path = obj.object_name
        minio_client.fget_object(bucket_name, obj.object_name, file_path)
        elements = partition(filename=file_path)
        text_content = "\n".join([e.text for e in elements if hasattr(e, 'text')])
        # Store in Weaviate
        data_object = {"source": obj.object_name, "content": text_content}
        client.data_object.create(data_object, "Document")
        os.remove(file_path)  # Clean up


            Consider upgrading to the new and improved v4 client instead!
            See here for usage: https://weaviate.io/developers/weaviate/client-libraries/python
            


# Combining the MinIO Dataset Population with the Weaviate Hydration Method (v2)
### This enhanced version provides clear feedback at each:
- initializing clients
- processing URLs
- storing data in MinIO
- inserting documents into Weaviate. 

It‚Äôs designed to keep you informed about the script‚Äôs progress and any issues encountered along the way.

In [13]:
import requests
from minio import Minio
import weaviate
import os
import tempfile
import re
from unstructured.partition.auto import partition
import io

# Setup for MinIO and Weaviate
minio_client = Minio("192.168.0.25:9000", access_key="cda_cdaprod", secret_key="cda_cdaprod", secure=False)
print("MinIO client initialized.")

client = weaviate.Client("http://192.168.0.25:8080")
print("Weaviate client initialized.")

bucket_name = "cda-datasets"

def sanitize_url_to_object_name(url):
    clean_url = re.sub(r'^https?://', '', url)
    clean_url = re.sub(r'[^\w\-_\.]', '_', clean_url)
    return clean_url[:250] + '.txt'

def prepare_text_for_tokenization(text):
    clean_text = re.sub(r'\s+', ' ', text).strip()
    return clean_text

urls = [
    "https://nanonets.com/blog/langchain/amp/",
    "https://www.sitepoint.com/langchain-python-complete-guide/",
    "https://medium.com/@aisagescribe/langchain-101-a-comprehensive-introduction-guide-7a5db81afa49",
    "https://blog.min.io/minio-langchain-tool",
    "https://quickaitutorial.com/langgraph-create-your-hyper-ai-agent/",
    "https://python.langchain.com/docs/langserve",
    "https://python.langchain.com/docs/expression_language/interface",
    "https://blog.min.io/minio-langchain-tool",
    "https://python.langchain.com/docs/langgraph",
    "https://www.33rdsquare.com/langchain/",
    "https://medium.com/widle-studio/building-ai-solutions-with-langchain-and-node-js-a-comprehensive-guide-widle-studio-4812753aedff", "https://blog.min.io/", "https://sanity.cdaprod.dev/"]

if not minio_client.bucket_exists(bucket_name):
    minio_client.make_bucket(bucket_name)
    print(f"Bucket '{bucket_name}' created.")

for url in urls:
    print(f"Fetching URL: {url}")
    try:
        response = requests.get(url)
        response.raise_for_status()  # Check for HTTP issues

        html_content = io.BytesIO(response.content)
        elements = partition(file=html_content, content_type="text/html")
        combined_text = "\n".join([e.text for e in elements if hasattr(e, 'text')])
        combined_text = prepare_text_for_tokenization(combined_text)
        object_name = sanitize_url_to_object_name(url)

        with tempfile.NamedTemporaryFile(delete=False, mode="w", encoding="utf-8", suffix=".txt") as tmp_file:
            tmp_file.write(combined_text)
            tmp_file_path = tmp_file.name
        
        minio_client.fput_object(bucket_name, object_name, tmp_file_path)
        print(f"Stored '{object_name}' in MinIO bucket '{bucket_name}'.")
        os.remove(tmp_file_path)  # Clean up

    except requests.RequestException as e:
        print(f"Failed to fetch URL {url}: {e}")
    except Exception as e:
        print(f"Error processing {url}: {e}")

for obj in minio_client.list_objects(bucket_name, recursive=True):
    if obj.object_name.endswith('.txt'):
        print(f"Processing document: {obj.object_name}")
        file_path = obj.object_name
        minio_client.fget_object(bucket_name, obj.object_name, file_path)
        
        elements = partition(filename=file_path)
        text_content = "\n".join([e.text for e in elements if hasattr(e, 'text')])
        
        data_object = {"source": obj.object_name, "content": text_content}
        client.data_object.create(data_object, "Document")
        print(f"Inserted document '{obj.object_name}' into Weaviate.")
        
        os.remove(file_path)

MinIO client initialized.
Weaviate client initialized.
Fetching URL: https://nanonets.com/blog/langchain/amp/


            Consider upgrading to the new and improved v4 client instead!
            See here for usage: https://weaviate.io/developers/weaviate/client-libraries/python
            


Stored 'nanonets.com_blog_langchain_amp_.txt' in MinIO bucket 'cda-datasets'.
Fetching URL: https://www.sitepoint.com/langchain-python-complete-guide/
Stored 'www.sitepoint.com_langchain-python-complete-guide_.txt' in MinIO bucket 'cda-datasets'.
Fetching URL: https://medium.com/@aisagescribe/langchain-101-a-comprehensive-introduction-guide-7a5db81afa49
Stored 'medium.com__aisagescribe_langchain-101-a-comprehensive-introduction-guide-7a5db81afa49.txt' in MinIO bucket 'cda-datasets'.
Fetching URL: https://blog.min.io/minio-langchain-tool
Stored 'blog.min.io_minio-langchain-tool.txt' in MinIO bucket 'cda-datasets'.
Fetching URL: https://quickaitutorial.com/langgraph-create-your-hyper-ai-agent/
Stored 'quickaitutorial.com_langgraph-create-your-hyper-ai-agent_.txt' in MinIO bucket 'cda-datasets'.
Fetching URL: https://python.langchain.com/docs/langserve
Stored 'python.langchain.com_docs_langserve.txt' in MinIO bucket 'cda-datasets'.
Fetching URL: https://python.langchain.com/docs/expressio

---

# Markdown Graveyard

---

# Contextual Stuff

## Integrating Cloud Storage and Knowledge Graphs for Enhanced Data Management

In the rapidly evolving digital landscape, the efficient management and processing of data have become paramount for businesses and researchers alike. This article explores a proof of concept (POC) that showcases the integration of cloud storage solutions and knowledge graphs to streamline the management of unstructured data. By leveraging technologies such as MinIO, Weaviate, and advanced text processing libraries, this POC demonstrates a scalable and effective approach to data handling.

## Overview of the Solution

The POC employs a Python script to fetch, process, and store textual data from specified URLs into MinIO, an open-source object storage service, and subsequently indexes this data in Weaviate, a knowledge graph designed for scalable and fast data storage and retrieval. The process involves several steps, starting from data acquisition to its transformation and storage, highlighting the potential of integrating different technologies to enhance data management capabilities.

### Step 1: Setting Up MinIO and Weaviate Clients

The initial step involves initializing clients for both MinIO and Weaviate, connecting to their respective servers. This setup is crucial for enabling subsequent operations like data storage and indexing.

```python
minio_client = Minio("192.168.0.25:9000", access_key="cda_cdaprod", secret_key="cda_cdaprod", secure=False)
client = weaviate.Client("http://192.168.0.25:8080")
```

### Step 2: Data Fetching and Pre-processing

The script fetches textual content from predefined URLs, processes this content to remove HTML tags, and prepares it for storage. This step is vital for converting raw HTML data into a more structured and usable text format.

### Step 3: Storage in MinIO

Once the data is processed, it's stored in a specified bucket in MinIO. This involves sanitizing the URL to a valid object name, writing the processed text to a temporary file, and then uploading this file to MinIO. This step demonstrates the flexibility and ease of using MinIO for handling large volumes of data.

### Step 4: Indexing in Weaviate

After storing the documents in MinIO, the script indexes them in Weaviate. This involves reading the stored documents, further processing the text, and creating data objects in Weaviate's knowledge graph. This step highlights Weaviate's capability to manage and search through large datasets efficiently.

## Benefits and Applications

The integration of MinIO and Weaviate offers numerous benefits, including scalable storage, efficient data retrieval, and the ability to handle unstructured data effectively. This POC illustrates not just a technical implementation but also a strategic approach to managing data in a way that enhances accessibility, searchability, and usability.

Such a system could be invaluable for organizations dealing with large datasets, researchers requiring efficient data retrieval methods, or businesses looking to implement advanced data analysis and management solutions.

## Conclusion

This proof of concept highlights the synergy between cloud storage and knowledge graphs, offering a glimpse into the future of data management. By leveraging the strengths of MinIO for storage and Weaviate for data indexing and retrieval, organizations can achieve a more streamlined and efficient data management process. This POC not only demonstrates the technical feasibility of such an integration but also underscores the potential benefits for businesses and researchers in managing and analyzing data at scale.