<a href="https://colab.research.google.com/github/Sprayer1122/Groclake_Tutorials/blob/main/Groclake_Datalake.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Groclake DataLake Colab Notebook

DataLake is a centralized platform for storing, managing, and retrieving large volumes of structured and unstructured data, such as documents, files, and metadata. It supports efficient data operations and enables advanced processing, retrieval, and analysis, making it ideal for AI/ML applications and businesses needing to manage complex datasets at scale.



#### **Step 1: Install Required Library**
Install the Groclake library to interact with its APIs and manage DataLakes

In [None]:
!pip install groclake
!pip install python-dotenv

#### **Step 2: Import Required Modules and Set up env variables**

In [None]:
import os
from groclake.cataloglake import Cataloglake
from groclake.modellake import Modellake
from groclake.datalake import Datalake
from groclake.vectorlake import Vectorlake

# Environment variable setup
GROCLAKE_API_KEY = 'your_groclake_api_key'
GROCLAKE_ACCOUNT_ID = 'your_groclake_account_id'

os.environ['GROCLAKE_API_KEY'] = GROCLAKE_API_KEY
os.environ['GROCLAKE_ACCOUNT_ID'] = GROCLAKE_ACCOUNT_ID

# Initialize Groclake catalog instance
model_lake = Modellake()
data_lake = Datalake()
vector_lake = Vectorlake()

#### **Step 3: Initialize Groclake Instances**

In [None]:
# Create instances for DataLake
data_lake = Datalake()

#### **Step 4: Create a New DataLake**
Create a new DataLake, which serves as a storage and processing environment for data

In [None]:
try:
    data_create = data_lake.create()
    print("DataLake Created Successfully:", data_create)
    # Store the DataLake ID for further operations
    datalake_id = data_create["datalake_id"]
except Exception as e:
    print("Error creating DataLake:", str(e))

#### **Step 5: Push a Document to the DataLake**
Push a document (in this case, a URL) to the created DataLake for storage and processing

Instructions to Generate a Publicly Accessible Link:

Google Drive:

1. Upload your file (e.g., a PDF) to Google Drive.
2. Right-click on the uploaded file and select "Get Link".
3. Change the permissions:
4. Click on the drop-down next to "Restricted".
5. Select "Anyone with the link".
6. Ensure the access level is set to "Viewer".
7. Copy the generated link.
8. Convert the shared link to a direct download link:
    
    Original link:
    https://drive.google.com/file/d/1cOYyJ5RuTjLph6Hjx_tAhGN_xH74tBtr/view

    Direct download link:
    https://drive.google.com/uc?export=download&id=1cOYyJ5RuTjLph6Hjx_tAhGN_xH74tBtr.

Dropbox:

1. Upload your file to Dropbox.
2. Right-click on the uploaded file and select "Copy Dropbox Link".
3. Ensure the link is set to public access.
4. Modify the link if needed (replace dl=0 with dl=1 at the end of the URL for direct download).

OneDrive:

1. Upload your file to OneDrive.
2. Right-click on the uploaded file and select "Share".
3. Choose "Anyone with the link" and set the access level to "View".
4. Copy the link.

Example Using Google Drive:
Original Link:
https://drive.google.com/file/d/1cOYyJ5RuTjLph6Hjx_tAhGN_xH74tBtr/view

Converted Direct Download Link:
https://drive.google.com/uc?export=download&id=1cOYyJ5RuTjLph6Hjx_tAhGN_xH74tBtr

Make sure the link is a download link. You can convert your link to a download link by making minor tweaks to the URL itself. Use ChatGPT for assistance.

In [None]:
try:
    payload_push = {
        "datalake_id": datalake_id,  # Specify the target DataLake
        "document_type": "url",    # Document type can be 'url', 'text', etc.
        "document_data": "https://plotch.ai/upload/plotch.pdf"  # URL of the document to push
    }

    # https://drive.google.com/uc?export=download&id=1cOYyJ5RuTjLph6Hjx_tAhGN_xH74tBtr

    # Push the document to the DataLake
    data_push = data_lake.push(payload_push)
    print("Response from push:", data_push)

    # Extract the document_id from the response, which will be used for retrieval
    if "document_id" in data_push:
        document_id = data_push["document_id"]
        print("Document ID:", document_id)
    else:
        print("Error: 'document_id' not found in the response.")

except Exception as e:
    print("Error pushing document:", str(e))

The document_id uniquely identifies the pushed document and links it to the DataLake

#### **Step 6: Fetch the Document in Chunks**
Define the payload to fetch the document in chunked format

In [None]:
payload_fetch = {
    "document_id": document_id,  # Specify the document to fetch
    "datalake_id": datalake_id,  # Specify the DataLake where the document resides
    "fetch_format": "chunk",   # Fetching in chunks allows partial retrieval for large files
    "chunk_size": "500"         # Define the size of each chunk (in bytes or characters)
}

try:
    # Fetch the document from the DataLake
    data_fetch = data_lake.fetch(payload_fetch)
    print("Document Fetched Successfully:\n", data_fetch)

    # When fetching in chunks, the document is divided into manageable pieces for processing
except Exception as e:
    print("Error fetching document:", str(e))

#### **Step 7: Display the Fetched Data**
Iterate over and display chunks of the fetched document for easy readability

In [None]:
try:
    document_chunks = data_fetch.get("document_data", [])  # Retrieve document data in chunks
    for idx, chunk in enumerate(document_chunks):
        print(f"Chunk {idx + 1}:")
        print(chunk)  # Display each chunk of the document
        print("-" * 50)  # Separator for clarity

    # Chunked fetching is especially useful for large documents, as it prevents memory overload
except Exception as e:
    print("Error processing fetched data:", str(e))

Question: Why we are fetching in Chunks
1. Improved Performance
2. Customizable Fetching
3. Avoiding Memory Overload
4. Partial Data Processing

# Also, LLMs have a certain token limit, so we cannot exceed it.