## Python Jupyter Notebook Recipe: Weaviate + Box Integration with Cohere LLM

Author: Alexander Novotny from Box

This notebook demonstrates how to:
1. Authenticate with Box using a developer token via the Box Python-gen SDK.
2. Retrieve files from a specified Box folder, using Box's text representations.
3. Generate embeddings for the file content using Cohere.
4. Store the embeddings and metadata in Weaviate.
5. Implement a q/a service to query the content using Weaviate’s vector search and Cohere’s language model.

### Prerequisites
- A Box account with a custom application and developer token (you can generate one in the Box Developer Console).
- A Weaviate cloud instance.
- A Cohere API key (sign up at https://cohere.ai/).
- A Box folder ID containing supported files (e.g., `.txt`, `.pdf`, `.docx`).

## Notes
- **Box Folder ID**: Find this in the Box web interface URL (e.g., `https://app.box.com/folder/12345` → `12345`).
- **File Types**: Only processes supported extensions (e.g., `.pdf`, `.docx`). Adjust `SUPPORTED_TEXT_FILE_TYPES` as needed.

### Step 1: Install Dependencies
First, install the required Python packages in your Jupyter environment.

In [None]:
!python3 -m venv venv
!source venv/bin/activate
!pip3 install weaviate-client box-sdk-gen requests

### Step 2: Import Libraries
Import the necessary libraries for Box, Weaviate, and Cohere.

In [84]:
import weaviate
from weaviate.auth import AuthApiKey
from weaviate.classes.query import QueryReference
from box_sdk_gen import BoxClient, BoxDeveloperTokenAuth
import re
import requests

### Step 3: Authentication
Set up authentication for Box, Weaviate, and Cohere.

In [None]:
# Box Developer Token (replace with your own)
BOX_DEVELOPER_TOKEN = 'DEVELOPER_TOKEN'
FOLDER_ID = 'BOX_FOLDER_ID'

# Weaviate Instance URL and API Key (replace with your own)
WEAVIATE_URL = 'WEAVIATE_URL'
WEAVIATE_API_KEY = 'WEAVIATE_ADMIN_KEY'  # Optional, depending on setup

# Cohere API Key (replace with your own)
COHERE_API_KEY = 'COHERE_API_KEY'


def main(box_token: str, weaviate_url: str, weaviate_api_key: str, cohere_api_key: str):
    # Initialize Box Client
    auth: BoxDeveloperTokenAuth = BoxDeveloperTokenAuth(token=box_token)
    box_client: BoxClient = BoxClient(auth=auth)
    
    # Initialize Weaviate Client for WCS
    weaviate_client = weaviate.connect_to_wcs(
        cluster_url=weaviate_url,
        auth_credentials=AuthApiKey(weaviate_api_key) if weaviate_api_key else None,
        headers={"X-Cohere-Api-Key": cohere_api_key}
    )
    
    # Return clients for use in subsequent steps
    return box_client, weaviate_client

# Call main to initialize clients
box_client, weaviate_client = main(
    BOX_DEVELOPER_TOKEN, WEAVIATE_URL, WEAVIATE_API_KEY, COHERE_API_KEY
)
print("Clients initialized successfully.")

Clients initialized successfully.


### Step 4: Define Weaviate Schema
Create a schema in Weaviate to store document embeddings and metadata. We’ll use Cohere’s `text2vec-cohere` vectorizer.

In [None]:
from weaviate.classes.config import Property, DataType, Configure

# Check if the "Document" collection already exists
if not weaviate_client.collections.exists("Documents"):
    # Create the collection explicitly
    weaviate_client.collections.create(
        name="Documents",
        generative_config=Configure.Generative.cohere(),
        properties=[
            Property(name="file_id", data_type=DataType.TEXT, skip_vectorization=True),
            Property(name="file_name", data_type=DataType.TEXT, skip_vectorization=True),
            Property(name="chunk_index", data_type=DataType.INT, skip_vectorization=True),
            Property(name="content", data_type=DataType.TEXT),  # Vectorized by default
            Property(name="created_date", data_type=DataType.TEXT, skip_vectorization=True),
 
        ],
        vectorizer_config=Configure.Vectorizer.text2vec_cohere()
    )
    print("Schema 'Documents' created successfully.")
else:
    print("Schema 'Documents' already exists.")

### Step 5: Retrieve Files from Box
Define a function to fetch files from a specified Box folder.

In [None]:
# Supported file types for text representation
SUPPORTED_TEXT_FILE_TYPES = {
    ".doc", ".docx", ".pdf", ".txt", ".html", ".md", ".json", ".xml",
    ".ppt", ".pptx", ".key",
    ".xls", ".xlsx", ".csv"
}

def is_supported_file_type(file_name):
    """Check if the file's extension is in the supported list."""
    return any(file_name.endswith(ext) for ext in SUPPORTED_TEXT_FILE_TYPES)

def get_files_in_folder(client, folder_id):
    """Retrieve all supported files from a specified Box folder."""
    items = client.folders.get_folder_items(folder_id)
    file_objects = []
    for item in items.entries:
        if item.type == 'file' and is_supported_file_type(item.name):
            file_objects.append(client.files.get_file_by_id(item.id))
    return file_objects

files = get_files_in_folder(box_client, FOLDER_ID)
print(f"Found {len(files)} supported files in folder {FOLDER_ID}.")

### Step 6: Extract Text and Generate Chunks
Extract text from files and prepare data for Weaviate. Note: This cleans up text and chunks the data with overlap.

In [None]:
# Existing cleanup function (unchanged)
def clean_up_text(content: str) -> str:
    """Clean up the extracted text content."""
    content = re.sub(r'(\w+)-\n(\w+)', r'\1\2', content)
    unwanted_patterns = [
        "\\n", "  —", "——————————", "—————————", "—————",
        r'\\u[\dA-Fa-f]{4}', r'\uf075', r'\uf0b7'
    ]
    for pattern in unwanted_patterns:
        content = re.sub(pattern, "", content)
    content = re.sub(r'(\w)\s*-\s*(\w)', r'\1-\2', content)
    content = re.sub(r'\s+', ' ', content)
    return content

# Existing text extraction function (unchanged)
def get_file_text_content(file, max_retries=5, delay=5):
    """Get text content from a file's representation with retry logic."""
    for attempt in range(max_retries):
        special_client = box_client.with_extra_headers(extra_headers={"x-rep-hints": "[extracted_text]", "x-box-ai-library": "weaviate"})
        file = special_client.files.get_file_by_id(file.id, fields=["representations"])
        if file.representations and file.representations.entries:
            for rep in file.representations.entries:
                if rep.representation == "extracted_text":
                    download_url = rep.content.url_template.replace("{+asset_path}", "") + '?access_token=' + box_client.auth.token
                    response = requests.get(download_url)
                    response.raise_for_status()
                    return clean_up_text(response.text)
                else:
                    print(f"Text representation not ready for file {file.id}")
                    raise ValueError(f"Text representation not ready for file {file.id}")
        if attempt == max_retries - 1:
            raise ValueError(f"Text representation not ready for {file.name} after {max_retries} attempts.")

# New chunking function
def chunk_text(text, chunk_size=4000, overlap=200):
    """Split text into chunks with specified size and overlap."""
    chunks = []
    start = 0
    while start < len(text):
        end = start + chunk_size
        chunk = text[start:end]
        chunks.append(chunk)
        start = end - overlap
    return chunks

# Modified extraction function with chunking
def extract_text_and_generate_embeddings(files):
    """Extract text from supported files and yield chunked data."""
    for file in files:
        try:
            text = get_file_text_content(file)
            chunks = chunk_text(text, chunk_size=4000, overlap=200)
            for i, chunk in enumerate(chunks):
                yield {
                    "file_id": file.id,
                    "file_name": file.name,
                    "chunk_index": i,
                    "content": chunk,
                    "created_date": file.created_at
                }
        except Exception as e:
            print(f"Error processing {file.name}: {e}")

# Extract data from files
data = list(extract_text_and_generate_embeddings(files))
print(f"Processed {len(data)} text files.")

### Step 7: Import Data into Weaviate
Batch import the data into Weaviate, where Cohere’s vectorizer will automatically generate embeddings.

In [None]:
# Function to import data into Weaviate
def import_data_to_weaviate(data):
    """Import chunked data into Weaviate."""
    collection = weaviate_client.collections.get("Documents")
    with collection.batch.dynamic() as batch:
        for item in data:
            batch.add_object(properties=item)
    print(f"Imported {len(data)} chunks into Weaviate.")

# Import the data
import_data_to_weaviate(data)

### Step 8: Search and generate
Ask a question and get a response based on the imported content

In [None]:
# Define your query here (update this variable as needed)
query = "INSERT A QUESTION HERE BASED ON THE CONTENT OF THE CONTENT IN THE FOLDER"

# Get the Documents collection
documents = weaviate_client.collections.get("Documents")

# Perform a near-text search and generate a single grouped response
gen_response = documents.generate.near_text(
    query=query,
    limit=5,  # Retrieve top 5 relevant chunks
    grouped_task=f"Using the following content chunks from Box documentation, provide a single answer to the question: '{query}'\n\n"
                 "Answer:",
    grouped_properties=["content"], 
    return_properties=["content", "file_name", "chunk_index"]
)

# Print the generated response
if gen_response.generated:
    print("Generated Response:")
    print(gen_response.generated.strip())
    print("\nRelevant Chunks Used:")
    for obj in gen_response.objects:
        print(f"File: {obj.properties['file_name']} (Chunk {obj.properties['chunk_index']}): {obj.properties['content'][:100]}...")
else:
    print("No response generated. Check query or data.")