# Setting Up Workspaces in Argilla

In this tutorial, we will learn how to set up and manage workspaces in Argilla using the default credentials on a fresh installation. It will walk you through the following steps:

1. Connecting to Argilla with default credentials 🔑
2. Creating your first workspace 🏗️
3. Listing available workspaces 📋
4. Adding PDF documents to the workspace 📄
5. Creating and uploading a schema 📊
6. Running PDF preprocessing 🔍
7. Running LLM extractions 🤖

![Argilla Workspace Management](https://raw.githubusercontent.com/argilla-io/argilla/main/docs/assets/argilla_workspace_management.png)

## Introduction

A **workspace** is a space inside your Argilla instance where authorized users can collaborate on datasets. Workspaces are accessible through both the Python SDK and the UI. When you first install Argilla, you'll need to create workspaces to organize your data and user access.

For more details on workspace management, refer to the [Argilla documentation](https://docs.extralit.ai/latest/admin_guide/workspace/).

Let's get started!


## 1. Connecting to Argilla

First, we need to import the Argilla library and connect to our instance using the default credentials.

In [1]:
import argilla as rg
import extralit as ex
import pandas as pd
import pandera as pa
from pandera.typing import Index, Series
import os
import tempfile
from pathlib import Path

# Connect to Argilla using default credentials
client = rg.Argilla(api_url="http://localhost:6900/", api_key='argilla.apikey')

print(f"Successfully connected to Argilla at {client.api_url}")

  from .autonotebook import tqdm as notebook_tqdm


ConnectError: [Errno 61] Connection refused

## 2. Creating Your First Workspace

After connecting to Argilla, let's create our first workspace. We'll define a new `Workspace` object and call the `create()` method.

In [None]:
# Define a new workspace
workspace_name = "test_workspace"
new_workspace = rg.Workspace(name=workspace_name, client=client)

# Create the workspace
created_workspace = new_workspace.create()

print(f"Workspace '{workspace_name}' created successfully with ID: {created_workspace.id}")

Workspace 'test-workspace' created successfully with ID: df91e2b9-c712-4474-af82-891478f23d40


## 3. Listing Available Workspaces

Now, let's check all the workspaces available in our Argilla instance.

In [9]:
# List all workspaces
workspaces = client.workspaces

# Display workspace information
print(f"Total number of workspaces: {len(workspaces)}\n")

# In a notebook, this will display a table with workspace information
workspaces

Total number of workspaces: 1



name,id,updated_at
test-workspace,df91e2b9-c712-4474-af82-891478f23d40,2025-04-16T01:27:58.174591


## 4. Adding PDF Documents to the Workspace

Let's add two PDF documents to our workspace. For this tutorial, we'll create temporary PDF files. In a real-world scenario, you'd use actual scientific papers.

In [10]:
# In a real-world scenario, you would use actual PDFs. Here we'll create temp files
# Define the paths for our temporary PDF files
temp_dir = tempfile.gettempdir()
pdf_file1 = Path(temp_dir) / "sample1.pdf"
pdf_file2 = Path(temp_dir) / "sample2.pdf"

# Create empty PDF files - in reality, these would be your actual PDFs
with open(pdf_file1, "wb") as f:
    f.write(b"%PDF-1.5\n%Example Document 1")
    
with open(pdf_file2, "wb") as f:
    f.write(b"%PDF-1.5\n%Example Document 2")

# Create a reference dataframe with metadata for the PDFs
references_df = pd.DataFrame({
    "reference": ["smith2023first", "johnson2022analysis"],
    "file_path": [str(pdf_file1), str(pdf_file2)],
    "title": ["Study on Sample Data", "Analysis of Experimental Results"],
    "authors": ["Smith, J.", "Johnson, A."],
    "year": [2023, 2022]
})

# Save the dataframe to a temporary CSV file
references_csv = Path(temp_dir) / "references.csv"
references_df.to_csv(references_csv, index=False)

print(f"Created reference CSV at {references_csv}")
references_df

Created reference CSV at /tmp/references.csv


Unnamed: 0,reference,file_path,title,authors,year
0,smith2023first,/tmp/sample1.pdf,Study on Sample Data,"Smith, J.",2023
1,johnson2022analysis,/tmp/sample2.pdf,Analysis of Experimental Results,"Johnson, A.",2022


In [None]:
new_workspace

Workspace(id=UUID('df91e2b9-c712-4474-af82-891478f23d40') inserted_at=datetime.datetime(2025, 4, 16, 1, 27, 58, 174591) updated_at=datetime.datetime(2025, 4, 16, 1, 27, 58, 174591) name='test-workspace')

In [None]:
# Import the documents into the workspace
# For demonstration purposes, we'll use the extralit client directly
# Initialize the extralit client with the same credentials
extralit_client = ex.Extralit(api_url="http://localhost:6900/", api_key='argilla.apikey')

# Import the documents
result = extralit_client.import_documents(
    workspace=workspace_name,
    papers=str(references_csv),
    metadatas=["title", "authors", "year"]
)

print(f"Imported {len(result)} documents into workspace '{workspace_name}'")

## 5. Creating and Uploading a Schema

Now, let's create a simple schema to define the structure of the data we want to extract from our documents.

In [None]:
# Define a simple schema using Pandera
class Publication(pa.DataFrameModel):
    """
    General information about the publication, extracted once per paper.
    """
    reference: Index[str] = pa.Field(unique=True, check_name=True)
    title: Series[str] = pa.Field()
    authors: Series[str] = pa.Field()
    publication_year: Series[int] = pa.Field(ge=1900, le=2100)
    doi: Series[str] = pa.Field(nullable=True)
    
    class Config:
        singleton = {'enabled': True}  # Indicates this is a document-level schema

# Define a second schema for experimental data
class ExperimentalData(pa.DataFrameModel):
    """
    Experimental data extracted from the paper, may appear multiple times.
    """
    experiment_id: Series[str] = pa.Field()
    sample_size: Series[int] = pa.Field(gt=0)
    study_type: Series[str] = pa.Field()
    result_value: Series[float] = pa.Field()
    significance: Series[float] = pa.Field(le=1.0, ge=0.0)

# Create a schema structure object
from extralit.extraction.models.schema import SchemaStructure

# Save schemas to a temporary JSON file
schema_file = Path(temp_dir) / "schemas.json"
schema_structure = SchemaStructure(schemas={"Publication": Publication, "ExperimentalData": ExperimentalData})
schema_structure.to_json(schema_file)

print(f"Created schema file at {schema_file}")

In [None]:
# Upload the schema to the workspace
result = extralit_client.upload_schemas(
    workspace=workspace_name,
    schemas=str(schema_file)
)

print(f"Uploaded schemas to workspace '{workspace_name}'")

## 6. Running PDF Preprocessing

Next, let's run the PDF preprocessing step to extract text and table content from our documents.

In [None]:
# Run PDF preprocessing
from extralit.preprocessing.pdf import process_pdfs

# Get the references from our dataframe
references = references_df["reference"].tolist()

# Run the preprocessing step
preprocessing_result = process_pdfs(
    workspace=workspace_name,
    references=references,
    text_ocr=["default"],  # Using the default text OCR model
    table_ocr=["default"],  # Using the default table OCR model
    output_dataset="PDF_Preprocessing_Results"
)

print(f"Preprocessing completed for {len(preprocessing_result)} documents")

## 7. Running LLM Extractions

Finally, let's run the LLM extraction step to extract structured data according to our schema.

In [None]:
# Run LLM extractions
from extralit.extraction.llm import extract_data

# Run the extraction step
extraction_result = extract_data(
    workspace=workspace_name,
    references=references,
    output_dataset="Data_Extraction_Results"
)

print(f"LLM extractions completed for {len(extraction_result)} documents")

## 8. Checking Extraction Results

Let's check the results of our extractions by listing the datasets created and viewing the extracted data.

In [None]:
# List datasets in the workspace
datasets = extralit_client.list_datasets(workspace=workspace_name)
print(f"Datasets in workspace '{workspace_name}':\n")
for dataset in datasets:
    print(f"- {dataset.name} ({dataset.id})")

In [None]:
# Export the extracted data
extracted_data = extralit_client.export_data(
    workspace=workspace_name,
    output="temp_output.csv"  # This will save the data to a CSV file
)

# Display the extracted data
if isinstance(extracted_data, dict):
    for schema_name, data_df in extracted_data.items():
        print(f"\nExtracted data for schema '{schema_name}':\n")
        display(data_df)
else:
    print("\nExtracted data:")
    display(extracted_data)

## Conclusion

Congratulations! You've successfully tested the primary functionalities of Extralit with default credentials on a fresh install. You have:

1. Connected to Argilla with default credentials
2. Created a workspace
3. Added PDF documents
4. Created and uploaded a schema
5. Run PDF preprocessing
6. Run LLM extractions
7. Checked the extraction results

This workflow demonstrates the basic process of using Extralit for data extraction from scientific papers. In a real-world scenario, you would upload actual scientific papers and create more complex schemas tailored to your specific data extraction needs.

For more detailed information, refer to the [Extralit documentation](https://docs.extralit.ai/latest/).