<a href="https://colab.research.google.com/github/JasperLS/toolbox/blob/main/dC_upload_example.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
API_KEY = "<API_KEY>"
WORKSPACE = "<WORKSPACE_NAME>"


# **Prerequisites**

First, let's add your deepset Cloud API key and choose the workspace where you want to upload the files.


# **Install SDK**

Now, let's install the SDK package.

In [None]:
pip install deepset-cloud-sdk

The following dependency is required for Colab only:

In [None]:
pip install nest_asyncio

Let's make sure we have all the necessary imports ready:

In [None]:
from pathlib import Path
import os, json
from deepset_cloud_sdk.workflows.async_client.files import upload_texts, DeepsetCloudFile, upload
from deepset_cloud_sdk._api.files import datetime

# **Create example files**

This creates some example TXT files with metadata and simple text as contents. deepset Cloud SDK doesn't need any special folder structure.

In [None]:
def create_files_and_meta():
  files = []
  for i in range(100):
      files.append(DeepsetCloudFile(
          name=f"file_{i}",
          text=f"this is file {i}",
          meta= {
              "url": f"https://example.com/{i}",
              "timestamp": datetime.datetime.timestamp(datetime.datetime.now())
          }
      ))
  return files

In [None]:
# see example files:

create_files_and_meta()[:2]

In [None]:
# only required for uploading from storage
os.mkdir("Files")

In [None]:
# only required for uploading from storage
for i in create_files_and_meta():
  with open("Files/"+i.name, 'w') as f:
    f.write(i.text)
  with open("Files/"+i.name+".meta.json", 'w') as f:
    f.write(json.dumps(i.meta))

# **Upload files**

This section shows you different options for uploading your files: from memory, using generators, or from storage.

## **Upload from memory**

In [None]:
files = create_files_and_meta()
await upload_texts(
    files=files,
    blocking=True,  # optional, by default True
    timeout_s=300,  # optional, by default 300
    api_key=API_KEY,
    workspace_name=WORKSPACE
)

## **Upload from memory using generators**

If you're using a generator instead of a list, set `blocking` to `False`. This  hides the ingestion status. Note that it takes a few minutes for all the uploaded files to show in deepset Cloud. The more files, the longer it takes.

The progress bar for the uploads is also a little inaccurate with respect to time taken for generators. You can disable the output summary by passing `show_progress=False`.

In [None]:
import time
import random
import uuid

def generate_files_and_meta():
    for i in range (10):
      time.sleep(1) # some blocking task
      id = uuid.uuid4()
      now = datetime.datetime.now()
      print(f"[{now.hour}:{now.minute}:{now.second}]::: generated file {id}.txt")
      yield DeepsetCloudFile(
          name=f"file_{id}.txt",
          text=f"this is file {id}",
          meta= {
              "url": f"https://example.com/{id}",
              "timestamp": datetime.datetime.timestamp(datetime.datetime.now())
          }
      )

await upload_texts(
    files=generate_files_and_meta(),
    blocking=False,  # must set to False for generators
    timeout_s=300,  # optional, by default 300
    api_key=API_KEY,
    workspace_name=WORKSPACE,
    show_progress=True,
)

## **Upload from storage**

In this example, you upload files from a physical folder on your computer.


In [None]:
await upload(
    paths=[Path("./Files")],
    blocking=True,  # optional, by default True
    timeout_s=300,  # optional, by default 300
    api_key=API_KEY,
    workspace_name=WORKSPACE
)


# **Show an upload summary**

1. Get the session details. The first result of the `list_upload_sessions` function gives you the latest created session. This function returns a generator of all sessions.

2. Use the `get_upload_session` function to get a summary of the session. This includes information such as the number of failed files, and the number of files that finished processing.

In [None]:
from deepset_cloud_sdk.workflows.async_client.files import get_upload_session, list_upload_sessions

sessions = list_upload_sessions(
    api_key=API_KEY,
    workspace_name=WORKSPACE,
    batch_size=100
)

session = await anext(sessions)
print(session)

details = await get_upload_session(
    session_id=session[-1].session_id,
    api_key=API_KEY,
    workspace_name=WORKSPACE
)
details

## **Get session details**

This is not currently implemented in the SDK, but you can directly invoke the API to get details of the status of files uploaded during the session.

This endpoint tells you about whether files were successfully processed into deepset Cloud or not.

In [None]:
pip install httpx

In [None]:
import httpx
import json
session_id = str(details.session_id)

ingestion_status="FINISHED" # can be FAILED/FINISHED/PENDING
limit = 100
page_number=1

endpoint = f"https://api.cloud.deepset.ai/api/v1/workspaces/{WORKSPACE}/upload_sessions/{session_id}/files?ingestion_status={ingestion_status}&limit={limit}&page_number={page_number}"
headers = {
    "accept": "application/json",
    "authorization": f"Bearer {API_KEY}"
}
response = httpx.get(
    url=endpoint,
    headers=headers,
    )

json.loads(response.content)

{'data': [], 'has_more': False, 'total': 0}