# AI Document Labeler
This project aims to fix the problem of filename mess by generating identifiable human-readable names for ambiguously named documents (e.g.: ``Document(1).pdf, WhatsApp Document 2025-06-28(2).pdf, IMPORTANT.pdf, 1706.03762v1.pdf``, etc.) from document text using IBM Granite.

In [1]:
!pip install "git+https://github.com/ibm-granite-community/utils.git" \
docling langchain langchain_community ibm_watsonx_ai langchain_ibm

Collecting git+https://github.com/ibm-granite-community/utils.git
  Cloning https://github.com/ibm-granite-community/utils.git to /tmp/pip-req-build-66rgqyu8
  Running command git clone --filter=blob:none --quiet https://github.com/ibm-granite-community/utils.git /tmp/pip-req-build-66rgqyu8
  Resolved https://github.com/ibm-granite-community/utils.git to commit 02416d22ebbeb050763453a44a17ff7d2fc72aa8
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone


In [2]:
import os
from langchain_ibm import WatsonxLLM
from ibm_watsonx_ai import Credentials, APIClient
from ibm_granite_community.notebook_utils import get_env_var


location = "us-south" # forced location for the hackathon, change to your own.
api_key = get_env_var("WATSON_API_KEY") # retrieve API key from environment variables/secrets

credentials = Credentials(
    url=f"https://{location}.ml.cloud.ibm.com",
    api_key=api_key,
)
project_id = "a65229f0-74e5-460a-bfab-a75a8f1780d5" # change to your own project id
client = APIClient(credentials=credentials, project_id=project_id)

WATSON_API_KEY loaded from Google Colab secret.


## Converting PDFs to Docling Documents.

In [6]:
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions

pdf_pipeline_options = PdfPipelineOptions(
    do_ocr=True,
    generate_picture_images=False,
)
format_options = {
    InputFormat.PDF: PdfFormatOption(pipeline_options=pdf_pipeline_options),
}
converter = DocumentConverter(format_options=format_options)

sources = []
path = "documents/"
dirlist = os.listdir(path)
print(dirlist)
for i in dirlist:
  if i[-4:].lower() == ".pdf":
    sources.append("documents/"+i)
print(sources)
conversions = { source: converter.convert(source=source).document for source in sources }

['AR_2020_WEB2.pdf', '2506.11928v1.pdf', 'AI-and-Automation-Unpacked-Hackathon-June-2025.pdf', 'shattered-1.pdf', 's1cmm0dh6593mt7xassj.pdf', 'robert_rules.pdf', '2405.04324v1.pdf', 'woot15-paper-adamsky.pdf', 'bxdhvkbsupo1rzkem6he.pdf']
['documents/AR_2020_WEB2.pdf', 'documents/2506.11928v1.pdf', 'documents/AI-and-Automation-Unpacked-Hackathon-June-2025.pdf', 'documents/shattered-1.pdf', 'documents/s1cmm0dh6593mt7xassj.pdf', 'documents/robert_rules.pdf', 'documents/2405.04324v1.pdf', 'documents/woot15-paper-adamsky.pdf', 'documents/bxdhvkbsupo1rzkem6he.pdf']


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Initialize Granite.

In [7]:
from ibm_watsonx_ai.metanames import GenTextParamsMetaNames as GenParams
llm = WatsonxLLM(
    model_id="ibm/granite-3-3-8b-instruct",
    url=credentials.get("url"),
    apikey=credentials.get("apikey"),
    project_id=project_id,
    params={
        GenParams.DECODING_METHOD: "greedy",
        GenParams.TEMPERATURE: 0,
        GenParams.MIN_NEW_TOKENS: 5,
        GenParams.MAX_NEW_TOKENS: 250,
        GenParams.STOP_SEQUENCES: ["\n"],
    },
)

Test if it's working.

In [8]:
template = "You are a personal assistant. Respond to the following user interactions in natural language; one sentence per response.\n\n"

print(llm.invoke(template+"How do I watch an MP4?"))




To watch an MP4 file, you can use various media players such as VLC, Windows Media Player, or QuickTime. Simply locate the file on your device, double-click it, and the video should start playing.



Pass it the top portions of the documents and ask it to generate filenames for them.

In [31]:
template = "Extract the title of the following document. If it does not have a title, generate a concise title for the document, intuitive of the document's content. Respond only with the title.\n\n"
titles = []
for source in sources:
  convdoc = str(conversions[source].export_to_markdown())
  convdoc = convdoc[:min(len(convdoc),12000)]
  print(f"Now processing: {conversions[source].name}: {convdoc[:min(len(convdoc),70)]}...")
  titles.append(llm.invoke(template+"[Document]"+convdoc+"[/Document]"))

Now processing: AR_2020_WEB2: <!-- image -->

<!-- image -->

bridging the gap between poverty and p...
Now processing: 2506.11928v1: ## LiveCodeBench Pro: How Do Olympiad Medalists Judge LLMs in Competit...
Now processing: AI-and-Automation-Unpacked-Hackathon-June-2025: IBM TechXchange

## AI &amp; Automation Unpacked Hackathon

hack

## A...
Now processing: shattered-1: ## SHAttered

The first concrete collision attack against SHA-1 https:...
Now processing: s1cmm0dh6593mt7xassj: ## IBM watsonx Hackathons Why participate?

## Grow generative AI skil...
Now processing: robert_rules: <!-- image -->

## ROBERT'S RULES OF POKER

Version 11

By Robert Ciaf...
Now processing: 2405.04324v1: ## Granite Code Models: A Family of Open Foundation Models for Code In...
Now processing: woot15-paper-adamsky: <!-- image -->

<!-- image -->

<!-- image -->

<!-- image -->

<!-- i...
Now processing: bxdhvkbsupo1rzkem6he: ## IBM TechXchange Dev Day: Virtual Agents

23 January 2025 11 AM 6 PM...


In [32]:
import re
for i in range(len(titles)):
  #strip titles from leading and trailing whitespaces
  titles[i] = titles[i].strip()

  #remove filename-illegal characters from string
  illegal_chars = r'[<>:"/\\|?*]'
  titles[i] = re.sub(illegal_chars, '_', titles[i])

  #trim title to max filename length and append extension
  titles[i] = titles[i][:min(len(titles[i]),251)]+".pdf"
titles

["Bridging the Gap Between Poverty and Prosperity_ Midwest Food Bank's 2020 Annual Report.pdf",
 'LiveCodeBench Pro_ How Do Olympiad Medalists Judge LLMs in Competitive Programming_.pdf',
 'IBM TechXchange AI & Automation Unpacked Hackathon Guide.pdf',
 '## SHAttered_ The First Concrete Collision Attack Against SHA-1.pdf',
 '_IBM watsonx Hackathons_ Hands-on Experience and Prizes for Generative AI Innovation_.pdf',
 "Robert's Rules of Poker_ Version 11 by Bob Ciaffone.pdf",
 'Granite Code Models_ A Family of Open Foundation Models for Code Intelligence.pdf',
 '_The Role of Artificial Intelligence in Modern Healthcare_.pdf',
 'IBM TechXchange Dev Day_ Virtual Agents - How to Use the Event Platform.pdf']

Backup original filenames and rename.

In [33]:
import json
backup = {}
for i in range(len(sources)):
  backup[path+titles[i]] = sources[i]
  os.rename(sources[i], path+titles[i])

with open("backup.json", 'w') as json_file:
    json.dump(backup, json_file, indent=4)

print("Files renamed sucessfully. Backup saved as backup.json.")

Files renamed sucessfully. Backup saved as backup.json.


# Recovery Tool.

In [38]:
def recover(backupfile="backup.json"):
  if input("Are you sure you want to revert filename changes? (y/n)").lower() == 'y':
    backup = {}
    with open(backupfile, 'r') as json_file:
      backup = json.load(json_file)
    for i in backup.keys():
      os.rename(i, backup[i])
recover()

Are you sure you want to revert filename changes? (y/n)y
