In [6]:
from google.colab import drive
drive.mount('/content/drive') # No need to change

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [7]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [8]:
import os
os.chdir("/content/drive/MyDrive/Final_Year_Project/FYP_data/dataset_for_donut")
!ls

test  train  validation


**Before you start training, it is strongly recommended that you read through https://github.com/clovaai/donut?tab=readme-ov-file to understand the file structure that donut expects, and the arguments it expects for training and testing.**

**Below is the code to test whether Colab points to the correct directory where your train, validation, and test data are stored. Note: These data are available in our shared drive. No need to duplicate. You may not need to change the directory because it should be the same**

In [9]:
data_path = "drive/MyDrive/Final_Year_Project/FYP_data/dataset_for_donut"
!ls "$data_path"

ls: cannot access 'drive/MyDrive/Final_Year_Project/FYP_data/dataset_for_donut': No such file or directory


## Faster GPUs

Users who have purchased one of Colab's paid plans have access to faster GPUs and more memory. You can upgrade your notebook's GPU settings in `Runtime > Change runtime type` in the menu to select from several accelerator options, subject to availability.

The free of charge version of Colab grants access to Nvidia's T4 GPUs subject to quota restrictions and availability.

You can see what GPU you've been assigned at any time by executing the following cell. If the execution result of running the code cell below is "Not connected to a GPU", you can change the runtime by going to `Runtime > Change runtime type` in the menu to enable a GPU accelerator, and then re-execute the code cell.







In [10]:
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Not connected to a GPU')
else:
  print(gpu_info)

Sat May 17 16:13:22 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA A100-SXM4-40GB          Off |   00000000:00:04.0 Off |                    0 |
| N/A   31C    P0             44W /  400W |       0MiB /  40960MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                                                

## More memory

Users who have purchased one of Colab's paid plans have access to high-memory VMs when they are available. More powerful GPUs are always offered with high-memory VMs.



You can see how much memory you have available at any time by running the following code cell. If the execution result of running the code cell below is "Not using a high-RAM runtime", then you can enable a high-RAM runtime via `Runtime > Change runtime type` in the menu. Then select High-RAM in the Runtime shape toggle button. After, re-execute the code cell.


In [11]:
import psutil

ram_gb = psutil.virtual_memory().total / 1e9
print('Your runtime has {:.1f} gigabytes of available RAM\n'.format(ram_gb))

if ram_gb < 20:
  print('Not using a high-RAM runtime')
else:
  print('You are using a high-RAM runtime!')

Your runtime has 89.6 gigabytes of available RAM

You are using a high-RAM runtime!


**Install dependencies**

In [12]:
!pip install -U datasets
!pip install donut-python
!pip install torch
!pip install torchvision
!pip install transformers==4.25.1
!pip install pytorch-lightning==1.6.4
!pip install timm==0.5.4
!pip install gradio

Collecting datasets
  Downloading datasets-3.6.0-py3-none-any.whl.metadata (19 kB)
Collecting fsspec<=2025.3.0,>=2023.1.0 (from fsspec[http]<=2025.3.0,>=2023.1.0->datasets)
  Downloading fsspec-2025.3.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.6.0-py3-none-any.whl (491 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.5/491.5 kB[0m [31m10.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2025.3.0-py3-none-any.whl (193 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m193.6/193.6 kB[0m [31m18.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: fsspec, datasets
  Attempting uninstall: fsspec
    Found existing installation: fsspec 2025.3.2
    Uninstalling fsspec-2025.3.2:
      Successfully uninstalled fsspec-2025.3.2
  Attempting uninstall: datasets
    Found existing installation: datasets 2.14.4
    Uninstalling datasets-2.14.4:
      Successfully uninstalled datasets-2.14.4
[31mERROR: pip's dependency r

**Change working directory to the donut-master folder. This is also a shared drive in Google Drive. You do not need to duplicate the folder. You may not need to change the directory because it should be the same across everyone**

In [13]:
%cd /content/drive/MyDrive/Final_Year_Project/donut-master

/content/drive/.shortcut-targets-by-id/1r1yroXj6UTjfQOfOqTW0t6cABXP7P6iL/donut-master


In [14]:
# Test whether your working directory is correct.
!ls

app.py	 dataset  lightning_module.py  __pycache__  setup.py  train.py
config	 donut	  misc		       README.md    synthdog
content  LICENSE  NOTICE	       result	    test.py


**Code to train the model below.
See https://github.com/clovaai/donut?tab=readme-ov-file for documentation on the arguments.
Note: --pretrained_model_name_or_path "naver-clova-ix/donut-base" pulls a pretrained model directly from hugging face, this is only necessary for the first training (since we are training based on that). For our subsequent trainings, we should change this path to our saved weights (saved weights should appear in --result_path under the name of your --exp_version once you've finish training), do not use hugging face anymore. Also note that --exp_version shall be updated like "v1", "v2", "v3", or whatever name everytime someone starts training for version history purposes. All of these versions will appear in the folder --result_path**

In [15]:
!python train.py --config config/train_cord.yaml \
                --pretrained_model_name_or_path "naver-clova-ix/donut-base-finetuned-cord-v2" \
                --dataset_name_or_paths '["/content/drive/MyDrive/Final_Year_Project/FYP_data/dataset_for_donut"]' \
                --result_path "/content/drive/MyDrive/Final_Year_Project/FYP_data/donut_output" \
                --exp_version "Training_6"

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
                                                                                                       
[A Normed ED: 0.2540045766590389
Epoch 32: 100% 59/59 [02:07<00:00,  2.17s/it, loss=0.0129, exp_name=train_cord, exp_version=Training_6]
                                                                                                       
[APrediction: <s_merchant>ORION MSC</s_merchant><s_date>16 May 2023</s_date><s_recipient>TEX INVOICE</s_recipient><s_menu><s_nm>Ad-Hoc Maintenance Service</s_nm><s_cnt>1.00</s_cnt><s_unitprice>1,000.00</s_unitprice><s_itemsubtotal>1,000.00</s_itemsubtotal></s_menu><s_subtotal><s_subtotal_price>1,000.00</s_subtotal_price><s_tax_price>60.00</s_tax_price></s_subtotal><s_total><s_total_price>1,000.00</s_total_price></s_total>
Epoch 32: 100% 59/59 [02:08<00:00,  2.18s/it, loss=0.0129, exp_name=train_cord, exp_version=Training_6]
                                                          

In [16]:
# import os
# import gc
# import torch

# os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"
# gc.collect()
# torch.cuda.empty_cache()

**Code to test the model. See https://github.com/clovaai/donut?tab=readme-ov-file for documentation on the arguments used.
Tip: The --pretrained_model_name_or_path for test.py is actualy the --result_path and --exp_version of train.py. Basically it's where your saved weights are**

In [17]:

!python test.py --dataset_name_or_path /content/drive/MyDrive/Final_Year_Project/FYP_data/dataset_for_donut \
--pretrained_model_name_or_path /content/drive/MyDrive/Final_Year_Project/FYP_data/donut_output/train_cord/Training_6 \
 --save_path /content/drive/MyDrive/Final_Year_Project/FYP_data/donut_output/test_output/train_output6.json

2025-05-17 17:23:28.362137: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-05-17 17:23:28.379996: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1747502608.401460   26121 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1747502608.407890   26121 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-05-17 17:23:28.429182: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instr

**EXTRA: Below is just the code to test whether your jsonl file has any errors before passing it into the training or testing code**

In [18]:
# import json

# with open('/content/drive/MyDrive/Final_Year_Project/FYP_data/dataset_for_donut/train/metadata.jsonl', 'r') as f:
#     for i, line in enumerate(f, 1):
#         try:
#             json.loads(line)
#         except json.JSONDecodeError as e:
#             print(f"Error on line {i}: {e}")