# DocAI Preparing and Exporting labeled dataset

* Author: docai-incubator@google.com

## Disclaimer

This tool is not supported by the Google engineering team or product team. It is provided and supported on a best-effort basis by the **DocAI Incubator Team**. No guarantees of performance are implied.


## Description

The labeled documents that are exported from a Document AI processor are saved in individual parent folders. The purpose of the python code is used to generate a CSV file that maps the labeled documents with its parent folder name. The first column of the generated CSV file contains the name of the parent folder, while the second column lists the files present within that parent folder.


## Prerequisites 
* A basic knowledge of Python.       
* Python : Jupyter notebook - Vertex AI        
* Permission to Google project is needed to access the files and to [Document AI](https://cloud.google.com/document-ai/docs/overview)     


## Steps to prepare labeled dataset

### 1. Create Processor

* Create a DocAI Processor as per your requirement.
* Further read:[Creating and managing processors](https://cloud.google.com/document-ai/docs/create-processor#create-processor)


<img src="./Images/create_processor.png" width=800 height=400></img>

### 2. Dataset Location

Once the processor is created, set the dataset location in the train tab. The location of the dataset folder must be empty


<img src="./Images/create_dataset.png" width=800 height=400></img>

### 3. Import Dataset
Import the PDF files which needs to be labeled

<img src="./Images/import_documents_1.png" width=800 height=400></img>
<img src="./Images/import_documents_2.png" width=800 height=400></img>

### 4. Add Schema

Click on the button “Edit Schema” and add the required entities that need to be labeled in the dataset.


<img src="./Images/edit_schema.png" width=800 height=400></img>

<img src="./Images/edit_schema_1.png" width=800 height=400></img>

### 5. Label the dataset
Label the entities and click on the button “Mark as labeled”.


<img src="./Images/labeled_data.png" width=800 height=400></img>

### 6. Export the Dataset

<img src="./Images/export_dataset.png" width=800 height=400></img>

#### The dataset that has been exported is stored in individual folders, as shown below.


<img src="./Images/individual_folders.png" width=800 height=400></img>

* **Note**: Once the dataset is exported, each labeled JSON file is stored in an individual folder with a unique, randomly generated name. Therefore, please follow the next steps to map the folder name with the file name and move all the files from the individual folder to a single folder


### 7. Run the code

### Code to map the folder name

Replace the parameters for  `project_id`, `bucket_name`, `folder_name` with your actual project ID, bucket name and folder name.
* `project_id`: provide the project id
* `bucket_name`: provide the name of the bucket
* `folder_name`:  provide the name of the folder path without prefixing the bucket name.


In [None]:
import csv
from google.cloud import storage

# Modify the below parameters as per your requirements
project_id = "<your-project-name>"
bucket_name = "<your-bucket-name>"
folder_name = "<your-folder-name>"

storage_client = storage.Client(project=project_id)
blobs = storage_client.list_blobs(bucket_name, prefix=folder_name)
csv_file = open("report.csv", "w", newline="")
csv_writer = csv.writer(csv_file)
csv_writer.writerow(["Folder Name", "File Name"])

for blob in blobs:
    parent_folder = blob.name.split("/")[-2]
    file_name = blob.name.split("/")[-1]
    csv_writer.writerow([parent_folder, file_name])

csv_file.close()

print("Done")

The above code generates a CSV file containing the name of the parent folder, while the second column lists the files present within that parent folder.

### 8. Command to copy files from multiple subfolders into a single folder

The following gsutil command is utilized to copy files from various subfolders into a single folder. 

Please change `<source-folder>` and `<destination-folder>` and run in a cloud shell.


In [None]:
gsutil -m cp gs://<source-folder>/*/*.json gs://<destination-folder>