# Extracting Embedded links in PDF 

* Author: docai-incubator@google.com

## Disclaimer

This tool is not supported by the Google engineering team or product team. It is provided and supported on a best-effort basis by the DocAI Incubator Team. No guarantees of performance are implied.

## Purpose and Description
This document guides  to extract hyperlinks from PDF files stored in a Google Cloud Storage (GCS) bucket and save the extracted hyperlinks along with associated information (such as page number and hyperlink text) into an Excel workbook.The Excel workbooks contain a worksheet named file_name  where each row represents a hyperlink found in the PDF file, along with its associated page number and text.

## Prerequisites

1. Access to vertex AI Notebook or Google Colab
2. Python
3. Access to the google storage bucket.

## Step by Step procedure 

### 1. Install the required libraries

In [11]:
!pip install PyMuPDF
!pip install xlsxwriter
!pip install google-cloud-storage

[0m

In [4]:
!wget https://raw.githubusercontent.com/GoogleCloudPlatform/document-ai-samples/main/incubator-tools/best-practices/utilities/utilities.py

--2024-06-12 09:22:12--  https://raw.githubusercontent.com/GoogleCloudPlatform/document-ai-samples/main/incubator-tools/best-practices/utilities/utilities.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 29735 (29K) [text/plain]
Saving to: ‘utilities.py’


2024-06-12 09:22:12 (18.7 MB/s) - ‘utilities.py’ saved [29735/29735]



### 2. Import the required libraries/Packages

In [7]:
from google.cloud import storage
from utilities import file_names
import io
import xlsxwriter
import fitz
import os

### 3. Input Details

<ul>
    <li><b>input_path : </b>It is input GCS folder path which contains pdf files</li>
</ul>

In [9]:
input_path = "gs://bucket_name/path_to_folders/"

### 4.Execute the code

In [10]:
def extract_hyperlinks_from_gcs(bucket_name, pdf_paths, excel_file):
    client = storage.Client()
    workbook = xlsxwriter.Workbook(excel_file)

    for pdf_path in pdf_paths:
        bucket = client.get_bucket(bucket_name)
        blob = bucket.blob(pdf_path)
        pdf_data = blob.download_as_string()
        pdf_content = io.BytesIO(pdf_data)
        reader = fitz.open(stream=pdf_content, filetype="pdf")

        worksheet_name = os.path.splitext(os.path.basename(pdf_path))[0][-30:]
        worksheet = workbook.add_worksheet(worksheet_name)

        row = 0
        col = 0
        worksheet.write(row, col, "Page Number")
        worksheet.write(row, col + 1, "HyperlinkText")
        worksheet.write(row, col + 2, "Hyperlink")
        row += 1

        for page_num in range(len(reader)):
            page = reader[page_num]

            links = page.links()
            for link in links:
                if "uri" in link:
                    uri = link["uri"]
                    link_rect = fitz.Rect(link["from"])
                    text = page.get_textbox(link_rect)
                    worksheet.write(row, col, page_num)
                    worksheet.write(row, col + 1, text)
                    worksheet.write(row, col + 2, uri)
                    row += 1
    workbook.close()


# Example usage
input_storage_bucket_name = input_path.split("/")[2]
pdf_files = file_names(input_path)[1].values()
pdf_paths = [i for i in list(pdf_files) if i.endswith(".pdf")]
# List of PDF paths in the bucket
excel_file = "hyperlinks_combined.xlsx"
extract_hyperlinks_from_gcs(input_storage_bucket_name, pdf_paths, excel_file)
print(f"Hyperlinks saved to '{excel_file}'.")

Hyperlinks saved to 'hyperlinks_combined.xlsx'.


### 5.Output

the extracted hyperlinks along with associated information (such as page number and hyperlink text) into an Excel workbook.The Excel workbooks contain a worksheet named file_name  where each row represents a hyperlink found in the PDF file, along with its associated page number and text.
. <br><hr>

<b>Output File example</b><br><br>
<img src="./images/Excel_image.png">