# Rasterize Unsupported Document Types

* Author: docai-incubator@google.com

## Disclaimer

This tool is not supported by the Google engineering team or product team. It is provided and supported on a best-effort basis by the **DocAI Incubator Team**. No guarantees of performance are implied.

## Objective

The objective of this document is to provide code to convert the unsupported formats of docs to supported formats.

## Prerequisites
* Access to vertex AI Notebook or Google Colab
* Python

## Step by Step procedure

### 1.Raw text or .txt file to pdf

#### 1.1.Importing Required Modules

In [None]:
# installing required libraries
!pip install reportlab
!pip install pypandoc

In [None]:
from reportlab.lib.pagesizes import inch, letter
from reportlab.lib.styles import getSampleStyleSheet
from reportlab.lib import colors
from reportlab.pdfgen import canvas
from reportlab.lib.units import inch
from reportlab.lib.pagesizes import A4
from reportlab.lib.pagesizes import landscape
from reportlab.lib.pagesizes import portrait
from reportlab.pdfgen.canvas import Canvas
import pypandoc
from typing import Any, Dict, List, Optional, Sequence, Tuple, Union

#### 1.2.Run the code

In [None]:
def create_pdf_from_text(raw_text, output_pdf_path):
    """
    Creates a PDF from the given raw text.

    Args:
        raw_text (str): The text to be included in the PDF.
        output_pdf_path (str): The output file path for the generated PDF.

    Returns:
        None
    """
    # Define canvas size (1024x1024 pixels converted to points)
    width, height = 1024, 1024

    # Create a canvas
    c = canvas.Canvas(output_pdf_path, pagesize=(width, height))

    # Set font properties
    font_name = "Courier"  # Monospaced font
    font_size = 12
    line_height = font_size * 1.5  # Row spacing

    # Set font
    c.setFont(font_name, font_size)

    # Starting position
    x = 50
    y = height - 50

    # Draw the text
    for line in raw_text.splitlines():
        c.drawString(x, y, line)
        y -= line_height
        if y < 50:
            c.showPage()
            c.setFont(font_name, font_size)
            y = height - 50

    # Save the PDF
    c.save()


# Example usage
raw_text = """This is an example of raw text.
It will be rendered in a PDF using a monospaced font.
Each line is spaced 1.5 times the font size.
The canvas size is 1024x1024 pixels."""

output_pdf_path = "output.pdf"
create_pdf_from_text(raw_text, output_pdf_path)
# for txt files
# Read the text file
with open("raw_data.txt", "r") as file:
    raw_text_from_file = file.read()
output_pdf_path = "output_txt.pdf"
create_pdf_from_text(raw_text_from_file, output_pdf_path)

### 2.Word file to pdf 

#### 2.1.Importing Required Modules

In [None]:
import pypandoc

#### 2.2.Run the code

In [None]:
def convert_docx_to_pdf(input_docx_path: str, output_pdf_path: str):
    """
    Converts a DOCX file to a PDF using Pandoc.

    Args:
        input_docx_path (str): The file path of the input DOCX file to be converted.
        output_pdf_path (str): The file path where the output PDF should be saved.

    Returns:
        None
    """
    pypandoc.convert_file(input_docx_path, "pdf", outputfile=output_pdf_path)


# Example usage
convert_docx_to_pdf("1.docx", "output_docx.pdf")

### 3.Output
Upon running the code it will convert the input document to PDF