extract images from PDF #60

ghosttie · 2022-03-08T23:51:01Z

Some of the PDF files that my application processes were created by scanners, so they're basically a PDF containing nothing but one image per page.

I would like to extract the images so that I can deal with them as images rather than PDFs. I don't want to use GetImage to convert the whole page to an image because this will include the margins around the image.

Looking at the source code, it looks like Docnet includes the PDFium calls required to extract images from PDF files:

docnet/src/Docnet.Core/Bindings/PdfiumWrapper.cs

Lines 3211 to 3214 in 728e6c9

    
                       [SuppressUnmanagedCodeSecurity] 
        
                       [DllImport("pdfium", CallingConvention = CallingConvention.Cdecl, 
        
                           EntryPoint = "FPDFImageObj_GetBitmap")] 
        
                       internal static extern IntPtr FPDFImageObjGetBitmap(IntPtr image_object);

docnet/src/Docnet.Core/Bindings/PdfiumWrapper.cs

Lines 1869 to 1887 in 728e6c9

    
                       [SuppressUnmanagedCodeSecurity] 
        
                       [DllImport("pdfium", CallingConvention = CallingConvention.Cdecl, 
        
                           EntryPoint = "FPDFBitmap_GetBuffer")] 
        
                       internal static extern IntPtr FPDFBitmapGetBuffer(IntPtr bitmap); 
        
                       [SuppressUnmanagedCodeSecurity] 
        
                       [DllImport("pdfium", CallingConvention = CallingConvention.Cdecl, 
        
                           EntryPoint = "FPDFBitmap_GetWidth")] 
        
                       internal static extern int FPDFBitmapGetWidth(IntPtr bitmap); 
        
                       [SuppressUnmanagedCodeSecurity] 
        
                       [DllImport("pdfium", CallingConvention = CallingConvention.Cdecl, 
        
                           EntryPoint = "FPDFBitmap_GetHeight")] 
        
                       internal static extern int FPDFBitmapGetHeight(IntPtr bitmap); 
        
                       [SuppressUnmanagedCodeSecurity] 
        
                       [DllImport("pdfium", CallingConvention = CallingConvention.Cdecl, 
        
                           EntryPoint = "FPDFBitmap_GetStride")] 
        
                       internal static extern int FPDFBitmapGetStride(IntPtr bitmap);

It would be great if this was exposed so it was available to be used through Docnet.

Modest-as · 2022-04-24T12:48:48Z

Duplicate to #39 we want to expose a way to detect if there are embedded images in pages and extract them

Modest-as closed this as completed Apr 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

extract images from PDF #60

extract images from PDF #60

ghosttie commented Mar 8, 2022

Modest-as commented Apr 24, 2022

extract images from PDF #60

extract images from PDF #60

Comments

ghosttie commented Mar 8, 2022

Modest-as commented Apr 24, 2022