Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to parse embedded file(OLE obejct) in pptx/docx #644

Open
hong1997 opened this issue Dec 3, 2019 · 5 comments
Open

How to parse embedded file(OLE obejct) in pptx/docx #644

hong1997 opened this issue Dec 3, 2019 · 5 comments

Comments

@hong1997
Copy link

@hong1997 hong1997 commented Dec 3, 2019

Before submitting an issue, please fill this out

Is this a:

  • Issue with the OpenXml library
  • Question on library usage

How to parse embedded files(OLE obejct) in pptx/docx.
They are Ole objects mostly, like object1.bin.
If there're any good ways to parse it?
Unzip the OLE object, there're several kinds of format:
image
image
image
image

Didn't find out a general good way to achieve that.
I check the source code of Tika parser, they extract it in a rule-based method...

// Please add a self-contained, minimum viable repro of the issue.
// If you require external resources, please provide a gist or GitHub repro
// An Xunit style test is preferred, but a console application would work too.

Observed

Please add your observed behavior here

Expected

Please add your expected behavior here.

@adamshakhabov

This comment has been minimized.

Copy link

@adamshakhabov adamshakhabov commented Dec 8, 2019

Use follow code example to get OLEObjects from the first slide presentation:

public static IEnumerable<DocumentFormat.OpenXml.Presentation.GraphicFrame> GetOleObjects(string pptxFilePath)
{
    using (var doc = PresentationDocument.Open(pptxFilePath, false))
    {
        // Gets first slide
        var sld = doc.PresentationPart.SlideParts.First().Slide;
        // OLEObjects is stored in graphic frame element
        var oleFrames = new List<DocumentFormat.OpenXml.Presentation.GraphicFrame>();
        foreach (var frame in sld.CommonSlideData.ShapeTree.OfType<DocumentFormat.OpenXml.Presentation.GraphicFrame>())
        {
            if (frame.Descendants<DocumentFormat.OpenXml.Presentation.OleObject>().Any())
            {
                oleFrames.Add(frame);
            }
        }

        return oleFrames;
    }
}
@hong1997

This comment has been minimized.

Copy link
Author

@hong1997 hong1997 commented Dec 8, 2019

Use follow code example to get OLEObjects from the first slide presentation:

public static IEnumerable<DocumentFormat.OpenXml.Presentation.GraphicFrame> GetOleObjects(string pptxFilePath)
{
    using (var doc = PresentationDocument.Open(pptxFilePath, false))
    {
        // Gets first slide
        var sld = doc.PresentationPart.SlideParts.First().Slide;
        // OLEObjects is stored in graphic frame element
        var oleFrames = new List<DocumentFormat.OpenXml.Presentation.GraphicFrame>();
        foreach (var frame in sld.CommonSlideData.ShapeTree.OfType<DocumentFormat.OpenXml.Presentation.GraphicFrame>())
        {
            if (frame.Descendants<DocumentFormat.OpenXml.Presentation.OleObject>().Any())
            {
                oleFrames.Add(frame);
            }
        }

        return oleFrames;
    }
}

Hi adamshakhabov, thanks for your reply! According to my knowledge, the ole object should be stored in embedded object parts(X.MainDocumentPart.EmbeddedObjectParts), and I am asking for a method to parse the oleobject instead of just getting it.

@adamshakhabov

This comment has been minimized.

Copy link

@adamshakhabov adamshakhabov commented Dec 8, 2019

Hi @hong1997!

I think Open XML SDK has not some specific method for OLEObject element reading (parse its properties). Can you say more precise, which one feature of OLEObject you try to parse?

Also, it would be better if you attach pptx-file with this OLEObject case.

@ThomasBarnekow

This comment has been minimized.

Copy link
Contributor

@ThomasBarnekow ThomasBarnekow commented Dec 8, 2019

@hong1997 and @adamshakhabov, GitHub issues are not the place to ask and discuss questions regarding Open XML SDK library usage. You should ask usage-related questions on stackoverflow.com, where you will already find a large number of questions and answers tagged with openxml or openxml-sdk.

In this specific case, another user already asked about how he could extract OLE-embedded files from Word documents, and I provided an accepted answer.

@hong1997

This comment has been minimized.

Copy link
Author

@hong1997 hong1997 commented Dec 8, 2019

@ThomasBarnekow , thanks for your info, I will close the issue. However, the answer you provided only handles 1 kind of OLE structure. You could see from my description that only the last kind of ole object can be handled by the class you provided.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.