diff --git a/docs/source/topic/converting_pdf_to_text.rst b/docs/source/topic/converting_pdf_to_text.rst index acb678bd..5194b114 100644 --- a/docs/source/topic/converting_pdf_to_text.rst +++ b/docs/source/topic/converting_pdf_to_text.rst @@ -11,7 +11,7 @@ the characters and their placement. This makes extracting meaningful pieces of text from PDF files difficult. The characters that compose a paragraph are no different from those that compose the table, the page footer or the description of a figure. Unlike -other documents formats, like a `.txt` file or a word document, the PDF format +other document formats, like a `.txt` file or a word document, the PDF format does not contain a stream of text. A PDF document does consists of a collection of objects that together describe @@ -29,10 +29,10 @@ PDFMiner attempts to reconstruct some of those structures by using heuristics on the positioning of characters. This works well for sentences and paragraphs because meaningful groups of nearby characters can be made. -The layout analysis consist of three different stages: it groups characters +The layout analysis consists of three different stages: it groups characters into words and lines, then it groups lines into boxes and finally it groups textboxes hierarchically. These stages are discussed in the following -sections. The resulting output of the layout analysis is an ordered hierarchy +sections. The resulting output of the layout analysis is an ordered hierarchy of layout objects on a PDF page. .. figure:: ../_static/layout_analysis_output.png @@ -48,8 +48,8 @@ Grouping characters into words and lines The first step in going from characters to text is to group characters in a meaningful way. Each character has an x-coordinate and a y-coordinate for its -bottom-left corner and upper-right corner, i.e. its bounding box. Pdfminer -.six uses these bounding boxes to decide which characters belong together. +bottom-left corner and upper-right corner, i.e. its bounding box. Pdfminer.six +uses these bounding boxes to decide which characters belong together. Characters that are both horizontally and vertically close are grouped onto one line. How close they should be is determined by the `char_margin` @@ -74,7 +74,7 @@ relative to the maximum width or height of the new character. Having a smaller least be smaller than the `char_margin` otherwise none of the characters will be separated by a space. -The result of this stage is a list of lines. Each line consists a list of +The result of this stage is a list of lines. Each line consists of a list of characters. These characters are either original `LTChar` characters that originate from the PDF file, or inserted `LTAnno` characters that represent spaces between words or newlines at the end of each line. @@ -91,14 +91,14 @@ Lines that are both horizontally overlapping and vertically close are grouped. How vertically close the lines should be is determined by the `line_margin`. This margin is specified relative to the height of the bounding box. Lines are close if the gap between the tops (see L :sub:`1` in the figure) and bottoms -(see L :sub:`2`) in the figure) of the bounding boxes are closer together +(see L :sub:`2`) in the figure) of the bounding boxes is closer together than the absolute line margin, i.e. the `line_margin` multiplied by the height of the bounding box. .. raw:: html :file: ../_static/layout_analysis_group_lines.html -The result of this stage is a list of text boxes. Each box consist of a list +The result of this stage is a list of text boxes. Each box consists of a list of lines. Grouping textboxes hierarchically