Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error String index out of range: -1 in PDFLayoutTextStripper #39

Open
Jaumexr opened this issue Feb 21, 2020 · 2 comments
Open

Error String index out of range: -1 in PDFLayoutTextStripper #39

Jaumexr opened this issue Feb 21, 2020 · 2 comments

Comments

@Jaumexr
Copy link

Jaumexr commented Feb 21, 2020

Hi,
Hi have this code, with attached PDF to test.
public void doStrip() {
String string = null;
try {
PDFParser pdfParser = new PDFParser(new RandomAccessFile(new File("D:/escaner/errorsPDFBOX/AN20-0149-0602201842.pdf"), "r"));
pdfParser.parse();
PDDocument pdDocument = new PDDocument(pdfParser.getDocument());
PDFTextStripper pdfTextStripper = new PDFLayoutTextStripper();
string = pdfTextStripper.getText(pdDocument);
BufferedWriter writer = Files.newBufferedWriter(FileSystems.getDefault().getPath("D:/escaner","fichero.txt"), Charset.forName("UTF-8"));
writer.write(string);
writer.flush();
writer.close();
} catch (InvalidPasswordException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}

AN20-0149-0602201842.pdf
I have this exception error:
Exception in thread "main" java.lang.StringIndexOutOfBoundsException: String index out of range: -1
at java.lang.String.charAt(String.java:658)
at com.sagedillepasa.gestion.TextLine.isSpaceCharacterAtIndex(PDFLayoutTextStripper.java:269)
at com.sagedillepasa.gestion.TextLine.getNextValidIndex(PDFLayoutTextStripper.java:283)
at com.sagedillepasa.gestion.TextLine.computeIndexForCharacter(PDFLayoutTextStripper.java:263)
at com.sagedillepasa.gestion.TextLine.writeCharacterAtIndex(PDFLayoutTextStripper.java:229)
at com.sagedillepasa.gestion.PDFLayoutTextStripper.writeLine(PDFLayoutTextStripper.java:127)
at com.sagedillepasa.gestion.PDFLayoutTextStripper.writeTextPositionList(PDFLayoutTextStripper.java:157)
at com.sagedillepasa.gestion.PDFLayoutTextStripper.iterateThroughTextList(PDFLayoutTextStripper.java:152)
at com.sagedillepasa.gestion.PDFLayoutTextStripper.writePage(PDFLayoutTextStripper.java:96)
at org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:392)
at com.sagedillepasa.gestion.PDFLayoutTextStripper.processPage(PDFLayoutTextStripper.java:80)
at org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
at org.apache.pdfbox.text.PDFTextStripper.getText(PDFTextStripper.java:227)
at com.sagedillepasa.gestion.test.doStrip(test.java:44)
at com.sagedillepasa.gestion.test.main(test.java:61)

@jenka13all
Copy link

I have the exact same issue with the example code - it doesn't work.

@Athou
Copy link

Athou commented Jan 20, 2022

I'm encountering the same issue.
The exception seems to happen because index is 0 here so isSpaceCharacterAtIndex is called with -1.
Changing the condition to !isCharacterPartOfPreviousWord && index > 0 && this.isSpaceCharacterAtIndex(index - 1) in the condition seems to fix the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants