# String comparison

**Goal:** We want to extract the text from a PDF and compare that against a clean file of the same text in order to benchmark different PDF extraction techniques in python. 

We are comparing document to document to see where there are differences, and are using a text file from schoolshooters.info called "2014 NaBITA Whitepaper Text with Graphics." This file seems representative of many scholarly article PDF files due to its use of graphics, footnotes, bulleted lists, etc. The baseline version of the file has been converted to .txt type.  

We will compare 4 PDF text extraction programs:
1. Pypdf2
2. Pdfminer.six
3. Grobid
4. Pymupdf

Pymupdf and Pdfminer.six performed similarly well when looking at Levenshtein distance, cosine similarity, and tf-idf similarity, though Pdfminer.six took longer to run. Other packages also performed very well when using cosine and tf-idf similarity as the measures.

In [1]:
import pandas as pd
import os
import Levenshtein
import html
import PyPDF2
import datetime

### Establishing baseline .txt file for comparison

In [26]:
now = datetime.datetime.now()

f = open(r'c:\users\j\desktop\projects\sia\medium\2014-NaBITA-Whitepaper-Text-with-Graphics.txt', "r", encoding="utf8")
baseline = f.read()

stop = datetime.datetime.now()
base_time = stop - now

### PyPDF2 text extraction

In [27]:
now = datetime.datetime.now()

with open(r'c:\users\j\desktop\projects\sia\medium\2014-NaBITA-Whitepaper-Text-with-Graphics.pdf','rb') as pdf_file:
    read_pdf = PyPDF2.PdfFileReader(pdf_file)
    number_of_pages = read_pdf.getNumPages()
    pypdf_test = ''
    for page_number in range(number_of_pages): 
        page = read_pdf.getPage(page_number)
        page_content = page.extractText()
        pypdf_test += page_content
        
stop = datetime.datetime.now()
pypdf_time = stop - now

### PDFMiner.six text extraction

In [29]:
now = datetime.datetime.now()

def pdf_to_txt(path):
    from io import StringIO

    from pdfminer.converter import TextConverter
    from pdfminer.layout import LAParams
    from pdfminer.pdfdocument import PDFDocument
    from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
    from pdfminer.pdfpage import PDFPage
    from pdfminer.pdfparser import PDFParser

    output_string = StringIO()
    with open(path, 'rb') as in_file:
        parser = PDFParser(in_file)
        doc = PDFDocument(parser)
        rsrcmgr = PDFResourceManager()
        device = TextConverter(rsrcmgr, output_string, laparams=LAParams())
        interpreter = PDFPageInterpreter(rsrcmgr, device)
        for page in PDFPage.create_pages(doc):
            interpreter.process_page(page)
    text = str(output_string.getvalue())
    return text

file = r'c:\users\j\desktop\projects\sia\medium\2014-NaBITA-Whitepaper-Text-with-Graphics.pdf'

pdfminersix_test = pdf_to_txt(file)

stop = datetime.datetime.now()
pdfminersix_time = stop - now

### Grobid text extraction

In [33]:
now = datetime.datetime.now()

def xml_parse(xml_file):
    import xml.etree.ElementTree as ET
    tree = ET.parse(xml_file) 
    root = tree.getroot()
    
    string = ET.tostring(root, encoding='utf8').decode('utf8')
    
    from bs4 import BeautifulSoup
    soup = BeautifulSoup(string)
    pageText = soup.findAll(text=True)
    raw_text = ' '.join(pageText)
        
    return raw_text

file = r'c:\users\j\desktop\projects\sia\medium\2014-NaBITA-Whitepaper-Text-with-Graphics.tei.xml'

grobid_test = xml_parse(file)

stop = datetime.datetime.now()
grobid_time = stop - now

### PyMuPDF text extraction

In [34]:
now = datetime.datetime.now()

import sys, fitz
fname = r'c:\users\j\desktop\projects\sia\medium\2014-NaBITA-Whitepaper-Text-with-Graphics.pdf'
doc = fitz.open(fname)  
text = ''
for page in doc:  
    text += page.getText() 

pymupdf_test = text

stop = datetime.datetime.now()
pymupdf_time = stop - now

# Testing Levenshtein Distance

In [20]:
base_base = Levenshtein.distance(baseline, baseline)
base_pypdf = Levenshtein.distance(baseline, pypdf_test)
base_pdfminersix = Levenshtein.distance(baseline, pdfminersix_test)
base_grobid = Levenshtein.distance(baseline, grobid_test)
base_pymupdf = Levenshtein.distance(baseline, pymupdf_test)

print("Levenshtein distance results are as follows: ")
print("Baseline: " + str(base_base))
print("Pypdf: " + str(base_pypdf))
print("PDFminer_six: " + str(base_pdfminersix))
print("Grobid: " + str(base_grobid))
print("Pymupdf: " + str(base_pymupdf))

Levenshtein distance results are as follows: 
Baseline: 0
Pypdf: 3692
PDFminer_six: 1706
Grobid: 34829
Pymupdf: 1949


# Cosine Similarity

In [21]:
import string
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer
from nltk.corpus import stopwords
stopwords = stopwords.words('english')

def clean_string(text):
    text = ''.join([word for word in text if word not in string.punctuation])
    text = text.lower()
    text = ' '.join([word for word in text.split() if word not in stopwords])
    return text

corpi = [baseline, pypdf_test, pdfminersix_test, grobid_test, pymupdf_test]

corpi = list(map(clean_string, corpi))

vectorizer = CountVectorizer().fit_transform(corpi)
vectors = vectorizer.toarray()

cosim = cosine_similarity(vectors)

def cosine_sim_vectors(vec1, vec2):
    vec1 = vec1.reshape(1, -1)
    vec2 = vec2.reshape(1, -1)
    return cosine_similarity(vec1, vec2)[0][0]

print("Cosine similarity for PypDF2:")
print(cosine_sim_vectors(vectors[0], vectors[1]))
print("Cosine similarity for PDFminer_six:")
print(cosine_sim_vectors(vectors[0], vectors[2]))
print("Cosine similarity for Grobid:")
print(cosine_sim_vectors(vectors[0], vectors[3]))
print("Cosine similarity for PyMuPdf:")
print(cosine_sim_vectors(vectors[0], vectors[4]))

Cosine similarity for PypDF2:
0.980466940137725
Cosine similarity for PDFminer_six:
0.9998833769688046
Cosine similarity for Grobid:
0.987242547178377
Cosine similarity for PyMuPdf:
0.9998833769688046


# Tf-idf Similarity

In [22]:
def tfidf_string_similarity_test(correct_string,test_string):
    import numpy as np
    import pandas as pd
    from sklearn.feature_extraction.text import TfidfVectorizer
    two_strings = [correct_string,test_string]
    vect = TfidfVectorizer(min_df=1, stop_words="english")
    tfidf = vect.fit_transform(two_strings)
    pairwise_similarity = tfidf * tfidf.T
    arr = pairwise_similarity.toarray()
    np.fill_diagonal(arr, -1)
    return arr[0][1]

In [23]:
print("Tfidf similarity for PyPDF2:")
print(tfidf_string_similarity_test(baseline, pypdf_test))
print("Tfidf similarity for PDFminer_six:")
print(tfidf_string_similarity_test(baseline, pdfminersix_test))
print("Tfidf similarity for Grobid:")
print(tfidf_string_similarity_test(baseline, grobid_test))
print("Tfidf similarity for PyMuPdf:")
print(tfidf_string_similarity_test(baseline, pymupdf_test))

Tfidf similarity for PyPDF2:
0.9807448546540203
Tfidf similarity for PDFminer_six:
0.9999051630168795
Tfidf similarity for Grobid:
0.9854546183829214
Tfidf similarity for PyMuPdf:
0.9999051630168795


In [24]:
data = {'Time to extract text': [base_time, pypdf_time, pdfminersix_time, grobid_time, pymupdf_time],
        'Levenshtein distance':  [base_base, base_pypdf, base_pdfminersix, base_grobid, base_pymupdf],
        'Cosine similarity': [cosine_sim_vectors(vectors[0], vectors[0]), cosine_sim_vectors(vectors[0], vectors[1]), cosine_sim_vectors(vectors[0], vectors[2]), cosine_sim_vectors(vectors[0], vectors[3]), cosine_sim_vectors(vectors[0], vectors[4])],
        'Tf-idf similarity': [tfidf_string_similarity_test(baseline, baseline), tfidf_string_similarity_test(baseline, pypdf_test), tfidf_string_similarity_test(baseline, pdfminersix_test), tfidf_string_similarity_test(baseline, grobid_test), tfidf_string_similarity_test(baseline, pymupdf_test)]
        }
df = pd.DataFrame(data, index=['Baseline_string', 'PyPdf', 'PdfMiner_six', 'Grobid', 'PyMuPdf'])

### Results Table

In [25]:
df

Unnamed: 0,Time to extract text,Levenshtein distance,Cosine similarity,Tf-idf similarity
Baseline_string,0 days 00:00:00.001965,0,1.0,1.0
PyPdf,0 days 00:00:00.998059,3692,0.980467,0.980745
PdfMiner_six,0 days 00:00:02.489001,1706,0.999883,0.999905
Grobid,0 days 00:00:00.037036,34829,0.987243,0.985455
PyMuPdf,0 days 00:00:00.041996,1949,0.999883,0.999905


### Baseline txt file

In [30]:
print(baseline)

THREAT ASSESSMENT IN THE CAMPUS SETTING 

THE NABITA 2014 WHITEPAPER 

BY: BRETT A. SOKOLOW, J.D. 
W. SCOTT LEWIS, J.D. 
SAUNDRA K. SCHUSTER, J.D. 
DANIEL C. SWINTON, J.D., ED.D. 

AND 
BRIAN J. VAN BRUNT, ED.D. 

This Threat Assessment Tool is being shared as a free resource to update the 
2009 Whitepaper published by the National Behavioral Intervention Team 
Association (NaBITA). 

Additional copies are available for free at www.nabita.org 

© NABITA 2014. 


THREAT ASSESSMENT IN THE CAMPUS SETTING 

Introduction 

The NaBITA Threat Assessment Tool (“Tool”) was first introduced in 2009. The Tool provides a rubric for 
behavioral and risk evaluation and helps create a common language for Behavioral Intervention Teams 
(“BITs”). It now commands respect as the tool most commonly used by campus behavioral intervention and 
threat assessment teams across the United States (Bennett & Lengerich, 2011; Van Brunt et al, 2012). 
Given the prominence it has achieved, we at NaBITA are mindful 

## Pypdf2 output text

In [31]:
print(pypdf_test)

  THREAT 
ASSESSMENT IN THE 
CAMPUS 
SETTING
  THE NABITA
 2014 WHITEPAPER
   BY: BRETT A. SOKOLOW
, J.D.
 W. SCOTT 
LEWIS
, J.D.
 SAUNDRA K. SCHUSTER, J.D.
 DANIEL 
C. SWINTON
, J.D.
, ED.D.
 AND BRIAN 
J. VAN BRUNT, ED.D.       This Threat Assessment Tool is being shared as a free 
resource to update the 
 2009 White
paper 
published by the
 National Behavioral Intervention Team 
Association (NaBITA).
  Additional copies are available for free at 
www.nabita.org
  © NABITA
 2014.  2 THREAT 
ASSESSMENT IN THE 
CAMPUS 
SETTING
 Introduction
  The NaBITA Threat Assessment Tool 
(ÒToolÓ) 
was first introduced in 2009
. The Tool provides a rubric 
for 
behavioral and risk 
evaluation 
and helps create 
a common language for 
Behavioral Intervention Teams 
(ÒBIT
sÓ). It now commands respect as the tool most commonly used by campus behavioral intervention and 
threat assessment teams
 across the United States
 (Bennett & Lengerich, 2011; Van Brunt et al, 2012)
. Given the prominence it has 

## Pdfminer.six output text

In [32]:
print(pdfminersix_test)

 
 

 

 
 

 

 

 

 

THREAT ASSESSMENT IN THE CAMPUS SETTING 

THE NABITA 2014 WHITEPAPER 

BY: BRETT A. SOKOLOW, J.D. 

W. SCOTT LEWIS, J.D. 

SAUNDRA K. SCHUSTER, J.D. 
DANIEL C. SWINTON, J.D., ED.D. 

AND 

BRIAN J. VAN BRUNT, ED.D. 

 
 

 
 

This Threat Assessment Tool is being shared as a free resource to update the  

2009 Whitepaper published by the National Behavioral Intervention Team 

Association (NaBITA). 

Additional copies are available for free at www.nabita.org 

© NABITA 2014. 

THREAT ASSESSMENT IN THE CAMPUS SETTING 

Introduction 
 
The NaBITA Threat Assessment Tool (“Tool”) was first introduced in 2009. The Tool provides a rubric for 
behavioral and risk evaluation and helps create a common language for Behavioral Intervention Teams 
(“BITs”). It now commands respect as the tool most commonly used by campus behavioral intervention and 
threat assessment teams across the United States (Bennett & Lengerich, 2011; Van Brunt et al, 2012). 
Given the prominence i

## Grobid output text

In [35]:
print(grobid_test)

xml version='1.0' encoding='utf8'? 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 National Behavioral Intervention Team Association (NaBITA) 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 GROBID - A machine learning software for extracting information from scholarly documents 
 
 
 
 
 
 
 
 
 
 
 Introduction The NaBITA Threat Assessment Tool ("Tool") was first introduced in 2009. The Tool provides a rubric for behavioral and risk evaluation and helps create a common language for Behavioral Intervention Teams  ("BITs") . It now commands respect as the tool most commonly used by campus behavioral intervention and threat assessment teams across the United States  (Bennett & Lengerich, 2011; Van Brunt et al, 2012) . Given the prominence it has achieved, we at NaBITA are mindful of our ongoing obligation to update the tool, to validate it, and assure that it continues to reflect best practices. While our trainings and our literature describing the use and application of the tool have evolved, this marks the first substant

## Pymupdf output text

In [36]:
print(pymupdf_test)

 
 
THREAT ASSESSMENT IN THE CAMPUS SETTING 
 
THE NABITA 2014 WHITEPAPER 
 
 
BY: BRETT A. SOKOLOW, J.D. 
W. SCOTT LEWIS, J.D. 
SAUNDRA K. SCHUSTER, J.D. 
DANIEL C. SWINTON, J.D., ED.D. 
AND 
BRIAN J. VAN BRUNT, ED.D. 
 
 
 
 
 
 
This Threat Assessment Tool is being shared as a free resource to update the  
2009 Whitepaper published by the National Behavioral Intervention Team 
Association (NaBITA). 
 
Additional copies are available for free at www.nabita.org 
 
© NABITA 2014. 
 
2 
THREAT ASSESSMENT IN THE CAMPUS SETTING 
Introduction 
 
The NaBITA Threat Assessment Tool (“Tool”) was first introduced in 2009. The Tool provides a rubric for 
behavioral and risk evaluation and helps create a common language for Behavioral Intervention Teams 
(“BITs”). It now commands respect as the tool most commonly used by campus behavioral intervention and 
threat assessment teams across the United States (Bennett & Lengerich, 2011; Van Brunt et al, 2012). 
Given the prominence it has achieved, we