This file provides a handful of functions that wil be useful in cleaning text with ligatures. Ligatures are single character glyphs that are usually captured in a strange way when being scraped off a PDF. Here is an example.

In [1]:
ligature_text = "The oﬃcer noted the eﬃciency of the ﬂame-retardant ﬂatboat. He was ﬂabbergasted by the ﬁrst ﬁgure on the manuscript, which was ﬁnished with a ﬂourish in a speciﬁc style."

In [None]:
print(ligature_text[5])
#Here we see the 4th character is actually 3 characters [ffi].

ﬃ


This is not too bad. Let's find an example of more frustrating text to work with. Let's take one of our articles located at "https://assets.nationbuilder.com/cuapb/pages/1462/attachments/original/1647659318/RCA-2022-00097_-_Legal_settlement_Workers'_Compensation_claim_of_Steven_Fogarty.pdf?1647659318"


In [15]:
test_url="https://assets.nationbuilder.com/cuapb/pages/1462/attachments/original/1647659318/RCA-2022-00097_-_Legal_settlement_Workers'_Compensation_claim_of_Steven_Fogarty.pdf?1647659318"

#Let's apply a simple scraper to it.

import requests
import pymupdf
from requests.exceptions import HTTPError
def pdf_scraper(url):
    if url=="No articles found":
        return "No articles found"

    headers = {
        "User-Agent": (
            "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
            "AppleWebKit/537.36 (KHTML, like Gecko) "
            "Chrome/118.0.0.0 Safari/537.36"
        )
    }#We will try to imitate a person browsing the web by appearing as if we are from a particular browser.

    try:
        response = requests.get(url, headers=headers)#We set our headers to attempt spoofing.
        if response.status_code != 200:
            return None

    except HTTPError as e:
        return f"Access denied ({e})"
    except Exception as e:
        return f"Request failed ({e})"
    pdf = response.content
    doc = pymupdf.open("pdf", pdf)
    text_parts = []
    if len(doc)>20:
        text_parts.append("Too_Long:Over 20 Pages") #If the document is over 20 pages, it might not be worth it. Scrape 20 first
    for k in range(min(len(doc), 20)):
      text_parts.append(doc[k].get_text())
    text = "".join(text_parts)
    #if len(text)>200:
    #    for z in range(min(len(doc), 10)):
    #        page=doc[z]
    #        pix=page.get_pixmap()
    #        img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
    #        text= pytesseract.image_to_string(img)
    #        text_parts.append(text)
    #    doc.close()
    #    text="".join(text_parts)
    #This text would apply OCR. But we are currently interested in seeing the effects of ligatures, which show up just in text scraping
    if len(text)==0:
        return "No text found"
    return text


In [16]:
q=pdf_scraper(test_url)

In [None]:
q
#Let's take  preview

"Home \uf105Legislative File 2022-00098 \uf105RCA\nLegal se\x01lement: Workers' Compensa\x02on claim of Steven Fogarty (RCA-2022-00097)\nORIGINATING DEPARTMENT\nFinance & Property Services\nTo Commi\x01ee(s)\n#\nCommi\x01ee Name\nMee\x02ng Date\n1\nPolicy & Government Oversight Commi\x01ee\nFeb 7, 2022\nLEAD STAFF:\nEmily Ann Colby\nPRESENTED BY:\nEmily Ann Colby\nAc\x02on Item(s)\n#\nFile Type\nSubcategory\nItem Descrip\x02on\n1\nAc\x02on\nSe\x01lement\nApproving the workers' compensa\x02on claim of Steven Fogarty by payment of $175,000\nover three years to Steven Fogarty and his a\x01orney, Meuser Law Firm, and authorizing\nthe City A\x01orney's Oﬃce to execute any documents necessary to eﬀectuate the\nse\x01lement.\nWard / Neighborhood / Address\n#\nWard\nNeighborhood\nAddress\n1.\nNot Applicable\nBackground Analysis\nA City of Minneapolis employee sustained work-related injuries. The par\x02es reached a tenta\x02ve se\x01lement by payment of $175,000 over three years from fund\n069

We see that the "ti" from "Workers' Compensation" is expressed as \x02. When we print the data, we get a rather confusing result.

In [18]:
print(q)

Home Legislative File 2022-00098 RCA
Legal selement: Workers' Compensaon claim of Steven Fogarty (RCA-2022-00097)
ORIGINATING DEPARTMENT
Finance & Property Services
To Commiee(s)
#
Commiee Name
Meeng Date
1
Policy & Government Oversight Commiee
Feb 7, 2022
LEAD STAFF:
Emily Ann Colby
PRESENTED BY:
Emily Ann Colby
Acon Item(s)
#
File Type
Subcategory
Item Descripon
1
Acon
Selement
Approving the workers' compensaon claim of Steven Fogarty by payment of $175,000
over three years to Steven Fogarty and his aorney, Meuser Law Firm, and authorizing
the City Aorney's Oﬃce to execute any documents necessary to eﬀectuate the
selement.
Ward / Neighborhood / Address
#
Ward
Neighborhood
Address
1.
Not Applicable
Background Analysis
A City of Minneapolis employee sustained work-related injuries. The pares reached a tentave selement by payment of $175,000 over three years from fund
06930-1450100-789401-145400. Risk Managment believes this selement is in the best interests of the C

Our \x02 is then replaced by a question mark block. This is because it cannot be properly read. It seems the problem is even worth than ligatures and enters the realm of PUA (private use area). Different organizations may use different fonts, which lead to non-uniform encoding. With that in mind, we cannot simply map \x02 to "ti". However, let's start by making use of unicodedata.normalize to help with some ligatures. 
Refer to https://docs.python.org/3/library/unicodedata.html for further documentation.
Find an example here.

In [20]:
import unicodedata
q=unicodedata.normalize('NFKC', "Workers' Compensa\x02on claim")

In [21]:
q

"Workers' Compensa\x02on claim"

In [22]:
print(q)

Workers' Compensaon claim


In [23]:
import unicodedata
q=unicodedata.normalize('NFKC', ligature_text[5])

In [24]:
q

'ffi'

In [26]:
ligature_text[5]

'ﬃ'

In [25]:
print(q)

ffi


As we see, this normalization code is great for ligatures, but not so great for these documents with an unusual mapping. So, what must be done here is simply to create a dictionary for mapping. We will then convert the characters into the appropriate forms

In [33]:
replacements={
    "\x02":"ti",
    "\x01":"tt",
}
test_string="Workers' Compensa\x02on claim"

revised_string=test_string

for bad, good in replacements.items():
        revised_string = revised_string.replace(bad, good)
revised_string

"Workers' Compensation claim"

In [34]:
print(revised_string)

Workers' Compensation claim


In [35]:
z=pdf_scraper("https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pages/1462/attachments/original/1638637587/RCA-2021-00523_-_Legal_Settlement_Workers'_Compensation_claim_of_Sokham_Klann.pdf?1638637587")

In [38]:
z

"Home \uf105Legislative File 2021-00545 \uf105RCA\nLegal Se\x01lement: Workers' Compensa\x02on claim of Sokham Klann\n(RCA-2021-00523)\nORIGINATING DEPARTMENT\nFinance & Property Services\nTo Commi\x01ee(s)\n#\nCommi\x01ee Name\nMee\x02ng Date\n1\nPolicy & Government Oversight\nCommi\x01ee\nMay 12, 2021\nLEAD\nSTAFF:\nEmily Ann Colby\nPRESENTED\nBY:\nEmily Ann Colby\nAc\x02on Item(s)\n#\nFile Type\nSubcategory\nItem Descrip\x02on\n1\nAc\x02on\nSe\x01lement\nApproving the Workers' Compensa\x02on claim of\nSokham Klann by payment of $180,000 over three\nyears to Mr. Klann and his a\x01orneys Meuser Law\nFirm and authorizing the City A\x01orney's Oﬃce to\nexecute any documents necessary to eﬀectuate\nthe se\x01lement.\nBackground Analysis\nFormer MPD oﬃcer sustained admi\x01ed work-related injuries.  The par\x02es reached a tenta\x02ve\nRCA-2021-00523 - Legal Settlement: Workers' Compensation claim of ...\nhttps://lims.minneapolismn.gov/RCA/7932\n1 of 2\n12/4/2021, 12:33 AM\nse\x01lement 

Let's try it with this similar article

In [36]:
revized=z
for bad, good in replacements.items():
    revized=revized.replace(bad, good)

In [37]:
revized

"Home \uf105Legislative File 2021-00545 \uf105RCA\nLegal Settlement: Workers' Compensation claim of Sokham Klann\n(RCA-2021-00523)\nORIGINATING DEPARTMENT\nFinance & Property Services\nTo Committee(s)\n#\nCommittee Name\nMeeting Date\n1\nPolicy & Government Oversight\nCommittee\nMay 12, 2021\nLEAD\nSTAFF:\nEmily Ann Colby\nPRESENTED\nBY:\nEmily Ann Colby\nAction Item(s)\n#\nFile Type\nSubcategory\nItem Description\n1\nAction\nSettlement\nApproving the Workers' Compensation claim of\nSokham Klann by payment of $180,000 over three\nyears to Mr. Klann and his attorneys Meuser Law\nFirm and authorizing the City Attorney's Oﬃce to\nexecute any documents necessary to eﬀectuate\nthe settlement.\nBackground Analysis\nFormer MPD oﬃcer sustained admitted work-related injuries.  The parties reached a tentative\nRCA-2021-00523 - Legal Settlement: Workers' Compensation claim of ...\nhttps://lims.minneapolismn.gov/RCA/7932\n1 of 2\n12/4/2021, 12:33 AM\nsettlement by payment of $180,000 over three ye

In [39]:
print(revized)

Home Legislative File 2021-00545 RCA
Legal Settlement: Workers' Compensation claim of Sokham Klann
(RCA-2021-00523)
ORIGINATING DEPARTMENT
Finance & Property Services
To Committee(s)
#
Committee Name
Meeting Date
1
Policy & Government Oversight
Committee
May 12, 2021
LEAD
STAFF:
Emily Ann Colby
PRESENTED
BY:
Emily Ann Colby
Action Item(s)
#
File Type
Subcategory
Item Description
1
Action
Settlement
Approving the Workers' Compensation claim of
Sokham Klann by payment of $180,000 over three
years to Mr. Klann and his attorneys Meuser Law
Firm and authorizing the City Attorney's Oﬃce to
execute any documents necessary to eﬀectuate
the settlement.
Background Analysis
Former MPD oﬃcer sustained admitted work-related injuries.  The parties reached a tentative
RCA-2021-00523 - Legal Settlement: Workers' Compensation claim of ...
https://lims.minneapolismn.gov/RCA/7932
1 of 2
12/4/2021, 12:33 AM
settlement by payment of $180,000 over three years from fund
06930-1450100-78901-145400. Risk Man

Pretty good, checking the document. We see that \uf105 maps to something like ">", so let's play with that idea.

In [40]:
replacements={
    "\x02":"ti",
    "\x01":"tt",
    "\uf105":">"
}
for bad, good in replacements.items():
    revized=revized.replace(bad, good)

In [42]:
revized

"Home >Legislative File 2021-00545 >RCA\nLegal Settlement: Workers' Compensation claim of Sokham Klann\n(RCA-2021-00523)\nORIGINATING DEPARTMENT\nFinance & Property Services\nTo Committee(s)\n#\nCommittee Name\nMeeting Date\n1\nPolicy & Government Oversight\nCommittee\nMay 12, 2021\nLEAD\nSTAFF:\nEmily Ann Colby\nPRESENTED\nBY:\nEmily Ann Colby\nAction Item(s)\n#\nFile Type\nSubcategory\nItem Description\n1\nAction\nSettlement\nApproving the Workers' Compensation claim of\nSokham Klann by payment of $180,000 over three\nyears to Mr. Klann and his attorneys Meuser Law\nFirm and authorizing the City Attorney's Oﬃce to\nexecute any documents necessary to eﬀectuate\nthe settlement.\nBackground Analysis\nFormer MPD oﬃcer sustained admitted work-related injuries.  The parties reached a tentative\nRCA-2021-00523 - Legal Settlement: Workers' Compensation claim of ...\nhttps://lims.minneapolismn.gov/RCA/7932\n1 of 2\n12/4/2021, 12:33 AM\nsettlement by payment of $180,000 over three years from f

In [41]:
print(revized)

Home >Legislative File 2021-00545 >RCA
Legal Settlement: Workers' Compensation claim of Sokham Klann
(RCA-2021-00523)
ORIGINATING DEPARTMENT
Finance & Property Services
To Committee(s)
#
Committee Name
Meeting Date
1
Policy & Government Oversight
Committee
May 12, 2021
LEAD
STAFF:
Emily Ann Colby
PRESENTED
BY:
Emily Ann Colby
Action Item(s)
#
File Type
Subcategory
Item Description
1
Action
Settlement
Approving the Workers' Compensation claim of
Sokham Klann by payment of $180,000 over three
years to Mr. Klann and his attorneys Meuser Law
Firm and authorizing the City Attorney's Oﬃce to
execute any documents necessary to eﬀectuate
the settlement.
Background Analysis
Former MPD oﬃcer sustained admitted work-related injuries.  The parties reached a tentative
RCA-2021-00523 - Legal Settlement: Workers' Compensation claim of ...
https://lims.minneapolismn.gov/RCA/7932
1 of 2
12/4/2021, 12:33 AM
settlement by payment of $180,000 over three years from fund
06930-1450100-78901-145400. Risk Man

I suppose all that's left to do is just to write a function that takes in a string, then returns a ligature-less result.

In [43]:
import unicodedata
def text_cleaner(text):
    norm_text=unicodedata.normalize('NFKC', text)
    #First try. This deals with most common ligatures
    replacements={
    "\x02":"ti",
    "\x01":"tt",
    "\uf105":">"
    }#This will, of course, change and grow with the texts you are working with.
    final_text=norm_text
    for bad, good in replacements.items():
        final_text=final_text.replace(bad, good)
    return final_text
