# cat-AI-log. An AI-based product group allocation system

Capstone project.

Sebastian Thomas @ neue fische Bootcamp Data Science<br />
(datascience at sebastianthomas dot de)

# Mining 1: pdf mining of IFA dosage forms

We crawl a pdf from Informationsstelle für Arzneispezialitäten (IFA) to obtain an official abbreviation list of dosage forms.

## Origin

The pdf was downloaded from https://www.ifaffm.de/mandanten/1/documents/02_ifa_anbieter/richtlinien/IFA-Richtlinien_Darreichungsformen.pdf.

## Imports

### Modules, classes and functions

In [None]:
# string input-output
from io import StringIO

# regular expressions
import re

# pdfs
# installation: pip install pdfminer.six==20191020
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage

# data
import pandas as pd

### Helpers

In [None]:
# inspired by https://stackoverflow.com/questions/56494070/how-to-use-pdfminer-six-with-python-3
def extract_text(path, pagenos=None):
    with StringIO() as string_io:
        resource_manager = PDFResourceManager()
        with TextConverter(resource_manager, string_io, laparams=LAParams()) as text_converter:
            with open(path, 'rb') as pdf_file:
                page_interpreter = PDFPageInterpreter(resource_manager, device=text_converter)
                for page in PDFPage.get_pages(pdf_file, pagenos=pagenos):
                    page_interpreter.process_page(page)
                text = string_io.getvalue()
    return text

## Pdf mining

In [None]:
# extract text from pdfs
text = ''

for idx in range(2, 8):
    page = extract_text('../data/IFA-Richtlinien_Darreichungsformen.pdf', [idx])
    # remove headers, etc.
    text += re.sub(r'IFA-Darreichungsformen|INFORMATION|3\.  Darreichungsformentabelle|\nErweiterung[\ \w\d\.\:]*\n|\n09\.09\.2019[\ \w\–\n]*\x0c$',
                   '', page)

# manual replacement to ensure that all tokens are recognized
text = text.replace('PSE', '\n PSE')

In [None]:
# extract abbreviations and dosage forms from text
tokens = [token.strip() for token in text.split('\n') if token not in ['', ' ']]

abbreviations = []
dosage_forms = []
for token in tokens:
    if len(token) == 3 and token.isupper():
        abbreviations.append(token)
    else:
        dosage_forms.append(token)

In [None]:
# manual cleaning of dosage forms that are spread over several lines
indices = [54, 62, 69, 88, 117, 163, 162, 165, 167, 169, 172, 171, 175, 174, 179, 178, 183, 186, 185, 189, 188,
           194, 193, 192, 191, 241, 243]

for idx in indices:
    dosage_forms[idx] += ' ' + dosage_forms[idx + 1]

for idx in sorted(indices, reverse=True):
    dosage_forms.pop(idx + 1)

# manual cleaning of a dosage form that contains multiple empty spaces
dosage_forms[215] = re.sub(r'\ \ +', ' ', dosage_forms[215])

## Save data

The crawled data is persisted in a csv file.

In [None]:
# save pairs abbreviation/dosage form to csv file
pd.Series(dosage_forms, index=pd.Index(abbreviations, name='abbreviation'),
          name='dosage form').to_csv('../data/dosage_forms_ifa.csv', sep=';')