# Import the source data
The data is provided as a directory that is three levels deep (the third level is ommited in the following listing).
``` bash
fiete@ubu:~/Documents/studium/analyse_semi_und_unstrukturierter_daten$ tree -d -L 1 CAPTUM
CAPTUM
├── Allergic Diseases
├── ANA
├── Angioedema
├── anti-FcεRI
├── Antihistamine
├── Anti-IgE
├── anti-TPO IgE ratio
├── ASST
├── Basophil
├── BAT
├── BHRA
├── CRP
├── Cyclosporine
├── D-Dimer
├── dsDNA
├── Duration
├── Eosinophil
├── IL-24
├── Omalizumab
├── Severity
├── Thyroglobulin
├── Total IgE
└── TPO
```

To work further with the source data, it is useful to have a list of file paths for the pdfs. The following creates a list of all pdf files in the `CAPTUM` source folder.

In [30]:
import os

path = './CAPTUM'

pdf_filepaths = []
for root, directories, files in os.walk(path, topdown=False):
	for name in files:
		if name[-4:] == '.pdf':
			pdf_filepaths.append(os.path.join(root, name))

pdf_filepaths[:5]

['./CAPTUM/CRP/ANA/Asero 2017.pdf',
 './CAPTUM/CRP/ANA/Magen 2015.pdf',
 './CAPTUM/CRP/Severity/Kolkhir 2017 .pdf',
 './CAPTUM/CRP/Severity/Baek 2014.pdf',
 './CAPTUM/CRP/Severity/Kasperska-Zajac 2015.pdf']

## Check data for duplicate entries
We can identify duplicate pdfs by computing the checksum of each file and then counting the unique values. So let us define the checksum function `get_checksum()`:

In [31]:
# https://stackoverflow.com/questions/16874598/how-do-i-calculate-the-md5-checksum-of-a-file-in-python#16876405
import hashlib

def get_checksum(filepath: str) -> str:
    # Open,close, read file and calculate MD5 on its contents 
    with open(filepath, 'rb') as file_to_check:
        # read contents of the file
        data = file_to_check.read()    
        # pipe contents of the file through
        return hashlib.md5(data).hexdigest()

# check that it works
file_one, file_one_copy, file_two = "./pdf_1.pdf", "./pdf_1 copy.pdf", "./pdf_2.pdf"
assert get_checksum(file_one) == get_checksum(file_one_copy), "should be equal"
assert get_checksum(file_one) != get_checksum(file_two), "should not be equal"

Then we can create a pandas dataframe from the list of filepath's and also add a checksum column that is computed using our `get_checksum()` function.

In [32]:
import pandas as pd
df = pd.DataFrame(pdf_filepaths, columns = ['filepath'])
df['checksum'] = df['filepath'].apply(get_checksum)
df

Unnamed: 0,filepath,checksum
0,./CAPTUM/CRP/ANA/Asero 2017.pdf,2fad223ae2232cb9e855d3ece9e34b72
1,./CAPTUM/CRP/ANA/Magen 2015.pdf,c721aaea67a47811324b3c860dde612b
2,./CAPTUM/CRP/Severity/Kolkhir 2017 .pdf,aed2cb292fdffefe2a319b9d7e517bb3
3,./CAPTUM/CRP/Severity/Baek 2014.pdf,989e3eca08259c9a898acc551473f55f
4,./CAPTUM/CRP/Severity/Kasperska-Zajac 2015.pdf,2ed156f4fd5cfa00198f3f6f590940e0
...,...,...
1042,./CAPTUM/Omalizumab/Cyclosporine/Rosenblum 202...,fb22292adf8f35656fde0e54dc0cee51
1043,./CAPTUM/Omalizumab/Cyclosporine/Gimenez Arnau...,6a5635468c99716fc18b91b7b6ebaeaf
1044,./CAPTUM/Omalizumab/Cyclosporine/Koski 2017.pdf,6cfd7540663be0f6d7fb72f776339b71
1045,./CAPTUM/Omalizumab/Cyclosporine/Ke 2017.pdf,849adffe6101df0a030cf425f661e1ed


In the final step, we can analyse the results of this activity. It seems that our available data is in reality only half as large as it initially appears.

In [33]:
print('Total number of pdfs: {}'.format(df['checksum'].count()))
print('Total number of unique pdfs: {}'.format(len(df['checksum'].unique())))
df['checksum']


Total number of pdfs: 1047
Total number of unique pdfs: 464


0       2fad223ae2232cb9e855d3ece9e34b72
1       c721aaea67a47811324b3c860dde612b
2       aed2cb292fdffefe2a319b9d7e517bb3
3       989e3eca08259c9a898acc551473f55f
4       2ed156f4fd5cfa00198f3f6f590940e0
                      ...               
1042    fb22292adf8f35656fde0e54dc0cee51
1043    6a5635468c99716fc18b91b7b6ebaeaf
1044    6cfd7540663be0f6d7fb72f776339b71
1045    849adffe6101df0a030cf425f661e1ed
1046    f13be81ffbff55e031a34ef81d43cbff
Name: checksum, Length: 1047, dtype: object

Now we create a df of unique pdfs by removing duplicate checksums

In [34]:
df_unique = df.drop_duplicates(subset=['checksum'])
df_unique.head()

Unnamed: 0,filepath,checksum
0,./CAPTUM/CRP/ANA/Asero 2017.pdf,2fad223ae2232cb9e855d3ece9e34b72
1,./CAPTUM/CRP/ANA/Magen 2015.pdf,c721aaea67a47811324b3c860dde612b
2,./CAPTUM/CRP/Severity/Kolkhir 2017 .pdf,aed2cb292fdffefe2a319b9d7e517bb3
3,./CAPTUM/CRP/Severity/Baek 2014.pdf,989e3eca08259c9a898acc551473f55f
4,./CAPTUM/CRP/Severity/Kasperska-Zajac 2015.pdf,2ed156f4fd5cfa00198f3f6f590940e0


# Extracting the text
The next step is to read the text from the pdfs. We can try two different approaches to this problem:
- [Using pdfminer.six][#using-pdfminersix]
- [Using Optical Character Recognition (OCR)](#using-optical-character-recognition-ocr)

## Using pdfminer.six
First we import the neccessary modules (more on pdfminer.six [here](https://pdfminersix.readthedocs.io/en/latest/)) and create a procedure to extract the text.

In [35]:
from io import StringIO

from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfparser import PDFParser

def convert_pdf_to_string(file_path):
	output_string = StringIO()
	with open(file_path, 'rb') as in_file:
	    parser = PDFParser(in_file)
	    doc = PDFDocument(parser)
	    rsrcmgr = PDFResourceManager()
	    device = TextConverter(rsrcmgr, output_string, laparams=LAParams())
	    interpreter = PDFPageInterpreter(rsrcmgr, device)
	    for page in PDFPage.create_pages(doc):
	        interpreter.process_page(page)

	return(output_string.getvalue())

Now we iterate over each individual file and use pdfminer to get the file content. You can run either one of the two cells below.

In [36]:
if (False):
    # write the content to a new column of the dataframe and save as csv
    # this takes between five and ten minutes
    df_unique['text'] = df_unique.filepath.apply(lambda fp: convert_pdf_to_string(fp))
    df_unique.to_csv('captum.csv')

In [37]:
# Remove Output Path if it exists, then create new
import shutil
out_dir = "./out/"
if os.path.exists(out_dir):
    shutil.rmtree(out_dir)
os.makedirs(out_dir)

# Alternative approach to cell above,
# replace with True if this should be run
if (False):
    # Iterate over each individual file
    for index, row in df_unique.iterrows():
        path = row['filepath']
        checksum = row['checksum']

        content = convert_pdf_to_string(path)
        
        out_file_path = os.path.join(out_dir, checksum + ".txt")
        
        with open(out_file_path, "w", encoding="utf-8") as text_file:
            text_file.write(content)
            
        if ((index + 1) % 10 == 0):
            print(str(index + 1) + "/" + str(len(df_unique)) + " done")

## Using Optical Character Recognition (OCR)
Also we can try to use Optical Character Recognition (OCR) on the pdfs to get a better result.

### Saving the pdfs as images

In [39]:
import pytesseract
from PIL import Image
from pdf2image import convert_from_path

# https://towardsdatascience.com/extracting-text-from-scanned-pdf-using-pytesseract-open-cv-cd670ee38052
# https://pdf2image.readthedocs.io/en/latest/reference.html
# this is also getting the page number because of performance reasons
def save_as_image(filepath):
    pages = convert_from_path(filepath)
    for p in range(len(pages)):
        path = filepath[:-4] + '_' + str(p) + '.jpg'
        # only save if file does not exist
        if (os.path.isfile(path) == False):
            pages[p].save(path, 'JPEG')
    return len(pages)

if (True):
    df_unique['number_of_pages'] = df_unique['filepath'].apply(lambda fp: save_as_image(fp))

df_unique.to_csv('captum.csv')
df_unique.head()

KeyboardInterrupt: 

In [10]:
from pytesseract import image_to_string

def text_from_ocr(filepath, number_of_pages):
    text = ''
    for page in range(number_of_pages):
        text += image_to_string(filepath[:-4] + '_' + str(page) + '.jpg')
    return text

if (False):
    df_unique['text'] = df_unique.apply(lambda row: text_from_ocr(row['filepath'], row['number_of_pages']), axis=1)

### Running OCR on the images

In [17]:
from multiprocessing import cpu_count, Pool
import numpy as np

def text_from_ocr(df) -> pd.DataFrame:
    df.head()
    for index, row in df.iterrows():
        print('entered row' + index)
        text = ''
        for page in range(row['number_of_pages']):
            print('# of pages' + row['number_of_pages'])
            text += image_to_string(row['filepath'][:-4] + '_' + str(page) + '.jpg')
            print('text: ' + text)
        df.loc[index, 'text'] = text
        print(index)
    return df

# https://towardsdatascience.com/make-your-own-super-pandas-using-multiproc-1c04f41944a1#6028
def parallelize_dataframe(df, func, n_cores=4):
    df_split = np.array_split(df, n_cores)
    pool = Pool(n_cores)
    df = pd.concat(pool.map(func, df_split))
    pool.close()
    pool.join()
    return df

df_unique = pd.read_csv('captum.csv')

# This will take more than two and a half hours
df_unique = parallelize_dataframe(df_unique, text_from_ocr, len(os.sched_getaffinity(0)))

In [36]:
df_unique.to_csv('captum.csv')