# Import the source data
The data is provided as a directory that is three levels deep (the third level is ommited in the following listing).
``` bash
fiete@ubu:~/Documents/studium/analyse_semi_und_unstrukturierter_daten$ tree -d -L 1 CAPTUM
CAPTUM
├── Allergic Diseases
├── ANA
├── Angioedema
├── anti-FcεRI
├── Antihistamine
├── Anti-IgE
├── anti-TPO IgE ratio
├── ASST
├── Basophil
├── BAT
├── BHRA
├── CRP
├── Cyclosporine
├── D-Dimer
├── dsDNA
├── Duration
├── Eosinophil
├── IL-24
├── Omalizumab
├── Severity
├── Thyroglobulin
├── Total IgE
└── TPO
```

To work further with the source data, it is useful to have a list of file paths for the pdfs. The following creates a list of all pdf files in the `CAPTUM` source folder.

In [62]:
from os import path, walk

path = './CAPTUM'

pdf_filepaths = []
for root, directories, files in os.walk(path, topdown=False):
	for name in files:
		pdf_filepaths.append(os.path.join(root, name))

pdf_filepaths[:5]

['./CAPTUM/CRP/ANA/Asero 2017.pdf',
 './CAPTUM/CRP/ANA/Magen 2015.pdf',
 './CAPTUM/CRP/Severity/Kolkhir 2017 .pdf',
 './CAPTUM/CRP/Severity/Baek 2014.pdf',
 './CAPTUM/CRP/Severity/Kasperska-Zajac 2015.pdf']

## Check data for duplicate entries

In [90]:
# https://stackoverflow.com/questions/16874598/how-do-i-calculate-the-md5-checksum-of-a-file-in-python#16876405
from hashlib import md5

def get_checksum(filepath: str) -> str:
    # Open,close, read file and calculate MD5 on its contents 
    with open(filepath, 'rb') as file_to_check:
        # read contents of the file
        data = file_to_check.read()    
        # pipe contents of the file through
        return hashlib.md5(data).hexdigest()

# check that it works
file_one, file_two, file_three = "./Arik yilmaz 2017.pdf", "./Arik yilmaz 2017 (copy).pdf", "./Bruno 2001.pdf"
assert get_checksum(file_one) == get_checksum(file_two), "should be equal"
assert get_checksum(file_one) != get_checksum(file_three), "should not be equal"

In [76]:
import pandas as pd
df = pd.DataFrame(pdf_filepaths, columns = ['filepath'])
df

Unnamed: 0,filepath
0,./CAPTUM/CRP/ANA/Asero 2017.pdf
1,./CAPTUM/CRP/ANA/Magen 2015.pdf
2,./CAPTUM/CRP/Severity/Kolkhir 2017 .pdf
3,./CAPTUM/CRP/Severity/Baek 2014.pdf
4,./CAPTUM/CRP/Severity/Kasperska-Zajac 2015.pdf
...,...
1054,./CAPTUM/Omalizumab/Cyclosporine/Rosenblum 202...
1055,./CAPTUM/Omalizumab/Cyclosporine/Gimenez Arnau...
1056,./CAPTUM/Omalizumab/Cyclosporine/Koski 2017.pdf
1057,./CAPTUM/Omalizumab/Cyclosporine/Ke 2017.pdf
