# Import the source data
The data is provided as a directory that is three levels deep (the third level is ommited in the following listing).
``` bash
fiete@ubu:~/Documents/studium/analyse_semi_und_unstrukturierter_daten$ tree -d -L 1 CAPTUM
CAPTUM
├── Allergic Diseases
├── ANA
├── Angioedema
├── anti-FcεRI
├── Antihistamine
├── Anti-IgE
├── anti-TPO IgE ratio
├── ASST
├── Basophil
├── BAT
├── BHRA
├── CRP
├── Cyclosporine
├── D-Dimer
├── dsDNA
├── Duration
├── Eosinophil
├── IL-24
├── Omalizumab
├── Severity
├── Thyroglobulin
├── Total IgE
└── TPO
```

To work further with the source data, it is useful to have a list of file paths for the pdfs. The following creates a list of all pdf files in the `CAPTUM` source folder.

In [3]:
import os

path = './CAPTUM'

pdf_filepaths = []
for root, directories, files in os.walk(path, topdown=False):
	for name in files:
		pdf_filepaths.append(os.path.join(root, name))

pdf_filepaths[:5]

['./CAPTUM\\Allergic Diseases\\Angioedema\\Arik yilmaz 2017.pdf',
 './CAPTUM\\Allergic Diseases\\Angioedema\\Bruno 2001.pdf',
 './CAPTUM\\Allergic Diseases\\Angioedema\\Cousin 2016.pdf',
 './CAPTUM\\Allergic Diseases\\Angioedema\\Faisant 2016.pdf',
 './CAPTUM\\Allergic Diseases\\Angioedema\\Kahveci 2020.pdf']

## Check data for duplicate entries
We can identify duplicate pdfs by computing the checksum of each file and then counting the unique values. So let us define the checksum function `get_checksum()`:

In [4]:
# https://stackoverflow.com/questions/16874598/how-do-i-calculate-the-md5-checksum-of-a-file-in-python#16876405
import hashlib

def get_checksum(filepath: str) -> str:
    # Open,close, read file and calculate MD5 on its contents 
    with open(filepath, 'rb') as file_to_check:
        # read contents of the file
        data = file_to_check.read()    
        # pipe contents of the file through
        return hashlib.md5(data).hexdigest()

# check that it works
file_one, file_two, file_three = "./Arik yilmaz 2017.pdf", "./Arik yilmaz 2017 (copy).pdf", "./Bruno 2001.pdf"
assert get_checksum(file_one) == get_checksum(file_two), "should be equal"
assert get_checksum(file_one) != get_checksum(file_three), "should not be equal"

Then we can create a pandas dataframe from the list of filepath's and also add a checksum column that is computed using our `get_checksum()` function.

In [9]:
import pandas as pd
df = pd.DataFrame(pdf_filepaths, columns = ['filepath'])
df['checksum'] = df['filepath'].apply(get_checksum)
df

Unnamed: 0,filepath,checksum
0,./CAPTUM\Allergic Diseases\Angioedema\Arik yil...,ad656fbed80a09bc5a842a528cbcfa5d
1,./CAPTUM\Allergic Diseases\Angioedema\Bruno 20...,6e0337369eae48049f7f080b48ca3af9
2,./CAPTUM\Allergic Diseases\Angioedema\Cousin 2...,b53f40ffe6c949eb06d7098d66e10fca
3,./CAPTUM\Allergic Diseases\Angioedema\Faisant ...,2882866de4e0ad21634941674bd81fe4
4,./CAPTUM\Allergic Diseases\Angioedema\Kahveci ...,6f1b0a59e73bedae5a5250aa82500c26
...,...,...
1055,./CAPTUM\TPO\Thyroglobulin\Sanchez 2020.pdf,7374451f8e1a341658d500b5577074e0
1056,./CAPTUM\TPO\Thyroglobulin\Silvares 2017.pdf,c85a07aeb5807151c5479d1c2c219a0f
1057,./CAPTUM\TPO\Thyroglobulin\Wan 2012.pdf,c4f1d27a5e3bcc7f023c0d180376e0d7
1058,./CAPTUM\desktop.ini,15478b340a8362bb79fd2a6ea0dde1a0


In the final step, we can analyse the results of this activity. It seems that our available data is in reality only half as large as it initially appears.

In [10]:
print('Total number of pdfs: {}'.format(df['checksum'].count()))
print('Total number of unique pdfs: {}'.format(len(df['checksum'].unique())))
df['checksum']


Total number of pdfs: 1060
Total number of unique pdfs: 467


0       ad656fbed80a09bc5a842a528cbcfa5d
1       6e0337369eae48049f7f080b48ca3af9
2       b53f40ffe6c949eb06d7098d66e10fca
3       2882866de4e0ad21634941674bd81fe4
4       6f1b0a59e73bedae5a5250aa82500c26
                      ...               
1055    7374451f8e1a341658d500b5577074e0
1056    c85a07aeb5807151c5479d1c2c219a0f
1057    c4f1d27a5e3bcc7f023c0d180376e0d7
1058    15478b340a8362bb79fd2a6ea0dde1a0
1059    f13be81ffbff55e031a34ef81d43cbff
Name: checksum, Length: 1060, dtype: object

Now we create a df of unique pdfs

In [11]:
# TODO
df_unique = df

The next step is to read the text from the pdfs
We use aws textract. To install the boto3 client you need to add boto3 to your environment
pip install boto3

In [19]:
import boto3

# Todo: Iterate over paths
path = df_unique.iloc[0]['filepath']

# Read document content
with open(path, 'rb') as pdf:
    imageBytes = bytearray(pdf.read())
    
# Amazon Textract client
textract = boto3.client(    
    'textract', 
    region_name='<region>', 
    aws_access_key_id='<key_id>', 
    aws_secret_access_key='<key>'
)

# Call Amazon Textract
response = textract.detect_document_text(Document={'Bytes': imageBytes})

#print(response)

# Print detected text
for item in response["Blocks"]:
    if item["BlockType"] == "LINE":
        print ('\033[94m' +  item["Text"] + '\033[0m')
# TODO:
# Find different ways to interact with the result. Read tables. Read Abstract, Contributors, Text, ...