# Import the source data
The data is provided as a directory that is three levels deep (the third level is ommited in the following listing).
``` bash
fiete@ubu:~/Documents/studium/analyse_semi_und_unstrukturierter_daten$ tree -d -L 1 CAPTUM
CAPTUM
├── Allergic Diseases
├── ANA
├── Angioedema
├── anti-FcεRI
├── Antihistamine
├── Anti-IgE
├── anti-TPO IgE ratio
├── ASST
├── Basophil
├── BAT
├── BHRA
├── CRP
├── Cyclosporine
├── D-Dimer
├── dsDNA
├── Duration
├── Eosinophil
├── IL-24
├── Omalizumab
├── Severity
├── Thyroglobulin
├── Total IgE
└── TPO
```

To work further with the source data, it is useful to have a list of file paths for the pdfs. The following creates a list of all pdf files in the `CAPTUM` source folder.

In [1]:
import os

path = './CAPTUM'

pdf_filepaths = []
for root, directories, files in os.walk(path, topdown=False):
	for name in files:
		pdf_filepaths.append(os.path.join(root, name))

pdf_filepaths[:5]

['./CAPTUM/CRP/ANA/Asero 2017.pdf',
 './CAPTUM/CRP/ANA/Magen 2015.pdf',
 './CAPTUM/CRP/Severity/Kolkhir 2017 .pdf',
 './CAPTUM/CRP/Severity/Baek 2014.pdf',
 './CAPTUM/CRP/Severity/Kasperska-Zajac 2015.pdf']

## Check data for duplicate entries
We can identify duplicate pdfs by computing the checksum of each file and then counting the unique values. So let us define the checksum function `get_checksum()`:

In [12]:
# https://stackoverflow.com/questions/16874598/how-do-i-calculate-the-md5-checksum-of-a-file-in-python#16876405
import hashlib

def get_checksum(filepath: str) -> str:
    # Open,close, read file and calculate MD5 on its contents 
    with open(filepath, 'rb') as file_to_check:
        # read contents of the file
        data = file_to_check.read()    
        # pipe contents of the file through
        return hashlib.md5(data).hexdigest()

# check that it works
file_one, file_one_copy, file_two = "./pdf_1.pdf", "./pdf_1 copy.pdf", "./pdf_2.pdf"
assert get_checksum(file_one) == get_checksum(file_one_copy), "should be equal"
assert get_checksum(file_one) != get_checksum(file_two), "should not be equal"

Then we can create a pandas dataframe from the list of filepath's and also add a checksum column that is computed using our `get_checksum()` function.

In [3]:
import pandas as pd
df = pd.DataFrame(pdf_filepaths, columns = ['filepath'])
df['checksum'] = df['filepath'].apply(get_checksum)
df

Unnamed: 0,filepath,checksum
0,./CAPTUM/CRP/ANA/Asero 2017.pdf,2fad223ae2232cb9e855d3ece9e34b72
1,./CAPTUM/CRP/ANA/Magen 2015.pdf,c721aaea67a47811324b3c860dde612b
2,./CAPTUM/CRP/Severity/Kolkhir 2017 .pdf,aed2cb292fdffefe2a319b9d7e517bb3
3,./CAPTUM/CRP/Severity/Baek 2014.pdf,989e3eca08259c9a898acc551473f55f
4,./CAPTUM/CRP/Severity/Kasperska-Zajac 2015.pdf,2ed156f4fd5cfa00198f3f6f590940e0
...,...,...
1054,./CAPTUM/Omalizumab/Cyclosporine/Rosenblum 202...,fb22292adf8f35656fde0e54dc0cee51
1055,./CAPTUM/Omalizumab/Cyclosporine/Gimenez Arnau...,6a5635468c99716fc18b91b7b6ebaeaf
1056,./CAPTUM/Omalizumab/Cyclosporine/Koski 2017.pdf,6cfd7540663be0f6d7fb72f776339b71
1057,./CAPTUM/Omalizumab/Cyclosporine/Ke 2017.pdf,849adffe6101df0a030cf425f661e1ed


In the final step, we can analyse the results of this activity. It seems that our available data is in reality only half as large as it initially appears.

In [5]:
print('Total number of pdfs: {}'.format(df['checksum'].count()))
print('Total number of unique pdfs: {}'.format(len(df['checksum'].unique())))
df['checksum']


Total number of pdfs: 1059
Total number of unique pdfs: 466


0       2fad223ae2232cb9e855d3ece9e34b72
1       c721aaea67a47811324b3c860dde612b
2       aed2cb292fdffefe2a319b9d7e517bb3
3       989e3eca08259c9a898acc551473f55f
4       2ed156f4fd5cfa00198f3f6f590940e0
                      ...               
1054    fb22292adf8f35656fde0e54dc0cee51
1055    6a5635468c99716fc18b91b7b6ebaeaf
1056    6cfd7540663be0f6d7fb72f776339b71
1057    849adffe6101df0a030cf425f661e1ed
1058    f13be81ffbff55e031a34ef81d43cbff
Name: checksum, Length: 1059, dtype: object

Now we create a df of unique pdfs

In [7]:
# TODO
df_unique = df

The next step is to read the text from the pdfs
We use aws textract. To install the boto3 client you need to add boto3 to your environment
pip install boto3

In [27]:
import boto3
from dotenv import load_dotenv

# Load .env file
load_dotenv()

# Todo: Iterate over paths
path = df_unique.iloc[0]['filepath']

# Read document content
with open(path, 'rb') as pdf:
    imageBytes = bytearray(pdf.read())
    
# Amazon Textract client
textract = boto3.client(    
    'textract', 
    region_name=os.getenv('REGION_NAME'), 
    aws_access_key_id=os.getenv('AWS_ACCESS_KEY_ID'), 
    aws_secret_access_key=os.getenv('AWS_SECRET_ACCESS_KEY')
)

# Call Amazon Textract
response = textract.detect_document_text(Document={'Bytes': imageBytes})

# Print detected text
for item in response["Blocks"]:
    if item["BlockType"] == "LINE":
        print ('\033[94m' +  item["Text"] + '\033[0m')

# # TOD:
# # Find different ways to interact with the result. Read tables. Read Abstract, Contributors, Text, ...

InvalidRegionError: Provided region_name '<region>' doesn't match a supported format.