# **OCR DXA Reports Using AWS Textract**

**<h2>Overview</h2>**<br/>
This notebook can be employed to extracts text and data from <strong>DICOM report images</strong>(In my Case, DXA Reports). <br/>


### **Phase 1 - Read DICOM files**

> First, We need to install *pydicom* to read DICOM files. *pydicom* is a pure Python package for working with DICOM files. <br/>We will access particular elements in DICOM dataset and print them to .csv files. <br/>For more information about *pydicom*, click [here](https://pydicom.github.io/pydicom/stable/old/getting_started.html)

In [None]:
# change working directory
import os
os.chdir('F:/')

In [None]:
%run C:/Users/mchoi/Desktop/dicom/read_dicom.py --infile=DXA --outfile=DXA.csv

### **Phase 2 - Convert DCM to PNG**

> Second, We need to convert .dcm to .png format. Let's import some libraries.



In [None]:
import pandas as pd
import pydicom
import numpy as np
import matplotlib.pyplot as plt
import cv2
import concurrent.futures

from tqdm.notebook import tqdm
from pydicom.pixel_data_handlers.util import apply_color_lut

Load the csv file we made in <strong>Phase 1</strong> , and make a new Dataframe using pandas. Make new directories and save data grouped by our criteria, 'SeriesDescription' and 'ProtocolName'. Converted .png files will be saved in directories hierarchically.

In [None]:
def make_imgs(key, group):

    # path = os.path.join(key[0],key[1])
    path = key
    os.makedirs(path, exist_ok=True)
    
    # print(key)
    
    with concurrent.futures.ThreadPoolExecutor() as executor:
        futures = [executor.submit(chunk_process, lst, path) for lst in chunks(group['filename'].values, 2000)]
        for idx, future in enumerate(concurrent.futures.as_completed(futures)):
            res = future.result()
            print(f"chunk_process Job#{idx}, result: {res}")
            

In [None]:
def chunk_process(lst, path):
    already = os.listdir(path)
    for f in tqdm(lst):
        filename = f.replace("\\","_").replace(".dcm",'.png')
        if filename in already:
            continue
        else:
            des = os.path.join(path,filename)
            ds = pydicom.dcmread(f)
            arr = ds.pixel_array
            if 'RGB' in ds.PhotometricInterpretation:
                plt.imsave(des,arr)
            elif 'PALETTE' in ds.PhotometricInterpretation:
                rgb = apply_color_lut(arr,ds)
                cv2.imwrite(des, rgb)
            elif 'MONOCHROME' in ds.PhotometricInterpretation:
                plt.imsave(des,arr,cmap='gray')
            else:
                print(f'{f}: {ds.PhotometricInterpretation}')

In [None]:
data = pd.read_csv("C:/Users/mchoi/Desktop/1234/DXA.csv", encoding='utf-8')
#data = pd.read_excel('2013.xlsx', header= 0)
# df = data[['filename','SeriesDescription','ProtocolName']]
df = data[['filename','ProtocolName']]
# df['SeriesDescription'] = df['SeriesDescription'].fillna("None")
df = df[df['ProtocolName'] == 'Left Femur']

# grouped = df.groupby(['SeriesDescription', 'ProtocolName'])
grouped = df.groupby('ProtocolName')
with concurrent.futures.ThreadPoolExecutor() as executor:
    futures = [executor.submit(make_imgs, key, group) for key,group in grouped]
    for idx, future in enumerate(concurrent.futures.as_completed(futures)):
        res = future.result()
        print("make_imgs Job#", idx, "result ", res)

### **Phase 3 - Extract text from Image using AWS Textract**


> Let's extract our data quickly by using Textract. We're going to make Textract to detect text synchronously, use the 'DetectDocumentText' API operation. The results are returned in JSON structure. 


In [None]:
import os
from tqdm.notebook import tqdm
import boto3
import json

In [None]:
def chunks(lst, n):
    """Yield successive n-sized chunks from lst."""
    for i in range(0, len(lst), n):
        yield lst[i:i + n]

In [None]:
def do_textract(lst):
    for filename in tqdm(lst):
        with open(filename, 'rb') as file:
            imageBytes = bytearray(file.read())

            client = boto3.client(service_name = 'textract',
                                 region_name = 'us-east-1')

            # Detect text in the document
            
            try:
                response = client.detect_document_text(Document={'Bytes': imageBytes})
                with open('./json/'+filename.split('\\')[-1]+'.json', mode='w', encoding='utf-8')as f:
                    json.dump(response,f )
            except:
                print(filename)

In [None]:
import concurrent.futures


file_dir = 'F:/Left Femur'

os.chdir(file_dir)
print(f'Now in {os.getcwd()}')
os.makedirs('json', exist_ok=True)

filelst = os.listdir(file_dir)
pnglst = [f for f in filelst if '.png' in f]
jsonlst = [j[:-5] for j in os.listdir(os.path.join(file_dir,'json'))]
# jsonlst = tmp
target = list(set(pnglst)-set(jsonlst))

with concurrent.futures.ThreadPoolExecutor() as executor:
    futures = [executor.submit(do_textract, lst) for lst in chunks(target, 1050)]
    for idx, future in enumerate(concurrent.futures.as_completed(futures)):
        res = future.result()
        print("Proccessed Job ", idx, "result ", res)


### **Phase 4 - Export Tabular Data into a CSV File**


> We don't need all texts from images. We only need data stored in tables.  Textract returns the location of the lines and words, so we are going to use it to extract essential data. You can skip this step if you use 'AnalyzeDocument' API in previous step, not 'DetectDocumentText' API.

In [None]:
%run C:/Users/mchoi/Desktop/dicom/json_parser.py --path="F:/Left Femur/json"

## Intermission
OCR results is not 100% accurate.
Open the output.csv file in Google Spreadsheet and check the values. 
You may get an error in the next step if quality check isn't done properly.

### **Phase 5 - Reshape data based on column of the CSV File**


> Read json/output.csv into pandas DataFrame. We are going to reshape data based on 'Region' column.

In [None]:
%run C:/Users/mchoi/Desktop/dicom/csv_parser.py" --infile "F:/Left Femur/json/output.csv"