## Document Convertion, Image Extraction and Selection

### Document Convertion
To use deepsearch to convert documents to json files to extract images is needed to register [here](https://ds4sd.github.io/deepsearch-toolkit/guide/configuration/) and run:

`$ deepsearch profile config
Host: https://deepsearch-experience.res.ibm.com
Username: name@example.com
Api key:`

In [None]:
import deepsearch as ds

In [None]:
from deepsearch.cps.client.api import CpsApi

api = CpsApi.from_env()
print([(p.name, p.key) for p in api.projects.list()])

In [None]:
PATH_DOCS = "PATH OR LINK TO PDF.pdf""
PROJ_KEY = ""
RESULT_DIR = "PATH TO OUTPUT FOLDER"

# for online documents use urls= and for local files use source_path
documents = ds.convert_documents(api=api, proj_key=PROJ_KEY, urls=PATH_DOCS)

documents.download_all(result_dir=RESULT_DIR)

this cell will return a json file with information regarding the extraction output of the submissted pdf, the "figures" section is the one that we will use to extract the images that we're interested in with bbox information

`{"_name": "s41746-023-00881-0.pdf", "_type": "pdf-document", ...,  "figures": [{"bounding-box": {"max": [6.9888954, 241.64507, 595.276, 734.6084], "min": [47.72477340698242, 243.7888793945312, 555.028564453125, 733.34814453125]}, "cells": {"data": [], "header": ["x0", "y0", "x1", "y1", "font", "text"]}, "confidence": 0.865432858467102, "created_by": "high_conf_pred", "prov": [{"bbox": [47.72477340698242, 243.7888793945312, 555.028564453125, 733.34814453125], "page": 2, "span": [0, 0]}], ....`

### Image Extraction

In [15]:
from PyPDF2 import PdfReader, PdfWriter


class PdfFileWriterWithStreamAttribute(PdfWriter):
    def __init__(self):
        super().__init__()
        from io import BytesIO

        self.stream = BytesIO()

In [30]:
import json
import os
import glob

JSON_FILE = "PATH TO JSON.json"
PDF_FILE = "PATH TO PDF.pdf"
CROP_DIR = "PATH TO SAVE CROPPED IMAGES FOLDER"

jsonFile = open(JSON_FILE)
jsonData = json.load(jsonFile)
jsonFile.close()
figures = jsonData["figures"]  # use directories for multiple pdfs

Looks for bounding boxes and exports individual files per each figure found

In [None]:
reader = PdfReader(PDF_FILE, "r")
if os.path.exists(JSON_FILE):
    with open(JSON_FILE) as json_file:
        data = json.load(json_file)
    for image_num in range(len(figures) - 1):
        cors = data["figures"][image_num]["prov"][0]["bbox"]
        page_num = data["figures"][image_num]["prov"][0]["page"] - 1
        page = reader.pages[page_num]
        writer = PdfFileWriterWithStreamAttribute()
        page.cropbox.upper_right = (cors[2], cors[3])
        page.cropbox.lower_left = (cors[0], cors[1])
        writer.add_page(page)
        outstream = open(
            os.path.join(
                CROP_DIR,
                os.path.split(PDF_FILE)[-1].split(".pdf")[0]
                + "_cropped_page_"
                + str(page_num + 1)
                + ".pdf",
            ),
            "wb",
        )
        writer.write(outstream)
        outstream.close()

### Image Selection

Note that for training this model you can request **DermEducation**. This image set of dermatology images used for educational purposes. **DermEducation** contains containing 2708 total images, among which 461 are non-skin images, 2247 skin images (1932 FST I-IV and 315 FST V-VI).

- $X$ will contain the feature that wants to be used Histogram of Oriented Gradient (HoG) and mean and standard deviations of image channels in CIELAB (24) color space.

- $y$ will contain binary labels, skin and non-skin images.

#### 1. Feature Extraction

In [None]:
from utils.preprocessing_utils import preprocesing_masks_for_classification

_, _, features_skin = preprocesing_masks_for_classification(images_w_skin, ita=False)
_, _, features_non_skin = preprocesing_masks_for_classification(
    images_w_skin, ita=False
)

#### 2. Train Binary XGBoost with feature vectors

In [None]:
from sklearn.preprocessing import StandardScaler
import xgboost as xgb

scaler = StandardScaler()
X_scaled = scaler.fit(X).transform(X)

data_dmatrix = xgb.DMatrix(data=X_scaled, label=y)
params = {
    "objective": "binary:logistic",
    "eval_metric": "auc",
    "eta": 0.03,
    "subsample": 0.5,
}
xgb_cv = xgb.cv(dtrain=data_dmatrix, params=params, nfold=3, metrics="error", seed=42)