# PDF Cleaning Example

The first step is to import the `pdf_cleaner` module. Then, we define the location of our initial pdf (`input_dir`), the location of the images obtained (`split_dir`), and the location of the cleaned pdf (`output_pdf`). We will also choose the model to use (`model_dir`), and where to store the file structure JSON (`dictionnary_dir`).

In [2]:
# ! pip install pdf_cleaning@git+https://github.com/MathieuDemarets/pdf-cleaning

import pdf_cleaning as pc

input_dir = 'example/input_pdf'
output_dir = 'example/cleaned_pdf'
split_dir = 'example/split_jpg'
model_dir = 'models/cleaner_x.pt'
dictionnary_dir = 'example/pdf_jpg.json'

## Preparation

We can then use our 3 first functions (they are documented):
1. `create_pdf2jpg` to create a dictionary with the file structure information necessary to run the next steps.
2. `init_pdf2jpg` to crawl the input directory and identify the pdf files to convert.
3. `transform_pdf_to_jpg` to split the pdf files into images (one per page).

In [2]:
pc.create_pdf2jpg(dictionnary_dir, input_dir, split_dir, verbose=False)
pc.init_pdf2jpg(dictionnary_dir, verbose=False)
pc.transform_pdf_to_jpg(dictionnary_dir)

   > armbrust-cidr21.pdf converted to jpg (pages: 9)
> pdf to jpg dictionnary saved


## Prediction

Now that our pdfs have been prepared, we will use computer vision to identify the features of interest with `identify_chunks_to_clean`. When the boxes have been identified and stored in the `predictions` DataFrame, we can use them to remove them from the initial pdf with `clean_pdf`.

In [4]:
predictions = pc.identify_chunks_to_clean(model_dir, dictionnary_dir, conf=0.25)
pc.clean_pdf(input_dir, output_dir, predictions, thresholds=0.25)

# alternative_thresholds = {
#     "figure": 0.55,
#     "table": 0.60,
#     "title": 0.4,
#     "header": 0.55,
#     "footnote": 0.175,
#     "reference":0.50
# }

>>> predictions
> armbrust-cidr21.pdf started


  predictions = pd.concat([predictions, ltrb.loc[:, [


   > armbrust-cidr21_0.jpg predicted
   > armbrust-cidr21_1.jpg predicted
   > armbrust-cidr21_2.jpg predicted
   > armbrust-cidr21_3.jpg predicted
   > armbrust-cidr21_4.jpg predicted
   > armbrust-cidr21_5.jpg predicted
   > armbrust-cidr21_6.jpg predicted
   > armbrust-cidr21_7.jpg predicted
> armbrust-cidr21.pdf predicted
{'title': 0.25, 'reference': 0.25, 'footnote': 0.25, 'figure': 0.25, 'header': 0.25}
> Output directory created
>>> cleaning
> armbrust-cidr21.pdf started
   > modifying page 0
   > modifying page 1
   > modifying page 2
   > modifying page 3
   > modifying page 4
   > modifying page 5
   > modifying page 6
   > modifying page 7
> armbrust-cidr21.pdf cleaned


If we wan't more insights as to how the cleaning has been done, we can use `visualize_boxes` to see the bounding boxes on the pdf before cleaning (as was shown in the README file).

In [1]:
pc.visualize_boxes(model_dir, dictionnary_dir, output_dir, file='armbrust-cidr21.pdf', conf=0.25)

NameError: name 'pc' is not defined

Finally, we can remove the temporary image splits with `remove_split_dir`. We can choose to remove them for a single pdf or for all of them, we can also choose to remove the initial pdfs.

In [3]:
pc.remove_split(dictionnary_dir, "all")

> all images removed
