# Generating Dataset

To validate Scanipy, we generate a dataset with Markdown files from LaTeX and PDF papers found in [Arxiv](https://arxiv.org/).

```powershell
pip install -r requirements_data.txt

```

## 1. Get PDF and TEX files from Arxiv

In [None]:
from access_arxiv import choose_topic

#choose the topic and the number of papers
df = choose_topic('sea',1000)

In [None]:
from get_tar_pdf import downloading_tar_pdf

downloading_tar_pdf(df)

## 2. Transform the Tex files in HTML

We use [engrafo](https://github.com/arxiv-vanity/engrafo) to convert the main LaTeX arxiv files into responsive web pages in HTML using [LaTeXML](https://github.com/brucemiller/LaTeXML).

We run engrafo by using the Docker image. The first step is to run in the powershell:

```powershell
# Assuming your main .tex files are located in the following directory
TEX_DIR = "path\to\tex\files"

# Get a list of all main .tex files in the subdirectories
$mainTexFiles = Get-ChildItem -Path $TEX_DIR -Filter "*.tex" -File -Recurse | Where-Object { $_.Name -eq ($_.Directory.Name + ".tex") }

# Specify the output folder
$outputFolder = "html"

# Loop through main .tex files and run the Docker command for each
foreach ($mainTexFile in $mainTexFiles) {
    # Extract the arxiv_id from the directory name
    $arxiv_id = $mainTexFile.Directory.Name

    # Create the output folder for the current arxiv_id if it doesn't exist
    $outputDir = Join-Path -Path $TEX_DIR -ChildPath "$outputFolder\$arxiv_id"
    if (-not (Test-Path -Path $outputDir -PathType Container)) {
        New-Item -Path $outputDir -ItemType Directory -Force
    }
    
    # Run the Docker command to convert .tex to .html, save it in the output folder
    $dockerCmd = "docker run --volume $($TEX_DIR):/workdir -w /workdir arxivvanity/engrafo:latest engrafo $($arxiv_id)/$($arxiv_id).tex $($outputFolder)/$($arxiv_id)/"
    Invoke-Expression $dockerCmd
}

```

You will get the html files, along with the images, the css file and the js file inside your output folder.

## 3. Extract figures, captions, tables and section titles from the PDF files

Clone the [pdffigures2](https://github.com/allenai/pdffigures2) repository. It is a Scala based project built to extract figures, captions, tables and section titles from scholarly documents.

```powershell
git clone https://github.com/allenai/pdffigures2.git

```

Fix some bugs with this [pull request](https://github.com/allenai/pdffigures2/commit/d7abe4c5210893e9104fe55707ba4b40eaf6a245).

Clone the [almond](https://github.com/almond-sh/almond) repo, to be able to use it as a [Scala](https://scala-lang.org/) kernel for [Jupyter](https://jupyter.org/).

```powershell
git clone https://github.com/almond-sh/almond.git
cd almond
./mill -i jupyterFast

```

A Scala kernel should open on Jupyter. In the shell, run:

```powershell
cd pdffigures2
sbt "runMain org.allenai.pdffigures2.FigureExtractorBatchCli path\to\pdffiles -s stat_file.json -m path\to\save\images -d path\to\save\data"

```

The images, figure objects and statistics are going to be seen in the output. Besides, a binary jar file is going to be saved. 

## 4. Transform the HTML files in Markdown

To call the [Nougat](https://github.com/facebookresearch/nougat) model to get the Markdown files, install the necessary dependencies:

```powershell
pip install nougat-ocr[dataset]

```

Create an environment variable with the path to the binary jar file generated in the previous step.

In [1]:
import os

# Set the PDFFIGURES_PATH environment variable
os.environ["PDFFIGURES_PATH"] = '../pdffigures2/target/scala-2.12/pdffigures2_2.12-0.1.0.jar'

Be careful. The HTML, PDF and JSON files corresponding to the same arxiv id should have the same name.

In [None]:
!python -m nougat.dataset.split_htmls_to_pages --html "path\to\html\files" --pdfs "path\to\pdf\files" --out "path\to\output" --figure "path\to\json\data\files"

In [None]:
python -m nougat.dataset.create_index --dir path\to\output --out index.jsonl

The final output is a JSON file containing the image paths, markdown text and meta information.