# Quickstart for meta-extractor
This notebook demonstrates how to install and use the `meta-extractor` package.
It also contains examples of how to train a gemma3 model using your own data.
This notebook uses the provided example data located in the `example_data/` folder.

We suggest you find you own data to try out the training process, but the example data is provided for convenience.
For more information see the README.md file.

## 1. Install the package (editable mode for development)


In [1]:
!pip install -e ../.

Looking in indexes: https://devpi.dbccloud.dk/dbc/packages
Obtaining file:///home/kwc/github/meta-extractor
  Installing build dependencies ... [?25ldone
[?25h  Checking if build backend supports build_editable ... [?25ldone
[?25h  Getting requirements to build editable ... [?25ldone
[?25h  Preparing editable metadata (pyproject.toml) ... [?25ldone
Building wheels for collected packages: meta-extractor
  Building editable for meta-extractor (pyproject.toml) ... [?25ldone
[?25h  Created wheel for meta-extractor: filename=meta_extractor-1.0-0.editable-py3-none-any.whl size=15083 sha256=5759e9240d81810eb1c85b7e904048cd94327feeef842dc2a152ccee30540169
  Stored in directory: /tmp/pip-ephem-wheel-cache-gg6xbbo1/wheels/ef/6d/e4/cd89217db362638842078720f675d5d51fc9ac5e80fcbb4e84
Successfully built meta-extractor
Installing collected packages: meta-extractor
  Attempting uninstall: meta-extractor
    Found existing installation: meta-extractor 1.0
    Uninstalling meta-ex

## 2. Check the CLI entry point works

In [2]:
!pdf2text --help

usage: pdf2text [-h] -i INPUT_DIRECTORY -o OUTPUT_DIRECTORY [-s]

options:
  -h, --help            show this help message and exit
  -i INPUT_DIRECTORY, --input-directory INPUT_DIRECTORY
                        path to directory where pdfs are stored.
  -o OUTPUT_DIRECTORY, --output-directory OUTPUT_DIRECTORY
                        path to directory where output text files will be
                        stored.
  -s, --short           only look at the first 3 pages instead of the first 13
                        pages of the pdf


## 3. Extract text from example PDF
If you have your own PDFs, you can use the `pdf2text` command to extract text from them.
Here we use the example PDFs provided in the `example-data/pdfs` folder.


In [23]:
!pdf2text -i ../src/meta_extractor/training_gemma3/example-data/pdfs -o output/

Processing status for every 100th file:
Processing file 1: example-11.pdf
Number of PDF files with errors: 0


### Check text extraction output

In [None]:
!ls output

## 4. Train a gemma3 model using the example data
When we trained the model we had 10.000 pairs of pdfs and corresponding metadata. We extracted the text of the pdfs using the `pdf2text` command above.
We then prepared the training data in the format required by gemma3.
The following will show you how to do this using the example data provided.

### 4.1 Prepare training data
The input conversations needed for finetuning gemma3 need to be in a jsonl format. You can combine the extracted text and the metadata to create the training and test conversations like this:

In [27]:
!build_prompt -t ../src/meta_extractor/training_gemma3/example-data/texts/train/ -m ../src/meta_extractor/training_gemma3/example-data/metadata/train/ -p ../src/meta_extractor/data/prompt_production.json -o output/conversations_train.jsonl

In [29]:
!build_prompt -t ../src/meta_extractor/training_gemma3/example-data/texts/test/ -m ../src/meta_extractor/training_gemma3/example-data/metadata/test/ -p ../src/meta_extractor/data/prompt_production.json -o output/conversations_test.jsonl

In [33]:
!ls output

conversations_test.jsonl  conversations_train.jsonl  example-11.txt


So you should now have two files: `conversations_train.jsonl` and `conversations_test.jsonl` in the output folder.
These files contain the conversations that will be used to finetune the gemma3 model and for evaluation afterwards.