Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add new gt metadata yml files #143

Open
wants to merge 9 commits into
base: master
Choose a base branch
from
Open

Conversation

tboenig
Copy link
Contributor

@tboenig tboenig commented Mar 20, 2024

No description provided.

@tboenig
Copy link
Contributor Author

tboenig commented Apr 27, 2024

Hello, can you please commit the PR. Thank you.

Copy link

@bertsky bertsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO it would be better if the titles get prefixed by OCR-D so they will line-up in the catalog and be immediately recognizable.

If possible at all, it would probably be better from HTR United's side if all the gt_structure entries (except gt_structure_text) were composed into one dataset here.

@@ -0,0 +1,50 @@
schema: https://htr-united.github.io/schema/2023-06-27/schema.json
title: gt_structure_1_1
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
title: gt_structure_1_1
title: OCR-D gt_structure_1_1

metric: regions
citation-file-link: https://github.com/OCR-D/gt_structure_1_1/blob/main/CITATION.cff
transcription-guidelines: >-
OCR-D-GT-Guideline, Part: Structure Ground Truth https://ocr-d.de/en/gt-guidelines/trans/structur_gt.html
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was a carriage return instead of a newline.

Suggested change
OCR-D-GT-Guideline, Part: Structure Ground Truth https://ocr-d.de/en/gt-guidelines/trans/structur_gt.html
OCR-D-GT-Guideline, Part: Structure Ground Truth
https://ocr-d.de/en/gt-guidelines/trans/structur_gt.html

@@ -0,0 +1,51 @@
schema: https://htr-united.github.io/schema/2023-06-27/schema.json
title: gt_structure_1_2
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
title: gt_structure_1_2
title: OCR-D gt_structure_1_2

@@ -0,0 +1,50 @@
schema: https://htr-united.github.io/schema/2023-06-27/schema.json
title: gt_structure_1_3
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
title: gt_structure_1_3
title: OCR-D gt_structure_1_3

metric: regions
citation-file-link: https://github.com/OCR-D/gt_structure_1_3/blob/main/CITATION.cff
transcription-guidelines: >-
OCR-D-GT-Guideline, Part: Structure Ground Truth https://ocr-d.de/en/gt-guidelines/trans/structur_gt.html
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here.

Suggested change
OCR-D-GT-Guideline, Part: Structure Ground Truth https://ocr-d.de/en/gt-guidelines/trans/structur_gt.html
OCR-D-GT-Guideline, Part: Structure Ground Truth
https://ocr-d.de/en/gt-guidelines/trans/structur_gt.html

metric: regions
citation-file-link: https://github.com/OCR-D/gt_structure_5_1/blob/main/CITATION.cff
transcription-guidelines: >-
OCR-D-GT-Guideline, Part: Structure Ground Truth https://ocr-d.de/en/gt-guidelines/trans/structur_gt.html
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
OCR-D-GT-Guideline, Part: Structure Ground Truth https://ocr-d.de/en/gt-guidelines/trans/structur_gt.html
OCR-D-GT-Guideline, Part: Structure Ground Truth
https://ocr-d.de/en/gt-guidelines/trans/structur_gt.html

@@ -0,0 +1,50 @@
schema: https://htr-united.github.io/schema/2023-06-27/schema.json
title: gt_structure_5_2
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
title: gt_structure_5_2
title: OCR-D gt_structure_5_2

metric: regions
citation-file-link: https://github.com/OCR-D/gt_structure_5_2/blob/main/CITATION.cff
transcription-guidelines: >-
OCR-D-GT-Guideline, Part: Structure Ground Truth https://ocr-d.de/en/gt-guidelines/trans/structur_gt.html
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
OCR-D-GT-Guideline, Part: Structure Ground Truth https://ocr-d.de/en/gt-guidelines/trans/structur_gt.html
OCR-D-GT-Guideline, Part: Structure Ground Truth
https://ocr-d.de/en/gt-guidelines/trans/structur_gt.html

@@ -0,0 +1,51 @@
schema: https://htr-united.github.io/schema/2023-06-27/schema.json
title: gt_structure_5_3
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
title: gt_structure_5_3
title: OCR-D gt_structure_5_3

@@ -0,0 +1,54 @@
schema: https://htr-united.github.io/schema/2023-06-27/schema.json
title: gt_structure_text
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
title: gt_structure_text
title: OCR-D gt_structure_text

@PonteIneptique
Copy link
Member

Dear @tboenig and @bertsky,
I am very happy that the collaboration is going forward, but I must say I am a bit lost about the extent of the dataset, what seems like duplication of corpora (??). Could you provide a little more insight ?

Also: be careful with Goth script code, it's for Gothic Language (and not Runes as I said: https://en.wikipedia.org/wiki/Gothic_alphabet ). I think you mean Latf

Copy link

@bertsky bertsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo?

@@ -0,0 +1,51 @@
schema: https://htr-united.github.io/schema/2023-06-27/schema.json
title: gt_structure_5_3
url: https://github.com/OCR-D/tboenig/gt_structure_5_3
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
url: https://github.com/OCR-D/tboenig/gt_structure_5_3
url: https://github.com/OCR-D/gt_structure_5_3

@bertsky
Copy link

bertsky commented May 21, 2024

@PonteIneptique

but I must say I am a bit lost about the extent of the dataset, what seems like duplication of corpora (??)

What do you mean duplication? The various entries are subcorpora of https://github.com/OCR-D/gt_structure_all, which is basically a subcorpus of deutschestextarchiv.de split into smaller chunks.

I proposed aggregating them into a single dataset here. But since the metadata.yml files are generated via CI on our side (for each repo independently), that might be difficult to achieve...

@tboenig
Copy link
Contributor Author

tboenig commented May 21, 2024

Hallo @PonteIneptique

Thank you for the rigorous check of the data records. I have changed Goth to Latf.

to @bertsky

https://github.com/OCR-D/gt_structure_all is a metarepo that links all datasets.

Maybe it should be considered for a future version of HTR-United, how such metarepos are represented in the catalog.

I suggest that first of all the datasets are published in the catalog. In a second or subsequent step, you can always make improvements.

Of course, the metadata/data must be correct. Thank you again for the check.

All the Bests
tboenig

@PonteIneptique
Copy link
Member

Dear both,
Given that nothing differentiates each repository except its name (same authors, same language, same scripts, etc.), and given that their name are non-semantic, I would probably refuse such a "massive" push (19 files) for usability reasons. The meta-repo however is completely welcome.

@bertsky
Copy link

bertsky commented May 22, 2024

@PonteIneptique – understood, @tboenig is already working on a solution.

@PonteIneptique
Copy link
Member

Thank you for your understanding :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants