Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Include based_on_id in the weekly exports #2325

Closed
trang opened this issue May 15, 2020 · 6 comments · Fixed by #2397
Closed

Include based_on_id in the weekly exports #2325

trang opened this issue May 15, 2020 · 6 comments · Fixed by #2397
Assignees
Labels
enhancement
Projects

Comments

@trang
Copy link
Member

trang commented May 15, 2020

Our users sometimes ask statistics about original sentences but it's not easy for someone who doesn't have database access to help answer these questions because the based_on_id field is not an information that we export in our CSV files.

A recent example is mramosch on the Wall:

Is there an easy way to find out how many of the 503.559 german sentences are originals and how many are translations?

And after I provided the number of original German sentences:

Actually I would really like to see the sources for the translations of the remaining 420.000 sentences.

@agrodet
Copy link
Contributor

agrodet commented May 16, 2020

Just to clarify the scope of this issue, are we talking only about extracting in the sense of exporting a file, or adding a possibility to search for them on Tatoeba?

@trang
Copy link
Member Author

trang commented May 16, 2020

I'm personally satisfied if the based_on_id information is included in the weekly exports. If someone wants to implement something in Tatoeba, they are free to suggest. I don't think it would be super necessary as a Tatoeba feature but if someone can design a good feature, then I'm fine if it gets implemented.

@trang trang added the enhancement label May 17, 2020
@jiru
Copy link
Member

jiru commented May 17, 2020

As a user of Tatoeba, I can see how useful it would be to have this as a search filter. Original sentences typically feature broader vocabulary and more traits that are specific to the language. Let’s keep the scope of this present issue to the weekly exports, and use #2159 for the equivalent search filter.

@jiru jiru changed the title It's not easy for users to extract stats about original sentences Include based_on_id in the weelky exports May 17, 2020
@AndiPersti AndiPersti changed the title Include based_on_id in the weelky exports Include based_on_id in the weekly exports May 28, 2020
@ftumsh
Copy link
Contributor

ftumsh commented Jun 11, 2020

Would I be correct in thinking that 'weekly reports' means the sql dump in weekly_exports.sql?
Also, should I only add the column to queries that use the sentences table or should I investigate joining the sentences table to other queries so as to retrieve the based_on_id?

@trang
Copy link
Member Author

trang commented Jun 13, 2020

@ftumsh

Would I be correct in thinking that 'weekly reports' means the sql dump in weekly_exports.sql?

Yes, that's correct.

should I only add the column to queries that use the sentences table

If we add a column to an existing file, then the sentence_details.csv would be the best candidate. I don't think it would make much sense to add it to the sentences.csv (because this data is supposed to be the bare minimum data on sentences) or sentences_CC0.csv (because this data is about CC0),

However, adding a column to an existing CSV could be disrupting for those who have built some automated process around the modified file. It could also be additional unnecessary data to download for them.

The safest option would be to create a new file. I'm not sure how to name it though. But the file could just contain the columns id and based_on_id of the sentences table.

@ftumsh
Copy link
Contributor

ftumsh commented Jun 15, 2020

I have created a new file, sentences_based_on_id.csv, along with the bzip of it.
The file just contains the columns id and based_on_id of the sentences table.
I've pushed it to my fork.

@trang trang linked a pull request Jun 30, 2020 that will close this issue
@trang trang added this to In progress in Kodoeba #1 Jun 30, 2020
@jiru jiru closed this as completed in #2397 Jul 1, 2020
Kodoeba #1 automation moved this from In progress to Done Jul 1, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement
Projects
Kodoeba #1
  
Done
Development

Successfully merging a pull request may close this issue.

4 participants