Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Europarl #25

Closed
StellaAthena opened this issue Sep 6, 2020 · 5 comments
Closed

Europarl #25

StellaAthena opened this issue Sep 6, 2020 · 5 comments
Assignees
Labels
dataset A dataset that has been approved to go in the Pile.

Comments

@StellaAthena
Copy link
Member

StellaAthena commented Sep 6, 2020

Transcripts from EU Parliament meetings from 1996 to 2011. Contains approximately 4.5 GB of text.

Languages: French, Italian, Spanish, Portuguese, Romanian, English, Dutch, German, Danish, Swedish, Bulgarian, Czech, Polish, Slovak, Slovene, Finnish, Hungarian, Estonian, Latvian, Lithuanian, and Greek.

Link: www.statmt.org/europarl/

@StellaAthena
Copy link
Member Author

Temporarily closing while we finish version 1.

@StellaAthena StellaAthena added the dataset A dataset that has been approved to go in the Pile. label Sep 16, 2020
@StellaAthena StellaAthena reopened this Sep 17, 2020
@thoppe
Copy link
Contributor

thoppe commented Sep 21, 2020

I could pull this, clean it up and look to see how it's organized if we are still interested. The parallel texts in many languages is interesting too. For v1, do we still want to keep all languages in though?

@StellaAthena
Copy link
Member Author

StellaAthena commented Sep 21, 2020 via email

@thoppe
Copy link
Contributor

thoppe commented Sep 22, 2020

Starting the processing on this. For reference, the data file is 1.5GB but it takes over 14 hours to download from the main site.

@thoppe
Copy link
Contributor

thoppe commented Sep 22, 2020

This is complete. The processing code is here

https://github.com/thoppe/The-Pile-EuroParl

with the temporary download link here https://drive.google.com/file/d/15kQ6jAGHsI3ZrA0ibXGuTmzGdib9NA63/view?usp=sharing

  ✔ Saved to EuroParliamentProceedings_1996_2011.jsonl
  ℹ Saved 187,072 articles
  ℹ Uncompressed filesize   4,941,430,389
  ℹ Compressed filesize     1,475,803,930

Once incorporated, this issue can be close and moved to the completed section.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dataset A dataset that has been approved to go in the Pile.
Projects
None yet
Development

No branches or pull requests

2 participants