Corpora

Francese
Inglese
Italiano
Tedesco
Russo
Multilingue

Francese

French News Article || Formato: json
Descrizione: A collection of news article written in French
Insurance Reviews France || Formato: csv
Descrizione: User reviews on mutual health insurance in France.

Inglese

Cornell Movie-Dialogs Corpus || Formato: txt
Descrizione: This corpus contains a large metadata-rich collection of fictional conversations extracted from raw movie scripts:
Elsevier OA CC-BY Corpus || Formato: json
Descrizione: This is a corpus of 40k (40,001) open access (OA) CC-BY articles from across Elsevier’s journals represent the first cross-discipline research of data at this scale to support NLP and ML research.
A dataset of English plaintext jokes || Formato: json
Descrizione: There are about 208 000 jokes in this database scraped from three sources.
I make no claim on ownership of these files, nor do I necessarily endorse the jokes in them. This dataset is provided for research purposes (see License section below).
Tripadvisor Comments || Formato: vari
Descrizione: Meta data includes: Author, Content, Date, Number of Reader, Number of Helpful Judgment, Overall rating, Value aspect rating, Rooms aspect rating, Location aspect rating, Cleanliness aspect rating, Check in/front desk aspect rating, Service aspect rating and Business Service aspect rating. Ratings ranges from 0 to 5 stars, and -1 indicates this aspect rating is missing in the orginal html file. NB SCARICARE JSON
SMS Spam Collection Dataset || Formato: csv
Descrizione: The SMS Spam Collection is a set of SMS tagged messages that have been collected for SMS Spam research. It contains one set of 5,574 English SMS messages, tagged according to the message being ham (legitimate) or spam.

Italiano

Hottest Italian Tweets || Formato: csv
Descrizione: An archive of Twitter posts about open data with at least 10 retweets or 10 favorites NB SERVE LOGGARE PER SCARICARLO (è possibile usare un account google)
Italian News Articles || Formato: json
Descrizione: A collection of news article written in Italian

Tedesco

German Recipes || Formato: json
Descrizione: This dataset contains 12190 german recipes with metadata crawled from chefkoch.de*.
Student reviews & recommendations of german universities || Formato: csv
Descrizione: 220k+ reviews for German universities.
German News Article || Formato: json
Descrizione: A collection of news written in German

Russo

Russian Twitter Corpus || Formato: csv
The corpus of short texts in Russian on the basis of Twitter posts. Suitable for training a language model for social media or for short texts, training classifiers for sentiment analysis and text toxicity.
NB: SCARICARE SOLO POSITIVI/NEGATIVI

Multilingue

Europarl
Descrizione: The Europarl parallel corpus is extracted from the proceedings of the European Parliament. It includes versions in 21 European languages: Romanic (French, Italian, Spanish, Portuguese, Romanian), Germanic (English, Dutch, German, Danish, Swedish), Slavik (Bulgarian, Czech, Polish, Slovak, Slovene), Finni-Ugric (Finnish, Hungarian, Estonian), Baltic (Latvian, Lithuanian), and Greek.
German-English_WordAlignment
Descrizione: A set of manually aligned datasets in German, English and Turkish.
Japanese-English Bilingual Corpus
Descrizione: A precise and large-scale corpus containing about 500,000 pairs of manually-translated sentences.
English, Chinese and French proverbs || Formato: csv \

Name		Name	Last commit message	Last commit date
Latest commit History 105 Commits
README.md		README.md
clean_data_wcitta_week6.json		clean_data_wcitta_week6.json
graduate_data_6 città.csv		graduate_data_6 città.csv
installare_python.md		installare_python.md
lezione_1.md		lezione_1.md
lezione_2.md		lezione_2.md
lezione_3.md		lezione_3.md
lezione_4.md		lezione_4.md
lezione_5.md		lezione_5.md
materiali.md		materiali.md
note_pandas.md		note_pandas.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

clean_data_wcitta_week6.json

clean_data_wcitta_week6.json

graduate_data_6 città.csv

graduate_data_6 città.csv

installare_python.md

installare_python.md

lezione_1.md

lezione_1.md

lezione_2.md

lezione_2.md

lezione_3.md

lezione_3.md

lezione_4.md

lezione_4.md

lezione_5.md

lezione_5.md

materiali.md

materiali.md

note_pandas.md

note_pandas.md

Repository files navigation

Corpora

Francese

Inglese

Italiano

Tedesco

Russo

Multilingue

About

Releases

Packages

Nolanogenn/unior_nlp_lab

Folders and files

Latest commit

History

Repository files navigation

Corpora

Francese

Inglese

Italiano

Tedesco

Russo

Multilingue

About

Topics

Resources

Stars

Watchers

Forks