-
French News Article || Formato: json
Descrizione: A collection of news article written in French -
Insurance Reviews France || Formato: csv
Descrizione: User reviews on mutual health insurance in France.
-
Cornell Movie-Dialogs Corpus || Formato: txt
Descrizione: This corpus contains a large metadata-rich collection of fictional conversations extracted from raw movie scripts: -
Elsevier OA CC-BY Corpus || Formato: json
Descrizione: This is a corpus of 40k (40,001) open access (OA) CC-BY articles from across Elsevier’s journals represent the first cross-discipline research of data at this scale to support NLP and ML research. -
A dataset of English plaintext jokes || Formato: json
Descrizione: There are about 208 000 jokes in this database scraped from three sources.
I make no claim on ownership of these files, nor do I necessarily endorse the jokes in them. This dataset is provided for research purposes (see License section below). -
Tripadvisor Comments || Formato: vari
Descrizione: Meta data includes: Author, Content, Date, Number of Reader, Number of Helpful Judgment, Overall rating, Value aspect rating, Rooms aspect rating, Location aspect rating, Cleanliness aspect rating, Check in/front desk aspect rating, Service aspect rating and Business Service aspect rating. Ratings ranges from 0 to 5 stars, and -1 indicates this aspect rating is missing in the orginal html file. NB SCARICARE JSON -
SMS Spam Collection Dataset || Formato: csv
Descrizione: The SMS Spam Collection is a set of SMS tagged messages that have been collected for SMS Spam research. It contains one set of 5,574 English SMS messages, tagged according to the message being ham (legitimate) or spam.
-
Hottest Italian Tweets || Formato: csv
Descrizione: An archive of Twitter posts about open data with at least 10 retweets or 10 favorites NB SERVE LOGGARE PER SCARICARLO (è possibile usare un account google) -
Italian News Articles || Formato: json
Descrizione: A collection of news article written in Italian
-
German Recipes || Formato: json
Descrizione: This dataset contains 12190 german recipes with metadata crawled from chefkoch.de*. -
Student reviews & recommendations of german universities || Formato: csv
Descrizione: 220k+ reviews for German universities. -
German News Article || Formato: json
Descrizione: A collection of news written in German
- Russian Twitter Corpus || Formato: csv
The corpus of short texts in Russian on the basis of Twitter posts. Suitable for training a language model for social media or for short texts, training classifiers for sentiment analysis and text toxicity.
NB: SCARICARE SOLO POSITIVI/NEGATIVI
-
Europarl
Descrizione: The Europarl parallel corpus is extracted from the proceedings of the European Parliament. It includes versions in 21 European languages: Romanic (French, Italian, Spanish, Portuguese, Romanian), Germanic (English, Dutch, German, Danish, Swedish), Slavik (Bulgarian, Czech, Polish, Slovak, Slovene), Finni-Ugric (Finnish, Hungarian, Estonian), Baltic (Latvian, Lithuanian), and Greek. -
German-English_WordAlignment
Descrizione: A set of manually aligned datasets in German, English and Turkish. -
Japanese-English Bilingual Corpus
Descrizione: A precise and large-scale corpus containing about 500,000 pairs of manually-translated sentences. -
English, Chinese and French proverbs || Formato: csv \