Skip to content

Nolanogenn/unior_nlp_lab

Repository files navigation

Corpora

  1. Francese
  2. Inglese
  3. Italiano
  4. Tedesco
  5. Russo
  6. Multilingue

Francese

  1. French News Article || Formato: json
    Descrizione: A collection of news article written in French

  2. Insurance Reviews France || Formato: csv
    Descrizione: User reviews on mutual health insurance in France.

Inglese

  1. Cornell Movie-Dialogs Corpus || Formato: txt
    Descrizione: This corpus contains a large metadata-rich collection of fictional conversations extracted from raw movie scripts:

  2. Elsevier OA CC-BY Corpus || Formato: json
    Descrizione: This is a corpus of 40k (40,001) open access (OA) CC-BY articles from across Elsevier’s journals represent the first cross-discipline research of data at this scale to support NLP and ML research.

  3. A dataset of English plaintext jokes || Formato: json
    Descrizione: There are about 208 000 jokes in this database scraped from three sources.
    I make no claim on ownership of these files, nor do I necessarily endorse the jokes in them. This dataset is provided for research purposes (see License section below).

  4. Tripadvisor Comments || Formato: vari
    Descrizione: Meta data includes: Author, Content, Date, Number of Reader, Number of Helpful Judgment, Overall rating, Value aspect rating, Rooms aspect rating, Location aspect rating, Cleanliness aspect rating, Check in/front desk aspect rating, Service aspect rating and Business Service aspect rating. Ratings ranges from 0 to 5 stars, and -1 indicates this aspect rating is missing in the orginal html file. NB SCARICARE JSON

  5. SMS Spam Collection Dataset || Formato: csv
    Descrizione: The SMS Spam Collection is a set of SMS tagged messages that have been collected for SMS Spam research. It contains one set of 5,574 English SMS messages, tagged according to the message being ham (legitimate) or spam.


Italiano

  1. Hottest Italian Tweets || Formato: csv
    Descrizione: An archive of Twitter posts about open data with at least 10 retweets or 10 favorites NB SERVE LOGGARE PER SCARICARLO (è possibile usare un account google)

  2. Italian News Articles || Formato: json
    Descrizione: A collection of news article written in Italian


Tedesco

  1. German Recipes || Formato: json
    Descrizione: This dataset contains 12190 german recipes with metadata crawled from chefkoch.de*.

  2. Student reviews & recommendations of german universities || Formato: csv
    Descrizione: 220k+ reviews for German universities.

  3. German News Article || Formato: json
    Descrizione: A collection of news written in German


Russo

  1. Russian Twitter Corpus || Formato: csv
    The corpus of short texts in Russian on the basis of Twitter posts. Suitable for training a language model for social media or for short texts, training classifiers for sentiment analysis and text toxicity.
    NB: SCARICARE SOLO POSITIVI/NEGATIVI

Multilingue

  1. Europarl
    Descrizione: The Europarl parallel corpus is extracted from the proceedings of the European Parliament. It includes versions in 21 European languages: Romanic (French, Italian, Spanish, Portuguese, Romanian), Germanic (English, Dutch, German, Danish, Swedish), Slavik (Bulgarian, Czech, Polish, Slovak, Slovene), Finni-Ugric (Finnish, Hungarian, Estonian), Baltic (Latvian, Lithuanian), and Greek.

  2. German-English_WordAlignment
    Descrizione: A set of manually aligned datasets in German, English and Turkish.

  3. Japanese-English Bilingual Corpus
    Descrizione: A precise and large-scale corpus containing about 500,000 pairs of manually-translated sentences.

  4. English, Chinese and French proverbs || Formato: csv \

About

Corso in lingua italiana per NLP con Python

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published