Skip to content

Latest commit

 

History

History
576 lines (404 loc) · 49.7 KB

File metadata and controls

576 lines (404 loc) · 49.7 KB

Awesome Deep Learning for Natural Language Processing (NLP) Awesome

Contents

Courses

  1. CS224d: Deep Learning for Natural Language Processing from Stanford
  2. Neural Networks for NLP from Carnegie Mellon University
  3. Deep Learning for Natural Language Processing from University of Oxford and DeepMind

Books

  1. Neural Network Methods in Natural Language Processing by Yoav Goldberg and Graeme Hirst
  2. Deep Learning in Natural Language Processing by Li Deng and Yang Liu
  3. Natural Language Processing in Action by Hobson Lane, Cole Howard, and Hannes Hapke

Tutorials

  1. Deep Learning for Natural Language Processing (without Magic)
  2. A Primer on Neural Network Models for Natural Language Processing
  3. Deep Learning for Natural Language Processing: Theory and Practice (Tutorial)
  4. TensorFlow Tutorials
  5. Practical Neural Networks for NLP from EMNLP 2016 using DyNet framework
  6. Recurrent Neural Networks with Word Embeddings
  7. LSTM Networks for Sentiment Analysis
  8. TensorFlow demo using the Large Movie Review Dataset
  9. LSTMVis: Visual Analysis for Recurrent Neural Networks

Talks

  1. Ali Ghodsi's lecture on word2vec part 1 and part 2
  2. Richard Socher's talk on sentiment analysis, question answering, and sentence-image embeddings
  3. Deep Learning, an interactive introduction for NLP-ers
  4. Deep Natural Language Understanding
  5. Deep Learning Summer School, Montreal 2016 Includes state-of-art language modeling.

Frameworks

  1. Keras - The Python Deep Learning library Emphasis on user friendliness, modularity, easy extensibility, and Pythonic.
  2. TensorFlow - A cross-platform, general purpose Machine Intelligence library with Python and C++ API.
  3. Genism: Topic modeling for humans - A Python package that includes word2vec and doc2vec implementations.
  4. DyNet - The Dynamic Neural Network Toolkit "work well with networks that have dynamic structures that change for every training instance".
  5. Google’s original word2vec implementation
  6. Deeplearning4j’s NLP framework - Java implementation.
  7. deepnl - A Python library for NLP based on Deep Learning neural network architecture.
  8. PyTorch - PyTorch is a deep learning framework that puts Python first. "Tensors and Dynamic neural networks in Python with strong GPU acceleration."

Papers

  1. Deep or shallow, NLP is breaking out - General overview of how Deep Learning is impacting NLP.
  2. Natural Language Processing from Research at Google - Not all Deep Learning (but mostly).
  3. Distributed Representations of Words and Phrases and their Compositionality - The original word2vec paper.
  4. word2vec Parameter Learning Explained
  5. Distributed Representations of Sentences and Documents
  6. Context Dependent Recurrent Neural Network Language Model
  7. Translation Modeling with Bidirectional Recurrent Neural Networks
  8. Contextual LSTM (CLSTM) models for Large scale NLP tasks
  9. LSTM Neural Networks for Language Modeling
  10. Exploring the Limits of Language Modeling
  11. Conversational Contextual Cues - Models context and participants in conversations.
  12. Sequence to sequence learning with neural networks
  13. Efficient Estimation of Word Representations in Vector Space
  14. Learning Character-level Representations for Part-of-Speech Tagging
  15. Representation Learning for Text-level Discourse Parsing
  16. Fast and Robust Neural Network Joint Models for Statistical Machine Translation
  17. Parsing With Compositional Vector Grammars
  18. Smart Reply: Automated Response Suggestion for Email
  19. Neural Architectures for Named Entity Recognition - State-of-the-art performance in NER with bidirectional LSTM with a sequential conditional random layer and transition-based parsing with stack LSTMs.
  20. GloVe: Global Vectors for Word Representation - A "count-based"/co-occurrence model to learn word embeddings.
  21. Grammar as a Foreign Language - State-of-the-art syntactic constituency parsing using generic sequence-to-sequence approach.
  22. Skip-Thought Vectors - "unsupervised learning of a generic, distributed sentence encoder"

Blog Posts

  1. the morning paper: The amazing power of word vectors - Overview of word vectors.
  2. Word embeddings in 2017: Trends and future directions
  3. Deep Learning, NLP, and Representations
  4. The Unreasonable Effectiveness of Recurrent Neural Networks
  5. Neural Language Modeling From Scratch
  6. Machine Learning for Emoji Trends
  7. Teaching Robots to Feel: Emoji & Deep Learning
  8. Computational Linguistics and Deep Learning - Opinion piece on how Deep Learning fits into the broader picture of text processing.
  9. Deep Learning NLP Best Practices

Researchers

  1. Christopher Manning
  2. Ali Ghodsi
  3. Richard Socher
  4. Yoshua Bengio

Datasets

  1. Dataset from "One Billion Word Language Modeling Benchmark" - Almost 1B words, already pre-processed text.
  2. Stanford Sentiment Treebank - Fine grained sentiment labels for 215,154 phrases in the parse trees of 11,855 sentences.
  3. Quora Question Pairs Dataset - Identify question pairs that have the same intent.

Specific Areas

Deep Learning for NLP

Stanford Natural Language Processing
Intro NLP course with videos. This has no deep learning. But it is a good primer for traditional nlp. Covers topics such as sentence segmentation, word tokenizing, word normalization, n-grams, named entity recognition, part of speech tagging. Currently not available

Stanford CS 224D: Deep Learning for NLP class
Richard Socher. (2016) Class with syllabus, and slides.
Videos: 2015 lectures / 2016 lectures

A Primer on Neural Network Models for Natural Language Processing
Yoav Goldberg. Submitted 9/2015, published 11/16. 75 page summary of state of the art.

Oxford Deep Learning for NLP class
Phil Blunsom. (2017) Class by Deep Mind NLP Group.
Lecture slides, videos, and practicals: Github Repository

Word Vectors

Resources about word vectors, aka word embeddings, and distributed representations for words.
Word vectors are numeric representations of words where similar words have similar vectors. Word vectors are often used as input to deep learning systems. This process is sometimes called pretraining.

A neural probabilistic language model.
Bengio 2003. Seminal paper on word vectors.


Efficient Estimation of Word Representations in Vector Space
Mikolov et al. 2013. Word2Vec generates word vectors in an unsupervised way by attempting to predict words from a corpus. Describes Continuous Bag-of-Words (CBOW) and Continuous Skip-gram models for learning word vectors.
Skip-gram takes center word and predict outside words. Skip-gram is better for large datasets.
CBOW - takes outside words and predict the center word. CBOW is better for smaller datasets.

Distributed Representations of Words and Phrases and their Compositionality
Mikolov et al. 2013. Learns vectors for phrases such as "New York Times." Includes optimizations for skip-gram: heirachical softmax, and negative sampling. Subsampling frequent words. (i.e. frequent words like "the" are skipped periodically to speed things up and improve vector for less frequently used words)

Linguistic Regularities in Continuous Space Word Representations
Mikolov et al. 2013. Performs well on word similarity and analogy task. Expands on famous example: King – Man + Woman = Queen
Word2Vec source code
Word2Vec tutorial in TensorFlow

word2vec Parameter Learning Explained
Rong 2014

Articles explaining word2vec: Deep Learning, NLP, and Representations and The amazing power of word vectors


GloVe: Global vectors for word representation
Pennington, Socher, Manning. 2014. Creates word vectors and relates word2vec to matrix factorizations. Evalutaion section led to controversy by Yoav Goldberg
Glove source code and training data


Enriching Word Vectors with Subword Information
Bojanowski, Grave, Joulin, Mikolov 2016
FastText Code

Sentiment Analysis

Thought vectors are numeric representations for sentences, paragraphs, and documents. This concept is used for many text classification tasks such as sentiment analysis.

Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank
Socher et al. 2013. Introduces Recursive Neural Tensor Network and dataset: "sentiment treebank." Includes demo site. Uses a parse tree.

Distributed Representations of Sentences and Documents
Le, Mikolov. 2014. Introduces Paragraph Vector. Concatenates and averages pretrained, fixed word vectors to create vectors for sentences, paragraphs and documents. Also known as paragraph2vec. Doesn't use a parse tree.
Implemented in gensim. See doc2vec tutorial

Deep Recursive Neural Networks for Compositionality in Language
Irsoy & Cardie. 2014. Uses Deep Recursive Neural Networks. Uses a parse tree.

Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks
Tai et al. 2015 Introduces Tree LSTM. Uses a parse tree.

Semi-supervised Sequence Learning
Dai, Le 2015
Approach: "We present two approaches that use unlabeled data to improve sequence learning with recurrent networks. The first approach is to predict what comes next in a sequence, which is a conventional language model in natural language processing. The second approach is to use a sequence autoencoder..."
Result: "With pretraining, we are able to train long short term memory recurrent networks up to a few hundred timesteps, thereby achieving strong performance in many text classification tasks, such as IMDB, DBpedia and 20 Newsgroups."

Bag of Tricks for Efficient Text Classification
Joulin, Grave, Bojanowski, Mikolov 2016 Facebook AI Research.
"Our experiments show that our fast text classifier fastText is often on par with deep learning classifiers in terms of accuracy, and many orders of magnitude faster for training and evaluation."
FastText blog
FastText Code

Neural Machine Translation

In 2014, neural machine translation (NMT) performance became comprable to state of the art statistical machine translation(SMT).

Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation (abstract)
Cho et al. 2014 Breakthrough deep learning paper on machine translation. Introduces basic sequence to sequence model which includes two rnns, an encoder for input and a decoder for output.

Neural Machine Translation by jointly learning to align and translate (abstract)
Bahdanau, Cho, Bengio 2014.
Implements attention mechanism. "Each time the proposed model generates a word in a translation, it (soft-)searches for a set of positions in a source sentence where the most relevant information is concentrated"
Result: "comparable to the existing state-of-the-art phrase-based system on the task of English-to-French translation."
English to French Demo

On Using Very Large Target Vocabulary for Neural Machine Translation
Jean, Cho, Memisevic, Bengio 2014.
"we try replacing each [UNK] token with the aligned source word or its most likely translation determined by another word alignment model."
Result: English -> German bleu score = 21.59 (target vocabulary of 50,000)

Sequence to Sequence Learning with Neural Networks
Sutskever, Vinyals, Le 2014. (nips presentation). Uses seq2seq to generate translations.
Result: English -> French bleu score = 34.8 (WMT’14 dataset)
A key contribution is improvements from reversing the source sentences.
seq2seq tutorial in TensorFlow.

Addressing the Rare Word Problem in Neural Machine Translation (abstract)
Luong, Sutskever, Le, Vinyals, Zaremba 2014
Replace UNK words with dictionary lookup.
Result: English -> French BLEU score = 37.5.

Effective Approaches to Attention-based Neural Machine Translation
Luong, Pham, Manning. 2015
2 models of attention: global and local.
Result: English -> German 25.9 BLEU points

Context-Dependent Word Representation for Neural Machine Translation
Choi, Cho, Bengio 2016
"we propose to contextualize the word embedding vectors using a nonlinear bag-of-words representation of the source sentence."
"we propose to represent special tokens (such as numbers, proper nouns and acronyms) with typed symbols to facilitate translating those words that are not well-suited to be translated via continuous vectors."

Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation
Wu et al. 2016
blog post
"WMT’14 English-to-French, our single model scores 38.95 BLEU"
"WMT’14 English-to-German, our single model scores 24.17 BLEU"

Google's Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation
Johnson et al. 2016
blog post
Translations between untrained language pairs.

Google has started rolling out NMT to it's production system, and it's a significant improvement.

Convolutional Sequence to Sequence Learning
Gehring et al. 2017 Facebook AI research blog post
Architecture: Convolutional sequence to sequence. ConvS2s.
Results: "We outperform the accuracy of the deep LSTM setup of Wu et al. (2016) on both WMT'14 English-German and WMT'14 English-French translation at an order of magnitude faster speed, both on GPU and CPU."

Facebook is transitioning entirely to neural machine translation

Transformer: A Novel Neural Network Architecture for Language Understanding
Arcitecture: Transformer, a T2T model introduced by Google in Attention is all you need
Results: "we show that the Transformer outperforms both recurrent and convolutional models on academic English to German and English to French translation benchmarks."
T2T Source code
T2T blog post

DeepL Translator claims to outperform competitors but doesn't disclose their architecture. "Specific details of our network architecture will not be published at this time. DeepL Translator is based on a single, non-ensemble model."

Image Captioning

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
Xu et al. 2015 Creates captions by feeding image into a CNN which feeds into hidden state of an RNN that generates the caption. At each time step the RNN outputs next word and the next location to pay attention to via a probability over grid locations. Uses 2 types of attention soft and hard. Soft attention uses gradient descent and backprop and is deterministic. Hard attention selects the element with highest probability. Hard attention uses reinforcement learning, rather than backprop and is stochastic.

Open source implementation in TensorFlow

Conversation modeling / Dialog

Neural Responding Machine for Short-Text Conversation
Shang et al. 2015 Uses Neural Responding Machine. Trained on Weibo dataset. Achieves one round conversations with 75% appropriate responses.

A Neural Network Approach to Context-Sensitive Generation of Conversational Responses
Sordoni et al. 2015. Generates responses to tweets.
Uses Recurrent Neural Network Language Model (RLM) architecture of (Mikolov et al., 2010). source code: RNNLM Toolkit

Building End-To-End Dialogue Systems Using Generative Hierarchical Neural Network Models
Serban, Sordoni, Bengio et al. 2015. Extends hierarchical recurrent encoder-decoder neural network (HRED).

Attention with Intention for a Neural Network Conversation Model
Yao et al. 2015 Architecture is three recurrent networks: an encoder, an intention network and a decoder.

A Hierarchical Latent Variable Encoder-Decoder Model for Generating Dialogues
Serban, Sordoni, Lowe, Charlin, Pineau, Courville, Bengio 2016
Proposes novel architecture: VHRED. Latent Variable Hierarchical Recurrent Encoder-Decoder
Compares favorably against LSTM and HRED.


A Neural Conversation Model
Vinyals, Le 2015. Uses LSTM RNNs to generate conversational responses. Uses seq2seq framework. Seq2Seq was originally designed for machine translation and it "translates" a single sentence, up to around 79 words, to a single sentence response, and has no memory of previous dialog exchanges. Used in Google Smart Reply feature for Inbox

Incorporating Copying Mechanism in Sequence-to-Sequence Learning
Gu et al. 2016 Proposes CopyNet, builds on seq2seq.

A Persona-Based Neural Conversation Model
Li et al. 2016 Proposes persona-based models for handling the issue of speaker consistency in neural response generation. Builds on seq2seq.

Deep Reinforcement Learning for Dialogue Generation
Li et al. 2016. Uses reinforcement learing to generate diverse responses. Trains 2 agents to chat with each other. Builds on seq2seq.

Adversarial Learning for Neural Dialogue Generation
Li et al. 2017.
Source code for Li papers


Deep learning for chatbots
Article summary of state of the art, and challenges for chatbots.
Deep learning for chatbots. part 2
Implements a retrieval based dialog agent using dual encoder lstm with TensorFlow, based on the Ubuntu dataset [paper] includes source code

ParlAI A framework for training and evaluating AI models on a variety of openly available dialog datasets. Released by FaceBook.

Memory and Attention Models

Attention mechanisms allows the network to refer back to the input sequence, instead of forcing it to encode all information into one fixed-length vector. - Attention and Memory in Deep Learning and NLP

Memory Networks Weston et. al 2014, and End-To-End Memory Networks Sukhbaatar et. al 2015.
Memory networks are implemented in MemNN. Attempts to solve task of reason attention and memory.
Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks
Weston 2015. Classifies QA tasks like single factoid, yes/no etc. Extends memory networks.
Evaluating prerequisite qualities for learning end to end dialog systems
Dodge et. al 2015. Tests Memory Networks on 4 tasks including reddit dialog task.
See Jason Weston lecture on MemNN

Neural Turing Machines
Graves, Wayne, Danihelka 2014.
We extend the capabilities of neural networks by coupling them to external memory resources, which they can interact with by attentional processes. The combined system is analogous to a Turing Machine or Von Neumann architecture but is differentiable end-toend, allowing it to be efficiently trained with gradient descent. Preliminary results demonstrate that Neural Turing Machines can infer simple algorithms such as copying, sorting, and associative recall from input and output examples. Olah and Carter blog on NTM

Inferring Algorithmic Patterns with Stack-Augmented Recurrent Nets
Joulin, Mikolov 2015. Stack RNN source code and blog post

Reasoning, Attention and Memory RAM workshop at NIPS 2015. slides included

Deep Learning for NLP resources

State of the art resources for NLP sequence modeling tasks such as machine translation, image captioning, and dialog.

My notes on neural networks, rnn, lstm

NLP-DATASETS -- all you can eat and burn on your GPUs 😅


Alphabetical list of free/public domain datasets with text data for use in Natural Language Processing (NLP). Most stuff here is just raw unstructured text data, if you are looking for annotated corpora or Treebanks refer to the sources at the bottom.

Datasets

  • Apache Software Foundation Public Mail Archives: all publicly available Apache Software Foundation mail archives as of July 11, 2011 (200 GB)

  • Blog Authorship Corpus: consists of the collected posts of 19,320 bloggers gathered from blogger.com in August 2004. 681,288 posts and over 140 million words. (298 MB)

  • Amazon Fine Food Reviews [Kaggle]: consists of 568,454 food reviews Amazon users left up to October 2012. Paper. (240 MB)

  • Amazon Reviews: Stanford collection of 35 million amazon reviews. (11 GB)

  • ArXiv: All the Papers on archive as fulltext (270 GB) + sourcefiles (190 GB).

  • ASAP Automated Essay Scoring [Kaggle]: For this competition, there are eight essay sets. Each of the sets of essays was generated from a single prompt. Selected essays range from an average length of 150 to 550 words per response. Some of the essays are dependent upon source information and others are not. All responses were written by students ranging in grade levels from Grade 7 to Grade 10. All essays were hand graded and were double-scored. (100 MB)

  • ASAP Short Answer Scoring [Kaggle]: Each of the data sets was generated from a single prompt. Selected responses have an average length of 50 words per response. Some of the essays are dependent upon source information and others are not. All responses were written by students primarily in Grade 10. All responses were hand graded and were double-scored. (35 MB)

  • Classification of political social media: Social media messages from politicians classified by content. (4 MB)

  • CLiPS Stylometry Investigation (CSI) Corpus: a yearly expanded corpus of student texts in two genres: essays and reviews. The purpose of this corpus lies primarily in stylometric research, but other applications are possible. (on request)

  • ClueWeb09 FACC: ClueWeb09 with Freebase annotations (72 GB)

  • ClueWeb11 FACC: ClueWeb11 with Freebase annotations (92 GB)

  • Common Crawl Corpus: web crawl data composed of over 5 billion web pages (541 TB)

  • Cornell Movie Dialog Corpus: contains a large metadata-rich collection of fictional conversations extracted from raw movie scripts: 220,579 conversational exchanges between 10,292 pairs of movie characters, 617 movies (9.5 MB)

  • Corporate messaging: A data categorization job concerning what corporations actually talk about on social media. Contributors were asked to classify statements as information (objective statements about the company or it’s activities), dialog (replies to users, etc.), or action (messages that ask for votes or ask users to click on links, etc.). (600 KB)

  • Crosswikis: English-phrase-to-associated-Wikipedia-article database. Paper. (11 GB)

  • DBpedia: a community effort to extract structured information from Wikipedia and to make this information available on the Web (17 GB)

  • Death Row: last words of every inmate executed since 1984 online (HTML table)

  • Del.icio.us: 1.25 million bookmarks on delicious.com

  • Disasters on social media: 10,000 tweets with annotations whether the tweet referred to a disaster event (2 MB).

  • Economic News Article Tone and Relevance: News articles judged if relevant to the US economy and, if so, what the tone of the article was. Dates range from 1951 to 2014. (12 MB)

  • Enron Email Data: consists of 1,227,255 emails with 493,384 attachments covering 151 custodians (210 GB)

  • Event Registry: Free tool that gives real time access to news articles by 100.000 news publishers worldwide. Has API. (query tool)

  • Examiner.com - Spam Clickbait News Headlines [Kaggle]: 3 Million crowdsourced News headlines published by now defunct clickbait website The Examiner from 2010 to 2015. (200 MB)

  • Federal Contracts from the Federal Procurement Data Center (USASpending.gov): data dump of all federal contracts from the Federal Procurement Data Center found at USASpending.gov (180 GB)

  • Flickr Personal Taxonomies: Tree dataset of personal tags (40 MB)

  • Freebase Data Dump: data dump of all the current facts and assertions in Freebase (26 GB)

  • Freebase Simple Topic Dump: data dump of the basic identifying facts about every topic in Freebase (5 GB)

  • Freebase Quad Dump: data dump of all the current facts and assertions in Freebase (35 GB)

  • GigaOM Wordpress Challenge [Kaggle]: blog posts, meta data, user likes (1.5 GB)

  • Google Books Ngrams: available also in hadoop format on amazon s3 (2.2 TB)

  • Google Web 5gram: contains English word n-grams and their observed frequency counts (24 GB)

  • Gutenberg Ebook List: annotated list of ebooks (2 MB)

  • Hansards text chunks of Canadian Parliament: 1.3 million pairs of aligned text chunks (sentences or smaller fragments) from the official records (Hansards) of the 36th Canadian Parliament. (82 MB)

  • Harvard Library: over 12 million bibliographic records for materials held by the Harvard Library, including books, journals, electronic resources, manuscripts, archival materials, scores, audio, video and other materials. (4 GB)

  • Hate speech identification: Contributors viewed short text and identified if it a) contained hate speech, b) was offensive but without hate speech, or c) was not offensive at all. Contains nearly 15K rows with three contributor judgments per text string. (3 MB)

  • Hillary Clinton Emails [Kaggle]: nearly 7,000 pages of Clinton's heavily redacted emails (12 MB)

  • Home Depot Product Search Relevance [Kaggle]: contains a number of products and real customer search terms from Home Depot's website. The challenge is to predict a relevance score for the provided combinations of search terms and products. To create the ground truth labels, Home Depot has crowdsourced the search/product pairs to multiple human raters. (65 MB)

  • Identifying key phrases in text: Question/Answer pairs + context; context was judged if relevant to question/answer. (8 MB)

  • Jeopardy: archive of 216,930 past Jeopardy questions (53 MB)

  • 200k English plaintext jokes: archive of 208,000 plaintext jokes from various sources.

  • Machine Translation of European Languages: (612 MB)

  • Material Safety Datasheets: 230,000 Material Safety Data Sheets. (3 GB)

  • Million News Headlines - ABC Australia [Kaggle]: 1.3 Million News headlines published by ABC News Australia from 2003 to 2017. (56 MB)

  • MCTest: a freely available set of 660 stories and associated questions intended for research on the machine comprehension of text; for question answering (1 MB)

  • NEGRA: A Syntactically Annotated Corpus of German Newspaper Texts. Available for free for all Universities and non-profit organizations. Need to sign and send form to obtain. (on request)

  • News Headlines of India - Times of India [Kaggle]: 2.7 Million News Headlines with category published by Times of India from 2001 to 2017. (185 MB)

  • News article / Wikipedia page pairings: Contributors read a short article and were asked which of two Wikipedia articles it matched most closely. (6 MB)

  • NIPS2015 Papers (version 2) [Kaggle]: full text of all NIPS2015 papers (335 MB)

  • NYTimes Facebook Data: all the NYTimes facebook posts (5 MB)

  • One Week of Global News Feeds [Kaggle]: News Event Dataset of 1.4 Million Articles published globally in 20 languages over one week of August 2017. (115 MB)

  • Objective truths of sentences/concept pairs: Contributors read a sentence with two concepts. For example “a dog is a kind of animal” or “captain can have the same meaning as master.” They were then asked if the sentence could be true and ranked it on a 1-5 scale. (700 KB)

  • Open Library Data Dumps: dump of all revisions of all the records in Open Library. (16 GB)

  • Personae Corpus: collected for experiments in Authorship Attribution and Personality Prediction. It consists of 145 Dutch-language essays by 145 different students. (on request)

  • Reddit Comments: every publicly available reddit comment as of july 2015. 1.7 billion comments (250 GB)

  • Reddit Comments (May ‘15) [Kaggle]: subset of above dataset (8 GB)

  • Reddit Submission Corpus: all publicly available Reddit submissions from January 2006 - August 31, 2015). (42 GB)

  • Reuters Corpus: a large collection of Reuters News stories for use in research and development of natural language processing, information retrieval, and machine learning systems. This corpus, known as "Reuters Corpus, Volume 1" or RCV1, is significantly larger than the older, well-known Reuters-21578 collection heavily used in the text classification community. Need to sign agreement and sent per post to obtain. (2.5 GB)

  • SaudiNewsNet: 31,030 Arabic newspaper articles alongwith metadata, extracted from various online Saudi newspapers. (2 MB)

  • SMS Spam Collection: 5,574 English, real and non-enconded SMS messages, tagged according being legitimate (ham) or spam. (200 KB)

  • SouthparkData: .csv files containing script information including: season, episode, character, & line. (3.6 MB)

  • Stackoverflow: 7.3 million stackoverflow questions + other stackexchanges (query tool)

  • Twitter Cheng-Caverlee-Lee Scrape: Tweets from September 2009 - January 2010, geolocated. (400 MB)

  • Twitter New England Patriots Deflategate sentiment: Before the 2015 Super Bowl, there was a great deal of chatter around deflated footballs and whether the Patriots cheated. This data set looks at Twitter sentiment on important days during the scandal to gauge public sentiment about the whole ordeal. (2 MB)

  • Twitter Progressive issues sentiment analysis: tweets regarding a variety of left-leaning issues like legalization of abortion, feminism, Hillary Clinton, etc. classified if the tweets in question were for, against, or neutral on the issue (with an option for none of the above). (600 KB)

  • Twitter Sentiment140: Tweets related to brands/keywords. Website includes papers and research ideas. (77 MB)

  • Twitter sentiment analysis: Self-driving cars: contributors read tweets and classified them as very positive, slightly positive, neutral, slightly negative, or very negative. They were also prompted asked to mark if the tweet was not relevant to self-driving cars. (1 MB)

  • Twitter Tokyo Geolocated Tweets: 200K tweets from Tokyo. (47 MB)

  • Twitter UK Geolocated Tweets: 170K tweets from UK. (47 MB)

  • Twitter USA Geolocated Tweets: 200k tweets from the US (45MB)

  • Twitter US Airline Sentiment [Kaggle]: A sentiment analysis job about the problems of each major U.S. airline. Twitter data was scraped from February of 2015 and contributors were asked to first classify positive, negative, and neutral tweets, followed by categorizing negative reasons (such as "late flight" or "rude service"). (2.5 MB)

  • U.S. economic performance based on news articles: News articles headlines and excerpts ranked as whether relevant to U.S. economy. (5 MB)

  • Urban Dictionary Words and Definitions [Kaggle]: Cleaned CSV corpus of 2.6 Million of all Urban Dictionary words, definitions, authors, votes as of May 2016. (238 MB)

  • Wesbury Lab Usenet Corpus: anonymized compilation of postings from 47,860 English-language newsgroups from 2005-2010 (40 GB)

  • Wesbury Lab Wikipedia Corpus Snapshot of all the articles in the English part of the Wikipedia that was taken in April 2010. It was processed, as described in detail below, to remove all links and irrelevant material (navigation text, etc) The corpus is untagged, raw text. Used by Stanford NLP (1.8 GB).

  • Wikipedia Extraction (WEX): a processed dump of english language wikipedia (66 GB)

  • Wikipedia XML Data: complete copy of all Wikimedia wikis, in the form of wikitext source and metadata embedded in XML. (500 GB)

  • Yahoo! Answers Comprehensive Questions and Answers: Yahoo! Answers corpus as of 10/25/2007. Contains 4,483,032 questions and their answers. (3.6 GB)

  • Yahoo! Answers consisting of questions asked in French: Subset of the Yahoo! Answers corpus from 2006 to 2015 consisting of 1.7 million questions posed in French, and their corresponding answers. (3.8 GB)

  • Yahoo! Answers Manner Questions: subset of the Yahoo! Answers corpus from a 10/25/2007 dump, selected for their linguistic properties. Contains 142,627 questions and their answers. (104 MB)

  • Yahoo! HTML Forms Extracted from Publicly Available Webpages: contains a small sample of pages that contain complex HTML forms, contains 2.67 million complex forms. (50+ GB)

  • Yahoo! Metadata Extracted from Publicly Available Web Pages: 100 million triples of RDF data (2 GB)

  • Yahoo N-Gram Representations: This dataset contains n-gram representations. The data may serve as a testbed for query rewriting task, a common problem in IR research as well as to word and sentence similarity task, which is common in NLP research. (2.6 GB)

  • Yahoo! N-Grams, version 2.0: n-grams (n = 1 to 5), extracted from a corpus of 14.6 million documents (126 million unique sentences, 3.4 billion running words) crawled from over 12000 news-oriented sites (12 GB)

  • Yahoo! Search Logs with Relevance Judgments: Annonymized Yahoo! Search Logs with Relevance Judgments (1.3 GB)

  • Yahoo! Semantically Annotated Snapshot of the English Wikipedia: English Wikipedia dated from 2006-11-04 processed with a number of publicly-available NLP tools. 1,490,688 entries. (6 GB)

  • Yelp: including restaurant rankings and 2.2M reviews (on request)

  • Youtube: 1.7 million youtube videos descriptions (torrent)

Sources

Miscellaneous

  1. word2vec analogy demo

Contributing

Have anything in mind that you think is awesome and would fit in this list? Feel free to send a pull request!


License

CC0

To the extent possible under law, Tarry Singh has waived all copyright and related or neighboring rights to this work.