Skip to content

Mukhopadhyay/OpenData

Repository files navigation

Open Data ❤️

PRs Welcome

forthebadge forthebadge



Open Data is the idea that some data should be freely available to everyone to use and republish as they wish, without restrictions from copyright, patents or other mechanisms of control. Wikipedia


Index




📊 OpenData Websites

Name Description URL
CDC Open Data The Centers for Disease Control and Prevention (CDC) is the national public health agency of the United States. data.cdc.gov
Data.world Data.world is the enterprice data catalog for modern data stack. data.world
Five Thirty Eight FiveThirtyEight is a website using data and evidence to advance public knowledge. This is their open data portal sharing the data and code behind some of their articles and graphics. data.fivethirtyeight.com
GENESIS-ONLINE The German Federal Statistical Office is the institution to contact first for official data on the society, the economy, the environment and the state. www-genesis.destatis.de
Kaggle Kaggle, a subsidiary of Google LLC, is an online community of data scientists and machine learning practitioners. Kaggle allows users to find and publish data sets, explore and build models in a web-based data-science environment, work with other data scientists and machine learning engineers, and enter competitions to solve data science challenges. www.kaggle.com
Project Gutenberg Project Gutenberg is an online library of free eBooks. Books can be found in almost all extensions. www.gutenberg.org
Registry of Open Data on AWS This registry exists to help people discover and share datasets that are available via AWS resources. registry.opendata.aws
Science On a Sphere Science On a Sphere is a room-sized, global display system that projects visualization of planetary data onto a six foot diameter sphere to help illustrate Earth System science to people of all ages. sos.noaa.gov
Stanford Large Network Dataset collection The SNAP library is being actively developed since 2004 and is organically growing as a result of their research pursuits in analysis of large social and information networks. snap.stanford.edu
Stanford Open Data Portal for Stanford Open Data stanfordopendata.org
The World Bank The World Bank is an internatinoal financial institution that provides loans and grants to the governments of low-and middle-income countries for the purpose of pursuing capitalprojects. datacatalog.worldbank.org
U.S Census Bureau The United States census is a census that is legally mandated by the US constitution. data.census.gov
U.S Department of Commerce Open Data by U.S Department of Commerce data.commerce.gov
U.S Education Open Data Data Profiles from U.S. Department of Education data.ed.gov
U.S Transportation Open Data Department of Transportation of United States of America data.transportation.gov
UCI ML Repository The UCI ML repository is a collectino of databases, domain theories and data generators that are used by the machine learning comunity for the empirical analysis of machine learning algorithms. archive.ics.uci.edu
UNICEF UNICEF, also known as the United Nations Children's Fund, is a United Nations agency responsible for providing humanitarian and developmental aid to children worldwide. data.unicef.org
World Health Organization The World Health Organization (WHO) is a specialized agency of the United Nations responsible for international public health. www.who.int
Yelp The Yelp Open Dataset is a subset of their businesses, reviews, and user data for use in personal, educational and academic purposes. www.yelp.com



⬆️ Go back to index

Name Description URL
20 Newsgroups A collection featuring 20,000 documents that covers 20 newsgroups and subjects qwone.com
Amazon question/answer data This dataset contains question and answer data from Amazon, totaling around 1.4 million answered questions. jmcauley.ucsd.edu
ArXiv This massive 270 GB dataset features all arXiv research papers in fulltext. arxiv.org
Enron Email Dataset This dataset contains 500,000+ messages of Enron officials' emails and is especially of use for anyone looking to expand their understanding of the inner-workings of email tools. www.cs.cmu.edu
Google Books Ngrams A data set containing Google Books n-gram corpora. This data set is freely available on Amazon S3 in a Hadoop friendly file format and is licensed under a Creative Commons Attribution 3.0 Unported License. aws.amazon.com
IMDB Reviews This is a database for binary sentiment classification substantially more data than previous benchmark datasets. ai.stanford.edu
Machine Translation of Various Languages This dataset consists of training data for four European languages statmt.org
Multi-Domain Sentiment Dataset A massive variety of Amazon products along with their corresponding reviews www.cs.jhu.edu
Rueters News Dataset Originally appearing in 1987, this dataset has been labeled, indexed, and compiled for use in machine learning. archive.ics.uci.edu
Sentiment140 Sentiment140 allows you to discover the sentiment of a brand, product or topic on Twitter help.sentiment140.com
Stanford Sentiment Treebank Dataset for training a model to identify sentiment with the use of longer phases with its 10,000+ Rotten Tomatoes reviews nlp.stanford.edu
The WikiQA Corpus This publicly-available Q&A dataset was initially compiled to aid in all open-domain question answering research. www.microsoft.com
Twenty Newsgroups Dataset This data set consists of 20000 messages taken from 20 newsgroups archive.ics.uci.edu
Twitter US Airline Sentiment Analyze how travelers in February 2015 expressed their feelings on Twitter www.kaggle.com
UCI's Spambase Data set This dataset was created by a team at HP (Hewlett-Packard) to help create a spam filter. It contains a litanie of emails previously labeled as spam by users. archive.ics.uci.edu
Wikipedia Links Data This Google dataset contains approximately 13 million documents with each containing a hyperlink (one minimum each) that goes to an English wikipedia page code.google.com
WordNet WordNet is a large lexical database of English. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. wordnet.princeton.edu
Yelp Open Dataset This Yelp dataset features 8.5M+ reviews of over 160,000 businesses. It also has 200,000+ pictures and spans across 8 major metropolitan areas. www.yelp.com
YouTubers-saying-things Dataset containing popular YouTuber's video subtitles www.kaggle.com



⬆️ Go back to index

Name Description URL
CIFAR-10 The CIFAR-10 dataset consists of 60000 32x32 color imagse in 10 classes, with 6000 imagse per class. There are 50000 training images and 10000 test images. www.cs.toronto.edu
COCO (Common Objects in Context) COCO is a large-scale object detection, segmentation, and captioning dataset. cocodataset.org
Fashion-MNIST Fashion-MNIST is a dataset consisting of training set of 60,000 examples and a test set of 10,000 examples. Each example is a 28x28 grayscale image, associated with a label from 10 classes. github.com
ImageNet ImageNet is an image database organized according to the WordNet hierarchy (currently only the nouns), in which each node of the hierarchy is depicted by hundreds and thousands of images. www.image-net.org
Open Images Dataset Open Images is a dataset of ~9M images annotated with image-level labels, object bounding boxes, object segmentation masks, visual relationships, and localized narratives. storage.googleapis.com
SVHN (Street View House Number) SVHN is a real-world image dataset for developing machine learning and object recognition algorithms with minimal requirement on data preprocessing and formatting. ufldl.stanford.edu
VisualQA VQA is a new dataset containing open-ended questions about images. The questions require an understanding of vision, language and commonsense of knowledge to answer. visualqa.org



⬆️ Go back to index

Name Description URL
Ballroom This dataset gives many informatinos on ballroom dancing. Some characteristic excerpts of many dance style are provided in real audio format. Their tempi are also available. mtg.upf.edu
FMA (Free Music Archive) A Dataset for Music Analysis github.com
Free Spoken Digit Dataset A free audio dataset of spoken digits. Think MNIST for audio. github.com
LibriSpeech LibriSpeech is a corpus of approximately 1000 hours of 16KHz read English speech, prepared by Vassil Panayotov with the assistance of Daniel Povey. www.openslr.org
Urban Sound 8K Dataset This dataset contains 8732 labeled sound excerpts (<=4s) of urban sounds from 10 classes. urbansounddataset.weebly.com
VoxCeleb This is an audio visual dataset consisting of short clips of human speech, extracted from interview videoes uploaded to YouTube www.robots.ox.ac.uk