Skip to content
@google-research-datasets

Google Research Datasets

Datasets released by Google Research

Pinned Loading

  1. natural-questions Public

    Natural Questions (NQ) contains real user questions issued to Google search, and answers found from Wikipedia by annotators. NQ is designed for the training and evaluation of automatic question ans…

    Python 1k 156

  2. conceptual-captions Public

    Conceptual Captions is a dataset containing (image-URL, caption) pairs designed for the training and evaluation of machine learned image captioning systems.

    Shell 541 28

  3. Objectron Public

    Objectron is a dataset of short, object-centric video clips. In addition, the videos also contain AR session metadata including camera poses, sparse point-clouds and planes. In each video, the came…

    Jupyter Notebook 2.3k 260

  4. wit Public

    WIT (Wikipedia-based Image Text) Dataset is a large multimodal multilingual dataset comprising 37M+ image-text sets with 11M+ unique images across 100+ languages.

    1.1k 45

  5. paws Public

    This dataset contains 108,463 human-labeled and 656k noisily labeled pairs that feature the importance of modeling structure, context, and word order information for the problem of paraphrase ident…

    Python 557 55

  6. dstc8-schema-guided-dialogue Public

    The Schema-Guided Dialogue Dataset

    Python 575 130

Repositories

Showing 10 of 169 repositories
  • sanpo_dataset Public
    Python 41 Apache-2.0 2 3 3 Updated Jun 27, 2025
  • cultural_familiarity_annotations Public

    The dataset consists of AI generated stories and accompanied human ratings on their cultural fluency and relevance.

    0 Apache-2.0 0 0 0 Updated Jun 27, 2025
  • common-crawl-domain-names Public

    Corpus of domain names scraped from Common Crawl and manually annotated to add word boundaries (e.g. "commoncrawl" to "common crawl").

    18 MIT 2 0 0 Updated Jun 16, 2025
  • rag_conflicts Public

    CONFLICTS is a QA dataset annotated with knowledge conflict types. Each instance comprises a query, a set of retrieved relevant passages, a corresponding conflict type label, and, for specific types, the ground truth correct answer

    4 Apache-2.0 0 1 0 Updated Jun 11, 2025
  • wit-retrieval Public
    4 0 1 0 Updated Jun 5, 2025
  • Amplify_SSA Public

    An annotated dataset of 8,091 adversarial queries in seven Sub-Saharan African languages.

    Jupyter Notebook 0 1 0 0 Updated May 1, 2025
  • egotempo Public
    Jupyter Notebook 22 CC-BY-4.0 0 3 0 Updated Apr 26, 2025
  • artydiqa Public

    ArTyDi-QA is a dataset for Question Answering (QA) and Question Generation (QG) in Modern Standard Arabic (MSA), adapted from TyDiQA. It features extractive QA where models find answer spans or identify unanswerable questions, and a QG task involving formulating questions from context and answer pairs.

    0 0 0 0 Updated Apr 23, 2025
  • web-images Public

    Images gathered from the Internet in 2023 and some metadata

    HTML 2 2 0 0 Updated Mar 19, 2025
  • screen_qa Public

    ScreenQA dataset was introduced in the "ScreenQA: Large-Scale Question-Answer Pairs over Mobile App Screenshots" paper. It contains ~86K question-answer pairs collected by human annotators for ~35K screenshots from Rico. It should be used to train and evaluate models capable of screen content understanding via question answering.

    Python 119 CC-BY-4.0 9 3 0 Updated Feb 7, 2025

People

This organization has no public members. You must be a member to see who’s a part of this organization.