-
Notifications
You must be signed in to change notification settings - Fork 4
Home
Joan Giner edited this page Sep 23, 2022
·
1 revision
(Version 0.0.5)
DescribeML is a VSCode language plugin to describe machine-learning datasets.
Full examples of the language can be found in the public open repository here
-
Title:
STRING
: The public title of the dataset -
Unique-identifier:
ID
Machine-readable unique identifier of the dataset -
Version:
ID
The version of the dataset -
-
Created:
DATE
The date where the dataset was initially created: -
Modified:
DATE
The date where the dataset was last modified: -
Published:
DATE
The publication date of the dataset:
Example
Dates: Release Date: 10-08-20 Modified Date: 10-08-20 Published Date: 10-08-20
-
Created:
-
Citation: The citation of the dataset, between chose between a raw citation and a structured format
-
Raw Citation:
STRING
Raw citation as text, or as Bibtex or equivalent format, of the dataset -
OR:
-
Title:
STRING
The title of the dataset -
Authors:
STRING
The authors of the dataset -
Year:
DATE
The year of the dataset -
Journal/Conference:
STRING
The publisher of the dataset -
Publisher:
STRING
The publisher of the dataset: -
URL:
URL
The URL of the dataset -
DOI:
ID
The DOI of the dataset -
ISBN:
ID
The ISBN of the dataset
Example:
Citation: Title: "SIIM-ISIC 2020 Challenge Dataset. International Skin Imaging Collaboration" Year: 2020 Publisher: "International Skin Imaging Collaboration" DOI: "doi.org/10.34970/2020-ds01" Url: "https://www.kaggle.com/c/siim-isic-melanoma-classification"
-
Title:
-
Raw Citation:
-
Description: The description of the dataset
-
Description:
STRING
Textual description of the dataset OR:-
Purposes
STRING
For what purposes was the dataset created? -
Tasks:
TASKS ENUMERATE
List of ML tasks the dataset is intended for:Autocomplete feature will guide you through the options
-
Gaps:
STRING
Which gaps does the dataset aims to fill
-
Purposes
-
Areas:
ID
Set a list of areas of the dataset -
Tags:
ID, ...
Set a list of Tags of the dataset
Example:
Description: Purposes: Purposes: "The 2020 SIIM-ISIC Melanoma" Tasks: [classification] Gaps: "As the leading healthcare organization for informatics in medical imaging..." Areas: HealthCare Tags: Images Melanoma diagnosis SkinImage
-
Description:
-
Applications Summerize the applications of the dataset
-
Past Uses:
STRING
Summerize the past uses of the dataset -
Recommended uses:
STRING
Summerize the recommended uses of the dataset -
Non-recommended uses:
STRING
Summerize the non-recommended uses of the dataset. -
Benchmarking: Benchmarking of the dataset
-
Task:
TASKS ENUMERATE
Task to benchmarkAutocomplete feature will guide you through the options
-
Metric: Metric to benchmark
-
F1:
NUMBER
F1 score -
Accuracy:
NUMBER
Accuracy score -
Precision:
NUMBER
Precision score -
Recall:
NUMBER
Recall score
-
F1:
-
Reference:
STRING
Source of the benchmark
Example
Applications: Past Uses: "The 2020 SIIM-ISIC Melanoma Classification... " Recommended: "Identify melanoma in lesion images." "Predict incidence of melanoma in a population." Non-recommended: "Due to low population prevalence and challenges with access." Benchmarking: Task: Language-model [ Model: "ModelExample" Metrics:[ F1: 81 Accuracy: 81 Precision: 81 Recall: 81 ] Reference: "https://www.kaggle.com/c/siim-isic-melanoma-classification/leaderboard" ]
-
Task:
-
Past Uses:
-
Distribution Summerize the distribution of the dataset
-
Is public?:
BOOL
Indicate if the dataset is publicly available -
Licenses:
LICENCES ENUMERATE
List of standard licenses, use others if not fit your case:The Montreal data license , Creative Commons, CC0: Public Domain ...
-
Rights(stand-alone)
ENUMERATE
Montreal data licence enumerate of stand-alone rights: Access | Tagging |'Distribute | Re-Represent -
Rights(with models):
ENUMERATE
Montreal data licence enumerate of model related rights:Benchmark | Research | Publish' | Internal Use | 'Output Commercialization' | Model Commercialization
-
Credits/Attribution Notice:
STRING
Who needs to be credited when using the dataset -
Designated Third Parties:
STRING
Third parties in charge of licensing and distribution issues -
Additional Conditions:
STRING
Other issues specified by the authors
Example
Distribution: Licences: CC BY 3.0 (Attribution 3.0 Unported) Rights(stand-alone): Access Rights(with models): Benchmark Additional Conditions "In addition to the CC-BY-NC license, the dataset is governed by the ISIC Terms of Use ... "
-
Is public?:
-
Authoring Authoring of the dataset
-
Authors Authors of the dataset
-
Name:
STRING
Name of the author -
Email:
EMAIL
Email of the author
-
Name:
-
Founders Founders of the dataset
-
Name:
STRING
Name of the founder -
Type:
ENUMERATE
Type of the founderprivate | public | mixed;
-
Grantor
STRING
Grantor of the dataset -
Grant ID:
ID
Machine-readable name of the grant id
-
Name:
-
Maintainers Maintainers of the dataset
-
Name:
STRING
Name of the maintainer -
Email:
EMAIL
Email of the maintainer
-
Name:
-
Erratum?:
STRING
Is there any erratum? -
Data retention:
STRING
Please indicate any data retention policy -
Version lifecycle:
STRING
Describe the planned version lifecycle -
Contribution guidelines
STRING
Is there any contribution guideline?
Example:
Authoring: Authors: Name Skin_Imaging_Collaboration_ISIC email emailo@emailo.com [...] Funders: Name The_University_of_Queensland type mixed grantor "National Health and Medical Research Council (NHMRC) – Centre of Research Excellence Scheme" grantId: APP1099021 [...] Erratum?: "There is no erratum known" Contribution guidelines: "No contribution guidelines provided"
-
Authors Authors of the dataset
-
Rationale
STRING
Provide a composition rationale -
Total Size
NUMBER
Total size of tuples of the dataset -
Instances A composition description of each instance of the dataset
-
Instance:
ID
Machine-readable name of the instance -
Size:
NUMBER
Size of the instance -
Description:
STRING
Description of the instance -
Type:
ENUMERATE
Type of the instanceRecord-Data | Time-Series | Ordered | Graph | Other
-
Attribute Number:
NUMBER
Number of attributes -
Attributes: Description of each attribute of the instance
-
attribute:
ID
Machine-readable name of the attribute -
Description:
STRING
Description of the attribute -
Associated label:
Labels
Reference to a declared label in a labeling process (first you should complete the provenance part) -
unique values:
NUMBER
Type of the attribute -
ofType:
ENUMERATE
Type of the attributeCategorical | Nominal
IfofType
isCategorical
-
Statistics: Statistic of the attribute
-
Unique:
NUMBER
Unique tuples (without duplications) -
Unique Percentage:
NUMBER
Percentage of unique tuples -
Missing Values:
NUMBER
Number of missing values -
Completeness:
NUMBER
Completeness of the attribute -
Mode:
STRING
Mode of the attribute -
First Rows:
[0: ROW1, ...]
Percentage of the mode -
Min-leght:
NUMBER
Min of the attribute -
Max-lenght:
NUMBER
Max of the attribute -
Median-lenght:
NUMER
Median lengths of the attribute -
Lenght-histogram:
STRING
Histogram of the attribute -
Chi-Squared: Chi-Squared of the attribute
- statistic: Statistic of the chi-sqaure analysis
- p-value: p-value of the chi-sqaure analysis
-
Binary attribute:
BOOL
Is a binary attribute?-
Symmetry:
ENUMERATE
Symmetryc | Asymmetryc
-
Attribute Sparsity:
NUMBER
How sparse is the binary attribute?
-
Symmetry:
-
Categoric Distribution:
["CATEGORY": "NUMBER"%, ...]
Categoric distribution of the attribute
Example
attribute: beningnant_malignant description: 'Type of the melanoma' label: skinLabel count: 33126 ofType: Categorical Statistics: Missing Values: 0 Completeness: 100 Chi-Squared: p-value: 0 Categoric Distribution: [ "beningnant": 80%, "malignant": 20% ]
-
Unique:
ofType
isNominal
-
Statistics: Statistics of the attribute
-
Mean:
NUMBER
Unique tuples (without duplications) -
Median:
NUMBER
Percentage of unique tuples -
Mode:
NUMBER
Mode of the attribute -
Minimmum:
NUMBER
Min of the attribute -
Maximmum:
NUMBER
Max of the attribute -
Quartiles:
[Q1:NUMBER, ...]
Median lengths of the attribute -
IQR:
NUMBER
Histogram of the attribute
Example
attribute: acidity description: 'wine acidity mesure' count: 33126 ofType: Numerical Statistics: Mean: 4 Median: 4.1 Standard Desviation: 0.2 Minimmum: 5 Maximmum: 87 Quartiles: Q1:17 Q2:27 Q3:30 Q4:30 IQR: 1.2
-
Mean:
-
Statistics: Statistic of the attribute
-
attribute:
-
Statistics: (instance) Statistic of the instance
-
Correlations: Correlation of the instance, choose one calculation type
-
Pearson:
[INDEX:"NUMBER", ...]
Pearson correlation of the instance -
Spearman:
[INDEX:"NUMBER", ...]
Spearman correlation of the instance -
Kendall:
[INDEX:"NUMBER", ...]
Kendall correlation of the instance -
Cramers:
[INDEX:"NUMBER", ...]
Cramers correlation of the instance -
Phi-k
[INDEX:"NUMBER", ...]
Phi-k correlation of the instance
-
Pearson:
-
Pair Correlation
Between [ATTRIBUTE], and [ATTRIBUTE]
Points the relevant pair-correlation between two instances of declared attributes. -
Quality Metrics: General quality metrics of the instance
-
Sparsity:
NUMBER
Sparsity of the instance -
Completeness:
NUMBER
Completeness of the instance -
Class balance:
STRING
Class balance of the instance -
Noisy labels:
STRING
Noisy labels of the instance
Example:
Statistics: Correlations: Spearman: ['1': 0.2, '2':0.3, '3':0.4, '4':0.5, '5':0.6, '6':0.7, '7':0.8, '8':0.9] Pair Correlation: between ImageId and diagnosis between age and external source From: "National statistical office" Rationale: "The age average is similar to the Nevada state age average due to national statistical office average of 2022 of Nevada" Quality Metrics: Completeness: 100
-
Sparsity:
-
Correlations: Correlation of the instance, choose one calculation type
-
Consistency Rules: Set the consistency rules of your dataset
-
Rule:
OCLExpression
OCL expression of the rule
Example:
Consistency rules: inv: skinImages : (age >= 0)
-
Rule:
-
Instance:
-
Dependencies: Dependencies of the rule
-
Description:
STRING
Description of the dependencies -
Links:
URL
Link to the dependency artifact
-
Description:
-
Instances relation:
Relation: ID attribute: [ATTRIBUTE] is related to [INSTANCE]
Relation between instances
-
Curation Rationale
STRING
Provide a provenance rationale -
Gathering Processes:
-
Process:
ID
Machine-readable name of the process -
Description:
STRING
Description of the process -
When data was collected:
STRING
Date where data the process was performed -
How data was collected
STRING
How data was collected -
Is language data: Set the speech situation
-
Language:
STRING
Language of the data -
Time and place:
STRING
-
Modality:
ENUMERATE
Modality of the speechspoken/signed | written
-
Type:
ENUMERATE
Type of the speechscripted/edited | spontaneous
-
Syncrony:
ENUMERATE
Synchrony of the speechsynchronous |asynchronous
-
Inteded Audience:
STRING
Intended audience of the speech
-
Language:
-
Social Issues:
[SOCIAL ISSUES]
Relation of the gathering process with an already declared social issue instance -
Source: Source of the data
-
Source:
ID
machine-readable name of the source -
Description:
STRING
Description of the source -
Noise:
STRING
Description of the source's noise -
Links:
URL
Link to the source artifact
-
Source:
-
Process Demographics:
-
Age:
NUMBER
Median age of the participants -
Gender:
STRING
Gender relation of the participants -
Country/Region
STRING
Country/Region of the participants -
Race/Ethnicity
STIRNG
Race or ethnicity of the participants -
Native Langugage
STRING
Native language of the participants -
Socioeconomic status
STRING
Socioeconomic status -
Number of speakers represented:
NUMBER
Number of participants -
Precense of disorders in speech:
STRING
Number of speakers -
Training in linguistics/other relevant disciplines
STRING
Explain the training of the participants
-
Age:
-
Gathering Team Team in charge of gathering the data
-
Who collects the data:
STRING
Who collects the data -
Type
ENUMERATE
Internal | External | Contractors | Crowdsourcing
-
Demographics: Demographics of the gathering team
-
Age:
NUMBER
Median age of the participants -
Gender:
STRING
Gender relation of the participants -
Country/Region
STRING
Country/Region of the participants -
Race/Ethnicity
STIRNG
Race or ethnicity of the participants -
Native Langugage
STRING
Native language of the participants -
Socioeconomic status
STRING
Socioeconomic status -
Training in linguistics/other relevant disciplines
STRING
Explain the training of the participants
-
Age:
-
Who collects the data:
-
Gathering Requirements:
Requirement: STRING, ...
Example:
Data Provenance: Curation Rationale: "The curation process have been conducted by several health institutions... " Gathering Processes: Process: GatheringProcess1 Description: "The sources are: the Melanoma Institute Australia and the ..." Source: GeneralHospital1 Description: 'Source Description' Noise: "Inconsistent lighting in images may alter skin type" "Duplicates:..." Related Instances: skinImages How data is collected: Manual Human Curator When data was collected: Range: 1998 - 2019 Process Demographics: Country/Region: 'Australia' [...] Gathering Team: Who collects the data: "A team of dermatologists and pathologists" Type Internal Gather Requirements: Requirement: "We queried clinical imaging databases across the six centers to generate a ..."
-
Process:
-
LabelingProcesses:
-
Labeling process:
ID
Machine-readable name of the labeling process -
Description:
STRING
Description of the labeling process -
Type:
ENUMERATE
'Bounding boxes' | 'Lines and splines' | 'Semantinc Segmentation' | '3D cuboids' | 'Polygonal segmentation' | 'Landmark and key-point' | 'Image and video annotations' | 'Entity annotation' | 'Content and textual categorization
-
Labels: Labels of the labeling process
-
Label:
ID
Machine-readable name of the label -
Description:
STRING
Description of the label - Mapping: [ATTRIBUTE,...] Relate a label with instances of attributes already declared in the documentation
-
Label:
-
Labeling Team:
-
Who collects the data:
STRING
Who collects the data -
Type
ENUMERATE
Internal, External, Contractors, Crowdsourcing
-
Demographics: Demographics of the gathering team
-
Age:
NUMBER
Median age of the participants -
Gender:
STRING
Gender relation of the participants -
Country/Region
STRING
Country/Region of the participants -
Race/Ethnicity
STIRNG
Race or ethnicity of the participants -
Native Langugage
STRING
Native language of the participants -
Socioeconomic status
STRING
Socioeconomic status -
Number of speakers represented:
NUMBER
Number of participants -
Precense of disorders in speech:
STRING
Number of speakers -
Training in linguistics/other relevant disciplines
STRING
Explain the training of the participants
-
Age:
-
Who collects the data:
-
Infrastructure: Infrastructure used to annotate the data
-
Tool:
STRING
Tool used to annotate the data -
Platform:
STRING
Platform where the tool works -
Version:
STRING
Version of the tool and platform -
Language:
STRING
Language of the tool -
Comments:
STRING
Provide comments about the tool
-
Tool:
-
Validation: Validation methods to ensure annotation quality
-
Validation Methods:
STRING
Validation method used -
Validation Dates:
STRING
Dates where the validation where done annotations -
Golden Questions: Golden Question pass to the annotators
-
Question:
STRING
Textual question -
Inter-annotation agreement:
NUMBER
Inter-annotation agreement for each question. Low values mean low confidence in the annotation
-
Question:
-
Validation Requirements:
Requirement: STRING, ...
Provide comments about the validation tool
-
Validation Methods:
-
Labeling Requirements:
Requirement: STRING, ...
Example:
LabelingProcesses: Labeling process: skinLabeling Description: "Medical staff looking at the data and images and annotating the diagnosis" Type: Image and video annotations Labels: Label: skinLabel Description: "marked as beningnant or malignant" Mapping: beningnant_malignant Labeling Team: Who collects the data: "Internal Medical staff" Type Internal Country/Region: "Australia" Label Requirements: Requirement: "1) Images containing any potentially identifying features, such as jewelry
-
Labeling process:
-
Preprocesses: Data preprocesses done over the data
-
Preprocess:
ID
machine-readable name of the preprocess -
Type:
ENUMERATE
Type of preprocess applied'Missing Values' | 'Data Augmentation' | 'Outlier Filtering' | 'Remove Duplicates' | 'Data reduction' | 'Sampling' | 'Data Normalization' | 'Others'
-
Description:
STRING
Description of the preprocess -
Social Issues:
[SOCIAL ISSUES]
Relation of the preprocess with an already declared social issue instance
-
Preprocess:
-
Social Concerns
-
Rationale:
STRING
Rationale of the social concerns of the dataset -
Social Issues: Social issues identified from the data
-
Social Issue:
ID
Machine-readable name of the social issue -
IssueType:
ENUMERATE
Type of social concern'Privacy' | 'Bias' | 'Sensitive Data' | 'Social Impact'
-
Description:
STRING
Description of the social issue -
Related Attributes
attribute: [ATTRIBUTE]
Attributes related to the social issue -
Instace belong to people:
-
Have sensitive attributes?
[Attribute], ...
List of sensitive attributes -
Are there protected groups?
ENUMERATE
(Yes, No, Unknown) -
Might be offensive?
STRING
Is there offensive content in the dataset
Examples
Social Concerns: Rationale: 'Dataset may not be representative of the real world data, and the cavenience sample is not representative of general incidence of melanoma' Social Issue: raceRepresentative IssueType: Bias Description: "Dataset is not representative with respect to darker skin types" Related Attributes: attribute: ImageId
-
Have sensitive attributes?
-
Social Issue:
-
Rationale:
For any related question, please contact the authors at: jginermi@uoc.edu