# MMS Dataset Card

> Dataset Card for https://huggingface.co/datasets/Brand24/mms

- title-block-banner: true
- bibliography: references.bib


## Easiness of using

One of the key ideas behind creating our library of datasets was to prioritize ease of use for researchers. Recognizing the importance of accessibility and convenience, we chose the HuggingFace platform as the storage and distribution platform for the datasets. HuggingFace provides a user-friendly interface and a wide range of tools and resources, making it easy for researchers to access and utilize the datasets.

To further enhance usability, we took the initiative to gather all the necessary citations for the datasets included in our library. By unifying the citations, we aimed to simplify and expedite the process of generating citations for researchers who utilize our datasets. This step reduces the time and effort required for researchers to acknowledge the datasets' sources properly.

However, it is essential to note that while we have taken steps to streamline the citation process, researchers should still independently verify the licenses of the datasets, especially if they intend to use them for purposes beyond strict academic research. Ensuring compliance with licensing requirements is crucial to maintaining ethical and legal data use standards.

Overall, our overarching goal in creating this unified corpus of datasets is accelerating academic sentiment analysis research. By providing a comprehensive collection of high-quality datasets and facilitating their accessibility, we aim to support researchers in exploring and advancing sentiment analysis techniques and methodologies.

### Data ready to slice and dice and train a model

Our dataset is designed to be versatile and allows researchers to slice and dice the data for training and modeling according to their specific needs. Drawing from the field of linguistic typology, which examines the characteristics of languages, we have incorporated various linguistic features into our dataset selection process. These features include the text itself, sentiment labels, the original dataset source, domain, language, language family, genus, the presence or absence of definite and indefinite articles, the number of cases, word order, negative morphemes, polar questions, the position of negative morphemes, prefixing vs. suffixing, coding of nominal plurals, and grammatical genders. Researchers can easily access datasets that match their desired linguistic typology criteria by offering these features as filtering options in our library. 

For instance, researchers can download datasets specific to Slavic languages with interrogative word order for polar questions or datasets from the Afro-Asiatic language family without morphological case-making. This flexibility empowers researchers to tailor their analyses and models to their linguistic interests and research questions.

More use cases in the next page.

In [None]:
#| eval: false
import datasets

mms_dataset = datasets.load_dataset("Brand24/mms")
mms_dataset_df = mms_dataset["train"].to_pandas()

Found cached dataset mms (/root/.cache/huggingface/datasets/Brand24___mms/default/0.0.0/16c5204229c16d6de713010e6933dd69a5611d13c81e004791b27f5fe83536b0)


  0%|          | 0/1 [00:00<?, ?it/s]

All features in dataset

In [None]:
#| eval: false
mms_dataset_df.sample(5)

Unnamed: 0,_id,text,label,original_dataset,domain,language,Family,Genus,Definite articles,Indefinite articles,Number of cases,"Order of subject, object, verb",Negative morphemes,Polar questions,Position of negative word wrt SOV,Prefixing vs suffixing,Coding of nominal plurality,Grammatical genders,cleanlab_self_confidence,label_name
4885253,4885253,遅いレビューですが、他の方の同様な状況を見かけたので投稿しました。 購入してしばらくすると移...,0,ja_multilan_amazon,reviews,ja,Japanese,Japanese,no article,indefinite word distinct from one,8-9,SOV,negative affix,question particle,MorphNeg,strongly suffixing,plural suffix,no grammatical gender,0.3962,negative
5370394,5370394,Дети в школу собирались: мылись... брились... ...,1,ru_twitter_sentiment,social_media,ru,Indo-European,Slavic,no article,no article,6-7,SVO,negative particle,question particle,SNegVO,strongly suffixing,plural suffix,"masculine, feminine, neuter",0.30718,neutral
3711934,3711934,Would you LIKE to see cornerback Champ Bailey ...,1,en_semeval_2017,mixed,en,Indo-European,Germanic,definite word distinct from demonstrative,indefinite word distinct from one,2,SVO,negative particle,interrogative word order,SNegVO,strongly suffixing,plural suffix,no grammatical gender,0.768998,neutral
5635649,5635649,@GeekHotness @uroshdemz moi???I pray not to! #...,1,sr_twitter_sentiment,social_media,sr,Indo-European,Slavic,no article,no article,5,SVO,negative particle,question particle,other,strongly suffixing,plural suffix,"masculine, feminine, neuter",0.623751,neutral
1448834,1448834,Manche Menschen sehen ohne Schminke echt komis...,0,de_sb10k,social_media,de,Indo-European,Germanic,definite word distinct from demonstrative,indefinite word same as one,4,no dominant order,negative particle,interrogative word order,more than one position,strongly suffixing,plural suffix,"masculine, feminine, neuter",0.674396,negative


### Linguistic Typology

The field of language typology focuses on studying the similarities and differences among languages. These differences can be categorized into phonological (sounds), syntactic (structures), lexical (vocabulary), and theoretical aspects. Linguistic typology analyzes the current state of languages, contrasting with genealogical linguistics, which examines historical relationships between languages.

Genealogical linguistics studies language families and genera. A language family consists of languages that share a common ancestral language, while genera are branches within a language family. The Indo-European family, for example, includes genera such as Slavic, Romance, Germanic, and Indic. Over 7000 languages are categorized into approximately 150 language families, with Indo-European, Sino-Tibetan, Turkic, Afro-Asiatic, Nilo-Saharan, Niger-Congo, and Eskimo-Aleut being some of the largest families.

Within linguistic typology, languages are described using various linguistic features. Our work focuses on sentiment classification and selects ten relevant features:

-  `text`: The feature text represents the actual text of the sentiment dataset. It is of type string and contains the text samples or sentences for sentiment analysis.
- `label`: The feature label corresponds to the sentiment labels of the text samples. It is of type ClassLabel and has three possible values: negative, neutral, and positive. These labels indicate the sentiment or emotional polarity associated with the text.
- `original_dataset`: The feature original_dataset refers to the name or identifier of the original dataset from which the text samples were extracted. It is of type string and provides information about the source dataset.
- `domain`: The feature domain represents the domain or topic of the sentiment dataset. It is of type string and provides context regarding the subject matter of the text samples.
- `language`: The feature language indicates the language of the text samples in the sentiment dataset. It is of type string and specifies the language in which the text is written.
- `Family`: The feature Family represents the language family to which a specific language belongs. It is of type string and provides information about the broader categorization of languages into language families.
- `Genus`: The feature Genus corresponds to the genus or branch within a language family. It is of type string and indicates the specific subgrouping of languages within a language family.
- `Definite article`: Half of the languages do not use the definite article, which signals uniqueness or definiteness of a concept.
- `Indefinite article`: Half of the languages do not use the indefinite article, with some languages using a separate article or the numeral "one."
- `Number of cases`: Languages vary greatly in the number of morphological cases used.
- `Order of subject, verb, and object`: Different languages have different word orderings, with variations like SOV, SVO, VSO, VOS, OVS, and OSV.
- `Negative morphemes`: Negative morphemes indicate clausal negation in declarative sentences.
- `Polar questions`: Questions with yes/no answers, which can be formed using question particles, interrogative morphology, or intonation.
- `Position of the negative morpheme`: The position of the negative morpheme can vary in relation to subjects and objects.
- `Prefixing vs. suffixing`: Languages differ in their use of prefixes and suffixes in inflectional morphology.
- `Coding of nominal plurals`: Plurals can be expressed through morphological changes or the use of plurality indicator morphemes.
- `Grammatical genders`: Languages vary in the number of grammatical genders used, or may not use the concept at all.

These language features are available as filtering options in our library. Users can download specific facets of the collection, such as datasets in Slavic languages with interrogative word order for polar questions or datasets from the Afro-Asiatic language family without morphological case-making.

## Datasheets for Datasets 

The datasheets provide detailed information about the datasets, including data collection methods, annotation guidelines, and potential biases. They also specify the intended uses and potential limitations of the datasets.

The initial pool of sentiment datasets was gathered through an extensive search using sources such as Google Scholar, GitHub repositories, and the HuggingFace datasets library. This search yielded a total of **345** datasets.

To ensure the quality of the datasets, a set of quality assurance criteria was applied to manually filter the initial pool of datasets. The following criteria were used:

1. **Strong Annotations**: Datasets containing weak annotations, such as labels based on emoji occurrence or automatically generated through classification by machine learning models, were rejected. This decision was made to minimize the presence of noise in the datasets, ensuring higher quality annotations.
2. **Well-Defined Annotation Protocol**: Datasets without sufficient information about the annotation protocol, including whether the annotation was done manually or automatically and the number of annotators involved, were rejected. This step aimed to avoid merging datasets with contradicting annotation instructions, ensuring consistency across the selected datasets.
3. **Numerical Ratings**: Datasets with numerical ratings were accepted. Specifically, Likert-type 5-point scales were mapped into three class sentiment labels. Ratings 1 and 2 were mapped to "negative," rating 3 was mapped to "neutral," and ratings 4 and 5 were mapped to "positive." This mapping allowed for consistent sentiment labeling across the datasets.
4. **Three Classes Only**: Datasets annotated with binary sentiment labels were rejected. The decision to focus on datasets with three sentiment classes (negative, neutral, and positive) was made based on the unsatisfactory performance of binary sentiment labeling in three-class settings.
5. **Monolingual Datasets**: In cases where a dataset contained samples in multiple languages, it was divided into independent datasets for each constituent language. This approach ensured that the corpus includes separate datasets for different languages, allowing for targeted analysis and evaluation.

By applying these quality assurance criteria, we were able to filter the initial pool of sentiment datasets and select a final set of **79** datasets that met the specified standards for inclusion in the multilingual corpus.

In [None]:
#| eval: false
f"We cover {mms_dataset_df.original_dataset.nunique()} datasets in {mms_dataset_df.language.nunique()} languages."

'We cover 79 datasets in 27 languages.'

In [None]:
#| eval: false
#| hide
labels = mms_dataset["train"].features["label"].names
labels

['negative', 'neutral', 'positive']

In [None]:
#| eval: false
#| hide
mms_dataset_df["label_name"] = mms_dataset_df["label"].apply(lambda x: labels[x])

In [None]:
#| eval: false
f"The classes that we cover: {mms_dataset_df.label_name.unique()}"

"The classes that we cover: ['positive' 'neutral' 'negative']"

## Limitations

Despite the fact that our collection is the largest public collection of multilingual sentiment datasets, it still covers only 27 languages. The collection of datasets is highly biased towards the Indo-European family of languages, English in particular. We attribute this bias to the general culture of scientific publishing and its enforcement of English as the primary carrier of scientific discovery. Our work's main potential negative social impact is that the models developed and trained using the provided datasets may still exhibit better performance for the major languages. This could further perpetuate the existing language disparities and inequality in sentiment analysis capabilities across different languages. Addressing this limitation and working towards more equitable representation and performance across languages is crucial to avoid reinforcing language biases and the potential marginalization of underrepresented languages. The ethical implications of such disparities should be thoroughly discussed and considered.


![Data Quality](images/quality.png)

An important limitation of our dataset collection is a significant variance in sample quality across all datasets and all languages. Above figure presents the distribution of self-confidence label-quality score for each data point computed by the **cleanlab** [@northcutt2021confident]. The distribution of quality is skewed in favor of popular languages, with low-resource languages suffering from data quality issues. A related limitation is caused by an unequal distribution of data modalities across languages. For instance, our benchmark clearly shows that all models universally underperform when tested on Portuguese datasets. This is the direct result of the fact that data points for Portuguese almost exclusively represent the domain of social media. As a consequence, some combinations of filtering facets in our dataset collection produce very little data (i.e., asking for social media data in the Germanic genus of Indo-European languages will produce a significantly larger dataset than asking for news data representing Afro-Asiatic languages).


Finally, we acknowledge the lack of internal coherence of annotation protocols between datasets and languages. We have enforced strict quality criteria and rejected all datasets published without the annotation protocol, but we were unable, for obvious reasons, to unify annotation guidelines. The annotation of sentiment expressions and the assignment of sentiment labels are heavily subjective and, at the same time, influenced by cultural and linguistic features. Unfortunately, it is possible that semantically similar utterances will be assigned conflicting labels if they come from different datasets or modalities. 

## Datasets