CompLex-ZH

This is the project for our paper CompLex-ZH: A New Dataset for Lexical Complexity Prediction in Mandarin and Cantonese.

Dataset

Mandarin dataset

Please go to folder: /data/mandarin

mandarin.tsv: contains rating data, column "word" indicates the target word, "sentence" indicates the context/ sentence, column "index" is the position of the target word in the context (starting from 1), column "source" indicates the source of the sentence, column "set" indicates the set type of the current word: training, validation or test.

mandarin_emb.json: contains ratings with CINO word embeddings

mandarin_vocab_info.csv: contains handcrafted features, including strokes, log frequency and word length.

Cantonese dataset

Please contact us for the Cantonese dataset (lani.qiu@connect.polyu.hk)

Code

Please refer to regressor.py

Citation

If you refer to our paper or use our dataset, please cite it as follows:

@inproceedings{qiu-etal-2024-complex,
    title = "{C}omp{L}ex-{ZH}: A New Dataset for Lexical Complexity Prediction in {M}andarin and {C}antonese",
    author = "Qiu, Le  and
      Guo, Shanyue  and
      Wong, Tak-Sum  and
      Chersoni, Emmanuele  and
      Lee, John  and
      Huang, Chu-Ren",
    editor = "Shardlow, Matthew  and
      Saggion, Horacio  and
      Alva-Manchego, Fernando  and
      Zampieri, Marcos  and
      North, Kai  and
      {\v{S}}tajner, Sanja  and
      Stodden, Regina",
    booktitle = "Proceedings of the Third Workshop on Text Simplification, Accessibility and Readability (TSAR 2024)",
    month = nov,
    year = "2024",
    address = "Miami, Florida, USA",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.tsar-1.3/",
    doi = "10.18653/v1/2024.tsar-1.3",
    pages = "20--26"
}

Disclaimer

The dataset provided may contain various types of content, including sensitive material such as political opinions, adult themes, and other potentially controversial topics.

By accessing or using this dataset, you acknowledge and agree to the following:

Sensitive Content: The dataset may include content that is inappropriate for certain audiences or may not align with personal or organizational values.
No Endorsement: The opinions and statements within the dataset do not reflect the views or positions of the creators or distributors of this dataset. We do not endorse or take responsibility for any content contained within.
No Liability: The creators of this dataset are not liable for any consequences arising from the use or interpretation of the content.
Responsible Use: Users are encouraged to analyze the dataset responsibly and ethically, considering the potential implications of the sensitive content it may contain.
Compliance: It is the user's responsibility to comply with any applicable laws and regulations regarding the use and distribution of data that may contain sensitive material.

By proceeding to use the dataset, you confirm that you understand and accept these terms.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
data		data
.gitignore		.gitignore
README.md		README.md
regressor.py		regressor.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CompLex-ZH

Dataset

Code

Citation

Disclaimer

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CompLex-ZH

Dataset

Code

Citation

Disclaimer

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages