Skip to content

A Project Using Data from the Lancaster Corpus of Mandarin Chinese

License

Notifications You must be signed in to change notification settings

Data-Science-for-Linguists/Study_of_the_LCMC

Repository files navigation

Kyle Landin ktl14@pitt.edu 12/15/2017

A Study of the Lancaster Corpus of Mandarin Chinese

Summary:

This project is going to about taking a closer look at the data in the Lancaster Corpus of Mandarin Chinese (LCMC). This corpus is in the .xml format and I believe that it can be difficult to read or use for linguists who do not have technical computer skills or software to read .xml files. Due to this, I believe that transforming the data from this corpus into dataframes or something similarly accessible would be worthwhile. As it is now, the LCMC divides its data into two separate categories: Chinese characters and Pinyin (the romanization of Chinese). One of the goals of this project is to put the two side by side so that the pinyin is accessible along side the character it corresponds with. The LCMC also includes a very in depth catalogue of tags for each word, marking their part of speech. Another goal of this is to include the part of speech for each word along side the characters and pinyin.

The Lancaster Corpus of Mandarin Chinese is a corpus designed as a match for the FLOB and FROWN corpora for modern British and American English. It contains 15 categories ranging from news to fiction to government documents. A link to the download is provided here.

Project Directory

  • .gitignore: A file used to ignore specific files within the local repo.
  • README.md: A project summary, a provided link to download the associated files for the LCMC corpus, and a directory of the files associated with this project.
  • LICENSE.md: The License used for this project.
  • LICENSE_notes.md: Justification for the chosen license.
  • project_plan.md: My initial plans for this project.
  • progress_report.md: A list of my updates throughout my project.
  • final_report.md: My final thoughts on and analysis of the data in the LCMC
  • LCMC_Compiled_Data.ipynb: The final form of my code with compiled lists, a dataframe of unique words in the LCMC, and my data collection.
  • LCMC_Compiled_Data.md: The markdown version of my final code.
  • 2474: All of the data from the Lancaster Corpus of Mandarin Chinese
  • images: .png files of any graphs or pictures produced by the code.
  • Presentation: My class presentation slides and the demonstration code I used.
  • Previous_Code: Practice and discarded code from early in the project cycle.

You can visit my visitor's log by clicking here.

About

A Project Using Data from the Lancaster Corpus of Mandarin Chinese

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published