A Study of the Lancaster Corpus of Mandarin Chinese

Kyle Landin ktl14@pitt.edu 12/15/2017

A Study of the Lancaster Corpus of Mandarin Chinese

Summary:

This project is going to about taking a closer look at the data in the Lancaster Corpus of Mandarin Chinese (LCMC). This corpus is in the .xml format and I believe that it can be difficult to read or use for linguists who do not have technical computer skills or software to read .xml files. Due to this, I believe that transforming the data from this corpus into dataframes or something similarly accessible would be worthwhile. As it is now, the LCMC divides its data into two separate categories: Chinese characters and Pinyin (the romanization of Chinese). One of the goals of this project is to put the two side by side so that the pinyin is accessible along side the character it corresponds with. The LCMC also includes a very in depth catalogue of tags for each word, marking their part of speech. Another goal of this is to include the part of speech for each word along side the characters and pinyin.

The Lancaster Corpus of Mandarin Chinese is a corpus designed as a match for the FLOB and FROWN corpora for modern British and American English. It contains 15 categories ranging from news to fiction to government documents. A link to the download is provided here.

Project Directory

.gitignore: A file used to ignore specific files within the local repo.
README.md: A project summary, a provided link to download the associated files for the LCMC corpus, and a directory of the files associated with this project.
LICENSE.md: The License used for this project.
LICENSE_notes.md: Justification for the chosen license.
project_plan.md: My initial plans for this project.
progress_report.md: A list of my updates throughout my project.
final_report.md: My final thoughts on and analysis of the data in the LCMC
LCMC_Compiled_Data.ipynb: The final form of my code with compiled lists, a dataframe of unique words in the LCMC, and my data collection.
LCMC_Compiled_Data.md: The markdown version of my final code.
2474: All of the data from the Lancaster Corpus of Mandarin Chinese
images: .png files of any graphs or pictures produced by the code.
Presentation: My class presentation slides and the demonstration code I used.
Previous_Code: Practice and discarded code from early in the project cycle.

You can visit my visitor's log by clicking here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2474

2474

Presentation

Presentation

Previous_Code

Previous_Code

images

images

.gitignore

.gitignore

LCMC_Compiled_Data.ipynb

LCMC_Compiled_Data.ipynb

LCMC_Compiled_Data.md

LCMC_Compiled_Data.md

LICENSE.md

LICENSE.md

LICENSE_notes.md

LICENSE_notes.md

README.md

README.md

final_report.md

final_report.md

progress_report.md

progress_report.md

project_plan.md

project_plan.md

Repository files navigation

A Study of the Lancaster Corpus of Mandarin Chinese

Summary:

Project Directory

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
2474		2474
Presentation		Presentation
Previous_Code		Previous_Code
images		images
.gitignore		.gitignore
LCMC_Compiled_Data.ipynb		LCMC_Compiled_Data.ipynb
LCMC_Compiled_Data.md		LCMC_Compiled_Data.md
LICENSE.md		LICENSE.md
LICENSE_notes.md		LICENSE_notes.md
README.md		README.md
final_report.md		final_report.md
progress_report.md		progress_report.md
project_plan.md		project_plan.md

License

Data-Science-for-Linguists/Study_of_the_LCMC

Folders and files

Latest commit

History

Repository files navigation

A Study of the Lancaster Corpus of Mandarin Chinese

Summary:

Project Directory

About

Resources

License

Stars

Watchers

Forks

Languages