esl-syntactic-analysis

Tianyi Zheng, tiz65@pitt.edu, May 1, 2022

For my final report for this project, see final_report.md (nbviewer version)

For fellow LING 1340 students, here's a link to my guestbook.

This final project for LING 1340 (Data Science for Linguists) will analyze written data from ESL learners and identify differences in syntax of different ESL learners based on their native language and proficiency level. The goal is to determine how quantitative measures of syntactic complexity differ between more advanced learners and less advanced learners and to determine whether they differ between learners based on their L1.

The project processes and analyzes written ESL samples from the PELIC dataset (Juffs, Han, & Naismith, 2020) using the TAASSC program (Kyle, 2006) with its SCA features (Lu, 2010).

For an overview of the goals of project, see project-plan.md. For an overview of the development of the project, see progress-report.md. The Jupyter notebooks, which contain the data cleaning and analysis, should be approached in the following order:

Data samples for both PELIC and TAASSC used in the analysis can be found under data_samples/.

Repo Contents

./
|--data_samples/
|  |--pelic-sample.csv          # First 100 rows of PELIC dataset
|  |--taassc.csv                # Clause complexity measures for pelic-sample.csv
|  |--taassc_sca.csv            # SCA complexity measures for pelic-sample.csv
|--data-overview.ipynb          # Initial exploratory data analysis
|--final_report.md              # Final report for project
|--final-analysis.ipynb         # Final data visualization and analysiis
|--LICENSE.md                   # License for project
|--prepare-final-data.ipynb     # Preparation of final dataset
|--presentation.pdf             # Project presentation
|--progress-report.md           # Progress reports for project
|--project-plan.md              # Initial description of project plans
|--README.md                    # This README file
|--taassc-prep.ipymb            # Exploratory data analysis of TAASSC output

Glossary

ESL: English as a Second Language
L1: First language
PELIC: The Pitt English Language Institute Corpus (PELIC) is a dataset of written samples from ESL students at the English Language Institute at the University of Pittsburgh (Pitt).
TAASSC: The Tool for the Automatic Analysis of Syntactic Sophistication and Complexity (TAASSC) is a program developed by K. Kyle (see references) that calculates numerical measures of syntactic complexity.
T-unit: A T-unit is a generalization of a sentence. More specifically, a T-unit consists of an independent clause and all of its associated dependent clauses:
- "Because the sentence only has one independent clause, it has one T-unit." has one T-unit.
- "This is a compound sentence because it has two independent clauses; as a result, it has two T-units." has two T-units.
SCA: The Syntactic Complexity Analyzer (SCA) is a program developed by X. Lu (see references) that calculates numerical measures of syntactic complexity. TAASSC builds upon SCA by including all of the measures from SCA as well as many new measures.

Proficiency Level Codes
`level_id`	Level Description	CEFR Level
2	Pre-Intermediate	A2 - B1
3	Intermediate	B1
4	Upper-Intermediate	B1 - B2
5	Advanced	B2 - C1

References

Juffs, A., Han, N-R., & Naismith, B. (2020). The University of Pittsburgh English Language Corpus (PELIC) [Data set]. http://doi.org/10.5281/zenodo.3991977

Kyle, K. (2006). Measuring syntactic development in L2 writing: Fine grained indices of syntactic complexity and usage-based indices of syntactic sophistication. (Doctoral dissertation).

Lu, X. (2010). Automatic analysis of syntactic complexity in second language writing. International Journal of Corpus Linguistics, 15(4):474-496.

License

This project is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data_samples

data_samples

images

images

LICENSE.md

LICENSE.md

README.md

README.md

data-overview.ipynb

data-overview.ipynb

final-analysis.ipynb

final-analysis.ipynb

final_report.md

final_report.md

prepare-final-data.ipynb

prepare-final-data.ipynb

presentation.pdf

presentation.pdf

progress-report.md

progress-report.md

project-plan.md

project-plan.md

taassc-prep.ipynb

taassc-prep.ipynb

Repository files navigation

esl-syntactic-analysis

Repo Contents

Glossary

References

License

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
data_samples		data_samples
images		images
LICENSE.md		LICENSE.md
README.md		README.md
data-overview.ipynb		data-overview.ipynb
final-analysis.ipynb		final-analysis.ipynb
final_report.md		final_report.md
prepare-final-data.ipynb		prepare-final-data.ipynb
presentation.pdf		presentation.pdf
progress-report.md		progress-report.md
project-plan.md		project-plan.md
taassc-prep.ipynb		taassc-prep.ipynb

License

Data-Science-for-Linguists-2022/esl-syntactic-analysis

Folders and files

Latest commit

History

Repository files navigation

esl-syntactic-analysis

Repo Contents

Glossary

References

License

About

Resources

License

Stars

Watchers

Forks

Languages