Skip to content

John's term project. Looking at syntactic scrambling in Persian. May possibly change depending on available resources.

License

Notifications You must be signed in to change notification settings

Data-Science-for-Linguists-2019/Scrambling-in-English-to-Persian-Subtitles

Repository files navigation

Scrambling in Persian based on a Subtitle Corpus

Hello, viewer! My name is John R. Starr, and I am an undergraduate student at the University of Pittsburgh. I am double majoring in English Writing (Poetry) and Linguistics, the latter of which being the reason that this repo exists in the first place. This repo is the term project for my Data Science for Linguists course. Now, let's dive in, shall we?

Description of Project

Persian is an SOV language that is prone to scrambling to SVO in a variety of contexts. This project examines SOV, SVO, and other, more complex word orders, hoping to find some possible explanations for scrambling. In particular, it analyzes how Persian similes and complex sentences are affected by scrambling.

The corpus that this project derives its data from is the Tehran English-Persian Parallel Corpus, which is a compilation of over 550,000 subtitles in both English and Persian. The corpus consists of two large .xml files, one for each language, with easy-to-read indices that align the two files. This corpus is available for free download here and can be used for any research and/or non-commercial purposes.

Comments on my project from my peers can be found here.

Repo Directory:

References: M. T. Pilevar, H. Faili, and A. H. Pilevar, “TEP: Tehran English-Persian Parallel Corpus”, in proceedings of 12th International Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2011)

About

John's term project. Looking at syntactic scrambling in Persian. May possibly change depending on available resources.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages