Scrambling in Persian based on a Subtitle Corpus

Hello, viewer! My name is John R. Starr, and I am an undergraduate student at the University of Pittsburgh. I am double majoring in English Writing (Poetry) and Linguistics, the latter of which being the reason that this repo exists in the first place. This repo is the term project for my Data Science for Linguists course. Now, let's dive in, shall we?

Description of Project

Persian is an SOV language that is prone to scrambling to SVO in a variety of contexts. This project examines SOV, SVO, and other, more complex word orders, hoping to find some possible explanations for scrambling. In particular, it analyzes how Persian similes and complex sentences are affected by scrambling.

The corpus that this project derives its data from is the Tehran English-Persian Parallel Corpus, which is a compilation of over 550,000 subtitles in both English and Persian. The corpus consists of two large .xml files, one for each language, with easy-to-read indices that align the two files. This corpus is available for free download here and can be used for any research and/or non-commercial purposes.

Comments on my project from my peers can be found here.

Repo Directory:

Notebooks:
- 1. Data Summary focuses on the structure of the data itself. NB Version
- 2. Tagging, Chunking POS-tags the Persian and English data and chunks the Persian data. NB Version
- 3. Chunking English, as the name suggests, chunks the English data. This is a separate file due to some limitations of running Anaconda on Windows. NB Version
- 4. Generalizing Chunks is where word orders are derived from the chunks. NB Version
- 5. Data Analysis navigates the data for three different structures, looking to make sense of why the word orders may be the way they are. NB Version
Project Information:
- Project Plan outlines the preliminary ideas for this project.
- Progress Report shows the process of building this repo throughout the semester.
- Information on Persian provides some basic information about Persian.
- Final Report sums up the project as a whole.
Images is where you will find image files found in my notebooks.
Data Samples contains an example DataFrame constructed from the files, along with a text documents of all the changes that I made to the orignal file.

References: M. T. Pilevar, H. Faili, and A. H. Pilevar, “TEP: Tehran English-Persian Parallel Corpus”, in proceedings of 12th International Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2011)

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
data_samples		data_samples
images		images
notebooks		notebooks
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
class_presentation.pdf		class_presentation.pdf
final_report.md		final_report.md
pers_info		pers_info
progress_report.md		progress_report.md
project_plan.md		project_plan.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Scrambling in Persian based on a Subtitle Corpus

Description of Project

Repo Directory:

About

Releases

Packages

Languages

License

Data-Science-for-Linguists-2019/Scrambling-in-English-to-Persian-Subtitles

Folders and files

Latest commit

History

Repository files navigation

Scrambling in Persian based on a Subtitle Corpus

Description of Project

Repo Directory:

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages