# Data Analysis
#### John R. Starr; jrs294@pitt.edu
Time for some analysis! What's on the schedule for today?

### Table of Contents
- [Section 1: Loading in Files](#Section-1:-Loading-in-Files)
- [Section 2: Examining SOV Data](#Section-2:-Examining-SOV-Data)
- [Section 3: Examining SVO Data](#Section-3:-Examining-SOV-Data)
- [Section 4: Examining Other Data](#Section-4:-Examining-Other-Data)
- [Section 5: Conclusions](#Section-5:-Conclusions)

### Section 1: Loading in Files

Like usual, we're going to load in all the usual modules and DFs:

In [1]:
import nltk
import numpy as np
import pandas as pd

In [2]:
# Releasing all output:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [3]:
df = pd.read_pickle('only_ordered_final.pkl')

A breakdown on what we've got:

In [4]:
df.head()

Unnamed: 0,ID,Eng,Far,Eng_Tok,Far_Tok,Eng_Len,Far_Len,Eng_Types,Far_Types,Far_POS,Far_Chunks,Eng_POS,Eng_Chunks,Word_Order_Final
4,5,stop please stop,دست نگه داريد خواهش ميکنم دست نگه داريد,"[stop, please, stop]","[دست, نگه, داريد, خواهش, ميکنم, دست, نگه, داريد]",3,8,"{stop, please}","{داريد, خواهش, ميکنم, نگه, دست}","[(دست, N), (نگه, N), (داريد, V), (خواهش, Ne), ...",[دست NP] [نگه داريد VP] [خواهش ميکنم دست NP] [...,"[(stop, JJ), (please, NN), (stop, VB)]","[[(stop, JJ), (please, NN)], [[('stop', 'VB')]]]",SVO
5,6,you have a week evans then well burn the house,اوانز تو فقط يک هفته وقت داري وگرنه خونتو خواه...,"[you, have, a, week, evans, then, well, burn, ...","[اوانز, تو, فقط, يک, هفته, وقت, داري, وگرنه, خ...",10,11,"{then, the, well, burn, house, evans, week, yo...","{وقت, وگرنه, خواهيم, سوزوند, داري, خونتو, فقط,...","[(اوانز, Ne), (تو, PRO), (فقط, ADV), (يک, Ne),...",[اوانز تو NP] [فقط يک هفته وقت ADVP] [داري VP]...,"[(you, PRP), (have, VBP), (a, DT), (week, NN),...","[(you, PRP), [[('have', 'VBP')], [('a', 'DT'),...",SVO
8,9,god damn it put that down,لعنت به تو اونو بذار زمين,"[god, damn, it, put, that, down]","[لعنت, به, تو, اونو, بذار, زمين]",6,6,"{damn, put, that, it, god, down}","{بذار, اونو, لعنت, تو, زمين, به}","[(لعنت, N), (به, P), (تو, PRO), (اونو, PRO), (...",[لعنت NP] [به PP] [تو NP] [اونو NP] [بذار VP] ...,"[(god, NN), (damn, VBZ), (it, PRP), (put, VBD)...","[[(god, NN)], [[('damn', 'VBZ')]], (it, PRP), ...",SOV
10,11,its the last feed weve got,اين آخرين علوفه اي بود که ما داشتيم,"[its, the, last, feed, weve, got]","[اين, آخرين, علوفه, اي, بود, که, ما, داشتيم]",6,8,"{feed, its, the, weve, got, last}","{داشتيم, بود, اي, آخرين, ما, اين, که, علوفه}","[(اين, Ne), (آخرين, NUM), (علوفه, N), (اي, N),...",[اين آخرين علوفه NP] [اي NP] [بود VP] که [ما N...,"[(its, PRP$), (the, DT), (last, JJ), (feed, NN...","[(its, PRP$), [(the, DT), (last, JJ), (feed, N...",SOV
13,14,herds over the ridge by now you go get cleaned up,گله را آوردم بيرون الان تو برو اونجا را تميز کن,"[herds, over, the, ridge, by, now, you, go, ge...","[گله, را, آوردم, بيرون, الان, تو, برو, اونجا, ...",11,11,"{the, by, over, up, get, now, go, ridge, you, ...","{گله, بيرون, اونجا, تو, آوردم, برو, الان, تميز...","[(گله, N), (را, POSTP), (آوردم, V), (بيرون, CO...",[گله NP] [را POSTP] [آوردم VP] بيرون [الان ADV...,"[(herds, NNS), (over, IN), (the, DT), (ridge, ...","[(herds, NNS), [[('over', 'IN')], [('the', 'DT...",SVO


Next, we're going to reset the index inplace, to make it a little easier to navigate the DF!

In [5]:
df.reset_index(drop = True, inplace = True)

What are some basic statistics about our data?

In [6]:
print(len(df))
print()
print(df['Word_Order_Final'].value_counts())

189561

SOV    120607
SVO     68954
Name: Word_Order_Final, dtype: int64


Well, we have nearly double the number of SOV sentences than we do SVO sentences -- this isn't a bad thing, since Persian is underlyingly SOV. Let's separate the different orderings into their own respective DFs and then examine some of the structures that we find in both.

In [7]:
#  Sorting by the values
df.sort_values(by=['Word_Order_Final'], inplace = True)

In [8]:
# Creating new DFs
sov_df = df.loc[df.Word_Order_Final == 'SOV']
svo_df = df.loc[df.Word_Order_Final == 'SVO']

In [9]:
# Making sure they are the proper lengths
print(len(sov_df))
print(len(svo_df))

120607
68954


Cool! Let's start with the SOV data first.

### Section 2: Examining SOV Data
Just a snippet of what we're working with:

In [10]:
sov_df.head()

Unnamed: 0,ID,Eng,Far,Eng_Tok,Far_Tok,Eng_Len,Far_Len,Eng_Types,Far_Types,Far_POS,Far_Chunks,Eng_POS,Eng_Chunks,Word_Order_Final
94780,276464,i will never sell you my land your land,من هيچوقت زمينم را نميفروشم زمينت را,"[i, will, never, sell, you, my, land, your, land]","[من, هيچوقت, زمينم, را, نميفروشم, زمينت, را]",9,7,"{will, land, your, my, sell, never, you, i}","{هيچوقت, زمينت, نميفروشم, من, زمينم, را}","[(من, PRO), (هيچوقت, CONJ), (زمينم, N), (را, P...",[من NP] هيچوقت [زمينم NP] [را POSTP] [نميفروشم...,"[(i, NN), (will, MD), (never, RB), (sell, VB),...","[[(i, NN)], (will, MD), (never, RB), [[('sell'...",SOV
157247,459494,i'm going to cut him down. don't.,ميارمش پايين اين كارو نكن,"[i, 'm, going, to, cut, him, down, ., do, n't, .]","[ميارمش, پايين, اين, كارو, نكن]",11,5,"{to, going, cut, 'm, him, n't, do, i, ., down}","{نكن, پايين, اين, ميارمش, كارو}","[(ميارمش, N), (پايين, Pe), (اين, Ne), (كارو, N...",[ميارمش NP] [پايين PP] [اين كارو NP] [نكن VP],"[(i, NN), ('m, VBP), (going, VBG), (to, TO), (...","[[(i, NN)], [[(""'m"", 'VBP')]], [[('going', 'VB...",SOV
89398,261678,oh my gosh i know i cant beiieve it,اوه خداي من ميدونم ، نميتونم باور کنم,"[oh, my, gosh, i, know, i, cant, beiieve, it]","[اوه, خداي, من, ميدونم, ،, نميتونم, باور, کنم]",9,8,"{my, it, gosh, beiieve, know, i, cant, oh}","{باور, خداي, ،, نميتونم, من, کنم, ميدونم, اوه}","[(اوه, RES), (خداي, Pe), (من, PRO), (ميدونم, V...",[اوه NP] [خداي من NP] [ميدونم VP] ، [نميتونم V...,"[(oh, UH), (my, PRP$), (gosh, NN), (i, NN), (k...","[(oh, UH), (my, PRP$), [(gosh, NN), (i, NN)], ...",SOV
157249,459500,danielle don't.,دانيل ، اين كارو نكن,"[danielle, do, n't, .]","[دانيل, ،, اين, كارو, نكن]",4,5,"{do, ., danielle, n't}","{نكن, ،, دانيل, اين, كارو}","[(دانيل, N), (،, PUNC), (اين, Ne), (كارو, N), ...",[دانيل NP] ، [اين كارو NP] [نكن VP],"[(danielle, NNS), (do, VBP), (n't, RB), (., .)]","[(danielle, NNS), [[('do', 'VBP')]], (n't, RB)...",SOV
89396,261676,hey you guys what this mean,هي ، شماها معني اين چيه,"[hey, you, guys, what, this, mean]","[هي, ،, شماها, معني, اين, چيه]",6,6,"{this, what, guys, mean, you, hey}","{،, معني, شماها, چيه, هي, اين}","[(هي, N), (،, PUNC), (شماها, PRO), (معني, N), ...",[هي NP] ، [شماها NP] [معني اين VP] چيه,"[(hey, NN), (you, PRP), (guys, VBP), (what, WP...","[[(hey, NN)], (you, PRP), [[('guys', 'VBP')]],...",SOV


In [11]:
# Making the indexes easier to work with
sov_df.reset_index(drop = True, inplace = True)

Just out of curiousity, I want to see if there's any big difference between the average lengths of the English and Persian SOV sentences:

In [12]:
avg_eng_len = []
for ln in sov_df['Eng_Len']:
    avg_eng_len.append(ln)
print('The average English sentence length in the SOV file is...', (sum(avg_eng_len))/len(sov_df))

The average English sentence length in the SOV file is... 8.797441276211165


In [13]:
avg_far_len = []
for ln in sov_df['Far_Len']:
    avg_far_len.append(ln)
print('The average Persian sentence length in the SOV file is...', (sum(avg_far_len))/len(sov_df))

The average Persian sentence length in the SOV file is... 8.894118915154179


Not too big of a difference, it seems. I figured this wouldn't have much of an impact, but it was worth at least checking.

Let's see how sentences involving the word phrase "is like" fared! This phrase is common (and easily searchable) way of looking for similes.

In [14]:
is_like_bool = []
for line in sov_df['Eng']:
    if 'is like' in line:
        is_like_bool.append(True)
    else:
        is_like_bool.append(False)
has_like = pd.Series(is_like_bool)

In [15]:
like_sov_df = sov_df[has_like]

In [16]:
print(len(like_sov_df))
like_sov_df.head(3)

56


Unnamed: 0,ID,Eng,Far,Eng_Tok,Far_Tok,Eng_Len,Far_Len,Eng_Types,Far_Types,Far_POS,Far_Chunks,Eng_POS,Eng_Chunks,Word_Order_Final
14003,272686,capricorns castle is like a fivestar hotel,قلعه كاپريكورن مثل يک هتل 5 ستاره است,"[capricorns, castle, is, like, a, fivestar, ho...","[قلعه, كاپريكورن, مثل, يک, هتل, 5, ستاره, است]",7,8,"{fivestar, hotel, is, castle, like, a, caprico...","{5, است, يک, ستاره, كاپريكورن, مثل, هتل, قلعه}","[(قلعه, Ne), (كاپريكورن, N), (مثل, ADVe), (يک,...",[قلعه كاپريكورن NP] [مثل PP] [يک هتل NP] [5 ست...,"[(capricorns, NNS), (castle, NN), (is, VBZ), (...","[(capricorns, NNS), [(castle, NN)], [[('is', '...",SOV
17108,206774,i know what it is like to try and scrape by wi...,من ميدونم كه زندگي بدون پشتيبان چه جون كندني ميشه,"[i, know, what, it, is, like, to, try, and, sc...","[من, ميدونم, كه, زندگي, بدون, پشتيبان, چه, جون...",14,10,"{and, by, to, without, scrape, it, is, what, l...","{زندگي, ميشه, چه, بدون, من, ميدونم, كه, پشتيبا...","[(من, PRO), (ميدونم, Ne), (كه, DET), (زندگي, N...",[من NP] [ميدونم كه زندگي NP] [بدون PP] [پشتيبا...,"[(i, NN), (know, VBP), (what, WP), (it, PRP), ...","[[(i, NN)], [[('know', 'VBP')]], (what, WP), (...",SOV
17691,204305,having an ulcer is like having a burglar alarm...,يه چيز حسابي حول و حوش زخم معده م انگار يک دزد...,"[having, an, ulcer, is, like, having, a, burgl...","[يه, چيز, حسابي, حول, و, حوش, زخم, معده, م, ان...",13,19,"{burglar, having, is, like, off, inside, ulcer...","{بياد, و, زخم, وجود, انگار, صدا, يک, حول, در, ...","[(يه, N), (چيز, ADV), (حسابي, N), (حول, N), (و...",[يه NP] [چيز ADVP] [حسابي NP] [حول PP] [و حوش ...,"[(having, VBG), (an, DT), (ulcer, NN), (is, VB...","[[[('having', 'VBG')], [('an', 'DT'), ('ulcer'...",SOV


This is a really small DF, with only 56 entires. Perhaps this is because "is like" is more common as a written phrase, rather than a verbal one.

Regardless of the size of the DF, it is clear to see some patterns. Persian's functional equivalents to the words "like" and "that" are "مثل" and "که" respectively. If you look closely at the data, nearly all the Persian sentences in this dataset use those words! And, conveniently enough, you see that those same sentences maintain the SOV structure that we anticipated before.

Let's look at some of the sentences in this set of sentences:

In [17]:
for sample in like_sov_df['Far_Chunks'].sample(5):
    print(sample)
    print()

[اين NP] [مثل PP] [سرپرست هست، NP] [کسي VP] که [سخت ADVP] [کار مي NP] [کنه VP] تا [شمشير NP] [درست ADJP] [کنه VP] .

[اینکه NP] [دنیا NP] [مثل PP] [یک کاتالوگ NP] [باشه VP]

[تسليم شدن NP] [به PP] [بو NP] [يه VP] [او NP] [مانند PP] [تسليم شدن يه NP] [هان NP] [است VP]

[مطالعه طبیعت NP] [این طور NP] [است VP]

[قلعه كاپريكورن NP] [مثل PP] [يک هتل NP] [5 ستاره NP] [است VP]



Most of these sentences use مثل and که to produce similes, but still maintain the SOV structure; the first, second, and fifth sentences, which utilize these functional words, mean the following: 
- 1 "He is like a guardian, someone that works to protect completely with a sword."
- 2 "This world is like a catalog"

However, the third and fourth sentences do not use either مثل or که to produce idiomatic expressions, preferring instead the path of direct metaphor over similes. It will be interesting to see if this is a trend that carries over to the other word orders.

In the meantime, let's search the Persian data for these two functional words and see if there's anything interesting that can be drawn. Unfortunately, I could not figure out a way to specify what a "simile" context would be, so I just raw-searched for مثل and که. The following few examples will hopefully shed some light on their usage.

In [18]:
# Creating a DF for مثل
مثل_bool = []
for line in sov_df['Far']:
    if 'مثل' in line:
        مثل_bool.append(True)
    else:
        مثل_bool.append(False)
has_مثل = pd.Series(مثل_bool)

In [19]:
مثل_sov_df = sov_df[مثل_bool]

Again, let's examine what the data looks like and then move check out some example sentences:

In [20]:
مثل_sov_df.tail(3)
len(مثل_sov_df)

Unnamed: 0,ID,Eng,Far,Eng_Tok,Far_Tok,Eng_Len,Far_Len,Eng_Types,Far_Types,Far_POS,Far_Chunks,Eng_POS,Eng_Chunks,Word_Order_Final
120366,110932,kids you like a famiiy say no to where he is ...,بچهها ، شما مثل يک خانواده ميمونيد نگو مثل خان...,"[kids, you, like, a, famiiy, say, no, to, wher...","[بچهها, ،, شما, مثل, يک, خانواده, ميمونيد, نگو...",12,14,"{to, he, say, is, famiiy, like, kids, no, wher...","{ميشه, ،, خانواده, يک, ديو, شما, ناراحت, مثل, ...","[(بچهها, N), (،, PUNC), (شما, PRO), (مثل, ADVe...",[بچهها NP] ، [شما NP] [مثل PP] [يک خانواده ميم...,"[(kids, NNS), (you, PRP), (like, IN), (a, DT),...","[(kids, NNS), (you, PRP), [[('like', 'IN')], [...",SOV
120422,128020,been leading me like a thread throughout the...,مرا مثل قلابي در ميان جهان آويزان کرده,"[been, leading, me, like, a, thread, throughou...","[مرا, مثل, قلابي, در, ميان, جهان, آويزان, کرده]",9,8,"{the, been, throughout, thread, world, like, a...","{آويزان, در, مثل, جهان, ميان, قلابي, کرده, مرا}","[(مرا, PRO), (مثل, ADVe), (قلابي, N), (در, P),...",[مرا NP] [مثل PP] [قلابي NP] [در PP] [ميان جها...,"[(been, VBN), (leading, VBG), (me, PRP), (like...","[[[('been', 'VBN')]], [[('leading', 'VBG')]], ...",SOV
120423,128019,had not its volatile needlelike sword,لطا فتش را مثل شمشير از د ست داده,"[had, not, its, volatile, needlelike, sword]","[لطا, فتش, را, مثل, شمشير, از, د, ست, داده]",6,9,"{its, volatile, sword, needlelike, had, not}","{د, از, داده, شمشير, لطا, مثل, ست, فتش, را}","[(لطا, N), (فتش, N), (را, POSTP), (مثل, ADVe),...",[لطا NP] [فتش NP] [را POSTP] [مثل PP] [شمشير N...,"[(had, VBD), (not, RB), (its, PRP$), (volatile...","[[[('had', 'VBD')]], (not, RB), (its, PRP$), [...",SOV


1599

There's a _lot_ more entries for this one, compared to the "is like" DF. That being said, this may be because مثل is often used as a connector in common speech. 

In [21]:
for sample in مثل_sov_df['Far_Chunks'].sample(5):
    print(sample)
    print()

: [من NP] [مثل PP] [همه NP] [به PP] [چهار دليل NP] [دكتر ADJP] [شدم VP]

و [آيا ما NP] [را POSTP] [مثل PP] [هميشه تقديس NP] [خواهي VP] كرد

[کتابها NP] ، [مثل PP] [يک پناهگاه NP] [براي VP] [روياهاي PP] [ما NP] [هستند VP]

[لشگر سرخ NP] ! ، [به PP] [لشگر سفيد، NP] [با PP] [آرايش مثلثي NP] [حمله کنيد VP]

[اون ور NP] [مثل PP] [تو NP] [در PP] [قلب من NP] [هست VP]



In these instances of مثل, "like" is used less as a simile but more as an interjection, in a similar fashion to the stereotype of speech patterns for Californians -- the first sentence  translates to "I was like, 'I have four reasons to become a doctor." What does که look like? We'll follow the usual pattern for creating a DF off a Series for که.

In [22]:
# Creating a new DF around که
که_bool = []
for line in sov_df['Far']:
    if ' که ' in line:
        که_bool.append(True)
    else:
        که_bool.append(False)
has_که = pd.Series(مثل_bool)

In [23]:
که_sov_df = sov_df[که_bool]

In [24]:
که_sov_df.head()
len(که_sov_df)

Unnamed: 0,ID,Eng,Far,Eng_Tok,Far_Tok,Eng_Len,Far_Len,Eng_Types,Far_Types,Far_POS,Far_Chunks,Eng_POS,Eng_Chunks,Word_Order_Final
8,261660,it s you oh my god it s everyone at here,اين تويي ، اوه خداي من اين هرکسي هست که اينجا هست,"[it, s, you, oh, my, god, it, s, everyone, at,...","[اين, تويي, ،, اوه, خداي, من, اين, هرکسي, هست,...",11,12,"{here, my, it, s, god, at, you, everyone, oh}","{هست, خداي, ،, من, هرکسي, اين, که, اوه, تويي, ...","[(اين, Ne), (تويي, N), (،, PUNC), (اوه, RES), ...",[اين تويي NP] ، [اوه خداي من NP] [اين هرکسي NP...,"[(it, PRP), (s, VBZ), (you, PRP), (oh, VBP), (...","[(it, PRP), [[('s', 'VBZ')]], (you, PRP), [[('...",SOV
10,261656,i think we shouid get together and taik about ...,من فکر ميکنم ما بايد با هم بمونيم وراجع به چيز...,"[i, think, we, shouid, get, together, and, tai...","[من, فکر, ميکنم, ما, بايد, با, هم, بمونيم, ورا...",14,17,"{and, to, think, together, need, about, we, wh...","{بايد, بديم, وراجع, ما, بمونيم, من, کنيم, ميکن...","[(من, PRO), (فکر, Ne), (ميکنم, Ne), (ما, PRO),...",[من NP] [فکر ميکنم ما NP] [بايد VP] [با PP] [ه...,"[(i, JJ), (think, VBP), (we, PRP), (shouid, VB...","[[(i, JJ)], [[('think', 'VBP')]], (we, PRP), [...",SOV
14,261648,it s aii i need to remember,اين همه چيزيه که ميخوام به ياد بيارم,"[it, s, aii, i, need, to, remember]","[اين, همه, چيزيه, که, ميخوام, به, ياد, بيارم]",7,8,"{to, need, it, aii, remember, s, i}","{ميخوام, ياد, بيارم, اين, که, چيزيه, همه, به}","[(اين, DET), (همه, DET), (چيزيه, N), (که, CONJ...",[اين همه چيزيه NP] که [ميخوام NP] [به PP] [ياد...,"[(it, PRP), (s, VBD), (aii, NN), (i, NNS), (ne...","[(it, PRP), [[('s', 'VBD')], [('aii', 'NN')]],...",SOV
16,261640,dont we have a speciai bottie of something spa...,يه بطري مخصوص از چيزي که براي مهمون هامون برق ...,"[dont, we, have, a, speciai, bottie, of, somet...","[يه, بطري, مخصوص, از, چيزي, که, براي, مهمون, ه...",12,12,"{bottie, of, for, dont, speciai, we, guests, s...","{از, هامون, مخصوص, مهمون, يه, نداريم, برق, که,...","[(يه, CONJ), (بطري, Ne), (مخصوص, AJ), (از, P),...",[يه بطري مخصوص NP] [از PP] [چيزي NP] که [براي ...,"[(dont, NN), (we, PRP), (have, VBP), (a, DT), ...","[[(dont, NN)], (we, PRP), [[('have', 'VBP')], ...",SOV
21,261627,what i hope is that we can go on iiving together,چيزي که من اميدوارم اينه که ما ميتونيم با هم ز...,"[what, i, hope, is, that, we, can, go, on, iiv...","[چيزي, که, من, اميدوارم, اينه, که, ما, ميتونيم...",11,12,"{can, iiving, together, on, we, that, is, what...","{زندگي, اينه, ما, من, کنيم, ميتونيم, با, هم, ک...","[(چيزي, N), (که, CONJ), (من, PRO), (اميدوارم, ...",[چيزي NP] که [من NP] [اميدوارم VP] [اينه NP] ک...,"[(what, WP), (i, NN), (hope, VBP), (is, VBZ), ...","[(what, WP), [(i, NN)], [[('hope', 'VBP')]], [...",SOV


16074

This one is _much, much_ bigger than either of the two previous DFs, but that's to be expected, since که is one of the most common conjunctive words. What do some of the sentences look like?

In [25]:
for sample in که_sov_df['Far_Chunks'].sample(5):
    print(sample)
    print()

[آيا آنها NP] [سر حرفشون NP] [ميايستند VP] وقتي که [راي آوردند VP]

[ميتوني NP] [به PP] [چيزي NP] که [دوست داري VP] ، [ادامه بدي NP]

[شوری NP] ، [وقتی ADVP] که [غذا NP] [سرد ADJP] [میشود VP] [به PP] [خوبی NP] [احساس میشود VP]

[پيش خودمون ميمونه NP] ، [اونجا ADVP] [يک NP] [کاميونه VP] که [توش NP] [دوربينه ADVP] [توي لندن NP] [تو NP] [جودي VP] دينچ

[وقتي NP] که [شاه مرد اون NP] [شاهه بعدي VP] ميشه



Unfortunately, که is not used abstractly whatsoever in these functions. In an ideal world, I would narrow the environments that که appears in to only those within similes and metaphors; due to the limitations of my coding abilities, I do not believe that I could have accomplished such a task within the time provided. 

Regardless, let's move on to the SVO data to see if there's anything meaningful that can be drawn from it.

### Section 3: Examining SVO Data
Similar to what we did for the SOV data, let's flash a little bit of what we're working with, then modify it in an identical way so as to keep things consistent:

In [26]:
svo_df.head()

Unnamed: 0,ID,Eng,Far,Eng_Tok,Far_Tok,Eng_Len,Far_Len,Eng_Types,Far_Types,Far_POS,Far_Chunks,Eng_POS,Eng_Chunks,Word_Order_Final
143410,416676,- i'd get going now. - ok see you next time.,- من ديگه بايد بريم- باشه، بعدا مي بينمت,"[-, i, 'd, get, going, now, ., -, ok, see, you...","[-, من, ديگه, بايد, بريم-, باشه،, بعدا, مي, بي...",14,9,"{you, ok, going, -, next, get, now, 'd, see, i...","{بايد, -, من, بعدا, بريم-, ديگه, مي, باشه،, بي...","[(-, PUNC), (من, PRO), (ديگه, N), (بايد, V), (...",- [من NP] [ديگه بايد VP] [بريم- PP] [باشه، بعد...,"[(-, :), (i, NN), ('d, MD), (get, VB), (going,...","[(-, :), [(i, NN)], ('d, MD), [[('get', 'VB')]...",SVO
162921,476284,it is fortunate that i followed mother's words...,خيلي خوب شد كه به حرفهاي مادر گوش كردم و به جن...,"[it, is, fortunate, that, i, followed, mother,...","[خيلي, خوب, شد, كه, به, حرفهاي, مادر, گوش, كرد...",14,13,"{and, went, followed, that, battle, into, it, ...","{جنگ, گوش, رفتم, مادر, كردم, خيلي, و, به, خوب,...","[(خيلي, N), (خوب, AJ), (شد, V), (كه, CONJ), (ب...",[خيلي NP] [خوب ADJP] [شد VP] كه [به PP] [حرفها...,"[(it, PRP), (is, VBZ), (fortunate, JJ), (that,...","[(it, PRP), [[('is', 'VBZ')], [('fortunate', '...",SVO
169165,495157,leaving buyeo to establish a new nation means ...,ترك بويه يو براي تاسيس يك طايفه ي جديد !معنيش ...,"[leaving, buyeo, to, establish, a, new, nation...","[ترك, بويه, يو, براي, تاسيس, يك, طايفه, ي, جدي...",17,19,"{'re, enemy, you, going, buyeo, to, !, establi...","{يك, يو, !, معنيش, اينه, باشي, خواي, ي, تاسيس,...","[(ترك, N), (بويه, V), (يو, CONJ), (براي, P), (...",[ترك NP] [بويه VP] يو [براي PP] [تاسيس يك طايف...,"[(leaving, VBG), (buyeo, NN), (to, TO), (estab...","[[[('leaving', 'VBG')], [('buyeo', 'NN')]], (t...",SVO
185577,544000,chunghae boasts the strongest navy in the sout...,کشتي هاي کوچک چانگ هي قوي ترين نيروي دريايي جن...,"[chunghae, boasts, the, strongest, navy, in, t...","[کشتي, هاي, کوچک, چانگ, هي, قوي, ترين, نيروي, ...",10,16,"{the, boasts, navy, strongest, in, seas, chung...","{ترين, کشتي, رو, چانگ, هاي, غربي, هي, ميده, کو...","[(کشتي, Ne), (هاي, Ne), (کوچک, AJe), (چانگ, N)...",[کشتي هاي کوچک چانگ هي قوي ترين NP] [نيروي VP]...,"[(chunghae, NN), (boasts, VBZ), (the, DT), (st...","[[(chunghae, NN)], [[('boasts', 'VBZ')], [('th...",SVO
185704,544253,we must take yi do hyung to the imperial city ...,ما باید همین حالا یی دو-یانگ را به شهر سلطنتی ...,"[we, must, take, yi, do, hyung, to, the, imper...","[ما, باید, همین, حالا, یی, دو-یانگ, را, به, شه...",13,11,"{city, the, must, right, to, take, we, hyung, ...","{را, حالا, یی, ما, باید, شهر, سلطنتی, دو-یانگ,...","[(ما, PRO), (باید, V), (همین, DET), (حالا, ADV...",[ما NP] [باید VP] [همین حالا ADVP] [یی NP] [دو...,"[(we, PRP), (must, MD), (take, VB), (yi, NN), ...","[(we, PRP), (must, MD), [[('take', 'VB')], [('...",SVO


In [27]:
# Making the indexes easier to work with
svo_df.reset_index(drop = True, inplace = True)

In [28]:
# Calculating average word length for the English translations of the SVO sentences
avg_eng_len2 = []
for ln in svo_df['Eng_Len']:
    avg_eng_len2.append(ln)
print('The average English sentence length in the SOV file is...', (sum(avg_eng_len2))/len(svo_df))

The average English sentence length in the SOV file is... 9.187980392725585


In [29]:
# Calculating average word length for the Persian translation of the SVO sentences
avg_far_len2 = []
for ln in svo_df['Far_Len']:
    avg_far_len2.append(ln)
print('The average Persian sentence length in the SOV file is...', (sum(avg_far_len2))/len(svo_df))

The average Persian sentence length in the SOV file is... 9.0338341503031


Like we searched through the English SOV sentences for "is like", let's do the same for the SVO sentences!

In [30]:
is_like_bool2 = []
for line in svo_df['Eng']:
    if 'is like' in line:
        is_like_bool2.append(True)
    else:
        is_like_bool2.append(False)
has_like2 = pd.Series(is_like_bool2)

In [56]:
len(svo_df)

68954

In [31]:
like_svo_df = svo_df[has_like2]

In [32]:
print(len(like_svo_df))
like_svo_df.head()

26


Unnamed: 0,ID,Eng,Far,Eng_Tok,Far_Tok,Eng_Len,Far_Len,Eng_Types,Far_Types,Far_POS,Far_Chunks,Eng_POS,Eng_Chunks,Word_Order_Final
3931,552576,the enemy you must fight is like a huge maelst...,اون دشمني که شما ميخواين باهاش بجنگين مثل يک گ...,"[the, enemy, you, must, fight, is, like, a, hu...","[اون, دشمني, که, شما, ميخواين, باهاش, بجنگين, ...",11,11,"{the, must, maelstrom, enemy, fight, is, huge,...","{دشمني, اون, يک, شما, بزرگه, ميخواين, که, مثل,...","[(اون, DET), (دشمني, V), (که, CONJ), (شما, PRO...",[اون NP] [دشمني VP] که [شما NP] [ميخواين ADJP]...,"[(the, DT), (enemy, NN), (you, PRP), (must, MD...","[[(the, DT), (enemy, NN)], (you, PRP), (must, ...",SVO
15324,497782,i heard that the coalition is like brothers of...,من شنيدم که اين گروه مثل برادر هاي خوني هستند,"[i, heard, that, the, coalition, is, like, bro...","[من, شنيدم, که, اين, گروه, مثل, برادر, هاي, خو...",12,10,"{the, blood, of, that, is, like, one, brothers...","{خوني, من, شنيدم, هاي, گروه, هستند, برادر, اين...","[(من, PRO), (شنيدم, V), (که, CONJ), (اين, Ne),...",[من NP] [شنيدم VP] که [اين گروه NP] [مثل PP] [...,"[(i, RB), (heard, VBP), (that, IN), (the, DT),...","[(i, RB), [[('heard', 'VBP')], [(P that/IN), (...",SVO
16471,434104,rounds is like being on a game show.,اين ويزيتها شبيه اغاز يه مسابقه تلوزيونيه,"[rounds, is, like, being, on, a, game, show, .]","[اين, ويزيتها, شبيه, اغاز, يه, مسابقه, تلوزيونيه]",9,7,"{rounds, game, being, on, is, like, show, a, .}","{مسابقه, شبيه, يه, ويزيتها, اين, اغاز, تلوزيونيه}","[(اين, DET), (ويزيتها, N), (شبيه, V), (اغاز, N...",[اين ويزيتها NP] [شبيه VP] [اغاز يه مسابقه تلو...,"[(rounds, NNS), (is, VBZ), (like, IN), (being,...","[(rounds, NNS), [[('is', 'VBZ')]], [(like, IN)...",SVO
16518,434767,that song is like a virus.,اين آهنگه مثله خوره ميمونه,"[that, song, is, like, a, virus, .]","[اين, آهنگه, مثله, خوره, ميمونه]",7,5,"{virus, that, is, like, song, a, .}","{ميمونه, خوره, اين, مثله, آهنگه}","[(اين, N), (آهنگه, V), (مثله, Ne), (خوره, Ne),...",[اين NP] [آهنگه VP] [مثله خوره ميمونه NP],"[(that, DT), (song, NN), (is, VBZ), (like, IN)...","[[(that, DT), (song, NN)], [[('is', 'VBZ')], [...",SVO
19827,129997,your wife is like to reap a proper man,همسر شما خوش‌اقبال است كه دوباره به شوهري دلخو...,"[your, wife, is, like, to, reap, a, proper, man]","[همسر, شما, خوش‌اقبال, است, كه, دوباره, به, شو...",9,11,"{to, proper, your, reap, is, like, wife, a, man}","{است, همسر, خوش‌اقبال, مي‌يابد, دوباره, شما, د...","[(همسر, Ne), (شما, PRO), (خوش‌اقبال, AJ), (است...",[همسر شما NP] [خوش‌اقبال ADJP] [است VP] كه [دو...,"[(your, PRP$), (wife, NN), (is, VBZ), (like, I...","[(your, PRP$), [(wife, NN)], [[('is', 'VBZ')]]...",SVO


Proportionally, there are less "is like" sentences in the SVO data than there are in the SOV data:
- 56/120607 = 0.047% for the SOV data
- 28/68954 = 0.038% for the SVO data

As mentioned before, مثل and که are the functional words here that are often used to replace "is like" in English. However, only three of the sentences in this DF use those functional words! These sentences still translate the idea of a simile, but convert it to a more directly metaphorical approach. This approach may have an influence on why the structure of the translation is SVO...! However, this is definitely not a conclusion I can make quite yet -- there's much, _much_ more to do! For now, let's look at how the مثل and که are used! 

In [33]:
# Creating the مثل analysis
مثل_bool2 = []
for line in svo_df['Far']:
    if 'مثل' in line:
        مثل_bool2.append(True)
    else:
        مثل_bool2.append(False)
has_مثل2 = pd.Series(مثل_bool2)

In [34]:
مثل_svo_df = svo_df[مثل_bool2]

In [35]:
مثل_svo_df.tail(3)

Unnamed: 0,ID,Eng,Far,Eng_Tok,Far_Tok,Eng_Len,Far_Len,Eng_Types,Far_Types,Far_POS,Far_Chunks,Eng_POS,Eng_Chunks,Word_Order_Final
68888,270971,you above all people shouldnt be one to dismis...,آدم هاي مثل تو نبايد کسي باشندکه نفوذ گذشته را...,"[you, above, all, people, shouldnt, be, one, t...","[آدم, هاي, مثل, تو, نبايد, کسي, باشندکه, نفوذ,...",14,13,"{dismiss, the, to, influence, of, shouldnt, pa...","{از, کسي, نفوذ, هاي, تو, دست, باشندکه, بدهند, ...","[(آدم, N), (هاي, V), (مثل, ADVe), (تو, PRO), (...",[آدم NP] [هاي VP] [مثل PP] [تو NP] [نبايد VP] ...,"[(you, PRP), (above, IN), (all, DT), (people, ...","[(you, PRP), [[('above', 'IN')], [('all', 'DT'...",SVO
68898,270891,few realize that mary was descended from kings...,عده کمي ميدونند که مريمثل شوهرش از نسل پادشاها...,"[few, realize, that, mary, was, descended, fro...","[عده, کمي, ميدونند, که, مريمثل, شوهرش, از, نسل...",13,10,"{her, mary, kings, as, husband, descended, tha...","{پادشاهان, از, کمي, نسل, شوهرش, بوده, که, مريم...","[(عده, N), (کمي, AJ), (ميدونند, V), (که, CONJ)...",[عده NP] [کمي ADJP] [ميدونند VP] که [مريمثل شو...,"[(few, JJ), (realize, NN), (that, IN), (mary, ...","[[(few, JJ), (realize, NN)], [[('that', 'IN')]...",SVO
68899,270889,peter i see you contending against a woman li...,پيتر ، ميبينم که تو عليهيک زن مثل يک حريف ميجنگي,"[peter, i, see, you, contending, against, a, w...","[پيتر, ،, ميبينم, که, تو, عليهيک, زن, مثل, يک,...",11,11,"{contending, you, adversary, against, peter, l...","{،, عليهيک, ميبينم, زن, يک, تو, پيتر, حريف, که...","[(پيتر, N), (،, PUNC), (ميبينم, V), (که, CONJ)...",[پيتر NP] ، [ميبينم VP] که [تو NP] [عليهيک زن ...,"[(peter, NN), (i, NN), (see, VBP), (you, PRP),...","[[(peter, NN), (i, NN)], [[('see', 'VBP')]], (...",SVO


1599

What do some sentences look like?

In [36]:
for sample in مثل_svo_df['Far_Chunks'].sample(5):
    print(sample)
    print()

[هرمويي NP] که [بيرون بمونه VP] [مثل PP] [خنجريه NP] که [تو NP] [قلب NP] [شهيد VP] [ما NP] [فرو ميره ADVP]

[اون مرده NP] [خوبيه VP] يک [چيزي مثله تو NP]

[سر من NP] [داره VP] [مثل PP] [گرد NP] ، [تو NP] [طوفان شن ميچرخه NP] [بايد VP] [برم VP]

[شما NP] [نميتونيد VP] [تک تک مردم NP] [رومثل PP] [ما NP] [متقاعد ADJP] [کنيد VP]

[اونا NP] [نمي VP] [توننمثل ايول مي ايول پيش NP] [بيني PP] [کنن NP] .



The odd thing about this data is that it does not always accurately predict the phrasal structure of مثل. Sometimes, as in the first, third, and fourth sentences, مثل is treated as a PP (though the fourth sentences has a slightly modified version of the word). The second and fifth sentences mark مثل as part of a NP. This irregularity between chunks is an example of a limitation of the chunker. 

The sentences themselves do not contain any metaphorical uses of مثل, although it clear that they do exist in some fashion (as proven by the previous search for "is like"). Let's see if any interesting points can be drawn from looking at که. 

In [37]:
# Looking at که, again!
که_bool2 = []
for line in svo_df['Far']:
    if ' که ' in line:
        که_bool2.append(True)
    else:
        که_bool2.append(False)
has_که = pd.Series(مثل_bool)

In [38]:
که_svo_df = svo_df[که_bool2]

In [39]:
که_svo_df.head()
len(که_svo_df)

Unnamed: 0,ID,Eng,Far,Eng_Tok,Far_Tok,Eng_Len,Far_Len,Eng_Types,Far_Types,Far_POS,Far_Chunks,Eng_POS,Eng_Chunks,Word_Order_Final
40,553945,i know reform is difficult as long as there ar...,من ميدونم که تا زماني که صاحب منصبان هستند اصل...,"[i, know, reform, is, difficult, as, long, as,...","[من, ميدونم, که, تا, زماني, که, صاحب, منصبان, ...",11,13,"{long, as, difficult, are, is, nobles, know, r...","{زماني, مشکل, است, کردن, تا, من, ميدونم, هستند...","[(من, PRO), (ميدونم, V), (که, CONJ), (تا, P), ...",[من NP] [ميدونم VP] که [تا PP] [زماني NP] که [...,"[(i, NN), (know, VBP), (reform, NN), (is, VBZ)...","[[(i, NN)], [[('know', 'VBP')], [('reform', 'N...",SVO
50,554238,you said you wished me to be happy...,تو گفتي که ميخواهي من شاد باشم,"[you, said, you, wished, me, to, be, happy, ...]","[تو, گفتي, که, ميخواهي, من, شاد, باشم]",9,7,"{to, said, wished, happy, be, you, ..., me}","{باشم, من, تو, گفتي, شاد, که, ميخواهي}","[(تو, PRO), (گفتي, V), (که, CONJ), (ميخواهي, N...",[تو NP] [گفتي VP] که [ميخواهي من NP] [شاد ADJP...,"[(you, PRP), (said, VBD), (you, PRP), (wished,...","[(you, PRP), [[('said', 'VBD')]], (you, PRP), ...",SVO
67,485809,i put everything l had in this transaction!,هر چي که داشتم تو اين معامله گذاشتم,"[i, put, everything, l, had, in, this, transac...","[هر, چي, که, داشتم, تو, اين, معامله, گذاشتم]",9,8,"{transaction, put, this, !, l, everything, in,...","{تو, اين, که, گذاشتم, چي, هر, داشتم, معامله}","[(هر, DET), (چي, N), (که, CONJ), (داشتم, V), (...",[هر چي NP] که [داشتم VP] [تو NP] [اين معامله N...,"[(i, NN), (put, VBD), (everything, NN), (l, NN...","[[(i, NN)], [[('put', 'VBD')], [('everything',...",SVO
83,554000,i believe you will protect her,من ميدونم که تو از اون مراقبت ميکني,"[i, believe, you, will, protect, her]","[من, ميدونم, که, تو, از, اون, مراقبت, ميکني]",6,8,"{her, will, believe, protect, you, i}","{اون, از, من, تو, ميدونم, که, مراقبت, ميکني}","[(من, PRO), (ميدونم, V), (که, CONJ), (تو, PRO)...",[من NP] [ميدونم VP] که [تو NP] [از PP] [اون NP...,"[(i, NN), (believe, VBP), (you, PRP), (will, M...","[[(i, NN)], [[('believe', 'VBP')]], (you, PRP)...",SVO
84,553929,i was forced to commit a grave wrong.,من مجبور شدم که اين کار اشتباه رو انجام بدم,"[i, was, forced, to, commit, a, grave, wrong, .]","[من, مجبور, شدم, که, اين, کار, اشتباه, رو, انج...",9,10,"{wrong, to, was, commit, grave, a, i, ., forced}","{مجبور, من, رو, کار, شدم, انجام, اين, که, اشتب...","[(من, PRO), (مجبور, AJ), (شدم, V), (که, CONJ),...",[من NP] [مجبور ADJP] [شدم VP] که [اين کار NP] ...,"[(i, NN), (was, VBD), (forced, VBN), (to, TO),...","[[(i, NN)], [[('was', 'VBD')]], [[('forced', '...",SVO


11227

Proportionally, که is used much more in SVO sentences. I expected that this may occur, since the chunker sometimes does not include که in chunks. Because که is not always included but usually appears right after the verb, a simple SV or OV sentence may be seen as a complex sentence. This is a result of not searching for a specific context of که when creating this DF.

What do some of the SVO sentences using که look like?

In [40]:
for sample in که_svo_df['Far_Chunks'].sample(5):
    print(sample)
    print()

[گارد سلطنتي NP] [تقاضا دارن VP] که [دوشيزه ADVP] [سو NP] [سئو-نو NP] [رو POSTP] [ملاقات کنند VP] .

[تو NP] که [ديدي VP] [من تو NP] [واگن اولي NP] [هستم VP] [چرا ADVP] [رفتي VP] [تو NP] [دومي‌ نشستي VP]

[من NP] [میدونم VP] که [شما NP] [در PP] [مورد این NP] ، [احساس خوبی NP] [ندارید VP] [به PP] [دلیل اتفاقی NP] که [دفعه NP] [پیش افتاد VP]

[پنجاه هزار تا NP] [دارم VP] که [ميگه ميتوني کمکم NP] [کني VP]

[نميتونم ژنرالي NP] [را POSTP] [پيدا کنم VP] که [جرات مواجه NP] [با PP] [هيتلر NP] [را POSTP] [داشته VP] [باشه VP]



Again, these sentences don't provide much support for که being a predictor of any particular word ordering. An interesting pattern can be found in the second sentence, which roughly translates to "You that I saw for the first time...", using an embedded syntactic structure that follows something like "NP that VP... VP". Since که is not tagged, the regular expressions limited this to the initial "NP VP", but the real structure would just be "NP".

These two word-orders didn't tell us very much, unfortunately -- what might the "Other" data say (pardon my pun)?

### Section 4: Examining Other Data
Let's load in the Other data and see what it has to offer:

In [41]:
other_df = pd.read_pickle('only_other_df.pkl')

In [42]:
other_df.head()
len(other_df)

Unnamed: 0,ID,Eng,Far,Eng_Tok,Far_Tok,Eng_Len,Far_Len,Eng_Types,Far_Types,Far_POS,Far_Chunks,Eng_POS,Eng_Chunks,Word_Order_Final
0,1,raspy breathing,صداي خر خر,"[raspy, breathing]","[صداي, خر, خر]",2,3,"{breathing, raspy}","{خر, صداي}","[(صداي, NUM), (خر, Ne), (خر, N)]",[صداي خر خر NP],"[(raspy, NN), (breathing, NN)]","[[(raspy, NN), (breathing, NN)]]",Other
1,2,dad,پدر,[dad],[پدر],1,1,{dad},{پدر},"[(پدر, N)]",[پدر NP],"[(dad, NN)]","[[(dad, NN)]]",Other
2,3,maybe its the wind,شايد صداي باد باشه,"[maybe, its, the, wind]","[شايد, صداي, باد, باشه]",4,4,"{its, the, wind, maybe}","{باشه, صداي, باد, شايد}","[(شايد, Ne), (صداي, AJ), (باد, V), (باشه, V)]",[شايد صداي NP] [باد VP] [باشه VP],"[(maybe, RB), (its, PRP$), (the, DT), (wind, NN)]","[(maybe, RB), (its, PRP$), [(the, DT), (wind, ...",Other
3,4,no,نه,[no],[نه],1,1,{no},{نه},"[(نه, ADV)]",نه,"[(no, DT)]","[[(no, DT)]]",Other
6,7,william,ويليام,[william],[ويليام],1,1,{william},{ويليام},"[(ويليام, N)]",[ويليام NP],"[(william, NN)]","[[(william, NN)]]",Other


422525

In [43]:
# Making the indexes easier to work with
other_df.reset_index(drop = True, inplace = True)

In [44]:
# Checking average sentence length
avg_eng_len3 = []
for ln in other_df['Eng_Len']:
    avg_eng_len3.append(ln)
print('The average English sentence length in the other file is...', (sum(avg_eng_len3))/len(other_df))

The average English sentence length in the other file is... 5.751316490148512


In [45]:
# Checking average sentence length
avg_far_len3 = []
for ln in other_df['Far_Len']:
    avg_far_len3.append(ln)
print('The average Persian sentence length in the other file is...', (sum(avg_far_len3))/len(other_df))

The average Persian sentence length in the other file is... 5.347882373824034


Whoa! The average sentence length for both English and Persian is _much_ shorter! Of course, this is probably a result of the influx of one-word/two-word lines that proliferate this file.

Again, let's check out the "is like" structure.

In [46]:
is_like_bool3 = []
for line in other_df['Eng']:
    if 'is like' in line:
        is_like_bool3.append(True)
    else:
        is_like_bool3.append(False)
has_like3 = pd.Series(is_like_bool3)

In [47]:
like_other_df = other_df[has_like3]

In [48]:
print(len(like_other_df))
like_other_df.head()

103


Unnamed: 0,ID,Eng,Far,Eng_Tok,Far_Tok,Eng_Len,Far_Len,Eng_Types,Far_Types,Far_POS,Far_Chunks,Eng_POS,Eng_Chunks,Word_Order_Final
3979,5923,it is like a completely new life,اين مثل يک زندگيه جديده,"[it, is, like, a, completely, new, life]","[اين, مثل, يک, زندگيه, جديده]",7,5,"{it, life, like, is, new, a, completely}","{يک, زندگيه, اين, مثل, جديده}","[(اين, N), (مثل, ADVe), (يک, NUM), (زندگيه, Ne...",[اين NP] [مثل PP] [يک زندگيه جديده NP],"[(it, PRP), (is, VBZ), (like, IN), (a, DT), (c...","[(it, PRP), [[('is', 'VBZ')], [(P like/IN), (N...",Other
11532,17526,its not no dear a tune is like:,نداره :نه عزيزم يک اهنگ هست مثله,"[its, not, no, dear, a, tune, is, like, :]","[نداره, :, نه, عزيزم, يک, اهنگ, هست, مثله]",9,8,"{its, dear, is, :, like, no, a, tune, not}","{هست, نداره, يک, عزيزم, اهنگ, نه, مثله, :}","[(نداره, V), (:, PUNC), (نه, ADV), (عزيزم, N),...",[نداره VP] : [نه ADVP] [عزيزم NP] [يک اهنگ NP]...,"[(its, PRP$), (not, RB), (no, DT), (dear, NN),...","[(its, PRP$), (not, RB), [(no, DT), (dear, NN)...",Other
30363,46011,that is like saying because someone gave thei...,به خاطر اين كه يك نفر به چيزي فكر كرده كه اون ...,"[that, is, like, saying, because, someone, gav...","[به, خاطر, اين, كه, يك, نفر, به, چيزي, فكر, كر...",9,14,"{their, someone, saying, that, is, like, atten...","{اون, فكر, يك, كرده, خاطر, اين, كه, نفر, نميخو...","[(به, P), (خاطر, Ne), (اين, N), (كه, V), (يك, ...",[به PP] [خاطر اين NP] [كه VP] [يك نفر NP] [به ...,"[(that, DT), (is, VBZ), (like, IN), (saying, V...","[[(that, DT)], [[('is', 'VBZ')]], [(like, IN)]...",Other
40178,60759,like this like his face all red his eyes buggi...,مثل اينتمام صورتش قرمز شد چشماش از حدقه ، داشت...,"[like, this, like, his, face, all, red, his, e...","[مثل, اينتمام, صورتش, قرمز, شد, چشماش, از, حدق...",11,12,"{buggin, his, this, red, like, face, out, eyes...","{قرمز, از, حدقه, ،, اينتمام, بيرون, داشت, ميزد...","[(مثل, ADVe), (اينتمام, Ne), (صورتش, N), (قرمز...",[مثل PP] [اينتمام صورتش NP] [قرمز ADJP] [شد VP...,"[(like, IN), (this, DT), (like, IN), (his, PRP...","[[[('like', 'IN')], [('this', 'DT')]], [(like,...",Other
40249,60871,arlene you ve forgotten what hanging out with...,آرلين ؛ يادت رفته گردش رفتن با جوليا جنگلي چطوريه,"[arlene, you, ve, forgotten, what, hanging, ou...","[آرلين, ؛, يادت, رفته, گردش, رفتن, با, جوليا, ...",12,10,"{arlene, is, julia, what, ve, like, with, hang...","{؛, جوليا, گردش, آرلين, يادت, چطوريه, با, رفتن...","[(آرلين, N), (؛, PUNC), (يادت, N), (رفته, V), ...",[آرلين NP] ؛ [يادت رفته گردش رفتن NP] [با PP] ...,"[(arlene, NN), (you, PRP), (ve, VBP), (forgott...","[[(arlene, NN)], (you, PRP), [[('ve', 'VBP')]]...",Other


Not a ton of choices here, with only 103 entries. Examining the five sentences in the head, it appears that "is like" behaves in a more explanatory comparison sense, rather than a metaphorical or similic one. What about مثل?

In [49]:
مثل_bool3 = []
for line in other_df['Far']:
    if 'مثل' in line:
        مثل_bool3.append(True)
    else:
        مثل_bool3.append(False)
has_مثل3 = pd.Series(مثل_bool3)

In [50]:
مثل_other_df = other_df[مثل_bool3]

In [51]:
مثل_other_df.tail(3)

Unnamed: 0,ID,Eng,Far,Eng_Tok,Far_Tok,Eng_Len,Far_Len,Eng_Types,Far_Types,Far_POS,Far_Chunks,Eng_POS,Eng_Chunks,Word_Order_Final
421777,611338,wifely,مثل زوجه,[wifely],"[مثل, زوجه]",1,2,{wifely},"{زوجه, مثل}","[(مثل, ADVe), (زوجه, N)]",[مثل PP] [زوجه NP],"[(wifely, RB)]","[(wifely, RB)]",Other
421952,611513,wiredrawn,مثل سيم,[wiredrawn],"[مثل, سيم]",1,2,{wiredrawn},"{سيم, مثل}","[(مثل, ADVe), (سيم, N)]",[مثل PP] [سيم NP],"[(wiredrawn, NN)]","[[(wiredrawn, NN)]]",Other
422043,611605,womanlike,مثل زن,[womanlike],"[مثل, زن]",1,2,{womanlike},"{زن, مثل}","[(مثل, ADVe), (زن, N)]",[مثل PP] [زن NP],"[(womanlike, NN)]","[[(womanlike, NN)]]",Other


1599

Funnily enough, it almost looks like the tail is alphebatized, but it's not! Things work out weirdly sometimes. That being said, this tail demonstrates a common feature of Persian morphology: compound nouns incorporating function words. Therefore, the word for "womanlike" directly translates to "like woman", the word for "wifely" translates to "like wife", and the word for "wiredrawn" translates to "like wire". 

What do که sentences look like?

In [52]:
که_bool3 = []
for line in other_df['Far']:
    if ' که ' in line:
        که_bool3.append(True)
    else:
        که_bool3.append(False)
has_که = pd.Series(مثل_bool3)

In [53]:
که_other_df = other_df[که_bool3]

In [54]:
که_other_df.head()
len(که_other_df)

Unnamed: 0,ID,Eng,Far,Eng_Tok,Far_Tok,Eng_Len,Far_Len,Eng_Types,Far_Types,Far_POS,Far_Chunks,Eng_POS,Eng_Chunks,Word_Order_Final
26,49,well i heard about a balled up whore named ch...,خوب من درباره يک جنده گندي که اسمش چارلي پرينس...,"[well, i, heard, about, a, balled, up, whore, ...","[خوب, من, درباره, يک, جنده, گندي, که, اسمش, چا...",11,11,"{i, up, balled, well, about, named, princess, ...","{جنده, من, شنيدم, درباره, يک, خوب, گندي, که, پ...","[(خوب, ADV), (من, PRO), (درباره, Pe), (يک, Ne)...",[خوب ADJP] [من NP] [درباره PP] [يک جنده NP] [گ...,"[(well, RB), (i, RB), (heard, VBP), (about, IN...","[(well, RB), (i, RB), [[('heard', 'VBP')], [(P...",Other
31,55,well will you look at all this you all spared...,خوب يک نگاه به اينا ميکني همه چيزهايي که داري ...,"[well, will, you, look, at, all, this, you, al...","[خوب, يک, نگاه, به, اينا, ميکني, همه, چيزهايي,...",15,13,"{will, well, this, expense, no, byron, at, you...","{نداره, نگاه, داري, يک, هيچ, خوب, مصرفي, چيزها...","[(خوب, ADV), (يک, NUM), (نگاه, N), (به, P), (ا...",[خوب ADJP] [يک نگاه NP] [به PP] [اينا NP] [ميک...,"[(well, RB), (will, MD), (you, PRP), (look, VB...","[(well, RB), (will, MD), (you, PRP), [[('look'...",Other
32,56,ive got to say thought its probably cheaper j...,من بايد بگم که احتمالا بي ارزشتر از اونيه که م...,"[ive, got, to, say, thought, its, probably, ch...","[من, بايد, بگم, که, احتمالا, بي, ارزشتر, از, ا...",16,15,"{the, damn, cheaper, its, to, probably, say, r...","{بايد, از, بدزدم, من, بي, اونيه, که, ارزشتر, ا...","[(من, PRO), (بايد, V), (بگم, V), (که, CONJ), (...",[من NP] [بايد VP] [بگم VP] که [احتمالا ADVP] [...,"[(ive, JJ), (got, VBD), (to, TO), (say, VB), (...","[[(ive, JJ)], [[('got', 'VBD')]], (to, TO), [[...",Other
42,70,well tommy it seems that there was a pinkerto...,خوب تامي به نظر ميرسه که داخل اون دليجان يک پي...,"[well, tommy, it, seems, that, there, was, a, ...","[خوب, تامي, به, نظر, ميرسه, که, داخل, اون, دلي...",16,16,"{coach, well, quite, that, was, it, inside, se...","{ميرسه, اون, داخل, يود, دليجان, پينکرتون, تامي...","[(خوب, ADV), (تامي, N), (به, P), (نظر, Ne), (م...",[خوب ADJP] [تامي NP] [به PP] [نظر ميرسه NP] که...,"[(well, RB), (tommy, IN), (it, PRP), (seems, V...","[(well, RB), [(tommy, IN)], (it, PRP), [[('see...",Other
43,71,now i know what charlie told you because we ...,الان ميدونم که چارلي بهت چي گفته براي اينکه ما...,"[now, i, know, what, charlie, told, you, becau...","[الان, ميدونم, که, چارلي, بهت, چي, گفته, براي,...",18,18,"{but, charlie, this, dont, we, rules, what, to...","{بهت, چندتا, قانون, ما, نداريم, تو, گروه, چارل...","[(الان, ADV), (ميدونم, V), (که, CONJ), (چارلي,...",[الان ADVP] [ميدونم VP] که [چارلي NP] [بهت PP]...,"[(now, RB), (i, VBZ), (know, VBP), (what, WP),...","[(now, RB), [[('i', 'VBZ')]], [[('know', 'VBP'...",Other


24652

Again, the head here does not show much promise. Since there are nearly 25000 entries in this DF, it would be extremely difficult to conclusively narrow the relationship between که and "Other" word order.

### Section 5: Conclusions

Unfortunately, my data was not set-up well enough to provide a good analysis of grammatical functions. This is a result of a few issues, most notably an unreliable chunker and a lack of understanding of regular expressions.

Some general conclusions are hinted at, by are by no means confirmed:
- Phrases that contain the word "is like" do carry over their simile components, especially in SOV sentences
- SVO sentences do not appear to use مثل ("like") for metaphors in its usual equivalent position
- "که" needs to  

On an unrelated note, I will be here this summer, and would like to review what I could've done to improve my project. I think I bit off more than I could chew, and I'd like to take a second-look after the semester has concluded.