# Generalizing Chunks
#### John R. Starr; jrs294@pitt.edu
Now that we have all of the data POS-tagged and chunked, it's time to generalize the chunks into categories: SOV, SVO, and an "extraneous" column EX (which will probably be filled by mis-parsed/mis-chunked data). At the most basic level, I intend on examining the number of noun phrases before the verb. If there are two, then we'll make it one. 

NOTE: I understand that the chunked data I have is by no means perfect; this is one of the limitations of my project.

All right, let's load in the usual stuff:

In [1]:
import nltk
import numpy as np
import pandas as pd

In [2]:
full_df = pd.read_pickle('tagged_chunked_df.pkl')

In [3]:
full_df.head()

Unnamed: 0_level_0,Eng,Far,Eng_Tok,Far_Tok,Eng_Len,Far_Len,Eng_Types,Far_Types,Far_POS,Far_Chunks,Eng_POS,Eng_Chunks
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1,raspy breathing,صداي خر خر,"[raspy, breathing]","[صداي, خر, خر]",2,3,"{breathing, raspy}","{صداي, خر}","[(صداي, NUM), (خر, Ne), (خر, N)]",[صداي خر خر NP],"[(raspy, NN), (breathing, NN)]","[[(raspy, NN), (breathing, NN)]]"
2,dad,پدر,[dad],[پدر],1,1,{dad},{پدر},"[(پدر, N)]",[پدر NP],"[(dad, NN)]","[[(dad, NN)]]"
3,maybe its the wind,شايد صداي باد باشه,"[maybe, its, the, wind]","[شايد, صداي, باد, باشه]",4,4,"{wind, maybe, its, the}","{باشه, صداي, شايد, باد}","[(شايد, Ne), (صداي, AJ), (باد, V), (باشه, V)]",[شايد صداي NP] [باد VP] [باشه VP],"[(maybe, RB), (its, PRP$), (the, DT), (wind, NN)]","[(maybe, RB), (its, PRP$), [(the, DT), (wind, ..."
4,no,نه,[no],[نه],1,1,{no},{نه},"[(نه, ADV)]",نه,"[(no, DT)]","[[(no, DT)]]"
5,stop please stop,دست نگه داريد خواهش ميکنم دست نگه داريد,"[stop, please, stop]","[دست, نگه, داريد, خواهش, ميکنم, دست, نگه, داريد]",3,8,"{stop, please}","{داريد, ميکنم, خواهش, دست, نگه}","[(دست, N), (نگه, N), (داريد, V), (خواهش, Ne), ...",[دست NP] [نگه داريد VP] [خواهش ميکنم دست NP] [...,"[(stop, JJ), (please, NN), (stop, VB)]","[[(stop, JJ), (please, NN)], [[('stop', 'VB')]]]"


Rather than working on the full DF, let's create a smaller one that we can run functions on faster...:

In [4]:
small_df = full_df.iloc[:100]

Let's see what type of data we're working with. Because we're looking for the abnormalities in Persian word order, we only need to focus on that data. So, what's it all look like?

In [5]:
# Gathering some information on the Persian data:
for item in small_df.Far_Chunks.iloc[:5]:
    print(item)
    print(item[0])
    print()

[صداي خر خر NP]
[

[پدر NP]
[

[شايد صداي NP] [باد VP] [باشه VP]
[

نه
ن

[دست NP] [نگه داريد VP] [خواهش ميکنم دست NP] [نگه داريد VP]
[



Ah.. it seems that our data is in string format...! This makes things a little more annoying to navigate. If only the data were in tuple form! Oh well...  Let's try using RegEx to search through this and get the word order that we want.

NOTE: Persian reads from right-to-left; however, the chunker flips the sentence backward so that it can be read "left-to-right" and can be more easily compared with languages like English. If this is confusing now, I'll explain more in my presentation.

In [6]:
# Importing...
import re

Let's try a test regex that looks for SVO word-order:

In [7]:
# SVO
SVO_test = []
for x in small_df.Far_Chunks:
    if re.findall('(NP)[^NP]*VP[^VP]*(NP)+[^VP]*', x):
        # The goal of the regex is as follows:
            # One NP, then any characters besides another NP, then a VP, then an NP (presumably the object)
        SVO_test.append(x)
len(SVO_test)

33

What kind of sentences are we looking at right now?

In [8]:
SVO_test[10:20]

['[مارک اول اينکه من NP] [ميخوام شما NP] [پسرهارو NP] [با PP] [خودم NP] [ببرم VP] و [ما NP] [گله NP] [را POSTP] [جمع NP] [خواهيم کرد VP] [بعد ADVP] [من NP] [به PP] [شهر ميرم NP]',
 '[من NP] [به PP] [هلندر NP] [ميگم VP] که [اينو NP] [درستش ADJP] [کنه VP]',
 '[من NP] [بهش NP] [ميگم VP] [پول يک اسطبل NP] [جديدو بده VP]',
 '[شايد ما NP] [بايد VP] [بهش PP] [تير NP] [اندازي VP] کنيم [همونجور NP] که [ويل ميگه NP]',
 '[تو NP] [بايد VP] [ميگذاشتي من NP] [علوفه هارو NP] [نگه دارم VP]',
 '[يه NP] [روزي VP] که [مثل PP] [من NP] [شدي VP] [تو NP] [اونوقت ميفهمي NP]',
 '[خوب ADJP] [يک نگاه NP] [به PP] [اينا NP] [ميکني VP] [همه چيزهايي NP] که [داري PP] [هيچ مصرفي NP] [نداره VP]',
 '[هيچ چيزي NP] [عوض نميشه VP] اگر [من NP] [زنده ADJP] [بمونم VP]',
 '[ميخوام NP] که [پول هارو NP] [بذارين زمين NP] [بريد VP] [عقب بهشون ADJP] [بگو VP] [برن NP] [عقب ADVP] [يا اين مرد ميميره NP]',
 '[الان ADVP] [ميدونم VP] که [چارلي NP] [بهت PP] [چي گفته NP] [براي VP] اينکه [ما NP] چيزي [تو NP] [اين گروه NP] [نداريم VP] [بجز P

Unfortunately, my chunker isn't the best. There are some clear errors with its labeling. For example, the first sentence up above has the [ميخوام شما] labeled as an "NP", even though "میخواهم" is a verb. It also doesn't take into account colloquial spellings of certain verbs [برم]...! The reason behind this is that Persian varies greatly between its spoken and written forms.

All things considered, though, this looks pretty good! The patterns I want are all in the data as best that the chunker allows...!

Now, let's check out SOV sentences. I predict that there will be more of these:

In [9]:
# SOV
SOV_test = []
for x in small_df.Far_Chunks[:100]:
    if re.findall('(NP)[^VP]*(NP)+(VP)*[^NP]', x):
        SOV_test.append(x)
len(SOV_test)

38

In [10]:
SOV_test[10:20]

['[تو NP] [ميخواي NP] [به PP] [مارشال NP] [بگي VP] که [اون آدما چيکار کردن NP]',
 '[مارک اول اينکه من NP] [ميخوام شما NP] [پسرهارو NP] [با PP] [خودم NP] [ببرم VP] و [ما NP] [گله NP] [را POSTP] [جمع NP] [خواهيم کرد VP] [بعد ADVP] [من NP] [به PP] [شهر ميرم NP]',
 '[من NP] [بهش NP] [ميگم VP] [پول يک اسطبل NP] [جديدو بده VP]',
 '[شايد ما NP] [بايد VP] [بهش PP] [تير NP] [اندازي VP] کنيم [همونجور NP] که [ويل ميگه NP]',
 '[دور NP] [تا PP] [دورش NP] [از PP] [آهنه يک NP] [چيزايي بالاش NP] [دوتا شاتگان زن و يک تيربار NP]',
 '[تو NP] [بايد VP] [ميگذاشتي من NP] [علوفه هارو NP] [نگه دارم VP]',
 '[يه NP] [روزي VP] که [مثل PP] [من NP] [شدي VP] [تو NP] [اونوقت ميفهمي NP]',
 '[اون تويي NP] [خانم کوچولو NP]',
 '[من NP] [بايد VP] [بگم VP] که [احتمالا ADVP] [بي PP] [ارزشتر NP] [از PP] [اونيه NP] که [من NP] [بخوام VP] [چيزي ازش بدزدم VP]',
 '[من NP] [تورو NP] [نميخوام بکشم VP]']

This looks a _lot_ better than the SVO output, which isn't unexpected -- since Persian tends towards SOV word order, it makes sense that the data here is much cleaner. Now let's build a function that will do all this automatically...! We're going to look for SVO, SOV, and "Other" word order, with "Other" covering the sentence fragments or other strange speech patterns that appear in everyday language.

Building the function:

In [11]:
def gen_word_order(text):
    if re.findall('(NP)[^NP]*VP[^VP]*(NP)+[^VP]*', text):
        return 'SVO'
    elif re.findall('(NP)[^VP]*(NP)+(VP)*[^NP]', text):
        return 'SOV'
    else:
        return 'Other'

How's it apply to our small_df?

In [12]:
small_df['Word_Order'] = small_df['Far_Chunks'].apply(gen_word_order)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [13]:
small_df.tail(10)

Unnamed: 0_level_0,Eng,Far,Eng_Tok,Far_Tok,Eng_Len,Far_Len,Eng_Types,Far_Types,Far_POS,Far_Chunks,Eng_POS,Eng_Chunks,Word_Order
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
91,by mr ben wade himself,بوسيله خود بن ويد,"[by, mr, ben, wade, himself]","[بوسيله, خود, بن, ويد]",5,4,"{mr, by, ben, wade, himself}","{خود, بوسيله, ويد, بن}","[(بوسيله, Ne), (خود, PRO), (بن, Ne), (ويد, N)]",[بوسيله خود بن ويد NP],"[(by, IN), (mr, JJ), (ben, NN), (wade, VBD), (...","[[[('by', 'IN')], [('mr', 'JJ'), ('ben', 'NN')...",Other
92,how did you know it was wade,تو از کجا فهميدي که اون ويد بود,"[how, did, you, know, it, was, wade]","[تو, از, کجا, فهميدي, که, اون, ويد, بود]",7,8,"{did, was, how, know, wade, you, it}","{از, بود, کجا, که, تو, فهميدي, ويد, اون}","[(تو, PRO), (از, P), (کجا, ADV), (فهميدي, V), ...",[تو NP] [از PP] [کجا NP] [فهميدي VP] که [اون و...,"[(how, WRB), (did, VBD), (you, PRP), (know, VB...","[(how, WRB), [[('did', 'VBD')]], (you, PRP), [...",SVO
93,its been him the last 21 times marshal,اون خودش بود توي 21 بار قبلي مارشال,"[its, been, him, the, last, 21, times, marshal]","[اون, خودش, بود, توي, 21, بار, قبلي, مارشال]",8,8,"{last, its, the, 21, marshal, him, been, times}","{بود, 21, مارشال, خودش, قبلي, توي, بار, اون}","[(اون, PRO), (خودش, PRO), (بود, V), (توي, N), ...",[اون NP] [خودش NP] [بود VP] [توي NP] [21 بار A...,"[(its, PRP$), (been, VBN), (him, PRP), (the, D...","[(its, PRP$), [[('been', 'VBN')]], (him, PRP),...",SVO
94,i saw a mexican sharpshooter and an apache,من يک تير انداز ماهر مکزيکي و يک آپاچي باهاش ديدم,"[i, saw, a, mexican, sharpshooter, and, an, ap...","[من, يک, تير, انداز, ماهر, مکزيکي, و, يک, آپاچ...",8,11,"{apache, a, sharpshooter, i, mexican, and, an,...","{يک, آپاچي, انداز, و, تير, مکزيکي, باهاش, من, ...","[(من, PRO), (يک, NUM), (تير, Ne), (انداز, AJe)...",[من NP] [يک تير انداز ماهر NP] [مکزيکي VP] و [...,"[(i, NN), (saw, VBD), (a, DT), (mexican, JJ), ...","[[(i, NN)], [[('saw', 'VBD')], [('a', 'DT'), (...",SVO
95,did you see the hand of god whats that,تو دست خدا را ديدي اون چيه ديگه,"[did, you, see, the, hand, of, god, whats, that]","[تو, دست, خدا, را, ديدي, اون, چيه, ديگه]",9,8,"{hand, the, did, whats, you, god, that, of, see}","{تو, ديگه, چيه, ديدي, را, خدا, دست, اون}","[(تو, PRO), (دست, Ne), (خدا, N), (را, POSTP), ...",[تو NP] [دست خدا NP] [را POSTP] [ديدي PP] [اون...,"[(did, VBD), (you, PRP), (see, VB), (the, DT),...","[[[('did', 'VBD')]], (you, PRP), [[('see', 'VB...",SOV
96,his pistol why the hell didnt you do something,هفت تيرش چرا تو هيچ کاري نکردي,"[his, pistol, why, the, hell, didnt, you, do, ...","[هفت, تيرش, چرا, تو, هيچ, کاري, نکردي]",9,7,"{the, his, something, didnt, pistol, you, why,...","{چرا, نکردي, تو, هيچ, تيرش, هفت, کاري}","[(هفت, NUM), (تيرش, N), (چرا, ADV), (تو, PRO),...",[هفت تيرش NP] [چرا ADVP] [تو NP] [هيچ کاري NP]...,"[(his, PRP$), (pistol, NN), (why, WRB), (the, ...","[(his, PRP$), [(pistol, NN)], (why, WRB), [(th...",SVO
97,they had a lot of weapons mr and they were sh...,اونا کلي اسلحه داشتن آقا و اونا گلوله شليک ميکردن,"[they, had, a, lot, of, weapons, mr, and, they...","[اونا, کلي, اسلحه, داشتن, آقا, و, اونا, گلوله,...",12,10,"{a, shooting, lot, mr, had, were, they, and, b...","{اسلحه, و, اونا, شليک, کلي, ميکردن, گلوله, داش...","[(اونا, PRO), (کلي, CONJ), (اسلحه, N), (داشتن,...",[اونا NP] کلي [اسلحه داشتن آقا و اونا گلوله شل...,"[(they, PRP), (had, VBD), (a, DT), (lot, NN), ...","[(they, PRP), [[('had', 'VBD')], [('a', 'DT'),...",SOV
98,lets go were wasting time,بريم داريم وقت تلف ميکنيم,"[lets, go, were, wasting, time]","[بريم, داريم, وقت, تلف, ميکنيم]",5,5,"{lets, time, were, wasting, go}","{وقت, بريم, تلف, ميکنيم, داريم}","[(بريم, N), (داريم, V), (وقت, Ne), (تلف, Ne), ...",[بريم NP] [داريم VP] [وقت تلف ميکنيم NP],"[(lets, NNS), (go, VBP), (were, VBD), (wasting...","[(lets, NNS), [[('go', 'VBP')]], [[('were', 'V...",SVO
99,where are you from anyway,درهرصورت تو کجايي هستي,"[where, are, you, from, anyway]","[درهرصورت, تو, کجايي, هستي]",5,4,"{where, are, anyway, from, you}","{هستي, درهرصورت, کجايي, تو}","[(درهرصورت, CONJ), (تو, PRO), (کجايي, V), (هست...",[درهرصورت ADVP] [تو NP] [کجايي VP] هستي,"[(where, WRB), (are, VBP), (you, PRP), (from, ...","[(where, WRB), [[('are', 'VBP')]], (you, PRP),...",Other
100,lets go com on boys,بريم بياين بچهها,"[lets, go, com, on, boys]","[بريم, بياين, بچهها]",5,3,"{lets, com, on, boys, go}","{بچهها, بياين, بريم}","[(بريم, N), (بياين, V), (بچهها, N)]",[بريم بياين VP] بچهها,"[(lets, NNS), (go, VBP), (com, VB), (on, IN), ...","[(lets, NNS), [[('go', 'VBP')]], [[('com', 'VB...",Other


What are the value counts of everything we've seen thus far?

In [14]:
small_df['Word_Order'].value_counts()

Other    50
SVO      33
SOV      17
Name: Word_Order, dtype: int64

Hmmm... these numbers seem a bit funky...! We got 38 SOV values when we searched through the small_df, but only 17 when we applied it to our DF. I wonder if the regexps are stepping on each other's toes? Let's reverse the rule and see what happens:

In [15]:
def gen_word_order2(text):
    if re.findall('(NP)[^VP]*(NP)+(VP)*[^NP]', text):
        return 'SOV'
    elif re.findall('(NP)[^NP]*VP[^VP]*(NP)+[^VP]*', text):
        return 'SVO'
    else:
        return 'Other'

In [16]:
small_df['Word_Order2'] = small_df['Far_Chunks'].apply(gen_word_order2)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [17]:
small_df['Word_Order2'].value_counts()

Other    50
SOV      38
SVO      12
Name: Word_Order2, dtype: int64

Yes, the regular expressions most definitely _are_ treading on each other...! It is interesting to note that the regular expressions only apply to 50% of the data; we'll see how this scales as we work up to the full_df.

In [None]:
full_df['Word_Order'] = full_df['Far_Chunks'].apply(gen_word_order)

In [None]:
full_df['Word_Order'].value_counts()