# Generalizing Chunks
#### John R. Starr; jrs294@pitt.edu
Now that we have all of the data POS-tagged and chunked, it's time to generalize the chunks into categories: SOV, SVO, and an "extraneous" column EX (which will probably be filled by mis-parsed/mis-chunked data). At the most basic level, I intend on examining the number of noun phrases before the verb. If there are two, then we'll make it one. 

NOTE: I understand that the chunked data I have is by no means perfect; this is one of the limitations of my project.

All right, let's load in the usual stuff:

In [1]:
import nltk
import numpy as np
import pandas as pd
import re

In [2]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [3]:
full_df = pd.read_pickle('tagged_chunked_df.pkl')

In [4]:
full_df.head()

Unnamed: 0_level_0,Eng,Far,Eng_Tok,Far_Tok,Eng_Len,Far_Len,Eng_Types,Far_Types,Far_POS,Far_Chunks,Eng_POS,Eng_Chunks
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1,raspy breathing,صداي خر خر,"[raspy, breathing]","[صداي, خر, خر]",2,3,"{breathing, raspy}","{خر, صداي}","[(صداي, NUM), (خر, Ne), (خر, N)]",[صداي خر خر NP],"[(raspy, NN), (breathing, NN)]","[[(raspy, NN), (breathing, NN)]]"
2,dad,پدر,[dad],[پدر],1,1,{dad},{پدر},"[(پدر, N)]",[پدر NP],"[(dad, NN)]","[[(dad, NN)]]"
3,maybe its the wind,شايد صداي باد باشه,"[maybe, its, the, wind]","[شايد, صداي, باد, باشه]",4,4,"{wind, its, maybe, the}","{باشه, باد, صداي, شايد}","[(شايد, Ne), (صداي, AJ), (باد, V), (باشه, V)]",[شايد صداي NP] [باد VP] [باشه VP],"[(maybe, RB), (its, PRP$), (the, DT), (wind, NN)]","[(maybe, RB), (its, PRP$), [(the, DT), (wind, ..."
4,no,نه,[no],[نه],1,1,{no},{نه},"[(نه, ADV)]",نه,"[(no, DT)]","[[(no, DT)]]"
5,stop please stop,دست نگه داريد خواهش ميکنم دست نگه داريد,"[stop, please, stop]","[دست, نگه, داريد, خواهش, ميکنم, دست, نگه, داريد]",3,8,"{please, stop}","{ميکنم, نگه, خواهش, دست, داريد}","[(دست, N), (نگه, N), (داريد, V), (خواهش, Ne), ...",[دست NP] [نگه داريد VP] [خواهش ميکنم دست NP] [...,"[(stop, JJ), (please, NN), (stop, VB)]","[[(stop, JJ), (please, NN)], [[('stop', 'VB')]]]"


Rather than working on the full DF, let's create a smaller one that we can run functions on faster...:

In [5]:
small_df = full_df.iloc[:100]

Let's see what type of data we're working with. Because we're looking for the abnormalities in Persian word order, we only need to focus on that data. So, what's it all look like?

In [6]:
# Gathering some information on the Persian data:
for item in small_df.Far_Chunks.iloc[:5]:
    print(item)
    print(item[0])
    print()

[صداي خر خر NP]
[

[پدر NP]
[

[شايد صداي NP] [باد VP] [باشه VP]
[

نه
ن

[دست NP] [نگه داريد VP] [خواهش ميکنم دست NP] [نگه داريد VP]
[



Ah.. it seems that our data is in string format...! This makes things a little more annoying to navigate. If only the data were in tuple form! Oh well...  Let's try using RegEx to search through this and get the word order that we want.

NOTE: Persian reads from right-to-left; however, the chunker flips the sentence backward so that it can be read "left-to-right" and can be more easily compared with languages like English. If this is confusing now, I'll explain more in my presentation.

# Removing Persian characters, getting stats on word orders

In [7]:
# Test thing
test_gen = small_df.Far_Chunks.apply(lambda x: re.findall(r' [A-Z]+P\]', x))

test_gen.head()
len(test_gen)

ID
1                      [ NP]]
2                      [ NP]]
3          [ NP],  VP],  VP]]
4                          []
5    [ NP],  VP],  NP],  VP]]
Name: Far_Chunks, dtype: object

100

In [8]:
# Removing extra stuff:
test_gen = test_gen.apply(str).apply(lambda x: re.sub(r"[,\[\]\']", '', x)).apply(lambda x: x.strip())
test_gen.head()

ID
1                NP
2                NP
3        NP  VP  VP
4                  
5    NP  VP  NP  VP
Name: Far_Chunks, dtype: object

In [9]:
# Whole thing
full_gen = full_df.Far_Chunks.apply(lambda x: re.findall(r' [A-Z]+P\]', x))
full_gen.head()
full_gen.tail()
len(full_gen)

ID
1                      [ NP]]
2                      [ NP]]
3          [ NP],  VP],  VP]]
4                          []
5    [ NP],  VP],  NP],  VP]]
Name: Far_Chunks, dtype: object

ID
612082                  [ NP]]
612083                  [ NP]]
612084    [ NP],  ADJP],  VP]]
612085                  [ NP]]
612086                  [ NP]]
Name: Far_Chunks, dtype: object

612086

In [10]:
# Removing brackets and lists
full_gen = full_gen.apply(str).apply(lambda x: re.sub(r"[,\[\]\']", '', x)).apply(lambda x: x.strip())

In [11]:
from collections import Counter

chunk_tags_freq = Counter(full_gen)
chunk_tags_freq.most_common(15)

[('NP', 97417),
 ('NP  VP', 36201),
 ('NP  NP', 17290),
 ('VP', 14023),
 ('NP  NP  VP', 13694),
 ('NP  PP  NP  VP', 13342),
 ('NP  PP  NP', 11081),
 ('PP  NP', 8256),
 ('NP  VP  VP', 8138),
 ('PP  NP  VP', 7168),
 ('NP  VP  NP', 6901),
 ('NP  VP  NP  VP', 6462),
 ('NP  ADJP  VP', 6022),
 ('', 4384),
 ('ADVP  VP', 4176)]

## VISUALIZATION?

So, many of these files have complex structures that include multiple NPs and VPs. Unforuntately, the strings do not have punctuation, so I can't split the sentences up on that. Admittedly, this is kind of giving me a headache.

But still, time to fix things!! Thanks to some help from Professor Han, I have tried something else that works better. Here are some sample strings to test on:

In [55]:
s1 = 'NP VP'
s2 = 'NP VP NP'
s3 = 'NP NP VP'
s4 = 'NP NP VP NP'
s5 = 'NP ADVP VP NP'
s6 = 'NP ADVP NP VP'
s7 = 'NP ADVP PP VP NP'
s8 = 'NP ADVP PP NP VP'
s9 = 'NP ADVP PP VP ADVP NP'

First we'll build structures that DO represent the necessary structures:

In [56]:
# Refine for pp!!
np = r'NP'
vp = r'VP'
advp = r'ADVP'
pp = r'PP'

Now to build the SVO and SOV word order patterns:

In [57]:
# Matching for SVO
svo_pat = re.compile(r""+ np +" (["+ advp + pp +" ])*" + vp + " (["+ advp + pp +" ])*" + np)

In [58]:
# Testing match
if svo_pat.match(s1):
    print('s1')
if svo_pat.match(s2):
    print('s2')
if svo_pat.match(s3):
    print('s3')
if svo_pat.match(s4):
    print('s4')
if svo_pat.match(s5):
    print('s5')
if svo_pat.match(s6):
    print('s6')

s2
s5
s6


In [59]:
# Matching for SOV
sov_pat = re.compile( r""+ np + ' ' + np + " " + vp)

In [60]:
# Testing match
if sov_pat.match(s1):
    print('s1')
if sov_pat.match(s2):
    print('s2')
if sov_pat.match(s3):
    print('s3')
if sov_pat.match(s4):
    print('s4')
if sov_pat.match(s5):
    print('s5')
if sov_pat.match(s6):
    print('s6')

s3
s4


Well, let's see what changes when we use these two functions instead:

In [44]:
def gen_word_order3(text):
    if svo_pat.match(text):
        return 'SVO'
    elif sov_pat.match(text):
        return 'SOV'
    else:
        return 'Other'

KeyError: 'Other'

In [None]:
small_df['Word_Order3'].value_counts()

And if I flip the function? Last time, this was a good way of checking whether or not the ordering of the if/elif/else statements was affecting the numbers:

In [None]:
def gen_word_order4(text):
    if sov_pat.match(text):
        return 'SOV'
    elif svo_pat.match(text):
        return 'SVO'
    else:
        return 'Other'

In [None]:
small_df['Word_Order4'] = small_df['Far_Chunks'].apply(gen_word_order4)

In [None]:
small_df['Word_Order4'].value_counts()

Well, at least this one doesn't have any cross-over! And, as expected, there are more SOV structures.

Unfortunately, this function ignores 75% of the small_df. That's a _lot_ of data to be losing. We'll need to see how this scales up when applied to the full DF.

A quick refresher on what our full DF looks like:

In [None]:
full_df.head()

And applying our word_order function to it:

In [None]:
def gen_word_order_final(text):
    if sov_pat.match(text):
        return 'SOV'
    elif svo_pat.match(text):
        return 'SVO'
    else:
        return 'Other'

In [None]:
full_df['Word_Order'] = full_df['Far_Chunks'].apply(gen_word_order_final)

In [None]:
print(full_df['Word_Order'].value_counts())
print()
print(len(full_df))

In [None]:
count_df = pd.DataFrame({'Word_Order': [535371, 52756 , 23959]},
                        index=['Other', 'SOV', 'SVO'])

In [None]:
plot = count_df.plot.pie(y='Word_Order', figsize=(5, 5))
plot

Well, this is _not_ ideal whatsoever. Over 87% of the data is labeled as "other". There are a few explanations for this:
- Chunker mischunked the data, therefore giving incorrect structures
- Regex are too restrictive in their selection (need editing)
- Various factors do not take into account the pro-drop nature of Persian, so they don't recognize OV structures to be SOV (create a separate OV category?)

Due to time constraints, I don't think I'll be able to fix this right now, but hopefully by the time I present on Tuesday I'll have a better grasp on my data...! For now, let's separate the structures that are properly labeled, pickle them out, then perform data analysis in another notebook.

First, I'll create a list of Boolean values that will distinguish between the "Other" and ordered cases.

In [None]:
booleans = []
for order in full_df['Word_Order']:
    if order == 'Other':
        booleans.append(False)
    else:
        booleans.append(True)

In [None]:
# Checking the length
len(booleans)

Now we'll turn this list into a Series so that way we can use it to organize the DF:

In [None]:
is_ordered = pd.Series(booleans)

Wait! The index for full_df starts at 1, not 0. Let's reset that:

In [None]:
full_df.reset_index(inplace = True)

Now to create the "ordered only" DF:

In [None]:
ordered_only_df = full_df[is_ordered]

In [None]:
ordered_only_df.head()

Pickle time!

In [None]:
ordered_only_df.to_pickle('ordered_only_df.pkl')

In [None]:
other_only_df = full_df[full_df['Word_Order'] == 'Other']
len(other_only_df)

In [None]:
other_only_df.to_pickle('other_only_df.pkl')

And that's it for now...! Time for some data analysis!