# Chunking English Files
#### John R. Starr; jrs294@pitt.edu
Since I had trouble handling my various parsers and chunkers for the previous progress report (due to some technical limitations of my laptop), I will load the pickle file that I had previously created, rather than add on to the previous JPYNB found [here](https://github.com/Data-Science-for-Linguists-2019/Scrambling-in-English-to-Persian-Subtitles/blob/master/tagging_chunking.ipynb).

So, let's chunk the English files!!

In [1]:
from nltk.chunk.regexp import *
import nltk
import pandas as pd
import numpy as np

I've been messing around with various parsers for English, but this one seems to run the best:

In [2]:
parser = RegexpParser('''
    NP: {<DT>? <JJ>* <NN>* <P>*} # NP
    P: {<IN>}           # Preposition
    V: {<V.*>}          # Verb
    PP: {<P> <NP>}      # PP -> P NP
    VP: {<V> <NP|PP>*}  # VP -> V (NP|PP)*
    ''')

Let's load the pickle file...!

In [3]:
full_df = pd.read_pickle('full_df_MOD.pkl')
full_df.head(10)

Unnamed: 0_level_0,Eng,Far,Eng_Tok,Far_Tok,Eng_Len,Far_Len,Eng_Types,Far_Types,Far_POS,Far_Chunks,Eng_POS
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,raspy breathing,صداي خر خر,"[raspy, breathing]","[صداي, خر, خر]",2,3,"{raspy, breathing}","{خر, صداي}","[(صداي, NUM), (خر, Ne), (خر, N)]",[صداي خر خر NP],"[(raspy, NN), (breathing, NN)]"
2,dad,پدر,[dad],[پدر],1,1,{dad},{پدر},"[(پدر, N)]",[پدر NP],"[(dad, NN)]"
3,maybe its the wind,شايد صداي باد باشه,"[maybe, its, the, wind]","[شايد, صداي, باد, باشه]",4,4,"{maybe, its, the, wind}","{باد, صداي, شايد, باشه}","[(شايد, Ne), (صداي, AJ), (باد, V), (باشه, V)]",[شايد صداي NP] [باد VP] [باشه VP],"[(maybe, RB), (its, PRP$), (the, DT), (wind, NN)]"
4,no,نه,[no],[نه],1,1,{no},{نه},"[(نه, ADV)]",نه,"[(no, DT)]"
5,stop please stop,دست نگه داريد خواهش ميکنم دست نگه داريد,"[stop, please, stop]","[دست, نگه, داريد, خواهش, ميکنم, دست, نگه, داريد]",3,8,"{please, stop}","{ميکنم, داريد, نگه, خواهش, دست}","[(دست, N), (نگه, N), (داريد, V), (خواهش, Ne), ...",[دست NP] [نگه داريد VP] [خواهش ميکنم دست NP] [...,"[(stop, JJ), (please, NN), (stop, VB)]"
6,you have a week evans then well burn the house,اوانز تو فقط يک هفته وقت داري وگرنه خونتو خواه...,"[you, have, a, week, evans, then, well, burn, ...","[اوانز, تو, فقط, يک, هفته, وقت, داري, وگرنه, خ...",10,11,"{burn, then, the, house, well, evans, you, wee...","{خواهيم, هفته, وقت, تو, يک, فقط, وگرنه, داري, ...","[(اوانز, Ne), (تو, PRO), (فقط, ADV), (يک, Ne),...",[اوانز تو NP] [فقط يک هفته وقت ADVP] [داري VP]...,"[(you, PRP), (have, VBP), (a, DT), (week, NN),..."
7,william,ويليام,[william],[ويليام],1,1,{william},{ويليام},"[(ويليام, N)]",[ويليام NP],"[(william, NN)]"
8,god damn it william,لعنتي ويليام 8,"[god, damn, it, william]","[لعنتي, ويليام, 8]",4,3,"{god, it, william, damn}","{ويليام, 8, لعنتي}","[(لعنتي, Ne), (ويليام, N), (8, PUNC)]",[لعنتي ويليام NP] 8,"[(god, NN), (damn, VBZ), (it, PRP), (william, ..."
9,god damn it put that down,لعنت به تو اونو بذار زمين,"[god, damn, it, put, that, down]","[لعنت, به, تو, اونو, بذار, زمين]",6,6,"{damn, god, it, that, put, down}","{اونو, به, بذار, زمين, تو, لعنت}","[(لعنت, N), (به, P), (تو, PRO), (اونو, PRO), (...",[لعنت NP] [به PP] [تو NP] [اونو NP] [بذار VP] ...,"[(god, NN), (damn, VBZ), (it, PRP), (put, VBD)..."
10,let go,بذار برم,"[let, go]","[بذار, برم]",2,2,"{go, let}","{بذار, برم}","[(بذار, V), (برم, V)]",[بذار VP] [برم VP],"[(let, NN), (go, VB)]"


Let's see an example parse:

In [4]:
print(parser.parse(full_df.Eng_POS.iloc[8]))

(S
  (NP god/NN)
  (VP (V damn/VBZ))
  it/PRP
  (VP (V put/VBD))
  (P that/IN)
  down/RP)


In [5]:
full_df['Eng_Chunks'] = full_df['Eng_POS'].apply(parser.parse)

In [6]:
full_df.head(10)

Unnamed: 0_level_0,Eng,Far,Eng_Tok,Far_Tok,Eng_Len,Far_Len,Eng_Types,Far_Types,Far_POS,Far_Chunks,Eng_POS,Eng_Chunks
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1,raspy breathing,صداي خر خر,"[raspy, breathing]","[صداي, خر, خر]",2,3,"{raspy, breathing}","{خر, صداي}","[(صداي, NUM), (خر, Ne), (خر, N)]",[صداي خر خر NP],"[(raspy, NN), (breathing, NN)]","[[(raspy, NN), (breathing, NN)]]"
2,dad,پدر,[dad],[پدر],1,1,{dad},{پدر},"[(پدر, N)]",[پدر NP],"[(dad, NN)]","[[(dad, NN)]]"
3,maybe its the wind,شايد صداي باد باشه,"[maybe, its, the, wind]","[شايد, صداي, باد, باشه]",4,4,"{maybe, its, the, wind}","{باد, صداي, شايد, باشه}","[(شايد, Ne), (صداي, AJ), (باد, V), (باشه, V)]",[شايد صداي NP] [باد VP] [باشه VP],"[(maybe, RB), (its, PRP$), (the, DT), (wind, NN)]","[(maybe, RB), (its, PRP$), [(the, DT), (wind, ..."
4,no,نه,[no],[نه],1,1,{no},{نه},"[(نه, ADV)]",نه,"[(no, DT)]","[[(no, DT)]]"
5,stop please stop,دست نگه داريد خواهش ميکنم دست نگه داريد,"[stop, please, stop]","[دست, نگه, داريد, خواهش, ميکنم, دست, نگه, داريد]",3,8,"{please, stop}","{ميکنم, داريد, نگه, خواهش, دست}","[(دست, N), (نگه, N), (داريد, V), (خواهش, Ne), ...",[دست NP] [نگه داريد VP] [خواهش ميکنم دست NP] [...,"[(stop, JJ), (please, NN), (stop, VB)]","[[(stop, JJ), (please, NN)], [[('stop', 'VB')]]]"
6,you have a week evans then well burn the house,اوانز تو فقط يک هفته وقت داري وگرنه خونتو خواه...,"[you, have, a, week, evans, then, well, burn, ...","[اوانز, تو, فقط, يک, هفته, وقت, داري, وگرنه, خ...",10,11,"{burn, then, the, house, well, evans, you, wee...","{خواهيم, هفته, وقت, تو, يک, فقط, وگرنه, داري, ...","[(اوانز, Ne), (تو, PRO), (فقط, ADV), (يک, Ne),...",[اوانز تو NP] [فقط يک هفته وقت ADVP] [داري VP]...,"[(you, PRP), (have, VBP), (a, DT), (week, NN),...","[(you, PRP), [[('have', 'VBP')], [('a', 'DT'),..."
7,william,ويليام,[william],[ويليام],1,1,{william},{ويليام},"[(ويليام, N)]",[ويليام NP],"[(william, NN)]","[[(william, NN)]]"
8,god damn it william,لعنتي ويليام 8,"[god, damn, it, william]","[لعنتي, ويليام, 8]",4,3,"{god, it, william, damn}","{ويليام, 8, لعنتي}","[(لعنتي, Ne), (ويليام, N), (8, PUNC)]",[لعنتي ويليام NP] 8,"[(god, NN), (damn, VBZ), (it, PRP), (william, ...","[[(god, NN)], [[('damn', 'VBZ')]], (it, PRP), ..."
9,god damn it put that down,لعنت به تو اونو بذار زمين,"[god, damn, it, put, that, down]","[لعنت, به, تو, اونو, بذار, زمين]",6,6,"{damn, god, it, that, put, down}","{اونو, به, بذار, زمين, تو, لعنت}","[(لعنت, N), (به, P), (تو, PRO), (اونو, PRO), (...",[لعنت NP] [به PP] [تو NP] [اونو NP] [بذار VP] ...,"[(god, NN), (damn, VBZ), (it, PRP), (put, VBD)...","[[(god, NN)], [[('damn', 'VBZ')]], (it, PRP), ..."
10,let go,بذار برم,"[let, go]","[بذار, برم]",2,2,"{go, let}","{بذار, برم}","[(بذار, V), (برم, V)]",[بذار VP] [برم VP],"[(let, NN), (go, VB)]","[[(let, NN)], [[('go', 'VB')]]]"


In [8]:
full_df.to_pickle('tagged_chunked_df.pkl')