# POS Tagger and Shallow Parsing
#### John R. Starr; jrs294@pitt.edu
In this file, I will apply my POS tagger to my data. Following this, I will use a shallow parser to determine the grammatical structure for both English and Persian sentences. 

The reason that this is a separate file is because my laptop runs on a Windows environment. After a lot of struggle and problem-solving, Dan suggested I try Colab. Thankfully, Colab works! So, I ran the code that wasn't working on my laptop, uploaded the necessary files, and got everything to work! As a result, I was able to POS-tag and shallow parse my data. 

I am using the Hazm library to manipulate my Persian text; information about this corpus can be found [here]('https://github.com/sobhe/hazm').

In [1]:
!pip install hazm



Following here, I will be using the general pipeline that the creators outline in their repository:

### POS-tagging and Chunking Persian Text

In [0]:
# Importing the necessary modules
from hazm import *

In [3]:
# Building our tagger
tagger = POSTagger(model='postagger.model')
tagger.tag(word_tokenize('ما بسیار کتاب می‌خوانیم'))

[('ما', 'PRO'), ('بسیار', 'ADV'), ('کتاب', 'N'), ('می\u200cخوانیم', 'V')]

In [0]:
# Building our tagger
chunker = Chunker(model='chunker.model')

In [5]:
# Test file for the chunker
tagged = tagger.tag(word_tokenize('کتاب خواندن را دوست داریم'))
tree2brackets(chunker.parse(tagged))

'[کتاب خواندن NP] [را POSTP] [دوست داریم VP]'

It works! Sweet. Let's read in our pickled DF from our data_summary file and tag our Persian sentences.

In [0]:
import pandas as pd

In [7]:
# Reading in the file
full_df = pd.read_pickle('full_df.pkl')
full_df.head()

Unnamed: 0_level_0,Eng,Far,Eng_Tok,Far_Tok,Eng_Len,Far_Len,Eng_Types,Far_Types
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,raspy breathing,صداي خر خر,"[raspy, breathing]","[صداي, خر, خر]",2,3,"{raspy, breathing}","{صداي, خر}"
2,dad,پدر,[dad],[پدر],1,1,{dad},{پدر}
3,maybe its the wind,شايد صداي باد باشه,"[maybe, its, the, wind]","[شايد, صداي, باد, باشه]",4,4,"{wind, maybe, the, its}","{باشه, صداي, شايد, باد}"
4,no,نه,[no],[نه],1,1,{no},{نه}
5,stop please stop,دست نگه داريد خواهش ميکنم دست نگه داريد,"[stop, please, stop]","[دست, نگه, داريد, خواهش, ميکنم, دست, نگه, داريد]",3,8,"{please, stop}","{خواهش, نگه, داريد, ميکنم, دست}"


In [0]:
# Creating our POS column for Farsi
full_df['Far_POS'] = full_df['Far_Tok'].apply(tagger.tag)

In [0]:
# Creating our Chunks column based on the POS column
full_df['Far_Chunks'] = full_df['Far_POS'].apply(chunker.parse).apply(tree2brackets)

In [10]:
full_df.head()

Unnamed: 0_level_0,Eng,Far,Eng_Tok,Far_Tok,Eng_Len,Far_Len,Eng_Types,Far_Types,Far_POS,Far_Chunks
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1,raspy breathing,صداي خر خر,"[raspy, breathing]","[صداي, خر, خر]",2,3,"{raspy, breathing}","{صداي, خر}","[(صداي, NUM), (خر, Ne), (خر, N)]",[صداي خر خر NP]
2,dad,پدر,[dad],[پدر],1,1,{dad},{پدر},"[(پدر, N)]",[پدر NP]
3,maybe its the wind,شايد صداي باد باشه,"[maybe, its, the, wind]","[شايد, صداي, باد, باشه]",4,4,"{wind, maybe, the, its}","{باشه, صداي, شايد, باد}","[(شايد, Ne), (صداي, AJ), (باد, V), (باشه, V)]",[شايد صداي NP] [باد VP] [باشه VP]
4,no,نه,[no],[نه],1,1,{no},{نه},"[(نه, ADV)]",نه
5,stop please stop,دست نگه داريد خواهش ميکنم دست نگه داريد,"[stop, please, stop]","[دست, نگه, داريد, خواهش, ميکنم, دست, نگه, داريد]",3,8,"{please, stop}","{خواهش, نگه, داريد, ميکنم, دست}","[(دست, N), (نگه, N), (داريد, V), (خواهش, Ne), ...",[دست NP] [نگه داريد VP] [خواهش ميکنم دست NP] [...


Sweet! It looks like that we've got all the Persian tagging and chunking in order. Now, we need to do the same for the English words, using nltk's modules. These may not be the best modules, but they are the only ones that I can seem to get to work... I intend on scheduling at least a few OH appointments so that I can get a better POS-tagger and chunker for English up and running.

Anyway, let's import nltk and download all the necessary files.

### POS-tagging and Chunking English Text

In [11]:
import nltk
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Package words is already up-to-date!


True

In [12]:
# Checking the default POS tagger in nltk:
nltk.pos_tag('maybe the wind is causing it. i hate it'.split())

[('maybe', 'RB'),
 ('the', 'DT'),
 ('wind', 'NN'),
 ('is', 'VBZ'),
 ('causing', 'VBG'),
 ('it.', 'NN'),
 ('i', 'NN'),
 ('hate', 'VBP'),
 ('it', 'PRP')]

In [13]:
# Seeing how chunk.ne_chunk works...
tagged_sentence = nltk.pos_tag('maybe the wind is causing it. i hate it'.split()) #pos_tagging the sentence
result = nltk.chunk.ne_chunk(tagged_sentence) #chunking the sentence
print(result)

(S
  maybe/RB
  the/DT
  wind/NN
  is/VBZ
  causing/VBG
  it./NN
  i/NN
  hate/VBP
  it/PRP)


So this didn't really work... what about this parser?

In [0]:
# Second Parser off of NLTK
from nltk.parse import ParserI
from nltk.chunk.api import ChunkParserI

In [15]:
p = ChunkParserI()
p.parse(tagged_sentence)

NotImplementedError: ignored

Hmm... this didn't work either... I need to get these to work (expect a few trips to OHs!)

For now, since I have to hand something in, let's just add the Eng_POS column and pickle what we have to a new pkl file that we can use in our analysis.

In [0]:
full_df['Eng_POS'] = full_df['Eng_Tok'].apply(nltk.pos_tag)

In [17]:
full_df.head()

Unnamed: 0_level_0,Eng,Far,Eng_Tok,Far_Tok,Eng_Len,Far_Len,Eng_Types,Far_Types,Far_POS,Far_Chunks,Eng_POS
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,raspy breathing,صداي خر خر,"[raspy, breathing]","[صداي, خر, خر]",2,3,"{raspy, breathing}","{صداي, خر}","[(صداي, NUM), (خر, Ne), (خر, N)]",[صداي خر خر NP],"[(raspy, NN), (breathing, NN)]"
2,dad,پدر,[dad],[پدر],1,1,{dad},{پدر},"[(پدر, N)]",[پدر NP],"[(dad, NN)]"
3,maybe its the wind,شايد صداي باد باشه,"[maybe, its, the, wind]","[شايد, صداي, باد, باشه]",4,4,"{wind, maybe, the, its}","{باشه, صداي, شايد, باد}","[(شايد, Ne), (صداي, AJ), (باد, V), (باشه, V)]",[شايد صداي NP] [باد VP] [باشه VP],"[(maybe, RB), (its, PRP$), (the, DT), (wind, NN)]"
4,no,نه,[no],[نه],1,1,{no},{نه},"[(نه, ADV)]",نه,"[(no, DT)]"
5,stop please stop,دست نگه داريد خواهش ميکنم دست نگه داريد,"[stop, please, stop]","[دست, نگه, داريد, خواهش, ميکنم, دست, نگه, داريد]",3,8,"{please, stop}","{خواهش, نگه, داريد, ميکنم, دست}","[(دست, N), (نگه, N), (داريد, V), (خواهش, Ne), ...",[دست NP] [نگه داريد VP] [خواهش ميکنم دست NP] [...,"[(stop, JJ), (please, NN), (stop, VB)]"


Let's pickle it:

In [0]:
full_df.to_pickle('full_df_MOD.pkl')

That's it for now! Time for some analysis.

#### NOTE
I am very frustrated with myself and my project and will be coming in to OH to figure things out and recenter.