# Starr Project
#### John R. Starr; jrs294@pitt.edu

The data is split into two folders/files, en/TEP.xml and fa/TEP.xml. Let's import what we need to properly search through this data:

In [1]:
import pandas as pd
import numpy as np
import nltk
import postagger
import xml.etree.ElementTree as ET
from lxml import etree

Building a parser that works for xml:

In [2]:
parser_test = etree.XMLParser()
tree_eng = ET.parse('TEP/raw/en/TEP.xml', parser = parser_test)

XMLSyntaxError: Char 0x0 out of allowed range, line 86381, column 25 (<string>, line 86381)

Hmmm... After some preliminary efforts in building trees, I found that some of my data has corrupted characters (or at least something along those lines). An example of one of these encoding errors can be found in the following sentence: <s id="86377">simple caf oronary . freak show choked to death .</s> When opened in Notepad++, the space between "caf" and "oronary" is the abbreviation NUL highlighted in black. Other problems occur later in the dataset. I am still in the process of figuring out what these characters are, but for now, we'll set the parser to the following: 

In [3]:
parser_full = etree.XMLParser(recover = True)
tree_eng = ET.parse('TEP/raw/en/TEP.xml', parser = parser_full)

Now, let's build the root and see how our data is structured. I've looked through the XML file and noticed that there is a bit of a heading that looks like this (I have added spaces between the greater/less than symbols so that it remains visible in this file):

< ?xml version="1.0" encoding="utf-8"? >
< letsmt version="1.0" >
< head >< /head >
    
After that, we have the body character, followed by a sentence. So, we'll use .findall() to start where we want it to start, and hopefully we'll be able to get an idea of what our data looks like:

In [4]:
root_eng = tree_eng.getroot()
root_eng.items()
for item in root_eng.findall('./body/s')[:5]:
    print(item.text)
    #print(dir(item))
    print(item.values())
    print(item.items())

raspy breathing .
['1']
[('id', '1')]
dad .
['2']
[('id', '2')]
maybe its the wind .
['3']
[('id', '3')]
no .
['4']
[('id', '4')]
stop please stop .
['5']
[('id', '5')]


It looks like our data uses the ID number to mark what text comes after it. This works well, as we'll be able to match up the keys between the two files to combine them! Let's create a test dictionary in which the key is the item and the text is the value, replacing all extraneous information:

In [5]:
eng_lines_test = {}
for item in root_eng.findall('./body/s')[:5]:
     eng_lines_test[int(str(item.values()).replace(',', '').replace("['", '').replace("']", ''))] = str(item.text.replace(',', '').replace(' .', ''))

In [6]:
eng_lines_test.keys()

dict_keys([1, 2, 3, 4, 5])

In [7]:
eng_lines_test.values()

dict_values(['raspy breathing', 'dad', 'maybe its the wind', 'no', 'stop please stop'])

Awesome! Let's do the same for the Farsi text as well!

In [8]:
tree_far = ET.parse('TEP/raw/fa/TEP.xml', parser = parser_full)

In [9]:
far_lines_test = {}
root_far = tree_far.getroot()
for item in root_far.findall('./body/s')[:5]:
    far_lines_test[int(str(item.values()).replace(',', '').replace("['", '').replace("']", ''))] = str(item.text.replace(',', '').replace(' .', ''))

In [10]:
far_lines_test.keys()

dict_keys([1, 2, 3, 4, 5])

In [11]:
far_lines_test.values()

dict_values(['صداي خر خر', 'پدر', 'شايد صداي باد باشه', 'نه', 'دست نگه داريد خواهش ميکنم دست نگه داريد'])

All in the clear! Now, let's combine the two into a single DataFrame object!

In [12]:
test_DF = pd.Series(eng_lines_test).to_frame('eng').join(pd.Series(far_lines_test).to_frame('far'), how='outer')
test_DF

Unnamed: 0,eng,far
1,raspy breathing,صداي خر خر
2,dad,پدر
3,maybe its the wind,شايد صداي باد باشه
4,no,نه
5,stop please stop,دست نگه داريد خواهش ميکنم دست نگه داريد


It worked! Yay! Let's try the whole thing now:

In [13]:
eng_lines = {}
for item in root_eng.findall('./body/s'):
     eng_lines[int(str(item.values()).replace(',', '').replace("['", '').replace("']", ''))] = str(item.text.replace(',', '').replace(' .', ''))
print(len(eng_lines))

86377


In [14]:
far_lines = {}
for item in root_far.findall('./body/s'):
     far_lines[int(str(item.values()).replace(',', '').replace("['", '').replace("']", ''))] = str(item.text.replace(',', '').replace(' .', ''))
print(len(far_lines))

612086


Wait a second... these numbers don't line up! What's going on here? The two XML files are the same length. Let's see what the DF would look like, plus a general description of the files:

In [15]:
sample_full_DF = pd.Series(eng_lines).to_frame('Eng').join(pd.Series(far_lines).to_frame('Far'), how='outer')
sample_full_DF.index.name = 'ID'
sample_full_DF

Unnamed: 0_level_0,Eng,Far
ID,Unnamed: 1_level_1,Unnamed: 2_level_1
1,raspy breathing,صداي خر خر
2,dad,پدر
3,maybe its the wind,شايد صداي باد باشه
4,no,نه
5,stop please stop,دست نگه داريد خواهش ميکنم دست نگه داريد
6,you have a week evans then well burn the house,اوانز تو فقط يک هفته وقت داري وگرنه خونتو خواه...
7,william,ويليام
8,god damn it william,لعنتي ويليام 8
9,god damn it put that down,لعنت به تو اونو بذار زمين
10,let go,بذار برم


In [16]:
sample_full_DF.describe()

Unnamed: 0,Eng,Far
count,86377,612086
unique,80554,583981
top,come on,بله
freq,28,1596


It seems that there are _significantly_ less Enlish sentences. Let's see if we can examine the problem area by looking at sentences 86375 to 86400:

In [17]:
sample_full_DF.loc[86375:86400,"Eng":"Far"]

Unnamed: 0_level_0,Eng,Far
ID,Unnamed: 1_level_1,Unnamed: 2_level_1
86375,hands off,دست نزنيد
86376,if you had your game faces on at the academy ...,اگه شما هم توي دانشكده بوديد ، مي فهميديد
86377,simple caf,اون خفه شده
86378,,سوالي داريد
86379,,آزمايش از دهان نشون ميده گوشت خوك خورده
86380,,در راهرو هم هست
86381,,باشه ، بيا اين كارو بكنيم ، زود باش عزيزم
86382,,هيچ علامتي از آسيب روي گلويش وجود نداره
86383,,علائم درگيري نشون ميده كه اون خفه نشده
86384,,اونو خفه كردند


It seems that the "weird" character that we thought we eliminated has actually blocked all the values after it... Going to have to fix that for the next progress report! For now, let's try processing the lines up until that point to prepare them for tagging.

In [18]:
sample_full_DF = sample_full_DF[sample_full_DF.Eng.notnull()]
sample_full_DF

Unnamed: 0_level_0,Eng,Far
ID,Unnamed: 1_level_1,Unnamed: 2_level_1
1,raspy breathing,صداي خر خر
2,dad,پدر
3,maybe its the wind,شايد صداي باد باشه
4,no,نه
5,stop please stop,دست نگه داريد خواهش ميکنم دست نگه داريد
6,you have a week evans then well burn the house,اوانز تو فقط يک هفته وقت داري وگرنه خونتو خواه...
7,william,ويليام
8,god damn it william,لعنتي ويليام 8
9,god damn it put that down,لعنت به تو اونو بذار زمين
10,let go,بذار برم


#### What Needs to Be Done Still
- Figure out the encoding error for my data (I will be working on this over the weekend and will probably stop by Office Hours on Monday)
- Finalize and execute search criteria.
- POS-tag the sentences and process through shallow parser (Parsivar).
- Analysis!