# Data Summary
#### John R. Starr; jrs294@pitt.edu

The data is split into two folders/files, en/TEP.xml and fa/TEP.xml. Let's import what we need to properly search through this data:

In [1]:
import pandas as pd
import numpy as np
import nltk
# import postagger
import xml.etree.ElementTree as ET
from lxml import etree

In [2]:
import os
os.getcwd()

'C:\\Users\\16starjo\\Documents\\Data_Science\\Scrambling-in-English-to-Persian-Subtitles'

Building a parser that works for xml:

In [3]:
parser_test = etree.XMLParser()
tree_eng = ET.parse('Private/TEP/raw/en/TEP.xml', parser = parser_test)

XMLSyntaxError: Char 0x0 out of allowed range, line 86381, column 25 (<string>, line 86381)

Hmmm... After some preliminary efforts in building trees, I found that some of my data has corrupted characters (or at least something along those lines). An example of one of these encoding errors can be found in the following sentence: <s id="86377">simple caf oronary . freak show choked to death .</s> When opened in Notepad++, the space between "caf" and "oronary" is the abbreviation NUL highlighted in black. Other problems occur later in the dataset.

In order to get my parser to work, I've hand-modified the dataset. Any modifications that I made can be found in the data_modifications.txt file in this directory. The name of this edited file is 'TEP_mod.xml' and will be used for the remainder of this project.

In [4]:
parser_full = etree.XMLParser(recover = True)
tree_eng = ET.parse('Private/TEP/raw/en/TEP_mod.xml', parser = parser_full)

Now, let's build the root and see how our data is structured. I've looked through the XML file and noticed that there is a bit of a heading that looks like this (I have added spaces between the greater/less than symbols so that it remains visible in this file):

< ?xml version="1.0" encoding="utf-8"? >
< letsmt version="1.0" >
< head >< /head >
    
After that, we have the body character, followed by a sentence. So, we'll use .findall() to start where we want it to start, and hopefully we'll be able to get an idea of what our data looks like:

In [5]:
root_eng = tree_eng.getroot()
root_eng.items()
for item in root_eng.findall('./body/s')[:5]:
    print(item.text)
    #print(dir(item))
    print(item.values())
    print(item.items())

raspy breathing .
['1']
[('id', '1')]
dad .
['2']
[('id', '2')]
maybe its the wind .
['3']
[('id', '3')]
no .
['4']
[('id', '4')]
stop please stop .
['5']
[('id', '5')]


It looks like our data uses the ID number to mark what text comes after it. This works well, as we'll be able to match up the keys between the two files to combine them! Let's create a test dictionary in which the key is the item and the text is the value, replacing all extraneous information:

In [6]:
eng_lines_test = {}
for item in root_eng.findall('./body/s')[:5]:
     eng_lines_test[int(str(item.values()).replace(',', '').replace("['", '').replace("']", ''))] = str(item.text.replace(',', '').replace(' .', ''))

In [7]:
eng_lines_test.keys()

dict_keys([1, 2, 3, 4, 5])

In [8]:
eng_lines_test.values()

dict_values(['raspy breathing', 'dad', 'maybe its the wind', 'no', 'stop please stop'])

Awesome! Let's do the same for the Farsi text as well!

In [9]:
tree_far = ET.parse('Private/TEP/raw/fa/TEP.xml', parser = parser_full)

In [10]:
far_lines_test = {}
root_far = tree_far.getroot()
for item in root_far.findall('./body/s')[:5]:
    far_lines_test[int(str(item.values()).replace(',', '').replace("['", '').replace("']", ''))] = str(item.text.replace(',', '').replace(' .', ''))

In [11]:
far_lines_test.keys()

dict_keys([1, 2, 3, 4, 5])

In [12]:
far_lines_test.values()

dict_values(['صداي خر خر', 'پدر', 'شايد صداي باد باشه', 'نه', 'دست نگه داريد خواهش ميکنم دست نگه داريد'])

All in the clear! Now, let's combine the two into a single DataFrame object!

In [13]:
test_DF = pd.Series(eng_lines_test).to_frame('eng').join(pd.Series(far_lines_test).to_frame('far'), how='outer')
test_DF

Unnamed: 0,eng,far
1,raspy breathing,صداي خر خر
2,dad,پدر
3,maybe its the wind,شايد صداي باد باشه
4,no,نه
5,stop please stop,دست نگه داريد خواهش ميکنم دست نگه داريد


It worked! Yay! Let's try the whole thing now:

In [14]:
eng_lines = {}
for item in root_eng.findall('./body/s'):
     eng_lines[int(str(item.values()).replace(',', '').replace("['", '').replace("']", ''))] = str(item.text.replace(',', '').replace(' .', ''))
print(len(eng_lines))

612086


In [15]:
far_lines = {}
for item in root_far.findall('./body/s'):
     far_lines[int(str(item.values()).replace(',', '').replace("['", '').replace("']", ''))] = str(item.text.replace(',', '').replace(' .', ''))
print(len(far_lines))

612086


Cool! We have the same numbers. Let's see what the DF would look like, plus a general description of the files:

In [16]:
full_df = pd.Series(eng_lines).to_frame('Eng').join(pd.Series(far_lines).to_frame('Far'), how='outer')
full_df.index.name = 'ID'
full_df

Unnamed: 0_level_0,Eng,Far
ID,Unnamed: 1_level_1,Unnamed: 2_level_1
1,raspy breathing,صداي خر خر
2,dad,پدر
3,maybe its the wind,شايد صداي باد باشه
4,no,نه
5,stop please stop,دست نگه داريد خواهش ميکنم دست نگه داريد
6,you have a week evans then well burn the house,اوانز تو فقط يک هفته وقت داري وگرنه خونتو خواه...
7,william,ويليام
8,god damn it william,لعنتي ويليام 8
9,god damn it put that down,لعنت به تو اونو بذار زمين
10,let go,بذار برم


In [17]:
full_df.describe()

Unnamed: 0,Eng,Far
count,612086,612086
unique,520291,583981
top,yes.,بله
freq,1262,1596


In [18]:
# TOKENIZE, PERHAPS ELIMINATE SINGLE WORDS?

In [19]:
# Constructing Token columns
full_df['Eng_Tok'] = full_df['Eng'].apply(nltk.word_tokenize)
full_df['Far_Tok'] = full_df['Far'].apply(nltk.word_tokenize)

In [20]:
# Constructing Len columns
full_df['Eng_Len'] = full_df['Eng_Tok'].apply(len)
full_df['Far_Len'] = full_df['Far_Tok'].apply(len)

In [21]:
full_df.head()

Unnamed: 0_level_0,Eng,Far,Eng_Tok,Far_Tok,Eng_Len,Far_Len
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,raspy breathing,صداي خر خر,"[raspy, breathing]","[صداي, خر, خر]",2,3
2,dad,پدر,[dad],[پدر],1,1
3,maybe its the wind,شايد صداي باد باشه,"[maybe, its, the, wind]","[شايد, صداي, باد, باشه]",4,4
4,no,نه,[no],[نه],1,1
5,stop please stop,دست نگه داريد خواهش ميکنم دست نگه داريد,"[stop, please, stop]","[دست, نگه, داريد, خواهش, ميکنم, دست, نگه, داريد]",3,8


In [22]:
# Seeing how many one-word lines there are in English, Farsi, and both.
eng_1word = [x for x in full_df['Eng_Len'] if x == 1]
print(len(eng_1word))

far_1word = [x for x in full_df['Far_Len'] if x == 1]
print(len(far_1word))

both_1word = [x for x in full_df if len(full_df['Eng_Len']) == len(full_df['Far_Len'])]
print(len(far_1word))

59483
35506
35506


In [23]:
full_df.to_pickle('full_df.pkl')

#### What Needs to Be Done Still
- Finalize and execute search criteria.
- POS-tag the sentences and process through shallow parser (Parsivar).
- Analysis!