# Data Summary
#### John R. Starr; jrs294@pitt.edu

The data is split into two folders/files, en/TEP.xml and fa/TEP.xml. Let's import what we need to properly search through this data:

In [1]:
import pandas as pd
import numpy as np
import nltk
import xml.etree.ElementTree as ET
from lxml import etree

Making sure that we're in the correct directory:

In [2]:
import os
os.getcwd()

'C:\\Users\\16starjo\\Documents\\Data_Science\\Scrambling-in-English-to-Persian-Subtitles'

First, we need to build a parser works for xml. I found documentation on etree.XMLParser() [here](https://lxml.de/api/lxml.etree.XMLParser-class.html). After some preliminary efforts in building trees, I found that some of my data has corrupted characters (or at least something along those lines). An example of one of these encoding errors can be found in the following sentence: <s id="86377">simple caf oronary . freak show choked to death .</s> When opened in Notepad++, the space between "caf" and "oronary" is the abbreviation NUL highlighted in black. Other problems occur later in the dataset.

In order to get my parser to work, I've hand-modified the dataset. Any modifications that I made can be found in the data_modifications.txt file [here](https://github.com/Data-Science-for-Linguists-2019/Scrambling-in-English-to-Persian-Subtitles/blob/master/data_modifications.txt). The name of this edited file is 'TEP_mod.xml' and will be used for the remainder of this project.

In [4]:
# Creating a new parser
parser_full = etree.XMLParser(recover = True)
tree_eng = ET.parse('Private/TEP/raw/en/TEP_mod.xml', parser = parser_full)

Now, let's build the root and see how our data is structured. I've looked through the XML file and noticed that there is a bit of a heading that looks like this (I have added spaces between the greater/less than symbols so that it remains visible in this file):

< ?xml version="1.0" encoding="utf-8"? >
< letsmt version="1.0" >
< head >< /head >
    
After that, we have the body character, followed by a sentence. So, we'll use .findall() to start where we want it to start, and hopefully we'll be able to get an idea of what our data looks like:

In [5]:
root_eng = tree_eng.getroot()
root_eng.items()
for item in root_eng.findall('./body/s')[:5]:
    print(item.text)
    #print(dir(item))
    print(item.values())
    print(item.items())

raspy breathing .
['1']
[('id', '1')]
dad .
['2']
[('id', '2')]
maybe its the wind .
['3']
[('id', '3')]
no .
['4']
[('id', '4')]
stop please stop .
['5']
[('id', '5')]


It looks like our data uses the ID number to mark what text comes after it. This works well, as we'll be able to match up the keys between the two files to combine them! Let's create a test dictionary in which the key is the item and the text is the value, replacing all extraneous information:

In [6]:
eng_lines_test = {}
for item in root_eng.findall('./body/s')[:5]:
     eng_lines_test[int(str(item.values()).replace(',', '').replace("['", '').replace("']", ''))] = str(item.text.replace(',', '').replace(' .', ''))

In [7]:
eng_lines_test.keys()

dict_keys([1, 2, 3, 4, 5])

In [8]:
eng_lines_test.values()

dict_values(['raspy breathing', 'dad', 'maybe its the wind', 'no', 'stop please stop'])

Awesome! Let's do the same for the Farsi text as well!

In [9]:
tree_far = ET.parse('Private/TEP/raw/fa/TEP.xml', parser = parser_full)

In [10]:
far_lines_test = {}
root_far = tree_far.getroot()
for item in root_far.findall('./body/s')[:5]:
    far_lines_test[int(str(item.values()).replace(',', '').replace("['", '').replace("']", ''))] = str(item.text.replace(',', '').replace(' .', ''))

In [11]:
far_lines_test.keys()

dict_keys([1, 2, 3, 4, 5])

In [12]:
far_lines_test.values()

dict_values(['صداي خر خر', 'پدر', 'شايد صداي باد باشه', 'نه', 'دست نگه داريد خواهش ميکنم دست نگه داريد'])

All in the clear! Now, let's combine the two into a single DataFrame object!

In [13]:
test_DF = pd.Series(eng_lines_test).to_frame('eng').join(pd.Series(far_lines_test).to_frame('far'), how='outer')
test_DF

Unnamed: 0,eng,far
1,raspy breathing,صداي خر خر
2,dad,پدر
3,maybe its the wind,شايد صداي باد باشه
4,no,نه
5,stop please stop,دست نگه داريد خواهش ميکنم دست نگه داريد


It worked! Yay! Let's apply this methodology for both of the files in full, rather than the little pieces we've been testing:

In [14]:
eng_lines = {}
for item in root_eng.findall('./body/s'):
     eng_lines[int(str(item.values()).replace(',', '').replace("['", '').replace("']", ''))] = str(item.text.replace(',', '').replace(' .', ''))
print(len(eng_lines))

612086


In [15]:
far_lines = {}
for item in root_far.findall('./body/s'):
     far_lines[int(str(item.values()).replace(',', '').replace("['", '').replace("']", ''))] = str(item.text.replace(',', '').replace(' .', ''))
print(len(far_lines))

612086


Cool! We have the same numbers. Let's see what the DF would look like, and then add some more information that might be useful to use in the future.

In [16]:
full_df = pd.Series(eng_lines).to_frame('Eng').join(pd.Series(far_lines).to_frame('Far'), how='outer')
full_df.index.name = 'ID'
full_df

Unnamed: 0_level_0,Eng,Far
ID,Unnamed: 1_level_1,Unnamed: 2_level_1
1,raspy breathing,صداي خر خر
2,dad,پدر
3,maybe its the wind,شايد صداي باد باشه
4,no,نه
5,stop please stop,دست نگه داريد خواهش ميکنم دست نگه داريد
6,you have a week evans then well burn the house,اوانز تو فقط يک هفته وقت داري وگرنه خونتو خواه...
7,william,ويليام
8,god damn it william,لعنتي ويليام 8
9,god damn it put that down,لعنت به تو اونو بذار زمين
10,let go,بذار برم


In [17]:
full_df.describe()

Unnamed: 0,Eng,Far
count,612086,612086
unique,520291,583981
top,yes.,بله
freq,1262,1596


Some interesting things to point out:
- It appears that the subtitles include non-spoken acts of communication, such as "raspy breathing" or "music playing". I'm not entirely sure how to remove this data, as it is not marked in any particular way.
- Also, poor William in the beginning! It seems that he's having a tough time...
- "Yes" and it's Persian translation "بله" are the most common words, but they do not occur at the same frequency... This may be because it is common for Persian to repeat "بله" when speaking casually.

Back to the data. Let's create three more columns for both langauges: token, token count (or length), and type. This will help us navigate the data when performing analysis:

In [18]:
# Constructing Token columns
full_df['Eng_Tok'] = full_df['Eng'].apply(nltk.word_tokenize)
full_df['Far_Tok'] = full_df['Far'].apply(nltk.word_tokenize)

In [19]:
# Constructing Len columns
full_df['Eng_Len'] = full_df['Eng_Tok'].apply(len)
full_df['Far_Len'] = full_df['Far_Tok'].apply(len)

In [20]:
# Constructing Type columns
full_df['Eng_Types'] = full_df['Eng_Tok'].apply(set)
full_df['Far_Types'] = full_df['Far_Tok'].apply(set)

Seeing our resulting DF:

In [21]:
full_df.head()

Unnamed: 0_level_0,Eng,Far,Eng_Tok,Far_Tok,Eng_Len,Far_Len,Eng_Types,Far_Types
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,raspy breathing,صداي خر خر,"[raspy, breathing]","[صداي, خر, خر]",2,3,"{breathing, raspy}","{صداي, خر}"
2,dad,پدر,[dad],[پدر],1,1,{dad},{پدر}
3,maybe its the wind,شايد صداي باد باشه,"[maybe, its, the, wind]","[شايد, صداي, باد, باشه]",4,4,"{wind, the, maybe, its}","{شايد, باشه, صداي, باد}"
4,no,نه,[no],[نه],1,1,{no},{نه}
5,stop please stop,دست نگه داريد خواهش ميکنم دست نگه داريد,"[stop, please, stop]","[دست, نگه, داريد, خواهش, ميکنم, دست, نگه, داريد]",3,8,"{stop, please}","{نگه, دست, داريد, ميکنم, خواهش}"


How many one-word lines are there? If a significant portion of my data consists of one-word lines, then it will be pretty challenging to get results analyzing the syntax of the subtitles. I'm going to predict that there will be more Persian one-word lines, as you don't need to include the subject in Persian and can simply utter a verb. 

In [22]:
# Seeing how many one-word lines there are in English, Farsi, and both.
eng_1word = [x for x in full_df['Eng_Len'] if x == 1]
print(len(eng_1word))

far_1word = [x for x in full_df['Far_Len'] if x == 1]
print(len(far_1word))

both_1word = [x for x in full_df if len(full_df['Eng_Len']) == 1 if len(full_df['Far_Len']) == 1]
print(len(far_1word))

# Overall length of file:
len(full_df)

59483
35506
35506


612086

Interesting! It looks like my hypothesis was wrong. Every Persian line that is one word is also one word in English. However, reverse is not true. This is something I will investigate further in my data analysis. 

These sentences do not take up a significant portion of my data, but I am considering removing them because they do not help me with my analysis. I will be meeting with one of the instructors to get a second opinion on this matter.

Let's pickle the data as a whole and move it to my private folder, pickle a small portion of the data, get a little more information on the data, and then move on to POS tagging and shallow parsing!

In [23]:
full_df.to_pickle('full_df.pkl') # This will be put in my private folder. 

In [24]:
# Seeing general information about the DF
full_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 612086 entries, 1 to 612086
Data columns (total 8 columns):
Eng          612086 non-null object
Far          612086 non-null object
Eng_Tok      612086 non-null object
Far_Tok      612086 non-null object
Eng_Len      612086 non-null int64
Far_Len      612086 non-null int64
Eng_Types    612086 non-null object
Far_Types    612086 non-null object
dtypes: int64(2), object(6)
memory usage: 62.0+ MB


Well, we don't have any null values! This is good (and expected). What's the average sentence length for each language?

In [25]:
full_df.Eng_Len.value_counts()

6     60403
1     59483
7     58117
5     57950
8     54154
4     52814
9     47822
10    40727
3     40255
2     36142
11    32389
12    24742
13    18157
14    12105
15     7419
16     4401
17     2397
18     1293
19      618
20      312
21      178
22      100
23       52
24       16
25       15
26        9
27        6
28        5
29        3
31        1
34        1
Name: Eng_Len, dtype: int64

In [26]:
full_df.Far_Len.value_counts()

2     67818
5     60178
6     60088
4     56988
3     55920
7     55866
8     49803
9     42446
1     35506
10    34563
11    27501
12    20660
13    15037
14    10603
15     7179
16     4809
17     2958
18     1709
19     1052
20      641
21      353
22      174
23       94
24       51
25       35
26       25
27        9
28        6
30        6
29        5
32        1
31        1
34        1
Name: Far_Len, dtype: int64

In [27]:
# Average English sentence length
eng_len_tot = []
for item in full_df['Eng_Len']:
    eng_len_tot.append(item)
eng_len_avg = (sum(eng_len_tot))/len(full_df)
print(eng_len_avg)

6.738687047244995


In [28]:
# Average Farsi sentence length
far_len_tot = []
for item in full_df['Far_Len']:
    far_len_tot.append(item)
far_len_avg = (sum(far_len_tot))/len(full_df)
print(far_len_avg)

6.46187953980323


It seems that the sentences are pretty comparable in average length. This conclusion can be inferred from looking at the overall value counts for each of the languages. 

I think that's it for data exploration and summarization. Let's move to tagging and parsing the data!