Elena Cimino, e.cimino@pitt.edu, 2019.02.12 - 2019.02.17

# The Corpus: 
- BALC is a corpus compiled by researchers at the British University in Dubai of the writing of native Arabic speakers (L1 Arabic) in the United Arab Emirates who speak/have learned English as a Foreign Language (EFL). 
- The corpus has 1,865 texts that are handwritten. 
- All texts are available as plain text documents; some are also accompanied by jpeg/png image files of the essays.
- There are several sources for the texts.

# The Layout: 
- There are subdirectories with a folder that holds the files from each individual source. 
- There is a "CEPA Images" folder that has the images and texts for each text.
- There is also a "total" folder that has every text in the corpus in this folder. 
- Therefore, each text in the corpus should be available in its individual folder as well as the "total" folder.
- Additionally, each text from the CEPA examination should be available in _three_ different locations: 
    - The "CEPA Images" folder;
    - The individual "CEPA_#" folder; 
    - The "total" folder.
    
# The Exploration: 
- Double-checking the format of the files in the different texts and checking numbers
- I'm mainly interested in the CEPA files right now, since the proficiency levels (1-6, with 1 being the lowest perceived proficiency level and 6 the highest) can be compared against the proficiency levels found in the other corpus I'm looking at: Pitt ELI Corpus

In [1]:
import glob
import re

%pprint            # to turn off pretty printing
cor_dir = "private/BUiD Arab Learner Corpus v.1/"

Pretty printing has been turned OFF


In [2]:
# globbing "CEPA images folder"
cepa_im_1 = glob.glob(cor_dir+'CEPA Images/1/*.txt')
cepa_im_2 = glob.glob(cor_dir+'CEPA Images/2/*.txt')
cepa_im_3 = glob.glob(cor_dir+'CEPA Images/3/*.txt')
cepa_im_4 = glob.glob(cor_dir+'CEPA Images/4/*.txt')
cepa_im_5 = glob.glob(cor_dir+'CEPA Images/5/*.txt')
cepa_im_6 = glob.glob(cor_dir+'CEPA Images/6/*.txt')
cepa_im = cepa_im_1 + cepa_im_2 + cepa_im_3 + cepa_im_4 + cepa_im_5 + cepa_im_6
print(len(cepa_im))
cepa_im[:5]

938


['private/BUiD Arab Learner Corpus v.1/CEPA Images/1/200604231.txt', 'private/BUiD Arab Learner Corpus v.1/CEPA Images/1/200602456.txt', 'private/BUiD Arab Learner Corpus v.1/CEPA Images/1/200600490.txt', 'private/BUiD Arab Learner Corpus v.1/CEPA Images/1/200605503.txt', 'private/BUiD Arab Learner Corpus v.1/CEPA Images/1/200604379.txt']

In [4]:
len(cepa_im_1)

189

In [5]:
len(cepa_im_2)

124

In [6]:
len(cepa_im_3)

155

In [7]:
len(cepa_im_4)

261

In [8]:
len(cepa_im_5)

125

In [9]:
len(cepa_im_6)

84

In [10]:
'200609914' in cepa_im_5

False

Not a fabulous way to name files -- identification of each file may be dependent on the folder structure, not the filename itself. I'll double-check to see if there are any identically named files

In [3]:
# Making a list of the short-hand names of the cepa_images folder
nums = []
for file in cepa_im:
    start = file.rindex('/')+1
    name = file[start:-4]
    nums.append(name)

# Counting the number of occurences of each short-hand filename from nums
import collections
c = collections.Counter(nums)

# collecting files with more than 1 occurence 
more = [x for x in c if c[x] > 1]

# Seeing how many files are in "more"
print(len(more))

# oh, there's just one, let's see which file it is and how many occurences there are
print(more)
c['200602841']

1
['200602841']


2

In [4]:
repeat = []
for f in cepa_im:
    if '200602841' in f:
        repeat.append(f)
repeat

['private/BUiD Arab Learner Corpus v.1/CEPA Images/4/200602841.txt', 'private/BUiD Arab Learner Corpus v.1/CEPA Images/5/200602841.txt']

In [5]:
repeat_dict = {}
for essay in cepa_im:
    if '200602841' in essay:
        f = open(essay)
        txt = f.read()
        f.close()
        print(essay+":")
        print(txt)
        print()

private/BUiD Arab Learner Corpus v.1/CEPA Images/4/200602841.txt:
The best movie  that I have recently is (The Dafenshi Code) and one of the
leading roles is played by a famouse novie star ( gorge clonie ) he plays the hero.
The movie is about a super vikan that tries to speak the dafenshi book from the 
vadicane musiam but he misses up and gets tailled by a cope which tries
toprevint he from using a code inthe book that is said to cause
the end of the world (Armagedon) in the end good triamps over evil
and every thing goes back to normal.I liked the movie because it`s 
about a mistrey that no one knew how to solve accept the evil vikan
and because it`s based on history and real events,but I was abite
disapoint ed because the ending was to easy to pradict I like the endings
to be so twisted and unpradictable that you have to watch the movie 2 or
3 times to under stand what really happined.Never the less I enjoyed
the movie I liked it so nuch that I  saw it 6 limes in one week.

private

Well, that will do it. "CEPA-4-200602841" and "CEPA-5-200602841" are identical files. 

Let's check out the other two folders and see what's going on with that.

In [6]:
# globbing each cepa folder 
# folders with "_em" suffix have been coded for student corrections
cepa_1 = glob.glob(cor_dir+'cepa1_em/*.txt')
cepa_2 = glob.glob(cor_dir+'cepa2_em/*.txt')
cepa_3 = glob.glob(cor_dir+'cepa3_em/*.txt')
cepa_4 = glob.glob(cor_dir+'cepa4_em/*.txt')
cepa_5 = glob.glob(cor_dir+'cepa5/*.txt')
cepa_6 = glob.glob(cor_dir+'cepa6/*.txt')
cepa_sub = cepa_1 + cepa_2 + cepa_3 + cepa_4 + cepa_5 + cepa_6
print(len(cepa_sub))
cepa_sub[:5]

1676


['private/BUiD Arab Learner Corpus v.1/cepa1_em/CEPA 1 200611825.txt', 'private/BUiD Arab Learner Corpus v.1/cepa1_em/CEPA 1 200606381.txt', 'private/BUiD Arab Learner Corpus v.1/cepa1_em/CEPA 1 200603548.txt', 'private/BUiD Arab Learner Corpus v.1/cepa1_em/CEPA 1 200610508.txt', 'private/BUiD Arab Learner Corpus v.1/cepa1_em/CEPA 1 200603206.txt']

In [7]:
# globbing "total" folder
cepa_total = [f for f in glob.glob(cor_dir+"total/*.txt") if "CEPA" in f or "Cepa" in f or "cepa" in f or "CEPa" in f]
print(len(cepa_total))
cepa_total[:5]

1676


['private/BUiD Arab Learner Corpus v.1/total/CEPA 3 200607296.txt', 'private/BUiD Arab Learner Corpus v.1/total/CEPA 4 200607457.txt', 'private/BUiD Arab Learner Corpus v.1/total/CEPA 5 200600487.txt', 'private/BUiD Arab Learner Corpus v.1/total/CEPA 4 200608016.txt', 'private/BUiD Arab Learner Corpus v.1/total/CEPA 1 200611825.txt']

#### What we know so far: 
Across the board, we can see that the CEPA Images folder has the least amount of files. CEPA Total and the individual CEPA sub-folders have the same amount of files, which is good. Both the total and individual CEPA folders have student corrections coded with HTML tags by the researchers.

Let's probe and see if we can spot where the differences are

In [8]:
def rename_file(file):
    """If a file names contain spaces, these spaces are removed."""
    spaces = re.compile(r'\s+')
    matchobj = spaces.search(file)
    if matchobj:
        file = re.sub(r'\s+', '-', file)
    return file

sub_files = {}
for file in cepa_sub:
    f = open(file)
    txt = f.read()
    f.close()
    file = rename_file(file)
    start = file.rindex('/')+1
    name = file[start:-4]
    sub_files[name] = txt
    
cep_total_files = {}
for file in cepa_total:
    f = open(file)
    txt = f.read()
    f.close()
    file = rename_file(file)
    start = file.rindex('/')+1
    name = file[start:-4]
    cep_total_files[name] = txt

Let's just quickly double check and make sure neither of the folders have a random file that's not in the other one.

In [9]:
m = [k for k in sub_files.keys() if k not in cep_total_files.keys()]
m

[]

In [10]:
w = [k for k in cep_total_files.keys() if k not in sub_files.keys()]
w

[]

Now let's look into the repeated file in these two folders

In [11]:
for f in sub_files.keys():
    if '200602841' in f :
        print(sub_files[f])

﻿				CEPA 5 200602841



The best movie  that I have recently is (The Dafenshi Code) and one of the
leading roles is played by a famouse movie star ( gorge clonie ) he plays the hero.
The movie is about a super villan that tries to steall the dafenshi book from the  vadicane musiam but he misses up and gets tailled by a cope which tries toprevint he from using a code in the book that is said to cause the end of the world (Armagedon) in the end good triamps over evil
and every thing goes back to normal.I liked the movie because it`s 
about a mistrey that no one knew how to solve accept the evil villan
and because it`s based on history and real events,but I was abite
disapoint ed because the ending was to easy to pradict I like the endings
to be so twisted and unpradictable that you have to watch the movie 2 or
3 times to under stand what really happined.Never the less I enjoyed
the movie I liked it so nuch that I  saw it 6 limes in one week.



In [12]:
for f in cep_total_files.keys():
    if '200602841' in f :
        print(sub_files[f])

﻿				CEPA 5 200602841



The best movie  that I have recently is (The Dafenshi Code) and one of the
leading roles is played by a famouse movie star ( gorge clonie ) he plays the hero.
The movie is about a super villan that tries to steall the dafenshi book from the  vadicane musiam but he misses up and gets tailled by a cope which tries toprevint he from using a code in the book that is said to cause the end of the world (Armagedon) in the end good triamps over evil
and every thing goes back to normal.I liked the movie because it`s 
about a mistrey that no one knew how to solve accept the evil villan
and because it`s based on history and real events,but I was abite
disapoint ed because the ending was to easy to pradict I like the endings
to be so twisted and unpradictable that you have to watch the movie 2 or
3 times to under stand what really happined.Never the less I enjoyed
the movie I liked it so nuch that I  saw it 6 limes in one week.



So it looks like the level "4" repeated file was thrown out.

In [None]:
total