Elena Cimino, e.cimino@pitt.edu, 2019.02.12 - 2019.02.20

# The Corpus: 
- BALC is a corpus compiled by researchers at the British University in Dubai of the writing of native Arabic speakers (L1 Arabic) in the United Arab Emirates who speak/have learned English as a Foreign Language (EFL). 
- The corpus has 1,865 texts that are handwritten. 
- All texts are available as plain text documents; some are also accompanied by jpeg image files of the essays.
- There are several sources for the texts.

# The Layout: 
- There are subdirectories with a folder that holds the files from each individual source. 
- There is a "CEPA Images" folder that has the images and texts for each text.
- There is also a "total" folder that has every text in the corpus in this folder. 
- Therefore, each text in the corpus should be available in its individual folder as well as the "total" folder.
- Additionally, each text from the CEPA examination should be available in _three_ different locations: 
    - The "CEPA Images" folder;
    - The individual "CEPA_#" folder; 
    - The "total" folder.
- Folders with a "\_em" tag are tagged for student corrections
    - CEPA levels 1 - 4 are tagged this way
    - `<i>...</i>` = insertion
    - `<o>...</o>` = emphasized/bolding
    - `<x>...</x>` = cross out
    
# The Exploration: 
- Double-checking the format of the files in the different texts
- Double-checking number of files and looking for repeated files
- I'm mainly interested in the CEPA files right now, since the proficiency levels (1-6, with 1 being the lowest perceived proficiency level and 6 the highest) can be compared against the proficiency levels found in the other corpus I'm looking at: Pitt ELI Corpus

In [1]:
import glob
import re
import nltk

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

%pprint            # to turn off pretty printing

cor_dir = "../private/BUiD Arab Learner Corpus v.1/"

Pretty printing has been turned OFF


Let's look at the CEPA Images folder first. This folder has both .txt and .jpg files of the CEPA essays in the corpus.  

##### CEPA Images folder

In [2]:
# globbing "CEPA images folder"
# Text files
cepa_im_1 = glob.glob(cor_dir+'CEPA Images/1/*.txt')
cepa_im_2 = glob.glob(cor_dir+'CEPA Images/2/*.txt')
cepa_im_3 = glob.glob(cor_dir+'CEPA Images/3/*.txt')
cepa_im_4 = glob.glob(cor_dir+'CEPA Images/4/*.txt')
cepa_im_5 = glob.glob(cor_dir+'CEPA Images/5/*.txt')
cepa_im_6 = glob.glob(cor_dir+'CEPA Images/6/*.txt')
cepa_im = cepa_im_1 + cepa_im_2 + cepa_im_3 + cepa_im_4 + cepa_im_5 + cepa_im_6

# JPG files
cepa_pics_1 = glob.glob(cor_dir+'CEPA Images/1/*.jpg')
cepa_pics_2 = glob.glob(cor_dir+'CEPA Images/2/*.jpg')
cepa_pics_3 = glob.glob(cor_dir+'CEPA Images/3/*.jpg')
cepa_pics_4 = glob.glob(cor_dir+'CEPA Images/4/*.jpg')
cepa_pics_5 = glob.glob(cor_dir+'CEPA Images/5/*.jpg')
cepa_pics_6 = glob.glob(cor_dir+'CEPA Images/6/*.jpg')
cepa_pics = cepa_pics_1 + cepa_pics_2 + cepa_pics_3 + cepa_pics_4 + cepa_pics_5 + cepa_pics_6

# Do they have the same amount of files?
len(cepa_im)
len(cepa_pics)
cepa_im[:5]

938

1750

['../private/BUiD Arab Learner Corpus v.1/CEPA Images/1/200604231.txt', '../private/BUiD Arab Learner Corpus v.1/CEPA Images/1/200602456.txt', '../private/BUiD Arab Learner Corpus v.1/CEPA Images/1/200600490.txt', '../private/BUiD Arab Learner Corpus v.1/CEPA Images/1/200605503.txt', '../private/BUiD Arab Learner Corpus v.1/CEPA Images/1/200604379.txt']

In [3]:
# How many text files are there in each level?
len(cepa_im_1)
len(cepa_im_2)
len(cepa_im_3)
len(cepa_im_4)
len(cepa_im_5)
len(cepa_im_6)

189

124

155

261

125

84

In [4]:
# How many jpg files are there in each level?
len(cepa_pics_1)
len(cepa_pics_2)
len(cepa_pics_3)
len(cepa_pics_4)
len(cepa_pics_5)
len(cepa_pics_6)

300

300

300

298

300

252

Let's check and see if any files repeat.

In [5]:
im_pics = []
for fn in cepa_pics:
    x = re.findall(r'\d{9}', fn)
    if len(x) > 0:
        im_pics.append(x[0])
len(im_pics)
im_pics[:5]

im_texts = []
for fn in cepa_im:
    x = re.findall(r'\d{9}', fn)
    if len(x) > 0:
        im_texts.append(x[0])
len(im_texts)
im_texts[:5]

1750

['200610622', '200612021', '200603474', '200612380', '200606487']

938

['200604231', '200602456', '200600490', '200605503', '200604379']

In [6]:
imt_fd = nltk.FreqDist(im_texts)
imp_fd = nltk.FreqDist(im_pics)

len([f for f in imt_fd if imt_fd[f] > 1])
len([f for f in imp_fd if imp_fd[f] > 1])

1

0

In [7]:
# oh, there's just one, and it's in the text file folder, let's see which file it is and how many occurences there are
imt_fd

FreqDist({'200602841': 2, '200604231': 1, '200602456': 1, '200600490': 1, '200605503': 1, '200604379': 1, '200608556': 1, '200602047': 1, '200604608': 1, '200604620': 1, ...})

In [8]:
# Let's loop through and get the names and contents of each file
for file in cepa_im:
    if '200602841' in file:
        f = open(file)
        txt = f.read()
        f.close()
        print(file+":", txt)
        print()

../private/BUiD Arab Learner Corpus v.1/CEPA Images/4/200602841.txt: The best movie  that I have recently is (The Dafenshi Code) and one of the
leading roles is played by a famouse novie star ( gorge clonie ) he plays the hero.
The movie is about a super vikan that tries to speak the dafenshi book from the 
vadicane musiam but he misses up and gets tailled by a cope which tries
toprevint he from using a code inthe book that is said to cause
the end of the world (Armagedon) in the end good triamps over evil
and every thing goes back to normal.I liked the movie because it`s 
about a mistrey that no one knew how to solve accept the evil vikan
and because it`s based on history and real events,but I was abite
disapoint ed because the ending was to easy to pradict I like the endings
to be so twisted and unpradictable that you have to watch the movie 2 or
3 times to under stand what really happined.Never the less I enjoyed
the movie I liked it so nuch that I  saw it 6 limes in one week.

../p

In [9]:
for f in cepa_pics:
    if '200602841' in f:
        print(f)

../private/BUiD Arab Learner Corpus v.1/CEPA Images/5/200602841.jpg


So, there are more pictures of essays than there are text files. In both cases, there are fewer higher level students, which isn't suprising. However, this discrepancy is a lot smaller in the .jpg file camp than in the text files. Not only that, but we're losing _a lot_ of data using just the text files -- 4 out of the 6 proficiency levels have less than half of the amount of files that are present in the .jpg files. Finally, there are two repeating files, which are identical but are in different proficiency levels: 4 and 5. Thanks to the .jpgs, we know it should be a level 5.

Let's check out the other two folders and see what's going on with that. Do they have the missing files or not?

##### Other CEPA file folders

In [10]:
# globbing each *separate* cepa folder 
cepa_sub_1 = glob.glob(cor_dir+'cepa1_em/*.txt')
cepa_sub_2 = glob.glob(cor_dir+'cepa2_em/*.txt')
cepa_sub_3 = glob.glob(cor_dir+'cepa3_em/*.txt')
cepa_sub_4 = glob.glob(cor_dir+'cepa4_em/*.txt')
cepa_sub_5 = glob.glob(cor_dir+'cepa5/*.txt')
cepa_sub_6 = glob.glob(cor_dir+'cepa6/*.txt')
cepa_sub = cepa_sub_1 + cepa_sub_2 + cepa_sub_3 + cepa_sub_4 + cepa_sub_5 + cepa_sub_6

# Checking number of files and first 5
len(cepa_sub)
cepa_sub[:5]

1676

['../private/BUiD Arab Learner Corpus v.1/cepa1_em/CEPA 1 200611825.txt', '../private/BUiD Arab Learner Corpus v.1/cepa1_em/CEPA 1 200606381.txt', '../private/BUiD Arab Learner Corpus v.1/cepa1_em/CEPA 1 200603548.txt', '../private/BUiD Arab Learner Corpus v.1/cepa1_em/CEPA 1 200610508.txt', '../private/BUiD Arab Learner Corpus v.1/cepa1_em/CEPA 1 200603206.txt']

In [11]:
len(cepa_sub_1)
len(cepa_sub_2)
len(cepa_sub_3)
len(cepa_sub_4)
len(cepa_sub_5)
len(cepa_sub_6)

284

293

251

297

299

252

In [12]:
# globbing "total" folder
total = glob.glob(cor_dir+"total/*.txt")

cepa_total = []
for file in total:
    cepa = re.compile(r'CEPA', flags=re.I)
    matchobj = cepa.search(file)
    if matchobj:
        cepa_total.append(file)

len(cepa_total)

1676

In [13]:
cepa_tot_str = " ".join(cepa_total)
len(re.findall(r'\s+1\s+', cepa_tot_str))
len(re.findall(r'\s+2\s+', cepa_tot_str))
len(re.findall(r'\s+3\s+', cepa_tot_str))
len(re.findall(r'\s+4\s+', cepa_tot_str))
len(re.findall(r'\s+5\s+', cepa_tot_str))
len(re.findall(r'\s+6\s+', cepa_tot_str))

# Weird. Looks like we're losing two files for level 6, but everything else is the same -- including total files.
# Let's investigate further

284

293

251

297

299

250

In [14]:
# Maybe the two missing files don't have a level number after them?
re.findall(r'CEPA \d{9}', cepa_tot_str, flags=re.I)

# No they don't. Let's look for them in cepa_sub

['CEPA 200621158', 'CEPA 200619773']

In [15]:
for fn in cepa_sub:
    if '200621158' in fn:
        f = open(fn)
        txt = f.read()
        f.close()
        print(fn+":")
        print(txt)
        print()
    elif '200619773' in fn:
        f = open(fn)
        txt = f.read()
        f.close()
        print(fn+":")
        print(txt)
        print()

../private/BUiD Arab Learner Corpus v.1/cepa6/CEPA 200621158.txt:
﻿				CEPA 200621158



			      The worst weekend ever!



Last month I had the worst weekend ever in my life.
I just came back from school, it was Wednesday at 1:10 pm, the moment I entered the house my father informed me of the death of a member of our family. We took off to Abu Dhabi to attend the funrel and stayed there untill 9:00 pm, then we went back home and I was really tired and sad. I went to bed the moment we reached home. The next day when I woke up I felt sick, so I went to the doctor who told me to stay in bed all day. I felt a lot better the next day but I didn’t have any chance to have some fun because I had to study for a math test on Saturday. It was a very  rough weekend but that’s how it turned up to be. I’m  hoping the next weekends are a lot more fun and having better news.


../private/BUiD Arab Learner Corpus v.1/cepa6/CEPA 200619773.txt:
﻿				CEPA 200619773



Summer  vacation, 2005, Australia. 

In [16]:
'../private/BUiD Arab Learner Corpus v.1/total/CEPA 6 200619773.txt' in cepa_total
'../private/BUiD Arab Learner Corpus v.1/total/CEPA 6 200621158.txt' in cepa_total

True

True

At least we know they're the same file. We can check with the .jpg images in the image folder where it's actually supposed to be.

In [17]:
# Going to repeat the counter procedure from above 

# First with subfolder...
nums_sub = []
for fn in cepa_sub:
    x = re.findall(r'\d{9}', fn)
    if len(x) > 0:
        nums_sub.append(x[0])
    
# ... then for total folder
nums_tot = []
for fn in cepa_total:
    x = re.findall(r'\d{9}', fn)
    if len(x) > 0:
        nums_tot.append(x[0])

In [18]:
len(nums_sub)
len(nums_tot)

1671

1671

In [19]:
# Counting number of occurences of each file name
sub_fd = nltk.FreqDist(nums_sub)
tot_fd = nltk.FreqDist(nums_tot)

# Now checking numbers
rep_s = [f for f in sub_fd if sub_fd[f] > 1]
rep_t = [f for f in sub_fd if tot_fd[f] > 1]

len(rep_s)
len(rep_t)

rep_s == rep_t

# # Each has 12 repeating files, but at least they're the same files. And this is on top of the two repeated files from before, '200619773' and '200621158'

12

12

True

In [20]:
# Let's look at the repeated files
rep_s

['200607856', '200607880', '200607857', '200607777', '200607875', '200607861', '200607902', '200607910', '200612324', '200611115', '200621158', '200619773']

In [21]:
# Now let's match the repeats to the full file names
for item in cepa_sub:
    if any(search in item for search in rep_s):
        print("Repeated file:", item)

Repeated file: ../private/BUiD Arab Learner Corpus v.1/cepa1_em/CEPA 1 200607856.txt
Repeated file: ../private/BUiD Arab Learner Corpus v.1/cepa1_em/CEPA 1 200607880.txt
Repeated file: ../private/BUiD Arab Learner Corpus v.1/cepa1_em/CEPA 1 200607857.txt
Repeated file: ../private/BUiD Arab Learner Corpus v.1/cepa1_em/CEPA 1 200607777.txt
Repeated file: ../private/BUiD Arab Learner Corpus v.1/cepa1_em/CEPA 1 200607875.txt
Repeated file: ../private/BUiD Arab Learner Corpus v.1/cepa1_em/CEPA 1 200607861.txt
Repeated file: ../private/BUiD Arab Learner Corpus v.1/cepa1_em/CEPA 1 200607902.txt
Repeated file: ../private/BUiD Arab Learner Corpus v.1/cepa1_em/CEPA 1 200607910.txt
Repeated file: ../private/BUiD Arab Learner Corpus v.1/cepa2_em/CEPA 2 200612324.txt
Repeated file: ../private/BUiD Arab Learner Corpus v.1/cepa2_em/CEPA 2 200607856.txt
Repeated file: ../private/BUiD Arab Learner Corpus v.1/cepa2_em/CEPA 2 200607857.txt
Repeated file: ../private/BUiD Arab Learner Corpus v.1/cepa2_em/C

In [22]:
# Where does the image file say to place them?
for item in cepa_pics:
    if any(search in item for search in rep_s):
        print("Placement, per images:", item)

Placement, per images: ../private/BUiD Arab Learner Corpus v.1/CEPA Images/2/200607777.jpg
Placement, per images: ../private/BUiD Arab Learner Corpus v.1/CEPA Images/2/200607856.jpg
Placement, per images: ../private/BUiD Arab Learner Corpus v.1/CEPA Images/2/200607880.jpg
Placement, per images: ../private/BUiD Arab Learner Corpus v.1/CEPA Images/2/200607857.jpg
Placement, per images: ../private/BUiD Arab Learner Corpus v.1/CEPA Images/2/200612324.jpg
Placement, per images: ../private/BUiD Arab Learner Corpus v.1/CEPA Images/2/200607902.jpg
Placement, per images: ../private/BUiD Arab Learner Corpus v.1/CEPA Images/2/200607861.jpg
Placement, per images: ../private/BUiD Arab Learner Corpus v.1/CEPA Images/2/200607875.jpg
Placement, per images: ../private/BUiD Arab Learner Corpus v.1/CEPA Images/2/200607910.jpg
Placement, per images: ../private/BUiD Arab Learner Corpus v.1/CEPA Images/3/200611115.jpg
Placement, per images: ../private/BUiD Arab Learner Corpus v.1/CEPA Images/6/200621158.jpg

Well, this is fun. Both the separate CEPA subfolders (cepa_sub) and the CEPA files in the total folder have the same number of files (1676). This is 74 fewer files than the .jpg in CEPA Images, but still more than the .txt in CEPA Images. __However__, several essays have been repeated.

- In two different folders under the same name:
    - 200607856 (1 & 2) -- .jpg is in __2__
    - 200607880 (1 & 2) -- .jpg is in __2__
    - 200607857 (1 & 2) -- .jpg is in __2__
    - 200607777 (1 & 2) -- .jpg is in __2__
    - 200607875 (1 & 2) -- .jpg is in __2__
    - 200607861 (1 & 2) -- .jpg is in __2__
    - 200607902 (1 & 2) -- .jpg is in __2__
    - 200607910 (1 & 2) -- .jpg is in __2__
- In the same proficiency folder but with 2 different name conventions
    - 200612324 (`CEPA 2 200612324` and `CEPA 2  200612324`)
    - 200611115 (`CEPA 3 200612324` and `CEPA 2  300612324`)
    - 200621158 (`CEPA 200621158` and `CEPA 6 200621158`)
    - 200619773 (`CEPA 200619773` and `CEPA 6 200619773`)
    
Not great.

#### Now let's look into some of these repeated files
Let's dig deeper and see what we can find...

In [23]:
sub_files = {}
for file in cepa_sub:
    f = open(file)
    txt = f.read()
    f.close()
    start = file.rindex('/')+1
    name = file[start:-4]
    sub_files[name] = txt
    
cep_total_files = {}
for file in cepa_total:
    f = open(file)
    txt = f.read()
    f.close()
    start = file.rindex('/')+1
    name = file[start:-4]
    cep_total_files[name] = txt

In [24]:
len(sub_files.keys())
len(cep_total_files.keys())

# So two files were thrown out

1676

1676

In [25]:
for f in sub_files.keys():
    if '200602841' in f :
        print(sub_files[f])

﻿				CEPA 5 200602841



The best movie  that I have recently is (The Dafenshi Code) and one of the
leading roles is played by a famouse movie star ( gorge clonie ) he plays the hero.
The movie is about a super villan that tries to steall the dafenshi book from the  vadicane musiam but he misses up and gets tailled by a cope which tries toprevint he from using a code in the book that is said to cause the end of the world (Armagedon) in the end good triamps over evil
and every thing goes back to normal.I liked the movie because it`s 
about a mistrey that no one knew how to solve accept the evil villan
and because it`s based on history and real events,but I was abite
disapoint ed because the ending was to easy to pradict I like the endings
to be so twisted and unpradictable that you have to watch the movie 2 or
3 times to under stand what really happined.Never the less I enjoyed
the movie I liked it so nuch that I  saw it 6 limes in one week.



In [26]:
for f in cep_total_files.keys():
    if '200602841' in f :
        print(sub_files[f])

﻿				CEPA 5 200602841



The best movie  that I have recently is (The Dafenshi Code) and one of the
leading roles is played by a famouse movie star ( gorge clonie ) he plays the hero.
The movie is about a super villan that tries to steall the dafenshi book from the  vadicane musiam but he misses up and gets tailled by a cope which tries toprevint he from using a code in the book that is said to cause the end of the world (Armagedon) in the end good triamps over evil
and every thing goes back to normal.I liked the movie because it`s 
about a mistrey that no one knew how to solve accept the evil villan
and because it`s based on history and real events,but I was abite
disapoint ed because the ending was to easy to pradict I like the endings
to be so twisted and unpradictable that you have to watch the movie 2 or
3 times to under stand what really happined.Never the less I enjoyed
the movie I liked it so nuch that I  saw it 6 limes in one week.



So it looks like the level "4" repeated file was thrown out, which aligns with the observation from the .jpg files in CEPA Images.

In [27]:
for f in cep_total_files.keys():
    if '200607875' in f :
        print(sub_files[f])

﻿				CEPA 1 200607875



on day way have to go to the Abu Dudai by my fathera and mather and sistere and brater why have Aba Debai its not good becouse my breter it’s acsdend by car I’ts not good becous Abudadai it’s a bad

﻿				CEPA 2 200607875



on day way have to go to the Abu Dudai by my fathera and mather and sistere and brater why have Aba Debai its not good becouse my breter it’s acsdend by car I’ts not good becous Abudadai it’s a bad



In [28]:
for f in cep_total_files.keys():
    if '200607857' in f :
        print(sub_files[f])

﻿				CEPA 1 200607857



The perfect holiday is verey important sapject in our life. In summer holiday I taravel to the tarche. I taravel to my faimly. I saw the talfrek and the park. I abavsayou to taravel thes cantre. I look forward to hearing from yours.

﻿				CEPA 2 200607857



The perfect holiday is verey important sapject in our life. In summer holiday I taravel to the tarche. I taravel to my faimly. I saw the talfrek and the park. I abavsayou to taravel thes cantre. I look forward to hearing from yours.



In [29]:
for f in cep_total_files.keys():
    if '200607910' in f :
        print(sub_files[f])

﻿				CEPA 2 200607910



I think writing about Imagive you have just had the worst holiday ever! is very important becouce aplays an important rolx in our Life. No an adoubt that Imagive you have just had the worst holidayev! I the is no good the worst holiday ever. you went to zoo and park. They you went there with many friends and sister, father, mather and family. She eat many feed and druing milk. The go to the park foraday, and play a football and tennis and play with friend. I gave many apple and mant. In go to the zoo meany cat and cow in the zoo. I eat meay cats in the zoo. My freads play with my cat open the dour in the cat go in the open the drua. Is very import holiday the is it was so bad and friend go the doctor in the morining it very eary in the day. I hope reader I gave a bout just had the worst holiday ever! index.

﻿				CEPA 1 200607910



I think writing about Imagive you have just had the worst holiday ever! is very important becouce aplays an important rolx in our 

In [30]:
cep_total_files['CEPA 2 200606678']
cep_total_files["CEPA 2 200607910"]

'\ufeff\t\t\t\tCEPA 2 200606678\n\n\n\nMy most beautiful place is a Al mimzer , thir is in Dubai , I can see there a s<x>a</x>e p<x>e</x>pals and famil is go to wok neer the sae, I like is pl<x>a</x>ce becuse is kwit and  li<x>k</x>e the s<x>a</x>e ev<x>a</x>ry we<x>a</x>kied <o> I</o> go the Al Mimzer , and the yng like is Ploce .\n'

'\ufeff\t\t\t\tCEPA 2 200607910\n\n\n\nI think writing about Imagive you have just had the worst holiday ever! is very important becouce aplays an important rolx in our Life. No an adoubt that Imagive you have just had the worst holidayev! I the is no good the worst holiday ever. you went to zoo and park. They you went there with many friends and sister, father, mather and family. She eat many feed and druing milk. The go to the park foraday, and play a football and tennis and play with friend. I gave many apple and mant. In go to the zoo meany cat and cow in the zoo. I eat meay cats in the zoo. My freads play with my cat open the dour in the cat go in the open the drua. Is very import holiday the is it was so bad and friend go the doctor in the morining it very eary in the day. I hope reader I gave a bout just had the worst holiday ever! index.\n'

Okay... the \ufeff looks like an encoding thing. I can fix that next time around. The HTML tagging in the 200606678 file is the student correction tags.

In [31]:
# Let's compare some of these files to the counterparts in the CEPA Images folder
f = open('../private/BUiD Arab Learner Corpus v.1/CEPA Images/2/200606678.txt')
txt = f.read()
f.close()
txt

sub_files['CEPA 2 200606678']
# so, no steps taken to do anything about student corrections (even crossing out) in the Images folder -- sub_files seems to be identical to total

'My most beautiful place is a Al mimzer , thir is in Dubai , I can see there a sae pepals and famil is go to wok neer the sae, I like is place becuse is kwit and  like the sae evary weakied I go the Al Mimzer , and the yng like is ploce .'

'\ufeff\t\t\t\tCEPA 2 200606678\n\n\n\nMy most beautiful place is a Al mimzer , thir is in Dubai , I can see there a s<x>a</x>e p<x>e</x>pals and famil is go to wok neer the sae, I like is pl<x>a</x>ce becuse is kwit and  li<x>k</x>e the s<x>a</x>e ev<x>a</x>ry we<x>a</x>kied <o> I</o> go the Al Mimzer , and the yng like is Ploce .\n'

In [32]:
f = open('../private/BUiD Arab Learner Corpus v.1/CEPA Images/3/200606782.txt')
txt = f.read()
f.close()
txt

# vs...
cep_total_files['CEPA 3 200606782']

# vs 
sub_files['CEPA 3 200606782']

'I go withe my frineds to usa  and  i  go to  hilten  hotiel and i go to   walad Disny  and to the Deths with myfrind and i go  to   see  amaths  for footbaull amarican in the club.\nther was   a wonderful plac .\ni see a good pepulc  ina mavica. and  igo to  san francesco in is  a begear  valage in usa . thear are a may  peple in the valage   and    i see  a  dolphen and  clad reacing car itis for drges vease  it   is   very  beatful  the ar  a fauters car in  the club.\nand   in  see  many restrants in  the  usa  and ther are a  cinestuwen thear ar many cincy in usa and many dooges  and agood  persnts.\nand  i  see  in  usa  many  pepoles  its good peplets and  i go  to  the cinma   it  is  very bag  and many peples  in the cinma.\nwin i  bak  to my  contrey  UAE  i remamber that and  in  tell   my  famile  abowt USA amd i teu may frinfs    in the  UAE about  USA and the good peple  in   .USA .\nTher  wakand  its   good  becwas  the pepols  in   usa  it  in a  good pepols.'

'\ufeff        CEPA 3 200606782\n\n\n\n\nI go withe my Frineds to USA  and  i  go to  hilton  hotiel and i go to   walad Disny  and to the Deths with myFrind and i go  to   see  amaths  for footbaull amarican in the club.\nth<o>a</o>r was   a wonderful plac .\ni see a good pepole  ina mavica. and  igo to  san fran cesco in is  a begear  valage in usa . thear are a mny  peple i<x>n</x>  the valage   and    i see  a  dolphen and  clad reacing car itis for drgey vease  i<o>t</o>  i<x>s</x>   very  beatful  the ar  a Fasters car in  the club.\nand   in  see  many restrants in  the  usa  and ther are a  cinestuwen thear ar many cincy in usa and many dooges  and agood  persnts.\nand  i  see  in  USA  many  pepoles  its good peplets and  i go  to  the cinma   it  is  very bag  and many peples  in the cinma.\nwin i  bak  to my  contrey  UAE  i remamber that and  in  tell   my  famile  about USA amd i tell may frinfs in the UAE about  USA and the good peple  in   .US<o>a</o> .\nTher  wakand  it

'\ufeff        CEPA 3 200606782\n\n\n\n\nI go withe my Frineds to USA  and  i  go to  hilton  hotiel and i go to   walad Disny  and to the Deths with myFrind and i go  to   see  amaths  for footbaull amarican in the club.\nth<o>a</o>r was   a wonderful plac .\ni see a good pepole  ina mavica. and  igo to  san fran cesco in is  a begear  valage in usa . thear are a mny  peple i<x>n</x>  the valage   and    i see  a  dolphen and  clad reacing car itis for drgey vease  i<o>t</o>  i<x>s</x>   very  beatful  the ar  a Fasters car in  the club.\nand   in  see  many restrants in  the  usa  and ther are a  cinestuwen thear ar many cincy in usa and many dooges  and agood  persnts.\nand  i  see  in  USA  many  pepoles  its good peplets and  i go  to  the cinma   it  is  very bag  and many peples  in the cinma.\nwin i  bak  to my  contrey  UAE  i remamber that and  in  tell   my  famile  about USA amd i tell may frinfs in the UAE about  USA and the good peple  in   .US<o>a</o> .\nTher  wakand  it

So some inconsistencies as well in some capitalization here as well...

#### What we learned: 
The CEPA Images file has the fewest amount of CEPA text files, but it does have the full amount of essays as pictures. The text files in this folder are un-tagged. Next we have the subfolders for each level of CEPA with error corrections. These seem to be as the same as the files in the total file. Both of these files have much more than the CEPA Image text count, but still does not have all of them. Balancing across levels is more evenly distributed with these higher text counts.

There are some naming convention concerns, mainly repeated files across and within proficiency levels in the CEPA folders. 