# BALC cleaning
2019.02.08 — 2019.02.17

## Summary of code
- Creating a corpus dictionary (`corpus_dict`) in which `corpus_dict[essay]` returns the text of the essay
- Setting up a balc_df, which includes:
    - Filenames
    - Original Text
- Cleaning up the text for balc_df by making:
    - A "Normalized" Text with standardized tagging (if applicable)
    - A "Revised" Essay, which removes student correction tags and removes crossed out text
- Building up balc_df with: 
    - Tokenized essays (built off of the revised essay)
    - Token counts
    - TTR count

### Initial set-up

In [1]:
import pandas as pd
import numpy as np
import nltk
import glob
import re

%pprint      # turn off pretty printing

# Creating a path for the corpus. All folders in the corpus have spaces in them -- 
# some file names do as well. Keep in mind!
cor_dir = "private/BUiD Arab Learner Corpus v.1/total/"

Pretty printing has been turned OFF


In [2]:
# Creating a UDF that will return a file name with no spaces, to be used during dictionary formation of files
def rename_file(file):
    """If a file names contain spaces, these spaces are removed."""
    spaces = re.compile(r'\s+')
    matchobj = spaces.search(file)
    if matchobj:
        file = re.sub(r'\s+', '-', file)
    return file

# Testing the UDF
rename_file("/folder name/folder name 2/CEPA 1 0123456789")

'/folder-name/folder-name-2/CEPA-1-0123456789'

In [3]:
# All right - let's do this with the corpus
# trying to specify an encoding that will not include \ufeff
corpus = glob.glob(cor_dir+'*.txt')
corpus_dict = {}
for file in corpus:
    f = open(file, encoding='utf-8-sig')
    txt = f.read()
    f.close()
    file = rename_file(file)
    start = file.rindex('/')+1
    name = file[start:-4]
    corpus_dict[name] = txt
    
# checking number of keys
len(corpus_dict.keys())

# good, we're not missing anything

1856

In [4]:
corpus_dict.keys()

# Looking good

dict_keys(['CEPA-3-200607296', 'CEPA-4-200607457', 'CEPA-5-200600487', 'CEPA-4-200608016', 'CEPA-1-200611825', 'CEPA-1-200606381', 'CEPA-5-200608959', 'Taiseer-86', 'CEPA-5-200600322', 'Taiseer-79', 'CEPA-1-200603548', 'CEPA-3-200611351', 'CEPA-5-200600450', 'AS17e', 'CEPA-3-200607269', 'Taiseer-45', 'CEPA-5-200603617', 'CEPA-5-200607471', 'CEPA-1-200610508', 'CEPA-1-200603206', 'CEPA-3-200611379', 'CEPA-5-200601014', 'Taiseer-51', 'AK9e', 'FZ10e', 'CEPA-3-200611190', 'CEPA-2-200606353', 'CEPA-2-200606421', 'CEPA-2-200612454', 'CEPA-2-200611582', 'CEPA-2-200601400', 'CEPA-2-200601414', 'CEPA-2-200607065', 'CEPA-2-200612468', 'CEPA-4-200607682', 'CEPA-6-200620020', 'CEPA-6-200621470', 'CEPA-1-200601954', 'CEPA-4-200605888', 'CEPA-4-200607721', 'CEPA-4-200607735', 'CEPA-3-200611966', 'CEPA-6-200620801', 'CEPA-5-200607908', 'CEPA-6-200620815', 'CEPA-1-200610124', 'CEPA-1-200600323', 'CEPA-1-200611548', 'CEPA-4-200605877', 'CEPA-5-200607707', 'CEPA-5-200601376', 'US-10', 'CEPA-5-200605104'

In [5]:
corpus_dict["CEPA-1-200601970"]

'\t\t\t\tCEPA 1 200601970\n\n\n\nYou have just had the perfect holiday you went Yaman There go <o>my  father and my brather. You</o> saw and did hadr<o>a</o> mot and sanaa moll. It was so wonderful m<o>y</o> famely\n'

In [6]:
corpus_dict["CEPA-5-200600215"]

'\t\t\t\tCEPA 5 200600215\n\n\n\nLast summer holiday was the worst holiday I have ever had. It was bad holiday because evrythings happened suddenly and without any prepairing.\nLast summer holiday, my family decided to spend  the holiday in India, so my father booked us a tickets to India. We all prepaired the bags for travelling on Tuesday.\nWe were on the airport befor one hour of plane flying. We were told that the plane had something wrong and it would be late. We waited for three hours in the airport. Then, we flew to India. It took us three hours. When we arrived we started looking for taxi for a long time. After that we found small bus to take us to the hotel. \nIn the way of the hotel, I saw many dogs in the street and I was afraid. In the entrence of the hotel there were many poor children with dirty cats. We spent this day in hotel because of  the bad weather.\nNext day, I suggested going to the park. My father bought the lunch and took us to the park. We played in some games

In [7]:
corpus_dict["CEPA-1-200600677"]

'\t\t\t\tCEPA 1 200600677\n\n\n\nshe have just ha<o>d</o> the perfect. went go to the Da<o>bia</o>. Dabia is the dviring, and is vere <o>Fantastc</o> went my Frinds and Famly. He will go in the Ibn batota, and swimming. The swimming Dabia saw vere Fantastc and dviring. and it’s going in the <o>cin</o>ema Dabia. \n'

In [8]:
corpus_dict["AK17e"]

'\t\t\nName\nEbeeid    Mubarak\nSchool\nAl   Khaleej\nGrade\nGrade  12\n\n\n\nThe  <x> l </x> mother<o> s </o>  are all of life.  We must h<o>a </o> ve a mother in all home.  I have a mother in my home.  My mothe<o> r </o> <x> l </x> is  beutiful woman be<o> o</o> cuse <x> she </x> her  bite  me  to  make  a  good  thinks.  I  m<o> u</o> st  now not forg<o> g</o> eten  a   poit <x> w </x>    he  was  said  a  mother is  <x> la </x> same  a  school  w<o> h</o>en  you  make  you  make  avery  good   people.\n\n\n'

#### Overview:
- It appears that all of the files have some kind of heading before the actual text
    - This will affect measures that depend on tokenization and token counts, so we'll need to remove this information
    - The information isn't standard across all of the files. The headings may or may not include the following information: 
        - The file name
        - The student's name
        - The student's school
        - The student's grade 
        - The perceived proficiency of the student's essay 
- Coding of student corrections:
    - Not all files have student correction codes.
    - The coding of student corrections is not uniform -- I knew to expect this based on some preliminary exploration I did when I first downloaded the corpus, but here is a summary of known issues, which can be witnessed in the files above: 
        - Some files use "↑" whereas others use HTML codes `<o> ... </o>`, `<i> ... </i>`, and `<x> ... </x>`
        - The tagging of correction codes is not uniform, e.g.: `fr<o om > </o>` vs `thi<o> s</o>`, where the first example has the `<o>` tag _including_ the part of the word it is tagging, and the second which does not
        - The spacing of tagging varies, such as: 
            - `<x>and they are bad layer on the com</x>`, where the cross-out tagging is flush to the words
            - `or<x> e </x>`, where `e` is the thing crossed out but the tags actually start connecting to the previous word and end after the thing being crossed out
            - `and  <x> goo </x> very` where the tagging is just removed from each side of the thing being crossed out by a space 
            - `<o> s</o> ociety` where one letter of a word is being emphasized, but the letter _is not_ connected to the word it is a part of
            - `<o> A</o>lso` where the letter emphasized _is_ connected to the word it belongs to
            - `c<o> are</o>` where the tag of the letters begins connected to the word, where a subpart is being emphasized, but the actual part of the word being emphasized is not connected
            - `orp<o> han</o>  s`, which separates the word "orphans"
            - `m<o>y</o>` where everything is flush
            - ` There go <o>my  father and my brather. You</o> saw` an entire phrase (that spans two sentences!) being emphasized
        - There are probably more issues moving forward, but this is something to take into account. 
- In a pd.DataFrame, I would like: 
    - An `Original_Essay` column, that reads in the original text
        - Maybe get rid of heading though?
    - A "Normalized" Essay column, that has a more (or completely) uniform tagging system 
        - `<o>` -> `_`
        - `<i>` - `^`
        - `<x>` doesn't have to change, I don't think
        - This is where issues raised earlier will really crop up
    - A `Cleaned_Essay` column that removes the tagging, as well as deletes words that correction codes indicate students crossed out
    - POS-tag
    - token count, average sentence length, the usual stuff

### Making balc_df

In [9]:
# Creating a UDF that will return the text of a filename from corpus_dict
def get_text(file):
    return corpus_dict[file]

In [10]:
# Setting 
balc_df = pd.DataFrame(list(corpus_dict.keys()), columns={"Filename"})
print(balc_df.shape)
balc_df.head()

# good. it has the right number of files.

(1856, 1)


Unnamed: 0,Filename
0,CEPA-3-200607296
1,CEPA-4-200607457
2,CEPA-5-200600487
3,CEPA-4-200608016
4,CEPA-1-200611825


In [11]:
balc_df['Original_Text'] = balc_df.Filename.apply(get_text)
balc_df.head()

Unnamed: 0,Filename,Original_Text
0,CEPA-3-200607296,\t\t\t\tCEPA 3 200607296\n\n\n\nNow I tell you...
1,CEPA-4-200607457,\t\t\t\tCEPA 4 200607457\n\n\n\n ...
2,CEPA-5-200600487,\t\t\t\tCEPA 5 200600487\n\n\n\n\nEvery body i...
3,CEPA-4-200608016,\t\t\t\tCEPA 4 200608016\n\n\n\nEvery body hav...
4,CEPA-1-200611825,\t\t\t\tCEPA 1 200611825\n\n\n\nyou go in the ...


In [12]:
# How many essays do we have with '↑'
balc_df.Original_Text.str.contains('↑').value_counts()

# 22

False    1834
True       22
Name: Original_Text, dtype: int64

In [13]:
# What are they
balc_df[balc_df.Original_Text.str.contains('↑')]

Unnamed: 0,Filename,Original_Text
51,US-10,US-10\n\n\n\n\nThe third differences between t...
66,T-12,\t\t\t\tTaiseer - 12\n\n\nI am is Ali ↑i...
168,US-13,US-13\n\n\n\n\t\t\t\t\nName of University\nUn...
187,T-11,\t\t\t\tTaiseer - 11\n\n\nsmoking is a ve...
432,US-28,US-28 \n\n\n\n\t\t\t\t\nName of Univers...
433,US-14,US-14\n\n\n\nAnd when she become at the last t...
480,US-6,US-6\n\n\n\t\t\t\nName of University\nUnivers...
616,US-5,US-5\n\n\t\t\t\t\nName of University\nUnivers...
823,US-1,US-1\n\n\n\t\t\t\t\nName of University\nUniver...
1182,US-9,US-9\n\n\n\n\n\t\n\t\t\t\t\t\nName of Univers...


In [14]:
# Let's look at one
balc_df.Original_Text[66]

'\t\t\t\tTaiseer    -   12\n\n\nI am is Ali  ↑in the stop smoking in the cigarette.  I t’s help smokers quit smoking.  I relley somoking in the Large dangerous results.  L   They  my friend somking       beacous  help  quit.      My  father  and  mouther           I advies didn’t  friend  they  smoking.  I t’s   ↑ in the Gulf smoke 2  50%   to  youth.  I  advies  to  smoking   quit.  You  should  lost  of  cigarette  advertisements.  Let’s  in  the  semoking dangerous healthy.  My           father to in the would  2 % 95  in the cigarett.  I t’s  countery  Large               Breoblam  didn’t               smoking.  I they                in the people speres ↑in the  Smoking.  Smoking kills in the cigarette.  If I were you, I would                 grent father in the smoking.   Pleaes             stop smoking beacous dangerous  in the would.  pleaes to her me                     in the would quit smoking. \n\n\n'

##### Cleaning data up
- Remove excessive whitespace
- Remove things like curly quotes 
- Standardize tagging
    - `<o>...</o>` -> `_`
    - `<i>...</i>` -> `^`
    - `↑` -> `^`

In [15]:
def clean_text(txt):
    """Removes excessive whitespace, backticks, and curly quotes from a text."""
    txt = re.sub(r'[\n\t ]+', ' ', txt)
    txt = re.sub(r'^`', '', txt)
    txt = txt.replace('“','"').replace('”','"').replace("’", "'")
    txt = txt.strip()
    return txt

def retag(essay):
    """Replaces tags for student emphasis (<o>, </o>)with '_' and removes any unecessary spaces between emphasized 
    letters in words. Replaces tags for student insertions (<i>, </i>) with ^ and removes any unecessary spaces 
    between letters in inserted words. Closes any student tags that were left open by researchers."""
    essay = re.sub(r'\<i +', '<i>', essay)
    essay = re.sub(r'<x +', '<x>', essay)
    essay = re.sub(r'<o +', '<o>', essay)
    essay = re.sub(r'<o>(\s)?', '_', essay)
    essay = re.sub(r'(\s)?<\/o>(\s)?', '_', essay)
    essay = re.sub(r'<i>(\s)?', '^', essay)
    essay = re.sub(r'(\s)?<\/i>', '^', essay)
    essay = re.sub(r' >', ' ', essay)
    essay = re.sub(r'__', '_ _', essay)
    essay = essay.replace("↑", "^")
    return essay

def un_head(txt):
    """Removes headers from text that include the file name, as well as student's names, grades, schools, etc."""
    cepa = re.compile(r'^C(EPA|EPa|epa)')      # removes cepa headers
    tai = re.compile(r'^Taiseer')              # removes taiseer headers
    tai2 = re.compile(r'^T-\d+')               # removes tai_em headers
    us = re.compile(r'^US-\d+')                # removes University of Sharjah headers
    header = re.compile(r'^Name')              # removes everything else
    if cepa.search(txt):
        txt = re.sub(r'CEPA.*?\d{2,}( ?-)?', '', txt, flags=re.I)
    elif tai.search(txt):
        txt = re.sub(r'Taiseer( -)? \d+', '', txt)
    elif tai2.search(txt):
        txt = re.sub(r'T-\d+', '', txt)
    elif us.search(txt):
        txt = re.sub(r'US-\d+.*?\d+', '', txt)
    else:
        matchobj = header.search(txt)
        if matchobj:
            txt = re.sub(r'Name.*?\d+( - \d)?', '', txt)
    return txt

def normalize_essay(txt):
    txt = clean_text(txt)
    txt = retag(txt)
    txt = un_head(txt)
    txt = txt.strip()
    return txt

In [16]:
balc_df.Original_Text[137]

'\t\t\t\tTaiseer 85\n\n\n\nI need to xxx<i>someone</i> <x>to</x> for help me. It’s very ergient my father  <x>so</x> is died. After my finshed my school. I see my father in the bed, becuaus  It’s was <x>sm</x> a very good smoker. xxx And It was died. smoking is a very xxx harmful hapits. In advertisements people see a man xxx who <x>happy</x>. teneeger saw the man and trust of thim. It’s some results for smoking. But do you now <x>a</x> the least dangerous result’s. of cours you now. It is xxx vary hot in your bady. or smoking couss to the most dangerous result’s. It’s cancer. Did you naw cancer it’s died desises. many of people like cigarette but in they life, he don’t <x>happy</x>. I belive smoking it’s very bad xxx e hapet. But some people  <x>dont</x>don’t agreey with me, sey why? Becouse xxx xxx he <x>say</x> sey I xxx xxx enjoy after I take only <x>oen</x> one cigarette. In my openin I coll him xxx a creezy people. How you smok and you must be adied. Finally It’s a slow suiced!  

In [17]:
un_head(clean_text(balc_df.Original_Text[137]))

# we still have the 'xxx'. Take out?

" I need to xxx<i>someone</i> <x>to</x> for help me. It's very ergient my father <x>so</x> is died. After my finshed my school. I see my father in the bed, becuaus It's was <x>sm</x> a very good smoker. xxx And It was died. smoking is a very xxx harmful hapits. In advertisements people see a man xxx who <x>happy</x>. teneeger saw the man and trust of thim. It's some results for smoking. But do you now <x>a</x> the least dangerous result's. of cours you now. It is xxx vary hot in your bady. or smoking couss to the most dangerous result's. It's cancer. Did you naw cancer it's died desises. many of people like cigarette but in they life, he don't <x>happy</x>. I belive smoking it's very bad xxx e hapet. But some people <x>dont</x>don't agreey with me, sey why? Becouse xxx xxx he <x>say</x> sey I xxx xxx enjoy after I take only <x>oen</x> one cigarette. In my openin I coll him xxx a creezy people. How you smok and you must be adied. Finally It's a slow suiced!"

In [18]:
balc_df[balc_df.Filename.str.contains('US')]

Unnamed: 0,Filename,Original_Text
51,US-10,US-10\n\n\n\n\nThe third differences between t...
81,US-11,US-11\n\n\n\n\n\t\nName of University\nUniver...
168,US-13,US-13\n\n\n\n\t\t\t\t\nName of University\nUn...
208,US-12,"US-12\n\n\n\n\n\nOn the other hand, males are ..."
284,US-16,US-16\n\n\n\nName of University\nUniversity o...
328,US-17,US-17\n\n\n\n\n\n\nAs to my parents mother t...
401,US-15,US-15\n\n\t\t\t\t\nName of University\nUniver...
404,US-29,US-29\n\n\n\n\t\t\t\t\nName of University\nUn...
432,US-28,US-28 \n\n\n\n\t\t\t\t\nName of Univers...
433,US-14,US-14\n\n\n\nAnd when she become at the last t...


In [19]:
balc_df.Original_Text[51]

'US-10\n\n\n\n\nThe third differences between teacher (A) and  (B) is the Final exam.  The teacher (A) always put very difficult exams and her exams always need a lot of time.  I remmeber that day when I saw her final exam.  It was a very diffecult  and I took a lot of time to anwser it.  But the teacher (B) always put a very easy Exams and I could finish from it in a short time.  I remmeber when the teacher (B) said : “I hate to put difficult  exams, because I know ↑that you  need makes to pass.\n\nIn short, there was a lot of differences between them.  They have different ways in teaching style, but I was like the both ways.  I was know these ways may be difficult, but it will help me to get high marks.\n\n'

In [20]:
un_head(clean_text(balc_df.Original_Text[51]))

' The third differences between teacher (A) and (B) is the Final exam. The teacher (A) always put very difficult exams and her exams always need a lot of time. I remmeber that day when I saw her final exam. It was a very diffecult and I took a lot of time to anwser it. But the teacher (B) always put a very easy Exams and I could finish from it in a short time. I remmeber when the teacher (B) said : "I hate to put difficult exams, because I know ↑that you need makes to pass. In short, there was a lot of differences between them. They have different ways in teaching style, but I was like the both ways. I was know these ways may be difficult, but it will help me to get high marks.'

In [21]:
balc_df.Original_Text[284]

'US-16\n\n\n\nName of  University\nUniversity of Sharjah\nName of  Student\nEman Al Hajri\nStudent  ID\n20320719\n\n\nEvery country has its own traditional custom costume  that  it is proud of.  To me is a part of not all traiditonal costumes are  people  have different oppiniouns  opiniouns to wards these kinds of costoms  customs.   To me, I don’t like most of my traditional customs as they have  are not comfortable to me and I can’t  xxx  have my freedom with it.   Even though my opinion is ve  negative towards  traditional  chlo  clothes still my paren   parents think that I’m wrong and they are proud of it..\n\nThe customs  clothes  like traditional  over in the UAE  have lots to do with gold, loose with sparkling things on and its un comnfortable because you can’t have your freedom with it, as xx your not free to have sit the way you like or to play with this kind of expensive clothes without getting it dirty as a child.  I remember ones when I was 8 years old in Eid, I used to p

In [22]:
un_head(clean_text(balc_df.Original_Text[284]))

" Every country has its own traditional custom costume that it is proud of. To me is a part of not all traiditonal costumes are people have different oppiniouns opiniouns to wards these kinds of costoms customs. To me, I don't like most of my traditional customs as they have are not comfortable to me and I can't xxx have my freedom with it. Even though my opinion is ve negative towards traditional chlo clothes still my paren parents think that I'm wrong and they are proud of it.. The customs clothes like traditional over in the UAE have lots to do with gold, loose with sparkling things on and its un comnfortable because you can't have your freedom with it, as xx your not free to have sit the way you like or to play with this kind of expensive clothes without getting it dirty as a child. I remember ones when I was 8 years old in Eid, I used to play with my cousins cusins who in with sand mud and our clothes were so dirty and looked horrible, my mum started shouting and scolding scolded 

In [23]:
balc_df.Original_Text[5]

'\t\t\t\tCEPA 1 200606381\n\n\n\nThe most beautiful place you hnow  he see because place very cood on bag on the people and Whoes there\n'

In [24]:
clean_text(balc_df.Original_Text[5])

'CEPA 1 200606381 The most beautiful place you hnow he see because place very cood on bag on the people and Whoes there'

In [25]:
balc_df.Original_Text[1510]

'Name\nAbdullah  Lee\nSchool\nAl   Khaleej\nGrade\nGrade  11\n\n\n\nMany people lost  <i> their </i>  <x> re  pe  the </x> father  and mother because of many <o> r</o> easons and become orphan.  They  need  many  helps  to  pass  their  problems  <x> such  asit </x>.   They lost <o> t</o>hier  father  and mother for many reasons s<x> h</x>uch as waves  and exedants. <x To fell good  > </x> Our  relegion  told  us  to  tereat  the orphans kindly t<o> o </o>  win the paradise.  To do that we must be generous to give them w<o> h</o> at they need.   <o>W </o>e must  build shelters to save them <x> from </x>.  We  should forgive them when they do mistakes.  If we want to be <x> the most  </x> in a good situation we should learn the orphan people b<o> y </o> the schools.  We should improve their skills by let them to play<x> and the imp  </x> and have their time in fun.  Finally we must <x> see save them fromen </x>.   look after them becase they will be in the futur the men who will save ou

In [26]:
un_head(clean_text(balc_df.Original_Text[1510]))

' Many people lost <i> their </i> <x> re pe the </x> father and mother because of many <o> r</o> easons and become orphan. They need many helps to pass their problems <x> such asit </x>. They lost <o> t</o>hier father and mother for many reasons s<x> h</x>uch as waves and exedants. <x To fell good > </x> Our relegion told us to tereat the orphans kindly t<o> o </o> win the paradise. To do that we must be generous to give them w<o> h</o> at they need. <o>W </o>e must build shelters to save them <x> from </x>. We should forgive them when they do mistakes. If we want to be <x> the most </x> in a good situation we should learn the orphan people b<o> y </o> the schools. We should improve their skills by let them to play<x> and the imp </x> and have their time in fun. Finally we must <x> see save them fromen </x>. look after them becase they will be in the futur the men who will save our familiees and society.'

In [27]:
balc_df.Original_Text[94]

'\t\t\nName\nEbeeid    Mubarak\nSchool\nAl   Khaleej\nGrade\nGrade  12\n\n\n\nThe  <x> l </x> mother<o> s </o>  are all of life.  We must h<o>a </o> ve a mother in all home.  I have a mother in my home.  My mothe<o> r </o> <x> l </x> is  beutiful woman be<o> o</o> cuse <x> she </x> her  bite  me  to  make  a  good  thinks.  I  m<o> u</o> st  now not forg<o> g</o> eten  a   poit <x> w </x>    he  was  said  a  mother is  <x> la </x> same  a  school  w<o> h</o>en  you  make  you  make  avery  good   people.\n\n\n'

In [28]:
un_head(clean_text(balc_df.Original_Text[94]))

' The <x> l </x> mother<o> s </o> are all of life. We must h<o>a </o> ve a mother in all home. I have a mother in my home. My mothe<o> r </o> <x> l </x> is beutiful woman be<o> o</o> cuse <x> she </x> her bite me to make a good thinks. I m<o> u</o> st now not forg<o> g</o> eten a poit <x> w </x> he was said a mother is <x> la </x> same a school w<o> h</o>en you make you make avery good people.'

Checking normalize

In [29]:
normalize_essay(balc_df.Original_Text[5])

'The most beautiful place you hnow he see because place very cood on bag on the people and Whoes there'

In [30]:
normalize_essay(balc_df.Original_Text[4])

'you go in the oman just had the perfect holiday Describe it where in the oman holiday. The you went there with in the Father.'

In [31]:
normalize_essay(balc_df.Original_Text[94])

'The <x> l </x> mother_s_are all of life. We must h_a_ve a mother in all home. I have a mother in my home. My mothe_r_<x> l </x> is beutiful woman be_o_cuse <x> she </x> her bite me to make a good thinks. I m_u_st now not forg_g_eten a poit <x> w </x> he was said a mother is <x> la </x> same a school w_h_en you make you make avery good people.'

In [32]:
normalize_essay(balc_df.Original_Text[1510])

'Many people lost ^their^ <x> re pe the </x> father and mother because of many _r_easons and become orphan. They need many helps to pass their problems <x> such asit </x>. They lost _t_hier father and mother for many reasons s<x> h</x>uch as waves and exedants. <x>To fell good  </x> Our relegion told us to tereat the orphans kindly t_o_win the paradise. To do that we must be generous to give them w_h_at they need. _W_e must build shelters to save them <x> from </x>. We should forgive them when they do mistakes. If we want to be <x> the most </x> in a good situation we should learn the orphan people b_y_the schools. We should improve their skills by let them to play<x> and the imp </x> and have their time in fun. Finally we must <x> see save them fromen </x>. look after them becase they will be in the futur the men who will save our familiees and society.'

- UDF `retag` over-corrects and connects some words that should not be corrected. However, less conservative iterations of `retag` would split up words like "orophans" and "reasons". 
    - Words that are connected by underscores will have to be manually checked and cleaned.

In [33]:
balc_df.Original_Text[1510]

'Name\nAbdullah  Lee\nSchool\nAl   Khaleej\nGrade\nGrade  11\n\n\n\nMany people lost  <i> their </i>  <x> re  pe  the </x> father  and mother because of many <o> r</o> easons and become orphan.  They  need  many  helps  to  pass  their  problems  <x> such  asit </x>.   They lost <o> t</o>hier  father  and mother for many reasons s<x> h</x>uch as waves  and exedants. <x To fell good  > </x> Our  relegion  told  us  to  tereat  the orphans kindly t<o> o </o>  win the paradise.  To do that we must be generous to give them w<o> h</o> at they need.   <o>W </o>e must  build shelters to save them <x> from </x>.  We  should forgive them when they do mistakes.  If we want to be <x> the most  </x> in a good situation we should learn the orphan people b<o> y </o> the schools.  We should improve their skills by let them to play<x> and the imp  </x> and have their time in fun.  Finally we must <x> see save them fromen </x>.   look after them becase they will be in the futur the men who will save ou

In [34]:
normalize_essay(balc_df.Original_Text[1510])

'Many people lost ^their^ <x> re pe the </x> father and mother because of many _r_easons and become orphan. They need many helps to pass their problems <x> such asit </x>. They lost _t_hier father and mother for many reasons s<x> h</x>uch as waves and exedants. <x>To fell good  </x> Our relegion told us to tereat the orphans kindly t_o_win the paradise. To do that we must be generous to give them w_h_at they need. _W_e must build shelters to save them <x> from </x>. We should forgive them when they do mistakes. If we want to be <x> the most </x> in a good situation we should learn the orphan people b_y_the schools. We should improve their skills by let them to play<x> and the imp </x> and have their time in fun. Finally we must <x> see save them fromen </x>. look after them becase they will be in the futur the men who will save our familiees and society.'

In [35]:
normalize_essay(balc_df.Original_Text[1])

'My worst holiday Last year I have just had the worst holiday ever. It was too board . My Lonely sister had got married . She was making me Laugh and play with me. But now I`m alone with my male brothers. I cant stand them they are too noisy . In the Spring holiday my brothers and I travelled to Aust_r_alia with our parents. It was really great and nice place, but I didn`t enjoyed it because no girls were with me . I asked my cousin to come with us but she refused that because she joined a sports club .There was alot of st_r_ange animals. My brother was taking photoes for the animals and I was too board. I was walking like a sick person. I hated my self and prayed to go back to our country to see my sister. I missed her so much. After few day we came back home then I became the happiest person in the world.'

##### Building balc_df up
- Normalizing essays with UDFs created in previous section
- Making "Revised_essay", which removes all tagging
- Tokenize
- Token counts
- TTR

In [36]:
balc_df['Normalized_Essay'] = balc_df.Original_Text.apply(normalize_essay)

In [37]:
balc_df.head()

Unnamed: 0,Filename,Original_Text,Normalized_Essay
0,CEPA-3-200607296,\t\t\t\tCEPA 3 200607296\n\n\n\nNow I tell you...,Now I tell you why my worst holiday ever in th...
1,CEPA-4-200607457,\t\t\t\tCEPA 4 200607457\n\n\n\n ...,My worst holiday Last year I have just had the...
2,CEPA-5-200600487,\t\t\t\tCEPA 5 200600487\n\n\n\n\nEvery body i...,Every body in this life have a favourite posse...
3,CEPA-4-200608016,\t\t\t\tCEPA 4 200608016\n\n\n\nEvery body hav...,Every body have a lot ofpossessions in this li...
4,CEPA-1-200611825,\t\t\t\tCEPA 1 200611825\n\n\n\nyou go in the ...,you go in the oman just had the perfect holida...


In [38]:
# Let's take a quick look at underscores that are connected to something.
balc_df[balc_df.Normalized_Essay.str.contains(r'_[A-Za-z]+')]

Unnamed: 0,Filename,Original_Text,Normalized_Essay
0,CEPA-3-200607296,\t\t\t\tCEPA 3 200607296\n\n\n\nNow I tell you...,Now I tell you why my worst holiday ever in th...
1,CEPA-4-200607457,\t\t\t\tCEPA 4 200607457\n\n\n\n ...,My worst holiday Last year I have just had the...
3,CEPA-4-200608016,\t\t\t\tCEPA 4 200608016\n\n\n\nEvery body hav...,Every body have a lot ofpossessions in this li...
11,CEPA-3-200611351,\t\t\t\tCEPA 3 200611351\n\n\n\nAl_ain mall is...,Al_ain mall is the most beautiful place. First...
19,CEPA-1-200603206,\n\t\t\t\tCEPA 1 200603206\n\n\n\n Topi...,Topic _A_I will writing this a Prerf in far_me...
20,CEPA-3-200611379,\t\t\t\tCEPA 3 200611379\n\n\n\n\nI spend my w...,I spend my weekend in Dubai. I go with my fami...
23,AK9e,\t\t\t\nName\nSaid Afoos\nSchool\nAl Khalee...,you should drink becase it is good for your st...
24,FZ10e,\t\nName\nAmena Mal Allah \nSchool\nFatima A...,Mothers are symblols of love' _m_ercy ahd hope...
25,CEPA-3-200611190,\t\t\t\tCEPA 3 200611190\n\n\n\nThere are many...,There are many things make the worst holiday f...
26,CEPA-2-200606353,\t\t\t\tCEPA 2 200606353\n\n\n\nI m Love the U...,I m Love the UAE the UAE I am very very Love t...


481 returns. Some will probably be okay, and some will probably need to be separated. Let's take a closer look

In [39]:
balc_df.Normalized_Essay[0]

"Now I tell you why my worst holiday ever in the last summer I wented withe my family in the India and this story I will tell you what happened for the short story when I go the first the weathe is very very rain now bady for the children play out when I go in the hotel all may family was have the headk in there and all was sleep put for my I can;t sleep because I not love the area in the morning all the my family weak up and going i<x>n the reast</x>rant but is the strees,children and the food is very dearty earia I not liked becuse is not nice area so darty and people there is not nice all there have not happy only sawted for my sister whem she take some flower for the mam and also when you there you see what I means maby some body liked go ther but for I didn't liked And for what happand I tell for my family I wanted to go in my country I didn't _l_ike her fainally I wanted tell for hem why your earea like thes you shold clean and your people so nerves..."

In [40]:
balc_df.Original_Text[1852]

'\t\t\t\tCEPA 4 200608027\n\n\n\n\t\t\t\tThe perfect holiday\n\n\nSummer holiday is the best holidays in a year. There are many things you can enjoy in summer holiday like travel, make a long time for a fun, reading many books to save information. In the last summer I enjoy traveled with my family to Australia, Malaysia and Thailand.\nIn Australia I enjoy traveled to Sydeny and Gold Coust. Sydeny is th<o>b</o> big city and the best city in the Australia. There are many place we enjoyed like zoo, musum and shopping centers. However, Gold Coust is the fun city that we enjoyed many big parks like sea worlds and Dream worlds. In addition the weathe was very cold and windy that we shold were a jus<o>t</o> and take umbrala. Also, we visited Maloysia. Malaysia is a very wonderful country. They are many trees, rivers and green mountain. There are a lot of rainning and the warm weather. However there are many place to fun and shop. And I shold taken an umbrella all the time. The third place us 

In [41]:
balc_df.Normalized_Essay[1852]

'The perfect holiday Summer holiday is the best holidays in a year. There are many things you can enjoy in summer holiday like travel, make a long time for a fun, reading many books to save information. In the last summer I enjoy traveled with my family to Australia, Malaysia and Thailand. In Australia I enjoy traveled to Sydeny and Gold Coust. Sydeny is th_b_big city and the best city in the Australia. There are many place we enjoyed like zoo, musum and shopping centers. However, Gold Coust is the fun city that we enjoyed many big parks like sea worlds and Dream worlds. In addition the weathe was very cold and windy that we shold were a jus_t_and take umbrala. Also, we visited Maloysia. Malaysia is a very wonderful country. They are many trees, rivers and green mountain. There are a lot of rainning and the warm weather. However there are many place to fun and shop. And I shold taken an umbrella all the time. The third place us visited are bangkok the city of Thailand that many Arabs t

In [42]:
print(balc_df.Original_Text[1851])
balc_df.Normalized_Essay[1851]

				CEPA 2 200606606



        Topic A 
        

In the summer I went with my f<x>ri</x>nds to the Mounten .
I see there a Fantactec place. I see a very big Mountens . Me and t<x>w</x>o with my frinedd clame The one oFthis Mounten and thae some buteful <o> imaeg bih</o>and the Mountens there are sea. My f<o>r</o>in<o>ed</o> go to swimme in the sea and divend there . he want to chac some Fish but he dont can . after that go to are stor and to ead a Food after fines .th<o>ae</o> go to mack some descaver. after that we all back to howe.



'Topic A In the summer I went with my f<x>ri</x>nds to the Mounten . I see there a Fantactec place. I see a very big Mountens . Me and t<x>w</x>o with my frinedd clame The one oFthis Mounten and thae some buteful _imaeg bih_and the Mountens there are sea. My f_r_in_ed_go to swimme in the sea and divend there . he want to chac some Fish but he dont can . after that go to are stor and to ead a Food after fines .th_ae_go to mack some descaver. after that we all back to howe.'

In [43]:
# UDF to remove tags
def remove_tags(text):
    """Removes tags from Normalized_Essay. If there are tags indicating a student crossed something out, whatever is 
    enclosed in those tags is removed."""
    text = re.sub(r'<x>.*?<\/x>', '', text)         # delete tags for crossing out and the text in those tags
    text = text.replace('_', '').replace('^', '')   # remove emphasis and insertion tags
    return text

In [44]:
# Remove_tags applied to creat "Revised_Essay"
# Tokenize revised_essay
# Get token count
balc_df['Revised_Essay'] = balc_df.Normalized_Essay.apply(remove_tags)
balc_df['tokens'] = balc_df.Revised_Essay.apply(nltk.word_tokenize)
balc_df['token_count'] = balc_df.tokens.map(len)
balc_df.sample(20)

# Still need to address the connected underscore issues

Unnamed: 0,Filename,Original_Text,Normalized_Essay,Revised_Essay,tokens,token_count
1001,CEPA-3-200607405,\t\t\t\tCEPA 3 200607405\n\n\n\nI have just ha...,I have just had the perfect holiday . I travel...,I have just had the perfect holiday . I travel...,"[I, have, just, had, the, perfect, holiday, .,...",69
675,CEPA-5-200600551,\t\t\t\tCEPA 5 200600551\n\n\n\nI have just ha...,I have just had the worst holiday ever. I went...,I have just had the worst holiday ever. I went...,"[I, have, just, had, the, worst, holiday, ever...",186
1674,CEPA-5-200601146,\t\t\t\tCEPA 5 200601146\n\n\n\n The wo...,The worst holiday ever The last past holiday w...,The worst holiday ever The last past holiday w...,"[The, worst, holiday, ever, The, last, past, h...",228
1724,CEPA-2-200606605,\t\t\t\tCEPA 2 200606605\n\n\n\nI love play fo...,I love play football . I play with my friend ....,I love play football . I play with my friend ....,"[I, love, play, football, ., I, play, with, my...",54
1081,CEPA-2-200612429,\t\t\t\tCEPA 2 200612429\n\n\n\n Topic ...,Topic A I'm very think is. Ima you hav just ha...,Topic A I'm very think is. Ima you hav just ha...,"[Topic, A, I, 'm, very, think, is, ., Ima, you...",96
1300,CEPA-3-200612435,\t\t\t\tCEPA 3 200612435\n\n\n\nI like many pl...,I like many places but I like Al-Ain very well...,I like many places but I like Al-Ain very well...,"[I, like, many, places, but, I, like, Al-Ain, ...",74
653,CEPA-6-200619945,\t\t\t\tCEPA 6 200619945\n\n\n\nThere are so m...,There are so many things I like from my person...,There are so many things I like from my person...,"[There, are, so, many, things, I, like, from, ...",304
304,AK20e,Name\nOsama Zakaria\nSchool\nAl Khaleej\nGra...,Helping the orphan is every body's responsibil...,Helping the orphan is every body's responsibil...,"[Helping, the, orphan, is, every, body, 's, re...",63
998,CEPA-4-200608134,\t\t\t\tCEPA 4 200608134\n\n\n\nEvery body l...,"Every body like holidays and vecaition, becaus...","Every body like holidays and vecaition, becaus...","[Every, body, like, holidays, and, vecaition, ...",167
851,CEPA-3-200611323,\t\t\t\tCEPA 3 200611323\n\n\n\nI’m verry happ...,I'm verry happy to write about have just had t...,I'm verry happy to write about have just had t...,"[I, 'm, verry, happy, to, write, about, have, ...",239


In [45]:
balc_df.iloc[[243], :]

Unnamed: 0,Filename,Original_Text,Normalized_Essay,Revised_Essay,tokens,token_count
243,Taiseer-53,\t\t\t\tTaiseer 53\n\n\n\n<x>When you help som...,<x>When you help someone to quit smoking you a...,s much has been said about the dangers XXX of...,"[s, much, has, been, said, about, the, dangers...",172


In [46]:
balc_df.Original_Text[243]

'\t\t\t\tTaiseer 53\n\n\n\n<x>When you help someone to quit smoking you almost certianly</x> s much has been said about the dangers XXX of smoking to our health. Bad breath stained teeth loss of taste and mouth sores are the least dangerous reaults. More serious results are XXX gum infection and damage to bones and tissues underteeth However the most terrible results are fatal diseases such as cancer and heart diseases Inspit of their awareness of the dangers of smoking smokers donot quit for many reasons firstly they are not always of the dangers to their healthy unfortunately they often are safe from dangerous diseases. Scondly it is easier to start smoking than it is to quit. It taskes a long time to stop this bad habit thirdly smokers becomesalves to the habit and are too weak to quit. Fourthly medicines that can help smokers to quit are expensive. Nevertheless if asmoker really wants to give up he can do it.\nTeengers are easy victims of the cigartising industry, They will be the 

In [47]:
# checking to make sure there are no "cepa" tokens included from incorrect/incomplete heading removal
balc_df[balc_df.tokens.apply(lambda x: 'cepa' in [y.lower() for y in x])]

# returns are essays that talk about cepa 

Unnamed: 0,Filename,Original_Text,Normalized_Essay,Revised_Essay,tokens,token_count
6,CEPA-5-200608959,\t\t\t\tCEPA 5 200608959\n\n\n\nI have just ha...,I have just had the perfect and the best holid...,I have just had the perfect and the best holid...,"[I, have, just, had, the, perfect, and, the, b...",277
147,CEPA-2-200612443,\t\t\t\tCEPA 2 200612443\n\n\n\nI’m writing To...,I'm writing Topic B to Describe the best film ...,I'm writing Topic B to Describe the best film ...,"[I, 'm, writing, Topic, B, to, Describe, the, ...",310
273,CEPA-3-200611155,\t\t\t\tCEPA 3 200611155\n\n\n\nIn the last su...,"In the last summer holiday, I hade the worest ...","In the last summer holiday, I hade the worest ...","[In, the, last, summer, holiday, ,, I, hade, t...",264
906,CEPA-2-200612584,\t\t\t\tCEPA 2 200612584\n\n\n\nThese weekend ...,These weekend the worst. Because I stayd at ho...,These weekend the worst. Because I stayd at ho...,"[These, weekend, the, worst, ., Because, I, st...",52
1370,CEPA-3-200611303,\t\t\t\tCEPA 3 200611303\n\n\n\nThere are many...,There are many qualites the worst weekend bad....,There are many qualites the worst weekend bad....,"[There, are, many, qualites, the, worst, weeke...",199
1557,CEPA-1-200604402,\t\t\t\tCEPA 1 200604402\n\n\n\n<x>me</x> <o>m...,<x>me</x> _my you went ar_wous. went <x>my</x>...,my you went arwous. went and fraund Yor ne...,"[my, you, went, arwous, ., went, and, fraund, ...",43


In [48]:
un_head(clean_text(balc_df.Original_Text[273]))

" In the last summer holiday, I hade the worest holidy ever in my live, because many thing its had not fun. Firstly, the place want to respect my holidy, it's not very nice in my opinion, but my father and some brother, they need to goto this place, also my some relative liked went to same place, only my and my mother did'nt like went, Now I said the place all my family them went its Salala in Oman. Secoundly, why me did'nt like we to slala becaus we went <x>By</x> <i>by</i> car put me do like set in the ar in long time because I feel asto mich, and same time hideich, d'so the driver the car his my father and he drive ver slowly, and I ome of timemy uncil in onther car he bushe m<x>y</x> father car xxx so we spent tow oure to correct the car in the desart at night. Thirtly in the Salala its had many animals and nuts, and me leated all of animal just only only xxxone of like it's the bird. Finaly, I like in thise hoildy summer the sea in the Salal because it very very nice, and me alrea

In [49]:
balc_df[balc_df.Normalized_Essay.str.contains(r'^ ?-')]

Unnamed: 0,Filename,Original_Text,Normalized_Essay,Revised_Essay,tokens,token_count


In [50]:
balc_df.Original_Text[1763]

'\t\t\t\tCEPA 4 200607711\n\n\n\n- Holiday is the most important time in our life , so we decided to spend it in a useful thing.\n\n I like very much summer holiday because we usealy trare in that holidy . In Last summer holiday .I think I spend all my time  in useful things . For example   I travel with my family to the (thiland)  it was a wonderful country and the weather was beautiful . I enjoy my time their . We went  to a aplace that name ( poct )  it was a wonderful place I ever see. I same forst with all that trees .  Then we went to the (pangkok) it the Capital  of this country .I think it is the economic center of the Thiland .After 3 weeks we returned back to my lovely country. After that I take a  computer courses  for two week to improve me skill. The summer in dubai is so  wonderful because  the summer surprises will strats.\nWe go to the modhich city to play and enjoy our time. \nI like shopping very much so  I spend all my time in malls like (  city center, Burjoman and 

In [51]:
balc_df.Normalized_Essay[1763]

'Holiday is the most important time in our life , so we decided to spend it in a useful thing. I like very much summer holiday because we usealy trare in that holidy . In Last summer holiday .I think I spend all my time in useful things . For example I travel with my family to the (thiland) it was a wonderful country and the weather was beautiful . I enjoy my time their . We went to a aplace that name ( poct ) it was a wonderful place I ever see. I same forst with all that trees . Then we went to the (pangkok) it the Capital of this country .I think it is the economic center of the Thiland .After 3 weeks we returned back to my lovely country. After that I take a computer courses for two week to improve me skill. The summer in dubai is so wonderful because the summer surprises will strats. We go to the modhich city to play and enjoy our time. I like shopping very much so I spend all my time in malls like ( city center, Burjoman and merkato ) . I think holiday last summer it was the perf

In [52]:
# Let's get TTR
def get_TTR(toks):
    """All tokens are lowercased, punctuation is included.
    Get TTR by dividing set of lowercased tokens by length of tokens."""
    all_toks = [x.lower() for x in toks]
    if len(all_toks) == 0:   # one file has 0 tokens
        return 0
    else: return len(set(all_toks))/len(all_toks)

In [53]:
# checking UDF
foo = "My name is Elena and I am a Junior Linguist. I study applied linguistics, and i want to be a linguist."
get_TTR(foo.split())

0.7619047619047619

In [54]:
# checking UDF by hand
lil = [x.lower() for x in foo.split()]
len(set(lil))/len(lil)

0.7619047619047619

In [55]:
# checking UDF on an empty string
foo = ""
get_TTR(foo)

0

In [56]:
balc_df['TTR'] = balc_df.tokens.apply(get_TTR)

In [57]:
balc_df.head()

Unnamed: 0,Filename,Original_Text,Normalized_Essay,Revised_Essay,tokens,token_count,TTR
0,CEPA-3-200607296,\t\t\t\tCEPA 3 200607296\n\n\n\nNow I tell you...,Now I tell you why my worst holiday ever in th...,Now I tell you why my worst holiday ever in th...,"[Now, I, tell, you, why, my, worst, holiday, e...",207,0.492754
1,CEPA-4-200607457,\t\t\t\tCEPA 4 200607457\n\n\n\n ...,My worst holiday Last year I have just had the...,My worst holiday Last year I have just had the...,"[My, worst, holiday, Last, year, I, have, just...",180,0.572222
2,CEPA-5-200600487,\t\t\t\tCEPA 5 200600487\n\n\n\n\nEvery body i...,Every body in this life have a favourite posse...,Every body in this life have a favourite posse...,"[Every, body, in, this, life, have, a, favouri...",229,0.445415
3,CEPA-4-200608016,\t\t\t\tCEPA 4 200608016\n\n\n\nEvery body hav...,Every body have a lot ofpossessions in this li...,Every body have a lot ofpossessions in this li...,"[Every, body, have, a, lot, ofpossessions, in,...",156,0.608974
4,CEPA-1-200611825,\t\t\t\tCEPA 1 200611825\n\n\n\nyou go in the ...,you go in the oman just had the perfect holida...,you go in the oman just had the perfect holida...,"[you, go, in, the, oman, just, had, the, perfe...",27,0.62963


In [58]:
# The one file with no tokens
balc_df[balc_df.token_count == 0]

Unnamed: 0,Filename,Original_Text,Normalized_Essay,Revised_Essay,tokens,token_count,TTR
1188,CEPA-1-200604511,\t\t\t\tCEPA 1 200604511\n\n\n\n\n,,,[],0,0.0


In [59]:
# "I t's"... Was messed up in the original text as well. Probably messes up the token count
balc_df.Normalized_Essay[66]

"I am is Ali ^in the stop smoking in the cigarette. I t's help smokers quit smoking. I relley somoking in the Large dangerous results. L They my friend somking beacous help quit. My father and mouther I advies didn't friend they smoking. I t's ^ in the Gulf smoke 2 50% to youth. I advies to smoking quit. You should lost of cigarette advertisements. Let's in the semoking dangerous healthy. My father to in the would 2 % 95 in the cigarett. I t's countery Large Breoblam didn't smoking. I they in the people speres ^in the Smoking. Smoking kills in the cigarette. If I were you, I would grent father in the smoking. Pleaes stop smoking beacous dangerous in the would. pleaes to her me in the would quit smoking."

In [60]:
balc_df.tokens[66]
# yeah. "I t's", which should really be "It's", becomes ["I", "t", "'s"]

['I', 'am', 'is', 'Ali', 'in', 'the', 'stop', 'smoking', 'in', 'the', 'cigarette', '.', 'I', 't', "'s", 'help', 'smokers', 'quit', 'smoking', '.', 'I', 'relley', 'somoking', 'in', 'the', 'Large', 'dangerous', 'results', '.', 'L', 'They', 'my', 'friend', 'somking', 'beacous', 'help', 'quit', '.', 'My', 'father', 'and', 'mouther', 'I', 'advies', 'did', "n't", 'friend', 'they', 'smoking', '.', 'I', 't', "'s", 'in', 'the', 'Gulf', 'smoke', '2', '50', '%', 'to', 'youth', '.', 'I', 'advies', 'to', 'smoking', 'quit', '.', 'You', 'should', 'lost', 'of', 'cigarette', 'advertisements', '.', 'Let', "'s", 'in', 'the', 'semoking', 'dangerous', 'healthy', '.', 'My', 'father', 'to', 'in', 'the', 'would', '2', '%', '95', 'in', 'the', 'cigarett', '.', 'I', 't', "'s", 'countery', 'Large', 'Breoblam', 'did', "n't", 'smoking', '.', 'I', 'they', 'in', 'the', 'people', 'speres', 'in', 'the', 'Smoking', '.', 'Smoking', 'kills', 'in', 'the', 'cigarette', '.', 'If', 'I', 'were', 'you', ',', 'I', 'would', 'gren

In [61]:
balc_df[balc_df.Original_Text.str.contains('phans')]

Unnamed: 0,Filename,Original_Text,Normalized_Essay,Revised_Essay,tokens,token_count,TTR
364,AK4e,\nName\nMahmood Hanawi\nSchool\nAl Khaleej\...,<x> M </x> Many people says that <x> op </x> o...,Many people says that orphan children are us...,"[Many, people, says, that, orphan, children, a...",209,0.45933
630,FZ16e,Name\nMaytha Khamis\nSchool\nFatima Al Zahna\n...,There are many orphans in the world and helpin...,There are many orphans in the world and helpin...,"[There, are, many, orphans, in, the, world, an...",144,0.576389
1304,FZ19e,Name\nKholod Al Audi\nSchool\nFatima Al Za...,I once heard someone quot<x> e</x>ing the word...,I once heard someone quoting the words of a wi...,"[I, once, heard, someone, quoting, the, words,...",327,0.513761
1510,AK19e,Name\nAbdullah Lee\nSchool\nAl Khaleej\nGra...,Many people lost ^their^ <x> re pe the </x> fa...,Many people lost their father and mother beca...,"[Many, people, lost, their, father, and, mothe...",142,0.549296


In [62]:
balc_df.Original_Text[364]

'\nName\nMahmood  Hanawi\nSchool\nAl   Khaleej\nGrade\nGrade 11\n\n\n<x> M </x> Many people says that <x> op </x> orphan children are useless  <x>and they are bad layer on the com</x> and  they are bad persons in the society, but this idea is so wrong, because there is many great persons in the life was<x> op </x> orp<o> han</o>  s,  but the were great  leaders and  <x> goo </x> very  good <x> po </x>  people.\n\nUsually  o<o> r</o>phans  are weak and miserable, because there is no one take  c<o> are</o> of them, and if no one cared about them, they may die or  suffer, or even beco<o> m</o>e  crimenals.  <x> and  thiev</x>  And  to stop thi<o> s</o>  , we should help them, to imp<o> r</o> ove  <x> there</x>   their  life,  and  sup<o> por</o>  t  them  with  basic  needs,  such  as  food  <x> and </x>,  clothes,  and  place  to  live  in  and  we  should  be  generous  with  them,  and  we  should  be  show  them  murcy  and  <x>under</x>   wed  so  should  understan  <x> there  passio

In [63]:
def womp(text):
    text = text.replace("_", " ")
    return text
balc_df['Attempt'] = balc_df.Normalized_Essay.apply(womp)

In [64]:
balc_df.head()

Unnamed: 0,Filename,Original_Text,Normalized_Essay,Revised_Essay,tokens,token_count,TTR,Attempt
0,CEPA-3-200607296,\t\t\t\tCEPA 3 200607296\n\n\n\nNow I tell you...,Now I tell you why my worst holiday ever in th...,Now I tell you why my worst holiday ever in th...,"[Now, I, tell, you, why, my, worst, holiday, e...",207,0.492754,Now I tell you why my worst holiday ever in th...
1,CEPA-4-200607457,\t\t\t\tCEPA 4 200607457\n\n\n\n ...,My worst holiday Last year I have just had the...,My worst holiday Last year I have just had the...,"[My, worst, holiday, Last, year, I, have, just...",180,0.572222,My worst holiday Last year I have just had the...
2,CEPA-5-200600487,\t\t\t\tCEPA 5 200600487\n\n\n\n\nEvery body i...,Every body in this life have a favourite posse...,Every body in this life have a favourite posse...,"[Every, body, in, this, life, have, a, favouri...",229,0.445415,Every body in this life have a favourite posse...
3,CEPA-4-200608016,\t\t\t\tCEPA 4 200608016\n\n\n\nEvery body hav...,Every body have a lot ofpossessions in this li...,Every body have a lot ofpossessions in this li...,"[Every, body, have, a, lot, ofpossessions, in,...",156,0.608974,Every body have a lot ofpossessions in this li...
4,CEPA-1-200611825,\t\t\t\tCEPA 1 200611825\n\n\n\nyou go in the ...,you go in the oman just had the perfect holida...,you go in the oman just had the perfect holida...,"[you, go, in, the, oman, just, had, the, perfe...",27,0.62963,you go in the oman just had the perfect holida...


In [93]:
def get_prof(fn):
    if fn.startswith('C'):
        num = re.findall('-\d-', fn)
        num = ''.join(num)
        num = num.replace('-', '')
        return num
    else: return "N/A"
balc_df[balc_df.Filename.str.contains(r'^C')]

Unnamed: 0,Filename,Original_Text,Normalized_Essay,Revised_Essay,tokens,token_count,TTR,Attempt
0,CEPA-3-200607296,\t\t\t\tCEPA 3 200607296\n\n\n\nNow I tell you...,Now I tell you why my worst holiday ever in th...,Now I tell you why my worst holiday ever in th...,"[Now, I, tell, you, why, my, worst, holiday, e...",207,0.492754,Now I tell you why my worst holiday ever in th...
1,CEPA-4-200607457,\t\t\t\tCEPA 4 200607457\n\n\n\n ...,My worst holiday Last year I have just had the...,My worst holiday Last year I have just had the...,"[My, worst, holiday, Last, year, I, have, just...",180,0.572222,My worst holiday Last year I have just had the...
2,CEPA-5-200600487,\t\t\t\tCEPA 5 200600487\n\n\n\n\nEvery body i...,Every body in this life have a favourite posse...,Every body in this life have a favourite posse...,"[Every, body, in, this, life, have, a, favouri...",229,0.445415,Every body in this life have a favourite posse...
3,CEPA-4-200608016,\t\t\t\tCEPA 4 200608016\n\n\n\nEvery body hav...,Every body have a lot ofpossessions in this li...,Every body have a lot ofpossessions in this li...,"[Every, body, have, a, lot, ofpossessions, in,...",156,0.608974,Every body have a lot ofpossessions in this li...
4,CEPA-1-200611825,\t\t\t\tCEPA 1 200611825\n\n\n\nyou go in the ...,you go in the oman just had the perfect holida...,you go in the oman just had the perfect holida...,"[you, go, in, the, oman, just, had, the, perfe...",27,0.629630,you go in the oman just had the perfect holida...
5,CEPA-1-200606381,\t\t\t\tCEPA 1 200606381\n\n\n\nThe most beaut...,The most beautiful place you hnow he see becau...,The most beautiful place you hnow he see becau...,"[The, most, beautiful, place, you, hnow, he, s...",20,0.850000,The most beautiful place you hnow he see becau...
6,CEPA-5-200608959,\t\t\t\tCEPA 5 200608959\n\n\n\nI have just ha...,I have just had the perfect and the best holid...,I have just had the perfect and the best holid...,"[I, have, just, had, the, perfect, and, the, b...",277,0.440433,I have just had the perfect and the best holid...
8,CEPA-5-200600322,\t\t\t\tCEPA 5 200600322\n\n\n\nOne of my most...,One of my most enjoyable activities is Ice ska...,One of my most enjoyable activities is Ice ska...,"[One, of, my, most, enjoyable, activities, is,...",132,0.621212,One of my most enjoyable activities is Ice ska...
10,CEPA-1-200603548,\t\t\t\tCEPA 1 200603548\n\n\n\nthe smar the h...,the smar the holde go famli and frened and the...,the smar the holde go famli and frened and the...,"[the, smar, the, holde, go, famli, and, frened...",54,0.611111,the smar the holde go famli and frened and the...
11,CEPA-3-200611351,\t\t\t\tCEPA 3 200611351\n\n\n\nAl_ain mall is...,Al_ain mall is the most beautiful place. First...,Alain mall is the most beautiful place. Firstl...,"[Alain, mall, is, the, most, beautiful, place,...",149,0.449664,Al ain mall is the most beautiful place. First...


In [94]:
get_prof(balc_df.Filename[31])

'2'

In [95]:
balc_df['Proficiency'] = balc_df.Filename.apply(get_prof)

In [96]:
balc_df.head()

Unnamed: 0,Filename,Original_Text,Normalized_Essay,Revised_Essay,tokens,token_count,TTR,Attempt,Proficiency
0,CEPA-3-200607296,\t\t\t\tCEPA 3 200607296\n\n\n\nNow I tell you...,Now I tell you why my worst holiday ever in th...,Now I tell you why my worst holiday ever in th...,"[Now, I, tell, you, why, my, worst, holiday, e...",207,0.492754,Now I tell you why my worst holiday ever in th...,3
1,CEPA-4-200607457,\t\t\t\tCEPA 4 200607457\n\n\n\n ...,My worst holiday Last year I have just had the...,My worst holiday Last year I have just had the...,"[My, worst, holiday, Last, year, I, have, just...",180,0.572222,My worst holiday Last year I have just had the...,4
2,CEPA-5-200600487,\t\t\t\tCEPA 5 200600487\n\n\n\n\nEvery body i...,Every body in this life have a favourite posse...,Every body in this life have a favourite posse...,"[Every, body, in, this, life, have, a, favouri...",229,0.445415,Every body in this life have a favourite posse...,5
3,CEPA-4-200608016,\t\t\t\tCEPA 4 200608016\n\n\n\nEvery body hav...,Every body have a lot ofpossessions in this li...,Every body have a lot ofpossessions in this li...,"[Every, body, have, a, lot, ofpossessions, in,...",156,0.608974,Every body have a lot ofpossessions in this li...,4
4,CEPA-1-200611825,\t\t\t\tCEPA 1 200611825\n\n\n\nyou go in the ...,you go in the oman just had the perfect holida...,you go in the oman just had the perfect holida...,"[you, go, in, the, oman, just, had, the, perfe...",27,0.62963,you go in the oman just had the perfect holida...,1


In [100]:
balc_df.Filename.describe()

count                 1856
unique                1856
top       CEPA-6-200620556
freq                     1
Name: Filename, dtype: object

In [102]:
balc_df.Filename.str.contains('200609914').value_counts()

False    1855
True        1
Name: Filename, dtype: int64

In [97]:
balc_df.Proficiency.value_counts()

5      299
4      297
2      292
1      284
3      250
6      250
N/A    182
         2
Name: Proficiency, dtype: int64

In [98]:
balc_df[balc_df.Filename.str.contains(r'^C')].sample(50)

Unnamed: 0,Filename,Original_Text,Normalized_Essay,Revised_Essay,tokens,token_count,TTR,Attempt,Proficiency
1102,CEPA-1-200602561,\t\t\t\tCEPA 1 200602561\n\n\n\nI go the Duba...,I go the Dubai I going to th fathir in mothere...,I go the Dubai I going to th fathir in mothere...,"[I, go, the, Dubai, I, going, to, th, fathir, ...",22,0.636364,I go the Dubai I going to th fathir in mothere...,1
1170,CEPA-4-200605572,\t\t\t\tCEPA 4 200605572\n\n\n\nlast mounth I ...,last mounth I went to Al -Ain city with my fam...,last mounth I went to Al -Ain city with my fam...,"[last, mounth, I, went, to, Al, -Ain, city, wi...",254,0.448819,last mounth I went to Al -Ain city with my fam...,4
1722,CEPA-6-200609899,\t\t\t\tCEPA 6 200609899\n\n\n\nMy favarite ac...,My favarite activity is rollerblading . Whenev...,My favarite activity is rollerblading . Whenev...,"[My, favarite, activity, is, rollerblading, .,...",205,0.497561,My favarite activity is rollerblading . Whenev...,6
1554,CEPA-3-200607316,\t\t\t\tCEPA 3 200607316\n\n\n\nI am very like...,I am very like to go the party and pailying wi...,I am very like to go the party and pailying wi...,"[I, am, very, like, to, go, the, party, and, p...",146,0.465753,I am very like to go the party and pailying wi...,3
1830,CEPA-3-20067753,\t\t\t\tCEPA 3 20067753\n\n\n\nIn sutardy I wi...,"In sutardy I will perfact happy holiday,..I we...","In sutardy I will perfact happy holiday,..I we...","[In, sutardy, I, will, perfact, happy, holiday...",177,0.491525,"In sutardy I will perfact happy holiday,..I we...",3
758,CEPA-6-200620051,\t\t\t\tCEPA 6 200620051\n\n\n\n\t\t\t\tAn ama...,An amazing weekend Last week my friends and I ...,An amazing weekend Last week my friends and I ...,"[An, amazing, weekend, Last, week, my, friends...",302,0.44702,An amazing weekend Last week my friends and I ...,6
250,CEPA-2-200612126,\t\t\t\tCEPA 2 200612126\n\n\nI have just had ...,I have just had the worst holiday ever! I went...,I have just had the worst holiday ever! I went...,"[I, have, just, had, the, worst, holiday, ever...",70,0.628571,I have just had the worst holiday ever! I went...,2
202,CEPA-4-200607509,\t\t\t\t\n\t\t\t\tCEPA 4 200607509\n\n\n\nThe ...,The best holid_a_y I haveit with my Family sin...,The best holiday I haveit with my Family sinse...,"[The, best, holiday, I, haveit, with, my, Fami...",140,0.6,The best holid a y I haveit with my Family sin...,4
1643,CEPA-4-200607909,\t\t\t\tCEPA 4 200607909\n\n\n\n\t\t\t\tTopic ...,"Topic A I went to bankok with my family, it wa...","Topic A I went to bankok with my family, it wa...","[Topic, A, I, went, to, bankok, with, my, fami...",226,0.482301,"Topic A I went to bankok with my family, it wa...",4
1840,CEPA-1-200610505,\t\t\t\tCEPA 1 200610505\n\n\n\nWhere you went...,"Where you went, who you went there with, what ...","Where you went, who you went there with, what ...","[Where, you, went, ,, who, you, went, there, w...",199,0.482412,"Where you went, who you went there with, what ...",1


In [99]:
balc_df.Original_Text[1783]

'\t\t\t\tCEPA 1 200604411\n\n\n\nThe ar<o>e</o> simr it flmatd . the you went smie . who you went there will about that topic. the my below about minute. The are the smiwwing iodos could are write essay. the are may the page for notes.The are diodet sheel whated. The are happened is my hioldy the smiwwing\n'