# BALC cleaning
2019.02.08 — 2019.02.17

## Summary of code
- Creating a corpus dictionary (`corpus_dict`) in which `corpus_dict[essay]` returns the text of the essay
- Setting up a balc_df, which includes the filenames and text of each essay in the corpus
- Cleaning up the essays by removing headers from the essays and normalizing the coding of student corrections (when available)

### Initial set-up

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import nltk
import glob
import re

%pprint      # turn off pretty printing

# Creating a path for the corpus. All folders in the corpus have spaces in them -- 
# some file names do as well. Keep in mind!
cor_dir = "private/BUiD Arab Learner Corpus v.1/total/"

Pretty printing has been turned OFF


In [2]:
# Creating a UDF that will return a file name with no spaces, to be used during dictionary formation of files
def rename_file(file):
    """If a file names contain spaces, these spaces are removed."""
    spaces = re.compile(r'\s+')
    matchobj = spaces.search(file)
    if matchobj:
        file = re.sub(r'\s+', '-', file)
    return file

# Testing the UDF
rename_file("/folder name/folder name 2/CEPA 1 0123456789")

'/folder-name/folder-name-2/CEPA-1-0123456789'

##### Reading in and checking files

In [3]:
# All right - let's do this with the corpus
# trying to specify an encoding that will not include \ufeff
corpus = glob.glob(cor_dir+'*.txt')
corpus_dict = {}
for file in corpus:
    f = open(file, encoding='utf-8-sig')
    txt = f.read()
    f.close()
    file = rename_file(file)
    start = file.rindex('/')+1
    name = file[start:-4]
    corpus_dict[name] = txt
    
# checking number of keys
len(corpus_dict.keys())

# good, we're not missing anything

1856

In [4]:
corpus_dict.keys()

dict_keys(['CEPA-3-200607296', 'CEPA-4-200607457', 'CEPA-5-200600487', 'CEPA-4-200608016', 'CEPA-1-200611825', 'CEPA-1-200606381', 'CEPA-5-200608959', 'Taiseer-86', 'CEPA-5-200600322', 'Taiseer-79', 'CEPA-1-200603548', 'CEPA-3-200611351', 'CEPA-5-200600450', 'AS17e', 'CEPA-3-200607269', 'Taiseer-45', 'CEPA-5-200603617', 'CEPA-5-200607471', 'CEPA-1-200610508', 'CEPA-1-200603206', 'CEPA-3-200611379', 'CEPA-5-200601014', 'Taiseer-51', 'AK9e', 'FZ10e', 'CEPA-3-200611190', 'CEPA-2-200606353', 'CEPA-2-200606421', 'CEPA-2-200612454', 'CEPA-2-200611582', 'CEPA-2-200601400', 'CEPA-2-200601414', 'CEPA-2-200607065', 'CEPA-2-200612468', 'CEPA-4-200607682', 'CEPA-6-200620020', 'CEPA-6-200621470', 'CEPA-1-200601954', 'CEPA-4-200605888', 'CEPA-4-200607721', 'CEPA-4-200607735', 'CEPA-3-200611966', 'CEPA-6-200620801', 'CEPA-5-200607908', 'CEPA-6-200620815', 'CEPA-1-200610124', 'CEPA-1-200600323', 'CEPA-1-200611548', 'CEPA-4-200605877', 'CEPA-5-200607707', 'CEPA-5-200601376', 'US-10', 'CEPA-5-200605104'

In [5]:
corpus_dict["CEPA-1-200601970"]

'\t\t\t\tCEPA 1 200601970\n\n\n\nYou have just had the perfect holiday you went Yaman There go <o>my  father and my brather. You</o> saw and did hadr<o>a</o> mot and sanaa moll. It was so wonderful m<o>y</o> famely\n'

In [6]:
corpus_dict["CEPA-5-200600215"]

'\t\t\t\tCEPA 5 200600215\n\n\n\nLast summer holiday was the worst holiday I have ever had. It was bad holiday because evrythings happened suddenly and without any prepairing.\nLast summer holiday, my family decided to spend  the holiday in India, so my father booked us a tickets to India. We all prepaired the bags for travelling on Tuesday.\nWe were on the airport befor one hour of plane flying. We were told that the plane had something wrong and it would be late. We waited for three hours in the airport. Then, we flew to India. It took us three hours. When we arrived we started looking for taxi for a long time. After that we found small bus to take us to the hotel. \nIn the way of the hotel, I saw many dogs in the street and I was afraid. In the entrence of the hotel there were many poor children with dirty cats. We spent this day in hotel because of  the bad weather.\nNext day, I suggested going to the park. My father bought the lunch and took us to the park. We played in some games

In [7]:
corpus_dict["CEPA-1-200600677"]

'\t\t\t\tCEPA 1 200600677\n\n\n\nshe have just ha<o>d</o> the perfect. went go to the Da<o>bia</o>. Dabia is the dviring, and is vere <o>Fantastc</o> went my Frinds and Famly. He will go in the Ibn batota, and swimming. The swimming Dabia saw vere Fantastc and dviring. and it’s going in the <o>cin</o>ema Dabia. \n'

In [8]:
corpus_dict["AK17e"]

'\t\t\nName\nEbeeid    Mubarak\nSchool\nAl   Khaleej\nGrade\nGrade  12\n\n\n\nThe  <x> l </x> mother<o> s </o>  are all of life.  We must h<o>a </o> ve a mother in all home.  I have a mother in my home.  My mothe<o> r </o> <x> l </x> is  beutiful woman be<o> o</o> cuse <x> she </x> her  bite  me  to  make  a  good  thinks.  I  m<o> u</o> st  now not forg<o> g</o> eten  a   poit <x> w </x>    he  was  said  a  mother is  <x> la </x> same  a  school  w<o> h</o>en  you  make  you  make  avery  good   people.\n\n\n'

#### Overview:
- It appears that all of the files have some kind of heading before the actual text
    - This will affect measures that depend on tokenization and token counts, so we'll need to remove this information
    - The information isn't standard across all of the files. The headings may or may not include the following information: 
        - The file name
        - The student's name
        - The student's school
        - The student's grade 
        - The perceived proficiency of the student's essay 
- Coding of student corrections:
    - Not all files have student correction codes.
    - The coding of student corrections is not uniform -- I knew to expect this based on some preliminary exploration I did when I first downloaded the corpus, but here is a summary of known issues, which can be witnessed in the files above: 
        - Some files use "↑" whereas others use HTML codes `<o> ... </o>`, `<i> ... </i>`, and `<x> ... </x>`
        - The tagging of correction codes is not uniform, e.g.: `fr<o om > </o>` vs `thi<o> s</o>`, where the first example has the `<o>` tag _including_ the part of the word it is tagging, and the second which does not
        - The spacing of tagging varies, such as: 
            - `<x>and they are bad layer on the com</x>`, where the cross-out tagging is flush to the words
            - `or<x> e </x>`, where `e` is the thing crossed out but the tags actually start connecting to the previous word and end after the thing being crossed out
            - `and  <x> goo </x> very` where the tagging is just removed from each side of the thing being crossed out by a space 
            - `<o> s</o> ociety` where one letter of a word is being emphasized, but the letter _is not_ connected to the word it is a part of
            - `<o> A</o>lso` where the letter emphasized _is_ connected to the word it belongs to
            - `c<o> are</o>` where the tag of the letters begins connected to the word, where a subpart is being emphasized, but the actual part of the word being emphasized is not connected
            - `orp<o> han</o>  s`, which separates the word "orphans"
            - `m<o>y</o>` where everything is flush
            - ` There go <o>my  father and my brather. You</o> saw` an entire phrase (that spans two sentences!) being emphasized
        - There are probably more issues moving forward, but this is something to take into account. 
- In a pd.DataFrame, I would like: 
    - An `Original_Essay` column, that reads in the original text
        - Maybe get rid of heading though?
    - A "Normalized" Essay column, that has a more (or completely) uniform tagging system 
        - `<o>` -> `_`
        - `<i>` - `^`
        - `<x>` doesn't have to change, I don't think
        - This is where issues raised earlier will really crop up
    - A `Cleaned_Essay` column that removes the tagging, as well as deletes words that correction codes indicate students crossed out
    - POS-tag
    - token count, average sentence length, the usual stuff

##### Putting a DataFrame together

In [9]:
# Creating a UDF that will return the text of a filename from corpus_dict
def get_text(file):
    return corpus_dict[file]

In [10]:
# Setting 
balc_df = pd.DataFrame(list(corpus_dict.keys()), columns={"Filename"})
print(balc_df.shape)
balc_df.head()

# good. it has the right number of files.

(1856, 1)


Unnamed: 0,Filename
0,CEPA-3-200607296
1,CEPA-4-200607457
2,CEPA-5-200600487
3,CEPA-4-200608016
4,CEPA-1-200611825


In [11]:
balc_df['Original_Text'] = balc_df.Filename.apply(get_text)
balc_df.head()

Unnamed: 0,Filename,Original_Text
0,CEPA-3-200607296,\t\t\t\tCEPA 3 200607296\n\n\n\nNow I tell you...
1,CEPA-4-200607457,\t\t\t\tCEPA 4 200607457\n\n\n\n ...
2,CEPA-5-200600487,\t\t\t\tCEPA 5 200600487\n\n\n\n\nEvery body i...
3,CEPA-4-200608016,\t\t\t\tCEPA 4 200608016\n\n\n\nEvery body hav...
4,CEPA-1-200611825,\t\t\t\tCEPA 1 200611825\n\n\n\nyou go in the ...


##### Attempts at cleaning up a bit 

In [12]:
def clean_text(txt):
    txt = re.sub(r'[\n\t ]+', ' ', txt)
    txt = re.sub(r'`', ' ', txt)
    txt = txt.strip()
    return txt

def retag(essay):
    """Replaces tags for student emphasis (<o>, </o>)with '_' and removes any unecessary spaces between emphasized 
    letters in words. Replaces tags for student insertions (<i>, </i>) with ^ and removes any unecessary spaces 
    between letters in inserted words. Closes any student tags that were left open by researchers."""
    essay = re.sub(r'\<i +', '<i>', essay)
    essay = re.sub(r'<x +', '<x>', essay)
    essay = re.sub(r'<o +', '<o>', essay)
    essay = re.sub(r'<o>(\s)?', '_', essay)
    essay = re.sub(r'(\s)?<\/o>(\s)?', '_', essay)
    essay = re.sub(r'<i>(\s)?', '^', essay)
    essay = re.sub(r'(\s)?<\/i>', '^', essay)
    essay = re.sub(r' >', ' ', essay)
    essay = re.sub(r'__', '_ _', essay)
    return essay

def un_head(txt):
    """Removes headers from text that include the file name, as well as student's names, grades, schools, etc."""
    cepa = re.compile(r'^C(EPA|EPa|epa)')      # removes cepa headers
    tai = re.compile(r'^Taiseer')              # removes taiseer headers
    tai2 = re.compile(r'^T-\d+')               # removes tai_em headers
    us = re.compile(r'^US-\d+')                # removes University of Sharjah headers
    header = re.compile(r'^Name')              # removes everything else
    if cepa.search(txt):
        txt = re.sub(r'CEPA.*?\d{2,}( ?-)?', '', txt, flags=re.I)
    elif tai.search(txt):
        txt = re.sub(r'Taiseer( -)? \d+', '', txt)
    elif tai2.search(txt):
        txt = re.sub(r'T-\d+', '', txt)
    elif us.search(txt):
        txt = re.sub(r'US-\d+.*?\d+', '', txt)
    else:
        matchobj = header.search(txt)
        if matchobj:
            txt = re.sub(r'Name.*?\d+( - \d)?', '', txt)
    return txt

def normalize_essay(txt):
    txt = clean_text(txt)
    txt = retag(txt)
    txt = un_head(txt)
    txt = txt.strip()
    return txt

In [13]:
balc_df.Original_Text[137]

'\t\t\t\tTaiseer 85\n\n\n\nI need to xxx<i>someone</i> <x>to</x> for help me. It’s very ergient my father  <x>so</x> is died. After my finshed my school. I see my father in the bed, becuaus  It’s was <x>sm</x> a very good smoker. xxx And It was died. smoking is a very xxx harmful hapits. In advertisements people see a man xxx who <x>happy</x>. teneeger saw the man and trust of thim. It’s some results for smoking. But do you now <x>a</x> the least dangerous result’s. of cours you now. It is xxx vary hot in your bady. or smoking couss to the most dangerous result’s. It’s cancer. Did you naw cancer it’s died desises. many of people like cigarette but in they life, he don’t <x>happy</x>. I belive smoking it’s very bad xxx e hapet. But some people  <x>dont</x>don’t agreey with me, sey why? Becouse xxx xxx he <x>say</x> sey I xxx xxx enjoy after I take only <x>oen</x> one cigarette. In my openin I coll him xxx a creezy people. How you smok and you must be adied. Finally It’s a slow suiced!  

In [14]:
un_head(clean_text(balc_df.Original_Text[137]))

' I need to xxx<i>someone</i> <x>to</x> for help me. It’s very ergient my father <x>so</x> is died. After my finshed my school. I see my father in the bed, becuaus It’s was <x>sm</x> a very good smoker. xxx And It was died. smoking is a very xxx harmful hapits. In advertisements people see a man xxx who <x>happy</x>. teneeger saw the man and trust of thim. It’s some results for smoking. But do you now <x>a</x> the least dangerous result’s. of cours you now. It is xxx vary hot in your bady. or smoking couss to the most dangerous result’s. It’s cancer. Did you naw cancer it’s died desises. many of people like cigarette but in they life, he don’t <x>happy</x>. I belive smoking it’s very bad xxx e hapet. But some people <x>dont</x>don’t agreey with me, sey why? Becouse xxx xxx he <x>say</x> sey I xxx xxx enjoy after I take only <x>oen</x> one cigarette. In my openin I coll him xxx a creezy people. How you smok and you must be adied. Finally It’s a slow suiced!'

In [15]:
balc_df[balc_df.Filename.str.contains('US')]

Unnamed: 0,Filename,Original_Text
51,US-10,US-10\n\n\n\n\nThe third differences between t...
81,US-11,US-11\n\n\n\n\n\t\nName of University\nUniver...
168,US-13,US-13\n\n\n\n\t\t\t\t\nName of University\nUn...
208,US-12,"US-12\n\n\n\n\n\nOn the other hand, males are ..."
284,US-16,US-16\n\n\n\nName of University\nUniversity o...
328,US-17,US-17\n\n\n\n\n\n\nAs to my parents mother t...
401,US-15,US-15\n\n\t\t\t\t\nName of University\nUniver...
404,US-29,US-29\n\n\n\n\t\t\t\t\nName of University\nUn...
432,US-28,US-28 \n\n\n\n\t\t\t\t\nName of Univers...
433,US-14,US-14\n\n\n\nAnd when she become at the last t...


In [16]:
balc_df.Original_Text[51]

'US-10\n\n\n\n\nThe third differences between teacher (A) and  (B) is the Final exam.  The teacher (A) always put very difficult exams and her exams always need a lot of time.  I remmeber that day when I saw her final exam.  It was a very diffecult  and I took a lot of time to anwser it.  But the teacher (B) always put a very easy Exams and I could finish from it in a short time.  I remmeber when the teacher (B) said : “I hate to put difficult  exams, because I know ↑that you  need makes to pass.\n\nIn short, there was a lot of differences between them.  They have different ways in teaching style, but I was like the both ways.  I was know these ways may be difficult, but it will help me to get high marks.\n\n'

In [17]:
un_head(clean_text(balc_df.Original_Text[51]))

' The third differences between teacher (A) and (B) is the Final exam. The teacher (A) always put very difficult exams and her exams always need a lot of time. I remmeber that day when I saw her final exam. It was a very diffecult and I took a lot of time to anwser it. But the teacher (B) always put a very easy Exams and I could finish from it in a short time. I remmeber when the teacher (B) said : “I hate to put difficult exams, because I know ↑that you need makes to pass. In short, there was a lot of differences between them. They have different ways in teaching style, but I was like the both ways. I was know these ways may be difficult, but it will help me to get high marks.'

In [18]:
balc_df.Original_Text[284]

'US-16\n\n\n\nName of  University\nUniversity of Sharjah\nName of  Student\nEman Al Hajri\nStudent  ID\n20320719\n\n\nEvery country has its own traditional custom costume  that  it is proud of.  To me is a part of not all traiditonal costumes are  people  have different oppiniouns  opiniouns to wards these kinds of costoms  customs.   To me, I don’t like most of my traditional customs as they have  are not comfortable to me and I can’t  xxx  have my freedom with it.   Even though my opinion is ve  negative towards  traditional  chlo  clothes still my paren   parents think that I’m wrong and they are proud of it..\n\nThe customs  clothes  like traditional  over in the UAE  have lots to do with gold, loose with sparkling things on and its un comnfortable because you can’t have your freedom with it, as xx your not free to have sit the way you like or to play with this kind of expensive clothes without getting it dirty as a child.  I remember ones when I was 8 years old in Eid, I used to p

In [19]:
un_head(clean_text(balc_df.Original_Text[284]))

' Every country has its own traditional custom costume that it is proud of. To me is a part of not all traiditonal costumes are people have different oppiniouns opiniouns to wards these kinds of costoms customs. To me, I don’t like most of my traditional customs as they have are not comfortable to me and I can’t xxx have my freedom with it. Even though my opinion is ve negative towards traditional chlo clothes still my paren parents think that I’m wrong and they are proud of it.. The customs clothes like traditional over in the UAE have lots to do with gold, loose with sparkling things on and its un comnfortable because you can’t have your freedom with it, as xx your not free to have sit the way you like or to play with this kind of expensive clothes without getting it dirty as a child. I remember ones when I was 8 years old in Eid, I used to play with my cousins cusins who in with sand mud and our clothes were so dirty and looked horrible, my mum started shouting and scolding scolded 

Let's make sure these UDFs work before applying them to a dataframe

In [20]:
balc_df.Original_Text[5]

'\t\t\t\tCEPA 1 200606381\n\n\n\nThe most beautiful place you hnow  he see because place very cood on bag on the people and Whoes there\n'

In [21]:
clean_text(balc_df.Original_Text[5])

'CEPA 1 200606381 The most beautiful place you hnow he see because place very cood on bag on the people and Whoes there'

In [22]:
balc_df.Original_Text[1510]

'Name\nAbdullah  Lee\nSchool\nAl   Khaleej\nGrade\nGrade  11\n\n\n\nMany people lost  <i> their </i>  <x> re  pe  the </x> father  and mother because of many <o> r</o> easons and become orphan.  They  need  many  helps  to  pass  their  problems  <x> such  asit </x>.   They lost <o> t</o>hier  father  and mother for many reasons s<x> h</x>uch as waves  and exedants. <x To fell good  > </x> Our  relegion  told  us  to  tereat  the orphans kindly t<o> o </o>  win the paradise.  To do that we must be generous to give them w<o> h</o> at they need.   <o>W </o>e must  build shelters to save them <x> from </x>.  We  should forgive them when they do mistakes.  If we want to be <x> the most  </x> in a good situation we should learn the orphan people b<o> y </o> the schools.  We should improve their skills by let them to play<x> and the imp  </x> and have their time in fun.  Finally we must <x> see save them fromen </x>.   look after them becase they will be in the futur the men who will save ou

In [23]:
balc_df.Original_Text[94]

'\t\t\nName\nEbeeid    Mubarak\nSchool\nAl   Khaleej\nGrade\nGrade  12\n\n\n\nThe  <x> l </x> mother<o> s </o>  are all of life.  We must h<o>a </o> ve a mother in all home.  I have a mother in my home.  My mothe<o> r </o> <x> l </x> is  beutiful woman be<o> o</o> cuse <x> she </x> her  bite  me  to  make  a  good  thinks.  I  m<o> u</o> st  now not forg<o> g</o> eten  a   poit <x> w </x>    he  was  said  a  mother is  <x> la </x> same  a  school  w<o> h</o>en  you  make  you  make  avery  good   people.\n\n\n'

Checking unhead

In [24]:
un_head(clean_text(balc_df.Original_Text[5]))

' The most beautiful place you hnow he see because place very cood on bag on the people and Whoes there'

In [25]:
un_head(clean_text(balc_df.Original_Text[1510]))

' Many people lost <i> their </i> <x> re pe the </x> father and mother because of many <o> r</o> easons and become orphan. They need many helps to pass their problems <x> such asit </x>. They lost <o> t</o>hier father and mother for many reasons s<x> h</x>uch as waves and exedants. <x To fell good > </x> Our relegion told us to tereat the orphans kindly t<o> o </o> win the paradise. To do that we must be generous to give them w<o> h</o> at they need. <o>W </o>e must build shelters to save them <x> from </x>. We should forgive them when they do mistakes. If we want to be <x> the most </x> in a good situation we should learn the orphan people b<o> y </o> the schools. We should improve their skills by let them to play<x> and the imp </x> and have their time in fun. Finally we must <x> see save them fromen </x>. look after them becase they will be in the futur the men who will save our familiees and society.'

In [26]:
un_head(clean_text(balc_df.Original_Text[94]))

' The <x> l </x> mother<o> s </o> are all of life. We must h<o>a </o> ve a mother in all home. I have a mother in my home. My mothe<o> r </o> <x> l </x> is beutiful woman be<o> o</o> cuse <x> she </x> her bite me to make a good thinks. I m<o> u</o> st now not forg<o> g</o> eten a poit <x> w </x> he was said a mother is <x> la </x> same a school w<o> h</o>en you make you make avery good people.'

Checking normalize

In [27]:
normalize_essay(balc_df.Original_Text[5])

'The most beautiful place you hnow he see because place very cood on bag on the people and Whoes there'

In [28]:
normalize_essay(balc_df.Original_Text[4])

'you go in the oman just had the perfect holiday Describe it where in the oman holiday. The you went there with in the Father.'

In [29]:
normalize_essay(balc_df.Original_Text[94])

'The <x> l </x> mother_s_are all of life. We must h_a_ve a mother in all home. I have a mother in my home. My mothe_r_<x> l </x> is beutiful woman be_o_cuse <x> she </x> her bite me to make a good thinks. I m_u_st now not forg_g_eten a poit <x> w </x> he was said a mother is <x> la </x> same a school w_h_en you make you make avery good people.'

In [30]:
normalize_essay(balc_df.Original_Text[1510])

'Many people lost ^their^ <x> re pe the </x> father and mother because of many _r_easons and become orphan. They need many helps to pass their problems <x> such asit </x>. They lost _t_hier father and mother for many reasons s<x> h</x>uch as waves and exedants. <x>To fell good  </x> Our relegion told us to tereat the orphans kindly t_o_win the paradise. To do that we must be generous to give them w_h_at they need. _W_e must build shelters to save them <x> from </x>. We should forgive them when they do mistakes. If we want to be <x> the most </x> in a good situation we should learn the orphan people b_y_the schools. We should improve their skills by let them to play<x> and the imp </x> and have their time in fun. Finally we must <x> see save them fromen </x>. look after them becase they will be in the futur the men who will save our familiees and society.'

Words that are connected by underscores will have to be manually checked and cleaned.

In [31]:
balc_df.Original_Text[1510]

'Name\nAbdullah  Lee\nSchool\nAl   Khaleej\nGrade\nGrade  11\n\n\n\nMany people lost  <i> their </i>  <x> re  pe  the </x> father  and mother because of many <o> r</o> easons and become orphan.  They  need  many  helps  to  pass  their  problems  <x> such  asit </x>.   They lost <o> t</o>hier  father  and mother for many reasons s<x> h</x>uch as waves  and exedants. <x To fell good  > </x> Our  relegion  told  us  to  tereat  the orphans kindly t<o> o </o>  win the paradise.  To do that we must be generous to give them w<o> h</o> at they need.   <o>W </o>e must  build shelters to save them <x> from </x>.  We  should forgive them when they do mistakes.  If we want to be <x> the most  </x> in a good situation we should learn the orphan people b<o> y </o> the schools.  We should improve their skills by let them to play<x> and the imp  </x> and have their time in fun.  Finally we must <x> see save them fromen </x>.   look after them becase they will be in the futur the men who will save ou

In [32]:
normalize_essay(balc_df.Original_Text[1510])

'Many people lost ^their^ <x> re pe the </x> father and mother because of many _r_easons and become orphan. They need many helps to pass their problems <x> such asit </x>. They lost _t_hier father and mother for many reasons s<x> h</x>uch as waves and exedants. <x>To fell good  </x> Our relegion told us to tereat the orphans kindly t_o_win the paradise. To do that we must be generous to give them w_h_at they need. _W_e must build shelters to save them <x> from </x>. We should forgive them when they do mistakes. If we want to be <x> the most </x> in a good situation we should learn the orphan people b_y_the schools. We should improve their skills by let them to play<x> and the imp </x> and have their time in fun. Finally we must <x> see save them fromen </x>. look after them becase they will be in the futur the men who will save our familiees and society.'

In [33]:
normalize_essay(balc_df.Original_Text[1])

'My worst holiday Last year I have just had the worst holiday ever. It was too board . My Lonely sister had got married . She was making me Laugh and play with me. But now I m alone with my male brothers. I cant stand them they are too noisy . In the Spring holiday my brothers and I travelled to Aust_r_alia with our parents. It was really great and nice place, but I didn t enjoyed it because no girls were with me . I asked my cousin to come with us but she refused that because she joined a sports club .There was alot of st_r_ange animals. My brother was taking photoes for the animals and I was too board. I was walking like a sick person. I hated my self and prayed to go back to our country to see my sister. I missed her so much. After few day we came back home then I became the happiest person in the world.'

In [34]:
balc_df['Normalized_Essay'] = balc_df.Original_Text.apply(normalize_essay)

In [35]:
balc_df.head()

Unnamed: 0,Filename,Original_Text,Normalized_Essay
0,CEPA-3-200607296,\t\t\t\tCEPA 3 200607296\n\n\n\nNow I tell you...,Now I tell you why my worst holiday ever in th...
1,CEPA-4-200607457,\t\t\t\tCEPA 4 200607457\n\n\n\n ...,My worst holiday Last year I have just had the...
2,CEPA-5-200600487,\t\t\t\tCEPA 5 200600487\n\n\n\n\nEvery body i...,Every body in this life have a favourite posse...
3,CEPA-4-200608016,\t\t\t\tCEPA 4 200608016\n\n\n\nEvery body hav...,Every body have a lot ofpossessions in this li...
4,CEPA-1-200611825,\t\t\t\tCEPA 1 200611825\n\n\n\nyou go in the ...,you go in the oman just had the perfect holida...


In [36]:
balc_df[balc_df.Normalized_Essay.str.contains(r'_[A-Za-z]+')]

Unnamed: 0,Filename,Original_Text,Normalized_Essay
0,CEPA-3-200607296,\t\t\t\tCEPA 3 200607296\n\n\n\nNow I tell you...,Now I tell you why my worst holiday ever in th...
1,CEPA-4-200607457,\t\t\t\tCEPA 4 200607457\n\n\n\n ...,My worst holiday Last year I have just had the...
3,CEPA-4-200608016,\t\t\t\tCEPA 4 200608016\n\n\n\nEvery body hav...,Every body have a lot ofpossessions in this li...
11,CEPA-3-200611351,\t\t\t\tCEPA 3 200611351\n\n\n\nAl_ain mall is...,Al_ain mall is the most beautiful place. First...
19,CEPA-1-200603206,\n\t\t\t\tCEPA 1 200603206\n\n\n\n Topi...,Topic _A_I will writing this a Prerf in far_me...
20,CEPA-3-200611379,\t\t\t\tCEPA 3 200611379\n\n\n\n\nI spend my w...,I spend my weekend in Dubai. I go with my fami...
23,AK9e,\t\t\t\nName\nSaid Afoos\nSchool\nAl Khalee...,you should drink becase it is good for your st...
24,FZ10e,\t\nName\nAmena Mal Allah \nSchool\nFatima A...,Mothers are symblols of love’ _m_ercy ahd hope...
25,CEPA-3-200611190,\t\t\t\tCEPA 3 200611190\n\n\n\nThere are many...,There are many things make the worst holiday f...
26,CEPA-2-200606353,\t\t\t\tCEPA 2 200606353\n\n\n\nI m Love the U...,I m Love the UAE the UAE I am very very Love t...


481 returns. Some will probably be okay, and some will probably need to be separated. Let's take a closer look

In [37]:
balc_df.Normalized_Essay[0]

"Now I tell you why my worst holiday ever in the last summer I wented withe my family in the India and this story I will tell you what happened for the short story when I go the first the weathe is very very rain now bady for the children play out when I go in the hotel all may family was have the headk in there and all was sleep put for my I can;t sleep because I not love the area in the morning all the my family weak up and going i<x>n the reast</x>rant but is the strees,children and the food is very dearty earia I not liked becuse is not nice area so darty and people there is not nice all there have not happy only sawted for my sister whem she take some flower for the mam and also when you there you see what I means maby some body liked go ther but for I didn't liked And for what happand I tell for my family I wanted to go in my country I didn't _l_ike her fainally I wanted tell for hem why your earea like thes you shold clean and your people so nerves..."

In [38]:
balc_df.Original_Text[1852]

'\t\t\t\tCEPA 4 200608027\n\n\n\n\t\t\t\tThe perfect holiday\n\n\nSummer holiday is the best holidays in a year. There are many things you can enjoy in summer holiday like travel, make a long time for a fun, reading many books to save information. In the last summer I enjoy traveled with my family to Australia, Malaysia and Thailand.\nIn Australia I enjoy traveled to Sydeny and Gold Coust. Sydeny is th<o>b</o> big city and the best city in the Australia. There are many place we enjoyed like zoo, musum and shopping centers. However, Gold Coust is the fun city that we enjoyed many big parks like sea worlds and Dream worlds. In addition the weathe was very cold and windy that we shold were a jus<o>t</o> and take umbrala. Also, we visited Maloysia. Malaysia is a very wonderful country. They are many trees, rivers and green mountain. There are a lot of rainning and the warm weather. However there are many place to fun and shop. And I shold taken an umbrella all the time. The third place us 

In [39]:
balc_df.Normalized_Essay[1852]

'The perfect holiday Summer holiday is the best holidays in a year. There are many things you can enjoy in summer holiday like travel, make a long time for a fun, reading many books to save information. In the last summer I enjoy traveled with my family to Australia, Malaysia and Thailand. In Australia I enjoy traveled to Sydeny and Gold Coust. Sydeny is th_b_big city and the best city in the Australia. There are many place we enjoyed like zoo, musum and shopping centers. However, Gold Coust is the fun city that we enjoyed many big parks like sea worlds and Dream worlds. In addition the weathe was very cold and windy that we shold were a jus_t_and take umbrala. Also, we visited Maloysia. Malaysia is a very wonderful country. They are many trees, rivers and green mountain. There are a lot of rainning and the warm weather. However there are many place to fun and shop. And I shold taken an umbrella all the time. The third place us visited are bangkok the city of Thailand that many Arabs t

In [40]:
print(balc_df.Original_Text[1851])
balc_df.Normalized_Essay[1851]

				CEPA 2 200606606



        Topic A 
        

In the summer I went with my f<x>ri</x>nds to the Mounten .
I see there a Fantactec place. I see a very big Mountens . Me and t<x>w</x>o with my frinedd clame The one oFthis Mounten and thae some buteful <o> imaeg bih</o>and the Mountens there are sea. My f<o>r</o>in<o>ed</o> go to swimme in the sea and divend there . he want to chac some Fish but he dont can . after that go to are stor and to ead a Food after fines .th<o>ae</o> go to mack some descaver. after that we all back to howe.



'Topic A In the summer I went with my f<x>ri</x>nds to the Mounten . I see there a Fantactec place. I see a very big Mountens . Me and t<x>w</x>o with my frinedd clame The one oFthis Mounten and thae some buteful _imaeg bih_and the Mountens there are sea. My f_r_in_ed_go to swimme in the sea and divend there . he want to chac some Fish but he dont can . after that go to are stor and to ead a Food after fines .th_ae_go to mack some descaver. after that we all back to howe.'

In [41]:
def remove_tags(text):
    """Removes tags from Normalized_Essay. If there are tags indicating a student crossed something out, whatever is 
    enclosed in those tags is removed."""
    text = re.sub(r'<x>.*?<\/x>', '', text)
    text = text.replace('_', '').replace('^', '')
    return text

In [42]:
balc_df['Revised_Essay'] = balc_df.Normalized_Essay.apply(remove_tags)
balc_df['tokens'] = balc_df.Revised_Essay.apply(nltk.word_tokenize)
balc_df['token_count'] = balc_df.tokens.map(len)
balc_df.sample(20)

Unnamed: 0,Filename,Original_Text,Normalized_Essay,Revised_Essay,tokens,token_count
60,CEPA-1-200600121,\t\t\t\tCEPA 1 200600121\n\n\n\nI’m going and ...,I’m going and _zoom_there _with_fimi_l_you see...,I’m going and zoomthere withfimilyou see cat a...,"[I, ’, m, going, and, zoomthere, withfimilyou,...",33
1099,CEPA-5-200604298,\t\t\t\tCEPA 5 200604298\n\n\n\nMy best Film i...,"My best Film is""Dhoom"". It is an Indian film. ...","My best Film is""Dhoom"". It is an Indian film. ...","[My, best, Film, is, '', Dhoom, '', ., It, is,...",247
1439,CEPA-6-200624886,\t\t\t\tCEPA 6 200624886\n\n\n\n Last week I h...,Last week I had the best holiday ever. My frie...,Last week I had the best holiday ever. My frie...,"[Last, week, I, had, the, best, holiday, ever,...",239
329,CEPA-5-200602678,\t\t\t\tCEPA 5 200602678\n\n\n\nMy worst holid...,My worst holiday was in 1999 when we went to O...,My worst holiday was in 1999 when we went to O...,"[My, worst, holiday, was, in, 1999, when, we, ...",180
1057,CEPA-1-200601334,\t\t\t\tCEPA 1 200601334\n\n\n\nThe pest I hav...,The pest I have just had the worst. I went in ...,The pest I have just had the worst. I went in ...,"[The, pest, I, have, just, had, the, worst, .,...",33
842,CEPA-2-200604052,\t\t\t\tCEPA 2 200604052\n\n\n\n<o>My</o> Hoil...,_My_HoilyDay in saummr is so fantaistk is <x>I...,MyHoilyDay in saummr is so fantaistk is go wi...,"[MyHoilyDay, in, saummr, is, so, fantaistk, is...",46
559,CEPA-4-200607958,\t\t\t\tCEPA 4 200607958\n\n\n\nWhen I imagine...,When I imagine we have the perfect holiday ^we...,When I imagine we have the perfect holiday we ...,"[When, I, imagine, we, have, the, perfect, hol...",222
1696,CEPA-4-200607854,\t\t\t\tCEPA 4 200607854\n\n\n\nIn the last ye...,In the last year I have jast had the worst hol...,In the last year I have jast had the worst hol...,"[In, the, last, year, I, have, jast, had, the,...",143
1122,CEPA-5-200601134,\t\t\t\tCEPA 5 200601134\n\n\n\nLast summer my...,Last summer my family and I went on a holiday ...,Last summer my family and I went on a holiday ...,"[Last, summer, my, family, and, I, went, on, a...",213
1664,CEPA-6-200620403,\t\t\t\tCEPA 6 200620403\n\n\n\nSummer had end...,"Summer had end, and it was time to go back hom...","Summer had end, and it was time to go back hom...","[Summer, had, end, ,, and, it, was, time, to, ...",231


In [43]:
balc_df.iloc[[544], :]

Unnamed: 0,Filename,Original_Text,Normalized_Essay,Revised_Essay,tokens,token_count
544,AS9e,Name\nNoof Ebeed\nGrade\n11\n\n\n\n\nI like t...,I like to be <x> ll </x> teacher and it is my ...,I like to be teacher and it is my future job....,"[I, like, to, be, teacher, and, it, is, my, fu...",166


In [44]:
balc_df.Original_Text[544]

'Name\nNoof  Ebeed\nGrade\n11\n\n\n\n\nI like to be <x> ll </x> teacher  and  it  is  my  future  job.  I love to be teacher of math in the future bec<i> ause </i> I want to sarvel my country.  I will still study for 5 <o>&</o>  \nyears becuse  <x> my  ambi  </x> I dre<o> a</o>m  to be the best teche of math in the world  <o> an </o>and  I  will study  in  U.A.E.  University.\n\nE<o> v</o> ery  thing  in  the  world  have  advantages  and  dis advant<o> a</o>  \nges  <i> like </i> the  job  of <x> of  the  tt  </x> teacher  have  a  lot  of  advantages  like  I  want  to  help  my  count<o> <r/o>y,  I  want  all  my  studint  to  be  shouler  in  the  future  and  best  people  an<o> d </o> many  advantages.  Dis  advantages  like  is  x  children  <x> how </x> whoes  make  problems  <x> and </x> and  come  l<o> ate</o>r  to  <x> shool </x> school.  I  think  this  job  is  the  best  choice  bec<o>a </o>use <x> it  is</x>  th<o> e </o> math  is  the  mother  of  all  world,  and,  job

In [45]:
balc_df.iloc[[243], :]

Unnamed: 0,Filename,Original_Text,Normalized_Essay,Revised_Essay,tokens,token_count
243,Taiseer-53,\t\t\t\tTaiseer 53\n\n\n\n<x>When you help som...,<x>When you help someone to quit smoking you a...,s much has been said about the dangers XXX of...,"[s, much, has, been, said, about, the, dangers...",172


In [46]:
balc_df.Original_Text[243]

'\t\t\t\tTaiseer 53\n\n\n\n<x>When you help someone to quit smoking you almost certianly</x> s much has been said about the dangers XXX of smoking to our health. Bad breath stained teeth loss of taste and mouth sores are the least dangerous reaults. More serious results are XXX gum infection and damage to bones and tissues underteeth However the most terrible results are fatal diseases such as cancer and heart diseases Inspit of their awareness of the dangers of smoking smokers donot quit for many reasons firstly they are not always of the dangers to their healthy unfortunately they often are safe from dangerous diseases. Scondly it is easier to start smoking than it is to quit. It taskes a long time to stop this bad habit thirdly smokers becomesalves to the habit and are too weak to quit. Fourthly medicines that can help smokers to quit are expensive. Nevertheless if asmoker really wants to give up he can do it.\nTeengers are easy victims of the cigartising industry, They will be the 

In [47]:
balc_df[balc_df.tokens.apply(lambda x: 'cepa' in [y.lower() for y in x])]
# curly apostrophes may be problematic later
# CEPA-1-200605193 did not have heading removed from original text because of '!'
# CEPA-3-200611685 did not have heading removed from original text because of 'CEPa'

Unnamed: 0,Filename,Original_Text,Normalized_Essay,Revised_Essay,tokens,token_count
6,CEPA-5-200608959,\t\t\t\tCEPA 5 200608959\n\n\n\nI have just ha...,I have just had the perfect and the best holid...,I have just had the perfect and the best holid...,"[I, have, just, had, the, perfect, and, the, b...",277
147,CEPA-2-200612443,\t\t\t\tCEPA 2 200612443\n\n\n\nI’m writing To...,I’m writing Topic B to Describe the best film ...,I’m writing Topic B to Describe the best film ...,"[I, ’, m, writing, Topic, B, to, Describe, the...",325
273,CEPA-3-200611155,\t\t\t\tCEPA 3 200611155\n\n\n\nIn the last su...,"In the last summer holiday, I hade the worest ...","In the last summer holiday, I hade the worest ...","[In, the, last, summer, holiday, ,, I, hade, t...",275
906,CEPA-2-200612584,\t\t\t\tCEPA 2 200612584\n\n\n\nThese weekend ...,These weekend the worst. Because I stayd at ho...,These weekend the worst. Because I stayd at ho...,"[These, weekend, the, worst, ., Because, I, st...",54
1370,CEPA-3-200611303,\t\t\t\tCEPA 3 200611303\n\n\n\nThere are many...,There are many qualites the worst weekend bad....,There are many qualites the worst weekend bad....,"[There, are, many, qualites, the, worst, weeke...",205
1557,CEPA-1-200604402,\t\t\t\tCEPA 1 200604402\n\n\n\n<x>me</x> <o>m...,<x>me</x> _my you went ar_wous. went <x>my</x>...,my you went arwous. went and fraund Yor ne...,"[my, you, went, arwous, ., went, and, fraund, ...",43


In [48]:
un_head(clean_text(balc_df.Original_Text[273]))

' In the last summer holiday, I hade the worest holidy ever in my live, because many thing its had not fun. Firstly, the place want to respect my holidy, it’s not very nice in my opinion, but my father and some brother, they need to goto this place, also my some relative liked went to same place, only my and my mother did’nt like went, Now I said the place all my family them went its Salala in Oman. Secoundly, why me did’nt like we to slala becaus we went <x>By</x> <i>by</i> car put me do like set in the ar in long time because I feel asto mich, and same time hideich, d’so the driver the car his my father and he drive ver slowly, and I ome of timemy uncil in onther car he bushe m<x>y</x> father car xxx so we spent tow oure to correct the car in the desart at night. Thirtly in the Salala its had many animals and nuts, and me leated all of animal just only only xxxone of like it’s the bird. Finaly, I like in thise hoildy summer the sea in the Salal because it very very nice, and me alrea

In [49]:
balc_df.Original_Text[1799]

'Name\nNoor Waleed\nSchool\nFatima Al Zahna\nGrade\n10  -  4\n\n\n\n\n\nThere are different kind of food, so you must choose the healthy food to have a healthy body.   There are some simple advices to have a healthy body.  First <i> of all  </i> you must eat <i> food with  </i> less  fats.  Never miss your breakfast.  Drink a lot of water.  If you want to go shopping you must eat some thing before shopping, because if you are hungry there you will <o> b</o>uy meals from the restau<x> l</x>rants.  Don’t drink water while you are eating.  Choose healthy choises like salad if you go to a restaurant with your friends.  Don’t starve yourself.  Eat a light snacks when you are hungry.  Never eat in front of T.V.  Do some exercise.  Always buy fresh healthy food and put it at home.  Don’t eat alot of sweets and sugar.  Finally, you must eat From different nutrients to have a beautiful and healthy body.\n\n'

In [50]:
un_head(clean_text(balc_df.Original_Text[1799]))

' There are different kind of food, so you must choose the healthy food to have a healthy body. There are some simple advices to have a healthy body. First <i> of all </i> you must eat <i> food with </i> less fats. Never miss your breakfast. Drink a lot of water. If you want to go shopping you must eat some thing before shopping, because if you are hungry there you will <o> b</o>uy meals from the restau<x> l</x>rants. Don’t drink water while you are eating. Choose healthy choises like salad if you go to a restaurant with your friends. Don’t starve yourself. Eat a light snacks when you are hungry. Never eat in front of T.V. Do some exercise. Always buy fresh healthy food and put it at home. Don’t eat alot of sweets and sugar. Finally, you must eat From different nutrients to have a beautiful and healthy body.'

In [51]:
balc_df.Original_Text[1739]

'Name\nHosam  Zakhbai\nGrade\n12\n\n\n\nThe  modern  lif<o>e </o> is very dangerour  <x> about t  </x> <o> w </o>  hith  the  past  in  the past we don’t have diseases x  and  many  <x> t  </x> people  not  s<x>e </x> ick  and  they  are  not  mor  fat  <x> beac  </x> because  they  eat  fresh  food  and  fresh  meat  <x> and  eh  d   </x> and  they  <x> v </x> every day working  to  give  <o> m</o> any.  <x> an </x>  <x> t</x> They  not  sei<o>c </o>k  to  eat  fresh  food.\n\nbut  now  ad<o> a</o>y  ma<o>n </o>y  people  sick  <x>p </x><o> b</o>ecause  the  food  is  not  good  and  th<o>e </o>y  many  <x> be </x> people  get  <x> beging </x> \nbeing  overweight  <x> the  fi  because  they  </x> because  they  eat  very  much  food  <x> t </x> <o>d </o>nd  eating  fast  food  and  wuthing  T.V  very  much  and  eat  foo<o> d </o> <x> wh  whith  </x> whith  wathin  T.V  and  play<o> in</o>g   <x> o</x> vidio  games  <o> a</o>nd  the<o> y </o>  <o> x </o> not  doing  sport  games  and 

In [52]:
un_head(clean_text(balc_df.Original_Text[1739]))

' The modern lif<o>e </o> is very dangerour <x> about t </x> <o> w </o> hith the past in the past we don’t have diseases x and many <x> t </x> people not s<x>e </x> ick and they are not mor fat <x> beac </x> because they eat fresh food and fresh meat <x> and eh d </x> and they <x> v </x> every day working to give <o> m</o> any. <x> an </x> <x> t</x> They not sei<o>c </o>k to eat fresh food. but now ad<o> a</o>y ma<o>n </o>y people sick <x>p </x><o> b</o>ecause the food is not good and th<o>e </o>y many <x> be </x> people get <x> beging </x> being overweight <x> the fi because they </x> because they eat very much food <x> t </x> <o>d </o>nd eating fast food and wuthing T.V very much and eat foo<o> d </o> <x> wh whith </x> whith wathin T.V and play<o> in</o>g <x> o</x> vidio games <o> a</o>nd the<o> y </o> <o> x </o> not doing sport games and x va<o>n </o>ning to <x> loses </x> loss waight.'

In [53]:
balc_df.Original_Text[187]

'\t\t\t\tTaiseer    -   11\n\n\nsmoking is a very bad hapit.  smoking causes bad breath steined il  teeth, Loss of tast and mouth sores.  In spite of their awareness of the dangers of smoke, smokers do not quit for many reasons ; firest  they are not aware of the dangers of smoking, second it is esier to start then to quit  and finally the become slaves to the h habit and are too weak to quit.  It is dengres.  If you smoke your will children will smoker, too.  It is true these that body smoking burans ther are body and money weak to quil quit so why not stop smoking and star to xx xx living.  why not stop smoking and a  start z  zzzz  liaefystail.  If you stop smoking Your cheldran dnot ↑will  do  not  smoke  and  x  do not came  deises  decesis.  You  chould  should be not agean smoke.\n\n\n'

In [54]:
un_head(clean_text(balc_df.Original_Text[187]))

' smoking is a very bad hapit. smoking causes bad breath steined il teeth, Loss of tast and mouth sores. In spite of their awareness of the dangers of smoke, smokers do not quit for many reasons ; firest they are not aware of the dangers of smoking, second it is esier to start then to quit and finally the become slaves to the h habit and are too weak to quit. It is dengres. If you smoke your will children will smoker, too. It is true these that body smoking burans ther are body and money weak to quil quit so why not stop smoking and star to xx xx living. why not stop smoking and a start z zzzz liaefystail. If you stop smoking Your cheldran dnot ↑will do not smoke and x do not came deises decesis. You chould should be not agean smoke.'

In [55]:
balc_df.Original_Text[1482]

'Name\n Khlood  Mousa\nSchool\nFatima Al Zahna\nGrade\n10  -  4\n\n\n\n<x>m </x> My Mother sacrifice me.  I Love <x> ma </x> mather.  <x> had  </x> \nhe cooked feed.  I am stady principle<o> s </o>.   \n\n'

In [56]:
un_head(clean_text(balc_df.Original_Text[1482]))

' <x>m </x> My Mother sacrifice me. I Love <x> ma </x> mather. <x> had </x> he cooked feed. I am stady principle<o> s </o>.'

In [57]:
balc_df.Original_Text[544]

'Name\nNoof  Ebeed\nGrade\n11\n\n\n\n\nI like to be <x> ll </x> teacher  and  it  is  my  future  job.  I love to be teacher of math in the future bec<i> ause </i> I want to sarvel my country.  I will still study for 5 <o>&</o>  \nyears becuse  <x> my  ambi  </x> I dre<o> a</o>m  to be the best teche of math in the world  <o> an </o>and  I  will study  in  U.A.E.  University.\n\nE<o> v</o> ery  thing  in  the  world  have  advantages  and  dis advant<o> a</o>  \nges  <i> like </i> the  job  of <x> of  the  tt  </x> teacher  have  a  lot  of  advantages  like  I  want  to  help  my  count<o> <r/o>y,  I  want  all  my  studint  to  be  shouler  in  the  future  and  best  people  an<o> d </o> many  advantages.  Dis  advantages  like  is  x  children  <x> how </x> whoes  make  problems  <x> and </x> and  come  l<o> ate</o>r  to  <x> shool </x> school.  I  think  this  job  is  the  best  choice  bec<o>a </o>use <x> it  is</x>  th<o> e </o> math  is  the  mother  of  all  world,  and,  job

In [58]:
un_head(clean_text(balc_df.Original_Text[544]))

' I like to be <x> ll </x> teacher and it is my future job. I love to be teacher of math in the future bec<i> ause </i> I want to sarvel my country. I will still study for 5 <o>&</o> years becuse <x> my ambi </x> I dre<o> a</o>m to be the best teche of math in the world <o> an </o>and I will study in U.A.E. University. E<o> v</o> ery thing in the world have advantages and dis advant<o> a</o> ges <i> like </i> the job of <x> of the tt </x> teacher have a lot of advantages like I want to help my count<o> <r/o>y, I want all my studint to be shouler in the future and best people an<o> d </o> many advantages. Dis advantages like is x children <x> how </x> whoes make problems <x> and </x> and come l<o> ate</o>r to <x> shool </x> school. I think this job is the best choice bec<o>a </o>use <x> it is</x> th<o> e </o> math is the mother of all world, and, jobs an<o> d </o> everything in this time and now a days xx like <x> jo </x> jobs want th<o> ex </o> maths this is cause <x> want </x> w<o> h<

In [59]:
balc_df[balc_df.Normalized_Essay.str.contains(r'^ ?-')]

Unnamed: 0,Filename,Original_Text,Normalized_Essay,Revised_Essay,tokens,token_count


In [60]:
balc_df.Original_Text[1763]

'\t\t\t\tCEPA 4 200607711\n\n\n\n- Holiday is the most important time in our life , so we decided to spend it in a useful thing.\n\n I like very much summer holiday because we usealy trare in that holidy . In Last summer holiday .I think I spend all my time  in useful things . For example   I travel with my family to the (thiland)  it was a wonderful country and the weather was beautiful . I enjoy my time their . We went  to a aplace that name ( poct )  it was a wonderful place I ever see. I same forst with all that trees .  Then we went to the (pangkok) it the Capital  of this country .I think it is the economic center of the Thiland .After 3 weeks we returned back to my lovely country. After that I take a  computer courses  for two week to improve me skill. The summer in dubai is so  wonderful because  the summer surprises will strats.\nWe go to the modhich city to play and enjoy our time. \nI like shopping very much so  I spend all my time in malls like (  city center, Burjoman and 

In [61]:
balc_df.Normalized_Essay[1763]

'Holiday is the most important time in our life , so we decided to spend it in a useful thing. I like very much summer holiday because we usealy trare in that holidy . In Last summer holiday .I think I spend all my time in useful things . For example I travel with my family to the (thiland) it was a wonderful country and the weather was beautiful . I enjoy my time their . We went to a aplace that name ( poct ) it was a wonderful place I ever see. I same forst with all that trees . Then we went to the (pangkok) it the Capital of this country .I think it is the economic center of the Thiland .After 3 weeks we returned back to my lovely country. After that I take a computer courses for two week to improve me skill. The summer in dubai is so wonderful because the summer surprises will strats. We go to the modhich city to play and enjoy our time. I like shopping very much so I spend all my time in malls like ( city center, Burjoman and merkato ) . I think holiday last summer it was the perf

In [62]:
balc_df[balc_df.Normalized_Essay.str.contains('↑')]

Unnamed: 0,Filename,Original_Text,Normalized_Essay,Revised_Essay,tokens,token_count
51,US-10,US-10\n\n\n\n\nThe third differences between t...,The third differences between teacher (A) and ...,The third differences between teacher (A) and ...,"[The, third, differences, between, teacher, (,...",162
66,T-12,\t\t\t\tTaiseer - 12\n\n\nI am is Ali ↑i...,I am is Ali ↑in the stop smoking in the cigare...,I am is Ali ↑in the stop smoking in the cigare...,"[I, am, is, Ali, ↑in, the, stop, smoking, in, ...",162
168,US-13,US-13\n\n\n\n\t\t\t\t\nName of University\nUn...,A Grade of A is to have excellent A Grade of A...,A Grade of A is to have excellent A Grade of A...,"[A, Grade, of, A, is, to, have, excellent, A, ...",334
187,T-11,\t\t\t\tTaiseer - 11\n\n\nsmoking is a ve...,smoking is a very bad hapit. smoking causes ba...,smoking is a very bad hapit. smoking causes ba...,"[smoking, is, a, very, bad, hapit, ., smoking,...",159
432,US-28,US-28 \n\n\n\n\t\t\t\t\nName of Univers...,Male and Female ↑In general have many differen...,Male and Female ↑In general have many differen...,"[Male, and, Female, ↑In, general, have, many, ...",278
433,US-14,US-14\n\n\n\nAnd when she become at the last t...,And when she become at the last time she didn’...,And when she become at the last time she didn’...,"[And, when, she, become, at, the, last, time, ...",316
480,US-6,US-6\n\n\n\t\t\t\nName of University\nUnivers...,As we all know that man is stronger and he can...,As we all know that man is stronger and he can...,"[As, we, all, know, that, man, is, stronger, a...",283
616,US-5,US-5\n\n\t\t\t\t\nName of University\nUnivers...,In my styding life I have tought from many dif...,In my styding life I have tought from many dif...,"[In, my, styding, life, I, have, tought, from,...",328
823,US-1,US-1\n\n\n\t\t\t\t\nName of University\nUniver...,Driving is very interested hobby. In the past ...,Driving is very interested hobby. In the past ...,"[Driving, is, very, interested, hobby, ., In, ...",248
1182,US-9,US-9\n\n\n\n\n\t\n\t\t\t\t\t\nName of Univers...,In our life we can see a lot of different styl...,In our life we can see a lot of different styl...,"[In, our, life, we, can, see, a, lot, of, diff...",246


In [69]:
balc_df.Original_Text[66]

'\t\t\t\tTaiseer    -   12\n\n\nI am is Ali  ↑in the stop smoking in the cigarette.  I t’s help smokers quit smoking.  I relley somoking in the Large dangerous results.  L   They  my friend somking       beacous  help  quit.      My  father  and  mouther           I advies didn’t  friend  they  smoking.  I t’s   ↑ in the Gulf smoke 2  50%   to  youth.  I  advies  to  smoking   quit.  You  should  lost  of  cigarette  advertisements.  Let’s  in  the  semoking dangerous healthy.  My           father to in the would  2 % 95  in the cigarett.  I t’s  countery  Large               Breoblam  didn’t               smoking.  I they                in the people speres ↑in the  Smoking.  Smoking kills in the cigarette.  If I were you, I would                 grent father in the smoking.   Pleaes             stop smoking beacous dangerous  in the would.  pleaes to her me                     in the would quit smoking. \n\n\n'