
# 2_PELIC dataset


## Finding occurrences of use of 'key' family items in the PELIC corpus

**Note:** The pickle file containing the written section of the PELIC corpus used in this notebook is not currently publicly available. However, it will be made publicly available in the summer of 2020 and can be downloaded from the [Pitt ELI Data Mining Group github page](https://github.com/ELI-Data-Mining-Group/Pitt-ELI-Corpus).

#### Sections of the notebook
- [Initial setup](#Initial-setup)
- [Spelling correction](#Spelling-correction)
- [Key families in PELIC](#Key-families-in-PELIC)
- [Key family forms](#Key-family-forms)
- [Narrowing dataset](#Narrowing-dataset)

In [1]:
# Importing necessary modules
import pandas as pd
import pprint
import pickle as pkl
import csv

# Setting preferred notebook format
%pprint # Turn off pretty printing
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all" # Shows all output, not just last item
pd.set_option('display.max_columns', 999) # Allow viewing of all columns

Pretty printing has been turned OFF


In [2]:
# Reading in necessary files

#COCA key families data frame
coca_key = pkl.load(open('coca_key.pkl', 'rb'))
coca_key.head()

# PELIC dataframe (written texts)
pelic = pd.read_pickle('pelic_df.pkl')
pelic = pelic.drop(['question_id', 'user_file_id', 'text_preanon', 'toks_re', 'toks_re_len'], axis=1)
pelic.head()

Unnamed: 0,family,forms,form_POS,form_in_midfreq,len_form_in_midfreq,lemma_in_midfreq,lemma_POS,len_lemma_in_midfreq,core_POS,core_POS_only,len_core_POS_only,most_freq_lemma,lemma_rank
25,back,"[back, backed, backer, backers, backing, backs...","[(back, rr), (back, nn), (back, vv), (back, jj...","[(back, vv), (backing, nn), (backward, rr), (b...",4,"[(backward, rr), (back, vv), (backing, nn), (b...","[(back, jj), (back, rr), (backing, nn), (backw...",4,"[(backward, rr), (back, vv), (backing, nn), (b...","[rr, vv, nn, rr]",4,"(back, rr)",109
242,nation,"[nation, national, nationalisation, nationalis...","[(nation, nn), (national, jj), (national, nn),...","[(nationalism, nn), (nationalist, jj), (nation...",6,"[(nationwide, rr), (nationalist, nn), (nationa...","[(nationwide, rr), (nationalize, vv), (nation,...",6,"[(nationwide, rr), (nationalist, nn), (nationa...","[jj, jj, nn, rr]",4,"(national, jj)",220
253,open,"[open, opened, opener, openers, opening, openi...","[(open, jj), (open, vv), (open, rr), (opened, ...","[(open, rr), (opener, nn), (openly, rr), (open...",5,"[(open, rr), (reopen, vv), (openly, rr), (open...","[(open, jj), (reopening, nn), (open, rr), (uno...",5,"[(open, rr), (reopen, vv), (openly, rr), (open...","[rr, vv, rr, nn, nn]",5,"(open, vv)",342
72,continue,"[continue, continual, continually, continuance...","[(continue, vv), (continual, jj), (continually...","[(continually, rr), (continued, jj), (continui...",6,"[(continuing, jj), (continuity, nn), (continua...","[(continuing, jj), (continuity, nn), (continua...",6,"[(continuing, jj), (continuity, nn), (continua...","[jj, nn, rr, rr, jj, jj]",6,"(continue, vv)",350
140,expect,"[expect, expectancies, expectancy, expectant, ...","[(expect, vv), (expectancies, nn), (expectancy...","[(expectancy, nn), (expected, jj), (unexpected...",4,"[(unexpected, jj), (unexpectedly, rr), (expect...","[(expectation, nn), (unexpectedly, rr), (unexp...",4,"[(unexpected, jj), (unexpectedly, rr), (expect...","[jj, rr, jj, nn]",4,"(expect, vv)",434


Unnamed: 0_level_0,anon_id,text,class_code,level_id,native_language,gender,version,toks_nltk,toks_pos,lemmas
answer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1,eq0,I met my friend Nife while I was studying in a...,g,4,Arabic,Male,1,"[I, met, my, friend, Nife, while, I, was, stud...","[(I, PRP), (met, VBD), (my, PRP$), (friend, NN...","[i, meet, my, friend, nife, while, i, be, stud..."
2,am8,"Ten years ago, I met a women on the train betw...",g,4,Thai,Female,1,"[Ten, years, ago, ,, I, met, a, women, on, the...","[(Ten, CD), (years, NNS), (ago, RB), (,, ,), (...","[ten, year, ago, ,, i, meet, a, woman, on, the..."
3,dk5,In my country we usually don't use tea bags. F...,w,4,Turkish,Female,1,"[In, my, country, we, usually, do, n't, use, t...","[(In, IN), (my, PRP$), (country, NN), (we, PRP...","[in, my, country, we, usually, do, n't, use, t..."
4,dk5,I organized the instructions by time.,w,4,Turkish,Female,1,"[I, organized, the, instructions, by, time, .]","[(I, PRP), (organized, VBD), (the, DT), (instr...","[i, organize, the, instruction, by, time, .]"
5,ad1,"First, prepare a port, loose tea, and cup.\nSe...",w,4,Korean,Female,1,"[First, ,, prepare, a, port, ,, loose, tea, ,,...","[(First, RB), (,, ,), (prepare, VB), (a, DT), ...","[first, ,, prepare, a, port, ,, loose, tea, ,,..."


### Spelling correction
Spelling the PELIC texts in PELIC has not been corrected in order to minimize data manipulation. However, as spelling accuracy is not the component of lexical depth being investigated here, the spelling of key words will be corrected so that such occurrences are included in the final dataset.

In [3]:
# Reading in misspell_dict
with open('misspell_dict.csv') as f:
    reader = csv.reader(f, skipinitialspace=True)
    misspell_dict = dict(reader)

**Note:** This dictionary includes many spelling mispellings (the dictionary keys)  and their correct equivalents (the dictionary values), derived from multiple sources. Only a fraction of spelling mistakes in PELIC are addressed using this dictionary, but all misspellings and corrections of key words have been manually added. As with the PELIC pickle file, the misspell dict will be publically available in the summer of 2020.

In [4]:
# Creating function for identifying the errors in a text
def Errors_in_text(tokenized_text):
    error_list = []
    for word in tokenized_text:
        if word.lower() in misspell_dict:
            error_list.append(word.lower())
    return error_list

In [5]:
# Creating function for replacing the errors in a text
def CorrectSpelling(tokenized_text):
    new_text = tokenized_text.copy()
    for num in range (len(tokenized_text)):
        if(tokenized_text[num].lower() in misspell_dict):
            new_text[num] = misspell_dict[tokenized_text[num].lower()]
    return new_text

In [6]:
# Creating 'errors_in_text' column with list of the errors which are also in our dictionary
pelic['errors_in_text'] = pelic.toks_nltk.map(Errors_in_text)

In [7]:
# Creating 'number of errors' column for descriptive statistics
pelic['len_errors_in_text'] = [len(x) for x in pelic['errors_in_text']]

In [8]:
pelic.head()

Unnamed: 0_level_0,anon_id,text,class_code,level_id,native_language,gender,version,toks_nltk,toks_pos,lemmas,errors_in_text,len_errors_in_text
answer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1,eq0,I met my friend Nife while I was studying in a...,g,4,Arabic,Male,1,"[I, met, my, friend, Nife, while, I, was, stud...","[(I, PRP), (met, VBD), (my, PRP$), (friend, NN...","[i, meet, my, friend, nife, while, i, be, stud...",[],0
2,am8,"Ten years ago, I met a women on the train betw...",g,4,Thai,Female,1,"[Ten, years, ago, ,, I, met, a, women, on, the...","[(Ten, CD), (years, NNS), (ago, RB), (,, ,), (...","[ten, year, ago, ,, i, meet, a, woman, on, the...",[agin],1
3,dk5,In my country we usually don't use tea bags. F...,w,4,Turkish,Female,1,"[In, my, country, we, usually, do, n't, use, t...","[(In, IN), (my, PRP$), (country, NN), (we, PRP...","[in, my, country, we, usually, do, n't, use, t...",[],0
4,dk5,I organized the instructions by time.,w,4,Turkish,Female,1,"[I, organized, the, instructions, by, time, .]","[(I, PRP), (organized, VBD), (the, DT), (instr...","[i, organize, the, instruction, by, time, .]",[],0
5,ad1,"First, prepare a port, loose tea, and cup.\nSe...",w,4,Korean,Female,1,"[First, ,, prepare, a, port, ,, loose, tea, ,,...","[(First, RB), (,, ,), (prepare, VB), (a, DT), ...","[first, ,, prepare, a, port, ,, loose, tea, ,,...",[],0


In [9]:
# Creating column with corrected tokens
pelic['toks_corrected'] = pelic.toks_nltk.map(CorrectSpelling)

In [10]:
# Counting number of tokens that have been corrected (and checking this number)
texts_with_errors = pelic.loc[pelic.len_errors_in_text != 0]
len(texts_with_errors)
len(texts_with_errors.loc[texts_with_errors.toks_nltk != texts_with_errors.toks_corrected])

2724

2724

In [11]:
# Also adding a corrected lemmas column
pelic['lem_errors'] = pelic.lemmas.map(Errors_in_text)
pelic['len_lem_errors'] = [len(x) for x in pelic['lem_errors']]
pelic['lems_corrected'] = pelic.lemmas.map(CorrectSpelling)

In [12]:
# Slightly less lemma errors as lemmatization collapsed some items
texts_with_lem_errors = pelic.loc[pelic.len_lem_errors != 0]
len(texts_with_lem_errors)
len(texts_with_lem_errors.loc[texts_with_lem_errors.toks_nltk != texts_with_lem_errors.lems_corrected])

2628

2628

In [13]:
# Also need to add a corrected toks_pos column
pelic['toks_pos_errors'] = pelic['toks_pos'].apply(lambda x: [i for i in x if i[0] in misspell_dict])
pelic['len_toks_pos_errors'] = [len(x) for x in pelic['toks_pos_errors']]
pelic['toks_pos_corrected'] = pelic['toks_pos'].apply(lambda x: [tuple(CorrectSpelling(list(i))) for i in x])

In [14]:
# Re-pickle the updated pelic df for later use
pelic.to_pickle("pelic_df.pkl")

### Key families in PELIC

In [15]:
# Creating a list of the key family head lemmas
key_lemma_list = coca_key.family.tolist()

In [16]:
# Seeing how many PELIC texts have one of the 27 key family head words
mask1 = pelic.lems_corrected.apply(lambda x: any(item for item in key_lemma_list if item in x))
pelic_key = pelic[mask1]

In [17]:
# Checking how many texts include a key lemma
print('PELIC texts with key lemmas:',len(pelic_key))
print('Total PELIC texts:',len(pelic))
print('Percentage of PELIC texts with key lemmas:',round(len(pelic_key)/len(pelic)*100,2))

PELIC texts with key lemmas: 7223
Total PELIC texts: 46239
Percentage of PELIC texts with key lemmas: 15.62


In [18]:
# Checking number of key lemmas in/not in PELIC
peliclemmas_list = pelic.lems_corrected.tolist() #make a list of all lemmas in PELIC
peliclemmas_list = [x for y in peliclemmas_list for x in y] #flatten the list
in_pelic = [x for x in key_lemma_list if x in peliclemmas_list]
not_in_pelic = [x for x in key_lemma_list if x not in peliclemmas_list]

print('Number of key lemmas used in PELIC:',len(in_pelic))
print('Number of key lemmas NOT used in PELIC:',len(not_in_pelic))

# Students used all of the 27 key lemmas

Number of key lemmas used in PELIC: 26
Number of key lemmas NOT used in PELIC: 0


### Key family forms

In [19]:
# Creating a list of the key family forms
key_family_forms_list = {x for y in coca_key.form_POS.tolist() for x in y} # made into a list and flattened
len(key_family_forms_list)

403

In [20]:
# Need to change the POS tags in PELIC 'toks_pos_corrected' to match the COCA tags (easier than the reverse)

# First simplify by reducing two two characters
pelic.toks_pos_corrected = pelic.toks_pos_corrected.apply(lambda x: [(i[0],i[1][0:2]) for i in x])
quick_POS_dict = {'JJ':'jj','NN':'nn','RB':'rr', 'VB':'vv'}

# Then replace the ones being analyzed and lower case the words (ignoring all others for now)
pelic['toks_pos_corrected'] = pelic.toks_pos_corrected.apply\
(lambda x: [(i[0].lower(),quick_POS_dict[i[1]]) if i[1] in quick_POS_dict else i for i in x])

In [21]:
# Checking how many texts include a key lemma
mask2 = pelic.toks_pos_corrected.apply(lambda x: any(item for item in key_family_forms_list if item in x))
pelic_key_forms = pelic[mask2]
len(pelic_key_forms)

8715

In [22]:
# Next, checking total number of instances when one of these forms used (not just number of texts)

# Creating a column in pelic_key_forms with the forms used from coca_key.lemma_POS
pelic_key_forms['forms_in_text'] = pelic_key_forms['toks_pos_corrected'].apply(lambda x: [i for i in x if i in key_family_forms_list])

# Creating a column with the number of the forms found in the above column
pelic_key_forms['len_forms'] = [len(x) for x in pelic_key_forms['forms_in_text']]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  import sys


In [23]:
# Total number of occurrences of key family forms
pelic_key_forms['len_forms'].sum()

15811

### Narrowing dataset
- For valid natural production, we will include only want written free written production, i.e. from writing classes and not reading or grammar classes where students might have copied the word.
- Only the first versions of texts will be included to avoid duplicates and corrections made due to teacher feedback.

In [24]:
# Keeping only version 1 of texts
pelic_key_forms = pelic_key_forms.loc[pelic_key_forms.version == 1]
print('New number of texts:',len(pelic_key_forms))
print('New number of tokens:',pelic_key_forms['len_forms'].sum())

New number of texts: 7570
New number of tokens: 13658


In [25]:
# Keeping only writing texts (class code 'w')
pelic_key_forms = pelic_key_forms.loc[pelic_key_forms.class_code == 'w']
print('New number of texts:',len(pelic_key_forms))
print('New number of tokens:',pelic_key_forms['len_forms'].sum())

New number of texts: 3775
New number of tokens: 8326


In [26]:
# Dropping the version and class_code columns which are no longer relevant
pelic_key_forms = pelic_key_forms.drop(['class_code','version'], axis=1)

In [27]:
# Pickling the dataframe for later use
pelic_key_forms.to_pickle("pelic_key_forms.pkl")

In [28]:
pelic_key_forms.head()

Unnamed: 0_level_0,anon_id,text,level_id,native_language,gender,toks_nltk,toks_pos,lemmas,errors_in_text,len_errors_in_text,toks_corrected,lem_errors,len_lem_errors,lems_corrected,toks_pos_errors,len_toks_pos_errors,toks_pos_corrected,forms_in_text,len_forms
answer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
3,dk5,In my country we usually don't use tea bags. F...,4,Turkish,Female,"[In, my, country, we, usually, do, n't, use, t...","[(In, IN), (my, PRP$), (country, NN), (we, PRP...","[in, my, country, we, usually, do, n't, use, t...",[],0,"[In, my, country, we, usually, do, n't, use, t...",[],0,"[in, my, country, we, usually, do, n't, use, t...",[],0,"[(In, IN), (my, PR), (country, nn), (we, PR), ...","[(heat, nn)]",1
25,gc5,Last week I planned to go paintball match' but...,4,Turkish,Male,"[Last, week, I, planned, to, go, paintball, ma...","[(Last, JJ), (week, NN), (I, PRP), (planned, V...","[last, week, i, plan, to, go, paintball, match...",[],0,"[Last, week, I, planned, to, go, paintball, ma...",[],0,"[last, week, i, plan, to, go, paintball, match...",[],0,"[(last, jj), (week, nn), (I, PR), (planned, vv...","[(accepted, vv)]",1
30,er4,when you want to enjoy drinking a tea you have...,4,Arabic,Male,"[when, you, want, to, enjoy, drinking, a, tea,...","[(when, WRB), (you, PRP), (want, VBP), (to, TO...","[when, you, want, to, enjoy, drink, a, tea, yo...","[coract, finaly]",2,"[when, you, want, to, enjoy, drinking, a, tea,...","[coract, finaly]",2,"[when, you, want, to, enjoy, drink, a, tea, yo...","[(coract, JJ), (finaly, NN)]",2,"[(when, WR), (you, PR), (want, vv), (to, TO), ...","[(correct, jj)]",1
97,ea3,Here are a few instructions for typing in Engl...,4,Korean,Male,"[Here, are, a, few, instructions, for, typing,...","[(Here, RB), (are, VBP), (a, DT), (few, JJ), (...","[here, be, a, few, instruction, for, type, in,...",[],0,"[Here, are, a, few, instructions, for, typing,...",[],0,"[here, be, a, few, instruction, for, type, in,...",[],0,"[(here, rr), (are, vv), (a, DT), (few, jj), (i...","[(open, vv), (select, vv)]",2
111,cz7,Every one like a special kind of food. For me ...,4,Arabic,Male,"[Every, one, like, a, special, kind, of, food,...","[(Every, DT), (one, CD), (like, IN), (a, DT), ...","[every, one, like, a, special, kind, of, food,...",[],0,"[Every, one, like, a, special, kind, of, food,...",[],0,"[every, one, like, a, special, kind, of, food,...",[],0,"[(Every, DT), (one, CD), (like, IN), (a, DT), ...","[(special, jj), (specially, rr)]",2


### Next notebook: 3_Concordances.ipynb