# 1_COCA dataset


## Building the COCA dataset of 'key' word families

Key family characteristics:
- Frequency information based on COCA 100k list, available at https://www.wordfrequency.info/100k.asp
- Mid-frequency lemmas, i.e., in the K3-K9 frequency bands
- Word families with 4+ mid-frequency derivations and at least one derivation in each major word class (noun, verb, adjective, adverb)

#### Sections of the notebook
- [Initial setup](#Initial-setup)
- [Preparing COCA dataframe](#Preparing-COCA-dataframe)
- [POS tags](#POS-tags)
- [Lemma information](#Lemma-information)
- [Mid-frequency items](#Mid-frequency-items)
- [Word families](#Word-families)
- [Checking derivations](#Checking-derivations)
- [Dataset narrowing](#Dataset-narrowing)

### Initial setup

In [1]:
# Import necessary modules
import pandas as pd
import pprint
import pickle as pkl
import re
import random

# Set preferred notebook format
%pprint # turn off pretty printing
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
pd.set_option('display.max_columns', 999)

Pretty printing has been turned OFF


### Preparing COCA dataframe

In [2]:
# Read in necessary file
coca = pd.read_csv('COCA_frequency_info.txt', skiprows=2, encoding="utf8", sep='\t', na_filter=False)
coca.head()

Unnamed: 0,ID,w1,L1,c1,pc,spelling,coca,pcoca,pbnc,psoap,ph3,ph2,ph1,pc1,pc2,pc3,pc4,pc5,pb1,pb2,pb3,pb4,pb5,pb6,pb7,tpcoca,tpbnc,tpsoap,tph3,tph2,tph1,tpc1,tpc2,tpc3,tpc4,tpc5,tpb1,tpb2,tpb3,tpb4,tpb5,tpb6,tpb7,bnc,fs,fh3,fh2,fh1,fc1,fc2,fc3,fc4,fc5,fb1,fb2,fb3,fb4,fb5,fb6,fb7,tcoca,tbnc,tsoap,th3,th2,th1,tc1,tc2,tc3,tc4,tc5,tb1,tb2,tb3,tb4,tb5,tb6,tb7,Unnamed: 78
0,1,the,the,at,0.11,,25131726,54124.71,59717.97,21403.42,59363.87,63479.96,65266.92,46393.26,53301.68,53775.83,53613.78,63981.74,41050.47,52467.51,58832.77,60670.3,68926.84,73581.36,67229.24,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,5971797,2140342,5797074,7579669,8545851,4433575,4820039,5138750,4917319,5826573,409013,834722,427243,635001,1136961,1128125,1400732,188643,4045,21985,42556,39575,10875,38159,19219,53139,56903,21223,904,463,210,518,533,501,916,
1,2,and,and,cc,0.08,,12368293,26636.86,25808.34,17677.41,26260.07,28577.44,33417.48,26089.72,25756.04,26458.18,24577.22,30346.6,26107.37,26803.23,26337.68,22236.16,27659.4,26888.33,28883.92,1.0,1.0,1.0,0.99,0.98,1.0,1.0,1.0,1.0,0.99,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2580834,1767741,2564381,3412219,4375583,2493266,2329103,2528310,2254160,2763549,260125,426421,191264,232733,456247,412243,601801,188462,4045,21985,42264,38849,10873,38146,19153,53129,56812,21222,904,463,210,518,533,501,916,
2,3,of,of,ii,0.01,,11971724,25782.79,30086.53,10067.32,27505.53,32184.17,37182.86,21502.94,19640.22,25872.17,23814.03,38260.97,17505.61,21465.1,26809.46,25441.17,37980.66,45030.26,34311.38,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.99,1.0,0.99,1.0,0.99,1.0,1.0,1.0,1.0,1.0,1.0,3008653,1006732,2686004,3842872,4868611,2054930,1776053,2472312,2184162,3484281,174420,341495,194690,266278,626498,690389,714883,188408,4035,21985,42470,39429,10870,38144,19106,53114,56823,21221,894,463,210,518,533,501,916,
3,4,a,a,at,0.06,,10327063,22240.78,20853.49,15632.54,22557.53,21357.82,20346.25,21403.59,22960.81,24395.42,23734.44,18637.48,19811.69,22470.99,23270.21,23642.27,20335.03,20755.41,22095.87,1.0,1.0,1.0,0.99,0.99,1.0,1.0,0.99,1.0,0.99,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2085349,1563254,2202816,2550178,2664076,2045436,2076332,2331195,2176862,1697244,197397,357498,168988,247450,335430,318215,460371,188277,4042,21985,42289,39059,10847,38138,19121,53117,56680,21221,903,463,210,517,533,501,915,
4,5,in,in,ii,0.09,,8035789,17306.2,18307.46,6702.35,17055.21,17335.58,17775.94,15433.91,13194.97,17503.23,18490.76,21952.5,12382.49,13318.05,16743.76,19122.49,21928.16,24739.06,20770.32,0.99,1.0,1.0,0.99,0.98,1.0,0.99,0.99,1.0,0.99,1.0,0.99,1.0,1.0,1.0,1.0,1.0,1.0,1830746,670235,1665496,2069913,2327528,1474943,1193213,1672586,1695925,1999131,123375,211881,121593,200144,361709,379291,432753,188167,4035,21985,42242,38991,10863,38126,19043,53041,56740,21217,895,463,210,518,533,501,915,


#### IMPORTANT NOTE: The frequency information is a licensed dataset and is not publicly available. As such, this notebook will not run unless the dataset has been purchased.

In [3]:
# Narrow down to only the relevant columns
coca_df = coca[['ID', 'w1', 'L1', 'c1', 'coca','fc1']]

# Rename columns
coca_df.columns = ['rank', 'word', 'lemma', 'POS', 'freq','spoken_freq']

# Set rank as index
coca_df = coca_df.set_index('rank')

In [4]:
# Creating column of written data (total freq - spoken freq)
coca_df['written_freq'] = coca_df.freq - coca_df.spoken_freq

# And then dropping freq and spoken_freq columns as they are no longer needed
coca_df = coca_df.drop(['freq','spoken_freq'], axis=1)

# And simpliying 'written_freq' column name to just 'word_freq' - from here on, all 'freq' references 
#are to the written texts only.
coca_df.columns = ['word', 'lemma', 'POS', 'word_freq']

In [5]:
# Re-rank according to written_freq
coca_df = coca_df.sort_values(by=['word_freq'],ascending=False)
coca_df = coca_df.reset_index(drop=True)

In [6]:
# The last item has total freq of 112 and spoken freq of 376 - This appears to be a glitch in the COCA spreadsheet
coca_df.loc[coca_df.word == 'self']

Unnamed: 0,word,lemma,POS,word_freq
2319,self,self,nn1,15731
100813,self,self,jj,-264


In [7]:
coca_df.head()

Unnamed: 0,word,lemma,POS,word_freq
0,the,the,at,20698151
1,of,of,ii,9916794
2,and,and,cc,9875027
3,a,a,at,8281627
4,in,in,ii,6560846


### POS tags

The tagset used in the expert-speaker corpus (COCA) and the learner corpus (PELIC) are tagged using the different tagsets. COCA uses the [CLAWS 7 tagset](http://ucrel.lancs.ac.uk/claws7tags.html) whereas PELIC uses NLTK's tagger which uses the [Penn Treebank tagset](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html).  
To address this issue and to simplify mapping, a simplified version of the CLAWS 7 tagset will be used, collating groups where appropriate.

In [8]:
# POS Used in COCA: List of all possible COCA POS tags
coca_df['POS'].apply(pd.Series).stack().value_counts()

jj       29604
nn1      27154
nn2      15906
vv0       5219
vvd       5093
vvn       4873
vvg       4593
vvz       3547
rr        3152
jjr        365
jjt        304
mc         237
uh         170
ii         146
md          93
cs          56
pp          54
mf          52
pn          34
rrr         25
dd          22
da          17
rrt         17
vm          15
appge       13
cc           7
at           5
to           4
vh0          3
db           3
vhz          2
ge           2
vdg          2
vbr          2
vdz          2
vmk          2
xx           2
vbn          1
vbdz         1
vhd          1
vhn          1
vhg          1
vbz          1
vvgk         1
vbm          1
vv           1
vdd          1
ex           1
vbdr         1
vb0          1
vdn          1
vbg          1
at1          1
vd0          1
dtype: int64

In [9]:
# Simplify the POS column by keeping only the first two letters, 
# e.g. so that different types of nouns will all be 'nn'.
coca_df['POS'] = [x[0:2] for x in coca_df.POS]

In [10]:
# Further simplifying by combining types of verbs and a POS mapping dictionary
coca_pos_map_dict = {'nn': 'nn', 'jj': 'jj', 'vv': 'vv', 'rr': 'rr',
                     'mc': 'mc', 'uh': 'uh', 'ii': 'ii', 'md': 'md',
                     'cs': 'cs', 'pp': 'pp', 'mf': 'mf', 'pn': 'pn',
                     'dd': 'dd', 'da': 'da', 'vm': 'vv', 'ap': 'ap',
                     'vb': 'vv', 'vh': 'vv','cc': 'cc', 'vd': 'vv',
                     'at': 'at', 'to': 'to', 'db':'db', 'xx': 'xx',
                     'ge': 'ge', 'ex': 'ex'}

In [11]:
# Mapping that dictionary to the new simple_POS column
coca_df.POS = coca_df.POS.map(coca_pos_map_dict)

In [12]:
# Checking resulting simple_POS
coca_df['POS'].apply(pd.Series).stack().value_counts()

nn    43060
jj    30273
vv    23368
rr     3194
mc      237
uh      170
ii      146
md       93
cs       56
pp       54
mf       52
pn       34
dd       22
da       17
ap       13
cc        7
at        6
to        4
db        3
ge        2
xx        2
ex        1
dtype: int64

### Lemma information

In [13]:
# Adding word_POS and lemma_POS columns
coca_df['word_POS'] = list(zip(coca_df.word, coca_df.POS))
coca_df['lemma_POS'] = list(zip(coca_df.lemma, coca_df.POS))

In [14]:
coca_df.head()

Unnamed: 0,word,lemma,POS,word_freq,word_POS,lemma_POS
0,the,the,at,20698151,"(the, at)","(the, at)"
1,of,of,ii,9916794,"(of, ii)","(of, ii)"
2,and,and,cc,9875027,"(and, cc)","(and, cc)"
3,a,a,at,8281627,"(a, at)","(a, at)"
4,in,in,ii,6560846,"(in, ii)","(in, ii)"


#### NOTE: Lemmas rather than words are the principle countining unit.

In [15]:
# Grouping the lemmas by their raw frequencies, summing the same lemmas and sorting in descending order
lemma_freq = coca_df.groupby(['lemma', 'POS'])['word_freq'].sum().sort_values(ascending=False)
lemma_freq[0:5]
lemma_freq[-5:]

lemma  POS
the    at     20698151
be     vv     10782841
of     ii      9916794
and    cc      9875027
a      at      9523432
Name: word_freq, dtype: int64

lemma          POS
hand-off       jj       2
newsdesk       nn       1
uncapitalized  jj       1
puzzlemaster   nn       0
self           jj    -264
Name: word_freq, dtype: int64

In [16]:
# Creating a dictionary of the (lemma, POS) tuples and their lemma frequencies
lemma_dict = lemma_freq.to_dict()
random.sample(lemma_dict.items(),5) # Random sample of the dictionary to check

[(('trouper', 'nn'), 72), (('minestrone', 'nn'), 103), (('lebensraum', 'nn'), 22), (('export-driven', 'jj'), 24), (('insect-eating', 'jj'), 19)]

In [17]:
# Pickling the lemma dict for future use and checking that it loads correctly
a = lemma_dict

with open('lemma_dict.pkl', 'wb') as handle:
    pkl.dump(a, handle, protocol=pkl.HIGHEST_PROTOCOL)

with open('lemma_dict.pkl', 'rb') as handle:
    b = pkl.load(handle)

print(a == b)

True


In [18]:
# Adding lemma_POS and lemma_freq column based on the above
coca_df['lemma_POS'] = coca_df.apply(lambda row: row.lemma + ' ' + row.POS, axis=1)
coca_df['lemma_POS'] = [tuple(x.split()) for x in coca_df['lemma_POS']]
coca_df['lemma_freq'] = coca_df.lemma_POS.map(lemma_dict)

In [19]:
# And a word_POS column too for later use
coca_df['word_POS'] = coca_df.apply(lambda row: row.word + ' ' + row.POS, axis=1)
coca_df['word_POS'] = [tuple(x.split()) for x in coca_df['word_POS']]

In [20]:
# Reorder columns
coca_df = coca_df[['word','POS','word_freq','word_POS','lemma','lemma_POS','lemma_freq']]

In [21]:
coca_df.head()

Unnamed: 0,word,POS,word_freq,word_POS,lemma,lemma_POS,lemma_freq
0,the,at,20698151,"(the, at)",the,"(the, at)",20698151
1,of,ii,9916794,"(of, ii)",of,"(of, ii)",9916794
2,and,cc,9875027,"(and, cc)",and,"(and, cc)",9875027
3,a,at,8281627,"(a, at)",a,"(a, at)",9523432
4,in,ii,6560846,"(in, ii)",in,"(in, ii)",6560846


In [22]:
# Pickling dataframe for later use
coca_df.to_pickle("coca_df.pkl")

### Mid-frequency items
Narrowing the coca_df to only the items in the K3-K9 frequency bands, based on **lemma** frequencies.

In [23]:
mid_freq = coca_df.sort_values(by=['lemma_freq'], ascending = False) # First, sort rows by lemma freq
mid_freq.reset_index(inplace = True) #reset index since word rank not important
mid_freq = mid_freq.drop(['index','word','word_POS','POS','word_freq'], axis =1) #remove word-related columns
mid_freq = mid_freq.drop_duplicates(subset ="lemma_POS", keep='first') # Then, drop duplicates (1 row per lemma)
mid_freq.reset_index(inplace = True, drop = True) #reset index again to fix numbering
mid_freq.head()

Unnamed: 0,lemma,lemma_POS,lemma_freq
0,the,"(the, at)",20698151
1,be,"(be, vv)",10782841
2,of,"(of, ii)",9916794
3,and,"(and, cc)",9875027
4,a,"(a, at)",9523432


In [24]:
# Narrow to 2001-9000 items
mid_freq = mid_freq.iloc[2001:9001,]
len(mid_freq)
mid_freq

7000

Unnamed: 0,lemma,lemma_POS,lemma_freq
2001,touch,"(touch, nn)",18185
2002,scholar,"(scholar, nn)",18184
2003,wonderful,"(wonderful, jj)",18161
2004,ride,"(ride, nn)",18154
2005,teaspoon,"(teaspoon, nn)",18150
...,...,...,...
8996,diver,"(diver, nn)",2016
8997,spill,"(spill, nn)",2016
8998,insult,"(insult, vv)",2015
8999,sole,"(sole, nn)",2015


### Word families

In [25]:
# write out txt file of the lemmas to categorize into word families 
#pd.Series(mid_freq.lemma).to_csv('mid_freq_lemmas.txt', sep='\t', index=False, header=False)

To find which forms belong to each word family, the ['familizer' function](https://lextutor.ca/familizer/) at lextutor.ca is used, producing the following csv.  

**NOTE:** It is necessary to check the above txt manually as there may be missing line breaks, in this case: _writewronged_ and _withdrewaccurate_. As such, the above line is hashed out - only use the first time.

In [26]:
# Reading in the new txt file as a dataframe
mid_freq_fams = pd.read_csv('mid_freq_families.txt', encoding = "ISO-8859-1", names=["forms"])
mid_freq_fams.head()

Unnamed: 0,forms
0,abandon abandoned abandoning abandonment aban...
1,able abilities ability abler ablest ably inab...
2,abnormal abnormalities abnormality abnormally
3,aboard
4,abolish abolished abolishes abolishing


In [27]:
# clean up above dataframe
mid_freq_fams.forms = [x.split() for x in mid_freq_fams.forms] #make lemma families into lists
mid_freq_fams['family'] = [x[0] for x in mid_freq_fams.forms] #create column with head lemma from COCA index
mid_freq_fams.head()

Unnamed: 0,forms,family
0,"[abandon, abandoned, abandoning, abandonment, ...",abandon
1,"[able, abilities, ability, abler, ablest, ably...",able
2,"[abnormal, abnormalities, abnormality, abnorma...",abnormal
3,[aboard],aboard
4,"[abolish, abolished, abolishes, abolishing]",abolish


In [28]:
# Applying function to make a new column with POS for all the forms in each lemma family
# Need to use COCA 100k list look up to decide POS

# Making COCA word:word_POS dict for the function to use
coca_pos_dict = pd.Series(coca_df.word.values,coca_df.word_POS).to_dict()

# And inverting this dict, combining the values of any duplicate keys, i.e. the homonyms
inv_coca_pos_dict = {}
for k, v in coca_pos_dict.items():
    inv_coca_pos_dict[v] = inv_coca_pos_dict.get(v, [])
    inv_coca_pos_dict[v].append(k)

In [29]:
# Creating the function looking up the word_POS given a word
def find_COCA_POS_from_word(word_list):
    word_and_POS_list = []
    for word in word_list:
        if word in inv_coca_pos_dict:
            word_and_POS_list.append(inv_coca_pos_dict[word])
    return word_and_POS_list

# Applying the function to create a new column in the dataframe
mid_freq_fams['form_POS'] = mid_freq_fams.forms.apply(find_COCA_POS_from_word)

In [30]:
# Flattening the above column
mid_freq_fams['form_POS'] = mid_freq_fams['form_POS'].apply(lambda i:[x for y in i for x in y])

In [31]:
# Rearranging columns for clarity
mid_freq_fams = mid_freq_fams[['family', 'forms', 'form_POS']]
mid_freq_fams.head()

Unnamed: 0,family,forms,form_POS
0,abandon,"[abandon, abandoned, abandoning, abandonment, ...","[(abandon, vv), (abandon, nn), (abandoned, vv)..."
1,able,"[able, abilities, ability, abler, ablest, ably...","[(able, jj), (abilities, nn), (ability, nn), (..."
2,abnormal,"[abnormal, abnormalities, abnormality, abnorma...","[(abnormal, jj), (abnormalities, nn), (abnorma..."
3,aboard,[aboard],"[(aboard, ii), (aboard, rr)]"
4,abolish,"[abolish, abolished, abolishes, abolishing]","[(abolish, vv), (abolished, vv), (abolishes, v..."


### Checking derivations

In [32]:
# Creating columns showing different parts of speech and how many there are
mid_freq_fams['unique_POS'] = [list(set(dict(x).values())) for x in mid_freq_fams['form_POS']] #only unique POS
mid_freq_fams['POS_len'] = [len(x) for x in mid_freq_fams['unique_POS']] #number of different POS
mid_freq_fams.head()

#NOTE: This does not take into account how many different forms of the same POS there may be, 
# e.g. how many adj forms in a lemma family.

Unnamed: 0,family,forms,form_POS,unique_POS,POS_len
0,abandon,"[abandon, abandoned, abandoning, abandonment, ...","[(abandon, vv), (abandon, nn), (abandoned, vv)...","[nn, jj, vv]",3
1,able,"[able, abilities, ability, abler, ablest, ably...","[(able, jj), (abilities, nn), (ability, nn), (...","[nn, jj, rr]",3
2,abnormal,"[abnormal, abnormalities, abnormality, abnorma...","[(abnormal, jj), (abnormalities, nn), (abnorma...","[nn, jj, rr]",3
3,aboard,[aboard],"[(aboard, ii), (aboard, rr)]",[rr],1
4,abolish,"[abolish, abolished, abolishes, abolishing]","[(abolish, vv), (abolished, vv), (abolishes, v...",[vv],1


In [33]:
# Removing all form_POS except for nn, jj, rr, vv (nouns, adjectives, adverbs, verbs)
all_POS = list(set([x for y in mid_freq_fams.unique_POS for x in y]))
all_POS
removal = [x for x in all_POS if x!='nn' and x!='jj' and x!='rr' and x!='vv']
removal
mid_freq_fams['core_POS'] = mid_freq_fams['form_POS'].apply(lambda x: [i for i in x if i[1] not in removal])

['ii', 'nn', 'pp', 'pn', 'at', 'mf', 'da', 'rr', 'md', 'cs', 'jj', 'uh', 'ap', 'dd', 'db', 'mc', 'cc', 'vv']

['ii', 'pp', 'pn', 'at', 'mf', 'da', 'md', 'cs', 'uh', 'ap', 'dd', 'db', 'mc', 'cc']

In [34]:
# Create columns again showing different CORE parts of speech and how many there are
mid_freq_fams['core_unique_POS'] = [list(set(dict(x).values())) for x in mid_freq_fams['core_POS']] #only unique POS
mid_freq_fams['core_POS_len'] = [len(x) for x in mid_freq_fams['core_unique_POS']] #number of different POS

In [35]:
# Checking how many mid-freq families have forms in four word classes
mid_freq_fams.core_POS_len.value_counts()

3    1486
1    1472
2    1325
4     399
0      32
Name: core_POS_len, dtype: int64

In [36]:
# Creating dataframe of the 4-derivation subset of mid-frequency items
deriv4 = mid_freq_fams.loc[mid_freq_fams.core_POS_len == 4].reset_index(inplace = False, drop = True)

In [37]:
# And also narrowing to just those lemma families which have all four major word classes: nn, jj, rr, vv
(deriv4.form_POS == deriv4.core_POS).value_counts()
deriv4 = deriv4.loc[deriv4.form_POS == deriv4.core_POS,:]
len(deriv4)

True     386
False     13
dtype: int64

386

In [38]:
# Dropping redundant columns
deriv4 = deriv4.drop(['unique_POS', 'POS_len', 'core_unique_POS', 'core_POS','core_POS_len'], axis=1)
deriv4.head()

Unnamed: 0,family,forms,form_POS
0,abstract,"[abstract, abstracted, abstractedly, abstracti...","[(abstract, jj), (abstract, nn), (abstract, vv..."
1,accept,"[accept, acceptability, acceptable, acceptably...","[(accept, vv), (acceptability, nn), (acceptabl..."
2,accuse,"[accuse, accusation, accusations, accused, acc...","[(accuse, vv), (accusation, nn), (accusations,..."
4,admire,"[admire, admirable, admirably, admiration, adm...","[(admire, vv), (admirable, jj), (admirably, rr..."
5,advise,"[advise, advisability, advisable, advisably, a...","[(advise, vv), (advisability, nn), (advisable,..."


### Dataset narrowing
After first round of COCA analysis, 386 lemma families is too big a dataset, so further narrowing based on the following parameters:
- minimum of 4 mid-frequency derivations (though not necessarily all of nn, jj, rr, vv)
- no forms in the top 100 most common lemmas (as lemma like 'time' is so frequent as to skew all data)

In [39]:
# Checking how many items from each deriv4 family are in mid_freq

# Create new column with only the lemma family items which are in the mid_freq too
mid_freq_list = sorted(mid_freq.lemma_POS.to_list())
deriv4['form_in_midfreq'] = deriv4.form_POS.apply(lambda x: [i for i in x if i in mid_freq_list])

In [40]:
# Checking number of different form_in_midfreq for each of the families
deriv4['len_form_in_midfreq'] = [len(x) for x in deriv4['form_in_midfreq']]
deriv4.len_form_in_midfreq.value_counts()

1    142
2    130
3     71
4     31
5      8
6      2
9      1
0      1
Name: len_form_in_midfreq, dtype: int64

In [41]:
# make a dictionary from coca_df of lemma_POS:wordPOS - give it the word_POS and returns the lemma_POS
word_POS_lemma_POS_dict = pd.Series(coca_df.lemma_POS.values,coca_df.word_POS).to_dict()

In [42]:
# Create a new column of lemma_in_midfreq - should be nearly the same or exactly the same as form_in_midfreq
# but with inflections combined
deriv4['lemma_in_midfreq'] = deriv4['form_in_midfreq'].apply(lambda i:list(set([word_POS_lemma_POS_dict[x] for x in i])))

In [43]:
# And use same idea to calculate lemma_POS column (necessary for later data analysis)
deriv4['lemma_POS'] = deriv4['form_POS'].apply(lambda i:list(set([word_POS_lemma_POS_dict[x] for x in i])))

In [44]:
deriv4['len_lemma_in_midfreq'] = [len(x) for x in deriv4['lemma_in_midfreq']]
deriv4.head()

Unnamed: 0,family,forms,form_POS,form_in_midfreq,len_form_in_midfreq,lemma_in_midfreq,lemma_POS,len_lemma_in_midfreq
0,abstract,"[abstract, abstracted, abstractedly, abstracti...","[(abstract, jj), (abstract, nn), (abstract, vv...","[(abstract, jj), (abstraction, nn)]",2,"[(abstraction, nn), (abstract, jj)]","[(abstract, nn), (abstractness, nn), (abstract...",2
1,accept,"[accept, acceptability, acceptable, acceptably...","[(accept, vv), (acceptability, nn), (acceptabl...","[(acceptable, jj), (acceptance, nn), (accepted...",4,"[(accepted, jj), (acceptance, nn), (unacceptab...","[(accepted, jj), (unacceptability, nn), (unacc...",4
2,accuse,"[accuse, accusation, accusations, accused, acc...","[(accuse, vv), (accusation, nn), (accusations,...","[(accuse, vv), (accusation, nn)]",2,"[(accusation, nn), (accuse, vv)]","[(accusation, nn), (accuser, nn), (accusingly,...",2
4,admire,"[admire, admirable, admirably, admiration, adm...","[(admire, vv), (admirable, jj), (admirably, rr...","[(admire, vv), (admiration, nn)]",2,"[(admiration, nn), (admire, vv)]","[(admiringly, rr), (admirer, nn), (admiration,...",2
5,advise,"[advise, advisability, advisable, advisably, a...","[(advise, vv), (advisability, nn), (advisable,...","[(advise, vv), (adviser, nn), (advisor, nn), (...",4,"[(adviser, nn), (advise, vv), (advisory, jj), ...","[(advisable, jj), (advisement, nn), (inadvisab...",4


In [45]:
# Removing all lemma_POS except for jj, nn, rr, vv
all_POS = list(set([x for y in mid_freq_fams.unique_POS for x in y]))
removal = [x for x in all_POS if x!='nn' and x!='jj' and x!='rr' and x!='vv']
deriv4['core_POS'] = deriv4['lemma_in_midfreq'].apply(lambda x: [i for i in x if i[1] not in removal])

In [46]:
# Check how many families have 4+ mid-freq derivations
deriv4['core_POS_only'] = [list(dict(x).values()) for x in deriv4['core_POS']]
deriv4['len_core_POS_only'] = [len(x) for x in deriv4.core_POS_only]
len(deriv4.loc[deriv4.len_core_POS_only >= 4]) # This is a reasonable number of families for analysis

29

In [47]:
# Removing any items with lemma in top 100 most frequent lemmas

# Create ranking of lemmas
lemma_ranking = pd.DataFrame.from_dict(lemma_dict, orient = 'index')
lemma_ranking = lemma_ranking.reset_index(drop=False)
lemma_ranking.index += 1
lemma_ranking = lemma_ranking.rename(columns={"index": "lemma", 0: "freq"})
lemma_ranking.head()

Unnamed: 0,lemma,freq
1,"(the, at)",20698151
2,"(be, vv)",10782841
3,"(of, ii)",9916794
4,"(and, cc)",9875027
5,"(a, at)",9523432


In [48]:
# Create lemma_rank dict and column
lemma_rank_dict = dict(zip(lemma_ranking.lemma,lemma_ranking.index))

In [49]:
# Find the most frequent lemma form in each family
deriv4['most_freq_lemma'] = deriv4.lemma_POS.apply(lambda row: [(x,lemma_dict[x]) for x in row])
deriv4['most_freq_lemma'] = deriv4['most_freq_lemma'].apply(lambda row: sorted(row,key=lambda x: x[1],reverse=True))
deriv4['most_freq_lemma'] = [x[0][0] for x in deriv4['most_freq_lemma']]

deriv4['lemma_rank'] = deriv4.most_freq_lemma.map(lemma_rank_dict)
deriv4 = deriv4.sort_values(by='lemma_rank')

In [50]:
deriv4.head()

Unnamed: 0,family,forms,form_POS,form_in_midfreq,len_form_in_midfreq,lemma_in_midfreq,lemma_POS,len_lemma_in_midfreq,core_POS,core_POS_only,len_core_POS_only,most_freq_lemma,lemma_rank
18,art,"[art, artist, artistic, artistically, artistri...","[(art, nn), (art, vv), (artist, nn), (artistic...","[(artistic, jj)]",1,"[(artistic, jj)]","[(artistry, nn), (be, vv), (art, nn), (artisti...",1,"[(artistic, jj)]",[jj],1,"(be, vv)",2
312,say,"[say, said, saying, sayings, says, sez, unsaid...","[(say, vv), (say, nn), (say, rr), (said, vv), ...","[(say, nn)]",1,"[(say, nn)]","[(say, nn), (saying, nn), (unsayable, jj), (sa...",1,"[(say, nn)]",[nn],1,"(say, vv)",22
394,will,"[will, ll, unwilling, unwillingly, unwillingne...","[(will, vv), (will, nn), (unwilling, jj), (unw...","[(unwilling, jj), (willingness, nn)]",2,"[(unwilling, jj), (willingness, nn)]","[(willingly, rr), (willing, jj), (unwillingly,...",2,"[(unwilling, jj), (willingness, nn)]","[jj, nn]",2,"(will, vv)",47
370,time,"[time, anytime, timed, timeless, timelessness,...","[(time, nn), (time, vv), (anytime, rr), (timed...","[(time, vv), (anytime, rr), (timely, jj), (tim...",4,"[(timing, nn), (time, vv), (anytime, rr), (tim...","[(timeliness, nn), (timing, nn), (anytime, rr)...",4,"[(timing, nn), (time, vv), (anytime, rr), (tim...","[nn, vv, rr, jj]",4,"(time, nn)",51
207,know,"[know, dunno, knew, knowable, knowed, knowing,...","[(know, vv), (know, nn), (knew, vv), (knowable...","[(known, jj), (unknown, jj)]",2,"[(unknown, jj), (known, jj)]","[(unknowing, jj), (knowingly, rr), (unknown, n...",2,"[(unknown, jj), (known, jj)]","[jj, jj]",2,"(know, vv)",61


In [51]:
# Final narrowing based on above criteria to create dataframe of COCA key families
coca_key = deriv4.loc[deriv4.len_core_POS_only >= 4]
coca_key = coca_key.loc[coca_key.lemma_rank > 100]

### Outlier: 'Reason'
After further analysis at a later stage when compiling the concordance, it as discovered that the lemma 'reason' (nn) was an outlier, accounting for 27.8% of the total dataset. As can be seen below, this is likely due to a number of factors:
- the discrepancy between the frequency of the noun form (lemma rank 413) and the other most common forms.
- its inclusion in task prompts (e.g. 'give reasons...') and utility in the genre of argumentative essays.

As such, the 'reason' family has been excluded from analysis since the likelihood of any other form being used, or an inaccurate form being used, is minimal for learners in this context.

In [52]:
lemma_ranking.loc[lemma_ranking.lemma == ('reason','nn')]
lemma_ranking.loc[lemma_ranking.lemma == ('reason','vv')]
lemma_ranking.loc[lemma_ranking.lemma == ('reasonable','jj')]

Unnamed: 0,lemma,freq
413,"(reason, nn)",91350


Unnamed: 0,lemma,freq
6617,"(reason, vv)",3394


Unnamed: 0,lemma,freq
2915,"(reasonable, jj)",11469


In [53]:
# Analysis of academic word per million of 'reason'
reason = pd.read_csv('COCA_frequency_info.txt', skiprows=2, encoding="utf8", sep='\t', na_filter=False)
reason = reason[['ID', 'w1', 'L1', 'c1', 'coca','fc1','pc5']]
reason.columns = ['rank', 'word', 'lemma', 'POS', 'freq','spoken_freq','acad_per_M']
reason = reason.set_index('rank')
reason.loc[reason.lemma == 'reason']
reason.loc[reason.lemma == 'reasonable']

Unnamed: 0_level_0,word,lemma,POS,freq,spoken_freq,acad_per_M
rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
500,reason,reason,nn1,81997,21441,170.43
1084,reasons,reason,nn2,39930,9136,143.35
12119,reason,reason,vv0,2046,285,6.35
15216,reasoned,reason,vvd,1410,45,4.5
46064,reasons,reason,vvz,154,13,0.33
48612,reasoned,reason,vvn,134,7,0.64


Unnamed: 0_level_0,word,lemma,POS,freq,spoken_freq,acad_per_M
rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2867,reasonable,reasonable,jj,14739,3270,51.61
46692,reasonable,reasonable,rr,149,25,0.38


In [54]:
# Removing of 'reason family' from key families dataset
coca_key = coca_key.loc[coca_key.family != 'reason']

In [55]:
len(coca_key)
sorted(coca_key.family)

26

['accept', 'advise', 'back', 'collaborate', 'compete', 'confuse', 'construct', 'continue', 'correct', 'embarrass', 'equal', 'excite', 'expect', 'frustrate', 'heat', 'infect', 'intense', 'nation', 'open', 'precede', 'predict', 'select', 'special', 'structure', 'vary', 'wide']

In [56]:
# Total number of word types
sum([len(x) for x in coca_key.form_POS])

# Total number of lemma types
sum([len(x) for x in coca_key.lemma_POS])

# Total number of mid-freq lemma types
sum([len(x) for x in coca_key.lemma_in_midfreq])

403

262

120

In [57]:
coca_key.to_pickle("coca_key.pkl")

### Next notebook: 2_PELIC_dataset.ipynb