# PELIC concordancing tutorial<a name="top"></a>

This notebook provides a short example of the type of linguistic investigation that can be carried out with the data in PELIC. The focus of the investigation is a set of verbs which are important indicators of syntactic complexity (described in more detail in the [`Background`](#Background) section of this notebook). The subsequent tutorial has two aims:
1. to present a straightforward and replicable way of accessing and processing the corpus data necessary to answer genuine research questions, using tools from the [`Pitt ELI Toolkit (pelitk)`](https://github.com/ELI-Data-Mining-Group/pelitk)
2. to demonstrate how to build a concordance list and dataframe using the PELIC data


#### Sections of the notebook
- [Background](#Background)
- [Initial setup](#Initial-setup)
- [Building a concordance list](#Building-a-concordance-list)
- [Summary](#Summary)

## Background
This tutorial is based on the work by Dr. Alan Juffs and Dr. Na-Rae Han (2019) which investigates the development of syntactic complexity in learners' writing by focusing on a set of key words. In this tutorial we analyize a selection of these key words, nine verbs which we would expect to be followed by either a noun phrase (NP) or a complement clause (CP) in varying degrees: 

_consider, suggest, explain, realize, admit, deny, conclude, recommend, suppose_  

For example, with _suppose_ we would expect a CP but not an NP, i.e.:
1. _Sam supposed the answer was correct. (CP)_ √ 
2. _Sam supposed the answer (NP)_. X  

In contrast, with _conclude_ both options are acceptable: 
1. _Andrea concluded her speech with a joke. (NP)_ √ 
2. _Andrea concluded that he was telling the truth. (CP)_ √

However, considering that the use of a CP necessitates greater syntactic complexity, we hypothesize that learners will underuse the CP constructions with these verbs compared to expert speakers, especially at lower levels of proficiency. Therefore, by analyzing occurences of these verbs and their syntactic patterns, we can answer the following research questions:

1. With verbs that allow for both CP and NP constructions, do learners show a preference for NP constructions compared to expert speakers?
2. To what extent do factors such as first language and verb frequency affect learners' choice of constructions with these verbs?

For a more detailed discussion of this work, please see the [slides](https://github.com/ELI-Data-Mining-Group/Pitt-ELI-Corpus/blob/master/AAAL-2019-FREQ-Mar-12.pdf) or the [abstract](https://aaal.confex.com/aaal/2019/meetingapp.cgi/Session/1553) from the conference where this work was presented.

## Initial setup

In [1]:
# Import necessary modules

from pelitk import conc
import pandas as pd
import pprint
import pickle as pkl
import operator
import csv
from more_itertools import unique_everseen
from ast import literal_eval


# Set preferred notebook format

%pprint # turn off pretty printing
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all" # Show all output
pd.set_option('display.max_columns', 999) # Allow viewing of all columns in dataframe
pd.options.mode.chained_assignment = None # Turn off SettingWithCopy warning

Pretty printing has been turned OFF


In [2]:
pelic_df = pd.read_csv("../PELIC_compiled.csv", index_col = 'answer_id', # answer_id is unique
                      dtype = {'level_id':'object','question_id':'object','version':'object'}, #str not ints
                               converters={'tokens':literal_eval,'tok_POS':literal_eval,'lemmas':literal_eval,
                                          'lemma_POS': literal_eval,}) # Read in as lists
pelic_df.info()
pelic_df.head()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 46230 entries, 1 to 48420
Data columns (total 13 columns):
anon_id        46230 non-null object
L1             46230 non-null object
gender         46230 non-null object
level_id       46230 non-null object
class_id       46230 non-null object
question_id    46230 non-null object
version        46230 non-null object
text_len       46230 non-null int64
text           46230 non-null object
tokens         46230 non-null object
tok_POS        46230 non-null object
lemmas         46230 non-null object
lemma_POS      46230 non-null object
dtypes: int64(1), object(12)
memory usage: 4.9+ MB


Unnamed: 0_level_0,anon_id,L1,gender,level_id,class_id,question_id,version,text_len,text,tokens,tok_POS,lemmas,lemma_POS
answer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
1,eq0,Arabic,Male,4,g,5,1,177,I met my friend Nife while I was studying in a...,"[I, met, my, friend, Nife, while, I, was, stud...","[(I, PRP), (met, VBD), (my, PRP$), (friend, NN...","[i, meet, my, friend, nife, while, i, be, stud...","[(i, PRP), (meet, VBD), (my, PRP$), (friend, N..."
2,am8,Thai,Female,4,g,5,1,137,"Ten years ago, I met a women on the train betw...","[Ten, years, ago, ,, I, met, a, women, on, the...","[(Ten, CD), (years, NNS), (ago, RB), (,, ,), (...","[ten, year, ago, ,, i, meet, a, woman, on, the...","[(ten, CD), (year, NNS), (ago, RB), (,, ,), (i..."
3,dk5,Turkish,Female,4,w,12,1,64,In my country we usually don't use tea bags. F...,"[In, my, country, we, usually, do, n't, use, t...","[(In, IN), (my, PRP$), (country, NN), (we, PRP...","[in, my, country, we, usually, do, not, use, t...","[(in, IN), (my, PRP$), (country, NN), (we, PRP..."
4,dk5,Turkish,Female,4,w,13,1,6,I organized the instructions by time.,"[I, organized, the, instructions, by, time, .]","[(I, PRP), (organized, VBD), (the, DT), (instr...","[i, organize, the, instruction, by, time, .]","[(i, PRP), (organize, VBD), (the, DT), (instru..."
5,ad1,Korean,Female,4,w,12,1,59,"First, prepare a port, loose tea, and cup.\nSe...","[First, ,, prepare, a, port, ,, loose, tea, ,,...","[(First, RB), (,, ,), (prepare, VB), (a, DT), ...","[first, ,, prepare, a, port, ,, loose, tea, ,,...","[(first, RB), (,, ,), (prepare, VB), (a, DT), ..."


**Note:** Here we have read in the PELIC_compiled csv. To see how this csv was created from the raw data files, please see the [build_PELIC_compiled.ipynb](https://github.com/ELI-Data-Mining-Group/PELIC-dataset/blob/master/PELIC_compiled.csv) notebook.

### Narrowing the dataset
- For valid natural production, we may want to include only free written production, i.e. from writing classes and not reading or grammar classes.
- Only the first versions of texts will be included to avoid duplicates and corrections made due to teacher feedback.
- Only levels 3,4,5 will be included as there are very few level 2 classes, making later analysis of proficiency as a factor unuseful.

In [3]:
# Keep only writing texts (class id 'w')

pelic_df = pelic_df.loc[pelic_df.class_id == 'w']
print('New number of texts:',len(pelic_df))
print('New number of tokens:',pelic_df['text_len'].sum())

New number of texts: 14873
New number of tokens: 2654899


In [4]:
# Keep only version 1 of texts

pelic_df = pelic_df.loc[pelic_df.version == '1']
print('New number of texts:',len(pelic_df))
print('New number of tokens:',pelic_df['text_len'].sum())

New number of texts: 12981
New number of tokens: 2256633


In [5]:
# Remove level 2 students

texts_per_level = pelic_df.level_id.value_counts()
texts_per_level
pelic_df = pelic_df.loc[pelic_df.level_id != '2']
print('New number of texts:',len(pelic_df))
print('New number of tokens:',pelic_df['text_len'].sum())

4    4888
5    4273
3    3463
2     357
Name: level_id, dtype: int64

New number of texts: 12624
New number of tokens: 2252791


## Building a concordance list
This section shows how to use the concordance function from [`pelitk`](https://github.com/ELI-Data-Mining-Group/pelitk) to build a create a concordance list for the set of verbs being analyzed. These concordances are stored in a dataframe containing other useful identifying information, but can also be printed as a stand-alone list.

#### Creating the verb list
First, we can create a list of the nine verbs we want to find concordances for.  
**Note:** This is a list of the lemma forms, so that _consider_ also includes inflections like _considers,_ _considered,_ etc.

In [6]:
verbs = ['consider', 'suggest', 'explain', 'realize', 'admit', 'deny', 'conclude', 'recommend', 'suppose']

#### Concordancing a list of items
The concordance function takes one word or tuple as the 'node' argument, as seen in the example below.

In [7]:
example_text = 'Andrea concluded her speech with a joke. Andrea concluded that he was telling the truth.'
example_tok_text = ['Andrea', 'concluded', 'her', 'speech', 'with', 'a', 'joke', '.', 'Andrea',
                    'concluded', 'that', 'he', 'was', 'telling', 'the', 'truth', '.']

%pprint
conc.concordance(example_tok_text,'concluded',5,pretty=True)
# Update to use lex.tokenize once it's ready

Pretty printing has been turned ON


['                                  Andrea  concluded   her speech with a joke                  ',
 '                    with a joke . Andrea  concluded   that he was telling the                 ']

However, as we want to find concordances for a **list** of node words, an additional function needs to be created.

In [8]:
# Create function get_concs function which creates a concordance line for each occurrence of any item in our list

def get_concs(tok_text, forms_list):
    conclist = []
    for x in tok_text:
        if x.lower() in [x for x in forms_list]: 
            conclist.append(conc.concordance(tok_text, x, 5))
    return [x for y in list(unique_everseen(conclist)) for x in y]

Testing out this function on the same example text, we see that it returns concordance lines for _concluded_ and _speech_.

In [9]:
example_forms_list = ['concluded','speech']

get_concs(example_tok_text, example_forms_list)

#The 'prettify' function can also be applied if desired:
conc.prettify(get_concs(example_tok_text, example_forms_list))

[('    Andrea', 'concluded', 'her speech with a joke'),
 ('with a joke . Andrea', 'concluded', 'that he was telling the'),
 ('  Andrea concluded her', 'speech', 'with a joke . Andrea')]

['                                  Andrea  concluded   her speech with a joke                  ',
 '                    with a joke . Andrea  concluded   that he was telling the                 ',
 '                    Andrea concluded her    speech    with a joke . Andrea                    ']

We can now apply this new function to our entire dataframe, creating a new `concordance` column which will include all the concordances for the verbs in our list which appear in each text.

In [10]:
pelic_df['concordance'] = pelic_df['lemmas'].apply(lambda x: get_concs(x,verbs))

In [11]:
# Check how many texts in PELIC contain our target items

print('Total number of texts:',len(pelic_df))
print('Number of texts containing at least one lemma from verb list',len(pelic_df.loc[~pelic_df.concordance.str.len().eq(0)]))
print('Percentage of texts containing lemma from verb list:', 
      round((len(pelic_df.loc[~pelic_df.concordance.str.len().eq(0)])/len(pelic_df))*100,2),'%')
pelic_df.loc[~pelic_df.concordance.str.len().eq(0)].sample(5) # Sample of ten rows

Total number of texts: 12624
Number of texts containing at least one lemma from verb list 1856
Percentage of texts containing lemma from verb list: 14.7 %


Unnamed: 0_level_0,anon_id,L1,gender,level_id,class_id,question_id,version,text_len,text,tokens,tok_POS,lemmas,lemma_POS,concordance
answer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
29379,gj2,Arabic,Male,5,w,3899,1,467,Favorite Restaurant\n My friend graduated from...,"[Favorite, Restaurant, My, friend, graduated, ...","[(Favorite, NNP), (Restaurant, NNP), (My, NNP)...","[favorite, restaurant, my, friend, graduate, f...","[(favorite, NNP), (restaurant, NNP), (my, NNP)...","[(, he call me and, suggest, two restaurant th..."
45035,fw7,Chinese,Female,4,w,5717,1,484,\n Topic: Online Learning \n\n Nowadays is a m...,"[Topic, :, Online, Learning, Nowadays, is, a, ...","[(Topic, NN), (:, :), (Online, NNP), (Learning...","[topic, :, online, ##NO-MATCHING-POS##, nowada...","[(topic, NN), (:, :), (online, NNP), (##NO-MAT...","[(. next , let us, consider, that if student t..."
23696,ck4,Arabic,Male,4,w,3072,1,825,"Nowadays, prices and life expenses have becom...","[Nowadays, ,, prices, and, life, expenses, hav...","[(Nowadays, NNS), (,, ,), (prices, NNS), (and,...","[nowadays, ,, price, and, life, expense, have,...","[(nowadays, NNS), (,, ,), (price, NNS), (and, ...","[(for another offer and he, realize, he have l..."
26580,ay3,Japanese,Female,5,w,3466,1,609,The Difference between Macintosh and Windows\n...,"[The, Difference, between, Macintosh, and, Win...","[(The, DT), (Difference, NNP), (between, IN), ...","[the, difference, between, macintosh, and, win...","[(the, DT), (difference, NNP), (between, IN), ...","[(, it be possible to, conclude, the macintosh..."
40665,ct4,Arabic,Male,4,w,5189,1,57,1. Chefs enjouy preparing usual meals.\n\n2. T...,"[1, ., Chefs, enjouy, preparing, usual, meals,...","[(1, CD), (., .), (Chefs, NNP), (enjouy, VBD),...","[1, ., chef, enjouy, prepare, usual, meal, ., ...","[(1, CD), (., .), (chef, NNP), (enjouy, VBD), ...","[(. the head of state, consider, form a allian..."


Since we are only interested in texts containing lemmas from our verb list, we will remove all other rows.

In [12]:
verbs_df = pelic_df.loc[~pelic_df.concordance.str.len().eq(0)]

#### Labelling the concordance lines
To be able to easily refer back to concordance lines later, it is useful to attach identifying information to each one, e.g. what the node word is and where it can be found in the text (i.e. the index).

In [13]:
# Create function to label each concordance with the offset, i.e. the node and its index in the text

def get_offset(tok_list, forms):
    new_list =  [x.lower() for x in tok_list.copy()] # lower case all tokens in text
    new_forms = forms.copy()
    return [x for x in list(enumerate(new_list)) if x[1] in new_forms] # enumerate function finds index of each token in the text

In [14]:
# Create an offset column by applying the above function

verbs_df['offset'] = verbs_df.lemmas.apply(lambda x: get_offset(x,verbs))
verbs_df.head()

Unnamed: 0_level_0,anon_id,L1,gender,level_id,class_id,question_id,version,text_len,text,tokens,tok_POS,lemmas,lemma_POS,concordance,offset
answer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
34,bf0,Arabic,Male,4,w,10,1,171,How to fail a test\n\n There are several ways ...,"[How, to, fail, a, test, There, are, several, ...","[(How, WRB), (to, TO), (fail, VB), (a, DT), (t...","[how, to, fail, a, test, there, be, several, w...","[(how, WRB), (to, TO), (fail, VB), (a, DT), (t...","[(with you then , i, recommend, you to cheat i...","[(114, recommend)]"
111,cz7,Arabic,Male,4,w,15,1,51,Every one like a special kind of food. For me ...,"[Every, one, like, a, special, kind, of, food,...","[(Every, DT), (one, CD), (like, IN), (a, DT), ...","[every, one, like, a, special, kind, of, food,...","[(every, DT), (one, CD), (like, IN), (a, DT), ...","[(specially kabsah and i will, explain, how to...","[(22, explain)]"
119,cs3,Japanese,Male,4,w,6,1,235,It's difficult to succeed in getting a higher ...,"[It, 's, difficult, to, succeed, in, getting, ...","[(It, PRP), ('s, VBZ), (difficult, JJ), (to, T...","[it, 's, difficult, to, succeed, in, get, a, h...","[(it, PRP), ('s, VBZ), (difficult, JJ), (to, T...","[(. but he would not, realize, that it 's the ...","[(200, realize), (231, explain)]"
133,az2,Korean,Male,5,w,17,1,130,"When I was in Germany, I met a friend who was ...","[When, I, was, in, Germany, ,, I, met, a, frie...","[(When, WRB), (I, PRP), (was, VBD), (in, IN), ...","[when, i, be, in, germany, ,, i, meet, a, frie...","[(when, WRB), (i, PRP), (be, VBD), (in, IN), (...","[(about my country . i, realize, i be wrong wh...","[(113, realize)]"
152,dj0,Korean,Female,5,w,4,1,299,There are many qualities of a good neighbor in...,"[There, are, many, qualities, of, a, good, nei...","[(There, EX), (are, VBP), (many, JJ), (qualiti...","[there, be, many, quality, of, a, good, neighb...","[(there, EX), (be, VBP), (many, JJ), (quality,...","[(bad neighbor once , i, realize, a big instru...","[(308, realize), (321, consider)]"


In [15]:
# For clarity, we want to sort the concordance and offset columns alphabetically by the node word

# Sort offset column
for x in verbs_df['offset']:
    x.sort(key = operator.itemgetter(1))
    
    
# Sort concordance column
verbs_df['concordance'] = [sorted(x, key=lambda x: x[1]) for x in verbs_df.concordance]

In [16]:
verbs_df.head(1)

Unnamed: 0_level_0,anon_id,L1,gender,level_id,class_id,question_id,version,text_len,text,tokens,tok_POS,lemmas,lemma_POS,concordance,offset
answer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
34,bf0,Arabic,Male,4,w,10,1,171,How to fail a test\n\n There are several ways ...,"[How, to, fail, a, test, There, are, several, ...","[(How, WRB), (to, TO), (fail, VB), (a, DT), (t...","[how, to, fail, a, test, there, be, several, w...","[(how, WRB), (to, TO), (fail, VB), (a, DT), (t...","[(with you then , i, recommend, you to cheat i...","[(114, recommend)]"


#### Creating the conc_df
Up until now, the dataframe has had each **text** as a row. However, we may want to have each **concordance** as a row.

In [17]:
# Create the new dataframe based on verbs_df

# Keep only most relevant columns and reset the index 
conc_df = verbs_df[['level_id','L1','gender','offset','concordance','text']].reset_index()
conc_df.head(2)

Unnamed: 0,answer_id,level_id,L1,gender,offset,concordance,text
0,34,4,Arabic,Male,"[(114, recommend)]","[(with you then , i, recommend, you to cheat i...",How to fail a test\n\n There are several ways ...
1,111,4,Arabic,Male,"[(22, explain)]","[(specially kabsah and i will, explain, how to...",Every one like a special kind of food. For me ...


**Note:** Depending on the desired analysis, we could keep other columns too, e.g. `question_id` to analyze the prompts or `toks_re_len` to consider the impact of text length. For clarity, here we are maintaining only essential information.

In [18]:
# Create a new column with tuples of the offset and the concordance and delete old offset and concordance columns
conc_df['offset_conc'] = list(zip(conc_df.offset, conc_df.concordance)) # Zip together two columns
conc_df['offset_conc'] = [list(zip(x[0],x[1])) for x in conc_df.offset_conc] # Zip together the items in each row
conc_df = conc_df.drop(['offset','concordance'], axis = 1)

conc_df.head(2)

Unnamed: 0,answer_id,level_id,L1,gender,text,offset_conc
0,34,4,Arabic,Male,How to fail a test\n\n There are several ways ...,"[((114, recommend), (with you then , i, recomm..."
1,111,4,Arabic,Male,Every one like a special kind of food. For me ...,"[((22, explain), (specially kabsah and i will,..."


In [19]:
# 'Explode' the dataframe so that each item in the 'offset_conc' column becomes its own row
conc_df = conc_df.explode('offset_conc')
conc_df = conc_df.reset_index(drop=True)

In [20]:
# Re-split the offset_conc back into two columns and then dropping it
conc_df['offset'] = [x[0] for x in conc_df.offset_conc]
conc_df['concordance'] = [x[1] for x in conc_df.offset_conc] # NOTE: commas are around the node word (not from Ss)
conc_df = conc_df.drop(['offset_conc'], axis = 1)
conc_df.head()

Unnamed: 0,answer_id,level_id,L1,gender,text,offset,concordance
0,34,4,Arabic,Male,How to fail a test\n\n There are several ways ...,"(114, recommend)","(with you then , i, recommend, you to cheat in..."
1,111,4,Arabic,Male,Every one like a special kind of food. For me ...,"(22, explain)","(specially kabsah and i will, explain, how to ..."
2,119,4,Japanese,Male,It's difficult to succeed in getting a higher ...,"(231, explain)","(high education . 2 ., explain, a difficulty a..."
3,119,4,Japanese,Male,It's difficult to succeed in getting a higher ...,"(200, realize)","(. but he would not, realize, that it 's the c..."
4,133,5,Korean,Male,"When I was in Germany, I met a friend who was ...","(113, realize)","(about my country . i, realize, i be wrong whe..."


In [21]:
# Sort dataframe by offset (word then offset number)
conc_df['offset_num'] = [x[0] for x in conc_df.offset]
conc_df['offset_word'] = [x[1] for x in conc_df.offset]
conc_df = conc_df.sort_values(by = ['offset_word', 'answer_id','offset_num']).reset_index(drop=True)
conc_df = conc_df.drop(['offset_num','offset_word'], axis = 1)

In [22]:
# Dropping duplicates - from manual checking, it was found that some texts have two versions back to back, 
# some might be repetitions of task prompts, and some are mislabelled as version 1.
len(conc_df)
conc_df = conc_df.drop_duplicates(subset='concordance', keep="first")
conc_df.head()
len(conc_df)

2795

Unnamed: 0,answer_id,level_id,L1,gender,text,offset,concordance
0,1551,5,Chinese,Male,"In Chinese saying, students are ""like warms in...","(442, admit)","(them . most of us, admit, that this behavior ..."
1,1736,4,Korean,Female,Cats and dogs are the most popular pets in the...,"(312, admit)","(their master . when it, admit, someone as a m..."
2,1736,4,Korean,Female,Cats and dogs are the most popular pets in the...,"(384, admit)","(us that cat do not, admit, its master . there..."
5,2534,4,Arabic,Male,Writing in my language makes me do my best to ...,"(134, admit)","(english , i have to, admit, the new vocabular..."
6,3830,4,Korean,Female,\nWriting 4P\n\n31 Jan 07\n\nThe Problem and S...,"(283, admit)","(to understanding each other and, admit, the f..."


2418

#### Concordance csv file
It may be useful to create a csv file of the conc_df dataframe, e.g. to use in annotations, to share, or to use with other programs. Here, we create a simple csv containing just the identifying information and concordances.

In [23]:
# Zip desired columns together to prepare the concordance csv
conc_csv = zip(conc_df.answer_id, conc_df.offset, conc_df.concordance)
conc_csv = list(conc_csv)

In [24]:
# Run the prettify function to prepare the csv for printing
conc_csv = [(x[0], x[1], conc.prettify([x[2]])) for x in conc_csv]

In [25]:
len(conc_csv)
conc_csv[:5]

2418

[(1551,
  (442, 'admit'),
  ['                       them . most of us    admit     that this behavior will lead            ']),
 (1736,
  (312, 'admit'),
  ['                  their master . when it    admit     someone as a master ,                   ']),
 (1736,
  (384, 'admit'),
  ['                      us that cat do not    admit     its master . there be                   ']),
 (2534,
  (134, 'admit'),
  ['                     english , i have to    admit     the new vocabulary word be              ']),
 (3830,
  (283, 'admit'),
  ['         to understanding each other and    admit     the fault each of them                  '])]

In [26]:
# Or alternatively a list with only the concordance lines
conc_simple_csv = [x[2] for x in conc_csv]
conc_simple_csv[:5]

[['                       them . most of us    admit     that this behavior will lead            '],
 ['                  their master . when it    admit     someone as a master ,                   '],
 ['                      us that cat do not    admit     its master . there be                   '],
 ['                     english , i have to    admit     the new vocabulary word be              '],
 ['         to understanding each other and    admit     the fault each of them                  ']]

## Summary

In this notebook, we have accomplished the following:
- narrowed PELIC down to just those texts containing the verbs we are interested in
- organized these data into three useful formats:
    - a dataframe where each row is a text containing the key verbs
    - a dataframe where each row is one concordance line
    - a csv in a traditional concordance format

This data could then be analyzed qualitatively, reading the concordance vertically, looking for patterns. Further quantitative analysis can also be conducted, considering factors like level, L1, gender, etc., or extracting patterns related to these node words. In the case of our original research questions, the next step would be to parse the sentences we have identified, in order to determine whether NPs or CPs follow the verbs.

## Contact
If you have any questions about this tutorial, or more generally about using PELIC, please contact Ben Naismith at bnaismith@pitt.edu. Thank you for visiting.

[Back to top](#top)