## Data analysis
--- 
`NEW CONTINUING` script from [data_curation_cont](../notebooks/data_curation_cont.ipynb) script. 

Data processing pipeline: 
- [`data_curation.ipynb`](../notebooks/data_curation.ipynb)
- [`data_curation_cont.ipynb`](../notebooks/data_curation_cont.ipynb)
-  `data_analysis.ipynb` << You are here.

In [3]:
# loading required libraries
import nltk, pickle, pprint, csv, re, pylangacq
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# pretty printing for readability
cp = pprint.PrettyPrinter(compact=True, sort_dicts=True)

# loading data from last notebook
Lcorpus = pickle.load(open("../data/Lcorpus_cont.pkl", 'rb'))
Ncorpus = pickle.load(open("../data/Ncorpus_cont.pkl", 'rb'))

According to R. Brown (1973), the starting point of acquisition order research, the order of L1 acquisition of English morphemes is as follows: 

| Rank        | Morpheme    |
| ----------- | ----------- |
| 1   | Present progressive (*-ing*)    |
| 2/3   | *in, on*       |
| 4   | Plural (*-s*)  |
| 5   | Past irregular      |
| 6   | Possessive (*-'s*)   |
| 7  | Uncontractible copula (*is, am, are*)   |
| 8  | Articles (*a, the*)   |
| 9   | Past regular (*-ed*)      |
| 10   | Third person singular (*-s*)     |
| 11   | Third person irregular     |
| 12   | Uncontractible auxiliary (*is, am, are*)  |
| 13  | Contractible copula  |
| 14  | Contractible auxiliary   |

This project will not analyze all of these, but I will attempt to cover most of them.

To begin the analysis, I'll extract and count instances of particular morphemes from each text. First, I'll test this out on a single row using the present progressive (verb suffix *-ing*). The MOR annotation scheme for the TalkBank corpora can be found [here](https://talkbank.org/manuals/MOR.html#_Toc65933281).

In [4]:
Ncorpus.head(1)

Unnamed: 0,Filename,Participant,Age,Tokens,POS,Morphemes
0,03\03a.cha,11312/c-00020713-1,3;01,"[., when, he's, sleeping, ,, ., and, his, frog...","[None, conj, pro:sub, aux, part, cm, ., coord,...","[None, when, he, be&3S, sleep-PRESP, cm, , and..."


In [5]:
# -PRESP is the TalkBank MOR annotation for a verb in the present progressive
pattern = r'\w*-PRESP\b'
# sample row
presp_test = Ncorpus.Morphemes[0]
# find all present progressive morphemes
presps = re.findall(pattern, ' '.join(str(x) for x in presp_test))
print(presps, '\ncount:', len(presps))

['sleep-PRESP', 'get-PRESP', 'stand-PRESP', 'run-PRESP'] 
count: 4


The first participant in our data frame, age 3 years and 1 month, used the present progressive 4 times: 'sleeping', 'getting', 'standing', and 'running'.

Now to define a function and get this information for the rest of the data.

In [6]:
def get_presp(x):
    pattern = r'\w*-PRESP\b'
    presps = re.findall(pattern, ' '.join(str(y) for y in x))
    return presps

In [7]:
# native speaker corpus 
Ncorpus['PresP_Count'] = Ncorpus.Morphemes.apply(get_presp).str.len()
Ncorpus.head()

Unnamed: 0,Filename,Participant,Age,Tokens,POS,Morphemes,PresP_Count
0,03\03a.cha,11312/c-00020713-1,3;01,"[., when, he's, sleeping, ,, ., and, his, frog...","[None, conj, pro:sub, aux, part, cm, ., coord,...","[None, when, he, be&3S, sleep-PRESP, cm, , and...",4
1,03\03b.cha,11312/c-00020714-1,3;04,"[they're, looking, at, it, ., and, there's, a,...","[pro:sub, aux, part, prep, pro:per, ., coord, ...","[they, be&PRES, look-PRESP, at, it, , and, the...",6
2,03\03c.cha,11312/c-00020715-1,3;04,"[there's, a, frog, in, there, ., he's, in, the...","[pro:exist, cop, det:art, n, prep, adv, ., pro...","[there, be&3S, a, frog, in, there, , he, be&3S...",8
3,03\03d.cha,11312/c-00020716-1,3;05,"[a, frog, a, person, ., a, person, ., a, boot,...","[det:art, n, det:art, n, ., det:art, n, ., det...","[a, frog, a, person, , a, person, , a, boot, ,...",23
4,03\03e.cha,11312/c-00020717-1,3;08,"[., there's, a, dog, ., and, there's, a, frog,...","[None, pro:exist, cop, det:art, n, ., coord, p...","[None, there, be&3S, a, dog, , and, there, be&...",3


In [8]:
# learner corpus
Lcorpus['PresP_Count'] = Lcorpus.Morphemes.apply(get_presp).str.len()
Lcorpus.head()

Unnamed: 0,Filename,Participant,Anon_ID,L1,Age,Education,Years_Learn,Years_Env,Tokens,POS,Morphemes,PresP_Count
0,Vercellotti\1060_3G1.cha,1060,fm5,Arabic,19.0,level4,more than 5 years,less than 1 year,"[my, topic, is, describe, your, favorite, meal...","[det:poss, n, cop, v, det:poss, adj, n, prep, ...","[my, topic, be&3S, describe, your, favorite, m...",5
1,Vercellotti\1060_3G2.cha,1060,fm5,Arabic,19.0,level4,more than 5 years,less than 1 year,"[the, topic, is, transportation, ., in, this, ...","[det:art, n, cop, n, ., prep, det:dem, n, qn, ...","[the, topic, be&3S, transport&dv-ATION, , in, ...",0
2,Vercellotti\1060_3G3.cha,1060,fm5,Arabic,19.0,level4,more than 5 years,less than 1 year,"[the, topic, is, someone, I, admire, ., I'll, ...","[det:art, n, cop, pro:indef, pro:sub, v, ., pr...","[the, topic, be&3S, someone, I, admire, , I, w...",0
3,Vercellotti\1060_4P1.cha,1060,fm5,Arabic,19.0,level4,more than 5 years,less than 1 year,"[the, topic, is, talking, about, a, problem, i...","[det:art, n, aux, part, prep, det:art, n, prep...","[the, topic, be&3S, talk-PRESP, about, a, prob...",4
4,Vercellotti\1060_4P2.cha,1060,fm5,Arabic,19.0,level4,more than 5 years,less than 1 year,"[the, topic, is, talk, about, something, I, re...","[det:art, n, cop, v, adv, pro:indef, pro:sub, ...","[the, topic, be&3S, talk, about, something, I,...",3


Doing the same for other important morphemes.

In [9]:
# in
def get_in(x):
    pattern = r'\bin\b'
    ins = re.findall(pattern, ' '.join(str(y) for y in x))
    return ins

# adding data to the data frames
Ncorpus['In_Count'] = Ncorpus.Morphemes.apply(get_in).str.len()
Lcorpus['In_Count'] = Lcorpus.Morphemes.apply(get_in).str.len()

In [10]:
# on
def get_on(x):
    pattern = r'\bon\b'
    ons = re.findall(pattern, ' '.join(str(y) for y in x))
    return ons

# adding data to the data frames
Ncorpus['On_Count'] = Ncorpus.Morphemes.apply(get_on).str.len()
Lcorpus['On_Count'] = Lcorpus.Morphemes.apply(get_on).str.len()

In [11]:
# past irregular
def get_pastirr(x):
    pattern = r'\w*&PAST\b'
    pastirr = re.findall(pattern, ' '.join(str(y) for y in x))
    return pastirr
# adding data to the data frames
Ncorpus['PastIrr_Count'] = Ncorpus.Morphemes.apply(get_pastirr).str.len()
Lcorpus['PastIrr_Count'] = Lcorpus.Morphemes.apply(get_pastirr).str.len()

In [12]:
# possessives
def get_poss(x):
    pattern = r'\w*-POSS\b'
    poss = re.findall(pattern, ' '.join(str(y) for y in x))
    return poss

# adding data to the data frames
Ncorpus['Poss_Count'] = Ncorpus.Morphemes.apply(get_poss).str.len()
Lcorpus['Poss_Count'] = Lcorpus.Morphemes.apply(get_poss).str.len()

In [13]:
# copula
def get_cop(x):
    pattern = r'cop'
    cops = re.findall(pattern, ' '.join(str(y) for y in x))
    return cops

# adding data to the data frames
Ncorpus['Cop_Count'] = Ncorpus.POS.apply(get_cop).str.len()
Lcorpus['Cop_Count'] = Lcorpus.POS.apply(get_cop).str.len()

In [14]:
# articles
def get_art(x):
    pattern = r'det:art'
    arts = re.findall(pattern, ' '.join(str(y) for y in x))
    return arts

# adding data to the data frames
Ncorpus['Art_Count'] = Ncorpus.POS.apply(get_art).str.len()
Lcorpus['Art_Count'] = Lcorpus.POS.apply(get_art).str.len()

In [15]:
# past regular
def get_pastreg(x):
    pattern = r'\w*-PAST\b'
    pastreg = re.findall(pattern, ' '.join(str(y) for y in x))
    return pastreg

# adding data to the data frames
Ncorpus['PastReg_Count'] = Ncorpus.Morphemes.apply(get_pastreg).str.len()
Lcorpus['PastReg_Count'] = Lcorpus.Morphemes.apply(get_pastreg).str.len()

In [16]:
# third person singular
def get_tps(x):
    pattern = r'\w*-3S\b'
    tps = re.findall(pattern, ' '.join(str(y) for y in x))
    return tps

# adding data to the data frames
Ncorpus['3PS_Count'] = Ncorpus.Morphemes.apply(get_tps).str.len()
Lcorpus['3PS_Count'] = Lcorpus.Morphemes.apply(get_tps).str.len()

In [17]:
# third person irregular
def get_tpirr(x):
    pattern = r'\w*&3S\b' 
    tpirr = re.findall(pattern, ' '.join(str(y) for y in x))
    return tpirr

# adding data to the data frames
Ncorpus['3PIrr_Count'] = Ncorpus.Morphemes.apply(get_tpirr).str.len()
Lcorpus['3PIrr_Count'] = Lcorpus.Morphemes.apply(get_tpirr).str.len()

In [18]:
# auxiliary
def get_aux(x):
    pattern = r'aux'
    aux = re.findall(pattern, ' '.join(str(y) for y in x))
    return aux

# adding data to the data frames
Ncorpus['Aux_Count'] = Ncorpus.POS.apply(get_aux).str.len()
Lcorpus['Aux_Count'] = Lcorpus.POS.apply(get_aux).str.len()

Morpheme counting completed. Now for some means.

In [19]:
# L1 corpus
Ncorpus['Age'] = Ncorpus['Age'].str.replace(';','.').astype(float)
Ncorpus[['Age', 'PresP_Count', 'In_Count', 'On_Count', 'PastIrr_Count', 
        'Poss_Count', 'Cop_Count', 'Art_Count', 'PastReg_Count', 
        '3PS_Count', '3PIrr_Count', 'Aux_Count']].groupby("Age").mean()

Unnamed: 0_level_0,PresP_Count,In_Count,On_Count,PastIrr_Count,Poss_Count,Cop_Count,Art_Count,PastReg_Count,3PS_Count,3PIrr_Count,Aux_Count
Age,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
3.01,4.0,0.0,2.0,4.0,0.0,1.0,5.0,1.0,0.0,5.0,3.0
3.04,7.0,4.5,0.0,0.5,0.5,5.5,7.0,0.5,0.0,7.0,2.0
3.05,23.0,5.0,6.0,12.0,0.0,12.0,42.0,0.0,2.0,11.0,10.0
3.08,3.0,4.0,0.0,4.0,0.0,6.0,9.0,1.0,0.0,10.0,4.0
3.09,8.25,3.0,1.25,5.75,0.25,4.75,20.5,5.75,0.5,11.25,8.75
3.1,7.0,3.0,0.0,2.5,0.0,2.5,18.0,1.5,1.0,8.5,8.0
3.11,4.5,3.5,2.5,10.0,0.5,1.5,14.0,3.0,0.0,1.0,1.0
4.04,2.0,3.0,2.0,16.0,0.0,5.0,35.0,11.0,0.0,1.0,4.0
4.06,4.5,5.0,0.5,4.5,0.0,3.0,17.0,6.0,3.0,6.0,5.5
4.07,9.333333,3.333333,4.0,10.0,0.666667,4.333333,18.333333,5.666667,4.0,6.666667,6.0


In [20]:
# L2 corpus
Lcorpus['Years_Learn'] = pd.Categorical(Lcorpus['Years_Learn'],
                                        ['less than 1 year', '1-2 years', '3-5 years',
                                         'more than 5 years'])
Lcorpus[['Years_Learn', 'PresP_Count', 'In_Count', 'On_Count', 'PastIrr_Count', 
        'Poss_Count', 'Cop_Count', 'Art_Count', 'PastReg_Count', 
        '3PS_Count', '3PIrr_Count', 'Aux_Count']].groupby("Years_Learn").mean()

Unnamed: 0_level_0,PresP_Count,In_Count,On_Count,PastIrr_Count,Poss_Count,Cop_Count,Art_Count,PastReg_Count,3PS_Count,3PIrr_Count,Aux_Count
Years_Learn,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
less than 1 year,1.72,3.48,0.2,3.92,0.04,6.72,8.4,1.36,0.32,4.6,1.96
1-2 years,1.645161,3.096774,0.516129,2.645161,0.0,4.806452,6.806452,0.806452,0.451613,3.83871,1.806452
3-5 years,2.043478,3.434783,0.130435,2.565217,0.130435,6.478261,9.26087,0.695652,0.391304,5.0,1.73913
more than 5 years,1.870968,3.344086,0.430108,1.946237,0.053763,5.397849,9.021505,0.784946,0.591398,4.322581,1.634409


These data frames are not the easiest to parse. 

From a brief glance at the native speaker corpus, it appears that there are some patterns in the occurence of partiuclar morphemes that seem to increase with age through the proposed acquisition order. 

However, the learner corpus doesn't appear to demonstrate any similar pattern at this stage.

Means of morpheme counts alone are also not necessarily informative. These values need to be normalized so that they are not affected by other factors such as length of the text.

Some visualizations should also help make sense of this. To be continued.