## Data analysis
--- 
`NEW CONTINUING` script from [data_curation_cont](../notebooks/data_curation_cont.ipynb) script. 

Data processing pipeline: 
- [`data_curation.ipynb`](../notebooks/data_curation.ipynb)
- [`data_curation_cont.ipynb`](../notebooks/data_curation_cont.ipynb)
-  `data_analysis.ipynb` << You are here.

In [14]:
# loading required libraries
import nltk, pickle, pprint, csv, re, pylangacq
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# pretty printing for readability
cp = pprint.PrettyPrinter(compact=True, sort_dicts=True)

# loading data from last notebook
Lcorpus = pickle.load(open("../data/Lcorpus_cont.pkl", 'rb'))
Ncorpus = pickle.load(open("../data/Ncorpus_cont.pkl", 'rb'))

To begin the analysis, I'll extract and count instances of particular morphemes from each text. First, I'll test this out on a single row using the present progressive (verb suffix *ing*).

In [87]:
Ncorpus.head(1)

Unnamed: 0,Filename,Participant,Age,Tokens,POS,Morphemes
0,03\03a.cha,11312/c-00020713-1,3;01,"[., when, he's, sleeping, ,, ., and, his, frog...","[None, conj, pro:sub, aux, part, cm, ., coord,...","[None, when, he, be&3S, sleep-PRESP, cm, , and..."


In [95]:
# -PRESP is the TalkBank annotation for a verb in the present progressive
pattern = r'\w*-PRESP\b'
# sample row
presp_test = Ncorpus.Morphemes[0]
# find all present progressive morphemes
presps = re.findall(pattern, ' '.join(str(x) for x in presp_test))
print(presps, '\ncount:', len(presps))

['sleep-PRESP', 'get-PRESP', 'stand-PRESP', 'run-PRESP'] 
count: 4


The first participant in our data frame, age 3 years and 1 month, used the present progressive 4 times: 'sleeping', 'getting', 'standing', and 'running'.

Now to define a function and get this information for the rest of the data.

In [91]:
def get_presp(x):
    pattern = r'\w*-PRESP\b'
    presps = re.findall(pattern, ' '.join(str(y) for y in x))
    return presps

In [103]:
# native speaker corpus 
Ncorpus['PresP'] = Ncorpus.Morphemes.apply(get_presp)
Ncorpus['PresP_Count'] = Ncorpus['PresP'].str.len()
Ncorpus.head()

Unnamed: 0,Filename,Participant,Age,Tokens,POS,Morphemes,PresP,PresP_Count
0,03\03a.cha,11312/c-00020713-1,3;01,"[., when, he's, sleeping, ,, ., and, his, frog...","[None, conj, pro:sub, aux, part, cm, ., coord,...","[None, when, he, be&3S, sleep-PRESP, cm, , and...","[sleep-PRESP, get-PRESP, stand-PRESP, run-PRESP]",4
1,03\03b.cha,11312/c-00020714-1,3;04,"[they're, looking, at, it, ., and, there's, a,...","[pro:sub, aux, part, prep, pro:per, ., coord, ...","[they, be&PRES, look-PRESP, at, it, , and, the...","[look-PRESP, look-PRESP, get-PRESP, climb-PRES...",6
2,03\03c.cha,11312/c-00020715-1,3;04,"[there's, a, frog, in, there, ., he's, in, the...","[pro:exist, cop, det:art, n, prep, adv, ., pro...","[there, be&3S, a, frog, in, there, , he, be&3S...","[go-PRESP, go-PRESP, stand-PRESP, go-PRESP, sp...",8
3,03\03d.cha,11312/c-00020716-1,3;05,"[a, frog, a, person, ., a, person, ., a, boot,...","[det:art, n, det:art, n, ., det:art, n, ., det...","[a, frog, a, person, , a, person, , a, boot, ,...","[try-PRESP, try-PRESP, try-PRESP, try-PRESP, t...",23
4,03\03e.cha,11312/c-00020717-1,3;08,"[., there's, a, dog, ., and, there's, a, frog,...","[None, pro:exist, cop, det:art, n, ., coord, p...","[None, there, be&3S, a, dog, , and, there, be&...","[go-PRESP, call-PRESP, run-PRESP]",3


In [109]:
# learner corpus
Lcorpus['PresP'] = Lcorpus.Morphemes.apply(get_presp)
Lcorpus['PresP_Count'] = Lcorpus['PresP'].str.len()
Lcorpus.head()

Unnamed: 0,Filename,Participant,Anon_ID,L1,Age,Education,Years_Learn,Years_Env,Tokens,POS,Morphemes,PresP,PresP_Count
0,Vercellotti\1060_3G1.cha,1060,fm5,Arabic,19.0,level4,more than 5 years,less than 1 year,"[my, topic, is, describe, your, favorite, meal...","[det:poss, n, cop, v, det:poss, adj, n, prep, ...","[my, topic, be&3S, describe, your, favorite, m...","[eat-PRESP, eat-PRESP, eat-PRESP, try-PRESP, e...",5
1,Vercellotti\1060_3G2.cha,1060,fm5,Arabic,19.0,level4,more than 5 years,less than 1 year,"[the, topic, is, transportation, ., in, this, ...","[det:art, n, cop, n, ., prep, det:dem, n, qn, ...","[the, topic, be&3S, transport&dv-ATION, , in, ...",[],0
2,Vercellotti\1060_3G3.cha,1060,fm5,Arabic,19.0,level4,more than 5 years,less than 1 year,"[the, topic, is, someone, I, admire, ., I'll, ...","[det:art, n, cop, pro:indef, pro:sub, v, ., pr...","[the, topic, be&3S, someone, I, admire, , I, w...",[],0
3,Vercellotti\1060_4P1.cha,1060,fm5,Arabic,19.0,level4,more than 5 years,less than 1 year,"[the, topic, is, talking, about, a, problem, i...","[det:art, n, aux, part, prep, det:art, n, prep...","[the, topic, be&3S, talk-PRESP, about, a, prob...","[talk-PRESP, look-PRESP, cause-PRESP, look-PRESP]",4
4,Vercellotti\1060_4P2.cha,1060,fm5,Arabic,19.0,level4,more than 5 years,less than 1 year,"[the, topic, is, talk, about, something, I, re...","[det:art, n, cop, v, adv, pro:indef, pro:sub, ...","[the, topic, be&3S, talk, about, something, I,...","[distribute-PRESP, face-PRESP, study-PRESP]",3


I'll do the same for other important morphemes: plural (noun suffixes *s*, *es*), past irregular (past tense verbs with irregular suffixes), possessive (noun suffix *'s*), 

In [115]:
# plural
def get_plural(x):
    pattern = r'\w*-PL\b'
    plurals = re.findall(pattern, ' '.join(str(y) for y in x))
    return plurals

# adding data to the data frames
Ncorpus['Plural'] = Ncorpus.Morphemes.apply(get_plural)
Ncorpus['Plural_Count'] = Ncorpus['Plural'].str.len()

Lcorpus['Plural'] = Lcorpus.Morphemes.apply(get_plural)
Lcorpus['Plural_Count'] = Lcorpus['Plural'].str.len()

In [119]:
# past irregular
def get_pastirr(x):
    pattern = r'\w*&PAST\b'
    pastirr = re.findall(pattern, ' '.join(str(y) for y in x))
    return pastirr

# adding data to the data frames
Ncorpus['PastIrr'] = Ncorpus.Morphemes.apply(get_pastirr)
Ncorpus['PastIrr_Count'] = Ncorpus['PastIrr'].str.len()

Lcorpus['PastIrr'] = Lcorpus.Morphemes.apply(get_pastirr)
Lcorpus['PastIrr_Count'] = Lcorpus['PastIrr'].str.len()

In [122]:
# possessives
def get_poss(x):
    pattern = r'\w*-POSS\b'
    poss = re.findall(pattern, ' '.join(str(y) for y in x))
    return poss

# adding data to the data frames
Ncorpus['Poss'] = Ncorpus.Morphemes.apply(get_poss)
Ncorpus['Poss_Count'] = Ncorpus['Poss'].str.len()

Lcorpus['Poss'] = Lcorpus.Morphemes.apply(get_poss)
Lcorpus['Poss_Count'] = Lcorpus['Poss'].str.len()