## Data curation continued
--- 
`NEW CONTINUING` script from the first [data_curation](../notebooks/data_curation.ipynb) script. 

Data processing pipeline: 
- [`data_curation.ipynb`](../notebooks/data_curation.ipynb)
- `data_curation_cont.ipynb` <<

__Script header__

In [7]:
# loading required libraries
import nltk, pickle, pprint, csv, pylangacq
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# pretty printing for readability
cp = pprint.PrettyPrinter(compact=True, sort_dicts=True)

# loading data from last notebook
Lcorpus = pickle.load(open("../data/Lcorpus.pkl", 'rb'))
Ncorpus = pickle.load(open("../data/Ncorpus.pkl", 'rb'))

---
### Overview

Continuing my data curation efforts after first [progress report](../progress_report.md). In the last notebook, some issues were encountered, namely, the learner corpus may not be appropriate/contain sufficient data for my analysis. Thus, my first step is to delve into more exploration of that data to see if it will suffice. Following this, I will begin to investigate specific morphemes that are salient in the texts.

--- 

In [8]:
Lcorpus.head()

Unnamed: 0,Filename,Proficiency,Age,L1,Age_Exposure,Years_Study,Text
0,DE_SP_B1_26_13_13_TM,B1,26,German,8,13.0,One day Tommy found a frog in a forest and bro...
1,DE_SP_B1_19_11_13_RN,B1,19,German,10,11.0,One day a little boy called John uh with his d...
2,DE_SP_B1_21_12_13_SE,B1,21,German,9,12.0,One day a boy was sitting in his room / uh he ...
3,DE_SP_B1_22_15_13_LF,B1,22,German,7,15.0,Uh one day a little boy and his dog are watchi...
4,DE_SP_B1_33_10_14_JR,B1,33,German,10,10.0,Ok this story is about toch uh Charles Chaplin...


In [52]:
Lcorpus_7 = Lcorpus[Lcorpus.Years_Study <= 7]
print(len(Lcorpus))
print(len(Lcorpus_7))

350
14


There are only 14 samples out of the 350 learners in this corpus that have been studying English for 7 years or less. Therefore, most data is from learners who are likely past the stages of acquisition where they would be acquiring basic morphemes.

In [66]:
Lcorpus_7.head()

Unnamed: 0,Filename,Proficiency,Age,L1,Age_Exposure,Years_Study,Text
180,ES_SP_A2_50_6_14_MJRC,A2,50,Spanish,38,6.0,uh well uh this video / this story is about uh...
181,ES_SP_A2_18_3_14_PAMM,A2,18,Spanish,15,3.0,hhh uh Charles Chaplin hhh was walking and smo...
182,ES_SP_A2_19_2_13_ERO,A2,19,Spanish,17,2.0,uh one day hhh they boy and her / his dog hhh ...
185,ES_SP_A2_26_3_14_SM,A2,26,Spanish,11,3.0,hi in this video we can look uh at a you can w...
189,ES_SP_A2_23_4_14_B,A2,23,Spanish,16,4.0,hello / my name is / ryan 'n' / 'n' this video...


In [60]:
Lcorpus_7.L1.nunique()

1

Additionally, they all share an L1, Spanish. This sample is not likely to provide an adequate amount of data for my analysis. It would also be restricted to a single L1, which isn't ideal for making generalizations about all English language learners, especially since Spanish is similar to English in quite a few aspects of morphology.

### Plan B: import new SLABank data

While it would have been interesting to leverage two corpora sources for this analysis, the cons for CORFL were too substantial. It has proven challenging to find additional spoken, transcribed, English as an L2 corpora that are _freely_ available online, so I will resort to using additional corpora from the TalkBank family. 

These corpora come from SLABank rather than CHILDES. In addition, this time I will compile data from multiple corpora rather than a single one in order to include a variety of L1's.

Importing a sample of the data using `PyLangAcq`

In [96]:
# setting path
path = '../data/SLABank/BELC'

BELC_narr = pylangacq.read_chat(path, 'narratives') # creating a reader object

In [74]:
print(type(BELC_narr))
print(BELC_narr.n_files()) # info about this object

<class 'pylangacq.chat.Reader'>
168


In [75]:
cp.pprint(BELC_narr.headers()[0]) # checking stored metadata for first CHAT file

{'Comment': 'pronounced as /lif/',
 'Languages': ['spa', 'cat', 'eng'],
 'Media': '9139riza_1, audio, unlinked',
 'PID': '11312/t-00016825-1',
 'Participants': {'SUB': {'age': '10;09.00',
                          'corpus': 'BELC',
                          'custom': '',
                          'education': '',
                          'group': '1A',
                          'language': 'eng',
                          'name': '9139RIZA',
                          'role': 'Subject',
                          'ses': '',
                          'sex': 'female'}},
 'Transcriber': 'Mireia',
 'UTF8': ''}


In [93]:
BELC_narr.headers()[0]['Participants']

{'SUB': {'name': '9139RIZA',
  'language': 'eng',
  'corpus': 'BELC',
  'age': '10;09.00',
  'sex': 'female',
  'group': '1A',
  'ses': '',
  'role': 'Subject',
  'education': '',
  'custom': ''}}

One major issue: there is no metadata stored about the _length of time the subject has spent learning English_ which is independent of age and crucial for my analysis. The `group` variable might be based on proficiency..

Luckily, according to the [author's introduction to the data](https://slabank.talkbank.org/access/English/BELC.html), groups do indeed to correlate to the age at which the subjects began instruction in English. This will be necessary, along with the age variable, in order to calculate years spent learning English.

In [94]:
group_dict= {}# dictionary to reference

In [76]:
BELC_narr.tokens()[:20] # preview tokens

[Token(word='okay', pos='co', mor='okay', gra=Gra(dep=1, head=2, rel='COM')),
 Token(word='hm', pos='phon', mor='hm', gra=Gra(dep=2, head=5, rel='LINK')),
 Token(word='the', pos='det:art', mor='the', gra=Gra(dep=3, head=4, rel='DET')),
 Token(word='story', pos='n', mor='story', gra=Gra(dep=4, head=5, rel='SUBJ')),
 Token(word='starts', pos='v', mor='start-3S', gra=Gra(dep=5, head=0, rel='ROOT')),
 Token(word='with', pos='prep', mor='with', gra=Gra(dep=6, head=5, rel='JCT')),
 Token(word='a', pos='det:art', mor='a', gra=Gra(dep=7, head=9, rel='DET')),
 Token(word='poor', pos='adj', mor='poor', gra=Gra(dep=8, head=9, rel='MOD')),
 Token(word='woman', pos='n', mor='woman', gra=Gra(dep=9, head=6, rel='POBJ')),
 Token(word='dressing', pos='part', mor='dress-PRESP', gra=Gra(dep=10, head=5, rel='XJCT')),
 Token(word='in', pos='prep', mor='in', gra=Gra(dep=11, head=10, rel='JCT')),
 Token(word='rucks', pos='neo', mor='rucks', gra=Gra(dep=12, head=11, rel='POBJ')),
 Token(word='and', pos='coord

In [119]:
for p in BELC_narr.headers():
    print(p['Participants'].keys()) # checking keys to access data

dict_keys(['SUB'])
dict_keys(['SUB', 'INV'])
dict_keys(['SUB', 'INV'])
dict_keys(['SUB', 'INV'])
dict_keys(['SUB'])
dict_keys(['SUB', 'INV'])
dict_keys(['SUB'])
dict_keys(['SUB', 'INV'])
dict_keys(['SUB'])
dict_keys(['SUB', 'INV'])
dict_keys(['SUB', 'INV'])
dict_keys(['SUB'])
dict_keys(['SUB'])
dict_keys(['SUB', 'INV'])
dict_keys(['SUB'])
dict_keys(['SUB', 'INV'])
dict_keys(['SUB'])
dict_keys(['SUB', 'INV'])
dict_keys(['SUB'])
dict_keys(['SUB', 'INV'])
dict_keys(['SUB', 'INV'])
dict_keys(['SUB', 'INV'])
dict_keys(['SUB', 'INV'])
dict_keys(['SUB'])
dict_keys(['SUB'])
dict_keys(['SUB'])
dict_keys(['SUB', 'INV'])
dict_keys(['SUB', 'INV'])
dict_keys(['SUB'])
dict_keys(['SUB'])
dict_keys(['SUB', 'INV'])
dict_keys(['SUB', 'INV'])
dict_keys(['SUB', 'INV'])
dict_keys(['SUB', 'INV'])
dict_keys(['SUB', 'INV'])
dict_keys(['SUB', 'INV'])
dict_keys(['SUB'])
dict_keys(['SUB', 'INV'])
dict_keys(['SUB', 'INV'])
dict_keys(['SUB', 'INV'])
dict_keys(['SUB'])
dict_keys(['SUB', 'INV'])
dict_keys(['PAR', 'I

Time to build a dataframe. 

In [175]:
# initiating empty lists
file_path_list = []
participant_list = []
age_list = []
group_list = []
corpus_list = []
tokens_list = []

# read entire corpus into a Reader object
BELCcorpus = pylangacq.read_chat(path, 'narratives')
for f in BELCcorpus:
    file_path = f.file_paths()[0].split('/')[3]
    participant = f.headers()[0]['PID']
    if 'SUB' in f.headers()[0]['Participants']:
        age = f.headers()[0]['Participants']['SUB']['age']
        group = f.headers()[0]['Participants']['SUB']['group']
        corpus = f.headers()[0]['Participants']['SUB']['corpus']
    elif 'PAR' in f.headers()[0]['Participants']:
        age = f.headers()[0]['Participants']['PAR']['age']
        group = f.headers()[0]['Participants']['PAR']['group']
        corpus = f.headers()[0]['Participants']['PAR']['corpus']
    tokens = f.tokens()
    # appending values to lists
    file_path_list.append(file_path)
    participant_list.append(participant)
    age_list.append(age)
    group_list.append(group)
    corpus_list.append(corpus)
    tokens_list.append(tokens)
    
# repeating this process with an additional folder
BELCcorpus = pylangacq.read_chat(path, 'narratives-2014')
for f in BELCcorpus:
    file_path = f.file_paths()[0].split('/')[3]
    participant = f.headers()[0]['PID']
    if 'SUB' in f.headers()[0]['Participants']:
        age = f.headers()[0]['Participants']['SUB']['age']
        group = f.headers()[0]['Participants']['SUB']['group']
        corpus = f.headers()[0]['Participants']['SUB']['corpus']
    elif 'PAR' in f.headers()[0]['Participants']:
        age = f.headers()[0]['Participants']['PAR']['age']
        group = f.headers()[0]['Participants']['PAR']['group']
        corpus = f.headers()[0]['Participants']['PAR']['corpus']
    tokens = f.tokens()
    # appending values to lists
    file_path_list.append(file_path)
    participant_list.append(participant)
    age_list.append(age)
    group_list.append(group)
    corpus_list.append(corpus)
    tokens_list.append(tokens)

In [176]:
# building the dataframe
BELCcorpus_df = pd.DataFrame({'Filename':file_path_list,
                              'Corpus':corpus_list,
                            'Participant':participant_list,
                             'Age':age_list,
                             'Group':group_list,
                             'Tokens':tokens_list})

In [177]:
BELCcorpus_df.head()

Unnamed: 0,Filename,Corpus,Participant,Age,Group,Tokens
0,BELC\narratives-2014\time1time2\9139riza_1.cha,BELC,11312/t-00016825-1,10;09.00,1A,"[Token(word='okay', pos='co', mor='okay', gra=..."
1,BELC\narratives-2014\time1time2\9139riza_2.cha,BELC,11312/t-00016826-1,,EL,"[Token(word='Rita', pos='n:prop', mor='Rita', ..."
2,BELC\narratives-2014\time1time2\9144nugo_1.cha,BELC,11312/t-00016827-1,10;09.,,"[Token(word='okay', pos='co', mor='okay', gra=..."
3,BELC\narratives-2014\time1time2\9144nugo_2.cha,BELC,11312/t-00016828-1,,,"[Token(word='Núria', pos='n:prop', mor='Núria'..."
4,BELC\narratives-2014\time1time2\9148mira_1.cha,BELC,11312/t-00016829-1,17;09.00,4A,"[Token(word='okay', pos='co', mor='okay', gra=..."


The dataframe of two narrative speaking tasks from the BELC corpus is constructed. Let's make sure it is tidy.

In [178]:
BELCcorpus_df.info() # says there are no null values...

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 210 entries, 0 to 209
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Filename     210 non-null    object
 1   Corpus       210 non-null    object
 2   Participant  210 non-null    object
 3   Age          210 non-null    object
 4   Group        210 non-null    object
 5   Tokens       210 non-null    object
dtypes: object(6)
memory usage: 10.0+ KB


In [179]:
BELCcorpus_df.Group.value_counts() # but the group variable looks funky

      52
2A    37
1A    27
3A    26
4A    21
EL    16
LL    10
1B    10
2B    10
2      1
Name: Group, dtype: int64

These group names don't correlate to the description on the BELC corpus website on TalkBank. I will have to look into this further to make sense of it in order to calculate the learners' years studying English. 

BELC contains data from Spanish and Catalan speakers. Let's import more data from additional corpus to represent a variety of L1's.