# Loading in the RUEG Corpus
Goals
- Create a Data Frame for easy MetaData Use Later on
- Extract the pure text from the ConLL format
- Extract POS Tuples from the ConLL format

## Table of Contents
1. [Loading in the Data]()

    A. [Reading in Metadata]()

    B. [Basic Metrics of Metadata]()

    C. [Reading in the Texts]()
2. [Manually Parsing ConLL]()
3. [Stanza Parsing]()
4. [Corpora Creation for Later Exploration]()



## Loading in the Data
I'm going to start with four seperate dataframes

What to be included in DataFrame:
- speaker ID
- langauge
- bilingual/monolingual
- formality
- mode
- languages
- age group
- gender

In [1]:
%pprint

Pretty printing has been turned OFF


In [2]:
import glob
files = glob.glob('RUEG_corpus_0.3.0/exmaralda/RUEG/DE/BILINGUAL/*.meta', recursive = True)
DE_bi_filenickname= []
DE_bi_filename = []
for f in files:
    DE_bi_filename.append(f.split("BILINGUAL/",1)[1].strip('.meta'))

In [3]:
files = glob.glob('RUEG_corpus_0.3.0/exmaralda/RUEG/DE/MONOLINGUAL/*.meta', recursive = True)
DE_mono_filename= []
for f in files:
    DE_mono_filename.append(f.split("MONOLINGUAL/",1)[1].strip('.meta'))

In [4]:
files = glob.glob('RUEG_corpus_0.3.0/exmaralda/RUEG/EN/BILINGUAL/*.meta', recursive = True)
EN_bi_filename= []
for f in files:
    f = (f.split("BILINGUAL/",1)[1].strip('.meta'))
    if f != 'USbi77FG_fwE':     ## this is because I found that this file has no POS markings on it which I cannot use
        EN_bi_filename.append(f)

In [5]:
files = glob.glob('RUEG_corpus_0.3.0/exmaralda/RUEG/EN/MONOLINGUAL/*.meta', recursive = True)
EN_mono_filename= []
for f in files:
    EN_mono_filename.append(f.split("MONOLINGUAL/",1)[1].strip('.meta'))

In [6]:
## Getting Some Basic Stats on What We're Looking at
print('DE mono Files: ', len(DE_mono_filename))
print('DE bi Files: ', len(DE_bi_filename))
print('EN mono Files: ', len(EN_mono_filename))
print('EN bi Files: ', len(EN_bi_filename))


DE mono Files:  240
DE bi Files:  559
EN mono Files:  64
EN bi Files:  443


### Reading in Metadata 

Some things to keep in mind:
- way fewer monolingual speakers in comparison to bilingual speakers
- some bilingual speakers are going to overlap as they are will appear in both languages as bilingual (probably accounts for this disparity in numbers)

In [7]:
import pandas as pd
de_mono_df = pd.DataFrame(DE_mono_filename, index = DE_mono_filename)
de_bi_df = pd.DataFrame(DE_bi_filename, index = DE_bi_filename)
en_mono_df = pd.DataFrame(EN_mono_filename, index = EN_mono_filename)
en_bi_df = pd.DataFrame(EN_bi_filename, index = EN_bi_filename)
de_mono_df.columns = ['Filename']
de_bi_df.columns = ['Filename']
en_mono_df.columns = ['Filename']
en_bi_df.columns = ['Filename']

In [8]:
de_mono_df['Mono/Bilingual'] = 'Monolingual'
de_bi_df['Mono/Bilingual'] = 'Bilingual'
en_mono_df['Mono/Bilingual'] = 'Monolingual'
en_bi_df['Mono/Bilingual'] = 'Bilingual'
de_mono_df['Language_of_Data'] = 'German'
de_bi_df['Language_of_Data'] = 'German'
en_mono_df['Language_of_Data'] = 'English'
en_bi_df['Language_of_Data'] = 'English'

In [9]:
## much easier to combine them all now and .loc them late rwhen needed
rueg_all_df = pd.concat([de_mono_df, de_bi_df, en_mono_df, en_bi_df])

rueg_all_df['Mode'] = rueg_all_df.Filename.map(lambda x: x[-2])
rueg_all_df['Formality'] = rueg_all_df.Filename.map(lambda x: x[-3])
rueg_all_df['Gender'] = rueg_all_df.Filename.map(lambda x: x[-6])
rueg_all_df['Heritage_Language'] = rueg_all_df.Filename.map(lambda x: x[-5])
rueg_all_df['Age_Group'] = rueg_all_df.Filename.map(lambda x: x[-8:-6])
rueg_all_df['Age_Group'] = rueg_all_df.Age_Group.map(lambda x: 'adolescent' if int(x) >= 49 else 'adult')
rueg_all_df['Country_of_Data'] = rueg_all_df.Filename.map(lambda x: x[0:2])
rueg_all_df.head(3)

## ideally I fully write out spoken/written and the age group

Unnamed: 0,Filename,Mono/Bilingual,Language_of_Data,Mode,Formality,Gender,Heritage_Language,Age_Group,Country_of_Data
DEmo17MD_fsD,DEmo17MD_fsD,Monolingual,German,s,f,M,D,adult,DE
DEmo20FD_fwD,DEmo20FD_fwD,Monolingual,German,w,f,F,D,adult,DE
DEmo71FD_isD,DEmo71FD_isD,Monolingual,German,s,i,F,D,adolescent,DE


In [10]:
## making sure nothing is null before i edit the dataframe more
print(set(rueg_all_df['Gender'].tolist()))
print(set(rueg_all_df['Formality'].tolist()))
print(set(rueg_all_df['Mode'].tolist()))
print(set(rueg_all_df['Heritage_Language'].tolist()))
rueg_all_df.info()

{'F', 'M'}
{'f', 'i'}
{'s', 'w'}
{'T', 'R', 'D', 'G', 'E'}
<class 'pandas.core.frame.DataFrame'>
Index: 1306 entries, DEmo17MD_fsD to USbi04FD_fsE
Data columns (total 9 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   Filename           1306 non-null   object
 1   Mono/Bilingual     1306 non-null   object
 2   Language_of_Data   1306 non-null   object
 3   Mode               1306 non-null   object
 4   Formality          1306 non-null   object
 5   Gender             1306 non-null   object
 6   Heritage_Language  1306 non-null   object
 7   Age_Group          1306 non-null   object
 8   Country_of_Data    1306 non-null   object
dtypes: object(9)
memory usage: 102.0+ KB


In [11]:
rueg_all_df['Mode'] = rueg_all_df.Mode.map(lambda x: 'spoken' if x == 's' else 'written')
rueg_all_df['Formality'] = rueg_all_df.Formality.map(lambda x: 'informal' if x == 'i' else 'formal')
rueg_all_df['Gender'] = rueg_all_df.Gender.map(lambda x: 'female' if x == 'F' else 'male')
rueg_all_df['Country_of_Data'] = rueg_all_df.Country_of_Data.map(lambda x: 'United States' if x == 'US' or x == 'Us' else 'Germany')
rueg_all_df.head(2)

Unnamed: 0,Filename,Mono/Bilingual,Language_of_Data,Mode,Formality,Gender,Heritage_Language,Age_Group,Country_of_Data
DEmo17MD_fsD,DEmo17MD_fsD,Monolingual,German,spoken,formal,male,D,adult,Germany
DEmo20FD_fwD,DEmo20FD_fwD,Monolingual,German,written,formal,female,D,adult,Germany


In [12]:
rueg_all_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1306 entries, DEmo17MD_fsD to USbi04FD_fsE
Data columns (total 9 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   Filename           1306 non-null   object
 1   Mono/Bilingual     1306 non-null   object
 2   Language_of_Data   1306 non-null   object
 3   Mode               1306 non-null   object
 4   Formality          1306 non-null   object
 5   Gender             1306 non-null   object
 6   Heritage_Language  1306 non-null   object
 7   Age_Group          1306 non-null   object
 8   Country_of_Data    1306 non-null   object
dtypes: object(9)
memory usage: 102.0+ KB


### Basic metrics of the Metadata
Exploring the basic metrics of data we have and what it consists of
- find out what is defined as a 'heritage speaker'

In [13]:
print('There are', len(rueg_all_df.loc[(rueg_all_df['Mode'] == 'spoken')]), 'spoken data files')
print('There are', len(rueg_all_df.loc[(rueg_all_df['Mode'] == 'written')]), 'written data files')
print('There are', len(rueg_all_df.loc[(rueg_all_df['Formality'] == 'informal')]), 'informal data files')
print('There are', len(rueg_all_df.loc[(rueg_all_df['Formality'] == 'formal')]), 'formal data files')
print('There are', len(rueg_all_df.loc[(rueg_all_df['Mono/Bilingual'] == 'Bilingual')]), 'bilingual data files')
print('There are', len(rueg_all_df.loc[(rueg_all_df['Mono/Bilingual'] == 'Monolingual')]), 'monolingual data files')
print('There are', len(rueg_all_df.loc[(rueg_all_df['Language_of_Data'] == 'German')]), 'German data files')
print('There are', len(rueg_all_df.loc[(rueg_all_df['Language_of_Data'] == 'English')]), 'English data files')
print('There are', len(rueg_all_df.loc[(rueg_all_df['Age_Group'] == 'adult')]), 'adult data files')
print('There are', len(rueg_all_df.loc[(rueg_all_df['Age_Group'] == 'adolescent')]), 'adolescent data files')

print('There are', len(rueg_all_df.loc[(rueg_all_df['Heritage_Language'] == 'D')]), 'German heritage language data files')
print('There are', len(rueg_all_df.loc[(rueg_all_df['Heritage_Language'] == 'E')]), 'English heritage language data files')
print('There are', len(rueg_all_df.loc[(rueg_all_df['Heritage_Language'] == 'T')]), 'Turkish heritage language data files')
print('There are', len(rueg_all_df.loc[(rueg_all_df['Heritage_Language'] == 'G')]), 'Greek heritage language data files')
print('There are', len(rueg_all_df.loc[(rueg_all_df['Heritage_Language'] == 'R')]), 'Russian heritage language data files')



There are 653 spoken data files
There are 653 written data files
There are 654 informal data files
There are 652 formal data files
There are 1002 bilingual data files
There are 304 monolingual data files
There are 799 German data files
There are 507 English data files
There are 595 adult data files
There are 711 adolescent data files
There are 327 German heritage language data files
There are 64 English heritage language data files
There are 260 Turkish heritage language data files
There are 267 Greek heritage language data files
There are 388 Russian heritage language data files


### Reading in the Texts
The data format being read in right now is the CoNLL format, and for now I'm just going to enter the entire text file (with POS, lemma, ect annotations)

In [14]:
files = glob.glob('RUEG_corpus_0.3.0/conll/RUEG/DE/BILINGUAL/*.txt', recursive = True)
de_bi_texts = []
DE_bi_files = []
for file in files:
    f = open(file)
    s = f.read()
    f1 = file.split("BILINGUAL/",1)[1].strip('.txt')
    de_bi_texts.append((f1, s))
    f.close()
    DE_bi_files.append(file)
de_bi_texts[:3]
DE_bi_files[:3]
## important to note that everything is tab seperated

['RUEG_corpus_0.3.0/conll/RUEG/DE/BILINGUAL/USbi50FD_fsD.txt', 'RUEG_corpus_0.3.0/conll/RUEG/DE/BILINGUAL/DEbi24FT_fwD.txt', 'RUEG_corpus_0.3.0/conll/RUEG/DE/BILINGUAL/DEbi64MR_isD.txt']

In [15]:
files = glob.glob('RUEG_corpus_0.3.0/conll/RUEG/DE/MONOLINGUAL/*.txt', recursive = True)
de_mono_texts = []
DE_mono_files = []
for file in files:
    f = open(file)
    s = f.read()
    f1 = file.split("MONOLINGUAL/",1)[1].strip('.txt')
    de_mono_texts.append((f1, s))
    f.close()
    DE_mono_files.append(file)

In [16]:
files = glob.glob('RUEG_corpus_0.3.0/conll/RUEG/EN/BILINGUAL/*.txt', recursive = True)
en_bi_texts = []
EN_bi_files = []
for file in files:
    f = open(file)
    s = f.read()
    f1 = file.split("BILINGUAL/",1)[1].strip('.txt')
    if f1 != 'USbi77FG':     ## Same thing, this text file has no POS marking so it will be discluded
        en_bi_texts.append((f1, s))
    f.close()
    EN_bi_files.append(file)

In [17]:
files = glob.glob('RUEG_corpus_0.3.0/conll/RUEG/EN/MONOLINGUAL/*.txt', recursive = True)
en_mono_texts = []
EN_mono_files = []
for file in files:
    f = open(file)
    s = f.read()
    f1 = file.split("MONOLINGUAL/",1)[1].strip('.txt')
    en_mono_texts.append((f1, s))
    f.close()
    EN_mono_files.append(file)

In [18]:
## Let's compare the text sizes
print('DE mono metadata Files: ', len(DE_mono_filename))
print('DE bi metadata Files: ', len(DE_bi_filename))
print('EN mono metadata Files: ', len(EN_mono_filename))
print('EN bi metadata Files: ', len(EN_bi_filename))
print('DE mono text: ', len(de_mono_texts))
print('DE bi text: ', len(de_bi_texts))
print('EN mono text: ', len(en_mono_texts))
print('EN bi text: ', len(en_bi_texts))


DE mono metadata Files:  240
DE bi metadata Files:  559
EN mono metadata Files:  64
EN bi metadata Files:  443
DE mono text:  256
DE bi text:  586
EN mono text:  64
EN bi text:  444


As you can see, the German documents have some discrepencies as there are more conLL files than meta files, meaning that some participants likely had multiple recordings. For now, I'm going to leave these two dataframes seperate because of this.

## Manually Parsing ConLL
I have never worked with the ConLL format, so I'm going to take just one entry and play around with it to get it how I would like before messing with the entire dataset.

In [19]:
foo = de_bi_texts[0][1]
foo

'1\täh\täh\tINTJ\tNGHES\t_\t0\troot\t_\t_\n2\thello\thello\tX\tFM\t_\t3\tdep\t_\t_\n3\tthis\tthis\tX\tFM\tPronType=Dem\t4\tdep\t_\t_\n4\tis\tbe\tX\tFM\tMood=Ind|Person=3|Tense=Pres\t1\tdep\t_\t_\n5\tfile\tfile\tX\tFM\tNumber=Sing\t4\tdep\t_\t_\n6\tNummer\tNummer\tNOUN\tNN\tCase=Nom|Gender=Fem|Number=Sing\t8\tnsubj\t_\t_\n7\tF\tF\tPROPN\tNE\t_\t6\tappos\t_\t_\n8\täh\täh\tINTJ\tNGHES\t_\t9\tpunct\t_\t_\n9\t16\t@card@\tPROPN\tNE\tNumType=Card\t5\tappos\t_\t_\n\n1\tja\tja\tINTJ\tNGIRR\t_\t2\tadvmod\t_\t_\n2\tokay\tokay\tINTJ\tNGIRR\t_\t0\troot\t_\t_\n3\täh\täh\tINTJ\tNGHES\t_\t2\tdep\t_\t_\n\n1\tich\tich\tPRON\tPPER\tCase=Nom|Number=Sing|Person=1|PronType=Prs\t6\tnsubj\t_\t_\n2\thabe\thaben\tAUX\tVAFIN\tMood=Ind|Number=Sing|Person=1|Tense=Past|VerbForm=Fin\t6\taux\t_\t_\n3\tgerade\tgerade\tADV\tADV\t_\t6\tadvmod\t_\t_\n4\tein\tein\tDET\tART\tCase=Nom|Definite=Ind|Gender=Masc|Number=Sing|PronType=Art\t5\tdet\t_\t_\n5\tUnfall\tUnfall\tNOUN\tNN\tCase=Acc|Gender=Masc|Number=Sing\t6\tobj\t_\t_\

In [20]:
foo = foo.replace('\t', ' ').split('\n')
foo = [x.split() for x in foo]
foo[:4]
## ok I like this list a lot with a list in each line and I can feasibly
## mark each conLL annotation accordingly

[['1', 'äh', 'äh', 'INTJ', 'NGHES', '_', '0', 'root', '_', '_'], ['2', 'hello', 'hello', 'X', 'FM', '_', '3', 'dep', '_', '_'], ['3', 'this', 'this', 'X', 'FM', 'PronType=Dem', '4', 'dep', '_', '_'], ['4', 'is', 'be', 'X', 'FM', 'Mood=Ind|Person=3|Tense=Pres', '1', 'dep', '_', '_']]

In [21]:
conLL_ann = []
for lines in foo:
    if len(lines) == 10:
        conLL_ann.append({'id': lines[0], 'token': lines[1], 'lemma': lines[2], 
                            'pos_uni': lines[3], 'pos_lang': lines[4], 'morphology': lines[5], 
                            'head': lines[6], 'relationship': lines[7], 'misc1': lines[8],
                            'misc2': lines[9]})

In [22]:
print(len(conLL_ann))
print([x['lemma'] for x in conLL_ann][:20])

155
['äh', 'hello', 'this', 'be', 'file', 'Nummer', 'F', 'äh', '@card@', 'ja', 'okay', 'äh', 'ich', 'haben', 'gerade', 'ein', 'Unfall', 'sehen', 'und', 'es']


## Stanza Parsing

### POS Extraction

In [23]:
import stanza
from stanza.utils.conll import CoNLL
from stanza.models.common.doc import Document

In [24]:
file = DE_bi_files[0]
doc = CoNLL.conll2doc(file)

This very helpful bit of code originates [here](https://github.com/StabiBerlin/Stanza-Conllu-2Corpus/blob/main/stanza-conllu-2-pos-lat.ipynb)

In [25]:
def convert_conllu_to_pos(input_path, pos_list):

    with open(input_path, 'r', encoding='utf-8') as file:
        lines = file.readlines()
    
    pos_text = ""
    sentence = list([tuple()])
    
    for line in lines:
        line = line.strip()
        if line and not line.startswith("#"):
            columns = line.split("\t")
            if len(columns) > 3:
                word_text = columns[1]  # Token
                upos = columns[3]  # Universal POS Tag

                extension = tuple([word_text, upos])
                sentence.append(extension)
        else:
            if sentence:
                pos_text = sentence
                sentence = []
    
    pos_list.append(pos_text)

In [26]:
debi_pos = []
flat_debi_pos = []
for files in DE_bi_files:
    convert_conllu_to_pos(files, debi_pos)
    for x in debi_pos:
        for y in x:
            flat_debi_pos = [x for xs in debi_pos for x in xs]

print(debi_pos)
print(flat_debi_pos[:10])

[[('und', 'CCONJ'), ('die', 'PRON'), ('haben', 'AUX'), ('die', 'DET'), ('Polizei', 'NOUN'), ('äh', 'INTJ'), ('angerufen', 'VERB')], [('DEbi24FT', 'PROPN')], [('und', 'CCONJ'), ('ist', 'AUX'), ('ins', 'ADP'), ('erste', 'ADJ'), ('Auto', 'NOUN'), ('reingefahren', 'VERB')], [('und', 'CCONJ'), ('das', 'DET'), ('vordere', 'ADJ'), ('Auto', 'NOUN'), ('muss', 'AUX'), ('wegen', 'ADP'), ('des', 'DET'), ('rollenden', 'ADJ'), ('Balls', 'NOUN'), ('und', 'CCONJ'), ('dem', 'DET'), ('Hund', 'NOUN'), ('so', 'ADV'), ('stark', 'ADJ'), ('bremsen', 'VERB'), (',', 'PUNCT'), ('dass', 'CCONJ'), ('das', 'DET'), ('hintere', 'ADJ'), ('Auto', 'NOUN'), ('auffährt', 'VERB'), ('und', 'CCONJ'), ('einen', 'DET'), ('Unfall', 'NOUN'), ('verursacht', 'VERB'), ('.', 'PUNCT')], [('Der', 'DET'), ('Man', 'NOUN'), ('rief', 'VERB'), ('die', 'DET'), ('Polizei', 'NOUN'), ('an', 'ADV'), ('.', 'PUNCT')], [('Ich', 'PRON'), ('wurde', 'AUX'), ('nun', 'ADV'), ('als', 'SCONJ'), ('Zeuge', 'NOUN'), ('von', 'ADP'), ('der', 'DET'), ('Polize

In [27]:
demono_pos = []
flat_demono_pos = []
for files in DE_mono_files:
    convert_conllu_to_pos(files, demono_pos)
    for x in demono_pos:
        for y in x:
            flat_demono_pos = [x for xs in demono_pos for x in xs]

In [28]:
enbi_pos = []
flat_enbi_pos = []
for files in EN_bi_files:
    convert_conllu_to_pos(files, enbi_pos)
    for x in enbi_pos:
        for y in x:
            flat_enbi_pos = [x for xs in enbi_pos for x in xs]

In [29]:
enmono_pos = []
flat_enmono_pos = []
for files in EN_mono_files:
    convert_conllu_to_pos(files, enmono_pos)
    for x in enmono_pos:
        for y in x:
            flat_enmono_pos = [x for xs in enmono_pos for x in xs]

In [30]:
print(len(flat_debi_pos))
print(len(flat_demono_pos))
print(len(flat_enbi_pos))
print(len(flat_enmono_pos))

4773
1761
4385
621


### Text Extraction

In [31]:
def convert_conllu_to_text(input_path, text_list):

    with open(input_path, 'r', encoding='utf-8') as file:
        lines = file.readlines()
    
    text_text = ""
    sentence = list([tuple()])
    
    for line in lines:
        line = line.strip()
        if line and not line.startswith("#"):
            columns = line.split("\t")
            if len(columns) > 3:
                word_text = columns[1]  # Token

                sentence.append(f"{word_text}")
        else:
            if sentence:
                text_text = sentence
                sentence = []
    
    text_list.append(text_text)

In [32]:
debi_text = []
flat_debi_text = []
for files in DE_bi_files:
    convert_conllu_to_text(files, debi_text)
    for x in debi_text:
        for y in x:
            flat_debi_text = [x for xs in debi_text for x in xs]
print(flat_debi_text[:10])

['und', 'die', 'haben', 'die', 'Polizei', 'äh', 'angerufen', 'DEbi24FT', 'und', 'ist']


In [33]:
demono_text = []
flat_demono_text = []
for files in DE_mono_files:
    convert_conllu_to_text(files, demono_text)
    for x in demono_text:
        for y in x:
            flat_demono_text = [x for xs in demono_text for x in xs]

In [34]:
enbi_text = []
flat_enbi_text = []
for files in EN_bi_files:
    convert_conllu_to_text(files, enbi_text)
    for x in enbi_text:
        for y in x:
            flat_enbi_text = [x for xs in enbi_text for x in xs]

In [35]:
enmono_text = []
flat_enmono_text = []
for files in EN_mono_files:
    convert_conllu_to_text(files, enmono_text)
    for x in enmono_text:
        for y in x:
            flat_enmono_text = [x for xs in enmono_text for x in xs]

In [36]:
print(len(flat_debi_text))
print(len(flat_demono_text))
print(len(flat_enbi_text))
print(len(flat_enmono_text))

4773
1761
4385
621


## Corpora Creation for Later Exploration
We finally have all our sentences parsed with pos and tokens, let's pickle to use later

In [37]:
import pickle

In [38]:
with open ('debi_pos.pkl', 'wb') as f:
    pickle.dump(flat_debi_pos, f)
with open ('demono_pos.pkl', 'wb') as f:
    pickle.dump(flat_demono_pos, f)
with open ('enbi_pos.pkl', 'wb') as f:
    pickle.dump(flat_enbi_pos, f)
with open ('enmono_pos.pkl', 'wb') as f:
    pickle.dump(flat_enmono_pos, f)

In [None]:
with open ('debi_text.pkl', 'wb') as f:
    pickle.dump(flat_debi_text, f)
with open ('demono_text.pkl', 'wb') as f:
    pickle.dump(flat_demono_text, f)
with open ('enbi_text.pkl', 'wb') as f:
    pickle.dump(flat_enbi_text, f)
with open ('enmono_text.pkl', 'wb') as f:
    pickle.dump(flat_enmono_text, f)

In [40]:
rueg_all_df.to_pickle('RUEG_meta_df.pkl')