# Loading in the RUEG Corpus
Goal: Create a Data Frame for easy Data Use Later on

## Table of Contents
1. [Loading in the Data]()

    A. [Reading in Metadata]()

    B. [Basic Metrics of Metadata]()

    C. [Reading in the Texts]()
2. [Manually Parsing ConLL]()
3. [Practice Spacy Parsing ConLL]()
4. [Attempted Spacy Parsing ConLL]()
5. [Cleaning Data]()
6. [Spacy Parsing for Real]()
7. [Corpora Creation for Later Exploration]()



## Loading in the Data
I'm going to start with four seperate dataframes

What to be included in DataFrame:
- speaker ID
- langauge
- bilingual/monolingual
- formality
- mode
- languages
- age group
- gender

In [None]:
%pprint

In [None]:
import glob
files = glob.glob('RUEG_corpus_0.3.0/exmaralda/RUEG/DE/BILINGUAL/*.meta', recursive = True)
DE_bi_filenickname= []
DE_bi_filename = []
for f in files:
    DE_bi_filename.append(f.split("BILINGUAL/",1)[1].strip('.meta'))

In [None]:
files = glob.glob('RUEG_corpus_0.3.0/exmaralda/RUEG/DE/MONOLINGUAL/*.meta', recursive = True)
DE_mono_filename= []
for f in files:
    DE_mono_filename.append(f.split("MONOLINGUAL/",1)[1].strip('.meta'))

In [None]:
files = glob.glob('RUEG_corpus_0.3.0/exmaralda/RUEG/EN/BILINGUAL/*.meta', recursive = True)
EN_bi_filename= []
for f in files:
    f = (f.split("BILINGUAL/",1)[1].strip('.meta'))
    if f != 'USbi77FG_fwE':     ## this is because I found that this file has no POS markings on it which I cannot use
        EN_bi_filename.append(f)

In [None]:
files = glob.glob('RUEG_corpus_0.3.0/exmaralda/RUEG/EN/MONOLINGUAL/*.meta', recursive = True)
EN_mono_filename= []
for f in files:
    EN_mono_filename.append(f.split("MONOLINGUAL/",1)[1].strip('.meta'))

In [None]:
## Getting Some Basic Stats on What We're Looking at
print('DE mono Files: ', len(DE_mono_filename))
print('DE bi Files: ', len(DE_bi_filename))
print('EN mono Files: ', len(EN_mono_filename))
print('EN bi Files: ', len(EN_bi_filename))


### Reading in Metadata 

Some things to keep in mind:
- way fewer monolingual speakers in comparison to bilingual speakers
- some bilingual speakers are going to overlap as they are will appear in both languages as bilingual (probably accounts for this disparity in numbers)

In [None]:
import pandas as pd
de_mono_df = pd.DataFrame(DE_mono_filename, index = DE_mono_filename)
de_bi_df = pd.DataFrame(DE_bi_filename, index = DE_bi_filename)
en_mono_df = pd.DataFrame(EN_mono_filename, index = EN_mono_filename)
en_bi_df = pd.DataFrame(EN_bi_filename, index = EN_bi_filename)
de_mono_df.columns = ['Filename']
de_bi_df.columns = ['Filename']
en_mono_df.columns = ['Filename']
en_bi_df.columns = ['Filename']

In [None]:
de_mono_df['Mono/Bilingual'] = 'Monolingual'
de_bi_df['Mono/Bilingual'] = 'Bilingual'
en_mono_df['Mono/Bilingual'] = 'Monolingual'
en_bi_df['Mono/Bilingual'] = 'Bilingual'
de_mono_df['Language_of_Data'] = 'German'
de_bi_df['Language_of_Data'] = 'German'
en_mono_df['Language_of_Data'] = 'English'
en_bi_df['Language_of_Data'] = 'English'

In [None]:
## much easier to combine them all now and .loc them late rwhen needed
rueg_all_df = pd.concat([de_mono_df, de_bi_df, en_mono_df, en_bi_df])

rueg_all_df['Mode'] = rueg_all_df.Filename.map(lambda x: x[-2])
rueg_all_df['Formality'] = rueg_all_df.Filename.map(lambda x: x[-3])
rueg_all_df['Gender'] = rueg_all_df.Filename.map(lambda x: x[-6])
rueg_all_df['Heritage_Language'] = rueg_all_df.Filename.map(lambda x: x[-5])
rueg_all_df['Age_Group'] = rueg_all_df.Filename.map(lambda x: x[-8:-6])
rueg_all_df['Age_Group'] = rueg_all_df.Age_Group.map(lambda x: 'adolescent' if int(x) >= 49 else 'adult')
rueg_all_df['Country_of_Data'] = rueg_all_df.Filename.map(lambda x: x[0:2])
rueg_all_df.head(3)

## ideally I fully write out spoken/written and the age group

In [None]:
## making sure nothing is null before i edit the dataframe more
print(set(rueg_all_df['Gender'].tolist()))
print(set(rueg_all_df['Formality'].tolist()))
print(set(rueg_all_df['Mode'].tolist()))
print(set(rueg_all_df['Heritage_Language'].tolist()))
rueg_all_df.info()

In [None]:
rueg_all_df['Mode'] = rueg_all_df.Mode.map(lambda x: 'spoken' if x == 's' else 'written')
rueg_all_df['Formality'] = rueg_all_df.Formality.map(lambda x: 'informal' if x == 'i' else 'formal')
rueg_all_df['Gender'] = rueg_all_df.Gender.map(lambda x: 'female' if x == 'F' else 'male')
rueg_all_df['Country_of_Data'] = rueg_all_df.Country_of_Data.map(lambda x: 'United States' if x == 'US' or x == 'Us' else 'Germany')
rueg_all_df.head(2)

In [None]:
rueg_all_df.info()

### Basic metrics of the Metadata
Exploring the basic metrics of data we have and what it consists of
- find out what is defined as a 'heritage speaker'

In [None]:
print('There are', len(rueg_all_df.loc[(rueg_all_df['Mode'] == 'spoken')]), 'spoken data files')
print('There are', len(rueg_all_df.loc[(rueg_all_df['Mode'] == 'written')]), 'written data files')
print('There are', len(rueg_all_df.loc[(rueg_all_df['Formality'] == 'informal')]), 'informal data files')
print('There are', len(rueg_all_df.loc[(rueg_all_df['Formality'] == 'formal')]), 'formal data files')
print('There are', len(rueg_all_df.loc[(rueg_all_df['Mono/Bilingual'] == 'Bilingual')]), 'bilingual data files')
print('There are', len(rueg_all_df.loc[(rueg_all_df['Mono/Bilingual'] == 'Monolingual')]), 'monolingual data files')
print('There are', len(rueg_all_df.loc[(rueg_all_df['Language_of_Data'] == 'German')]), 'German data files')
print('There are', len(rueg_all_df.loc[(rueg_all_df['Language_of_Data'] == 'English')]), 'English data files')
print('There are', len(rueg_all_df.loc[(rueg_all_df['Age_Group'] == 'adult')]), 'adult data files')
print('There are', len(rueg_all_df.loc[(rueg_all_df['Age_Group'] == 'adolescent')]), 'adolescent data files')

print('There are', len(rueg_all_df.loc[(rueg_all_df['Heritage_Language'] == 'D')]), 'German heritage language data files')
print('There are', len(rueg_all_df.loc[(rueg_all_df['Heritage_Language'] == 'E')]), 'English heritage language data files')
print('There are', len(rueg_all_df.loc[(rueg_all_df['Heritage_Language'] == 'T')]), 'Turkish heritage language data files')
print('There are', len(rueg_all_df.loc[(rueg_all_df['Heritage_Language'] == 'G')]), 'Greek heritage language data files')
print('There are', len(rueg_all_df.loc[(rueg_all_df['Heritage_Language'] == 'R')]), 'Russian heritage language data files')



### Reading in the Texts
The data format being read in right now is the CoNLL format, and for now I'm just going to enter the entire text file (with POS, lemma, ect annotations)

In [None]:
files = glob.glob('RUEG_corpus_0.3.0/conll/RUEG/DE/BILINGUAL/*.txt', recursive = True)
de_bi_texts = []
DE_bi_files = []
for file in files:
    f = open(file)
    s = f.read()
    f1 = file.split("BILINGUAL/",1)[1].strip('.txt')
    de_bi_texts.append((f1, s))
    f.close()
    DE_bi_files.append(file)
de_bi_texts[:3]
DE_bi_files[:3]
## important to note that everything is tab seperated

In [None]:
files = glob.glob('RUEG_corpus_0.3.0/conll/RUEG/DE/MONOLINGUAL/*.txt', recursive = True)
de_mono_texts = []
DE_mono_files = []
for file in files:
    f = open(file)
    s = f.read()
    f1 = file.split("MONOLINGUAL/",1)[1].strip('.txt')
    de_mono_texts.append((f1, s))
    f.close()
    DE_mono_files.append(file)

In [None]:
files = glob.glob('RUEG_corpus_0.3.0/conll/RUEG/EN/BILINGUAL/*.txt', recursive = True)
en_bi_texts = []
EN_bi_files = []
for file in files:
    f = open(file)
    s = f.read()
    f1 = file.split("BILINGUAL/",1)[1].strip('.txt')
    if f1 != 'USbi77FG':     ## Same thing, this text file has no POS marking so it will be discluded
        en_bi_texts.append((f1, s))
    f.close()
    EN_bi_files.append(file)

In [None]:
files = glob.glob('RUEG_corpus_0.3.0/conll/RUEG/EN/MONOLINGUAL/*.txt', recursive = True)
en_mono_texts = []
EN_mono_files = []
for file in files:
    f = open(file)
    s = f.read()
    f1 = file.split("MONOLINGUAL/",1)[1].strip('.txt')
    en_mono_texts.append((f1, s))
    f.close()
    EN_mono_files.append(file)

In [None]:
## Let's compare the text sizes
print('DE mono metadata Files: ', len(DE_mono_filename))
print('DE bi metadata Files: ', len(DE_bi_filename))
print('EN mono metadata Files: ', len(EN_mono_filename))
print('EN bi metadata Files: ', len(EN_bi_filename))
print('DE mono text: ', len(de_mono_texts))
print('DE bi text: ', len(de_bi_texts))
print('EN mono text: ', len(en_mono_texts))
print('EN bi text: ', len(en_bi_texts))


As you can see, the German documents have some discrepencies as there are more conLL files than meta files, meaning that some participants likely had multiple recordings. For now, I'm going to leave these two dataframes seperate because of this.

## Manually Parsing ConLL
I have never worked with the ConLL format, so I'm going to take just one entry and play around with it to get it how I would like before messing with the entire dataset.

In [None]:
foo = de_bi_texts[0][1]
foo

In [None]:
foo = foo.replace('\t', ' ').split('\n')
foo = [x.split() for x in foo]
foo[:4]
## ok I like this list a lot with a list in each line and I can feasibly
## mark each conLL annotation accordingly

In [None]:
conLL_ann = []
for lines in foo:
    if len(lines) == 10:
        conLL_ann.append({'id': lines[0], 'token': lines[1], 'lemma': lines[2], 
                            'pos_uni': lines[3], 'pos_lang': lines[4], 'morphology': lines[5], 
                            'head': lines[6], 'relationship': lines[7], 'misc1': lines[8],
                            'misc2': lines[9]})

In [None]:
print(len(conLL_ann))
print([x['lemma'] for x in conLL_ann][:20])

## Stanza Parsing

In [None]:
import stanza
from stanza.utils.conll import CoNLL
from stanza.models.common.doc import Document

In [None]:
file = DE_bi_files[0]
doc = CoNLL.conll2doc(file)

In [None]:
doc

This very helpful bit of code originates [here](https://github.com/StabiBerlin/Stanza-Conllu-2Corpus/blob/main/stanza-conllu-2-pos-lat.ipynb)

In [None]:
def convert_conllu_to_pos(input_path, pos_list):

    with open(input_path, 'r', encoding='utf-8') as file:
        lines = file.readlines()
    
    pos_text = ""
    sentence = list([tuple()])
    
    for line in lines:
        line = line.strip()
        if line and not line.startswith("#"):
            columns = line.split("\t")
            if len(columns) > 3:
                word_text = columns[1]  # Token
                upos = columns[3]  # Universal POS Tag

                extension = tuple([word_text, upos])
                sentence.append(extension)
        else:
            if sentence:
                pos_text = sentence
                sentence = []
    
    pos_list.append(pos_text)

In [None]:
debi_pos = []
flat_debi_pos = []
for files in DE_bi_files:
    convert_conllu_to_pos(files, debi_pos)
    for x in debi_pos:
        for y in x:
            flat_debi_pos.append(x)
print(flat_debi_pos[:10])

In [None]:
## shows up like a list of dictionaries for each token- very similar to the manual parsing attempt
## but it's really long so I'm notgoing to print it
## debi_con_str[0]

In [None]:
demono_con = []
for file in DE_bi_files:
    doc = CoNLL.conll2doc(file)
    demono_con.append(doc)

In [None]:
enbi_con = []
for file in DE_bi_files:
    doc = CoNLL.conll2doc(file)
    enbi_con.append(doc)

In [None]:
enmono_con = []
for file in DE_bi_files:
    doc = CoNLL.conll2doc(file)
    enmono_con.append(doc)

In [None]:
type(enmono_con)

## Practice Spacy Parsing ConLL
It will be better to use an actual conll parser so all the rich synatctic information about dependency trees isn't lost

In [None]:
import spacy

In [None]:
from spacy_conll import init_parser
from spacy_conll.parser import ConllParser

from spacy import displacy
engconllparser = ConllParser(init_parser("en_core_web_sm", "spacy"))

In [None]:
connebidemo = en_bi_texts[20][1]
print((connebidemo[:962]))

In [None]:
connebidemo = connebidemo[:(len(connebidemo)-1)]

In [None]:
nlp = init_parser("en_core_web_sm", "spacy", include_headers=False)
parser = ConllParser(nlp)
connebidemo2 = parser.parse_conll_text_as_spacy(connebidemo)
for sent_id, sent in enumerate(connebidemo2.sents, 1):
        print(sent._.conll_pd)
        #displacy.render(sent, style='dep', options={"compact":True})  #renders the sentences into trees, just takes up
                                                                       #a LOT of screen space   
        for word in sent[:2]:
            print(word, word.lemma_, word.pos_, word.dep_)
        print()

In [None]:
connebidemo2._.conll_str[:100]

In [None]:
## trying it on the german, but we need a different (german) pipeline for this
conndbidemo = de_bi_texts[9][1]
conndbidemo[-10:]
## sooo pesky

In [None]:
conndbidemo = conndbidemo[:(len(conndbidemo)-1)]
dnlp = init_parser("de_core_news_sm", "spacy", include_headers=False)
dparser = ConllParser(dnlp)
conndbidemo2 = dparser.parse_conll_text_as_spacy(conndbidemo)

In [None]:
conndbidemo2._.conll_str

### Pause
Firstly, I want to thank Na-Rae for helping with the spacy_conll things. The spacy_conll library is a little tempermental and rages against an extra newline character at the end of a text. What is not pictured is the hours I and Na-Rae spent trying to figure out what wasn't working until she figured it out. 

Secondly, I know that my first bit of parsing by hand is redundant and will not be used, but it gave some useful information about the documents regardless, because there are some irregular documents in here that I'm sure spacy_conll will throw a fit about. 

All this being said, it's finally time to work on spacy-parsing all the texts.

## Attempted Spacy Parsing ConLL

In [None]:
## English Spacy Parser 
import re
nlp = init_parser("en_core_web_sm", "spacy", include_headers=False)
parser = ConllParser(nlp)
def parseEnTexts(constr, conlist):
    while constr[-2:] == '\n\n':      # this should also cover cases where the end could be \n\n\n
        constr = constr[:(len(constr)-1)]

    if re.match(r'\d+\t\w+\t\w+\t_', constr ) is None:
        constr2 = parser.parse_conll_text_as_spacy(constr)

    
    for sent_id, sent in enumerate(constr2.sents, 1):
        conlist.append(sent._.conll_str)

In [None]:
## German Spacy Parser
dnlp = init_parser("de_core_news_sm", "spacy", include_headers=False)
dparser = ConllParser(dnlp)
def parseDeTexts(constr, conlist):
    while constr[-2:] == '\n\n':
        constr = constr[:(len(constr)-1)] 
    if re.match(r'\d+\t\w+\t\w+\t_', constr ) is None:
        constr2 = parser.parse_conll_text_as_spacy(constr)
    for sent_id, sent in enumerate(constr2.sents, 1):
        conlist.append(sent._.conll_str)

    # if [re.match(r'\d+\t\w+\t\w+\t_', x )for x in constr.splitlines()] != None:
    #     pass
    # else:
    #     constr2 = parser.parse_conll_text_as_spacy(constr)


In [None]:
[x[1] for x in en_bi_texts][:3]

In [None]:
# en_bi_texts = [x[1] for x in en_bi_texts]
# en_mono_texts = [x[1] for x in en_mono_texts]
# de_bi_texts = [x[1] for x in de_bi_texts]
# de_mono_texts = [x[1] for x in de_mono_texts]

In [None]:
# ebi_con_str = []
# [parseEnTexts(x, ebi_con_str) for x in en_bi_texts]

This causes and error that says:

`pos` value "_" is not a valid Universal Dependencies tag. Non-UD tags should use the `tag` property.

That's definitely a problem, but let's see what other corpora have problems before we go onto cleaning the conLL

In [None]:
[x[1] for x in en_mono_texts][:3]

In [None]:
testlist = []
parseEnTexts(en_mono_texts[1][1], testlist)
testlist

In [None]:
emo_con_str = []
[parseEnTexts(x[1], emo_con_str) for x in en_mono_texts][:3]

## shows up as none, but that's not really an issue

In [None]:
## just one sentence as opposed to the whole text
emo_con_str[0]

In [None]:
print(emo_con_str[0])

Looks like we will not need to do cleaning for the english monolingual data! That's great so let's move forward to the German Data

In [None]:
de_bi_texts[1][0]

In [None]:
#debi_con_str = []
#[parseDeTexts(x, debi_con_str) for x in de_bi_texts]


## same issue as before with the English monolingual data

In [None]:
#demo_con_str = []
#[parseDeTexts(x, demo_con_str) for x in de_mono_texts]

## again, same issues. Onto cleaning

## Data Cleaning

As we saw with the manual parsing and with the fact that many of these texts have an extra newline character, we're going to have to clean up some documents before creating the corpora to use for analysis later

Here were the problem sets that need cleaning:
- English Bilingual
- German Bilingual
- German Monolingual

Now we got a hint of what was wrong in the earlier manual parsing, so now it's time to find the actual errors and fix them

In [None]:
[x[1] for x in en_bi_texts][:3]

In [None]:
enbi_con_str = []
[parseDeTexts(x[1], enbi_con_str) for x in en_bi_texts][:3]
#used to produce error

In [None]:
len(enbi_con_str)
## so line 4716 was the breaking point - the POS for the whole file was all _ so it was discluded

After some investigating, I have found the file who is to blame: USbi77FG_fwE.txt
For some reason, it has no POS markings. I believe this is the only file to be messed up like this. For this reason, when I read in the corpus for now on I will be excluding this file, and all subsequent files (metadata, audio, ect)

In [None]:
debi_con_str = []
bad_debi = []
for x in debi_con_str:
    try:
        parseDeTexts(x[1], debi_con_str)
    except ValueError:
        bad_debi.append(x[0])
#[parseDeTexts(x[1], debi_con_str) for x in de_bi_texts]

In [None]:
bad_debi

In [None]:
len(debi_con_str)
## so line 80 is the issue

In [None]:
debi_strs = []
for x[1] in de_bi_texts:
    while x[-2:] == '\n\n':
        x = x[:(len(x)-1)]
    for y in x:
        debi_strs.append(y)
len((debi_strs))

In [None]:
print(debi_strs[80])

In [None]:
no_pos = []
for x in pos:
    if x != None:
        no_pos.append(x)
len(no_pos)
for x in no_pos:
    print(x)

In [None]:
debi_strs[77][-2:]

In [None]:
print(debi_strs[346])
print(debi_strs[347])
print(debi_strs[348])
## not appearing that this error is so easy as a text with POS missing (which is a good thing!)
## but more investigation is reguired!

In [None]:
demo_con_str = []
[parseDeTexts(x, demo_con_str) for x in de_mono_texts]

In [None]:
len(demo_con_str)

In [None]:
demo_strs = []
for x in de_mono_texts:
    while x[-2:] == '\n\n':
        x = x[:(len(x)-1)]
    x = x.split('\n\n')
    for y in x:
        demo_strs.append(y)
len(demo_strs)

In [None]:
print(demo_strs[992])
print(demo_strs[993])
print(demo_strs[994])
print(demo_strs[995])
## (the \n looks pesky but it's likely a document break, which should be fine)

## Corpora Creation for Later Exploration
We finally have all our sentences parsed. Let's do one final look before pickling them to use in the exploration of the data

In [None]:
print(len(emo_con_str))
print(emo_con_str[0])

In [None]:
print(len(enbi_con_str))
print(enbi_con_str[0])