# Loading in the RUEG Corpus
Goal: Create a Data Frame for easy Data Use Later on

## Table of Contents
1. [Loading in the Data]()

    A. [Reading in Metadata]()

    B. [Basic Metrics of Metadata]()

    C. [Reading in the Texts]()
2. [Manually Parsing ConLL]()
3. [Practice Spacy Parsing ConLL]()
4. [Attempted Spacy Parsing ConLL]()
5. [Cleaning Data]()
6. [Spacy Parsing for Real]()
7. [Corpora Creation for Later Exploration]()



## Loading in the Data
I'm going to start with four seperate dataframes

What to be included in DataFrame:
- speaker ID
- langauge
- bilingual/monolingual
- formality
- mode
- languages
- age group
- gender
- raw audio
- transcription
- POS transciption (although I don't know I'll be using it)

In [463]:
%pprint

Pretty printing has been turned OFF


In [464]:
import glob
files = glob.glob('RUEG_corpus_0.3.0/exmaralda/RUEG/DE/BILINGUAL/*.meta', recursive = True)
DE_bi_filename= []
for f in files:
    DE_bi_filename.append(f.split("BILINGUAL/",1)[1].strip('.meta'))
DE_bi_filename[:3]

['DEbi65FR_isD', 'DEbi83FR_iwD', 'DEbi52FR_isD']

In [465]:
files = glob.glob('RUEG_corpus_0.3.0/exmaralda/RUEG/DE/MONOLINGUAL/*.meta', recursive = True)
DE_mono_filename= []
for f in files:
    DE_mono_filename.append(f.split("MONOLINGUAL/",1)[1].strip('.meta'))

In [466]:
files = glob.glob('RUEG_corpus_0.3.0/exmaralda/RUEG/EN/BILINGUAL/*.meta', recursive = True)
EN_bi_filename= []
for f in files:
    f = (f.split("BILINGUAL/",1)[1].strip('.meta'))
    if f != 'USbi77FG_fwE':     ## this is because I found that this file has no POS markings on it which I cannot use
        EN_bi_filename.append(f)

In [467]:
files = glob.glob('RUEG_corpus_0.3.0/exmaralda/RUEG/EN/MONOLINGUAL/*.meta', recursive = True)
EN_mono_filename= []
for f in files:
    EN_mono_filename.append(f.split("MONOLINGUAL/",1)[1].strip('.meta'))

In [468]:
## Getting Some Basic Stats on What We're Looking at
print('DE mono Files: ', len(DE_mono_filename))
print('DE bi Files: ', len(DE_bi_filename))
print('EN mono Files: ', len(EN_mono_filename))
print('EN bi Files: ', len(EN_bi_filename))


DE mono Files:  240
DE bi Files:  559
EN mono Files:  64
EN bi Files:  443


### Reading in Metadata 

Some things to keep in mind:
- way fewer monolingual speakers in comparison to bilingual speakers
- some bilingual speakers are going to overlap as they are will appear in both languages as bilingual (probably accounts for this disparity in numbers)

In [469]:
import pandas as pd
de_mono_df = pd.DataFrame(DE_mono_filename, index = DE_mono_filename)
de_bi_df = pd.DataFrame(DE_bi_filename, index = DE_bi_filename)
en_mono_df = pd.DataFrame(EN_mono_filename, index = EN_mono_filename)
en_bi_df = pd.DataFrame(EN_bi_filename, index = EN_bi_filename)
de_mono_df.columns = ['Filename']
de_bi_df.columns = ['Filename']
en_mono_df.columns = ['Filename']
en_bi_df.columns = ['Filename']

In [470]:
de_mono_df['Mono/Bilingual'] = 'Monolingual'
de_bi_df['Mono/Bilingual'] = 'Bilingual'
en_mono_df['Mono/Bilingual'] = 'Monolingual'
en_bi_df['Mono/Bilingual'] = 'Bilingual'
de_mono_df['Language_of_Data'] = 'German'
de_bi_df['Language_of_Data'] = 'German'
en_mono_df['Language_of_Data'] = 'English'
en_bi_df['Language_of_Data'] = 'English'

In [471]:
## much easier to combine them all now and .loc them late rwhen needed
rueg_all_df = pd.concat([de_mono_df, de_bi_df, en_mono_df, en_bi_df])

rueg_all_df['Mode'] = rueg_all_df.Filename.map(lambda x: x[-2])
rueg_all_df['Formality'] = rueg_all_df.Filename.map(lambda x: x[-3])
rueg_all_df['Gender'] = rueg_all_df.Filename.map(lambda x: x[-6])
rueg_all_df['Heritage_Language'] = rueg_all_df.Filename.map(lambda x: x[-5])
rueg_all_df['Age_Group'] = rueg_all_df.Filename.map(lambda x: x[-8:-6])
rueg_all_df['Age_Group'] = rueg_all_df.Age_Group.map(lambda x: 'adolescent' if int(x) >= 49 else 'adult')
rueg_all_df['Country_of_Data'] = rueg_all_df.Filename.map(lambda x: x[0:2])
rueg_all_df.head(3)

## ideally I fully write out spoken/written and the age group

Unnamed: 0,Filename,Mono/Bilingual,Language_of_Data,Mode,Formality,Gender,Heritage_Language,Age_Group,Country_of_Data
DEmo17MD_fsD,DEmo17MD_fsD,Monolingual,German,s,f,M,D,adult,DE
DEmo20FD_fwD,DEmo20FD_fwD,Monolingual,German,w,f,F,D,adult,DE
DEmo71FD_isD,DEmo71FD_isD,Monolingual,German,s,i,F,D,adolescent,DE


In [472]:
## making sure nothing is null before i edit the dataframe more
print(set(rueg_all_df['Gender'].tolist()))
print(set(rueg_all_df['Formality'].tolist()))
print(set(rueg_all_df['Mode'].tolist()))
print(set(rueg_all_df['Heritage_Language'].tolist()))
rueg_all_df.info()

{'F', 'M'}
{'i', 'f'}
{'s', 'w'}
{'T', 'R', 'G', 'E', 'D'}
<class 'pandas.core.frame.DataFrame'>
Index: 1306 entries, DEmo17MD_fsD to USbi04FD_fsE
Data columns (total 9 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   Filename           1306 non-null   object
 1   Mono/Bilingual     1306 non-null   object
 2   Language_of_Data   1306 non-null   object
 3   Mode               1306 non-null   object
 4   Formality          1306 non-null   object
 5   Gender             1306 non-null   object
 6   Heritage_Language  1306 non-null   object
 7   Age_Group          1306 non-null   object
 8   Country_of_Data    1306 non-null   object
dtypes: object(9)
memory usage: 102.0+ KB


In [473]:
rueg_all_df['Mode'] = rueg_all_df.Mode.map(lambda x: 'spoken' if x == 's' else 'written')
rueg_all_df['Formality'] = rueg_all_df.Formality.map(lambda x: 'informal' if x == 'i' else 'formal')
rueg_all_df['Gender'] = rueg_all_df.Gender.map(lambda x: 'female' if x == 'F' else 'male')
rueg_all_df['Country_of_Data'] = rueg_all_df.Country_of_Data.map(lambda x: 'United States' if x == 'US' or x == 'Us' else 'Germany')
rueg_all_df.head(2)

Unnamed: 0,Filename,Mono/Bilingual,Language_of_Data,Mode,Formality,Gender,Heritage_Language,Age_Group,Country_of_Data
DEmo17MD_fsD,DEmo17MD_fsD,Monolingual,German,spoken,formal,male,D,adult,Germany
DEmo20FD_fwD,DEmo20FD_fwD,Monolingual,German,written,formal,female,D,adult,Germany


In [474]:
rueg_all_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1306 entries, DEmo17MD_fsD to USbi04FD_fsE
Data columns (total 9 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   Filename           1306 non-null   object
 1   Mono/Bilingual     1306 non-null   object
 2   Language_of_Data   1306 non-null   object
 3   Mode               1306 non-null   object
 4   Formality          1306 non-null   object
 5   Gender             1306 non-null   object
 6   Heritage_Language  1306 non-null   object
 7   Age_Group          1306 non-null   object
 8   Country_of_Data    1306 non-null   object
dtypes: object(9)
memory usage: 102.0+ KB


### Basic metrics of the Metadata
Exploring the basic metrics of data we have and what it consists of
- find out what is defined as a 'heritage speaker'

In [475]:
print('There are', len(rueg_all_df.loc[(rueg_all_df['Mode'] == 'spoken')]), 'spoken data files')
print('There are', len(rueg_all_df.loc[(rueg_all_df['Mode'] == 'written')]), 'written data files')
print('There are', len(rueg_all_df.loc[(rueg_all_df['Formality'] == 'informal')]), 'informal data files')
print('There are', len(rueg_all_df.loc[(rueg_all_df['Formality'] == 'formal')]), 'formal data files')
print('There are', len(rueg_all_df.loc[(rueg_all_df['Mono/Bilingual'] == 'Bilingual')]), 'bilingual data files')
print('There are', len(rueg_all_df.loc[(rueg_all_df['Mono/Bilingual'] == 'Monolingual')]), 'monolingual data files')
print('There are', len(rueg_all_df.loc[(rueg_all_df['Language_of_Data'] == 'German')]), 'German data files')
print('There are', len(rueg_all_df.loc[(rueg_all_df['Language_of_Data'] == 'English')]), 'English data files')
print('There are', len(rueg_all_df.loc[(rueg_all_df['Age_Group'] == 'adult')]), 'adult data files')
print('There are', len(rueg_all_df.loc[(rueg_all_df['Age_Group'] == 'adolescent')]), 'adolescent data files')

print('There are', len(rueg_all_df.loc[(rueg_all_df['Heritage_Language'] == 'D')]), 'German heritage language data files')
print('There are', len(rueg_all_df.loc[(rueg_all_df['Heritage_Language'] == 'E')]), 'English heritage language data files')
print('There are', len(rueg_all_df.loc[(rueg_all_df['Heritage_Language'] == 'T')]), 'Turkish heritage language data files')
print('There are', len(rueg_all_df.loc[(rueg_all_df['Heritage_Language'] == 'G')]), 'Greek heritage language data files')
print('There are', len(rueg_all_df.loc[(rueg_all_df['Heritage_Language'] == 'R')]), 'Russian heritage language data files')



There are 653 spoken data files
There are 653 written data files
There are 654 informal data files
There are 652 formal data files
There are 1002 bilingual data files
There are 304 monolingual data files
There are 799 German data files
There are 507 English data files
There are 595 adult data files
There are 711 adolescent data files
There are 327 German heritage language data files
There are 64 English heritage language data files
There are 260 Turkish heritage language data files
There are 267 Greek heritage language data files
There are 388 Russian heritage language data files


### Reading in the Texts
The data format being read in right now is the CoNLL format, and for now I'm just going to enter the entire text file (with POS, lemma, ect annotations)

In [476]:
files = glob.glob('RUEG_corpus_0.3.0/conll/RUEG/DE/BILINGUAL/*.txt', recursive = True)
de_bi_texts = []
for file in files:
    f = open(file)
    s = f.read()
    de_bi_texts.append(((file.split("BILINGUAL/",1)[1].strip('.txt')), s))
    f.close()
de_bi_texts[:3]
## important to note that everything is tab seperated

[('USbi50FD_fsD', '1\täh\täh\tINTJ\tNGHES\t_\t0\troot\t_\t_\n2\thello\thello\tX\tFM\t_\t3\tdep\t_\t_\n3\tthis\tthis\tX\tFM\tPronType=Dem\t4\tdep\t_\t_\n4\tis\tbe\tX\tFM\tMood=Ind|Person=3|Tense=Pres\t1\tdep\t_\t_\n5\tfile\tfile\tX\tFM\tNumber=Sing\t4\tdep\t_\t_\n6\tNummer\tNummer\tNOUN\tNN\tCase=Nom|Gender=Fem|Number=Sing\t8\tnsubj\t_\t_\n7\tF\tF\tPROPN\tNE\t_\t6\tappos\t_\t_\n8\täh\täh\tINTJ\tNGHES\t_\t9\tpunct\t_\t_\n9\t16\t@card@\tPROPN\tNE\tNumType=Card\t5\tappos\t_\t_\n\n1\tja\tja\tINTJ\tNGIRR\t_\t2\tadvmod\t_\t_\n2\tokay\tokay\tINTJ\tNGIRR\t_\t0\troot\t_\t_\n3\täh\täh\tINTJ\tNGHES\t_\t2\tdep\t_\t_\n\n1\tich\tich\tPRON\tPPER\tCase=Nom|Number=Sing|Person=1|PronType=Prs\t6\tnsubj\t_\t_\n2\thabe\thaben\tAUX\tVAFIN\tMood=Ind|Number=Sing|Person=1|Tense=Past|VerbForm=Fin\t6\taux\t_\t_\n3\tgerade\tgerade\tADV\tADV\t_\t6\tadvmod\t_\t_\n4\tein\tein\tDET\tART\tCase=Nom|Definite=Ind|Gender=Masc|Number=Sing|PronType=Art\t5\tdet\t_\t_\n5\tUnfall\tUnfall\tNOUN\tNN\tCase=Acc|Gender=Masc|Number=S

In [477]:
files = glob.glob('RUEG_corpus_0.3.0/conll/RUEG/DE/MONOLINGUAL/*.txt', recursive = True)
de_mono_texts = []
for file in files:
    f = open(file)
    s = f.read()
    de_mono_texts.append(((file.split("MONOLINGUAL/",1)[1].strip('.txt')), s))
    f.close()

In [478]:
files = glob.glob('RUEG_corpus_0.3.0/conll/RUEG/EN/BILINGUAL/*.txt', recursive = True)
en_bi_texts = []
for file in files:
    f = open(file)
    s = f.read()
    file = file.split("BILINGUAL/",1)[1].strip('.txt')
    if file != 'USbi77FG_fwE':          ## Same thing, this text file has no POS marking so it will be discluded
        en_bi_texts.append((file, s))
    f.close()
len(en_bi_texts)

443

In [479]:
files = glob.glob('RUEG_corpus_0.3.0/conll/RUEG/EN/MONOLINGUAL/*.txt', recursive = True)
en_mono_texts = []
for file in files:
    f = open(file)
    s = f.read()
    en_mono_texts.append(((file.split("MONOLINGUAL/",1)[1].strip('.txt')), s))
    f.close()

In [480]:
debi_text_df = pd.DataFrame(data = de_bi_texts)
debi_text_df.columns = ['Filename', 'Raw_ConLL']
debi_text_df = debi_text_df.set_index('Filename')
debi_text_df[:3]

Unnamed: 0_level_0,Raw_ConLL
Filename,Unnamed: 1_level_1
USbi50FD_fsD,1\täh\täh\tINTJ\tNGHES\t_\t0\troot\t_\t_\n2\th...
DEbi24FT_fwD,1\t﻿Sehr\t﻿Sehr\tADV\tADV\t_\t3\tadvmod\t_\t_\...
DEbi64MR_isD,1\thi\thi\tINTJ\tNGIRR\t_\t0\troot\t_\t_\n2\td...


In [481]:
demono_text_df = pd.DataFrame(data = de_mono_texts)
demono_text_df.columns = ['Filename', 'Raw_ConLL']
demono_text_df = demono_text_df.set_index('Filename')
demono_text_df[:3]

Unnamed: 0_level_0,Raw_ConLL
Filename,Unnamed: 1_level_1
DEmo47MD_isD,1\thi\thi\tINTJ\tNGIRR\t_\t2\tpunct\t_\t_\n2\t...
DEmo45FD_iwD,1\tIch\tich\tPRON\tPPER\tCase=Nom|Number=Sing|...
DEmo22FD_fwD,1\t﻿Zeugenaussage\tZeugenaussage\tNOUN\tNN\tCa...


In [482]:
enbi_text_df = pd.DataFrame(data = en_bi_texts)
enbi_text_df.columns = ['Filename', 'Raw_ConLL']
enbi_text_df = enbi_text_df.set_index('Filename')
enbi_text_df[:3]

Unnamed: 0_level_0,Raw_ConLL
Filename,Unnamed: 1_level_1
USbi01FG_iwE,1\they\tHey\tINTJ\tITJ\t_\t2\tdiscourse\t_\t_\...
USbi13FR_fsE,1\thello\thello\tINTJ\tITJ\t_\t0\troot\t_\t_\n...
USbi08FR_fwE,1\t﻿I\ti\tPROPN\tPNP\tNumber=Sing|Person=1|Pro...


In [483]:
enmono_text_df = pd.DataFrame(data = en_mono_texts)
enmono_text_df.columns = ['Filename', 'Raw_ConLL']
enmono_text_df = enmono_text_df.set_index('Filename')
enmono_text_df[:3]

Unnamed: 0_level_0,Raw_ConLL
Filename,Unnamed: 1_level_1
USmo70ME_isE,1\tso\tso\tINTJ\tITJ\t_\t11\tdiscourse\t_\t_\n...
USmo66FE_iwE,1\tHi\thi\tINTJ\tITJ\tDegree=Pos|NumType=Ord\t...
USmo01FE_fwE,1\tI\ti\tPROPN\tPNP\tNumber=Sing|Person=1|Pron...


In [484]:
## Let's compare the text sizes
print('DE mono metadata Files: ', len(DE_mono_filename))
print('DE bi metadata Files: ', len(DE_bi_filename))
print('EN mono metadata Files: ', len(EN_mono_filename))
print('EN bi metadata Files: ', len(EN_bi_filename))
print('DE mono text: ', len(de_mono_texts))
print('DE bi text: ', len(de_bi_texts))
print('EN mono text: ', len(en_mono_texts))
print('EN bi text: ', len(en_bi_texts))


DE mono metadata Files:  240
DE bi metadata Files:  559
EN mono metadata Files:  64
EN bi metadata Files:  443
DE mono text:  256
DE bi text:  586
EN mono text:  64
EN bi text:  443


As you can see, the German documents have some discrepencies as there are more conLL files than meta files, meaning that some participants likely had multiple recordings. For now, I'm going to leave these two dataframes seperate because of this.

## Manually Parsing ConLL
I have never worked with the ConLL format, so I'm going to take just one entry and play around with it to get it how I would like before messing with the entire dataset.

In [485]:
foo = de_bi_texts[0][1]
foo

'1\täh\täh\tINTJ\tNGHES\t_\t0\troot\t_\t_\n2\thello\thello\tX\tFM\t_\t3\tdep\t_\t_\n3\tthis\tthis\tX\tFM\tPronType=Dem\t4\tdep\t_\t_\n4\tis\tbe\tX\tFM\tMood=Ind|Person=3|Tense=Pres\t1\tdep\t_\t_\n5\tfile\tfile\tX\tFM\tNumber=Sing\t4\tdep\t_\t_\n6\tNummer\tNummer\tNOUN\tNN\tCase=Nom|Gender=Fem|Number=Sing\t8\tnsubj\t_\t_\n7\tF\tF\tPROPN\tNE\t_\t6\tappos\t_\t_\n8\täh\täh\tINTJ\tNGHES\t_\t9\tpunct\t_\t_\n9\t16\t@card@\tPROPN\tNE\tNumType=Card\t5\tappos\t_\t_\n\n1\tja\tja\tINTJ\tNGIRR\t_\t2\tadvmod\t_\t_\n2\tokay\tokay\tINTJ\tNGIRR\t_\t0\troot\t_\t_\n3\täh\täh\tINTJ\tNGHES\t_\t2\tdep\t_\t_\n\n1\tich\tich\tPRON\tPPER\tCase=Nom|Number=Sing|Person=1|PronType=Prs\t6\tnsubj\t_\t_\n2\thabe\thaben\tAUX\tVAFIN\tMood=Ind|Number=Sing|Person=1|Tense=Past|VerbForm=Fin\t6\taux\t_\t_\n3\tgerade\tgerade\tADV\tADV\t_\t6\tadvmod\t_\t_\n4\tein\tein\tDET\tART\tCase=Nom|Definite=Ind|Gender=Masc|Number=Sing|PronType=Art\t5\tdet\t_\t_\n5\tUnfall\tUnfall\tNOUN\tNN\tCase=Acc|Gender=Masc|Number=Sing\t6\tobj\t_\t_\

In [486]:
foo = foo.replace('\t', ' ').split('\n')
foo = [x.split() for x in foo]
foo[:4]
## ok I like this list a lot with a list in each line and I can feasibly
## mark each conLL annotation accordingly

[['1', 'äh', 'äh', 'INTJ', 'NGHES', '_', '0', 'root', '_', '_'], ['2', 'hello', 'hello', 'X', 'FM', '_', '3', 'dep', '_', '_'], ['3', 'this', 'this', 'X', 'FM', 'PronType=Dem', '4', 'dep', '_', '_'], ['4', 'is', 'be', 'X', 'FM', 'Mood=Ind|Person=3|Tense=Pres', '1', 'dep', '_', '_']]

In [487]:
conLL_ann = []
for lines in foo:
    if len(lines) == 10:
        conLL_ann.append({'id': lines[0], 'token': lines[1], 'lemma': lines[2], 
                            'pos_uni': lines[3], 'pos_lang': lines[4], 'morphology': lines[5], 
                            'head': lines[6], 'relationship': lines[7], 'misc1': lines[8],
                            'misc2': lines[9]})

In [488]:
print(len(conLL_ann))
print([x['lemma'] for x in conLL_ann][:20])

155
['äh', 'hello', 'this', 'be', 'file', 'Nummer', 'F', 'äh', '@card@', 'ja', 'okay', 'äh', 'ich', 'haben', 'gerade', 'ein', 'Unfall', 'sehen', 'und', 'es']


## Practice Spacy Parsing ConLL
It will be better to use an actual conll parser so all the rich synatctic information about dependency trees isn't lost

In [489]:
import spacy

In [490]:
from spacy_conll import init_parser
from spacy_conll.parser import ConllParser

from spacy import displacy
engconllparser = ConllParser(init_parser("en_core_web_sm", "spacy"))

In [491]:
connebidemo = en_bi_texts[0][1]
print(connebidemo)

1	hey	Hey	INTJ	ITJ	_	2	discourse	_	_
2	Eleni	Eleni	NOUN	NP0	_	0	root	_	_
3	,	,	PUNCT	PUN	_	2	punct	_	_

1	let	let	VERB	VVB	Mood=Imp|Number=Plur|Person=2|Tense=Pres|VerbForm=Fin	0	root	_	_
2	me	i	PROPN	PNP	Number=Sing|Person=1|PronType=Prs	1	obj	_	_
3	tell	tell	VERB	VVI	Mood=Ind|Number=Sing|Person=1|Tense=Pres|VerbForm=Fin	1	xcomp	_	_
4	you	you	PROPN	PNP	Person=2|PronType=Prs	6	nsubj	_	_
5	what	what	DET	DTQ	Number=Sing|PronType=Rel	6	det	_	_
6	happened	happen	VERB	VVD	Mood=Ind|Person=3|Tense=Past|VerbForm=Fin	3	ccomp	_	_
7	today	today	ADV	AV0	_	6	advmod	_	_
8	!	!	PUNCT	SENT	_	6	punct	_	_

1	I	I	PROPN	PNP	Number=Sing|Person=1|PronType=Prs	2	nsubj	_	_
2	saw	see	VERB	VVD	Mood=Ind|Person=3|Tense=Past|VerbForm=Fin	0	root	_	_
3	a	a	DET	AT0	Definite=Ind|Number=Sing|PronType=Art	6	det	_	_
4	minor	minor	ADJ	AJ0	Degree=Pos	6	amod	_	_
5	car	car	NOUN	NN1	Number=Sing	6	compound	_	_
6	accident	accident	NOUN	NN1	Number=Sing	2	obj	_	_
7	when	when	SCONJ	CJS	_	11	mark	_	_
8	either	either	ADV	AV0	_	11	adv

In [492]:
connebidemo = connebidemo[:(len(connebidemo)-1)]

In [493]:
nlp = init_parser("en_core_web_sm", "spacy", include_headers=False)
parser = ConllParser(nlp)
connebidemo2 = parser.parse_conll_text_as_spacy(connebidemo)
for sent_id, sent in enumerate(connebidemo2.sents, 1):
        print(sent._.conll_pd)
        #displacy.render(sent, style='dep', options={"compact":True})  #renders the sentences into trees, just takes up
                                                                       #a LOT of screen space   
        for word in sent:
            print(word, word.lemma_, word.pos_, word.dep_)
        print()

   ID   FORM  LEMMA   UPOS XPOS FEATS  HEAD     DEPREL DEPS MISC
0   1    hey    Hey   INTJ  ITJ     _     2  discourse    _    _
1   2  Eleni  Eleni   NOUN  NP0     _     0       ROOT    _    _
2   3      ,      ,  PUNCT  PUN     _     2      punct    _    _
hey Hey INTJ discourse
Eleni Eleni NOUN ROOT
, , PUNCT punct

   ID      FORM   LEMMA   UPOS  XPOS  \
0   1       let     let   VERB   VVB   
1   2        me       i  PROPN   PNP   
2   3      tell    tell   VERB   VVI   
3   4       you     you  PROPN   PNP   
4   5      what    what    DET   DTQ   
5   6  happened  happen   VERB   VVD   
6   7     today   today    ADV   AV0   
7   8         !       !  PUNCT  SENT   

                                               FEATS  HEAD  DEPREL DEPS MISC  
0  Mood=Imp|Number=Plur|Person=2|Tense=Pres|VerbF...     0    ROOT    _    _  
1                  Number=Sing|Person=1|PronType=Prs     1     obj    _    _  
2  Mood=Ind|Number=Sing|Person=1|Tense=Pres|VerbF...     1   xcomp    _    _  
3

In [494]:
connebidemo2._.conll_str


'1\they\tHey\tINTJ\tITJ\t_\t2\tdiscourse\t_\t_\n2\tEleni\tEleni\tNOUN\tNP0\t_\t0\tROOT\t_\t_\n3\t,\t,\tPUNCT\tPUN\t_\t2\tpunct\t_\t_\n\n1\tlet\tlet\tVERB\tVVB\tMood=Imp|Number=Plur|Person=2|Tense=Pres|VerbForm=Fin\t0\tROOT\t_\t_\n2\tme\ti\tPROPN\tPNP\tNumber=Sing|Person=1|PronType=Prs\t1\tobj\t_\t_\n3\ttell\ttell\tVERB\tVVI\tMood=Ind|Number=Sing|Person=1|Tense=Pres|VerbForm=Fin\t1\txcomp\t_\t_\n4\tyou\tyou\tPROPN\tPNP\tPerson=2|PronType=Prs\t6\tnsubj\t_\t_\n5\twhat\twhat\tDET\tDTQ\tNumber=Sing|PronType=Rel\t6\tdet\t_\t_\n6\thappened\thappen\tVERB\tVVD\tMood=Ind|Person=3|Tense=Past|VerbForm=Fin\t3\tccomp\t_\t_\n7\ttoday\ttoday\tADV\tAV0\t_\t6\tadvmod\t_\t_\n8\t!\t!\tPUNCT\tSENT\t_\t6\tpunct\t_\t_\n\n1\tI\tI\tPROPN\tPNP\tNumber=Sing|Person=1|PronType=Prs\t2\tnsubj\t_\t_\n2\tsaw\tsee\tVERB\tVVD\tMood=Ind|Person=3|Tense=Past|VerbForm=Fin\t0\tROOT\t_\t_\n3\ta\ta\tDET\tAT0\tDefinite=Ind|Number=Sing|PronType=Art\t6\tdet\t_\t_\n4\tminor\tminor\tADJ\tAJ0\tDegree=Pos\t6\tamod\t_\t_\n5\tcar\tcar\

In [495]:
## trying it on the german, but we need a different (german) pipeline for this
conndbidemo = de_bi_texts[0][1]
conndbidemo[-10:]
## sooo pesky

'root\t_\t_\n\n'

In [496]:
conndbidemo = conndbidemo[:(len(conndbidemo)-1)]
dnlp = init_parser("de_core_news_sm", "spacy", include_headers=False)
dparser = ConllParser(dnlp)
conndbidemo2 = dparser.parse_conll_text_as_spacy(conndbidemo)
for sent_id, sent in enumerate(conndbidemo2.sents, 1):
        print(sent._.conll_pd)
        for word in sent:
            print(word, word.lemma_, word.pos_, word.dep_)
        print()

   ID    FORM   LEMMA   UPOS   XPOS                            FEATS  HEAD  \
0   1      äh      äh   INTJ  NGHES                                _     0   
1   2   hello   hello      X     FM                                _     3   
2   3    this    this      X     FM                     PronType=Dem     4   
3   4      is      be      X     FM     Mood=Ind|Person=3|Tense=Pres     1   
4   5    file    file      X     FM                      Number=Sing     4   
5   6  Nummer  Nummer   NOUN     NN  Case=Nom|Gender=Fem|Number=Sing     8   
6   7       F       F  PROPN     NE                                _     6   
7   8      äh      äh   INTJ  NGHES                                _     9   
8   9      16  @card@  PROPN     NE                     NumType=Card     5   

  DEPREL DEPS MISC  
0   ROOT    _    _  
1    dep    _    _  
2    dep    _    _  
3    dep    _    _  
4    dep    _    _  
5  nsubj    _    _  
6  appos    _    _  
7  punct    _    _  
8  appos    _    _  
äh äh INT

In [497]:
conndbidemo2._.conll_str

'1\täh\täh\tINTJ\tNGHES\t_\t0\tROOT\t_\t_\n2\thello\thello\tX\tFM\t_\t3\tdep\t_\t_\n3\tthis\tthis\tX\tFM\tPronType=Dem\t4\tdep\t_\t_\n4\tis\tbe\tX\tFM\tMood=Ind|Person=3|Tense=Pres\t1\tdep\t_\t_\n5\tfile\tfile\tX\tFM\tNumber=Sing\t4\tdep\t_\t_\n6\tNummer\tNummer\tNOUN\tNN\tCase=Nom|Gender=Fem|Number=Sing\t8\tnsubj\t_\t_\n7\tF\tF\tPROPN\tNE\t_\t6\tappos\t_\t_\n8\täh\täh\tINTJ\tNGHES\t_\t9\tpunct\t_\t_\n9\t16\t@card@\tPROPN\tNE\tNumType=Card\t5\tappos\t_\t_\n\n1\tja\tja\tINTJ\tNGIRR\t_\t2\tadvmod\t_\t_\n2\tokay\tokay\tINTJ\tNGIRR\t_\t0\tROOT\t_\t_\n3\täh\täh\tINTJ\tNGHES\t_\t2\tdep\t_\t_\n\n1\tich\tich\tPRON\tPPER\tCase=Nom|Number=Sing|Person=1|PronType=Prs\t6\tnsubj\t_\t_\n2\thabe\thaben\tAUX\tVAFIN\tMood=Ind|Number=Sing|Person=1|Tense=Past|VerbForm=Fin\t6\taux\t_\t_\n3\tgerade\tgerade\tADV\tADV\t_\t6\tadvmod\t_\t_\n4\tein\tein\tDET\tART\tCase=Nom|Definite=Ind|Gender=Masc|Number=Sing|PronType=Art\t5\tdet\t_\t_\n5\tUnfall\tUnfall\tNOUN\tNN\tCase=Acc|Gender=Masc|Number=Sing\t6\tobj\t_\t_\

### Pause
Firstly, I want to thank Na-Rae for helping with the spacy_conll things. The spacy_conll library is a little tempermental and rages against an extra newline character at the end of a text. What is not pictured is the hours I and Na-Rae spent trying to figure out what wasn't working until she figured it out. 

Secondly, I know that my first bit of parsing by hand is redundant and will not be used, but it gave some useful information about the documents regardless, because there are some irregular documents in here that I'm sure spacy_conll will throw a fit about. 

All this being said, it's finally time to work on spacy-parsing all the texts.

## Attempted Spacy Parsing ConLL

In [498]:
## English Spacy Parser 
nlp = init_parser("en_core_web_sm", "spacy", include_headers=False)
parser = ConllParser(nlp)
def parseEnTexts(constr, conlist):
    while constr[-2:] == '\n\n':      # this should also cover cases where the end could be \n\n\n
        constr = constr[:(len(constr)-1)]
    constr2 = parser.parse_conll_text_as_spacy(constr)
    for sent_id, sent in enumerate(constr2.sents, 1):
        conlist.append(sent._.conll_str)

In [499]:
## German Spacy Parser
dnlp = init_parser("de_core_news_sm", "spacy", include_headers=False)
dparser = ConllParser(dnlp)
def parseDeTexts(constr, conlist):
    while constr[-2:] == '\n\n':
        constr = constr[:(len(constr)-1)]
    constr2 = dparser.parse_conll_text_as_spacy(constr)
    for sent_id, sent in enumerate(constr2.sents, 1):
        conlist.append(sent._.conll_str)

In [500]:
en_bi_texts[:3]

[('USbi01FG_iwE', '1\they\tHey\tINTJ\tITJ\t_\t2\tdiscourse\t_\t_\n2\tEleni\tEleni\tNOUN\tNP0\t_\t0\troot\t_\t_\n3\t,\t,\tPUNCT\tPUN\t_\t2\tpunct\t_\t_\n\n1\tlet\tlet\tVERB\tVVB\tMood=Imp|Number=Plur|Person=2|Tense=Pres|VerbForm=Fin\t0\troot\t_\t_\n2\tme\ti\tPROPN\tPNP\tNumber=Sing|Person=1|PronType=Prs\t1\tobj\t_\t_\n3\ttell\ttell\tVERB\tVVI\tMood=Ind|Number=Sing|Person=1|Tense=Pres|VerbForm=Fin\t1\txcomp\t_\t_\n4\tyou\tyou\tPROPN\tPNP\tPerson=2|PronType=Prs\t6\tnsubj\t_\t_\n5\twhat\twhat\tDET\tDTQ\tNumber=Sing|PronType=Rel\t6\tdet\t_\t_\n6\thappened\thappen\tVERB\tVVD\tMood=Ind|Person=3|Tense=Past|VerbForm=Fin\t3\tccomp\t_\t_\n7\ttoday\ttoday\tADV\tAV0\t_\t6\tadvmod\t_\t_\n8\t!\t!\tPUNCT\tSENT\t_\t6\tpunct\t_\t_\n\n1\tI\tI\tPROPN\tPNP\tNumber=Sing|Person=1|PronType=Prs\t2\tnsubj\t_\t_\n2\tsaw\tsee\tVERB\tVVD\tMood=Ind|Person=3|Tense=Past|VerbForm=Fin\t0\troot\t_\t_\n3\ta\ta\tDET\tAT0\tDefinite=Ind|Number=Sing|PronType=Art\t6\tdet\t_\t_\n4\tminor\tminor\tADJ\tAJ0\tDegree=Pos\t6\tamod\t

In [501]:
en_bi_texts = [x[1] for x in en_bi_texts]
en_mono_texts = [x[1] for x in en_mono_texts]
de_bi_texts = [x[1] for x in de_bi_texts]
de_mono_texts = [x[1] for x in de_mono_texts]

In [502]:
# ebi_con_str = []
# [parseEnTexts(x, ebi_con_str) for x in en_bi_texts]

This causes and error that says:

`pos` value "_" is not a valid Universal Dependencies tag. Non-UD tags should use the `tag` property.

That's definitely a problem, but let's see what other corpora have problems before we go onto cleaning the conLL

In [503]:
en_mono_texts[:3]

['1\tso\tso\tINTJ\tITJ\t_\t11\tdiscourse\t_\t_\n2\tI\ti\tPROPN\tPNP\tNumber=Sing|Person=1|PronType=Prs\t11\tnsubj\t_\t_\n3\tjust\tjust\tADV\tAV0\t_\t11\tadvmod\t_\t_\n4\tso\tso\tINTJ\tITJ\t_\t11\tdiscourse\t_\t_\n5\tI\ti\tPROPN\tPNP\tNumber=Sing|Person=1|PronType=Prs\t11\tnsubj\t_\t_\n6\twas\tbe\tAUX\tVBD\tMood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin\t11\tcop\t_\t_\n7\tjust\tjust\tADV\tAV0\tDegree=Pos\t11\tadvmod\t_\t_\n8\tin\tin\tADP\tPRP\t_\t11\tcase\t_\t_\n9\ta\ta\tDET\tAT0\tDefinite=Ind|Number=Sing|PronType=Art\t11\tdet\t_\t_\n10\tparking\tparking\tNOUN\tNN1\tNumber=Sing\t11\tcompound\t_\t_\n11\tlot\tlot\tNOUN\tNN1\t_\t0\troot\t_\t_\n\n1\tand\tand\tCCONJ\tCJC\t_\t3\tcc\t_\t_\n2\tI\ti\tPROPN\tPNP\tNumber=Sing|Person=1|PronType=Prs\t3\tnsubj\t_\t_\n3\tsaw\tsee\tVERB\tVVD\tMood=Ind|Person=3|Tense=Past|VerbForm=Fin\t0\troot\t_\t_\n4\tI\ti\tPROPN\tPNP\tNumber=Sing|Person=1|PronType=Prs\t5\tnsubj\t_\t_\n5\tsaw\tsee\tVERB\tVVD\tMood=Ind|Person=3|Tense=Past|VerbForm=Fin\t3\tccomp\

In [504]:
testlist = []
parseEnTexts(en_mono_texts[0], testlist)
testlist

['1\tso\tso\tINTJ\tITJ\t_\t11\tdiscourse\t_\t_\n2\tI\ti\tPROPN\tPNP\tNumber=Sing|Person=1|PronType=Prs\t11\tnsubj\t_\t_\n3\tjust\tjust\tADV\tAV0\t_\t11\tadvmod\t_\t_\n4\tso\tso\tINTJ\tITJ\t_\t11\tdiscourse\t_\t_\n5\tI\ti\tPROPN\tPNP\tNumber=Sing|Person=1|PronType=Prs\t11\tnsubj\t_\t_\n6\twas\tbe\tAUX\tVBD\tMood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin\t11\tcop\t_\t_\n7\tjust\tjust\tADV\tAV0\tDegree=Pos\t11\tadvmod\t_\t_\n8\tin\tin\tADP\tPRP\t_\t11\tcase\t_\t_\n9\ta\ta\tDET\tAT0\tDefinite=Ind|Number=Sing|PronType=Art\t11\tdet\t_\t_\n10\tparking\tparking\tNOUN\tNN1\tNumber=Sing\t11\tcompound\t_\t_\n11\tlot\tlot\tNOUN\tNN1\t_\t0\tROOT\t_\t_\n', '1\tand\tand\tCCONJ\tCJC\t_\t3\tcc\t_\t_\n2\tI\ti\tPROPN\tPNP\tNumber=Sing|Person=1|PronType=Prs\t3\tnsubj\t_\t_\n3\tsaw\tsee\tVERB\tVVD\tMood=Ind|Person=3|Tense=Past|VerbForm=Fin\t0\tROOT\t_\t_\n4\tI\ti\tPROPN\tPNP\tNumber=Sing|Person=1|PronType=Prs\t5\tnsubj\t_\t_\n5\tsaw\tsee\tVERB\tVVD\tMood=Ind|Person=3|Tense=Past|VerbForm=Fin\t3\tccom

In [505]:
emo_con_str = []
[parseEnTexts(x, emo_con_str) for x in en_mono_texts]

## shows up as none, but that's not really an issue

[None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None]

In [506]:
## just one sentence as opposed to the whole text
emo_con_str[0]

'1\tso\tso\tINTJ\tITJ\t_\t11\tdiscourse\t_\t_\n2\tI\ti\tPROPN\tPNP\tNumber=Sing|Person=1|PronType=Prs\t11\tnsubj\t_\t_\n3\tjust\tjust\tADV\tAV0\t_\t11\tadvmod\t_\t_\n4\tso\tso\tINTJ\tITJ\t_\t11\tdiscourse\t_\t_\n5\tI\ti\tPROPN\tPNP\tNumber=Sing|Person=1|PronType=Prs\t11\tnsubj\t_\t_\n6\twas\tbe\tAUX\tVBD\tMood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin\t11\tcop\t_\t_\n7\tjust\tjust\tADV\tAV0\tDegree=Pos\t11\tadvmod\t_\t_\n8\tin\tin\tADP\tPRP\t_\t11\tcase\t_\t_\n9\ta\ta\tDET\tAT0\tDefinite=Ind|Number=Sing|PronType=Art\t11\tdet\t_\t_\n10\tparking\tparking\tNOUN\tNN1\tNumber=Sing\t11\tcompound\t_\t_\n11\tlot\tlot\tNOUN\tNN1\t_\t0\tROOT\t_\t_\n'

In [507]:
print(emo_con_str[0])

1	so	so	INTJ	ITJ	_	11	discourse	_	_
2	I	i	PROPN	PNP	Number=Sing|Person=1|PronType=Prs	11	nsubj	_	_
3	just	just	ADV	AV0	_	11	advmod	_	_
4	so	so	INTJ	ITJ	_	11	discourse	_	_
5	I	i	PROPN	PNP	Number=Sing|Person=1|PronType=Prs	11	nsubj	_	_
6	was	be	AUX	VBD	Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin	11	cop	_	_
7	just	just	ADV	AV0	Degree=Pos	11	advmod	_	_
8	in	in	ADP	PRP	_	11	case	_	_
9	a	a	DET	AT0	Definite=Ind|Number=Sing|PronType=Art	11	det	_	_
10	parking	parking	NOUN	NN1	Number=Sing	11	compound	_	_
11	lot	lot	NOUN	NN1	_	0	ROOT	_	_



Looks like we will not need to do cleaning for the english monolingual data! That's great so let's move forward to the German Data

In [508]:
de_bi_texts[0]

'1\täh\täh\tINTJ\tNGHES\t_\t0\troot\t_\t_\n2\thello\thello\tX\tFM\t_\t3\tdep\t_\t_\n3\tthis\tthis\tX\tFM\tPronType=Dem\t4\tdep\t_\t_\n4\tis\tbe\tX\tFM\tMood=Ind|Person=3|Tense=Pres\t1\tdep\t_\t_\n5\tfile\tfile\tX\tFM\tNumber=Sing\t4\tdep\t_\t_\n6\tNummer\tNummer\tNOUN\tNN\tCase=Nom|Gender=Fem|Number=Sing\t8\tnsubj\t_\t_\n7\tF\tF\tPROPN\tNE\t_\t6\tappos\t_\t_\n8\täh\täh\tINTJ\tNGHES\t_\t9\tpunct\t_\t_\n9\t16\t@card@\tPROPN\tNE\tNumType=Card\t5\tappos\t_\t_\n\n1\tja\tja\tINTJ\tNGIRR\t_\t2\tadvmod\t_\t_\n2\tokay\tokay\tINTJ\tNGIRR\t_\t0\troot\t_\t_\n3\täh\täh\tINTJ\tNGHES\t_\t2\tdep\t_\t_\n\n1\tich\tich\tPRON\tPPER\tCase=Nom|Number=Sing|Person=1|PronType=Prs\t6\tnsubj\t_\t_\n2\thabe\thaben\tAUX\tVAFIN\tMood=Ind|Number=Sing|Person=1|Tense=Past|VerbForm=Fin\t6\taux\t_\t_\n3\tgerade\tgerade\tADV\tADV\t_\t6\tadvmod\t_\t_\n4\tein\tein\tDET\tART\tCase=Nom|Definite=Ind|Gender=Masc|Number=Sing|PronType=Art\t5\tdet\t_\t_\n5\tUnfall\tUnfall\tNOUN\tNN\tCase=Acc|Gender=Masc|Number=Sing\t6\tobj\t_\t_\

In [509]:
#debi_con_str = []
#[parseDeTexts(x, debi_con_str) for x in de_bi_texts]


## same issue as before with the English monolingual data

In [510]:
#demo_con_str = []
#[parseDeTexts(x, demo_con_str) for x in de_mono_texts]

## again, same issues. Onto cleaning

## Data Cleaning

As we saw with the manual parsing and with the fact that many of these texts have an extra newline character, we're going to have to clean up some documents before creating the corpora to use for analysis later

Here were the problem sets that need cleaning:
- English Bilingual
- German Bilingual
- German Monolingual

Now we got a hint of what was wrong in the earlier manual parsing, so now it's time to find the actual errors and fix them

In [511]:
en_bi_texts[:3]

['1\they\tHey\tINTJ\tITJ\t_\t2\tdiscourse\t_\t_\n2\tEleni\tEleni\tNOUN\tNP0\t_\t0\troot\t_\t_\n3\t,\t,\tPUNCT\tPUN\t_\t2\tpunct\t_\t_\n\n1\tlet\tlet\tVERB\tVVB\tMood=Imp|Number=Plur|Person=2|Tense=Pres|VerbForm=Fin\t0\troot\t_\t_\n2\tme\ti\tPROPN\tPNP\tNumber=Sing|Person=1|PronType=Prs\t1\tobj\t_\t_\n3\ttell\ttell\tVERB\tVVI\tMood=Ind|Number=Sing|Person=1|Tense=Pres|VerbForm=Fin\t1\txcomp\t_\t_\n4\tyou\tyou\tPROPN\tPNP\tPerson=2|PronType=Prs\t6\tnsubj\t_\t_\n5\twhat\twhat\tDET\tDTQ\tNumber=Sing|PronType=Rel\t6\tdet\t_\t_\n6\thappened\thappen\tVERB\tVVD\tMood=Ind|Person=3|Tense=Past|VerbForm=Fin\t3\tccomp\t_\t_\n7\ttoday\ttoday\tADV\tAV0\t_\t6\tadvmod\t_\t_\n8\t!\t!\tPUNCT\tSENT\t_\t6\tpunct\t_\t_\n\n1\tI\tI\tPROPN\tPNP\tNumber=Sing|Person=1|PronType=Prs\t2\tnsubj\t_\t_\n2\tsaw\tsee\tVERB\tVVD\tMood=Ind|Person=3|Tense=Past|VerbForm=Fin\t0\troot\t_\t_\n3\ta\ta\tDET\tAT0\tDefinite=Ind|Number=Sing|PronType=Art\t6\tdet\t_\t_\n4\tminor\tminor\tADJ\tAJ0\tDegree=Pos\t6\tamod\t_\t_\n5\tcar\tcar

In [512]:
enbi_con_str = []
[parseDeTexts(x, enbi_con_str) for x in en_bi_texts]
#used to produce error

[None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, Non

In [513]:
len(enbi_con_str)
## so line 4716 was the breaking point - the POS for the whole file was all _ so it was discluded

5365

After some investigating, I have found the file who is to blame: USbi77FG_fwE.txt
For some reason, it has no POS markings. I believe this is the only file to be messed up like this. For this reason, when I read in the corpus for now on I will be excluding this file, and all subsequent files (metadata, audio, ect)

In [514]:
debi_con_str = []
[parseDeTexts(x, debi_con_str) for x in de_bi_texts]

ValueError: [E1021] `pos` value "_" is not a valid Universal Dependencies tag. Non-UD tags should use the `tag` property.

In [515]:
len(debi_con_str)
## so line 80 is the issue

80

In [523]:
debi_strs = []
for x in de_bi_texts:
    while x[-2:] == '\n\n':
        x = x[:(len(x)-1)]
    x = x.split('\n\n')
    for y in x:
        debi_strs.append(y)
len(debi_strs)

9017

In [535]:
#print(debi_strs[79])
print(debi_strs[80])
#print(debi_strs[81])
## not appearing that this error is so easy as a text with POS missing (which is a good thing!)
## but more investigation is reguired!

1	ja	ja	INTJ	NGIRR	_	7	advmod	_	_
2	schönen	schöne	ADJ	ADJA	Degree=Pos	4	amod	_	_
3	guten	gute	ADJ	ADJA	Degree=Pos	4	amod	_	_
4	Tag	Tag	NOUN	NN	_	7	nmod	_	_
5	DEbi17MR	DEbi17MR	PROPN	NE	_	4	appos	_	_
6	mein	mein	DET	PPOSAT	Person=1|Poss=Yes|PronType=Prs	7	det:poss	_	_
7	Name	Name	NOUN	NN	_	0	root	_	_


In [525]:
demo_con_str = []
[parseDeTexts(x, demo_con_str) for x in de_mono_texts]

ValueError: [E1021] `pos` value "_" is not a valid Universal Dependencies tag. Non-UD tags should use the `tag` property.

In [526]:
len(demo_con_str)

994

In [527]:
demo_strs = []
for x in de_mono_texts:
    while x[-2:] == '\n\n':
        x = x[:(len(x)-1)]
    x = x.split('\n\n')
    for y in x:
        demo_strs.append(y)
len(demo_strs)

4135

In [528]:
print(demo_strs[992])
print(demo_strs[993])
print(demo_strs[994])
print(demo_strs[995])
## (the \n looks pesky but it's likely a document break, which should be fine)

1	und	und	CCONJ	KON	_	6	cc	_	_
2	das	d	PRON	PDS	Case=Nom|PronType=Dem	6	nsubj	_	_
3	war	sein	AUX	VAFIN	Mood=Ind|Person=3|Tense=Past|VerbForm=Fin	6	aux	_	_
4	es	es	PRON	PPER	Case=Acc|Gender=Neut|Number=Sing|Person=3|PronType=Prs	6	obj	_	_
5	auch	auch	ADV	PTKIFG	_	6	advmod	_	_
6	schon	schon	ADV	PTKMWL	_	0	root	_	_
1	vielen	viel	DET	PIAT	Degree=Pos	2	det	_	_
2	Dank	Dank	NOUN	NN	_	0	root	_	_

1	hallo	hallo	INTJ	NGIRR	_	2	punct	_	_
2	Toni	Toni	PROPN	NE	_	0	root	_	_
1	ich	ich	PRON	PPER	Case=Nom|Number=Sing|Person=1|PronType=Prs	4	nsubj	_	_
2	war	sein	AUX	VAFIN	Mood=Ind|Number=Sing|Person=1|Tense=Past|VerbForm=Fin	4	aux	_	_
3	gerade	gerade	ADV	ADV	_	4	advmod	_	_
4	äh	äh	INTJ	NGHES	Degree=Pos	0	root	_	_
5	auf	auf	ADP	APPR	_	7	case	_	_
6	einem	ein	DET	ART	Case=Dat|Definite=Ind|Gender=Masc,Neut|Number=Sing|PronType=Art	7	det	_	_
7	Parkplatz	Parkplatz	NOUN	NN	Case=Dat|Gender=Masc,Neut|Number=Sing	4	obl	_	_


## Corpora Creation for Later Exploration
We finally have all our sentences parsed. Let's do one final look before pickling them to use in the exploration of the data

In [None]:
print(len(emo_con_str))
print(emo_con_str[0])

In [None]:
print(len(enbi_con_str))
print(enbi_con_str[0])