# Exploring the RUEG Corpus
Goals

## Table of Contents
1. [Unigram Exploration]()

    A. [Loading in the Data]()
2. [Taking in Some Basic Stats]()
3. [Exloring with the POS]()

## Unigram Exploration

### Loading in the Data
Load in the pickle files created in [this](https://github.com/Data-Science-for-Linguists-2025/DEU-ENG-Mono-and-Billingual-Speakers/blob/main/LoadingRUEGData.ipynb) jupyter notebook, and poking around a bit

In [1]:
%pprint

Pretty printing has been turned OFF


In [2]:
import pickle
import nltk

In [3]:
with open ('debi_pos.pkl', 'rb') as file:
    DE_bi_pos = pickle.load(file)
with open ('demono_pos.pkl', 'rb') as file:
    DE_mono_pos = pickle.load(file)
with open ('enbi_pos.pkl', 'rb') as file:
    EN_bi_pos = pickle.load(file)
with open ('enmono_pos.pkl', 'rb') as file:
    EN_mono_pos = pickle.load(file)

In [4]:
with open ('debi_text.pkl', 'rb') as file:
    DE_bi_tokens = pickle.load(file)
with open ('demono_text.pkl', 'rb') as file:
    DE_mono_tokens = pickle.load(file)
with open ('enbi_text.pkl', 'rb') as file:
    EN_bi_tokens = pickle.load(file)
with open ('enmono_text.pkl', 'rb') as file:
    EN_mono_tokens = pickle.load(file)

In [5]:
print(len(DE_bi_pos))
print(len(DE_mono_pos))
print(len(EN_bi_pos))
print(len(EN_mono_pos))

print(len(DE_bi_tokens))
print(len(DE_mono_tokens))
print(len(EN_bi_tokens))
print(len(EN_mono_tokens))

4773
1761
4385
621
4773
1761
4385
621


In [6]:
DE_bi_pos[:10]

[('und', 'CCONJ'), ('die', 'PRON'), ('haben', 'AUX'), ('die', 'DET'), ('Polizei', 'NOUN'), ('äh', 'INTJ'), ('angerufen', 'VERB'), ('DEbi24FT', 'PROPN'), ('und', 'CCONJ'), ('ist', 'AUX')]

In [7]:
DE_bi_tokens[:10]

['und', 'die', 'haben', 'die', 'Polizei', 'äh', 'angerufen', 'DEbi24FT', 'und', 'ist']

So, unfortunately, after investigating some POS tags, there are some non-UPOS tags included in the German sets. This is likely from some kind of incorrect parsing from the stanza parsing, or incorrectly marked in the actual text (it was automatic POS tagging, not by hand with exMaralda)

For out purposes, I decided to just exlude these instances from the data here. They will not be helpful and there really isn't another solution here.

In [8]:
DE_bi_pos = [x for x in DE_bi_pos if x[1] not in ['NE', '_', '$.']]
debi_postags = [x[1] for x in DE_bi_pos]
debitagfd = nltk.FreqDist(debi_postags)
print(debitagfd.most_common())

[('NOUN', 702), ('DET', 618), ('VERB', 558), ('ADV', 550), ('PRON', 448), ('AUX', 354), ('CCONJ', 341), ('ADP', 292), ('ADJ', 291), ('PUNCT', 251), ('INTJ', 175), ('PROPN', 51), ('SYM', 29), ('PART', 23), ('NUM', 22), ('X', 11), ('SCONJ', 7)]


In [9]:
DE_mono_pos = [x for x in DE_mono_pos if x[1] not in ['PPER', 'VAFIN', 'KON', 'PIAT', 'NN']]
demono_postags = [x[1] for x in DE_mono_pos]
demonotagfd = nltk.FreqDist(demono_postags)
print(demonotagfd.most_common())

[('NOUN', 251), ('DET', 210), ('ADV', 207), ('VERB', 189), ('PRON', 176), ('AUX', 127), ('CCONJ', 119), ('ADP', 115), ('ADJ', 113), ('PUNCT', 100), ('INTJ', 92), ('SYM', 20), ('PROPN', 17), ('NUM', 8), ('PART', 7), ('SCONJ', 3)]


In [10]:
EN_bi_pos = [x for x in EN_bi_pos if x[1] != '_']
enbi_postags = [x[1] for x in EN_bi_pos]
enbitagfd = nltk.FreqDist(enbi_postags)
print(enbitagfd.most_common())

[('NOUN', 685), ('VERB', 675), ('DET', 673), ('ADP', 420), ('PROPN', 330), ('CCONJ', 268), ('AUX', 251), ('ADV', 234), ('ADJ', 212), ('PUNCT', 200), ('PART', 119), ('SCONJ', 112), ('PRON', 91), ('INTJ', 75), ('NUM', 36)]


In [11]:
enmono_postags = [x[1] for x in EN_mono_pos]
enmonotagfd = nltk.FreqDist(enmono_postags)
print(enmonotagfd.most_common())
len(set(enmono_postags))

[('DET', 102), ('VERB', 99), ('NOUN', 92), ('ADP', 61), ('CCONJ', 44), ('PROPN', 43), ('ADJ', 34), ('AUX', 33), ('ADV', 31), ('PUNCT', 21), ('PART', 19), ('PRON', 17), ('SCONJ', 17), ('INTJ', 7), ('NUM', 1)]


15

These are a little harder to compare, because we know that the sizes of the texts are pretty different. Another thing to consider is similar to the issue with TTR (it is hard to compare Type to Token Ratio when text sizes are vastly different becasue stop words will have a larger proportion in longer text). If these text sizes are so different (mostly considering the English Monolingual) it may be harder to compare. I will do the best I can, but this is crucial to keep in mind whenever comparing the four partitions.

That being said, It's still fair to say that nouns, determinersa and verbs are in the top among all sets. What is interesting is the greater use of adverbs in German speakers comapred to more use of adpositions in English. At this point, it is hard to see the similarites of Bilingual speakers in comaprison to monolingual speakers, and minute differences may be difficult to see with the human eye and will require some kind of machine learning.

In [12]:
debiposfd = nltk.FreqDist(DE_bi_pos)
print(debiposfd.most_common(20))

[(('und', 'CCONJ'), 221), (('.', 'PUNCT'), 175), (('die', 'DET'), 155), (('der', 'DET'), 120), (('Polizei', 'NOUN'), 118), (('Auto', 'NOUN'), 94), (('dann', 'ADV'), 81), (('ist', 'AUX'), 75), (('äh', 'INTJ'), 60), (('das', 'DET'), 60), (('es', 'PRON'), 59), ((',', 'PUNCT'), 54), (('hat', 'AUX'), 54), (('ich', 'PRON'), 52), (('dem', 'DET'), 49), (('ja', 'INTJ'), 47), (('haben', 'AUX'), 44), (('den', 'DET'), 43), (('war', 'AUX'), 41), (('ein', 'DET'), 38)]


In [13]:
demonoposfd = nltk.FreqDist(DE_mono_pos)
print(demonoposfd.most_common(20))

[(('und', 'CCONJ'), 76), (('.', 'PUNCT'), 66), (('die', 'DET'), 37), (('der', 'DET'), 35), (('ist', 'AUX'), 31), (('dann', 'ADV'), 29), (('dem', 'DET'), 29), (('den', 'DET'), 27), (('ja', 'INTJ'), 26), (('Auto', 'NOUN'), 25), (('ich', 'PRON'), 25), ((',', 'PUNCT'), 24), (('war', 'AUX'), 23), (('das', 'PRON'), 23), (('mit', 'ADP'), 22), (('Polizei', 'NOUN'), 19), (('nicht', 'INTJ'), 19), (('es', 'PRON'), 18), (('auf', 'ADP'), 18), (('das', 'DET'), 17)]


In [14]:
enbiposfd = nltk.FreqDist(EN_bi_pos)
print(enbiposfd.most_common(20))

[(('the', 'DET'), 428), (('and', 'CCONJ'), 220), (('.', 'PUNCT'), 158), (('car', 'NOUN'), 144), (('to', 'PART'), 89), (('was', 'AUX'), 87), (('of', 'ADP'), 83), (('it', 'PROPN'), 78), (('called', 'VERB'), 72), (('they', 'PROPN'), 51), (('in', 'ADP'), 50), (('then', 'ADV'), 50), (('him', 'PROPN'), 50), (('I', 'PROPN'), 45), (('911', 'NOUN'), 45), (('behind', 'ADP'), 45), (('police', 'NOUN'), 44), (('hit', 'VERB'), 40), (('to', 'ADP'), 40), (('a', 'DET'), 39)]


In [15]:
enmonoposfd = nltk.probability.FreqDist(EN_mono_pos)
print(enmonoposfd.most_common(20))

[(('the', 'DET'), 62), (('and', 'CCONJ'), 39), (('car', 'NOUN'), 35), (('.', 'PUNCT'), 19), (('it', 'PROPN'), 17), (('was', 'AUX'), 16), (('to', 'PART'), 13), (('other', 'ADJ'), 11), (('one', 'PRON'), 9), (('of', 'ADP'), 9), (('rear-ended', 'VERB'), 9), (('him', 'PROPN'), 9), (('hit', 'VERB'), 9), (('into', 'ADP'), 8), (('then', 'ADV'), 8), (('they', 'PROPN'), 8), (('behind', 'ADP'), 8), (('stopped', 'VERB'), 7), (('that', 'DET'), 7), (('in', 'ADP'), 7)]


#### What are we seeing here
Let me translate what exactly is happening here with the langauges. First, it's important to recognize that some of these recording are coming from a situation in which the participants were asked to describe a video they saw about a car crash as if they had witnessed the car crash which is why words like 'Polizei' (English: Police) and 'Auto' and 'car' are common.

Secondly, when looking at the distribution of stop words, it is a little different. For one, the most common word in both German texts is 'und' (English: and) while the most common word in English is 'the'. This is likely because German has several different words for 'the' (der, die, das, den, dem, des) so the distribution is spread across these several words. The bilingual sets are pretty comparable becasue they have similar sizes, and when looking at 'und' and 'and', they have pretty similar usuage.

Additionally, some of these texts are transcriptions of *spoken* audio, so the punctuation is a little tricky to analyze and not going to be of a whole lot of importance to this anyway.