# Exploring the RUEG Corpus
Goals

## Table of Contents
1. [Unigram Exploration]()

    A. [Loading in the Data]()
2. [Taking in Some Basic Stats]()
3. [Exloring with the POS]()

## Unigram Exploration

### Loading in the Data
Load in the pickle files created in [this](https://github.com/Data-Science-for-Linguists-2025/DEU-ENG-Mono-and-Billingual-Speakers/blob/main/LoadingRUEGData.ipynb) jupyter notebook, and poking around a bit

In [83]:
%pprint

Pretty printing has been turned OFF


In [170]:
import pickle
import nltk
import sklearn
import pandas as pd
import numpy as np

In [85]:
with open ('debi_pos.pkl', 'rb') as file:
    DE_bi_pos = pickle.load(file)
with open ('demono_pos.pkl', 'rb') as file:
    DE_mono_pos = pickle.load(file)
with open ('enbi_pos.pkl', 'rb') as file:
    EN_bi_pos = pickle.load(file)
with open ('enmono_pos.pkl', 'rb') as file:
    EN_mono_pos = pickle.load(file)

In [86]:
with open ('debi_text.pkl', 'rb') as file:
    DE_bi_tokens = pickle.load(file)
with open ('demono_text.pkl', 'rb') as file:
    DE_mono_tokens = pickle.load(file)
with open ('enbi_text.pkl', 'rb') as file:
    EN_bi_tokens = pickle.load(file)
with open ('enmono_text.pkl', 'rb') as file:
    EN_mono_tokens = pickle.load(file)

In [None]:
print(len(DE_bi_pos))
print(len(DE_mono_pos))
print(len(EN_bi_pos))
print(len(EN_mono_pos))

print(len(DE_bi_tokens))
print(len(DE_mono_tokens))
print(len(EN_bi_tokens))
print(len(EN_mono_tokens))

## they are the same- good!

4773
1761
4385
621
4773
1761
4385
621


In [88]:
DE_bi_pos[:10]

[('und', 'CCONJ'), ('die', 'PRON'), ('haben', 'AUX'), ('die', 'DET'), ('Polizei', 'NOUN'), ('äh', 'INTJ'), ('angerufen', 'VERB'), ('DEbi24FT', 'PROPN'), ('und', 'CCONJ'), ('ist', 'AUX')]

In [89]:
DE_bi_tokens[:10]

['und', 'die', 'haben', 'die', 'Polizei', 'äh', 'angerufen', 'DEbi24FT', 'und', 'ist']

So, unfortunately, after investigating some POS tags, there are some non-UPOS tags included in the German sets. This is likely from some kind of incorrect parsing from the stanza parsing, or incorrectly marked in the actual text (it was automatic POS tagging, not by hand with exMaralda)

For out purposes, I decided to just exlude these instances from the data here. They will not be helpful and there really isn't another solution here.

In [90]:
DE_bi_pos = [x for x in DE_bi_pos if x[1] not in ['NE', '_', '$.', 'X']]
debi_postags = [x[1] for x in DE_bi_pos]
debitagfd = nltk.FreqDist(debi_postags)
print(debitagfd.most_common())

[('NOUN', 702), ('DET', 618), ('VERB', 558), ('ADV', 550), ('PRON', 448), ('AUX', 354), ('CCONJ', 341), ('ADP', 292), ('ADJ', 291), ('PUNCT', 251), ('INTJ', 175), ('PROPN', 51), ('SYM', 29), ('PART', 23), ('NUM', 22), ('SCONJ', 7)]


In [91]:
DE_mono_pos = [x for x in DE_mono_pos if x[1] not in ['PPER', 'VAFIN', 'KON', 'PIAT', 'NN']]
demono_postags = [x[1] for x in DE_mono_pos]
demonotagfd = nltk.FreqDist(demono_postags)
print(demonotagfd.most_common())

[('NOUN', 251), ('DET', 210), ('ADV', 207), ('VERB', 189), ('PRON', 176), ('AUX', 127), ('CCONJ', 119), ('ADP', 115), ('ADJ', 113), ('PUNCT', 100), ('INTJ', 92), ('SYM', 20), ('PROPN', 17), ('NUM', 8), ('PART', 7), ('SCONJ', 3)]


In [92]:
EN_bi_pos = [x for x in EN_bi_pos if x[1] != '_']
enbi_postags = [x[1] for x in EN_bi_pos]
enbitagfd = nltk.FreqDist(enbi_postags)
print(enbitagfd.most_common())

[('NOUN', 685), ('VERB', 675), ('DET', 673), ('ADP', 420), ('PROPN', 330), ('CCONJ', 268), ('AUX', 251), ('ADV', 234), ('ADJ', 212), ('PUNCT', 200), ('PART', 119), ('SCONJ', 112), ('PRON', 91), ('INTJ', 75), ('NUM', 36)]


In [93]:
enmono_postags = [x[1] for x in EN_mono_pos]
enmonotagfd = nltk.FreqDist(enmono_postags)
print(enmonotagfd.most_common())
len(set(enmono_postags))

[('DET', 102), ('VERB', 99), ('NOUN', 92), ('ADP', 61), ('CCONJ', 44), ('PROPN', 43), ('ADJ', 34), ('AUX', 33), ('ADV', 31), ('PUNCT', 21), ('PART', 19), ('PRON', 17), ('SCONJ', 17), ('INTJ', 7), ('NUM', 1)]


15

These are a little harder to compare, because we know that the sizes of the texts are pretty different. Another thing to consider is similar to the issue with TTR (it is hard to compare Type to Token Ratio when text sizes are vastly different becasue stop words will have a larger proportion in longer text). If these text sizes are so different (mostly considering the English Monolingual) it may be harder to compare. I will do the best I can, but this is crucial to keep in mind whenever comparing the four partitions.

That being said, It's still fair to say that nouns, determinersa and verbs are in the top among all sets. What is interesting is the greater use of adverbs in German speakers comapred to more use of adpositions in English. At this point, it is hard to see the similarites of Bilingual speakers in comaprison to monolingual speakers, and minute differences may be difficult to see with the human eye and will require some kind of machine learning.

In [94]:
debiposfd = nltk.FreqDist(DE_bi_pos)
print(debiposfd.most_common(20))

[(('und', 'CCONJ'), 221), (('.', 'PUNCT'), 175), (('die', 'DET'), 155), (('der', 'DET'), 120), (('Polizei', 'NOUN'), 118), (('Auto', 'NOUN'), 94), (('dann', 'ADV'), 81), (('ist', 'AUX'), 75), (('äh', 'INTJ'), 60), (('das', 'DET'), 60), (('es', 'PRON'), 59), ((',', 'PUNCT'), 54), (('hat', 'AUX'), 54), (('ich', 'PRON'), 52), (('dem', 'DET'), 49), (('ja', 'INTJ'), 47), (('haben', 'AUX'), 44), (('den', 'DET'), 43), (('war', 'AUX'), 41), (('ein', 'DET'), 38)]


In [95]:
demonoposfd = nltk.FreqDist(DE_mono_pos)
print(demonoposfd.most_common(20))

[(('und', 'CCONJ'), 76), (('.', 'PUNCT'), 66), (('die', 'DET'), 37), (('der', 'DET'), 35), (('ist', 'AUX'), 31), (('dann', 'ADV'), 29), (('dem', 'DET'), 29), (('den', 'DET'), 27), (('ja', 'INTJ'), 26), (('Auto', 'NOUN'), 25), (('ich', 'PRON'), 25), ((',', 'PUNCT'), 24), (('war', 'AUX'), 23), (('das', 'PRON'), 23), (('mit', 'ADP'), 22), (('Polizei', 'NOUN'), 19), (('nicht', 'INTJ'), 19), (('es', 'PRON'), 18), (('auf', 'ADP'), 18), (('das', 'DET'), 17)]


In [96]:
enbiposfd = nltk.FreqDist(EN_bi_pos)
print(enbiposfd.most_common(20))

[(('the', 'DET'), 428), (('and', 'CCONJ'), 220), (('.', 'PUNCT'), 158), (('car', 'NOUN'), 144), (('to', 'PART'), 89), (('was', 'AUX'), 87), (('of', 'ADP'), 83), (('it', 'PROPN'), 78), (('called', 'VERB'), 72), (('they', 'PROPN'), 51), (('in', 'ADP'), 50), (('then', 'ADV'), 50), (('him', 'PROPN'), 50), (('I', 'PROPN'), 45), (('911', 'NOUN'), 45), (('behind', 'ADP'), 45), (('police', 'NOUN'), 44), (('hit', 'VERB'), 40), (('to', 'ADP'), 40), (('a', 'DET'), 39)]


In [111]:
enmonoposfd = nltk.FreqDist(EN_mono_pos)
print(enmonoposfd.most_common(20))

[(('the', 'DET'), 62), (('and', 'CCONJ'), 39), (('car', 'NOUN'), 35), (('.', 'PUNCT'), 19), (('it', 'PROPN'), 17), (('was', 'AUX'), 16), (('to', 'PART'), 13), (('other', 'ADJ'), 11), (('one', 'PRON'), 9), (('of', 'ADP'), 9), (('rear-ended', 'VERB'), 9), (('him', 'PROPN'), 9), (('hit', 'VERB'), 9), (('into', 'ADP'), 8), (('then', 'ADV'), 8), (('they', 'PROPN'), 8), (('behind', 'ADP'), 8), (('stopped', 'VERB'), 7), (('that', 'DET'), 7), (('in', 'ADP'), 7)]


#### What are we seeing here
Let me translate what exactly is happening here with the langauges. First, it's important to recognize that some of these recording are coming from a situation in which the participants were asked to describe a video they saw about a car crash as if they had witnessed the car crash which is why words like 'Polizei' (English: Police) and 'Auto' and 'car' are common.

Secondly, when looking at the distribution of stop words, it is a little different. For one, the most common word in both German texts is 'und' (English: and) while the most common word in English is 'the'. This is likely because German has several different words for 'the' (der, die, das, den, dem, des) so the distribution is spread across these several words. The bilingual sets are pretty comparable becasue they have similar sizes, and when looking at 'und' and 'and', they have pretty similar usuage.

Additionally, some of these texts are transcriptions of *spoken* audio, so the punctuation is a little tricky to analyze and not going to be of a whole lot of importance to this anyway.

### Combing the Bilingual v Monolingual
Let's combine the bilingual and monolingual data and just look at pos to see if that will show any greater differences.

In [98]:
bilingual_uni_pos = EN_bi_pos + DE_bi_pos
print(len(bilingual_uni_pos))

monolingual_uni_pos = EN_mono_pos + DE_mono_pos
print(len(monolingual_uni_pos))

## big size discrepency to keep in mind

9093
2375


In [99]:
biuni_postags = [x[1] for x in bilingual_uni_pos]
biuni_postagsfd = nltk.FreqDist(biuni_postags)
biuni_postagsfd.most_common(20)

[('NOUN', 1387), ('DET', 1291), ('VERB', 1233), ('ADV', 784), ('ADP', 712), ('CCONJ', 609), ('AUX', 605), ('PRON', 539), ('ADJ', 503), ('PUNCT', 451), ('PROPN', 381), ('INTJ', 250), ('PART', 142), ('SCONJ', 119), ('NUM', 58), ('SYM', 29)]

In [100]:
monouni_postags = [x[1] for x in monolingual_uni_pos]
monouni_postagsfd = nltk.FreqDist(monouni_postags)
monouni_postagsfd.most_common(20)

## with just the POS, we can compare how the first four groups are similar,
## however pronoun usuage is clearly different and greater in the monolingual
## speakers, but everything else is nearly identical- very cool!!

[('NOUN', 343), ('DET', 312), ('VERB', 288), ('ADV', 238), ('PRON', 193), ('ADP', 176), ('CCONJ', 163), ('AUX', 160), ('ADJ', 147), ('PUNCT', 121), ('INTJ', 99), ('PROPN', 60), ('PART', 26), ('SCONJ', 20), ('SYM', 20), ('NUM', 9)]

In [171]:
## add in machine learning for giggles because I know it's not going to be good at this stage

## Bigram Exploration
Unigram exploration is interesting, but it can really only do so much for us. What is (hopefully) more telling, will be the bigram and possibly trigram trends

In [101]:
with open ('bigram_debi_pos.pkl', 'rb') as file:
    DE_bi_bigram_pos = pickle.load(file)
with open ('bigram_demono_pos.pkl', 'rb') as file:
    DE_mono_bigram_pos = pickle.load(file)
with open ('bigram_enbi_pos.pkl', 'rb') as file:
    EN_bi_bigram_pos = pickle.load(file)
with open ('bigram_enmono_pos.pkl', 'rb') as file:
    EN_mono_bigram_pos = pickle.load(file)

In [103]:
DE_bi_bigram_pos[:10]

[(('und', 'CCONJ'), ('die', 'PRON')), (('die', 'PRON'), ('haben', 'AUX')), (('haben', 'AUX'), ('die', 'DET')), (('die', 'DET'), ('Polizei', 'NOUN')), (('Polizei', 'NOUN'), ('äh', 'INTJ')), (('äh', 'INTJ'), ('angerufen', 'VERB')), (('und', 'CCONJ'), ('die', 'PRON')), (('die', 'PRON'), ('haben', 'AUX')), (('haben', 'AUX'), ('die', 'DET')), (('die', 'DET'), ('Polizei', 'NOUN'))]

In [104]:
len(DE_bi_bigram_pos)

1194250

In [105]:
DE_bi_bigram_pos = [(x,y) for (x,y) in DE_bi_bigram_pos if (x[1] not in ['NE', '_', '$.', 'X']) and (y[1] not in ['NE', '_', '$.', 'X'])]

In [None]:
len(DE_bi_bigram_pos)
## yay it worked!

1180238

In [107]:
debi_bipostags = [(x[1], y[1]) for (x, y) in DE_bi_bigram_pos]

In [108]:
debi_bipostags[:10]

[('CCONJ', 'PRON'), ('PRON', 'AUX'), ('AUX', 'DET'), ('DET', 'NOUN'), ('NOUN', 'INTJ'), ('INTJ', 'VERB'), ('CCONJ', 'PRON'), ('PRON', 'AUX'), ('AUX', 'DET'), ('DET', 'NOUN')]

In [143]:
debi_bitoks = [(x[0], y[0]) for (x, y) in DE_bi_bigram_pos]
debi_bitoks[:10]

[('und', 'die'), ('die', 'haben'), ('haben', 'die'), ('die', 'Polizei'), ('Polizei', 'äh'), ('äh', 'angerufen'), ('und', 'die'), ('die', 'haben'), ('haben', 'die'), ('die', 'Polizei')]

In [144]:
debibitokfd = nltk.FreqDist(debi_bitoks)
debibitokfd.most_common(20)

[(('die', 'Polizei'), 27973), (('und', 'der'), 9930), (('Polizei', 'gerufen'), 8095), (('der', 'Mann'), 5678), (('das', 'war'), 5191), (('und', 'dann'), 4861), (('war', 'es'), 4714), (('Polizei', '.'), 4534), (('Polizei', 'angerufen'), 4486), (('und', 'das'), 4415), (('mit', 'dem'), 3805), (('und', 'die'), 3773), (('ist', 'dann'), 3756), (('und', 'ja'), 3695), (('das', 'Auto'), 3487), (('Auto', 'reingefahren'), 3406), (('gerufen', '.'), 3353), (('und', 'äh'), 3299), (('zweite', 'Auto'), 3191), (('Mann', 'im'), 3186)]

In [168]:
DE_mono_bigram_pos = [(x,y) for (x,y) in DE_mono_bigram_pos if (x[1] not in ['PPER', 'VAFIN', 'KON', 'PIAT', 'NN']) and (y[1] not in ['PPER', 'VAFIN', 'KON', 'PIAT', 'NN'])]
EN_bi_bigram_pos = [(x,y) for (x,y) in EN_bi_bigram_pos if (x[1] != '_') and (y[1] != '_')]
EN_mono_bigram_pos = [(x,y) for (x,y) in EN_mono_bigram_pos]
## pos tags extraction
demono_bipostags = [(x[1], y[1]) for (x, y) in DE_mono_bigram_pos]
enbi_bipostags = [(x[1], y[1]) for (x, y) in EN_bi_bigram_pos]
enmono_bipostags = [(x[1], y[1]) for (x, y) in EN_mono_bigram_pos]
## text extraction
demono_bitoks = [(x[0], y[0]) for (x, y) in DE_mono_bigram_pos]
enbi_bitoks = [(x[0], y[0]) for (x, y) in EN_bi_bigram_pos]
enmono_bitoks = [(x[0], y[0]) for (x, y) in EN_mono_bigram_pos]
## printing len of each bigram pos
print(len(DE_bi_bigram_pos))
print(len(DE_mono_bigram_pos))
print(len(EN_bi_bigram_pos))
print(len(EN_mono_bigram_pos))

1180238
198093
901501
18023


In [115]:
debibiposfd = nltk.FreqDist(DE_bi_bigram_pos)
debibiposfd.most_common(20)
## lot's of very topical words, given the prompt the participants were given

[((('die', 'DET'), ('Polizei', 'NOUN')), 27973), ((('und', 'CCONJ'), ('der', 'DET')), 8874), ((('Polizei', 'NOUN'), ('gerufen', 'VERB')), 8095), ((('der', 'DET'), ('Mann', 'NOUN')), 5678), ((('und', 'CCONJ'), ('dann', 'ADV')), 4861), ((('das', 'PRON'), ('war', 'AUX')), 4828), ((('Polizei', 'NOUN'), ('.', 'PUNCT')), 4534), ((('Polizei', 'NOUN'), ('angerufen', 'VERB')), 4486), ((('war', 'AUX'), ('es', 'PRON')), 4351), ((('mit', 'ADP'), ('dem', 'DET')), 3805), ((('ist', 'AUX'), ('dann', 'ADV')), 3756), ((('und', 'CCONJ'), ('ja', 'INTJ')), 3695), ((('das', 'DET'), ('Auto', 'NOUN')), 3487), ((('Auto', 'NOUN'), ('reingefahren', 'VERB')), 3406), ((('gerufen', 'VERB'), ('.', 'PUNCT')), 3353), ((('zweite', 'ADJ'), ('Auto', 'NOUN')), 3191), ((('Mann', 'NOUN'), ('im', 'ADP')), 3186), ((('in', 'ADP'), ('das', 'DET')), 3098), ((('an', 'ADV'), ('.', 'PUNCT')), 3087), ((('und', 'CCONJ'), ('das', 'DET')), 3056)]

In [117]:
debibipostagfd = nltk.FreqDist(debi_bipostags)
debibipostagfd.most_common(20)
## det noun makes a lot of sense, as does noun verb to be the most common

[(('DET', 'NOUN'), 119603), (('NOUN', 'VERB'), 65140), (('ADJ', 'NOUN'), 48252), (('DET', 'ADJ'), 41112), (('ADP', 'DET'), 40121), (('VERB', 'PUNCT'), 32132), (('ADV', 'VERB'), 31215), (('AUX', 'ADV'), 29194), (('PRON', 'ADV'), 26520), (('ADV', 'ADV'), 25606), (('CCONJ', 'DET'), 25210), (('PRON', 'VERB'), 23776), (('PRON', 'AUX'), 23713), (('CCONJ', 'PRON'), 23328), (('ADV', 'DET'), 23170), (('AUX', 'PRON'), 22636), (('NOUN', 'AUX'), 19400), (('VERB', 'AUX'), 19363), (('NOUN', 'ADP'), 19271), (('NOUN', 'ADV'), 18892)]

In [None]:
demonobitoksfd = nltk.FreqDist(demono_bitoks)
enbibitoksfd = nltk.FreqDist(enbi_bitoks)
enmonobitoksfd = nltk.FreqDist(enmono_bitoks)
biposfds = [debibiposfd, demonobitoksfd, enbibitoksfd, enmonobitoksfd]
for x in biposfds:
    print(x.most_common(20))

## Order:
# German Bilingual
# German Monolingual
# English Bilingual
# English Monolingual

[((('die', 'DET'), ('Polizei', 'NOUN')), 27973), ((('und', 'CCONJ'), ('der', 'DET')), 8874), ((('Polizei', 'NOUN'), ('gerufen', 'VERB')), 8095), ((('der', 'DET'), ('Mann', 'NOUN')), 5678), ((('und', 'CCONJ'), ('dann', 'ADV')), 4861), ((('das', 'PRON'), ('war', 'AUX')), 4828), ((('Polizei', 'NOUN'), ('.', 'PUNCT')), 4534), ((('Polizei', 'NOUN'), ('angerufen', 'VERB')), 4486), ((('war', 'AUX'), ('es', 'PRON')), 4351), ((('mit', 'ADP'), ('dem', 'DET')), 3805), ((('ist', 'AUX'), ('dann', 'ADV')), 3756), ((('und', 'CCONJ'), ('ja', 'INTJ')), 3695), ((('das', 'DET'), ('Auto', 'NOUN')), 3487), ((('Auto', 'NOUN'), ('reingefahren', 'VERB')), 3406), ((('gerufen', 'VERB'), ('.', 'PUNCT')), 3353), ((('zweite', 'ADJ'), ('Auto', 'NOUN')), 3191), ((('Mann', 'NOUN'), ('im', 'ADP')), 3186), ((('in', 'ADP'), ('das', 'DET')), 3098), ((('an', 'ADV'), ('.', 'PUNCT')), 3087), ((('und', 'CCONJ'), ('das', 'DET')), 3056)]
[(('die', 'Polizei'), 2860), (('mit', 'dem'), 1755), (('dem', 'Ball'), 1124), (('das', 'wa

In [137]:
demonobiposfd = nltk.FreqDist(DE_mono_bigram_pos)
enbibiposfd = nltk.FreqDist(EN_bi_bigram_pos)
enmonobiposfd = nltk.FreqDist(EN_mono_bigram_pos)
biposfds = [debibiposfd, demonobiposfd, enbibiposfd, enmonobiposfd]
for x in biposfds:
    print(x.most_common(20))
        
## Order:
# German Bilingual
# German Monolingual
# English Bilingual
# English Monolingual

[((('die', 'DET'), ('Polizei', 'NOUN')), 27973), ((('und', 'CCONJ'), ('der', 'DET')), 8874), ((('Polizei', 'NOUN'), ('gerufen', 'VERB')), 8095), ((('der', 'DET'), ('Mann', 'NOUN')), 5678), ((('und', 'CCONJ'), ('dann', 'ADV')), 4861), ((('das', 'PRON'), ('war', 'AUX')), 4828), ((('Polizei', 'NOUN'), ('.', 'PUNCT')), 4534), ((('Polizei', 'NOUN'), ('angerufen', 'VERB')), 4486), ((('war', 'AUX'), ('es', 'PRON')), 4351), ((('mit', 'ADP'), ('dem', 'DET')), 3805), ((('ist', 'AUX'), ('dann', 'ADV')), 3756), ((('und', 'CCONJ'), ('ja', 'INTJ')), 3695), ((('das', 'DET'), ('Auto', 'NOUN')), 3487), ((('Auto', 'NOUN'), ('reingefahren', 'VERB')), 3406), ((('gerufen', 'VERB'), ('.', 'PUNCT')), 3353), ((('zweite', 'ADJ'), ('Auto', 'NOUN')), 3191), ((('Mann', 'NOUN'), ('im', 'ADP')), 3186), ((('in', 'ADP'), ('das', 'DET')), 3098), ((('an', 'ADV'), ('.', 'PUNCT')), 3087), ((('und', 'CCONJ'), ('das', 'DET')), 3056)]
[((('die', 'DET'), ('Polizei', 'NOUN')), 2860), ((('mit', 'ADP'), ('dem', 'DET')), 1666), 

In [136]:
demonobipostagfd = nltk.FreqDist(demono_bipostags)
enbibipostagfd = nltk.FreqDist(enbi_bipostags)
enmonobipostagfd = nltk.FreqDist(enmono_bipostags)
bipostagfds = [debibipostagfd, demonobipostagfd, enbibipostagfd, enmonobipostagfd]
for x in bipostagfds:
    print(x.most_common(20))
    
## Order:
# German Bilingual
# German Monolingual
# English Bilingual
# English Monolingual

[(('DET', 'NOUN'), 119603), (('NOUN', 'VERB'), 65140), (('ADJ', 'NOUN'), 48252), (('DET', 'ADJ'), 41112), (('ADP', 'DET'), 40121), (('VERB', 'PUNCT'), 32132), (('ADV', 'VERB'), 31215), (('AUX', 'ADV'), 29194), (('PRON', 'ADV'), 26520), (('ADV', 'ADV'), 25606), (('CCONJ', 'DET'), 25210), (('PRON', 'VERB'), 23776), (('PRON', 'AUX'), 23713), (('CCONJ', 'PRON'), 23328), (('ADV', 'DET'), 23170), (('AUX', 'PRON'), 22636), (('NOUN', 'AUX'), 19400), (('VERB', 'AUX'), 19363), (('NOUN', 'ADP'), 19271), (('NOUN', 'ADV'), 18892)]
[(('DET', 'NOUN'), 19281), (('ADP', 'DET'), 9688), (('NOUN', 'VERB'), 9018), (('ADJ', 'NOUN'), 6068), (('DET', 'ADJ'), 5763), (('VERB', 'PUNCT'), 5635), (('PRON', 'AUX'), 5134), (('ADV', 'ADV'), 4891), (('ADV', 'VERB'), 4676), (('NOUN', 'ADP'), 4444), (('AUX', 'PRON'), 4243), (('PRON', 'ADV'), 4132), (('PRON', 'VERB'), 3572), (('CCONJ', 'DET'), 3558), (('NOUN', 'ADV'), 3389), (('VERB', 'AUX'), 3367), (('CCONJ', 'PRON'), 3301), (('VERB', 'PRON'), 3237), (('ADV', 'AUX'), 31

In [164]:
debibipostagcfd = nltk.ConditionalFreqDist(debi_bipostags)
demonobipostagcfd = nltk.ConditionalFreqDist(demono_bipostags)
enbibipostagcfd = nltk.ConditionalFreqDist(enbi_bipostags)
enmonobipostagcfd = nltk.ConditionalFreqDist(enmono_bipostags)
bipostagcfds = [debibipostagcfd, demonobipostagcfd, enbibipostagcfd, enmonobipostagcfd]

print('Noun, Verb')
for x in bipostagcfds:
    print(x['NOUN'].freq('VERB'))
print('Adjective, Noun')
for x in bipostagcfds:
    print(x['ADJ'].freq('NOUN'))
print('Adjective, Adjective')
for x in bipostagcfds:
    print(x['ADJ'].freq('ADJ'))
print('Adverb, Verb')
for x in bipostagcfds:
    print(x['ADV'].freq('VERB'))

## Order:
# German Bilingual
# German Monolingual
# English Bilingual
# English Monolingual

Noun, Verb
0.35290930761729333
0.3193342776203966
0.19764318831929775
0.1419828641370869
Adjective, Noun
0.5896470818261804
0.46155016353540734
0.7200904020003847
0.4186991869918699
Adjective, Adjective
0.02189852380486851
0.03034912907887731
0.0031256010771302176
0.0
Adverb, Verb
0.20999838540405263
0.19375155382447998
0.26983454398708634
0.37158469945355194


In [166]:
print('Adposition, Noun')
for x in bipostagcfds:
    print(x['ADP'].freq('NOUN'))
print('Adposition, Determiner')
for x in bipostagcfds:
    print(x['ADP'].freq('DET'))
print('Verb, Noun')
for x in bipostagcfds:
    print(x['VERB'].freq('NOUN'))
print('Verb, Determiner')
for x in bipostagcfds:
    print(x['VERB'].freq('DET'))

## Order:
# German Bilingual
# German Monolingual
# English Bilingual
# English Monolingual

Adposition, Noun
0.14500856373868362
0.15211870070846142
0.04937817427470697
0.0
Adposition, Determiner
0.4908367996085148
0.6475070177783718
0.5092430791869769
0.44450431034482757
Verb, Noun
0.0
0.003353317752260949
0.09656873392798335
0.06950207468879668
Verb, Determiner
0.13888797957313082
0.08982826948480846
0.262617627892963
0.27420470262793917


#### Again, let's pause and look what's happening
This is a lot of data to take in, but me break down what's we're seeing with these sets of bigrams. Of course, the most popular token bigrams are topical things like 'Die Polizei' (english: 'the police') or 'dem Ball' (english: 'the ball') or 'car behind' but there are also more function word combinations like 'and the' or 'of the' or 'mit dem' (english: 'with the') or 'und der' (english: 'and the'). A lot of these make some sense. 

As for the most popular being 'die Polizei' in German, I have some guesses. Firstly, as I stated before, 'the' in German can take many forms, so while it is very popular in english to have 'and the', in German this could be 'und die/der/das/dem/den ect' which 'waters down' so-to-say the frequency of the 'the' token in German. Because these were made-up police reports, however, 'die Polizei' (nom.) was likely used in the nominative case more frequently as it was the police doing something in these scenerios, so it doesn't suffer from this 'watering down' issue other determiners may face 
- like 'Ball' which may have been used in the accusative or dative case acting as either the patient or theme in a lot of these sentences

What is interesting, is that it is looking like the Bilinguals overall used more content bigrams more frequently/consistenly as opposed to stopwords, which may make sense as grammatical rules/stopwords change a lot more in different languages, have different meanings to speakers, ect where content words have more 'reliable' and 'consistent' translations 
- ie 'Apfel' and 'Apple' are going to mean the same thing to both speakers, but where English has one words: 'because', Germans have more like 'weil', 'da', 'dann', where 'dann' can be used like 'because' but it can also be used like 'then' (as in the time indicator: x happened and *then* x happened)

As for the part of speech, it's harder to see bilingual/monolingual similarities with the naked eye, which is why I implinted the condtional frequency distribution to get normalized results for targeted conditions.
- Condition 1: Noun, Verb
    - This is the one of the most basic ways to structure a sentence in both English and German (although German does have some more flexible word order), so I was using this to test if maybe bilingual speakers would be influenced by their other languages and use more basic or more free grammatical structures
    - Result: The numbers inside each language are pretty similar, but each bilingual partition has more use of this structure than it's monolingual counterpart. Again, and I can never emphasize this enough, both monolingual sections are smaller than their bilingual counterparts and with bigrams, this problem only gets enhanced as it's exponential growth. Since this is a very common sentence formation, it would make sense that in larger texts this takes up more of the bigrams (similar to the TTR caveat)
- Condition 2: Adjective, Noun
    - This was mostly just to compare with the next test. 
- Condition 3: Adjective, Adjective
    - This test was to compare the use of adjectives string in sentences. I wanted to comapre it with the use of (adjectives, noun) phrases because the most common places to use an adjectives are before nouns and before adjectives. 
    - Results: It does seem that the groups that used adjectives more in front of nouns, used them less in front of other adjectives, just meaning that in general these groups used less multi-adjective phrases. Although there does appear to be a potential statistical difference between (adjective, noun) usage for bilingual and monolingual speakers, that is certainly not present in the (adjective, adjective) data.
- Condition 4: Adverb, Verb
    - 
- Condition 5: Adposition, Noun
    - 
- Condition 6: Adposition, Determiner
    - 
- Condition 7: Verb, Determiner
    - 

Note to self: implement statistic testing into the ones i believe to have stronger correlation
