# Observations on the Significance of Word Choices in the European Parliament

This is for a university course, I hope you're not an actual researcher stumbling across this looking for actual academically important information.

## Importing NLTK & Loading Europarl

This section of the code sets up our environment to begin researching by installing NLTK and reading the Europarl Corpus.

In [None]:
# Install NLTK
%pip install nltk
import nltk
nltk.download('punkt_tab', quiet=True)

Note: you may need to restart the kernel to use updated packages.


True

In [None]:
from nltk.corpus import EuroparlCorpusReader
from nltk.text import Text

# Read up the English correspondence from our Europarl excerpt
europarl_reader = EuroparlCorpusReader('./europarl/en', '.*')
europarl = Text(europarl_reader.words())

This cell just sets up some definitions for the words were looking for and the n-gram classes we're going to be using.

In [None]:
# Give the imports generic names so if we decide to go for bigrams later on its a single change here
from nltk.collocations import TrigramCollocationFinder as NgramCollocationFinder, TrigramAssocMeasures as NgramAssocMeasures

# The words to examine
word_a = 'substantial'
word_b = 'significant'

## Statistics

This cell shows some general statistics of the words we're studying.

In [155]:
europarl_word_count = len(europarl)

word_a_abs_freq = europarl.count(word_a)
word_b_abs_freq = europarl.count(word_b)
word_a_freq = word_a_abs_freq / europarl_word_count
word_b_freq = word_b_abs_freq / europarl_word_count

print(f"Total corpus size: {europarl_word_count}")
print(f"Uses of '{word_a}': {word_a_abs_freq}, {word_a_freq*100:.2}%")
print(f"Uses of '{word_b}': {word_b_abs_freq}, {word_b_freq*100:.2}%")

word_a_to_word_b_ratio = word_a_abs_freq / word_b_abs_freq
print(f"'{word_a}' to '{word_b}' ratio: {word_a_to_word_b_ratio:.2}")

Total corpus size: 11235644
Uses of 'substantial': 888, 0.0079%
Uses of 'significant': 1645, 0.015%
'substantial' to 'significant' ratio: 0.54


In [None]:
# https://stackoverflow.com/questions/49197667/nltk-how-to-get-bigrams-containing-a-specific-word
word_a_finder = NgramCollocationFinder.from_documents([europarl])
word_a_finder.apply_ngram_filter(lambda *ngram: word_a not in ngram)

word_b_finder = NgramCollocationFinder.from_documents([europarl])
word_b_finder.apply_ngram_filter(lambda *ngram: word_b not in ngram)

## Finding uses of the words

The following cells actually check out how the words are used.

In [157]:
# Check which words are used in similar contexts
print(f'Similar contexts to {word_a}')
europarl.similar(word_a)
print(f'\nSimilar contexts to {word_b}')
europarl.similar(word_b)

print('\nCommon contexts')
europarl.common_contexts([word_a, word_b])

Similar contexts to substantial
the significant new european great specific a this major good real
more considerable and serious in political clear of that

Similar contexts to significant
the and important a major new real serious great good in political
positive european this clear specific that fundamental considerable

Common contexts
a_contribution a_number a_increase a_part a_amount made_progress
a_reduction a_majority make_progress more_and a_and a_step
the_increase a_proportion a_improvement a_effort a_but that_progress
and_progress make_changes


In [158]:
%pip install tabulate
from tabulate import tabulate

def score_by(func_name: str, func):
    print(f'as scored via {func_name}')
    best_a = word_a_finder.score_ngrams(func)
    best_b = word_b_finder.score_ngrams(func)
    print(tabulate(best_a[:15]))
    print(tabulate(best_b[:15]))

score_by('raw frequency', NgramAssocMeasures.raw_freq)
score_by('PMI', NgramAssocMeasures.pmi)

Note: you may need to restart the kernel to use updated packages.
as scored via raw frequency
-------------------------------------  -----------
('substantial', 'contribution', 'to')  2.84808e-06
('a', 'substantial', 'contribution')   2.49207e-06
('substantial', 'increase', 'in')      2.31406e-06
('make', 'a', 'substantial')           1.86905e-06
('a', 'substantial', 'increase')       1.78005e-06
('a', 'very', 'substantial')           1.78005e-06
('substantial', 'number', 'of')        1.78005e-06
('a', 'substantial', 'part')           1.69105e-06
('is', 'a', 'substantial')             1.69105e-06
('substantial', 'part', 'of')          1.69105e-06
('a', 'substantial', 'number')         1.51304e-06
('and', 'a', 'substantial')            1.33504e-06
('to', 'a', 'substantial')             1.15703e-06
('by', 'a', 'substantial')             1.06803e-06
('substantial', 'progress', 'in')      1.06803e-06
-------------------------------------  -----------
-------------------------------------  

## Examine specific concordances

Taking a closer look at some specific interesting results.

In [159]:

def compare_uses(query: str, count=10):
    a = europarl.concordance_list(query.replace('$', word_a).split(' '), lines=word_a_abs_freq)
    b = europarl.concordance_list(query.replace('$', word_b).split(' '), lines=word_b_abs_freq)
    print(f"showing {min(len(a),count)}/{len(a)} results for '{word_a}'")
    for con in a[:count]:
        print(con.line)
    print('---')
    print(f"showing {min(len(b),count)}/{len(b)} results for '{word_b}'")
    for con in b[:count]:
        print(con.line)
    print('---')
    print(f"ratio: {len(a)/len(b):.2}; compare to {word_a_to_word_b_ratio:.2}")

compare_uses('is a $')
print('\n=====\n')
compare_uses('$ impact')
print('\n=====\n')
compare_uses('$ role')
print('\n=====\n')
compare_uses('$ contribution')
print('\n=====\n')
compare_uses('$ increase')

showing 10/19 results for 'substantial'
ences of globalisation, but it is a substantial and necessary response. <P> In
nd the rest from the EIB. That is a substantial investment in the future prosp
. <P> On the other hand, there is a substantial risk of too many plans and pro
le European Food Safety Agency is a substantial step which satisfies the need 
made to me as to whether there is a substantial legal basis or not. "In princi
clearly demonstrate that there is a substantial link between vehicles that hav
rs are not, and of those there is a substantial number which is carried over, 
liation that we have before us is a substantial improvement upon the rules for
pon which we are about to vote is a substantial step forward in the Union's in
 benefited citizens. But there is a substantial danger here. We face a situati
---
showing 10/67 results for 'significant'
 number of jobs and when there is a significant reduction in the unemployment 
 NAME="Okking"> . (DA) Tourism is a significant