# Named Entities Recognition with the SE articles and extraction of Subject -   Verb - Object tuples

### Step 1. Loading Spacy models
***

We install Spacy's language library for the first run. Then we can comment-out the download command. Note that we are loading Spacy's "medium" model.


In [1]:
import re
import pandas as pd
import numpy as np
import spacy
import sys
from collections import Counter

## Run to install the language library, then comment-out
## !{sys.executable} -m spacy download en
!{sys.executable} -m spacy download en_core_web_md

nlp = spacy.load('en_core_web_md')
print('Finished loading.')


[+] Download and installation successful
You can now load the package via spacy.load('en_core_web_md')
Finished loading.


### Step 2. Pre-processing
***

* Read the file with the scraped content from the SE articles, i.e. the abstracts and the full contents from the **constructed "raw content" column** (see Deliverable D2.2). 
* In later versions, the corresponding tables will be directly exported from the database.
* Discard records with duplicate titles and/or abstracts and/or raw contents and do some data cleansing.
* Drop records with empty strings produced in the abstracts or the raw contents.


In [13]:
dat = pd.read_excel('articles_5_1_15_25.xlsx')
dat = dat[['title','abstract','categories','raw content']]

dat = dat.drop_duplicates(subset=["title"])
dat = dat.dropna(axis=0,subset=["title"])
dat = dat.drop_duplicates(subset=["abstract"])
dat = dat.dropna(axis=0,subset=["abstract"])
dat = dat.drop_duplicates(subset=["raw content"])
dat = dat.dropna(axis=0,subset=["raw content"])
dat.reset_index(drop=True, inplace=True)


## NO !!! .apply(lambda x: re.sub(r'[a-zA-Z]+\d+', ' ', x))## letters+digits -> space

## parentheses with only digits, dots, spaces, percentage sign, minus sign: replace with space
dat['raw content'] = dat['raw content'].apply(lambda x: re.sub(r"\([\d .%-]+\)", ' ',x)) ## replace numbers in parentheses with a space
##dat['raw content'] = dat['raw content'].apply(lambda x: re.sub("[^a-z\\.,A-Z0-9\-]", " ",x)) ## replace anything except digits,letters,comma,dot,dash by space 

dat['raw content'] = dat['raw content'].apply(lambda x: re.sub(' +', ' ',x)) ## remove more than one spaces
dat['raw content'] = dat['raw content'].apply(lambda x: re.sub('^ +| +$', '',x,flags=re.MULTILINE)) ## remove start and end spaces
dat['raw content'] = dat['raw content'].apply(lambda x: re.sub(' ,',',',x)) ## space-comma-space -> comma-space

## parentheses with only digits, dots, spaces, percentage sign, minus sign: replace with space
dat['abstract'] = dat['abstract'].apply(lambda x: re.sub(r"\([\d .%-]+\)", ' ',x)) ## replace numbers in parentheses with a space
##dat['abstract'] = dat['abstract'].apply(lambda x: re.sub("[^a-z\\.,A-Z0-9\-]", " ",x)) ## replace anything except digits,letters,comma,dot,dash by space 

dat['abstract'] = dat['abstract'].apply(lambda x: re.sub(' +', ' ',x)) ## remove more than one spaces
dat['abstract'] = dat['abstract'].apply(lambda x: re.sub('^ +| +$', '',x,flags=re.MULTILINE)) ## remove start and end spaces
dat['abstract'] = dat['abstract'].apply(lambda x: re.sub(' ,',',',x)) ## space-comma-space -> comma-space


dat = dat.replace('', np.nan) ## check if empty strings produced and drop records if necessary
dat = dat.dropna(axis=0,subset=["abstract"])
dat = dat.dropna(axis=0,subset=["raw content"])
dat.reset_index(drop=True, inplace=True)

dat

Unnamed: 0,title,abstract,categories,raw content
0,Absences from work - quarterly statistics,Absences from work can be classified into two ...,"['Employment', 'Labour market', 'Statistical a...",Absences from work sharply increase in first h...
1,Balance of payments statistics - quarterly data,This article presents quarterly statistics on ...,"['Balance of payments', 'Statistical article']",Current account. The EU non-seasonally adjuste...
2,Accidents and injuries statistics,This article presents an overview of European ...,"['Health', 'Health status', 'Statistical artic...","Deaths from accidents, injuries and assault. I..."
3,Accidents at work statistics,This article presents a set of main statistica...,"['Accidents at work', 'Health', 'Health and sa...","Number of accidents. In 2018, there were 3.1 m..."
4,Accidents at work - statistics by economic act...,This article presents a set of main statistica...,"['Accidents at work', 'Health', 'Health and sa...",Developments over time. Non-fatal accidents In...
...,...,...,...,...
601,Africa-EU - international trade in goods stati...,This article provides a picture of internation...,"['Non-EU countries', 'Trade in goods', 'Statis...",Africa’s main trade in goods partner is the EU...
602,Adult learning statistics,This article provides an overview of adult lea...,"['Education and training', 'Lifelong learning'...",Participation rate of adults in learning in th...
603,Acquisition of citizenship statistics,This article presents recent statistics on the...,"['Asylum and migration', 'Population', 'Acquis...",EU-27 Member States granted citizenship to 706...
604,Adult learning statistics - characteristics of...,This article presents an overview of European ...,"['Education and training', 'Participation in e...",Formal and non-formal adult education and trai...


### Step 3. An improved version of a Subject-Verb-Object extraction function using Spacy
***

* By Peter de Vocht, see [GitHub code](https://github.com/peter3125/enhanced-subject-verb-object-extraction/blob/master/subject_verb_object_extract.py).
* Elements in tuples returned may be too many - **EXPLAIN**.
* Function needs some **DESCRIPTION**.


In [3]:
# Copyright 2017 Peter de Vocht
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#    http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

#import en_core_web_sm
from collections.abc import Iterable

# use spacy small model
#nlp = en_core_web_sm.load()

# dependency markers for subjects
SUBJECTS = {"nsubj", "nsubjpass", "csubj", "csubjpass", "agent", "expl"}
# dependency markers for objects
OBJECTS = {"dobj", "dative", "attr", "oprd"}
# POS tags that will break adjoining items
BREAKER_POS = {"CCONJ", "VERB"}
# words that are negations
NEGATIONS = {"no", "not", "n't", "never", "none"}


# does dependency set contain any coordinating conjunctions?
def contains_conj(depSet):
    return "and" in depSet or "or" in depSet or "nor" in depSet or \
           "but" in depSet or "yet" in depSet or "so" in depSet or "for" in depSet


# get subs joined by conjunctions
def _get_subs_from_conjunctions(subs):
    more_subs = []
    for sub in subs:
        # rights is a generator
        rights = list(sub.rights)
        rightDeps = {tok.lower_ for tok in rights}
        if contains_conj(rightDeps):
            more_subs.extend([tok for tok in rights if tok.dep_ in SUBJECTS or tok.pos_ == "NOUN"])
            if len(more_subs) > 0:
                more_subs.extend(_get_subs_from_conjunctions(more_subs))
    return more_subs


# get objects joined by conjunctions
def _get_objs_from_conjunctions(objs):
    more_objs = []
    for obj in objs:
        # rights is a generator
        rights = list(obj.rights)
        rightDeps = {tok.lower_ for tok in rights}
        if contains_conj(rightDeps):
            more_objs.extend([tok for tok in rights if tok.dep_ in OBJECTS or tok.pos_ == "NOUN"])
            if len(more_objs) > 0:
                more_objs.extend(_get_objs_from_conjunctions(more_objs))
    return more_objs


# find sub dependencies
def _find_subs(tok):
    head = tok.head
    while head.pos_ != "VERB" and head.pos_ != "NOUN" and head.head != head:
        head = head.head
    if head.pos_ == "VERB":
        subs = [tok for tok in head.lefts if tok.dep_ == "SUB"]
        if len(subs) > 0:
            verb_negated = _is_negated(head)
            subs.extend(_get_subs_from_conjunctions(subs))
            return subs, verb_negated
        elif head.head != head:
            return _find_subs(head)
    elif head.pos_ == "NOUN":
        return [head], _is_negated(tok)
    return [], False


# is the tok set's left or right negated?
def _is_negated(tok):
    parts = list(tok.lefts) + list(tok.rights)
    for dep in parts:
        if dep.lower_ in NEGATIONS:
            return True
    return False


# get all the verbs on tokens with negation marker
def _find_svs(tokens):
    svs = []
    verbs = [tok for tok in tokens if tok.pos_ == "VERB"]
    for v in verbs:
        subs, verbNegated = _get_all_subs(v)
        if len(subs) > 0:
            for sub in subs:
                svs.append((sub.orth_, "!" + v.orth_ if verbNegated else v.orth_))
    return svs


# get grammatical objects for a given set of dependencies (including passive sentences)
def _get_objs_from_prepositions(deps, is_pas):
    objs = []
    for dep in deps:
        if dep.pos_ == "ADP" and (dep.dep_ == "prep" or (is_pas and dep.dep_ == "agent")):
            objs.extend([tok for tok in dep.rights if tok.dep_  in OBJECTS or
                         (tok.pos_ == "PRON" and tok.lower_ == "me") or
                         (is_pas and tok.dep_ == 'pobj')])
    return objs


# get objects from the dependencies using the attribute dependency
def _get_objs_from_attrs(deps, is_pas):
    for dep in deps:
        if dep.pos_ == "NOUN" and dep.dep_ == "attr":
            verbs = [tok for tok in dep.rights if tok.pos_ == "VERB"]
            if len(verbs) > 0:
                for v in verbs:
                    rights = list(v.rights)
                    objs = [tok for tok in rights if tok.dep_ in OBJECTS]
                    objs.extend(_get_objs_from_prepositions(rights, is_pas))
                    if len(objs) > 0:
                        return v, objs
    return None, None


# xcomp; open complement - verb has no suject
def _get_obj_from_xcomp(deps, is_pas):
    for dep in deps:
        if dep.pos_ == "VERB" and dep.dep_ == "xcomp":
            v = dep
            rights = list(v.rights)
            objs = [tok for tok in rights if tok.dep_ in OBJECTS]
            objs.extend(_get_objs_from_prepositions(rights, is_pas))
            if len(objs) > 0:
                return v, objs
    return None, None


# get all functional subjects adjacent to the verb passed in
def _get_all_subs(v):
    verb_negated = _is_negated(v)
    subs = [tok for tok in v.lefts if tok.dep_ in SUBJECTS and tok.pos_ != "DET"]
    if len(subs) > 0:
        subs.extend(_get_subs_from_conjunctions(subs))
    else:
        foundSubs, verb_negated = _find_subs(v)
        subs.extend(foundSubs)
    return subs, verb_negated


# find the main verb - or any aux verb if we can't find it
def _find_verbs(tokens):
    verbs = [tok for tok in tokens if _is_non_aux_verb(tok)]
    if len(verbs) == 0:
        verbs = [tok for tok in tokens if _is_verb(tok)]
    return verbs


# is the token a verb?  (excluding auxiliary verbs)
def _is_non_aux_verb(tok):
    return tok.pos_ == "VERB" and (tok.dep_ != "aux" and tok.dep_ != "auxpass")


# is the token a verb?  (excluding auxiliary verbs)
def _is_verb(tok):
    return tok.pos_ == "VERB" or tok.pos_ == "AUX"


# return the verb to the right of this verb in a CCONJ relationship if applicable
# returns a tuple, first part True|False and second part the modified verb if True
def _right_of_verb_is_conj_verb(v):
    # rights is a generator
    rights = list(v.rights)

    # VERB CCONJ VERB (e.g. he beat and hurt me)
    if len(rights) > 1 and rights[0].pos_ == 'CCONJ':
        for tok in rights[1:]:
            if _is_non_aux_verb(tok):
                return True, tok

    return False, v


# get all objects for an active/passive sentence
def _get_all_objs(v, is_pas):
    # rights is a generator
    rights = list(v.rights)

    objs = [tok for tok in rights if tok.dep_ in OBJECTS or (is_pas and tok.dep_ == 'pobj')]
    objs.extend(_get_objs_from_prepositions(rights, is_pas))

    #potentialNewVerb, potentialNewObjs = _get_objs_from_attrs(rights)
    #if potentialNewVerb is not None and potentialNewObjs is not None and len(potentialNewObjs) > 0:
    #    objs.extend(potentialNewObjs)
    #    v = potentialNewVerb

    potential_new_verb, potential_new_objs = _get_obj_from_xcomp(rights, is_pas)
    if potential_new_verb is not None and potential_new_objs is not None and len(potential_new_objs) > 0:
        objs.extend(potential_new_objs)
        v = potential_new_verb
    if len(objs) > 0:
        objs.extend(_get_objs_from_conjunctions(objs))
    return v, objs


# return true if the sentence is passive - at he moment a sentence is assumed passive if 
# it has an auxpass (auxiliary passive) verb
def _is_passive(tokens):
    for tok in tokens:
        if tok.dep_ == "auxpass":
            return True
    return False


# resolve a 'that' where/if appropriate
def _get_that_resolution(toks):
    for tok in toks:
        if 'that' in [t.orth_ for t in tok.lefts]:
            return tok.head
    return None


# simple stemmer using lemmas
def _get_lemma(word: str):
    tokens = nlp(word)
    if len(tokens) == 1:
        return tokens[0].lemma_
    return word


# print information for displaying all kinds of things of the parse tree
def printDeps(toks):
    for tok in toks:
        print(tok.orth_, tok.dep_, tok.pos_, tok.head.orth_, [t.orth_ for t in tok.lefts], [t.orth_ for t in tok.rights])


# expand an obj / subj np using its chunk
def expand(item, tokens, visited):
    if item.lower_ == 'that':
        temp_item = _get_that_resolution(tokens)
        if temp_item is not None:
            item = temp_item

    parts = []

    if hasattr(item, 'lefts'):
        for part in item.lefts:
            if part.pos_ in BREAKER_POS:
                break
            if not part.lower_ in NEGATIONS:
                parts.append(part)

    parts.append(item)

    if hasattr(item, 'rights'):
        for part in item.rights:
            if part.pos_ in BREAKER_POS:
                break
            if not part.lower_ in NEGATIONS:
                parts.append(part)

    if hasattr(parts[-1], 'rights'):
        for item2 in parts[-1].rights:
            if item2.pos_ == "DET" or item2.pos_ == "NOUN":
                if item2.i not in visited:
                    visited.add(item2.i)
                    parts.extend(expand(item2, tokens, visited))
            break

    return parts


# convert a list of tokens to a string
def to_str(tokens):
    if isinstance(tokens, Iterable):
        return ' '.join([item.text for item in tokens])
    else:
        return ''


# find verbs and their subjects / objects to create SVOs, detect passive/active sentences
def findSVOs(tokens):
    svos = []
    is_pas = _is_passive(tokens) ## is an "auxpass" verb contained in the tokens?
    verbs = _find_verbs(tokens) ## get the main verbs (or aux verbs if none) 
    visited = set()  # recursion detection
    for v in verbs:
        subs, verbNegated = _get_all_subs(v)
        # hopefully there are subs, if not, don't examine this verb any longer
        if len(subs) > 0:
            isConjVerb, conjV = _right_of_verb_is_conj_verb(v)
            if isConjVerb:
                v2, objs = _get_all_objs(conjV, is_pas)
                for sub in subs:
                    for obj in objs:
                        objNegated = _is_negated(obj)
                        if is_pas:  # reverse object / subject for passive
                            svos.append((to_str(expand(obj, tokens, visited)),
                                         "!" + v.lemma_ if verbNegated or objNegated else v.lemma_, to_str(expand(sub, tokens, visited))))
                            svos.append((to_str(expand(obj, tokens, visited)),
                                         "!" + v2.lemma_ if verbNegated or objNegated else v2.lemma_, to_str(expand(sub, tokens, visited))))
                        else:
                            svos.append((to_str(expand(sub, tokens, visited)),
                                         "!" + v.lower_ if verbNegated or objNegated else v.lower_, to_str(expand(obj, tokens, visited))))
                            svos.append((to_str(expand(sub, tokens, visited)),
                                         "!" + v2.lower_ if verbNegated or objNegated else v2.lower_, to_str(expand(obj, tokens, visited))))
            else:
                v, objs = _get_all_objs(v, is_pas)
                for sub in subs:
                    if len(objs) > 0:
                        for obj in objs:
                            objNegated = _is_negated(obj)
                            if is_pas:  # reverse object / subject for passive
                                svos.append((to_str(expand(obj, tokens, visited)),
                                             "!" + v.lemma_ if verbNegated or objNegated else v.lemma_, to_str(expand(sub, tokens, visited))))
                            else:
                                svos.append((to_str(expand(sub, tokens, visited)),
                                             "!" + v.lower_ if verbNegated or objNegated else v.lower_, to_str(expand(obj, tokens, visited))))
                    else:
                        # no obj - just return the SV parts
                        svos.append((to_str(expand(sub, tokens, visited)),
                                     "!" + v.lower_ if verbNegated else v.lower_,))
    ##print(len(svos))
    return svos

### Step 4. Apply the NER engine and the SVO function to the SE articles abstracts
***

* Create column ORG which will hold dictionaries with entities recognized as Organizations. 
* Create column ORG_SVOs which will hold dictionaries with SVOs involving entities recognized as Organizations. 
* A similar procedure can be applied for Countries, cities, states (code GPE), nationalities or religious or political groups (code NORP), non-GPE locations, mountain ranges, bodies of water (code LOCATION), etc. 
* In each dictionary in column ORG in a record, the key is the entity and the values are a list with the token span's *start* index position, the token span's *stop* index position and the count of occurence in the content of the SE article abstract.
* In each dictionary in column ORG_SVOs in a record, the key is the entity and the values are a list with the SVO tuples and their count in the content of the SE article abstract.
* We also create a separate dictionary ORG_SVOs gathering the SVOs from all abstracts.

In [4]:


dat['ORG'] = [dict() for i in range(len(dat))]
dat['ORG_SVOs'] = [dict() for i in range(len(dat))]
ORG_SVOs = dict() ## a separate dictionary holding all SVOs from all articles

def process_texts(dat,column):

    nlp.max_length = 1500000
    
    for i in range(len(dat)):
        if (i+1) % 100 == 0: print('article i = ',i+1,' of ',len(dat))
        doc = nlp(dat.loc[i,column]) ## pre-process article

        sents = doc.sents ## segment into sentences
        sents_list = [sent for sent in doc.sents]
        num_sents = len(sents_list)
        if num_sents ==0: 
            print(sents_list)
            raise Exception("Error A!") 

        for (j,sent) in enumerate(sents_list):  
            doc_sent = nlp(sent.text) ## pre-process sentence
        
            entities = doc_sent.ents ## entities in sentence       
            #print('General entities: ',len(entities),': ',entities)
            if len(entities) == 0: 
                #print('\nSentence ',j+1,' of ',num_sents,': No general entities - passed\n')
                continue

            for ent in entities: ## just a check to verify the span of each entity IN THE SENTENCE
                if ent.text != doc_sent.text[ent.start_char: ent.end_char]:
                    raise Exception("Error B!")             
            
            org_ents = [ent for ent in entities if ent.label_=='ORG'] ## ORG entities
            if len(org_ents) == 0:
                #print('\nSentence ',j+1,' of ',num_sents,': No ORG entities - passed\n')
                continue
 
            svos = findSVOs(doc_sent)
            for (k,sv) in enumerate(svos):
                for s in sv: 
                    #print('searching in: ',s)
                    for e in org_ents:
                        #print('searching for ',e.text)
                        if s.find(e.text) != -1:
                            #print(k,': ',sv,' : found ',e.text)
                            if e.text.upper() in dat.loc[i,'ORG'].keys():
                                dat.loc[i,'ORG'][e.text.upper()][0].append((e.start,e.end)) 
                                dat.loc[i,'ORG'][e.text.upper()][1] += 1 
                            else:    
                                dat.loc[i,'ORG'][e.text.upper()] = [[(e.start,e.end)],1]
                        
                            if e.text.upper() in dat.loc[i,'ORG_SVOs'].keys():
                                dat.loc[i,'ORG_SVOs'][e.text.upper()][0].append(sv) 
                                dat.loc[i,'ORG_SVOs'][e.text.upper()][1] += 1 
                            else:    
                                dat.loc[i,'ORG_SVOs'][e.text.upper()] = [[sv],1] 
                        
                            ## global dictionary
                            if e.text.upper() in ORG_SVOs.keys():
                                ORG_SVOs[e.text.upper()][0].append(sv) 
                                ORG_SVOs[e.text.upper()][1] += 1 
                            else:    
                                ORG_SVOs[e.text.upper()] = [[sv],1] 
    return dat  

dat = process_texts(dat,'abstract')

#PERSON People, including fictional
#NORP Nationalities or religious or political groups
#FACILITY Buildings, airports, highways, bridges, etc.
#ORGANIZATION Companies, agencies, institutions, etc.
#GPE Countries, cities, states
#LOCATION Non-GPE locations, mountain ranges, bodies of water
#PRODUCT Vehicles, weapons, foods, etc. (Not services)
#EVENT Named hurricanes, battles, wars, sports events, etc.
#WORK OF ART Titles of books, songs, etc.
#LAW Named documents made into laws 
#LANGUAGE Any named language
#The following values are also annotated in a style similar to names:
#DATE Absolute or relative dates or periods
#TIME Times smaller than a day
#PERCENT Percentage (including “%”)
#MONEY Monetary values, including unit
#QUANTITY Measurements, as of weight or distance
#ORDINAL “first”, “second”
#CARDINAL Numerals that do not fall under another typ



article i =  100  of  606
article i =  200  of  606
article i =  300  of  606
article i =  400  of  606
article i =  500  of  606
article i =  600  of  606


### Step 5. Apply the same procedure to the SE articles full contents
***

* Update column ORG. 
* Update column ORG_SVOs. 
* Update the separate global dictionary ORG_SVOs.

In [5]:

dat = process_texts(dat,'raw content')
              


article i =  100  of  606
article i =  200  of  606
article i =  300  of  606
article i =  400  of  606
article i =  500  of  606
article i =  600  of  606


In [6]:
dat

Unnamed: 0,title,abstract,categories,raw content,ORG,ORG_SVOs
0,Absences from work - quarterly statistics,Absences from work can be classified into two ...,"['Employment', 'Labour market', 'Statistical a...",Absences from work sharply increase in first h...,{},{}
1,Balance of payments statistics - quarterly data,This article presents quarterly statistics on ...,"['Balance of payments', 'Statistical article']",Current account. The EU non-seasonally adjuste...,"{'EU': [[(1, 2), (1, 2), (5, 6), (8, 9), (1, 2...","{'EU': [[('The EU non - account', 'adjusted'),..."
2,Accidents and injuries statistics,This article presents an overview of European ...,"['Health', 'Health status', 'Statistical artic...","Deaths from accidents, injuries and assault. I...","{'EU': [[(9, 10), (9, 10), (14, 15), (14, 15)]...","{'EU': [[('This article', 'presents', 'an over..."
3,Accidents at work statistics,This article presents a set of main statistica...,"['Accidents at work', 'Health', 'Health and sa...","Number of accidents. In 2018, there were 3.1 m...","{'ESAW': [[(48, 49), (7, 8)], 2], 'NACE SECTIO...","{'ESAW': [[('accidents at work ESAW exercise',..."
4,Accidents at work - statistics by economic act...,This article presents a set of main statistica...,"['Accidents at work', 'Health', 'Health and sa...",Developments over time. Non-fatal accidents In...,"{'ESAW': [[(48, 49)], 1], 'EU': [[(23, 24)], 1]}","{'ESAW': [[('accidents at work ESAW exercise',..."
...,...,...,...,...,...,...
601,Africa-EU - international trade in goods stati...,This article provides a picture of internation...,"['Non-EU countries', 'Trade in goods', 'Statis...",Africa’s main trade in goods partner is the EU...,"{'UN': [[(4, 5)], 1], 'EU': [[(9, 10), (11, 12...",{'UN': [[('the UN subdivision of in five diffe...
602,Adult learning statistics,This article provides an overview of adult lea...,"['Education and training', 'Lifelong learning'...",Participation rate of adults in learning in th...,"{'CZECHIA': [[(32, 33)], 1]}","{'CZECHIA': [[('Czechia', 'reported', 'the sam..."
603,Acquisition of citizenship statistics,This article presents recent statistics on the...,"['Asylum and migration', 'Population', 'Acquis...",EU-27 Member States granted citizenship to 706...,"{'CZECHIA': [[(8, 9)], 1], 'EU': [[(15, 16), (...","{'CZECHIA': [[('Other countries with under', '..."
604,Adult learning statistics - characteristics of...,This article presents an overview of European ...,"['Education and training', 'Participation in e...",Formal and non-formal adult education and trai...,"{'EU': [[(9, 10), (9, 10), (9, 10), (18, 19), ...","{'EU': [[('This article', 'presents', 'an over..."


### Step 6. Gathering the most common entities: example with ORG entities
***

We can see a few errors and repetitions. These require some further cleansing steps and fine-tuning of the NER engine (not yet carried out). There are in total 357 terms identified as named entities - organizations.


In [7]:
from itertools import chain
org_list=sorted(list(chain.from_iterable(dat['ORG'].apply(lambda x: x.keys()))))
org_all_freqs = sorted(Counter(org_list))
print('Total terms identified as ORG: ',len(org_all_freqs))

print('\n100 most common:\n')
org_common_freqs = Counter(org_list).most_common(100)
org_common = sorted([x[0] for x in org_common_freqs])

print(org_common_freqs)

Total terms identified as ORG:  353

100 most common:

[('EU', 476), ('EUROSTAT', 191), ('CZECHIA', 67), ('EFTA', 38), ('PPS', 27), ('NACE', 25), ('ISCED', 21), ('EU-28', 19), ('SDG', 19), ('SITC', 18), ('THE EUROPEAN COMMISSION', 17), ('EHIS', 15), ('ICT', 14), ('EUR', 13), ('GHG', 12), ('THE EUROPEAN UNION', 12), ('ASEAN', 11), ('FDI', 11), ('HOUSEHOLDS', 11), ('UAA', 11), ('OECD', 9), ('THE ÎLE DE FRANCE', 9), ('EC', 8), ('ESA', 7), ('NEET', 7), ('STS', 7), ('UN', 7), ('ASEM', 6), ('EEA', 6), ('THE EUROPEAN ENVIRONMENT AGENCY', 6), ('COICOP', 5), ('EGSS', 5), ('LFS', 5), ('NESA', 5), ('NPISH', 5), ('REGULATION', 5), ('SBS', 5), ('STATE', 5), ('THE UNITED NATIONS', 5), ('AGRICULTURAL', 4), ('CO 2', 4), ('COFOG', 4), ('COMMISSION', 4), ('DATA', 4), ('EUROBASE', 4), ('EUROPEAN UNION', 4), ('GNI', 4), ('HBS', 4), ('ICD', 4), ('NATURA', 4), ('THE WORLD HEALTH ORGANISATION', 4), ('AEA', 3), ('CFP', 3), ('COUNCIL', 3), ('DMC', 3), ('ESAW', 3), ('ESSPROS', 3), ('EUROBAROMETER', 3), ('FLEVOL

### Step 7. Storing information on these most common entities per article: example with ORG entities
***

This is one way of storing the information on both all entities and counts and on the most common ones in a Pandas dataframe.


In [8]:
dat['ORG_COMMON_100'] = dat['ORG'].apply(lambda x: {y:x[y] for y in x.keys() if y in org_common})
dat

Unnamed: 0,title,abstract,categories,raw content,ORG,ORG_SVOs,ORG_COMMON_100
0,Absences from work - quarterly statistics,Absences from work can be classified into two ...,"['Employment', 'Labour market', 'Statistical a...",Absences from work sharply increase in first h...,{},{},{}
1,Balance of payments statistics - quarterly data,This article presents quarterly statistics on ...,"['Balance of payments', 'Statistical article']",Current account. The EU non-seasonally adjuste...,"{'EU': [[(1, 2), (1, 2), (5, 6), (8, 9), (1, 2...","{'EU': [[('The EU non - account', 'adjusted'),...","{'EU': [[(1, 2), (1, 2), (5, 6), (8, 9), (1, 2..."
2,Accidents and injuries statistics,This article presents an overview of European ...,"['Health', 'Health status', 'Statistical artic...","Deaths from accidents, injuries and assault. I...","{'EU': [[(9, 10), (9, 10), (14, 15), (14, 15)]...","{'EU': [[('This article', 'presents', 'an over...","{'EU': [[(9, 10), (9, 10), (14, 15), (14, 15)]..."
3,Accidents at work statistics,This article presents a set of main statistica...,"['Accidents at work', 'Health', 'Health and sa...","Number of accidents. In 2018, there were 3.1 m...","{'ESAW': [[(48, 49), (7, 8)], 2], 'NACE SECTIO...","{'ESAW': [[('accidents at work ESAW exercise',...","{'ESAW': [[(48, 49), (7, 8)], 2], 'NACE SECTIO..."
4,Accidents at work - statistics by economic act...,This article presents a set of main statistica...,"['Accidents at work', 'Health', 'Health and sa...",Developments over time. Non-fatal accidents In...,"{'ESAW': [[(48, 49)], 1], 'EU': [[(23, 24)], 1]}","{'ESAW': [[('accidents at work ESAW exercise',...","{'ESAW': [[(48, 49)], 1], 'EU': [[(23, 24)], 1]}"
...,...,...,...,...,...,...,...
601,Africa-EU - international trade in goods stati...,This article provides a picture of internation...,"['Non-EU countries', 'Trade in goods', 'Statis...",Africa’s main trade in goods partner is the EU...,"{'UN': [[(4, 5)], 1], 'EU': [[(9, 10), (11, 12...",{'UN': [[('the UN subdivision of in five diffe...,"{'UN': [[(4, 5)], 1], 'EU': [[(9, 10), (11, 12..."
602,Adult learning statistics,This article provides an overview of adult lea...,"['Education and training', 'Lifelong learning'...",Participation rate of adults in learning in th...,"{'CZECHIA': [[(32, 33)], 1]}","{'CZECHIA': [[('Czechia', 'reported', 'the sam...","{'CZECHIA': [[(32, 33)], 1]}"
603,Acquisition of citizenship statistics,This article presents recent statistics on the...,"['Asylum and migration', 'Population', 'Acquis...",EU-27 Member States granted citizenship to 706...,"{'CZECHIA': [[(8, 9)], 1], 'EU': [[(15, 16), (...","{'CZECHIA': [[('Other countries with under', '...","{'CZECHIA': [[(8, 9)], 1], 'EU': [[(15, 16), (..."
604,Adult learning statistics - characteristics of...,This article presents an overview of European ...,"['Education and training', 'Participation in e...",Formal and non-formal adult education and trai...,"{'EU': [[(9, 10), (9, 10), (9, 10), (18, 19), ...","{'EU': [[('This article', 'presents', 'an over...","{'EU': [[(9, 10), (9, 10), (9, 10), (18, 19), ..."


### Step 8. Exporting the dataframe to Excel
***

This is useful for the manual inspection and the design of rules for the fine-tuning of the NER engine. This output can then directly be imported in the database.


In [9]:
dat.to_excel('SE_NERs.xlsx')

### Step 9. Checking the dictionary with all SVOs collected
***
* And write all SVOs to a text file.

In [10]:
import unidecode

ORG_SVOs = {k:v for k,v in sorted(ORG_SVOs.items(), key=lambda item: item[0])}

with open('SVOs.txt', 'w') as file:
    for key in ORG_SVOs.keys():
            print('<',key,'>',end=' ')
            number = ORG_SVOs[key][1]
            print(number, ' entries\n')
            phrases = ORG_SVOs[key][0]
            for (i,phrase) in enumerate(phrases):
                s = unidecode.unidecode(str(phrase))
                print(s)
                print('{0:s}: {1:4d}: {2:s}\n'.format(key,i,s))
                file.write('{0:40s}: {1:4d}: {2:s}\n'.format(unidecode.unidecode(key),i,s))


< +8.3 > 1  entries

('+8.3 pp in (', 'project', 'the range of increase')
+8.3:    0: ('+8.3 pp in (', 'project', 'the range of increase')

< -EQUIVALENTS > 1  entries

('Table 2', 'shows', 'the GHG emissions in grams of CO 2 -equivalents')
-EQUIVALENTS:    0: ('Table 2', 'shows', 'the GHG emissions in grams of CO 2 -equivalents')

< 2015/2174 > 1  entries

('2015/2174', 'proposes', 'an indicative compendium of environmental goods')
2015/2174:    0: ('2015/2174', 'proposes', 'an indicative compendium of environmental goods')

< A01 > 1  entries

('agricultural products ( A01 )', 'generate', 'Spillover value')
A01:    0: ('agricultural products ( A01 )', 'generate', 'Spillover value')

< AAGR > 3  entries

('compound rates of change', 'refer', 'AAGR ,')
AAGR:    0: ('compound rates of change', 'refer', 'AAGR ,')

('both indicators', 'refer', 'AAGR ,')
AAGR:    1: ('both indicators', 'refer', 'AAGR ,')

('the annual average growth rate AAGR of', 'was', '1.4 % per year')
AAGR:    2: ('the

('Statistics on goods', 'evaluate', 'the EU')
EU:  261: ('Statistics on goods', 'evaluate', 'the EU')

('Statistics on goods', 'evaluate', 'the EU')
EU:  262: ('Statistics on goods', 'evaluate', 'the EU')

('Statistics on goods', 'evaluate', 'the EU')
EU:  263: ('Statistics on goods', 'evaluate', 'the EU')

('This article', 'presents', 'some characteristics of Union ( EU ) international trade in sporting goods')
EU:  264: ('This article', 'presents', 'some characteristics of Union ( EU ) international trade in sporting goods')

('an online Eurostat publication', 'presenting', 'a summary of recent EU ) statistics on economic aspects of globalisation')
EU:  265: ('an online Eurostat publication', 'presenting', 'a summary of recent EU ) statistics on economic aspects of globalisation')

('an online Eurostat publication', 'presenting', 'a summary of recent EU ) statistics on economic aspects of globalisation')
EU:  266: ('an online Eurostat publication', 'presenting', 'a summary of recent 

('EUR billion , % )', 'was', 'the largest partner of , between')
EU:  900: ('EUR billion , % )', 'was', 'the largest partner of , between')

('the EU', 'had', 'a trade surplus with')
EU:  901: ('the EU', 'had', 'a trade surplus with')

('EU exports to', 'were')
EU:  902: ('EU exports to', 'were')

('EU imports from', 'were')
EU:  903: ('EU imports from', 'were')

('EU exports of goods', 'had', 'a higher share than primary goods')
EU:  904: ('EU exports of goods', 'had', 'a higher share than primary goods')

('EU imports of goods', 'had', 'a higher share than primary goods')
EU:  905: ('EU imports of goods', 'had', 'a higher share than primary goods')

('Figure 8', 'shows', 'the evolution of EU imports')
EU:  906: ('Figure 8', 'shows', 'the evolution of EU imports')

('the EU', 'had', 'trade surpluses')
EU:  907: ('the EU', 'had', 'trade surpluses')

('other products ( EUR )', 'manufactured')
EU:  908: ('other products ( EUR )', 'manufactured')

('The three largest importers from in', '

EU: 1914: ('the EU', 'locate', 'EU affiliates')

('the EU', 'locate', 'EU affiliates')
EU: 1915: ('the EU', 'locate', 'EU affiliates')

('the EU', 'locate', 'EU affiliates')
EU: 1916: ('the EU', 'locate', 'EU affiliates')

('the EU', 'locate', 'EU affiliates')
EU: 1917: ('the EU', 'locate', 'EU affiliates')

('the EU', 'locate', 'EU affiliates')
EU: 1918: ('the EU', 'locate', 'EU affiliates')

('the EU', 'locate', 'EU affiliates')
EU: 1919: ('the EU', 'locate', 'EU affiliates')

('the EU', 'locate', 'EU affiliates')
EU: 1920: ('the EU', 'locate', 'EU affiliates')

('the EU', 'locate', 'EU affiliates')
EU: 1921: ('the EU', 'locate', 'EU affiliates')

('the EU', 'locate', 'EU affiliates')
EU: 1922: ('the EU', 'locate', 'EU affiliates')

('the EU', 'locate', 'EU affiliates')
EU: 1923: ('the EU', 'locate', 'EU affiliates')

('the EU', 'locate', 'EU affiliates')
EU: 1924: ('the EU', 'locate', 'EU affiliates')

('the EU', 'locate', 'EU affiliates')
EU: 1925: ('the EU', 'locate', 'EU affiliat

('more than a fifth of EU households', 'included', 'one immigrant adult')
EU: 2989: ('more than a fifth of EU households', 'included', 'one immigrant adult')

('the EU', 'had', 'million immigrants aged ,')
EU: 2990: ('the EU', 'had', 'million immigrants aged ,')

('the EU', 'had', 'million immigrants aged ,')
EU: 2991: ('the EU', 'had', 'million immigrants aged ,')

('the EU', 'basing', 'its economic growth')
EU: 2992: ('the EU', 'basing', 'its economic growth')

('the EU', 'basing', 'policies')
EU: 2993: ('the EU', 'basing', 'policies')

('the EU', 'basing', 'social cohesion')
EU: 2994: ('the EU', 'basing', 'social cohesion')

('EU background', 'make', 'further distinction')
EU: 2995: ('EU background', 'make', 'further distinction')

('EU background', 'means')
EU: 2996: ('EU background', 'means')

('EU background', 'means')
EU: 2997: ('EU background', 'means')

('an EU background', 'have', 'the respective adult')
EU: 2998: ('an EU background', 'have', 'the respective adult')

('an EU 


('EUROSTAT PROJECTIONS Population projections', 'give', 'a picture of')
EUROSTAT:  232: ('EUROSTAT PROJECTIONS Population projections', 'give', 'a picture of')

('Eurostat latest population projections', 'indicate', 'further increases in during the decades')
EUROSTAT:  233: ('Eurostat latest population projections', 'indicate', 'further increases in during the decades')

('the release of a wide range of official statistics for', 'highlight', 'Eurostat , with other services of')
EUROSTAT:  234: ('the release of a wide range of official statistics for', 'highlight', 'Eurostat , with other services of')

('Eurostat online database ( Eurobase )', 'find', 'the data')
EUROSTAT:  235: ('Eurostat online database ( Eurobase )', 'find', 'the data')

('Eurostat online database ( Eurobase )', 'find', 'the data')
EUROSTAT:  236: ('Eurostat online database ( Eurobase )', 'find', 'the data')

('Eurostat', 'provides', 'information on labour productivity')
EUROSTAT:  237: ('Eurostat', 'provides', 'inf

ISCED:   62: ('ISCED', 'levels')

('ISCED', 'levels')
ISCED:   63: ('ISCED', 'levels')

('ISCED', 'levels', '5 - 8)')
ISCED:   64: ('ISCED', 'levels', '5 - 8)')

('ISCED', 'levels', '5 - 8)')
ISCED:   65: ('ISCED', 'levels', '5 - 8)')

('The international standard classification of ( ISCED )', 'covers', 'nine levels of education')
ISCED:   66: ('The international standard classification of ( ISCED )', 'covers', 'nine levels of education')

('The international standard classification of ( ISCED )', 'covers', 'nine levels of education')
ISCED:   67: ('The international standard classification of ( ISCED )', 'covers', 'nine levels of education')

('The international standard classification of ( ISCED )', 'covers', 'nine levels of education')
ISCED:   68: ('The international standard classification of ( ISCED )', 'covers', 'nine levels of education')

('The international standard classification of ( ISCED )', 'covers', 'nine levels of education')
ISCED:   69: ('The international standard c


('all elderly persons', 'pay', 'a sub)population , for , a fuel allowance ( for )')
SUB)POPULATION:    1: ('all elderly persons', 'pay', 'a sub)population , for , a fuel allowance ( for )')

< SWD/2016/249 > 2  entries

('EU climate policy ( SWD/2016/249 )', 'integrate', 'Commission proposal on')
SWD/2016/249:    0: ('EU climate policy ( SWD/2016/249 )', 'integrate', 'Commission proposal on')

('EU climate policy ( SWD/2016/249 )', 'integrate', 'Commission proposal on')
SWD/2016/249:    1: ('EU climate policy ( SWD/2016/249 )', 'integrate', 'Commission proposal on')

< SWOT > 1  entries

('the different priorities', 'following', 'their SWOT - analyses')
SWOT:    0: ('the different priorities', 'following', 'their SWOT - analyses')

< SÃO PAULO > 1  entries

('The largest urban agglomerations outside', 'were', 'Sao Paulo ( Brazil ) , City )')
SÃO PAULO:    0: ('The largest urban agglomerations outside', 'were', 'Sao Paulo ( Brazil ) , City )')

< TARGET > 7  entries

('2030 Target', 'd

* Verify the information written to the file.

In [11]:
with open('SVOs.txt', 'r') as f:
    count = 0
 
    while True:
        line = f.readline()
        if not line:
            break
        print(line)
 


+8.3                                    :    0: ('+8.3 pp in (', 'project', 'the range of increase')

-EQUIVALENTS                            :    0: ('Table 2', 'shows', 'the GHG emissions in grams of CO 2 -equivalents')

2015/2174                               :    0: ('2015/2174', 'proposes', 'an indicative compendium of environmental goods')

A01                                     :    0: ('agricultural products ( A01 )', 'generate', 'Spillover value')

AAGR                                    :    0: ('compound rates of change', 'refer', 'AAGR ,')

AAGR                                    :    1: ('both indicators', 'refer', 'AAGR ,')

AAGR                                    :    2: ('the annual average growth rate AAGR of', 'was', '1.4 % per year')

AEA                                     :    0: ('Eurostat', 'records', 'the AEA ,')

AEA                                     :    1: ('Eurostat', 'publishes', 'the AEA ,')

AEA                                     :    2: ("analyses as


EU                                      :  415: ('all EU Member States as Kingdom', 'provide', 'Data 2')

EU                                      :  416: ('This article', 'offers', 'a statistical overview of the pulses sector in EU agriculture')

EU                                      :  417: ('EU policies', 'range')

EU                                      :  418: ('a range of European Union ( EU ) policy objectives in fields diverse', 'apply', 'tools')

EU                                      :  419: ('part of EU statistics on income conditions SILC', 'form', 'a 2015 hoc module on')

EU                                      :  420: ('a Union ( EU ) cycle indicator', 'showing', 'the trend in the cost for new residential buildings')

EU                                      :  421: ('The group of countries', 'includes', 'the 27 EU Member States , Kingdom , countries')

EU                                      :  422: ('The coronavirus pandemic', 'impact', 'EU economies')

EU            


EU                                      : 2233: ('Some EU Member States', 'record', 'considerable regional disparities in their unemployment rates')

EU                                      : 2234: ('several EU Member States', 'enacted', 'new employment laws')

EU                                      : 2235: ('four EU Member States', 'registered', 'an unemployment rate of 4.0 % of the extended labour force : Czechia')

EU                                      : 2236: ('EU level', 'find', 'Gender differences')

EU                                      : 2237: ('all EU Member States', 'stay', 'it')

EU                                      : 2238: ('all EU Member States', 'stay', 'it')

EU                                      : 2239: ('all EU Member States', 'stay', 'it')

EU                                      : 2240: ('the EU', 'work', 'Hours')

EU                                      : 2241: ('6 EU Member States : Czechia', 'record', 'The highest shares , % ,')

EU                     


EU                                      : 3315: ('one third of EU total production', 'be', 'the most important of in')

EU                                      : 3316: ('one third of EU total production', 'be', 'the most important of in')

EU                                      : 3317: ('The growth of EU primary production from renewable energy sources', 'exceeded', 'that of all the other energy types')

EU                                      : 3318: ('The EU', 'are', 'net importers of energy')

EU                                      : 3319: ('the EU', 'become')

EU                                      : 3320: ('EU imports of energy', 'exceeded', 'exports by')

EU                                      : 3321: ('The main origins of EU energy imports', 'changed')

EU                                      : 3322: ('42.4 % of EU imports of hard coal', 'were')

EU                                      : 3323: ('Russia', 'was', 'the principal supplier of EU oil imports')

EU                


EUROSTAT                                :  313: ('Eurostat', 'publish', 'These estimates for')

EUROSTAT                                :  314: ('Eurostat', 'collects', 'data')

EUROSTAT                                :  315: ('data', 'collect', 'Eurostat')

EUROSTAT                                :  316: ('three concepts', 'collect', 'Eurostat')

EUROSTAT                                :  317: ('data', 'collect', 'Eurostat')

EUROSTAT                                :  318: ('three concepts', 'collect', 'Eurostat')

EUROSTAT                                :  319: ('data', 'collect', 'Eurostat')

EUROSTAT                                :  320: ('three concepts', 'collect', 'Eurostat')

EUROSTAT                                :  321: ('an EU-27 average production technology', 'use', 'Eurostat carbon footprint of ,')

EUROSTAT                                :  322: ('Eurostat footprint estimate', 'is')

EUROSTAT                                :  323: ('Eurostat', 'carried', 'the decompos

LCI                                     :    1: ('the LCI', 'measures', 'the cost pressure')

LCS                                     :    0: ('the latest vintage of the yearly Labour cost survey ( LCS )', 'base', 'This article')

LCS                                     :    1: ('the comparison of two vintages of the yearly Labour cost survey ( LCS )', 'base', 'This article')

LCS                                     :    2: ('LCS data', '!transmitted')

LEVADAS                                 :    0: ('Madeira', 'provides', 'Levadas ,')

LEVADAS                                 :    1: ('Madeira', 'provides', 'Levadas ,')

LFS                                     :    0: ('a force survey ( LFS )', 'compile', 'Indicators on households with low work intensity')

LFS                                     :    1: ('data from EU force survey ( LFS )', 'base', 'The comparison of (')

LFS                                     :    2: ('the EU force survey ( LFS )', 'collect', 'These data')

LFS    

* Also print all phrases found.
* **NEED to discard phrases with numbers.**

In [12]:
llist = [v[0] for k,v in ORG_SVOs.items()]
flattened_list = [item for sublist in llist for item in sublist]
print(len(flattened_list))
for el in sorted(flattened_list):
    print(el)




6864
("' footprints '", 'disseminate', 'Eurostat')
("' footprints '", 'estimate', 'Eurostat')
('( + imports – exports )', 'estimate', 'Eurostat JFSQ data')
('( + imports – exports )', 'estimate', 'Eurostat JFSQ data')
('( 2016 data ) , Czechia', 'reach', 'this difference')
('( EUR +20.1 bn', 'compared')
('( EUR +20.1 bn', 'compared')
('( GHG ) emissions from by', 'compared')
('( GHG ) emissions in', 'see', 'the article on SDG 13 ‘ Climate action ’')
('( HETUS ) data', 'shows')
('( ICT ) sector', 'led')
('( ISCED level 2 )', 'varies')
('( ISCED level 2 )', 'varies')
('( ISCED level 2 )', 'varies')
('( ISCED level 4 )', 'starts')
('( ISCED level 4 )', 'varies')
('( ISCED level 4 )', 'varies')
('( ISCED level 4 )', 'varies')
('( ISCED level 8) in', 'see', 'Table 3')
('( ITSS data', 'excluding')
('( ITSS data', 'excluding')
('( ITSS data', 'excluding')
('( ITSS data', 'including')
('( ITSS data', 'including')
('( ITSS data', 'including')
('( Mtoe).The EU candidate countries Albania', 'have

('Construction', 'has', 'the highest RMC')
('Conversely , HICP rates', 'are')
('Conversely , HICP rates', 'see', '" Limitations "')
('Countries', 'use', 'the price indices ( CPPI ,')
('Countries with ferry connections to other EU countries , as', 'have', 'high shares of international intra - EU transport')
('Countries with ferry connections to other EU countries , as', 'have', 'high shares of international intra - EU transport')
('Croatia', 'record', 'the only other EU Member States')
('Croatia', 'reported', 'EU lowest amounts of waste')
('Croplands', 'occupy', 'million km² of ,')
('Cross - border trade in services with the EU partners', 'was')
('Cyprus', 'is', 'the expensive EU Member State for software')
('Cyprus , Belgium', 'recorded', 'a similar pattern to the EU average')
('Cyprus , Czechia', 'formed', 'the largest group')
('Cyprus , Czechia', 'formed', 'the third')
('Cyprus , Czechia', 'record', 'high increases')
('Czechia', '!reported', 'transport of in')
('Czechia', 'be', 'the 

('Eurostat', 'transmit', 'national statistical institutes')
('Eurostat', 'transmit', 'only countries in')
('Eurostat', 'update', 'the information on effects of in the online publication')
('Eurostat', 'update', 'the information on effects of in the online publication')
('Eurostat', 'update', 'the information on the effects of in the online publication')
('Eurostat', 'used', 'it')
('Eurostat', 'uses', 'the average number of deaths for each month')
('Eurostat', 'uses', 'the ‘ scale ’')
('Eurostat', 'working')
('Eurostat (', 'publish', 'these dates')
('Eurostat ,', 'transmit', 'Member States')
('Eurostat , States', 'filling', 'that gap snapshot')
('Eurostat , office', 'send', 'registrations')
('Eurostat , with', 'launched', 'a plan for the regular collection')
('Eurostat baseline scenario', 'projects')
('Eurostat codes', 'allow', 'easy access to the recent data [ 31 ]')
('Eurostat data on )', 'show', 'a % increase between')
('Eurostat data on )', 'show', 'a % increase between')
('Eurostat

('Other Member States', 'included', 'Germany ( pp ) , Czechia')
('Other countries with under', 'were', 'Czechia , Estonia')
('Other economic activities with high GHG intensities', 'are', 'the primary sectors ; agriculture')
('Other goods (', 'cover', 'SITC Sections 6')
('Over , EU road transport', 'increased')
('Over , EU road transport', 'performed')
('Overig Groningen Netherlands ; decrease ) , regions ) , Thesprotia )', 'record', 'the largest decreases in relative labour productivity')
('PAYG schemes of', 'go')
('PAYG schemes of general government ( schemes', 'pay')
('PPS', 'base', 'market prices (')
('PPS', 'express', 'GDP in')
('PPS', 'express', 'GDP in')
('PPS', 'express', 'Values')
('PPS (', 'express', 'data')
('PPS (', 'express', 'data')
('PPS (', 'express', 'data')
('PPS 10 730', 'increase', 'social transfers (')
('PPS 876', 'contribute', 'Social transfers')
('PPS per', 'present', 'the results')
('PPS per adult', 'present', 'Figures 3 present : adult')
('PPS relative to the EU

('The Manufacture of food products', 'increased')
('The Member State with the median age', 'was', 'Greece')
('The Member States Denmark , Ireland', 'have', 'price levels above the EU-27 average')
('The Netherlands', 'held', 'more than half of EU-28 FDI stocks')
('The Netherlands', 'held', 'more than half of EU-28 FDI stocks')
('The Netherlands', 'held', 'more than half of EU-28 FDI stocks')
('The Netherlands', 'held', 'more than half of EU-28 FDI stocks')
('The Netherlands', 'held', 'more than half of EU-28 FDI stocks')
('The Netherlands', 'is', 'EU largest maritime transport country')
('The Netherlands ( EUR , % )', 'was', 'the largest exporter')
('The Netherlands ( EUR , % )', 'was', 'the largest importer')
('The Netherlands ( EUR , % )', 'was', 'the largest importer')
('The Netherlands ( EUR , % )', 'was', 'the largest importer')
('The Netherlands , Austria', 'are', 'the other EU Member States')
('The Netherlands , Belgium', 'recorded', 'the highest shares of EU imports of medicinal

('an EU Blue Card', 'grant', 'who')
('an EU Member State', 'bear', 'who')
('an EU Member State', 'leave', 'an order')
('an EU Member State', 'leave', 'million emigrants')
('an EU Member State', 'leave', 'million emigrants')
('an EU Member State ( other )', 'bear', 'some 44.1 % of the Union population')
('an EU Member State ( other )', 'bear', 'some 44.1 % of the Union population')
('an EU Member State ( other )', 'bear', 'some 44.1 % of the Union population')
('an EU Member State ( other )', 'bear', 'some 44.1 % of the Union population')
('an EU Member State ( other )', 'bear', 'some 44.1 % of the Union population')
('an EU Member State ( other )', 'bear', 'some 44.1 % of the Union population')
('an EU Member State ( other )', 'bear', 'some 44.1 % of the Union population')
('an EU affiliate', 'employ', 'every five persons')
('an EU affiliate', 'employ', 'every five persons')
('an EU affiliate', 'employ', 'every five persons')
('an EU background', 'have', 'the respective adult')
('an EU

('skilled mobile EU citizens', 'had', 'a lower employment rate')
('skilled mobile EU citizens', 'had', 'a lower employment rate')
('small IMPRO companies', 'cover', '34 % of their total gas market')
('small IMPRO companies', 'cover', '34 % of their total gas market')
('small MNE groups', 'covered')
('smaller surpluses', 'record', 'most ASEM partners , ones ,')
('social contributions ( D122 ) both', 'impute', 'employers ’')
('social contributions ( D122 ) both', 'impute', 'employers ’')
('some EFTA', 'calculate', 'it')
('some EU Member States', 'link', 'the cost of in an attempt')
('some EU Member States', 'link', 'their pensions')
('some EU Member States', 'present', 'alternative years')
('some EU Member States', 'present', 'alternative years')
('some EU Member States', 'recorded', 'cost rates for the population')
('some EU Member States', 'share', 'common languages')
('some EU Member States', 'share', 'common languages')
('some EU Member States', 'use', 'data for')
('some EU countries

('the EU employment rate for persons aged', 'stood')
('the EU employment rate for recent graduates', 'stood')
('the EU employment rate of recent male graduates', 'stood')
('the EU employment rate of recent male graduates', 'stood')
('the EU energy taxes', 'grew')
('the EU external current account', 'recorded', 'a surplus with ,')
('the EU force survey', 'precede', 'the four weeks')
('the EU force survey ( LFS )', 'collect', 'These data')
('the EU force survey ( LFS )', 'collect', 'These data')
('the EU force survey ( LFS )', 'link', 'ad hoc modules')
('the EU force survey ( LFS )', 'link', 'ad hoc modules')
('the EU four', 'locate', '10')
('the EU gender gap for', 'increased')
('the EU immigrant population', 'increased')
('the EU immigrant population', 'increased')
('the EU immigrant population', 'increased')
('the EU imports', 'focus', 'It')
('the EU imports', 'focus', 'It')
('the EU institutions', 'record', 'The remainder ( % )')
('the EU institutions', 'record', 'The remainder ( % )

('the share of over )', 'was', 'Czechia , Cyprus')
('the share of the EU adult population (', 'reported')
('the share of total area UAA', 'define', 'This indicator')
('the share of total organic area in the total area UAA within', 'rose')
('the shares of ASEAN members', 'increased')
('the significant COFOG group in by expenditure', 'spend', 'the equivalent of')
('the significant COFOG group in by expenditure', 'spend', 'the equivalent of')
('the single market ( in the form of intra - EU trade flows', 'take', 'a majority of')
('the situation in', 'supplied', '5.2 % of EU tomato harvest')
('the situation in', 'supplied', '5.2 % of EU tomato harvest')
('the six EU Member States', 'refused', 'entry into')
('the size of the enterprises', 'look', 'STEC data')
('the size of the population', 'divide', 'the GNI data')
('the size of the population', 'divide', 'the GNI data')
('the size of the population', 'divide', 'the GNI data')
('the size of the population', 'divide', 'the GNI data')
('the sl