# Efficiency and Diversity of R&D in Knowledge‑Intensive Services (2005‑2023)

## Introduction

The knowledge‑intensive services sector ('G‑N' sector in NACE classification, list available: __[LINK](https://ec.europa.eu/eurostat/statistics-explained/index.php?title=Glossary:High-tech_classification_of_manufacturing_industries)__) – including wholesale & retail trade, transportation, information & communication, finance, professional activities and administrative services – has become a major engine of innovation in Europe. 

A growing body of confessions suggests that gender diversity within research and development (R&D) teams enhances creativity and innovation. As Scientific American reports, diverse groups “are more innovative, more diligent, and better at solving complex problems” because they are able to produce and apply different perspectives (Phillips, 2014). Furthermore, firms with greater gender diversity tend to achieve higher productivity and innovation performance, particularly in knowledge-intensive sectors where collaboration and problem-solving are key (Hoogendoorn et al., 2019). The business press echoes this view — Forbes (2024) reports that companies introducing diversity consistently outperform rivals. 

The goal is to analyse how efficiently countries from the EU and EFTA convert R&D spending in the knowledge‑intensive services sector into human capital and to examine whether increasing female participation correlates with improved efficiency and labour intensity. Relative shares and growth rates over time will be central to the analysis.

**External data sources:**

- Business enterprise R&D expenditure in high-tech sectors by NACE Rev. 2 __[LINK: htec_sti_exp2](https://ec.europa.eu/eurostat/databrowser/view/htec_sti_exp2/default/table)__

- Business enterprise R&D personnel in high-tech sectors by NACE Rev. 2 __[LINK: htec_sti_pers2](https://ec.europa.eu/eurostat/databrowser/view/htec_sti_pers2/default/table)__

- R&D personnel and researchers in business enterprise sector by NACE Rev. 2 activity and sex __[LINK: rd_p_bempoccr2](https://db.nomics.world/Eurostat/rd_p_bempoccr2?dimensions=%7B%22freq%22%3A%5B%22A%22%5D%2C%22nace_r2%22%3A%5B%22G-N%22%5D%7D&tab=table)__ 

**Objectives:** 

- O1 Extract datasets and metadata from the above-mentioned sources.

- O2 Pre-process the datasets to unify variable names.

- O3 Merge datasets and save in the .csv format.

File metadata:

In [1]:
# __author__ = Dominika Drazyk
# __maintainer__ = Dominika Drazyk
# __email__ = dominika.a.drazyk@gmail.com
# __copyright__ = Dominika Drazyk
# __license__ = Apache License 2.0
# __version__ = 1.0.0
# __status__ = Production
# __date__ = 30/09/2025

Required libraries:

In [2]:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup as bs
from pyjstat import pyjstat
from datetime import datetime
import pandas as pd
import numpy as np
import requests
import json
import time
import csv
import re
import os

## O1 Data extraction using available API

### O1.1 Datasets

Eurostat provides a possibility to extract JSON files of both datasets. The following code extracts the data and transforms a JSON format into a pandas dataframe.

In [3]:
url_exp2 = "https://ec.europa.eu/eurostat/api/dissemination/statistics/1.0/data/htec_sti_exp2?lang=en"

response_exp2 = requests.get(url_exp2)
response_exp2.raise_for_status()

data_exp2 = response_exp2.json()
data_exp2 = pyjstat.from_json_stat(data_exp2, naming = 'id')[0]

print('Data from ', url_exp2,' was extracted.')
print(data_exp2.head())

Data from  https://ec.europa.eu/eurostat/api/dissemination/statistics/1.0/data/htec_sti_exp2?lang=en  was extracted.
  freq     unit nace_r2        geo  time       value
0    A  MIO_EUR   TOTAL  EU27_2020  2005  107641.827
1    A  MIO_EUR   TOTAL  EU27_2020  2006  116175.251
2    A  MIO_EUR   TOTAL  EU27_2020  2007  123195.979
3    A  MIO_EUR   TOTAL  EU27_2020  2008  131732.974
4    A  MIO_EUR   TOTAL  EU27_2020  2009  129092.773


In [4]:
url_pers2 = "https://ec.europa.eu/eurostat/api/dissemination/statistics/1.0/data/htec_sti_pers2?lang=en"

response_pers2 = requests.get(url_pers2)
response_pers2.raise_for_status()

data_pers2 = response_pers2.json()
data_pers2 = pyjstat.from_json_stat(data_pers2, naming = 'id')[0]

print('Data from ', url_pers2,' was extracted.')
print(data_pers2.head())

Data from  https://ec.europa.eu/eurostat/api/dissemination/statistics/1.0/data/htec_sti_pers2?lang=en  was extracted.
  freq nace_r2 unit prof_pos        geo  time      value
0    A   TOTAL  FTE    TOTAL  EU27_2020  2005   982304.1
1    A   TOTAL  FTE    TOTAL  EU27_2020  2006  1036105.4
2    A   TOTAL  FTE    TOTAL  EU27_2020  2007  1077303.3
3    A   TOTAL  FTE    TOTAL  EU27_2020  2008  1130735.5
4    A   TOTAL  FTE    TOTAL  EU27_2020  2009  1123274.0


The last dataset was embedded in a different type of webpage, and the filtering must have been done prior to downloading. 
The pre-filtered dataset was downloaded in the form of a *.csv* file (*rd_p_bempoccr2.csv*).
The dataset was transformed after loading to match the structure of the remaining two datasets.

In [5]:
url_fem2 = "https://db.nomics.world/Eurostat/rd_p_bempoccr2?dimensions=%7B%22freq%22%3A%5B%22A%22%5D%2C%22nace_r2%22%3A%5B%22G-N%22%5D%7D&tab=table"
data_fem2 = pd.read_csv('../data/rd_p_bempoccr2.csv')

for col in data_fem2.columns:
    if col.startswith('Annual'):
        match = re.search(r'\.([A-Z]{2,3})\)$', col)
        if match and match.group(1):
          geo = match.group(1)
        else:
          geo = 'UU'
        data_fem2 = data_fem2.rename(columns={col: geo})
        
data_fem2 = data_fem2.melt(id_vars = ['period'], 
                           value_vars = data_fem2.columns[1:274], 
                           var_name = 'geo', 
                           value_name = 'fem2_FTE_RSE')
data_fem2 = data_fem2[data_fem2['geo'] != "UU"]
data_fem2 = data_fem2.rename(columns={'period': 'time'})
data_fem2['time'] = data_fem2['time'].map(str)
data_fem2['nace_r2'] = 'G-N'

print('Data from ', url_fem2,' was downloaded.')
print(data_fem2.head())

Data from  https://db.nomics.world/Eurostat/rd_p_bempoccr2?dimensions=%7B%22freq%22%3A%5B%22A%22%5D%2C%22nace_r2%22%3A%5B%22G-N%22%5D%7D&tab=table  was downloaded.
   time geo  fem2_FTE_RSE nace_r2
0  2005  AT           NaN     G-N
1  2006  AT         999.2     G-N
2  2007  AT        1110.0     G-N
3  2008  AT           NaN     G-N
4  2009  AT        1829.8     G-N


### O1.2 Metadata

Each source webpage includes useful metadata that is updated upon the data refresh and can be used for reporting. The following code scraps useful metadata for both datasets: the datetime of the last update, the source, the ID and the long title. 

In [7]:
url_exp2_meta = 'https://ec.europa.eu/eurostat/databrowser/view/htec_sti_exp2/default/table'
print('htec_sti_exp2 source webpage: ', url_exp2_meta)
chrome_options = Options()
driver_exp2 = webdriver.Chrome(options = chrome_options)
driver_exp2.get(url_exp2_meta)
print('>>> opened')
time.sleep(20) 
r = driver_exp2.page_source
print('>>> extracted')
soup_exp2 = bs(r, "html.parser")
driver_exp2.close()
print('>>> closed')

htec_sti_exp2 source webpage:  https://ec.europa.eu/eurostat/databrowser/view/htec_sti_exp2/default/table
>>> opened
>>> extracted
>>> closed


In [8]:
print('htec_sti_exp2 dataset metadata:\n')

body = soup_exp2.find('body')

marker = body.find('span', string = "last update")
tag = marker.find_next("b", class_ = "infobox-text-data")
exp2_date = tag.get_text(strip = True)
print('\t dataset_last_updated: ', exp2_date)

marker = body.find('span', string = "Source of data:")
tag = marker.find_next("span")
exp2_source = tag.get_text(strip = True)
print('\t dataset_source: ', exp2_source)

exp2_title = soup_exp2.find('h1', class_ = "ecl-page-header__title").get_text()
print('\t dataset_title: ', exp2_title)

marker = body.find('span', string = "Online data code:")
tag = marker.find_next("b", class_ = "infobox-text-data")
exp2_id = tag.get_text(strip = True)
print('\t dataset_id: ', exp2_id)

exp2_meta = [exp2_id, exp2_source, exp2_title, exp2_date]

htec_sti_exp2 dataset metadata:

	 dataset_last_updated:  29/09/2025 23:00
	 dataset_source:  Eurostat
	 dataset_title:  Business enterprise R&D expenditure in high-tech sectors by NACE Rev. 2
	 dataset_id:  htec_sti_exp2


In [9]:
url_pers2_meta = 'https://ec.europa.eu/eurostat/databrowser/view/htec_sti_pers2/default/table'
print('htec_sti_pers2 source webpage: ', url_pers2_meta)
chrome_options = Options()
driver_pers2 = webdriver.Chrome(options = chrome_options)
driver_pers2.get(url_pers2_meta)
print('>>> opened')
time.sleep(20) 
r = driver_pers2.page_source
print('>>> extracted')
soup_pers2 = bs(r, "html.parser")
driver_pers2.close()
print('>>> closed')

htec_sti_pers2 source webpage:  https://ec.europa.eu/eurostat/databrowser/view/htec_sti_pers2/default/table
>>> opened
>>> extracted
>>> closed


In [10]:
print('htec_sti_pers2 dataset metadata:\n')

body = soup_pers2.find('body')

marker = body.find('span', string = "last update")
tag = marker.find_next("b", class_ = "infobox-text-data")
pers2_date = tag.get_text(strip = True)
print('\t dataset_last_updated: ', pers2_date)

marker = body.find('span', string = "Source of data:")
tag = marker.find_next("span")
pers2_source = tag.get_text(strip = True)
print('\t dataset_source: ', pers2_source)

pers2_title = soup_pers2.find('h1', class_ = "ecl-page-header__title").get_text()
print('\t dataset_title: ', pers2_title)

marker = body.find('span', string = "Online data code:")
tag = marker.find_next("b", class_ = "infobox-text-data")
pers2_id = tag.get_text(strip = True)
print('\t dataset_id: ', pers2_id)

pers2_meta = [pers2_id, pers2_source, pers2_title, pers2_date]

htec_sti_pers2 dataset metadata:

	 dataset_last_updated:  29/09/2025 23:00
	 dataset_source:  Eurostat
	 dataset_title:  Business enterprise R&D personnel in high-tech sectors by NACE Rev. 2
	 dataset_id:  htec_sti_pers2


In [11]:
url_fem2_meta = 'https://db.nomics.world/Eurostat/rd_p_bempoccr2?dimensions=%7B%22freq%22%3A%5B%22A%22%5D%2C%22nace_r2%22%3A%5B%22G-N%22%5D%7D&tab=table'
print('rd_p_bempoccr2 source webpage: ', url_fem2_meta)
chrome_options = Options()
driver_fem2 = webdriver.Chrome(options = chrome_options)
driver_fem2.get(url_fem2_meta)
print('>>> opened')
time.sleep(20) 
r = driver_fem2.page_source
print('>>> extracted')
soup_fem2 = bs(r, "html.parser")
driver_fem2.close()
print('>>> closed')

rd_p_bempoccr2 source webpage:  https://db.nomics.world/Eurostat/rd_p_bempoccr2?dimensions=%7B%22freq%22%3A%5B%22A%22%5D%2C%22nace_r2%22%3A%5B%22G-N%22%5D%7D&tab=table
>>> opened
>>> extracted
>>> closed


In [12]:
print('rd_p_bempoccr2 dataset metadata:\n')

body = soup_fem2.find('body')

marker = body.find("p", class_ = "my-8")
text = marker.get_text(strip = True)
match = re.search(r'on(\w+\s+\d+,\s+\d+)\s+\((\d+:\d+\s+[AP]M)\)', text)
if match:
    date_str = match.group(1)
    time_str = match.group(2)
    
    dt = datetime.strptime(f"{date_str} {time_str}", "%B %d, %Y %I:%M %p")
    fem2_date = dt.strftime("%d/%m/%Y %H:%M")
    print('\t dataset_last_updated: ', fem2_date)
else: fem2_date = 'None'

div = body.find('div', class_ = "container")
span = div.find('span', class_ = "hover:text-foreground transition-colors")
a = span.find_next('a', class_ = "text-muted-foreground link")
fem2_source = a.get_text(strip = True)[1:-1]
print('\t dataset_source: ', fem2_source)

div = body.find('div', class_ = "container")
h1 = div.find('h1', class_ = "text-3xl mb-10")
spans = h1.find_all('span')
fem2_title = spans[3].get_text(strip = True)
print('\t dataset_title: ', fem2_title)

marker = h1.find('span', class_ = "text-muted-foreground")
fem2_id = marker.get_text(strip = True)[1:-1]
print('\t dataset_id: ', fem2_id)

fem2_meta = [fem2_id, fem2_source, fem2_title, fem2_date]

rd_p_bempoccr2 dataset metadata:

	 dataset_last_updated:  02/05/2025 11:00
	 dataset_source:  Eurostat
	 dataset_title:  R&D personnel and researchers in business enterprise sector by NACE Rev. 2 activity and sex
	 dataset_id:  rd_​p_​bempoccr2


In [13]:
meta = pd.DataFrame([exp2_meta,pers2_meta,fem2_meta], columns = ['dataset_id', 'dataset_source', 'dataset_title', 'dataset_last_updated'])
print('Metadata dataset was created:')
meta

Metadata dataset was created:


Unnamed: 0,dataset_id,dataset_source,dataset_title,dataset_last_updated
0,htec_sti_exp2,Eurostat,Business enterprise R&D expenditure in high-te...,29/09/2025 23:00
1,htec_sti_pers2,Eurostat,Business enterprise R&D personnel in high-tech...,29/09/2025 23:00
2,rd_​p_​bempoccr2,Eurostat,R&D personnel and researchers in business ente...,02/05/2025 11:00


In [14]:
meta.to_csv('../data/scraper_metadata.csv', encoding='utf-8', index = False)
print('Metadata dataset was saved into \'../data/scraper_metadata.csv\'.')

Metadata dataset was saved into '../data/scraper_metadata.csv'.


List of European + EFTA countries

In [15]:
url_countries_meta = 'https://ec.europa.eu/eurostat/statistics-explained/index.php?title=Glossary:Country_codes'
print('European+EFTA countries source webpage: ', url_countries_meta)
chrome_options = Options()
driver_co = webdriver.Chrome(options = chrome_options)
driver_co.get(url_countries_meta)
print('>>> opened')
time.sleep(20) 
r = driver_co.page_source
print('>>> extracted')
soup_co = bs(r, "html.parser")
driver_co.close()
print('>>> closed')

European+EFTA countries source webpage:  https://ec.europa.eu/eurostat/statistics-explained/index.php?title=Glossary:Country_codes
>>> opened
>>> extracted
>>> closed


In [16]:
content_div = soup_co.find('div', {'id': 'mw-content-text'})
tables = content_div.find_all('table')
    
eu_efta_countries = []
    
for i, table in enumerate(tables):
    if (i == 0)  or (i == 1):  
        rows = table.find_all('tr')
        
        for row in rows:
            cells = row.find_all('td')
            
            for j in range(0, len(cells), 2):
                if j + 1 < len(cells):
                    if cells[j].get_text(strip=True) == '':
                        j = j + 1
                    country_name = cells[j].get_text(strip=True)
                    country_code = cells[j + 1].get_text(strip=True)
                    country_code = country_code.replace('(', '').replace(')', '').strip()
                    print(f"Country: {country_name}, geo: {country_code}")
    
                    country_data = {'Country': country_name, 'geo': country_code}
                    eu_efta_countries.append(country_data)
eu_efta_countries_df = pd.DataFrame.from_dict(eu_efta_countries)

Country: Belgium, geo: BE
Country: Greece, geo: EL
Country: Lithuania, geo: LT
Country: Portugal, geo: PT
Country: Bulgaria, geo: BG
Country: Spain, geo: ES
Country: Luxembourg, geo: LU
Country: Romania, geo: RO
Country: Czechia, geo: CZ
Country: France, geo: FR
Country: Hungary, geo: HU
Country: Slovenia, geo: SI
Country: Denmark, geo: DK
Country: Croatia, geo: HR
Country: Malta, geo: MT
Country: Slovakia, geo: SK
Country: Germany, geo: DE
Country: Italy, geo: IT
Country: Netherlands, geo: NL
Country: Finland, geo: FI
Country: Estonia, geo: EE
Country: Cyprus, geo: CY
Country: Austria, geo: AT
Country: Sweden, geo: SE
Country: Ireland, geo: IE
Country: Latvia, geo: LV
Country: Poland, geo: PL
Country: Iceland, geo: IS
Country: Norway, geo: NO
Country: Liechtenstein, geo: LI
Country: Switzerland, geo: CH


In [17]:
eu_efta_countries_df.to_csv('../data/eu_efta_countries.csv', encoding = 'utf-8', index = False)
print('EU + EFTA countries list was saved into \'../data/eu_efta_countries.csv\'.')

EU + EFTA countries list was saved into '../data/eu_efta_countries.csv'.


## O2 Cleaning and pre-processing datasets

In [18]:
print(data_exp2.info())
print(data_pers2.info())
print(data_fem2.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11970 entries, 0 to 11969
Data columns (total 6 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   freq     11970 non-null  object 
 1   unit     11970 non-null  object 
 2   nace_r2  11970 non-null  object 
 3   geo      11970 non-null  object 
 4   time     11970 non-null  object 
 5   value    7262 non-null   float64
dtypes: float64(1), object(5)
memory usage: 561.2+ KB
None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23940 entries, 0 to 23939
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   freq      23940 non-null  object 
 1   nace_r2   23940 non-null  object 
 2   unit      23940 non-null  object 
 3   prof_pos  23940 non-null  object 
 4   geo       23940 non-null  object 
 5   time      23940 non-null  object 
 6   value     12908 non-null  float64
dtypes: float64(1), object(6)
memory usage: 1.3+ MB
None
<class 'panda

In [19]:
print(data_exp2.nunique())
print(data_pers2.nunique())
print(data_fem2.nunique())

freq          1
unit          2
nace_r2       7
geo          45
time         19
value      5739
dtype: int64
freq            1
nace_r2         7
unit            2
prof_pos        2
geo            45
time           19
value       10163
dtype: int64
time              19
geo               38
fem2_FTE_RSE    2685
nace_r2            1
dtype: int64


Two datasets include a categorical variable *freq* with only one level; the variable will be removed for the purpose of data cleaning. 

In [20]:
data_exp2.drop(columns = ['freq'], inplace = True)
data_pers2.drop(columns = ['freq'], inplace = True) 
print('Unused columns were removed.')

Unused columns were removed.


exp2 and pers2 datasets include identical categorical variables: *nace_r2*, *geo*, *time*, which will be used for dataset merge. The remaining categorical variables (*unit* for exp2 dataset; *unit* and *prof_pos* for pers2 dataset) represent different metrics related to the dataset, represented by the *value* column. Levels of *unit* and *prof_pos* will be transformed to new columns representing those metrics (*exp2_MIO_EUR*, *exp2_PC_TOT*, *pers2_FTE_RSE*, *pers2_FTE_TOTAL*, *pers2_HC_RSE*, *pers2_HC_TOTAL*) to allow for a proper merge (flattening multiindex pivot tables).

In [21]:
print('Dataset exp2, \'unit\' levels  : ', data_exp2['unit'].unique().tolist())
print('Dataset pers2, \'unit\' levels : ', data_pers2['unit'].unique().tolist(), '\n')
print('Dataset pers2, \'prof_pos\' levels : ', data_pers2['prof_pos'].unique().tolist(), '\n')
print('Dataset exp2, \'nace_r2\' levels  : ', data_exp2['nace_r2'].unique().tolist())
print('Dataset pers2, \'nace_r2\' levels : ', data_pers2['nace_r2'].unique().tolist())
print('Dataset fem2, \'nace_r2\' levels : ', data_fem2['nace_r2'].unique().tolist(), '\n')
print('Dataset exp2, \'geo\' levels  : ', data_exp2['geo'].unique().tolist())
print('Dataset pers2, \'geo\' levels : ', data_pers2['geo'].unique().tolist())
print('Dataset fem2, \'geo\' levels : ', data_fem2['geo'].unique().tolist(), '\n')
print('Dataset exp2, \'time\' levels  : ', data_exp2['time'].unique().tolist())
print('Dataset pers2, \'time\' levels : ', data_pers2['time'].unique().tolist())
print('Dataset fem2, \'time\' levels : ', data_fem2['time'].unique().tolist())

Dataset exp2, 'unit' levels  :  ['MIO_EUR', 'PC_TOT']
Dataset pers2, 'unit' levels :  ['FTE', 'HC'] 

Dataset pers2, 'prof_pos' levels :  ['TOTAL', 'RSE'] 

Dataset exp2, 'nace_r2' levels  :  ['TOTAL', 'C', 'C_HTC_M', 'C_HTC', 'C_LTC_M', 'C_LTC', 'G-N']
Dataset pers2, 'nace_r2' levels :  ['TOTAL', 'C', 'C_HTC_M', 'C_HTC', 'C_LTC_M', 'C_LTC', 'G-N']
Dataset fem2, 'nace_r2' levels :  ['G-N'] 

Dataset exp2, 'geo' levels  :  ['EU27_2020', 'EA20', 'EA19', 'BE', 'BG', 'CZ', 'DK', 'DE', 'EE', 'IE', 'EL', 'ES', 'FR', 'HR', 'IT', 'CY', 'LV', 'LT', 'LU', 'HU', 'MT', 'NL', 'AT', 'PL', 'PT', 'RO', 'SI', 'SK', 'FI', 'SE', 'IS', 'NO', 'CH', 'UK', 'BA', 'ME', 'MK', 'AL', 'RS', 'TR', 'RU', 'US', 'CN_X_HK', 'JP', 'KR']
Dataset pers2, 'geo' levels :  ['EU27_2020', 'EA20', 'EA19', 'BE', 'BG', 'CZ', 'DK', 'DE', 'EE', 'IE', 'EL', 'ES', 'FR', 'HR', 'IT', 'CY', 'LV', 'LT', 'LU', 'HU', 'MT', 'NL', 'AT', 'PL', 'PT', 'RO', 'SI', 'SK', 'FI', 'SE', 'IS', 'NO', 'CH', 'UK', 'BA', 'ME', 'MK', 'AL', 'RS', 'TR', 'RU'

In [22]:
data_exp2_wide = data_exp2.pivot(index = ['nace_r2', 'geo', 'time'], columns = 'unit', values = 'value').reset_index()
data_exp2_wide = data_exp2_wide.rename(columns=lambda x: f"exp2_{x}" if x not in ['nace_r2', 'geo', 'time'] else x)
print('Dataset exp2 was transformed into a wide format:')
data_exp2_wide.head()

Dataset exp2 was transformed into a wide format:


unit,nace_r2,geo,time,exp2_MIO_EUR,exp2_PC_TOT
0,C,AL,2005,,
1,C,AL,2006,,
2,C,AL,2007,,
3,C,AL,2008,,
4,C,AL,2009,,


In [23]:
data_pers2_wide = data_pers2.pivot(index = ['nace_r2', 'geo', 'time', 'prof_pos'], columns = ['unit'], values = 'value').reset_index()
data_pers2_wide = data_pers2_wide.pivot(index = ['nace_r2', 'geo', 'time'], columns = ['prof_pos'], values = ['FTE','HC']).reset_index()
data_pers2_wide.columns = ["_".join([str(c) for c in col if c != ""])
                            for col in data_pers2_wide.columns.to_flat_index()]
data_pers2_wide = data_pers2_wide.rename(columns=lambda x: f"pers2_{x}" if x not in ['nace_r2', 'geo', 'time'] else x)
print('Dataset pers2 was transformed into a wide format:')
data_pers2_wide.head()

Dataset pers2 was transformed into a wide format:


Unnamed: 0,nace_r2,geo,time,pers2_FTE_RSE,pers2_FTE_TOTAL,pers2_HC_RSE,pers2_HC_TOTAL
0,C,AL,2005,,,,
1,C,AL,2006,,,,
2,C,AL,2007,,,,
3,C,AL,2008,,,,
4,C,AL,2009,,,,


## O3 Merging datasets and saving files

In [24]:
data = pd.merge(data_pers2_wide, data_exp2_wide, on = ['nace_r2', 'geo', 'time'], how = 'left') 
data = pd.merge(data, data_fem2, on = ['nace_r2', 'geo', 'time'], how = 'left') 
print('Datasets were merged:')
data.sample(10)

Datasets were merged:


Unnamed: 0,nace_r2,geo,time,pers2_FTE_RSE,pers2_FTE_TOTAL,pers2_HC_RSE,pers2_HC_TOTAL,exp2_MIO_EUR,exp2_PC_TOT,fem2_FTE_RSE
4630,G-N,BE,2015,15114.0,22689.0,20244.0,32337.0,3031.829,42.84,
606,C,MT,2022,182.0,311.0,192.0,337.0,12.34,18.91,
2396,C_HTC_M,RO,2007,,,,,,,
8061,G-N,PL,2019,35891.4,53129.1,48783.0,73510.0,,,19946.0
180,C,DE,2014,166970.0,310533.0,,,49482.4,86.82,
8980,G-N,TR,2014,23388.3,32143.5,26895.0,37219.0,1404.302,46.59,37219.0
6615,G-N,IS,2008,,,,,,,
5966,G-N,FI,2008,5489.0,8715.7,7598.0,12962.0,950.48,18.63,2854.0
1139,C_HTC,EL,2023,,,,,,,
3909,C_LTC_M,KR,2019,21125.8,25332.8,,,2414.276,4.41,


In [25]:
print('Dataset \'nace_r2\' levels  : ', data['nace_r2'].unique().tolist())
print('Dataset \'geo\' levels  : ', data['geo'].unique().tolist())
print('Dataset \'time\' levels  : ', data['time'].unique().tolist())

Dataset 'nace_r2' levels  :  ['C', 'C_HTC', 'C_HTC_M', 'C_LTC', 'C_LTC_M', 'G-N', 'TOTAL']
Dataset 'geo' levels  :  ['AL', 'AT', 'BA', 'BE', 'BG', 'CH', 'CN_X_HK', 'CY', 'CZ', 'DE', 'DK', 'EA19', 'EA20', 'EE', 'EL', 'ES', 'EU27_2020', 'FI', 'FR', 'HR', 'HU', 'IE', 'IS', 'IT', 'JP', 'KR', 'LT', 'LU', 'LV', 'ME', 'MK', 'MT', 'NL', 'NO', 'PL', 'PT', 'RO', 'RS', 'RU', 'SE', 'SI', 'SK', 'TR', 'UK', 'US']
Dataset 'time' levels  :  ['2005', '2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013', '2014', '2015', '2016', '2017', '2018', '2019', '2020', '2021', '2022', '2023']


In [26]:
data.to_csv('../data/scraper_data.csv', encoding='utf-8', index = False)
print('Merged dataset was saved into \'../data/scraper_data.csv\'.')

Merged dataset was saved into '../data/scraper_data.csv'.
