# Data Preparation

`Overview`
This notebook handles the initial data processing pipeline:
- Loading raw data from source files
- Performing exploratory data analysis (EDA)
- Cleaning and handling missing values
- Feature preprocessing and engineering
- Exporting processed datasets for modeling

`Inputs`
- Raw data files from `../data/raw/` 

`Outputs`
- Processed datasets in `../data/processed/`
- EDA visualizations in `../reports/figures/`

`Dependencies`
- pandas
- numpy
- matplotlib
- seaborn

*Note: This is notebook 1 of the analysis pipeline*

In [8]:
# Imports 
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os
from pathlib import Path

# Import custom modules
# from src.save_load import save_parquet

In [9]:
!which python


'which' is not recognized as an internal or external command,
operable program or batch file.


Here we load the project specific datasets as CSV files. In the follow-up cell, we load the auxiliary dataset containing extra information on the CORDIS-HORIZON projects. This includes
- Scientific vocabulary 
- legal basis documents
- organization
- project
- topics
- webItem 
- webLink

In [17]:
# Import the dataset as pandas DataFrame
run_dir = os.getcwd()
parent_dir = os.path.dirname(run_dir)

raw_dir = f'{parent_dir}/data/raw'
interim_dir = f'{parent_dir}/data/interim'
processed_dir = f'{parent_dir}/data/processed'

# define file paths to project-specific files
data_report_path = f'{raw_dir}/reportSummaries.csv'
data_filereport_path = f'{raw_dir}/file_report.csv'
data_publications_path = f'{raw_dir}/projectPublications.csv'
data_deliverables_path = f'{raw_dir}/projectDeliverables.csv'



## Inspect Reports

In [18]:
# get DataFrame keys
data_report = pd.read_csv(data_report_path, delimiter=';')
data_report.keys()

Index(['id', 'title', 'projectID', 'projectAcronym', 'attachment',
       'contentUpdateDate', 'rcn'],
      dtype='object')

In [19]:
data_report.head()

Unnamed: 0,id,title,projectID,projectAcronym,attachment,contentUpdateDate,rcn
0,101066069_PSHORIZON,Periodic Reporting for period 1 - ERASMUS (Ear...,101066069,ERASMUS,,2025-03-17 10:38:00,1267558
1,101073231_PSHORIZON,Periodic Reporting for period 1 - OncoProTools...,101073231,OncoProTools,/docs/results/horizon/101073/101073231_PS/2024...,2025-03-18 12:31:34,1270628
2,101068156_PSHORIZON,Periodic Reporting for period 1 - BLISS (Beta-...,101068156,BLISS,/docs/results/horizon/101068/101068156_PS/pict...,2025-03-05 11:47:45,1260626
3,101072180_PSHORIZON,Periodic Reporting for period 1 - Green2Ice (W...,101072180,Green2Ice,/docs/results/horizon/101072/101072180_PS/2023...,2025-02-14 10:36:27,1252991
4,101063407_PSHORIZON,Periodic Reporting for period 1 - GHost (His E...,101063407,GHost,/docs/results/horizon/101063/101063407_PS/pa-1...,2025-02-26 17:32:14,1257475


### Missing values
1. we check each column for missing values
2. Define decision tree for handling missing values
3. Change values algorithmically
4. Store update dataframe in interim directory


In [20]:
# look for missing values
report_missing = data_report.isnull()

# check which columns are missing data
for key in data_report:
    missing = report_missing[report_missing[key] == True]
    print(f'For key {key}:\n     {len(missing.id)} elements are missing.')

For key id:
     0 elements are missing.
For key title:
     0 elements are missing.
For key projectID:
     0 elements are missing.
For key projectAcronym:
     0 elements are missing.
For key attachment:
     1861 elements are missing.
For key contentUpdateDate:
     0 elements are missing.
For key rcn:
     0 elements are missing.


We see that there are only missing attachments. These attachments refer to some additional documents, mostly png picture.
We can handle this in three ways
- Look manually for the missing attachments 
- Ignore this column during analysis
- if attachment present: add to dashboard when user wants to inspect a particular project. If not present: leave blank. 

I recommend using the last approach. 

In [21]:
# handle missing values

# define missng values rule here

# change the missing values in dataframe
project_reports_interim = data_report
# save updated dataframe to data/interim
project_reports_interim.to_csv(f'{interim_dir}/reportSummaries_interim.csv', sep=';')

### Inspect other report file
This CSV file does not contain useful information

In [22]:
data_filereport = pd.read_csv(data_filereport_path, delimiter=';')
data_filereport

Unnamed: 0,"filename,status, issue_cause downloadURL, issue_cause accessURL"
0,HORIZON Report summaries (individual XML files...
1,"HORIZON Projects,delivered,,"
2,"HORIZON Projects Deliverables,delivered,,"
3,"HORIZON Projects (individual XML files),delive..."
4,HORIZON Projects Deliverables (individual XML ...
5,"HORIZON Report summaries,delivered,,"
6,"HORIZON Publications,delivered,,"
7,"HORIZON Projects Deliverables,delivered,,"
8,"HORIZON Publications,delivered,,"
9,"HORIZON Projects Deliverables,delivered,,"


## Inspect deliverables

In [23]:
# Inspect Dataframe
data_deliverables = pd.read_csv(data_deliverables_path, delimiter=';')
data_deliverables.keys()

ParserError: Error tokenizing data. C error: Expected 10 fields in line 1416, saw 11


In [None]:
data_deliverables

I have changed the following lines to enable opening the file with pandas:
- 1412: something wrong in the deliverable description
- 1412: wrong use of delimiter
- 6677: wrong use of quotation marks
- 6678: wrong use of quotation marks
- 8812: use of delimiter inside string
- 8826: use of delimiter inside string
- 9360: use of delimiter inside string
- 9524: use of delimiter inside string
- 10128: use of delimiter inside string
- 13108: use of delimiter inside string
- 19931: use of delimiter inside string

### Missing values
Here we handle the missing values in the dataset

In [None]:
# look for missing values
deliverables_missing = data_deliverables.isnull()

# check which columns are missing data
for key in deliverables_missing.keys():
    missing = deliverables_missing[deliverables_missing[key] == True]
    print(f'For key {key}:\n     {len(missing.id)} elements are missing.')

We are missing elements in the following columns:
- deliverableType
    - option 1: change to `'other'`
    - option 2: look up individual titles and add manually
- description
    - option 1: add empty string
    - Inspect manually to gain more insight what they exactly represent
        - Update: all the titles related to the projects are quite related. I suggest we copy title values into the description column.
- url
    - 1 missing url. Add the url to the main page of this project (SELFY, id = 101069748_16_DELIVHORIZON) instead of link to deliverable?
- rcn
    - 1 rcn is missing. 
    - Looked this number up in publication list based on the projectAcronym = `'GeneBEcon'`. There the rcn number is gives as `1077637.0`


In [None]:
# change unknown deliverable types to other
data_deliverables['deliverableType'] = data_deliverables['deliverableType'].fillna('Other') 

# change empty descriptions to title of that particular row
data_deliverables['description'] = data_deliverables['description'].fillna(data_deliverables['title'])

# change missing url to homepage of the particular project
data_deliverables['url'] = data_deliverables['url'].fillna('https://selfy-project.eu/')

# add missing rcn number
data_deliverables['rcn'] = data_deliverables['rcn'].fillna(1077637.0)

In [None]:
# check whether filling executed correctly
data_deliverables[deliverables_missing.deliverableType == True]

In [None]:
# save updated dataframe to data/interim
data_deliverables.to_csv(f'{interim_dir}/projectdeliverables_interim.csv', sep=';')

## Inspect Publications

In [None]:
# Inspect Dataframe
data_publications = pd.read_csv(data_publications_path, delimiter=';')
data_publications.keys()

ParserError: Error tokenizing data. C error: Expected 16 fields in line 7588, saw 17


Some entries in the publications CSV have been changed by hand, in order to allow loading them:
- 7588
- 7748

Both are from the same conference. Problem: switched the order of the columns and add one additional empty column causing pandas loader to crash. 

Next problems:
- 12036: wrong notation of authors names + use of ; delimiter inside string.
- 12043: start authors string with four " + use ; to separate names.
- 12099: same problem as stated above
- 12110: same problem
- 12115: same problem
- 12270: same problem
- 18735: same problem
- 24019: same problem




In [None]:
data_publications

### Missin values
Here we inspect the missing data in this file, and outline how we are goiing to treat these missing data points

In [None]:
# look for missing values
publications_missing = data_publications.isnull()

# check which columns are missing data
for key in publications_missing.keys():
    missing = publications_missing[publications_missing[key] == True]
    if len(missing.id) > 0:
        print(f'For key {key}:\n     {len(missing.id)} elements are missing.')

There is quite some missing data in this file. Let us go through each line individually.
- authors:
    - This sucks. Would have been very nice to decompose author strings into single authors and make the connections
    - How to treat this: look into the article title string to check whether this one contains more author infromation
- journalTitle:
    - chack in the publication title. Sometimes there one has just copy-pasted the whole article reference
- journalNumber:
    - Not the most relevant parameter in my opinion. Just make all NaN zeros
- publishedYear:
    - Manually look this up
- publishedPages:
    - Not the most relevant parameter in my opinion. Just make all NaN zeros
- issn:
    - Not the most relevant parameter in my opinion. Just make all NaN zeros
- isbn:
    - Not the most relevant parameter in my opinion. Just make all NaN zeros
- doi:
    - Fuck this, just pass about:blank as url. 
- rcn:
    - Manually adjust this one. 
        - Update: this entry was missing an entry for authors, all following field shifted 1 column to the left. Manually fixed this one. 



In [None]:
data_publications.keys()

In [None]:
# check missing rcn. 
data_publications[publications_missing.rcn == True]

In [None]:
# fill some gaps in the data structure
data_publications['isbn'] = data_publications['isbn'].fillna('0000-0000')
data_publications['issn'] = data_publications['issn'].fillna('0000-0000')
data_publications['publishedPages'] = data_publications['publishedPages'].fillna(0)
data_publications['doi'] = data_publications['doi'].fillna('about:blank')
data_publications['journalTitle'] = data_publications['journalTitle'].fillna('Miscalleneous')
data_publications['journalNumber'] = data_publications['journalNumber'].fillna(0)
data_publications['authors'] = data_publications['authors'].fillna('sine nome')


In [None]:
# check data_publications again
publications_missing = data_publications.isnull()

# check which columns are missing data
for key in publications_missing.keys():
    missing = publications_missing[publications_missing[key] == True]
    if len(missing.id) > 0:
        print(f'For key {key}:\n     {len(missing.id)} elements are missing.')

Now there are no empty entries left. We store the completed dataset in the interim folder

In [None]:
# Save to intermediate
data_publications.to_csv(f'{interim_dir}/projectPublications_interim.csv', sep=';')

## Inspect CORDIS-HORIZON projects data files
This is the folder containing some more datasets on the different projects.

In [None]:
# define file paths
CORDIS_framework_docs_dir = f'{raw_dir}/cordis-HORIZONprojects-csv'

SciVoc_path = f'{CORDIS_framework_docs_dir}/euroSciVoc.csv'
legalBasis_path = f'{CORDIS_framework_docs_dir}/legalBasis.csv'
organization_path = f'{CORDIS_framework_docs_dir}/organization.csv'
project_path = f'{CORDIS_framework_docs_dir}/project.csv'
topics_path = f'{CORDIS_framework_docs_dir}/topics.csv'
webItems_path = f'{CORDIS_framework_docs_dir}/webItem.csv'
webLink_path = f'{CORDIS_framework_docs_dir}/webLink.csv'

In [None]:
# Import some informative files

# load datasets
read_csv_options = {
    "delimiter": ";",
    "quotechar": '"',
    "escapechar": "\\",
    'doublequote': False,
    # "on_bad_lines": "skip",   # we skip lines that do not import properly for now
    "engine": "python"  # 'python' engine handles complex parsing better
}


sci_voc_df = pd.read_csv(SciVoc_path, **read_csv_options)
legal_basis_df = pd.read_csv(legalBasis_path, **read_csv_options)
organization_df = pd.read_csv(organization_path, delimiter=';')
topics_df = pd.read_csv(topics_path, **read_csv_options)
web_items_df = pd.read_csv(webItems_path, **read_csv_options)
web_link_df = pd.read_csv(webLink_path, **read_csv_options)


In [None]:
read_csv_options['on_bad_lines'] = 'skip'
try:
    project_df = pd.read_csv(project_path, **read_csv_options)
    print(len(project_df.id))
except pd.errors.ParserError as e:
    print("Parsing error:", e)

15736


There are 15863 - 15737 = 126 field which could not be read. Let us inspect those lines further. 

In [None]:
project_df

Unnamed: 0,id,acronym,status,title,startDate,endDate,totalCost,ecMaxContribution,legalBasis,topics,ecSignatureDate,frameworkProgramme,masterCall,subCall,fundingScheme,nature,objective,contentUpdateDate,rcn,grantDoi
0,101159220,PvSeroRDT,SIGNED,A point-of-care serological rapid diagnostic t...,2025-02-01,2030-01-31,406239623,406239623,HORIZON.2.1,HORIZON-JU-GH-EDCTP3-2023-02-02-two-stage,2024-12-09,HORIZON,HORIZON-JU-GH-EDCTP3-2023-02-two-stage,HORIZON-JU-GH-EDCTP3-2023-02-two-stage,HORIZON-JU-RIA,,Plasmodium vivax is considered the most diffic...,2024-12-24 11:18:48,268210,10.3030/101159220
1,101096150,BIOBoost,SIGNED,Boosting innovation agencies for bioeconomy va...,2023-02-01,2025-01-31,0,500000,HORIZON.3.2,HORIZON-EIE-2022-CONNECT-01-01,2022-11-25,HORIZON,HORIZON-EIE-2022-CONNECT-01,HORIZON-EIE-2022-CONNECT-01,HORIZON-CSA,,The overall objectives of the BIOBoost project...,2022-12-01 14:09:06,243343,10.3030/101096150
2,101093997,GlycanTrigger,SIGNED,GLYCANS AS MASTER TRIGGERS OF HEALTH TO INTEST...,2023-01-01,2028-12-31,6771571,6771571,HORIZON.2.1,HORIZON-HLTH-2022-STAYHLTH-02-01,2022-12-05,HORIZON,HORIZON-HLTH-2022-STAYHLTH-02,HORIZON-HLTH-2022-STAYHLTH-02,HORIZON-RIA,,Chronic inflammation underlies several disease...,2022-12-11 19:02:29,243439,10.3030/101093997
3,101126531,CHIKVAX_CHIM,SIGNED,Late-stage clinical development of Chikungunya...,2023-06-01,2028-11-30,100000000,70000000,HORIZON.2.1,HORIZON-HLTH-2022-CEPI-15-01-IBA,2023-06-15,HORIZON,HORIZON-HLTH-2022-CEPI-15-IBA,HORIZON-HLTH-2022-CEPI-15-IBA,HORIZON-COFUND,,A Framework Partnership Agreement (FPA) betwee...,2023-09-19 19:01:01,256925,10.3030/101126531
4,101113979,The Oater,CLOSED,The Oater develops a compact machine for hyper...,2023-07-01,2023-12-31,0,75000,HORIZON.3.2,HORIZON-EIE-2022-SCALEUP-02-02,2023-06-05,HORIZON,HORIZON-EIE-2022-SCALEUP-02,HORIZON-EIE-2022-SCALEUP-02,HORIZON-CSA,,The Oater is a female-founded food tech start-...,2023-07-11 15:45:49,253030,10.3030/101113979
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
15731,101052410,EUCYS2022,CLOSED,EUCYS Leiden2022,2021-09-01,2023-02-28,2000000,2000000,HORIZON.4.2,HORIZON-WIDERA-2021-EUCYS-IBA,2022-02-07,HORIZON,HORIZON-WIDERA-2021-EUCYS-IBA,HORIZON-WIDERA-2021-EUCYS-IBA,HORIZON-CSA,,The main objective of this Proposal is the org...,2023-03-10 20:23:58,241771,10.3030/101052410
15732,101124648,RESAVER_2023,SIGNED,Support to Retirement Savings Vehicle for Euro...,2023-09-01,2026-08-31,249963825,249963825,HORIZON.4.2,HORIZON-WIDERA-2023-RESAVER-IBA,2023-07-10,HORIZON,HORIZON-WIDERA-2023-RESAVER-IBA,HORIZON-WIDERA-2023-RESAVER-IBA,HORIZON-CSA,,The overall aim of the RESAVER Pension Fund as...,2023-10-13 14:43:57,257324,10.3030/101124648
15733,101052247,Leiden2022-ECS-ESOF,CLOSED,European City of Science and EuroScience Open ...,2021-08-01,2023-03-31,370914925,2000000,HORIZON.4.2,HORIZON-WIDERA-2021-ESOF-IBA,2021-12-13,HORIZON,HORIZON-WIDERA-2021-ESOF-IBA,HORIZON-WIDERA-2021-ESOF-IBA,HORIZON-CSA,,The main objective of this proposal is the org...,2022-09-14 19:17:24,241770,10.3030/101052247
15734,101172981,EUCYS2024,SIGNED,European Union Contest for Young Scientists (E...,2024-02-01,2025-02-28,999500,999500,HORIZON.4.2,HORIZON-WIDERA-2024-EUCYS-IBA,2024-04-15,HORIZON,HORIZON-WIDERA-2024-EUCYS-IBA,HORIZON-WIDERA-2024-EUCYS-IBA,HORIZON-CSA,,This proposal concerns the organization of the...,2024-04-22 17:56:03,262788,10.3030/101172981


## Inspect organization data files

In [None]:
# set or correct country values
organization_df.loc[organization_df['city']=='Windhoek', 'country']='NA' # NA was probably interpreted as NaN, can also be fixed using keep_default_na=True when loading data
organization_df.loc[organization_df['city']=='WINDHOEK', 'country']='NA'
organization_df.loc[organization_df['city']=='Crawley', 'country']='UK'
organization_df.loc[organization_df['name']=='CEREGE', 'country']='FR'
organization_df.loc[organization_df['name']=='Purdue University', 'country']='US'
organization_df.loc[organization_df['name']=='Rijk Zwaan', 'country']='NL'

# make data numeric where necessary
organization_df['totalCost'] = pd.to_numeric(organization_df['totalCost'].str.replace(',','.'))

In [None]:
# look for missing values
org_missing = organization_df.isnull()

# check which columns are missing data
for key in organization_df:
    missing = org_missing[org_missing[key] == True]
    print(f'For key {key}:\n     {len(missing.index)} elements are missing.')

For key projectID:
     0 elements are missing.
For key projectAcronym:
     0 elements are missing.
For key organisationID:
     0 elements are missing.
For key vatNumber:
     15490 elements are missing.
For key name:
     0 elements are missing.
For key shortName:
     25689 elements are missing.
For key SME:
     263 elements are missing.
For key activityType:
     23 elements are missing.
For key street:
     305 elements are missing.
For key postCode:
     786 elements are missing.
For key city:
     263 elements are missing.
For key country:
     0 elements are missing.
For key nutsCode:
     278 elements are missing.
For key geolocation:
     673 elements are missing.
For key organizationURL:
     39135 elements are missing.
For key contactForm:
     0 elements are missing.
For key contentUpdateDate:
     0 elements are missing.
For key rcn:
     0 elements are missing.
For key order:
     0 elements are missing.
For key role:
     0 elements are missing.
For key ecContribution

quite some missing values, but most of them i dont think we will need anyways

- all entries with role="associatedPartner" have NaN values in the column 'ecContribution' and sometimes in 'netEcContribution' and 'totalCost'. These will be set to 0.
- 'geolocation' has missing values that would be useful to have for visualization on a map. could be (approximately) added by looking at the city or address, but not sure if worth the effort
-  'activityType' can be added manually for the 23 missing values if necessary, but maybe they're undefined for a reason
-  all others i don't think are very important

I think the best metric to quantify the funding will be 'netEcContribution' (not 'EcContribution' or 'totalCost'), although i don't know exactly what the definitions of these are. Note that 'totalCost' is often 0 when the contribution is not, which is probably wrong and they just didn't enter complete data.

In [None]:
# fill nans with 0 in funding metrics
organization_df[['ecContribution', 'netEcContribution', 'totalCost']] = organization_df[['ecContribution', 'netEcContribution', 'totalCost']].fillna(0)

In [None]:
#save to csv
organization_df.to_csv(f'{interim_dir}/organization_interim.csv', sep=';')

## Construct functions to access cleaned data

We now define some functions that allow easy access to all aspects of different projects. 


- Merge datasets into one object
- Standardize column names => they are compatible
- Create function that allow access to project-specific data:
    - function argument: project name / acronym / identifier
    - function output: data class with project information as attributes
    - Or: approach this from a class init perspective

Find some way to pass load datasets
apply class on this, without having to load the full dataset each time we initialize the class


In [None]:
# class to load datasets

class CORDIS_data():
    def __init__(self):
        '''
        Initialize class: load data from the CSV files

        set some global variables that we 

        '''

        self.data_report = pd.read_csv(f'{interim_dir}/projectdeliverables_interim.csv', delimiter=';')
        self.data_deliverables = pd.read_csv(f'{interim_dir}/projectdeliverables_interim.csv', delimiter=';')

        pass

    
    def list_of_acronyms(self):
        '''
        This function prints out a dataframe 
        '''
        pass

class Project_data(CORDIS_data):
    ''' 
    This class inherits all the attributes from the class CORDIS_data, including the loaded datasets. 
    Intended use case: 
        set some new variables related to a project:
            - number of deliverables
            - deliverable_types
            - number of publications
            - publication types
            - ... (whatever might be useful later on in the project)
    '''
    
    def project(self, id=None, acronym=None):
        '''
        class to access all information related to a certain project:
        create some a

        '''
        pass


In [None]:
# store post-processed and feature enriched datasets here

data_full.to_csv(f'{processed_dir}/CORDIS_enriched.csv', sep=';')