<p>The EU Open Data Portal is a source of data, produced by the institutions and other bodies of the European Union. These open data are are free and can be used in research, applications, commercial or non-commercial purposes.
The dataset contains all the projects funded by the European Union from 1994 to 2020. These projects got approved under the framework programme (FP) for research and technological development. For each project it is provided information, such as: reference, acronym, dates, funding, programmes, participant countries, subjects and objectives.
Datasets that are uploaded on Open Data Portal, are being produced on a monthly basis.</p>

In [1]:
import pandas as pd
import xlrd
import numpy as np

Load datasets of all projects funded by the European Union for research and technological development under the:
- FP4: fourth framework programme (1994-1998)
- FP5: fifth framework programme (1998–2002)
- FP6: sixth framework programme (2002–2006)
- FP7: seventh framework programme (2007–2013)
- H2020: Horizon 2020 framework programme (2014-2020)

In [2]:
xlsxFP4 = pd.ExcelFile("dataset/cordisfp4projects.xlsx")
xlsxFP5 = pd.ExcelFile("dataset/cordis-fp5projects.xlsx")
xlsxFP6 = pd.ExcelFile("dataset/cordis-fp6projects.xlsx")
xlsxFP7 = pd.ExcelFile("dataset/cordis-fp7projects.xlsx")
xlsxH2020 = pd.ExcelFile("dataset/cordis-h2020projects.xlsx")

In [3]:
dataFP4 = xlsxFP4.parse()
dataFP5 = xlsxFP5.parse()
dataFP6 = xlsxFP6.parse()
dataFP7 = xlsxFP7.parse()
dataH2020 = xlsxH2020.parse()

<h4>Common attributes in all files:</h4>
<ul>
<li>rcn: reference code (type: int)</li>
<li>project title (type: string)</li>
<li>start date (type: datetime)</li>
<li>end date (type: datetime)</li>
<li>status (type: string):
    <ul>
    <li>completed/accepted (FP4)</li>
    <li>null (FP5, FP6)</li>
    <li>ong(oing)/can(celled) (FP7)</li>
    <li>signed (H2020)</li>
    </ul></li>
<li>acronym  (type: string)</li>
<li>programme/pga (type: string)</li>
<li>framework programme (type:> string)</li>
<li>total Cost (type: int)</li>
<li>objective (typ>e: string)</li>
<li>projectUrl</li>
<li>(project) call (type: st>ring)</li>
<li>subject (type: list)</li>
<li>coordinatorCountry (type: string)</li>
<li>participantCountries (type: list)</li>
</ul>

<br>
<h3>FP4: Fourth framework programme (1994-1998)</h3>
<br>
https://data.europa.eu/euodp/en/data/dataset/cordisfp4projects 

<p>Extra attributes: Contract Number, Keywords, Date of Signature, Total Funding, General Information, Achievements, Activity Area, Contract Type</p><br>


In [5]:
# dataFP4.head(10)
df4 = dataFP4[['rcn', 'title', 'objective', 'subjects', 'frameworkProgramme']]
df4.frameworkProgramme.replace(to_replace=df4['frameworkProgramme'].astype(str),value='FP4', inplace=True)
df4.head(10)

Unnamed: 0,rcn,title,objective,subjects,frameworkProgramme
0,29005,Spot IV-VΘgΘtation,,Environmental Protection; Forecasting; Meteoro...,FP4
1,30802,Formation and occurrence of nitrous acd in the...,%LTo understand the mechanisms leading to the ...,Environmental Protection; Forecasting; Measure...,FP4
2,31031,Process for Production of Light Olefins by Deh...,,Industrial Manufacture; Materials Technology,FP4
3,30803,High resolution diode laser carbon dioxide env...,%LTo develop a new instrument for measuring at...,Environmental Protection; Measurement Methods;...,FP4
4,31004,Subsurface Radar as a Tool for Non-destructive...,,Industrial Manufacture; Materials Technology; ...,FP4
5,30804,Pollution from aircraft emissions In the North...,To determine by measurements and analysis the ...,Environmental Protection; Forecasting; Measure...,FP4
6,30805,Diversity Effects in Grassland Ecosystems of E...,DEGREE aims at investigating the modifications...,Environmental Protection; Meteorology,FP4
7,31108,Improvement of moisture content measuring syst...,,Industrial Manufacture; Measurement Methods; R...,FP4
8,31109,Robust process analytical methods for industri...,The problem of the agreement between analytica...,Industrial Manufacture; Measurement Methods; R...,FP4
9,31110,Improvement of robot industrial standardisation,,"Electronics, Microelectronics; Industrial Manu...",FP4


<br>
<h3>FP5: fifth framework programme (1998–2002)</h3>
<br>
https://data.europa.eu/euodp/en/data/dataset/cordisfp5projects

<p>Extra attributes: id, topics, ecMaxContribution, fundingScheme, coordinator, participants</p><br>


In [6]:
# dataFP5.head(10)
import re

df5 = dataFP5[['rcn', 'title', 'objective', 'subjects', 'frameworkProgramme']]
df5.frameworkProgramme.replace(to_replace=df5['frameworkProgramme'].astype(str),value='FP5', inplace=True)
df5.head(5)

Unnamed: 0,rcn,title,objective,subjects,frameworkProgramme
0,64570,Genetic diversity in agriculture: temporal flu...,The overall objective of this project is to de...,ECO;SEA;LIF;ENV;AGR,FP5
1,64192,Sensing and controlling single molecules by no...,This project concerns controlling and sensing ...,BIO;LIF;ENV;MED;WAS;ITT,FP5
2,61977,Transduction mechanisms for non-noxious and no...,,,FP5
3,54932,Portable measurement systems for atmospheric p...,The primary objective of the proposed project ...,SEA;MET;ENV;FOR,FP5
4,56044,Benthic primary production - carbon cycling an...,,,FP5


<br>
<h3>FP6: sixth framework programme (2002–2006)</h3>
<br>
https://data.europa.eu/euodp/en/data/dataset/cordisfp6projects

<p>Extra attributes: reference, topics, ecMaxContribution, fundingScheme, coordinator, participants</p><br>


In [7]:
# dataFP6.head(2)
df6 = dataFP6[['rcn', 'title', 'objective', 'subjects', 'frameworkProgramme']]
df6.frameworkProgramme.replace(to_replace=np.NaN,value='FP6', inplace=True, regex=True)
df6.head(10)

Unnamed: 0,rcn,title,objective,subjects,frameworkProgramme
0,71920,Amigo Ambient Intelligence for the networked h...,The networked home environment leads to many n...,IPS,FP6
1,85502,Genetic component of the low dose risk of thyr...,Cancer of the non-medullary (follicular epithe...,BIO;RAD,FP6
2,74968,European food information resource network,EuroFIR will form a world-leading collaboratio...,IPS;FOO,FP6
3,74155,Global allergy and asthma european network,Allergic diseases and asthma pose an important...,SEA;LIF;MED;FOO;AGR,FP6
4,74297,Advanced Protection Systems (APROSYS),The IP on Advanced Protective Systems (APROSYS...,,FP6
5,81228,Flavonoids and related phenolics for healthy L...,There is growing evidence that bioactives in t...,MED;FOO;AGR;SAF,FP6
6,82431,Multi-functional carbon nanotubes for biomedic...,We will exploit the potential of multi-functio...,,FP6
7,79163,Enzyme Microarrays-An integrated technology fo...,The deciphering of the human genome laid the g...,,FP6
8,75937,Healthy Lifestyle in Europe by Nutrition in Ad...,The key to health promotion and disease preven...,SOC;MED;FOO,FP6
9,73971,"Crystalline Silicon PV: Low-cost, highly effic...",Crystal-Clear intends to develop innovative ma...,,FP6


<br>
<h3>FP7: seventh framework programme (2007–2013)</h3>
<br>
https://data.europa.eu/euodp/en/data/dataset/cordisfp7projects 

<p>Extra attributes: reference, topics, ecMaxContribution, fundingScheme, coordinator, participants</p><br>

In [8]:
# dataFP7.head(10)
df7 = dataFP7[['rcn', 'title', 'objective', 'subjects', 'frameworkProgramme']]
df7.frameworkProgramme.replace(to_replace=np.NaN, value='FP7', inplace=True)
df7.head(10)

Unnamed: 0,rcn,title,objective,subjects,frameworkProgramme
0,110629,ALFRED - Personal Interactive Assistant for In...,***Personal Interactive Assistant for Independ...,INF,FP7
1,104117,Microbial Biomarker Records in Tibetan Peats: ...,It is crucial to understand terrestrial microb...,SCI,FP7
2,188177,Post-glacial recolonisation and Holocene anthr...,"At the end of last glaciation, ca. 15 000 cal....",,FP7
3,188066,Molecular Mechanisms Employed by the Newly Ass...,Posttranscriptional gene regulation is an esse...,,FP7
4,187919,Identifying the targets and mechanism of actio...,The Ubiquitin (UB) and SUMO modification pathw...,,FP7
5,107182,Archaeological Investigations of the Extra-Urb...,'The research project 'ARIEL' proposes the arc...,LIF,FP7
6,109557,Nano -structural and -dynamic events in the T-...,The organization of the T-cell in its resting ...,LIF,FP7
7,107890,Synthesis and Biological Evaluation of a Poten...,'This project is directed toward the total syn...,LIF,FP7
8,108334,Microbially catalyzed electricity driven biopr...,'The breakthroughs in extracellular electron t...,SCI,FP7
9,109580,Thin-film Hybrid Interfaces: a training initia...,Organic thin-films constitute a fast growing a...,LIF,FP7


<h3>H2020: Horizon 2020 framework programme (2014-2020)</h3>
<br>
https://data.europa.eu/euodp/en/data/dataset/cordisH2020projects

<p>Extra attributes: reference, topics, ecMaxContribution, fundingScheme, coordinator, participants</p><br>

In [9]:
# dataH2020.head(10)
df20 = dataH2020[['rcn', 'title', 'objective', 'subjects', 'frameworkProgramme']]
df20.frameworkProgramme.replace(to_replace=np.NaN, value='H2020', inplace=True)
df20.head(10)

Unnamed: 0,rcn,title,objective,subjects,frameworkProgramme
0,193982,Carbon Cascades from Land to Ocean in the Anth...,C-CASCADES will produce a new generation of yo...,,H2020
1,193979,Applied mathematics for risk measures in finan...,The EID WAKEUPCALL has been set up with the kn...,,H2020
2,193971,BigStorage: Storage-based Convergence between ...,'The consortium of this European Training Netw...,,H2020
3,193970,Perception and Action in Complex Environments,The PACE research and training programme sits ...,,H2020
4,193969,Industrial optimal design using adjoint CFD,Adjoint-based methods have become the most int...,,H2020
5,193967,Environmental Humanities for a Concerned Europe,ENHANCE (Environmental Humanities for a Concer...,,H2020
6,193961,Improved production strategies for endangered ...,The IMPRESS European Training Network will pro...,,H2020
7,193956,Development of Selective Carbohydrate Immunomo...,IMMUNOSHAPE aims at training a new generation ...,,H2020
8,193958,The Extracellular Matrix in Epileptogenesis,Over 50 million people worldwide have epilepsy...,,H2020
9,193952,Engineering of new-generation protein secretio...,"The market for recombinant proteins, including...",,H2020


<br>
<h4>Data preparation</h4>

<p>Data cleaning is absolutely crucial for generating a useful topic model: as the saying goes, “garbage in, garbage out.” The steps below are common to most natural language processing methods:
<ul>
<li>Tokenizing: converting a document to its atomic elements.</li>
<li>Stopping: removing meaningless words.</li>
<li>Stemming: merging words that are equivalent in meaning.</li>
</ul>
</p>

In [10]:
# Compile the 'objective' of all the projects into a list
list_obj = df4['objective'].tolist()
list5 = df5['objective'].tolist()
list6 = df6['objective'].tolist()
list7 = df7['objective'].tolist()
list20 = df20['objective'].tolist()

In [11]:
list_obj.extend(list5)
list_obj.extend(list6)
list_obj.extend(list7)
list_obj.extend(list20)

In [12]:
print 'Number of documents: ', len(list_obj)

Number of documents:  76522


In [13]:
from nltk.tokenize import RegexpTokenizer
from stop_words import get_stop_words
from nltk.stem.porter import PorterStemmer
from gensim import corpora, models
import gensim
import string

tokenizer = RegexpTokenizer(r'\w+')

# create English stop words list
en_stop = get_stop_words('en')
en_stop.append('will')

# Create p_stemmer of class PorterStemmer
p_stemmer = PorterStemmer()

texts = []

In [14]:
for text in list_obj:
    
    try:
        # 'text' is a float type
        text = str(text)
    except UnicodeEncodeError:
        pass
        
    text = text.lower()
    
    # Remove punctuation & numbers
    exclude_punctuation = list(string.punctuation)
    numbers = ['1', '2', '3', '4', '5', '6', '7', '8', '9', '0']
    exclude_punctuation.extend(numbers)
    
    if text != 'nan':
        text = text.replace('%l', '')
        for c in exclude_punctuation:
            text = text.replace(c, " ")
        
        # tokenize document string
        tokens = tokenizer.tokenize(text)
        
        # remove stop words from tokens
        stopped_tokens = [i for i in tokens if not i in en_stop]
        
        # stem tokens
        stemmed_tokens = [p_stemmer.stem(i) for i in stopped_tokens]
        
        # add tokens to list
        texts.append(stemmed_tokens)
        


In [15]:
# turn our tokenized documents into a id <-> term dictionary
dictionary = corpora.Dictionary(texts)

In [16]:
# convert tokenized documents into a document-term matrix
corpus = [dictionary.doc2bow(text) for text in texts]

In [17]:
# generate LDA model
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=5, id2word = dictionary)

print(ldamodel.print_topics(num_topics=5, num_words=5))

[(0, u'0.010*product + 0.010*use + 0.007*process + 0.007*project + 0.007*water'), (1, u'0.013*market + 0.011*system + 0.011*develop + 0.010*project + 0.009*energi'), (2, u'0.014*cell + 0.008*use + 0.007*develop + 0.007*studi + 0.007*function'), (3, u'0.010*develop + 0.010*system + 0.009*use + 0.008*new + 0.008*applic'), (4, u'0.026*research + 0.013*project + 0.008*european + 0.007*develop + 0.007*innov')]
