## H2020 project objective analyse using ML

Starting from data published by the commission on H2020 projects at https://data.europa.eu/euodp/en/data/dataset/cordisH2020projects we do some data mining. 

Now we try ML using Naive Bayes on a dataset containing textile projects & nontextile projects

##### Creating the dataset

In [70]:
import pandas as pd
import csv

#modules for NLP
import nltk
from nltk.tokenize import word_tokenize
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords

from collections import Counter

##### Dataset files were downloaded locally

reports = pd.read_csv('C:/Users/jl/Downloads/h2020reports.csv', sep= ";", encoding='LATIN1') <br/>
organisations = pd.read_csv('C:/Users/jl/Downloads/h2020organizations.csv', sep= ";", encoding='LATIN1', ) <br/>
deliverables = pd.read_csv('C:/Users/jl/Downloads/h2020projectDeliverables.csv', sep= ";",encoding='LATIN1') <br/>
projects = pd.read_csv('C:/Users/jl/Downloads/h2020projects.csv', sep= ";",encoding='LATIN1')

In [71]:
projects = pd.read_csv('C:/Users/jl/Downloads/h2020projects.csv', sep= ";",encoding='UTF-8')
programmes = pd.read_csv('C:/Users/jl/Downloads/h2020programmes.csv', sep= ";",encoding='UTF-8')
topics = pd.read_csv('C:/Users/jl/Downloads/h2020topics.csv', sep= ";",encoding='LATIN1')

In [72]:
# concatenize programmes & topics
topics_programmes = pd.concat([programmes,topics], sort = False)

In [73]:
topics_programmes.head()

Unnamed: 0,rcn,code,title,shortTitle,language
0,664357,H2020-EU.3.4.,"RETOS DE LA SOCIEDAD - Transporte inteligente,...",Transport,es
1,664321,H2020-EU.3.3.,"WYZWANIA SPOŁECZNE - Bezpieczna, czysta i efek...",Energy,pl
2,664233,H2020-EU.2.3.2.3.,Wsparcie innowacji rynkowych,Supporting market-driven innovation,pl
3,664281,H2020-EU.3.2.,"RETOS DE LA SOCIEDAD - Seguridad alimentaria, ...","Food, agriculture, forestry, marine research a...",es
4,664235,H2020-EU.3.,PRIORITÉ «Défis de société»,Societal Challenges,fr


In [74]:
# Total number of projects in the file
print("Total number of projects in file is {}".format(len(projects)))

Total number of projects in file is 24554


In [75]:
# Adding programme and topic information on the textile projects
data = pd.merge(projects, topics_programmes, how='left', left_on ="topics", right_on ="code")
data.drop(['code','language', 'rcn_x','rcn_y','id','subjects','shortTitle'], axis=1, inplace=True)
data.rename(columns = {'title_x': 'titleProject','title_y':'callTitle'}, inplace=True)

In [76]:
data.head()

Unnamed: 0,acronym,status,programme,topics,frameworkProgramme,titleProject,startDate,endDate,projectUrl,objective,totalCost,ecMaxContribution,call,fundingScheme,coordinator,coordinatorCountry,participants,participantCountries,callTitle
0,FARMYNG,SIGNED,H2020-EU.2.1.4.;H2020-EU.3.2.6.,BBI.2018.SO3.F2,H2020,FlAgship demonstration of industrial scale pro...,2019-06-01,2022-06-30,,The world faces a major challenge with the sha...,469348491,1963041118,H2020-BBI-JTI-2018,BBI-IA-FLAG,YNSECT,FR,EUROFINS ANALYTICS FRANCE SAS;VIRBAC NUTRITION...,FR;NO;BE;ES;PL;CH;DE;NL,Large-scale production of proteins for food an...
1,FF-IPM,SIGNED,H2020-EU.3.2.1.1.,SFS-05-2018-2019-2020,H2020,"In-silico boosted, pest prevention and off-sea...",2019-09-01,2023-08-31,,The FF-IPM project targets three highly polyph...,60042525,60042525,H2020-SFS-2018-2,RIA,PANEPISTIMIO THESSALIAS,EL,UNIVERSITAT JAUME I DE CASTELLON;ANECOOP SOCIE...,ES;PT;HR;IL;DE;PL;AT;ZA;IT;BE;EL;CY;FR;US;AU;CN,New and emerging risks to plant health
2,BELENUS,SIGNED,H2020-EU.3.3.2.,LC-SC3-RES-11-2018,H2020,Lowering Costs by Improving Efficiencies in Bi...,2019-03-01,2023-02-28,,The primary objective of BELENUS is to lower b...,499132375,499132375,H2020-LC-SC3-2018-RES-TwoStages,RIA,UNIVERSIDAD COMPLUTENSE DE MADRID,ES,EIFER EUROPAISCHES INSTITUT FUR ENERGIEFORSCHU...,DE;ES;FI;PT;UK;SE;FR,Developing solutions to reduce the cost and in...
3,RURALIZATION,SIGNED,H2020-EU.3.2.1.3.,RUR-01-2018-2019,H2020,The opening of rural areas to renew rural gene...,2019-05-01,2023-04-30,,"European economic, social and territorial cohe...",5995904,5995904,H2020-RUR-2018-2,RIA,TECHNISCHE UNIVERSITEIT DELFT,NL,MAGYAR TUDOMANYOS AKADEMIA TARSADALOMTUDOMANYI...,HU;FR;IE;PL;DE;RO;ES;UK;BE;FI;IT,Building modern rural policies on long-term vi...
4,iLIVE,SIGNED,H2020-EU.3.1.3.,SC1-BHC-23-2018,H2020,"Living well, dying well. A research programme ...",2019-01-01,2022-12-31,,Every year around 4 million people die in the ...,41088175,40178175,H2020-SC1-2018-Single-Stage-RTD,RIA,ERASMUS UNIVERSITAIR MEDISCH CENTRUM ROTTERDAM,NL,MEDIZINISCHE UNIVERSITAET WIEN;KLINIKUM DER UN...,AT;DE;SE;AU;NZ;CH;NL;SI;UK;ES;IS;NO;AR,Novel patient-centred approaches for survivors...


In [77]:
# Selecting the projects which have the words textiles or clothing in the objectives
data['lobjective'] = data['objective'].str.lower()
data['memberTrainingSet'] = False
data['textileProject'] = False
terms = ['textile','textiles','clothing']
terms = '|'.join(terms)
data.loc[data['lobjective'].str.contains(terms, na=False),'memberTrainingSet'] = True
data.loc[data['lobjective'].str.contains(terms, na=False),'textileProject'] = True
print("Total number of textile & clothing related projects is {}".format(data['memberTrainingSet'].sum()))

Total number of textile & clothing related projects is 188


In [78]:
# selecting 200 random samples from non-textiles projects
import random
ran = data.loc[data['memberTrainingSet'] == 0].sample(n=200)
data.loc[data.index.isin(ran.index),'memberTrainingSet'] = True

In [79]:
data.head()

Unnamed: 0,acronym,status,programme,topics,frameworkProgramme,titleProject,startDate,endDate,projectUrl,objective,...,call,fundingScheme,coordinator,coordinatorCountry,participants,participantCountries,callTitle,lobjective,memberTrainingSet,textileProject
0,FARMYNG,SIGNED,H2020-EU.2.1.4.;H2020-EU.3.2.6.,BBI.2018.SO3.F2,H2020,FlAgship demonstration of industrial scale pro...,2019-06-01,2022-06-30,,The world faces a major challenge with the sha...,...,H2020-BBI-JTI-2018,BBI-IA-FLAG,YNSECT,FR,EUROFINS ANALYTICS FRANCE SAS;VIRBAC NUTRITION...,FR;NO;BE;ES;PL;CH;DE;NL,Large-scale production of proteins for food an...,the world faces a major challenge with the sha...,False,False
1,FF-IPM,SIGNED,H2020-EU.3.2.1.1.,SFS-05-2018-2019-2020,H2020,"In-silico boosted, pest prevention and off-sea...",2019-09-01,2023-08-31,,The FF-IPM project targets three highly polyph...,...,H2020-SFS-2018-2,RIA,PANEPISTIMIO THESSALIAS,EL,UNIVERSITAT JAUME I DE CASTELLON;ANECOOP SOCIE...,ES;PT;HR;IL;DE;PL;AT;ZA;IT;BE;EL;CY;FR;US;AU;CN,New and emerging risks to plant health,the ff-ipm project targets three highly polyph...,False,False
2,BELENUS,SIGNED,H2020-EU.3.3.2.,LC-SC3-RES-11-2018,H2020,Lowering Costs by Improving Efficiencies in Bi...,2019-03-01,2023-02-28,,The primary objective of BELENUS is to lower b...,...,H2020-LC-SC3-2018-RES-TwoStages,RIA,UNIVERSIDAD COMPLUTENSE DE MADRID,ES,EIFER EUROPAISCHES INSTITUT FUR ENERGIEFORSCHU...,DE;ES;FI;PT;UK;SE;FR,Developing solutions to reduce the cost and in...,the primary objective of belenus is to lower b...,False,False
3,RURALIZATION,SIGNED,H2020-EU.3.2.1.3.,RUR-01-2018-2019,H2020,The opening of rural areas to renew rural gene...,2019-05-01,2023-04-30,,"European economic, social and territorial cohe...",...,H2020-RUR-2018-2,RIA,TECHNISCHE UNIVERSITEIT DELFT,NL,MAGYAR TUDOMANYOS AKADEMIA TARSADALOMTUDOMANYI...,HU;FR;IE;PL;DE;RO;ES;UK;BE;FI;IT,Building modern rural policies on long-term vi...,"european economic, social and territorial cohe...",False,False
4,iLIVE,SIGNED,H2020-EU.3.1.3.,SC1-BHC-23-2018,H2020,"Living well, dying well. A research programme ...",2019-01-01,2022-12-31,,Every year around 4 million people die in the ...,...,H2020-SC1-2018-Single-Stage-RTD,RIA,ERASMUS UNIVERSITAIR MEDISCH CENTRUM ROTTERDAM,NL,MEDIZINISCHE UNIVERSITAET WIEN;KLINIKUM DER UN...,AT;DE;SE;AU;NZ;CH;NL;SI;UK;ES;IS;NO;AR,Novel patient-centred approaches for survivors...,every year around 4 million people die in the ...,False,False


In [80]:
data.to_csv('training.csv', sep = ";")

### Using ML

The dataset of samples is saved and the textile projects are reviewed to make sure that they are textile projects. Of course here start the discussion : 'What is a textile project ?". The data is stored on disk in a file 'training_corrected.csv'.

In [81]:
dataX = pd.read_csv('training_corrected.csv', sep=';', index_col=0)
dataX.dtypes

acronym                 object
status                  object
programme               object
topics                  object
frameworkProgramme      object
titleProject            object
startDate               object
endDate                 object
projectUrl              object
objective               object
totalCost               object
ecMaxContribution       object
call                    object
fundingScheme           object
coordinator             object
coordinatorCountry      object
participants            object
participantCountries    object
callTitle               object
lobjective              object
memberTrainingSet         bool
textileProject            bool
dtype: object

In [82]:
model_data = dataX.loc[dataX['memberTrainingSet'] == True,['lobjective','textileProject']]
model_data['textileProject']=model_data['textileProject'].map({True : 1, False : 0})

In [83]:
from nltk.stem import WordNetLemmatizer
tokenizer = RegexpTokenizer(r'\w+\D+.\D+')
wordnet_lemmatizer = WordNetLemmatizer()
stopWords = set(stopwords.words('english'))

In [84]:
def my_tokenizer(s):
    tokens = tokenizer.tokenize(s)
    tokens = [t for t in tokens if len(t) > 2] 
    tokens = [wordnet_lemmatizer.lemmatize(t) for t in tokens]
    tokens = [t for t in tokens if t not in stopWords]
    return tokens

In [85]:
def tokens_to_vector(tokens, label):
    x = np.zeros(len(word_index_map) + 1) # last element is for the label
    for t in tokens:
        i = word_index_map[t]
        x[i] += 1
    x = x / x.sum() # normalize it before setting label
    x[-1] = label
    return x

In [86]:
model_data.loc[model_data['textileProject'] == 0,'lobjective']

4        every year around 4 million people die in the ...
42       wherever heat is when heat is needed (for indu...
291      air cargo has experienced tremendous growth. e...
348      the main goal of co-val is to discover, analys...
373      plants are the foundation of all ecosystems an...
563      the goal of m2m is to conduct a market analysi...
16115    this proposal addresses the topic jti-cs2-2015...
13317    perovskite solar cells (psc) have shown an imp...
11630    a fundamental challenge in design research tod...
1165     90% of antifouling paints contain copper and s...
1302     regio, a leading provider of gis solutions in ...
1383     'car manufacturers have developed and incorpor...
1413     'lemon 'less energy more opportunities' focuss...
19214    visolis represent the future in cheap sustaina...
1523     the idea of the proposal is to prove the relev...
17695    bodypass aims to break barriers between health...
19503    with nearly 50 years of experience and holding.

In [87]:
word_index_map = {}
current_index = 0
orig_reviews = []
textile_tokens = []
nontextile_tokens = []

for obj in model_data.loc[model_data['textileProject'] == 1,'lobjective']:
    orig_reviews.append(obj)
    tokens = my_tokenizer(obj)
    textile_tokens.append(tokens)

    for token in tokens:
        if token not in word_index_map:
            word_index_map[token] = current_index
            current_index += 1
            
for obj in model_data.loc[model_data['textileProject'] == 0,'lobjective']:
    orig_reviews.append(obj)
    tokens = my_tokenizer(obj)
    nontextile_tokens.append(tokens)

    for token in tokens:
        if token not in word_index_map:
            word_index_map[token] = current_index
            current_index += 1

In [88]:
import numpy as np
data = np.zeros((len(textile_tokens)+len(nontextile_tokens),len(word_index_map)+1))
i=0

for tokens in textile_tokens:
    x = tokens_to_vector(tokens,1)
    data[i,:] = x
    i +=1
    
for tokens in nontextile_tokens:
    x = tokens_to_vector(tokens,0)
    data[i,:] = x
    i +=1

In [89]:
from sklearn.utils import shuffle
orig_reviews, data = shuffle(orig_reviews, data)

In [90]:
data.shape

(388, 1464)

In [99]:
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import AdaBoostClassifier

X = data[:,:-1]
Y = data[:,-1]

X.shape
Y.shape

CV = CountVectorizer(encoding = 'LATIN1')
#Xf = CV.fit_transform(X)

X_train, X_test, Y_train, Y_test = train_test_split(X,Y, shuffle = True, train_size = 0.8)

model = AdaBoostClassifier()
#model = MultinomialNB()
#model = logisticRegression(solver = 'lbfgs')
model.fit(X_train, Y_train)

print("Classification rate for NB:", model.score(X_test, Y_test))

Classification rate for NB: 0.717948717948718
