In [1]:
import json

from gensim.models.doc2vec import Doc2Vec

from doc2vec_prep import stem_text

In [2]:
def print_results(simdocs):
    for doc_id, sim in simdocs:
        try:
            print(publications[doc_id]['title'])
        except:
            print(projects[doc_id]['title'])

In [3]:
def read_input(filename, type):
    with open(filename) as f:
        docs = json.load(f)
    print(f'Loaded {len(docs)} {type}')
    return docs

In [4]:
publications_file = 'epcc_inf_publications.json'
projects_file = 'epcc_inf_projects.json'
staff_file = 'epcc_inf_staff.json'

publications = read_input(publications_file, 'publications')
projects = read_input(projects_file, 'projects')

Loaded 19451 publications
Loaded 2202 projects


In [5]:
def similar_docs(model, text):
    vector = model.infer_vector(stem_text(text))
    simdocs = model.docvecs.most_similar(positive=[vector])
    return simdocs

In [6]:
dare = '''
DARE will deliver a new working environment for the teams of professionals wrestling with the challenge of extreme data, computing and complexity. It will present methods, in abstract terms, so that domain experts can understand, change and use them effectively. It will provide a set of tools that visualise the runs of these methods in summary form still without distracting technical detail. Those tools will allow drill down for diagnostics and validation, and help with the organisation of campaigns involving multiple runs and immense amount of data. This holistic abstract presentation together with automation that eliminates chores will push back the complexity barrier, accelerate innovation and improve the productivity of our hard-pressed expert teams. The data-scale barrier will be pushed by a combination of optimised mappings and automation. To achieve this, we depend on learning the critical parameters in the cost functions dynamically, taking into account data movement, storage costs, limits and other resource costs in formulae weighted by community choices and priorities. The computational scale barrier will be pushed by a similar strategy. However, the methods we enable often have a mixture of computationally challenging parts and data challenging parts, best allocated to different platforms. In today’s R&D the practitioners have to organise this and the inherent data movement themselves. DARE’s optimised mappings will automatically partition parts of the work to different platforms and organise the coupled use of those platforms including any necessary data movements and adaptations. Most professional R&D requires sustained use of such methods. Sustaining their meaning across platforms means that working practices do not need to change and that the original investment in learning and in method development is retained. DARE will work with two research infrastructures: EPOS (European Plate Observing System) and IS-ENES (Infrastructure for the European Network of Earth System Modelling), engaging in the co-design and production use of extreme methods that address these challenges. With our partners, we will show:Accelerated innovation in the face of all three extremes. Significantly increased productivity for expert teams and a wide range of users. Substantial advances in the science and applications achievable in campaigns.
'''

In [7]:
rse_fellowship = '''
This call will support Research Software Engineer (RSE) Fellowships for a period of up to five years. The RSE Fellowship describes exceptional individuals in the software field, who demonstrate leadership and have combined expertise in programming and a solid knowledge of the research environment. The Research Software Engineer works with researchers to gain an understanding of the problems they face, and then develops, maintains and extends software to provide the answers.
As well as having expertise in computational software development and engineering, the RSE Fellow should be an ambassador for the research software community and have the potential to be a future research leader in the RSE community.'''

In [8]:
model_cbow = Doc2Vec.load('cbow.model', mmap='r')

In [9]:
# test CBOW model
simdocs = similar_docs(model_cbow, dare)
print_results(simdocs)

Establishing Core Concepts for Information-Powered Collaborations
Providing Dependability and Resilience in the Cloud: Challenges and Opportunities
A scientist's guide to cloud computing
dispel4py: An Agile Framework for Data-Intensive methods using HPC
DIALOGUE Data intergration applications: Linking organisations to gain understanding & Experience
iPregel: Strategies to Deal with an Extreme Form of Irregularity in Vertex-Centric Graph Processing
Navigating the Landscape for Real-time Localisation and Mapping for Robotics, Virtual and Augmented Reality
A characterization of workflow management systems for extreme-scale applications
Computing with Structured Connectionist Networks
Comprehensible Control for Researchers and Developers facing Data Challenges


In [10]:
# test CBOW model
simdocs = similar_docs(model_cbow, rse_fellowship)
print_results(simdocs)

ADA Lovelace Computer Scientist
CHSS Mid-Career Research Development Fellowship
Computational Modelling of Mathematical Reasoning
RSE Travel assistance grant
Computational Modelling of Mathematical Reasoning
Computational Modelling of Mathematical Reasoning
Computational Modelling of Mathematical Reasoning
Computational Modelling of Mathematical Reasoning
Computational Modelling of Mathematical Reasoning
Computational Modelling of Mathematical Reasoning


In [11]:
model_dmv1 = Doc2Vec.load('dmv1.model', mmap='r')

In [12]:
simdocs = similar_docs(model_dmv1, dare)
print_results(simdocs)

Asterism: Pegasus and dispel4py hybrid workflows for data-intensive science
Autoencoding Variational Inference for Topic Models
Strategies to promote adoption and usage of an application to support asthma self-management:
Data Wrangling for Big Data: Challenges and Opportunities
Learning Continuous Semantic Representations of Symbolic Expressions
DARE: A Reflective Platform Designed to Enable Agile Data-Driven Research on the Cloud
EGEE: Building a pan-European Grid Training Organisation
Working time reduction policy in a sustainable economy
Sonification of Gestures Using Specknets
Observations and Research Challenges in Map Generalisation and Multiple Representation


In [13]:
simdocs = similar_docs(model_dmv1, rse_fellowship)
print_results(simdocs)

Visiting Fellowship - Ozsoy
Constructing, Selecting and Repairing Representations of Knowledge
Compilers for High Performance
Beyond bitext: Five open problems in machine translation
Dealing With Software: the Research Data Issues
EU-BRIDGE MT: Combined Machine Translation
Computer Animation
What can national data sets tell us about inclusion and pupil achievement?
Statistical Techniques for Translating to Morphologically Rich Languages (Dagstuhl Seminar 14061)
ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '14, Edinburgh, United Kingdom - June 09 - 11, 2014


In [14]:
model_dmv2 = Doc2Vec.load('dmv2.model', mmap='r')

In [15]:
simdocs = similar_docs(model_dmv2, dare)
print_results(simdocs)

Ad hockery in secondhand markets, design and ethnomethodological studies
Building the Peeragogy Accelerator
Journeys in mathematical landscapes: genius or craft?
Categories, Software and Meaning
Human instruments, imagined return
Reducing Construction Carbon Emissions in Logistics (ReCCEL)
Enhancing Student Employability with Simulation: The Virtual Oil Rig and DART
The power of synthetic biology for bioproduction, remediation and pollution control
Unexpected encounters with Deep Time
Pass the Bucks: Credit, Blame, and the Global Competition for Investment


In [16]:
simdocs = similar_docs(model_dmv2, rse_fellowship)
print_results(simdocs)

Strategies and Policies to Support and Advance Education in e-Science
An Interview with Ivette Gomes
Exploring new business models for monetising digitisation beyond image licensing to promote adoption of OpenGLAM
Negotiating in a brave new world: Challenges and opportunities for the field of negotiation science
Journeys in mathematical landscapes: genius or craft?
E-science Core Programme Senior Research Fellow
Picture-Book Professors
Hardy, Littlewood and polymath
Mapping the biosphere
RSE Travel assistance grant
