Documentation

Index

Abstract
How will be the development experience
Test-driven development
Python as data management
- Python 2.x vs 3.x
Hexagonal architecture
Use Cases: Using Personas
Requirements
Database
- Graph Database
- Migration
References

Abstract

Stance4Health is an European Union’s Horizon 2020 project which fosters a global ecosystem while enabling collection, management, and analysis of healthcare and feeding data, with the aim of developing a Smart Personalized Nutrition service, based on food production that will optimize the gut microbiota activity. It will be tailored to different target groups, from healthy children and adults to children with coeliac disease or food allergy, as well as overweight children and adults, which will have an impact on the development of NCDs such as obesity or type 2 diabetes. The specific personalised nutrition tools developed along Stance4Health will be based on robust scientific evidence and knowledge from different fields like nutrition, medicine, food sciences, microbiology, computer sciences, and social sciences and humanities like economics, marketing, psychology and social anthropology.

In this document you can see how the project is being implemented, what technologies will be used, what programming technique is most appropriate, and how the data will be treated in terms of management and migration. This repository was created with the purpose of collecting information for the development of Stance4Health.

➤ Index

How will be the development experience: Continuous Integration (CI)

How are developers going to be organized to implement the system? using continuous integration. In a development based on continuous integration, the workers/developers who are implementing the system share their work together, pooling it in common with the rest of the members, in an automatic and systematized way. The development is continuous because each worker pools his work in common constantly, every time he fixes a bug, implements a new feature or some characteristic.

You can implement CI without using third-party tools, but it becomes very tedious. Because we need to take control of source code, manage the versions, perform unit tests, etc. For this there are services that can help us, such as the git version control system, which is what we use for the project. Then, thanks to its use, the development becomes systematized. The development of the project advances through changes, these changes are the modifications we make to the files within the version control system and published later, commit, so the repository is updated. We can also bring the changes, that some of the members made of the repository to have our workspace always updated. Git keeps information about what has been modified, who modified it, when it was modified and in some cases why it was modified. In this way, we have a history of all the changes that are presented in the project, allowing us to interact with the versions and go back if something went wrong.

What can we keep in a version control system? mainly the source code is saved, however it is not the only thing, we can store anything digital, for example images, icons, sound, videos, binary files, libraries, and even the same project documentation, from which you are reading.

The git version control allows us to work on a copy of the main project, without connection, in local. This makes it much faster than other alternatives such as Subversion (SVN), Perforce or Mercurial, among others. For example, Mercurial uses commands that need the server to be completed, however, git can do a fetch to get the repository information and then work offline, allowing comparisons merge and see the logs from your branch or from the rest of the branches of the repository, even if they do not belong to your local branches.

➤ Index

Test-driven development: all code emerged is tested

Test-driven development is the programming technique used to develop the system. This technique includes:

First implement the tests.
Refactoring: eliminate duplicate code, unnecessary dependencies.

This technique gives us certain advantages, such as minimizing errors, implementing software functionalities, producing modular software, among others. To improve efficiency and test times, automatic tests are often used. Advantages:

Execute a mayor number of tests.
Greater frequency of tests.
Greater depth of the tests.

Using Travis CI

Travis CI is a continuous integration system that executes the tests of our system every time a commit is pushed to Github. If your last commit was corrupted will be easer to the system to detect it, fix or come back to the previous commit/state. The configuration to tune Travis with the repository is very simple. En el directorio principal del repositorio tenemos el siguiente script en Python: In the main directory of the repository we have the following script in Python: setup.py

from setuptools import setup, find_packages

setup(
	name='common',
	url='https://github.com/Stance4Health-Dev/common',
	license='GNU General Public License v3.0',
	author='jimcase',
	keywords='test unittest common',
	description=u'Testing using unittest for common',
	packages=find_packages(include=['test_*.py']),
	python_requires='>=3',
)

With this script we have the execution of the tests ready. We neew to import the 'setuptools' module with which, we establish a configuration and parameters for our specific repository. Where we indicate the name of the repository, its address in github, the license, the author, the keywords of the repo, the description of the domain of the tests, which test we want to be executed according to a pattern in the names of the files, and the version of Python with which we want to test the code. In addition, we will use Coverage, a tool for measuring code coverage of Python programs. It monitors your program, noting which parts of the code have been executed, then analyzes the source to identify the code that could have been executed but was not.[x]

To run the tests, we go to the directory where we have the script setup.py and open the terminal. After, we execute the following: coverage run --source=src setup.py test

Once we have our execution ready, we need to link the Travis CI system with our repo. To do this, we started a session on Travis platform and add our repository. But this is not enough, we need to create a configuration script that Travis can read and execute as specified.

.travis.yml

language: python
python:
- '3.6'
- '3.5'

install:
- pip install coveralls
- pip install -r requirements.txt
- pip install setuptools
script:
- coverage run --source=src setup.py test
after_success:
- coveralls

In this script we tell Travis how he has to run the tests. It indicates the programming language, the supported versions (Travis will run the tests in each of the versions with virtual environments), the necessary modules, the sentence of execution of the tests and finally a call to coveralls, a web service to help you track your code coverage over time, and ensure that all your new code is fully covered.[x]

With this configuration, Travis will now be able to verify our repository in each commit.

Travis CI repository status: https://www.travis-ci.org/Stance4Health-Dev/common

Repository status in Coveralls: https://coveralls.io/github/Stance4Health-Dev/common

Battery of tests

It is important to have a sample list that is very complete, concrete and unambiguous. List of actions focused on passing the tests:

...
...

Some classic vulnerabilities to consider

Fuzzing: massive entry of random data to the entry points of the system. Ex: negative numbers, very large numbers, urls, html text, etc.
SQL injection: unauthorized requests to databases.
XSS: injection of malicious javascritp code. Ex: web based applications.

➤ Index

Python as data management

The system needs to handle a lot of information constantly, therefore, we have chosen the Python programming language as the main development language. Python is an interpreted language, slower than others, like C ++, but you only need one interpreter to run it, which makes it cross-platform. In addition, it comes pre-installed in many systems like Linux or Mac. It is multi-paradigm, unlike other languages such as R, can be oriented to different needs, such as object-oriented programming, modular development, functional programming and scripts. These scripts can be used for system administration tasks, tests, correction of errors and direct interaction with the database, among other usefull uses.

It is opensource and is supported by a large community that continues to develop libraries and modules that facilitate our work. Many of these libraries are oriented to massive data computing (Awesome Python).

Python has a very broad and easy to abstractread syntax, similar to pseudocode. It allows to easily manipulate data in a table way. It is not necessary to declare the type of a variable, depending on the content that a variable takes, it is of one type or another, this gives more flexibility when it comes to processing differenabstractt types of data.

Python 2.x vs 3.x

Both versions are incompatible, so we have chosen version 3.x for the development because it is the most recent version (3.6 2016) compared to the latest version 2.x (2.7 2010), which will also no longer have support next year. Most of the 2.x libraries are already available in 3.x, in the following link we can see some of those that are not available in red color.

➤ Index

Hexagonal architecture

The set of relationships between the components of a system forms its architecture. The hexagonal architecture also known as 'onion' architecture, architecture of 'ports and adapters', and even 'clean' architecture. It stands out for encapsulating the core of the system making it agnostic from the outside. Implements ports as the output and the inputs of the system, this being the way of relating to the outside. This architecture is characterized by separating the system into 3 main components.

Hexagonal architecture diagram

Application: User interface. Set of users' interactions of the users that the system receives as requests. As HTTP, sending and receiving json data.
Core:
- Domain model: represent the objects and states of the system.
- Domain services: here we find the behavior of the system represented by interfaces as ports. The abstraction which comunicates with external world.
- Application service: here are the most specific behaviors of the system.
Infrastructure: It contains essential infrastructure. Everything related to data storage, database management, use of the file system for storage, dependencies management.

For example, to digitize a food-type we need to implement everything that is related to the food itself. Implementing a serialization to convert a food-type object to JSON is not the responsibility of the food itself, so that function should not exist in the food class. Instead, the food class should be implemented so that it has easy reading access. For this, a bridge module is used wich is capable to run all these needs. In Python we have the modules 'json' and 'pickle' that facilitate us and save a lot of work. In the same way, when you want to call some function of the core from another system, through an HTTP request for example, the kernel is agnositco of who and how its happening. In terms of database base, the kernel does not care how it will be stored, whether in memory, SQL or graphs. If the code is modularized respecting the limits of the domain, we find certain development characteristics:

Fastest compile time for each module.
Isolation and compilation separately.
It makes your code easier to test.
More precise error detection and solution.
Being so unitary, the modules can be transferred to another project.
Being agnostic of the database, the same scheme can combine the use of different types of databases according to the problem to solve.
Independence of use of frameworks. The system should not depend on any framework or library.

Use Cases: Using Personas

Personas bring us a tool that allows to create models that represent a user or group of users focused on a specific activity. With the publication in 1998 of the book 'The Inmates Are Running the Asylum', Alan Cooper begins his approach towards what we know as Personas today. Personas [4,5,6] is a practical way to design the interaction and user experience with the system. It is a good option when we have a complex design, since it allows to differentiate without ambiguities between the functionalities and requirements that are necessary and those that are not. Users tend to be divided according to three criteria:

Goals.
Chores.
Skill level.

There are different criteria and ways to represent a future user, let's see an example:

| | Sofía Morales | |----------------|-------------------------------| |Vital statistics|24 years, Female, Puertollano, Spain | |Diseases |Celiac | |Job |Nurse. Actually working with cancer patients. | |Queries |

Usually she buys the same products that she already knows are gluten free.
Quick access to a database that allows to verify components of products.
Discover new recipes suitable for her
Notification of new free gluten restaurants

| |Links |

|

You can find here more user stories of the system, intented to get the prerequisites and needs of the future application.

[4] Cooper, A. 1999. The Inmates Are Running the Asylum, Indianapolis, Sams. [5] Cooper A., Reimann R., Cronin, D. 2007. About Face 3: The Essentials of Interaction Design. Indianapolis, Wiley. [6] Cooper A. 2003. The Origin of Personas. Retrieved June 12, 2008, from http://www.cooper.com/journal/2003/08/the_origin_of_personas.html

➤ Index

Requirements

We are facing one of the most important phases in the development of the project, constitutes a complete description of the behavior of a system to be developed. Fulfilling these behaviors are the objectives of the requirements. This requirement specification acts as a link between developer, client and future users, which allows them to maintain an initial and constant communication during the development of the project, attending possible contingencies. To understand the business logic and its needs, it is not enough to intuit them, for this purpose it is recommended to use some technique to obtain requirements already established and worked by the industry (or combination of different techniques). In our case, we use the technique described in the previous sections "Personas", where we simulate future users of the system, based in evidences like news, articles, surveys, studies, and we can extract their behaviors. From these behaviors we can extract the different needs, which we put together in the form of a test using TDD. Once we satisfy the tests, we are fulfilling in the same way with the behavior that the system requires.

Here you can check the Behaviors based on Personas list:

Database

The industry started to store data in relational databases like SQL (in 1970 by IBM) and in human readable tables like Excell (in 1985 by Microsoft). Relational Databases use a ledger-style representation and come with a key feature. We can identify each piece of data using the primary key (sometimes auto-generated) and get relationships with other pieces using the foreign keys.

In the current time, there has been a data explotion, petabytes and exabytes, known as Big Data. We can store it in disk, but, how we interact with all of this data? how we managment this big volume of primary and foreign keys? Although we continue to use them with the next versions, they are based on the initial scheme. So, these first tools are not suitable for these needs. But we still need to store all this information in a system where relationships play a crucial role. Innovation reappeared and databases based on graphs emerged.

Graph Database

The base idea of graph database is that any data that we already know and store, already have a relational structure, and it is easy to be represented in graph way, like a social network. The whiteboard model is the physical model, what you draw in the whiteboard to represent the data model, is represented on disk in the exact same way using graphs. What perhaps differentiates graphs from many other data modeling techniques, however, is the close affinity between the logical and physical models. The interesting thing about graph diagrams is that they tend to contain specific instances of nodes and relationships, rather than classes or archetypes.

There are three dominant graph data models: the labeled property graph, hypergraphs (this model allows any number of nodes at either end of a relationship) and Resource Description Framework (RDF) triples by W3C.

For this project we will develop the labeled property graph. This model has the following characteristics:

It contains nodes, relationships, properties, and labels.
Nodes contain properties (key-value pairs).
Nodes can be labeled with one or more labels.
Relationships are named and directed, and always have a start and end node.
Relationships can also contain properties.

Let's make an example to understand the characteristics above:

In the next image we can see a graph with two nodes, one a "ingredient/food" type (label 0), and being more specific, the label 1 say us that is an "onion" object. The second object is a "nutrient" type (label 0), being more specific, it is a "vitamin" (label 1) with the name "C" (label 2). We have both nodes well defined with their labels and determinated properties.

In terms of relationships, the object "onion" starts an edge that ends in "vitamin" "C" node with the relationship name "contains" with a value equeal to 19, that gives us the total amount of "vitamin" "C" in the onion object. Then, according to the assigned value we can know if the amount of vitamine C is important or significant, always according to the business logic. The second relationship starts in zinq node and ends in apple node, it has the relationship name "make up", and gives us the certainty/relationship that zinq is part of the apple.

Use cases/goals when devs want evaluate databases:

Flexibility: Mean a way to create and maintain your data in a logical wey. No just translation from code into database calls, also translation between the business logic describing the applications requirements, and the developers satisfyong those requeriments. The flexibility of the graph model allow us to add new nodes and new relationships without compromising the existing network or migrating data from the original data and its intent remain intact. Because of the graph model’s flexibility, we don’t have to model our domain in exhaustive detail ahead of time a practice that is all but foolhardy in the face of changing business requirements. The additive nature of graphs also means we tend to perform fewer migrations, thereby reducing maintenance overhead and risk.
Perfomance: In contrast to relational databases, where join-intensive query performance deteriorates as the dataset gets bigger, with a graph database performance tends to remain relatively constant, even as the dataset grows. This is because queries are localized to a portion of the graph. As a result, the execution time for each query is proportional only to the size of the part of the graph traversed to satisfy that query, rather than the size of the overall graph.
Agility: they are schema free, graph databases lack the kind of schema-oriented data governance mechanisms we’re familiar with in the relational world. But this is not a risk, rather, it calls forth a far more visible and actionable kind of governance. Agility is another measure of speed. How easy and quickly can your code adapt to changing business? graph databases are in step with changing business environments.

Native graph technology

Native graph storage

Some graph databases use native graph storage that is optimized and designed for storing and managing graphs. The benefit of native graph storage is that its purpose-built stack is engineered for performance and scalability. Unlike non-native graph storage, it typically depends on a mature non-graph backend (such as MySQL) whose production
characteristics are well understood by common operations teams. This non-native approach leads to latent results as their storage layer is not optimized for graphs.

Native graph processing

Others graph databases use index-free adjacency, meaning that connected nodes physically “point” to each other in the database. Native graph processing (index-free adjacency) benefits traversal performance, but at the expense of making some queries that don’t use traversals difficult or memory intensive. In common databases, traversing the graphs remains expensive, because each operation (Create, Read, Update, and Delete - CRUD) requires an index lookup. This is because aggregates (like in SQL) have no notion of locality, unlike graph databases, which naturally provide the index-free adjacency. This substantial cost is amplified when it comes to traversing deeper than just one hop/search. Friends are easy enough, but imagine trying to compute in real time friends-of-friends(depth 2), or friends-of-friends-of-friends(depth 3). It will be slow because of the number of index lookups involved. Again, graphs use index-free adjacency to ensure that traversing connected data is extremely rapid. Graph databases can carry out a series of operations, Create, Read, Update, and Delete (CRUD) methods that expose the graph data model.

How the performance of traversing changes with the size of the dataset

A graph database provides a constant order search for queries. In our case, we simply find the node in the graph that represents the object "apple" that is an "ingredient"/"food" type, and then we follow the incoming nutrient relationships, these relationships lead to nodes that represent "nutrient" that "make up" the object "apple". This is far cheaper than brute-forcing the result because it considers far fewer members of the network, that is, it considers only those that are connected to "apple". Of course, if all nutrients make up the "apple" object, we’ll still end up considering the entire dataset.

Ex. Finding extended friends in a relational database versus efficient finding in Neo4j. http://aitorrm.github.io/t%C3%A9cnicas%20y%20metodolog%C3%ADas/arquitectura_software_limpia/Source: Graph Databases, OReilly

Depth	DBMS exec time	Neo4j exec time	Dataset
2	0.016	0.01	~2500
3	30.267	0.168	~110000
4	1543.505	1.359	~600000
5	Unfinished	2.132	~800000

Query language

Diagrams are great for describing graphs outside of any technology context, but when it comes to using a database, we need some other mechanism for creating, manipulating, and querying data. We need a query language. The two main paradigms of database query languages are imperative and declarative languages.

Imperative Query Languages

Imperative query languages are used to describe how you want something done specifically. Step-by step manner, the sequence and wording of each line of code plays a critical role. imperative database query languages can also be limiting and not very user-friendly, requiring an extensive knowledge of the language and deep technical understanding of physical implementation details prior to usage. Writing one part incorrectly creates faulty outcomes. Example: Gremlin, GraphQL (Neo4J-GraphQL integration Simplifying Data-Intensive Development 2019)

Declarative Query Languages

Declarative query languages let users express what data to retrieve, letting the engine underneath take care of seamlessly retrieving it, rather than the specifics on how to complete it. Using a declarative database query language may also result in better code than what can be created manually, and it is usually easier to understand the purpose of the code written in a declarative language. Example: Cypher(Neo4j), Gremlin

Database objects

The objects/nodes of the graph represent the real world objects that we want to digitize and store. For our feed database we distinguish the following basic objects:

Menu Object: set of recipes that together form a menu.

Labels	Properties
Menu	Name: menu_name
	Type: [breakfast, lunch, snack, dinner, other]
	Price aprox: _ €

Recipe object: is a tutorial with a set of ingredients and properties that make up a recipe.

Labels	Properties
Recipe	Name: recipe_name
Recipe_name	Tutorial: string
	Cooking time: min
	Country: _

Ingredient object: it is the food itself. Any product of the supermarket, anything that is attributed to nutritional values is considered an ingredient/food.

Labels	Properties
Ingredient/food_product	Energy: _ kcal
Ingredient_name/commercial_name	cholesterol: _ mg
	Unit: _
	Quantity: _
	State: [solid,liquid,gas]
	Alcohol: _ mg

Nutrient object: There are many types and variants of nutrients, but we are not going to break down nutrients into more atomic components.

Labels	Properties
Nutrient	Unit:_
Nutrient_type (nutrients_type.txt)	Source:_
Nutrient_name

Aditive object: Non-nutritive contribution. There are many types and variants of aditives.

Labels	Properties
Aditive	Unit:_
Aditive_type(aditives-types.txt)	Source:_
Aditive_name

The properties of each object above are not definitive, you can eliminate and add new properties without interfering with the operation of the graph. This leaves us open a wide range of possibilities and variants within the same database.

Lists of different types of data:

aditives.txt
nutrients.txt
vitamins.txt
minerals.txt
aminoacids.txt
fatty-acids-saturated.txt
fatty-acids-unsatured.txt
countries.txt

Database relationships

The relational edges represent the relationship between the objects/nodes. For our feed database we distinguish the following basic relationship labels:

Labels	Meaning
Contain	Carry or have [something] inside
Compose	Be part of something
. . .	. . .

These labels are what allow us to traverse the graph respecting the coherence and logic of the business. More labels are added to the list as new relationships appear, the inclusion of new objects fosters new relationships.

Improving relationships

Probabilistic Label Relation Graphs with Ising Models

Developing

At the moment, you can find information about:

Representation of Food Composition databases like FDA, Open Food Facts, Intake24 and so on.

Migration

Developing..

Convenio Europeo

Translating..

Reglamento (UE) 1169/2011 .pdf

Valores de referencia de nutrientes.
Tablas Expresión y representación de los nutrientes.
Tablas factor de conversión para cálculo energético por nutriente.
Dieta ideal 2000kcal dividido en nutrientes.

➤ Index

References

https://blog.eizinger.io/5835/rust-s-custom-derives-in-a-hexagonal-architecture-incompatible-ideas https://stackoverflow.com/questions/22587148/trying-to-understand-what-travis-ci-does-and-when-it-should-be-used

https://coverage.readthedocs.io/en/v4.5.x/ https://docs.coveralls.io/

https://blog.octo.com/en/hexagonal-architecture-three-principles-and-an-implementation-example/ http://www.dossier-andreas.net/software_architecture/ports_and_adapters.html

Name		Name	Last commit message	Last commit date
Latest commit History 148 Commits
data-processing		data-processing
data-representation		data-representation
img		img
scripts		scripts
user-stories		user-stories
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Documentation

Index

Abstract

How will be the development experience: Continuous Integration (CI)