# Venn Diagram Generation

Ivan Mičetić ([ORCID:0000-0003-1691-8425](https://orcid.org/0000-0003-1691-8425)), _University of Padua, Italy_

Alasdair J G Gray ([ORCID:0000-0002-5711-4872](http://orcid.org/0000-0002-5711-4872)), _Heriot-Watt University, Edinburgh, UK_


__License:__ Apache 2.0

__Acknowledgements:__ This work was funded as part of the [ELIXIR Interoperabiltiy Platform](https://elixir-europe.org/platforms/interoperability) Strategic Implementation Study [Exploiting Bioschemas Markup to Support ELIXIR Communities](https://elixir-europe.org/about-us/commissioned-services/exploiting-bioschemas-markup-support-elixir-communities). This notebook builds upon the work conducted during the Virtual BioHackathon-Europe 2020 reported in [BioHackrXiv](https://biohackrxiv.org/v3jct/).

## Introduction

This notebook generates a Venn diagram plot of the intersections of the proteins from the three IDP data sources.

In [None]:
import json
import rdflib
from rdflib import ConjunctiveGraph, plugin
from rdflib.serializer import Serializer
from matplotlib import pyplot as plt
from matplotlib_venn import venn3

In [None]:
# Read query in from file
queryFile = 'proteins/proteins-by-dataset-groupings.rq'
print(f'Reading query {queryFile} ...')
with open('../queries/'+queryFile) as f:
    query = f.read()

## Query Execution

The query to gather the data for creating the Venn diagram can either be executed against the SWeL server or using the RDFlib in-memory local store.

The default (and fastest) is to use the SWeL triplestore. You should not need to change the code to do this. However, this requires that the version of the dataset in the triplestore corresponds to the version that you want to generate the figure for.

To load a custom triplestore, change the type of the following cell to a Code cell and execute the code. Note that the data is expected to be in the file `IDPKG-Full.nq`; this should correspond to the output of the generation notebook.

In [None]:
# Set up querying through remote SPARQL endpoint
endpoint = "https://swel.macs.hw.ac.uk/data/repositories/idpkg"
print(f"Executing query against external endpoint: \n\t" + endpoint)
from SPARQLWrapper import SPARQLWrapper, POST, JSON
sparql = SPARQLWrapper(endpoint)
sparql.setReturnFormat(JSON)
sparql.setMethod(POST)
sparql.setQuery(query)
results = sparql.queryAndConvert()
print(f"Number of Results: {len(results['results']['bindings'])}")

## Venn Diagram Creation

The results of the query are now used to plot the Venn diagram.

The diagram is written out to `venn.png`.

In [None]:
# extract subsets
Abc = int(next((i['count']['value'] for i in results['results']['bindings'] if i['description']['value'] == 'MobiDB \ (DisProt U PED)')))
aBc = int(next((i['count']['value'] for i in results['results']['bindings'] if i['description']['value'] == 'DisProt \ (MobiDB U PED)')))
ABc = int(next((i['count']['value'] for i in results['results']['bindings'] if i['description']['value'] == '(DisProt n MobiDB) \ PED')))
abC = int(next((i['count']['value'] for i in results['results']['bindings'] if i['description']['value'] == 'PED \ (DisProt U MobiDB)')))
AbC = int(next((i['count']['value'] for i in results['results']['bindings'] if i['description']['value'] == '(MobiDB n PED) \ DisProt')))
aBC = int(next((i['count']['value'] for i in results['results']['bindings'] if i['description']['value'] == '(DisProt n PED) \ MobiDB')))
ABC = int(next((i['count']['value'] for i in results['results']['bindings'] if i['description']['value'] == 'DisProt n MobiDB n PED')))

# plot Venn diagram
plt.figure(figsize=(11, 9))
venn3(subsets=(Abc, aBc, ABc, abC, AbC, aBC, ABC), set_labels=('MobiDB', 'DisProt', 'PED'))
## Venn for 2021-09-28
# venn3(subsets=(624, 586, 1401, 34, 5, 7, 44), set_labels=('MobiDB', 'DisProt', 'PED'))
plt.savefig('venn.png')
plt.show()