# MIxS to RDF
### This notebook demonstrates how to use the mixs_to_rdf library to convert MIxS spreadsheets to RDF.

## Load mixs-to-rdf library
* #### In order to find the library, you need to add the path to the system.  
* #### rdflib is needed in order to work with output graphs

In [None]:
import os, sys
sys.path.append(os.path.abspath('../../code/mixs_to_rdf/')) # add rdf_etl module to sys path

In [2]:
from mixs_file_to_rdf import mixs_package_file_to_rdf, mixs_package_directory_to_rdf
from rdflib import Graph

## Review help information for mixs_package_file_to_rdf function.

In [3]:
help(mixs_package_file_to_rdf)

Help on function mixs_package_file_to_rdf in module mixs_file_to_rdf:

mixs_package_file_to_rdf(file_name, mixs_version, package_name='', term_type='class', file_type='excel', sep='\t', base_iri='https://gensc.org/mixs#', ontology_iri='https://gensc.org/mixs.owl', output_file='', ontology_format='turtle', print_output=False)
    Builds an ontology (rdflib graph) from a MIxS package file.
    
    Args:
        file_name: The name of MIxS package file.
        mixs_version: The version of MIxS package.
        package_name: Overrides the package name provided in the package Excel spreadsheet.
                      This argument if necessary when using a file is a csv or tsv.
        term_type: Specifies if the MIxS terms will be represented as classes or data properties.
                   Accepted values: 'class', 'data property'
                   Default: 'class'
        file_type: The type of file being processed. 
                   If file type is not 'excel', a field separator/de

## Test creating RDF versions of the MIxS-air, version 5, package.
#### RDF files are output to the output directory.
* #### test_classes.ttl will can MIxS terms converted to classes.
* #### test_classes.ttl will can MIxS terms converted to data properties.

In [4]:
test_file = "../../mixs_data/mixs_v5_packages/MIxSair_20180621.xlsx"

In [5]:
graph_cls = mixs_package_file_to_rdf(test_file, 5, output_file='output/test_classes.ttl')

In [6]:
graph_dp = mixs_package_file_to_rdf(test_file, 5, term_type='data property', output_file='output/test_dataproperties.ttl')

## Test creating RDF versions of all MIxS package version 4 & 5 from a specified directories.
#### RDF files are output to the output directory.
* #### test_classes.ttl will can MIxS terms converted to classes.
* #### test_classes.ttl will can MIxS terms converted to data properties.

## Review help information for mixs_package_directory_to_rdf function.

In [7]:
help(mixs_package_directory_to_rdf)

Help on function mixs_package_directory_to_rdf in module mixs_file_to_rdf:

mixs_package_directory_to_rdf(package_directory, mixs_version, term_type='class', file_name_start='M', file_type='excel', sep='\t', base_iri='https://gensc.org/mixs#', ontology_iri='https://gensc.org/mixs.owl', output_file='', ontology_format='turtle', print_output=False)
    Builds an ontology (rdflib graph) from each Excel file in the specified package diretory.
    The graph retured is a union of each of the graphs builts from each file.
    
    Args:
        package_directory: The directory containing the MIxS package files.
        mixs_version: The version of MIxS package.
        package_name: Overrides the package name provided in the package Excel spreadsheet.
                      This argument if necessary when using a file is a csv or tsv.
        term_type: Specifies if the MIxS terms will be represented as classes or data properties.
                   Accepted values: 'class', 'data property'
  

In [8]:
version_4_dir = '../../mixs_data/mixs_v4_packages/'
version_5_dir = '../../mixs_data/mixs_v5_packages/'

#### First create version with terms as classes.
**NB:** The base IRI is changes to `https://gensc.org/mixs-class#`

In [9]:
mixs_4_package_class_graph = mixs_package_directory_to_rdf(version_4_dir, 4, base_iri="https://gensc.org/mixs-class#")

processing: MIxSwastesludge_210514.xls
processing: MIxSplantassoc_210514.xls
processing: MIxShumanvaginal_210514.xls
processing: MIxSmisc_210514.xls
processing: MIxSwater_210514.xls
processing: MIxSbuiltenv_210514.xls
processing: MIxSmatbiofilm_210514.xls
processing: MIxShumanoral_210514.xls
processing: MIxShumanassoc_210514.xls
processing: MIxSsoil_210514.xls
processing: MIxShumangut_210514.xls
processing: MIxSsediment_210514.xls
processing: MIxSair_210514.xls
processing: MIxShostassoc_210514.xls
processing: MIxShumanskin_210514.xls


In [10]:
mixs_5_package_class_graph = mixs_package_directory_to_rdf(version_5_dir, 5, base_iri="https://gensc.org/mixs-class#")

processing: MIxShumanskin_20180621.xlsx
processing: MIxSwater_20180621.xlsx
processing: MIxShydrocarbcores_20180621.xlsx
processing: MIxShumangut_20180621.xlsx
processing: MIxSair_20180621.xlsx
processing: MIxShumanoral_20180621.xlsx
processing: MIxShydrocarbfs_20180621.xlsx
processing: MIxSbuiltenv_20180621.xlsx
processing: MIxShumanassoc_20180621.xlsx
processing: MIxSsoil_20180621.xlsx
processing: MIxSsediment_20180621.xlsx
processing: MIxShostassoc_20180621.xlsx
processing: MIxSwastesludge_20180621.xlsx
processing: MIxShumanvaginal_20180621.xlsx
processing: MIxSplantassoc_20180621.xlsx
processing: MIxSmatbiofilm_20180621.xlsx
processing: MIxSmisc_20180621.xlsx


#### Merge MIxS 4 & 5 class graphs and save output

In [11]:
mixs_package_class_graph = Graph()
mixs_package_class_graph = mixs_4_package_class_graph + mixs_5_package_class_graph

## save output
mixs_package_class_graph.serialize(format='turtle', destination='output/mixs_package_class.ttl')

#### Next create version with terms as data properties.
**NB:** The base IRI is changes to `https://gensc.org/mixs-data-property#`

In [12]:
mixs_4_package_dp_graph = mixs_package_directory_to_rdf(version_4_dir, 4, term_type='data property', base_iri="https://gensc.org/mixs-data-property#")

processing: MIxSwastesludge_210514.xls
processing: MIxSplantassoc_210514.xls
processing: MIxShumanvaginal_210514.xls
processing: MIxSmisc_210514.xls
processing: MIxSwater_210514.xls
processing: MIxSbuiltenv_210514.xls
processing: MIxSmatbiofilm_210514.xls
processing: MIxShumanoral_210514.xls
processing: MIxShumanassoc_210514.xls
processing: MIxSsoil_210514.xls
processing: MIxShumangut_210514.xls
processing: MIxSsediment_210514.xls
processing: MIxSair_210514.xls
processing: MIxShostassoc_210514.xls
processing: MIxShumanskin_210514.xls


In [13]:
mixs_5_package_dp_graph = mixs_package_directory_to_rdf(version_5_dir, 5, term_type='data property', base_iri="https://gensc.org/mixs-data-property#")

processing: MIxShumanskin_20180621.xlsx
processing: MIxSwater_20180621.xlsx
processing: MIxShydrocarbcores_20180621.xlsx
processing: MIxShumangut_20180621.xlsx
processing: MIxSair_20180621.xlsx
processing: MIxShumanoral_20180621.xlsx
processing: MIxShydrocarbfs_20180621.xlsx
processing: MIxSbuiltenv_20180621.xlsx
processing: MIxShumanassoc_20180621.xlsx
processing: MIxSsoil_20180621.xlsx
processing: MIxSsediment_20180621.xlsx
processing: MIxShostassoc_20180621.xlsx
processing: MIxSwastesludge_20180621.xlsx
processing: MIxShumanvaginal_20180621.xlsx
processing: MIxSplantassoc_20180621.xlsx
processing: MIxSmatbiofilm_20180621.xlsx
processing: MIxSmisc_20180621.xlsx


#### Merge MIxS 4 & 5 data property graphs and save output

In [14]:
mixs_package_dp_graph = Graph()
mixs_package_dp_graph = mixs_4_package_dp_graph + mixs_5_package_dp_graph

## save output
mixs_package_dp_graph.serialize(format='turtle', destination='output/mixs_package_dp.ttl')

## Test SPARQL queries on ontologies
#### As an example, I'll use the class version of MIxS terms.
#### Note: rdflib is not the best libary for doing queries. It is SLOW. demonstration purposes it is fine.

### Find the first terms and labels

In [15]:
query = """
prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
prefix mixs: <https://gensc.org/mixs-class#>

select 
    ?iri ?label
where {
    ?iri rdfs:subClassOf mixs:mixs_term ;
         rdfs:label ?label .
}
limit 5
"""

In [16]:
results = mixs_package_class_graph.query(query)

In [17]:
for r in results:
    print(f"""{r.iri:60}  {r.label}""")

https://gensc.org/mixs-class#part_org_nitro                   particulate organic nitrogen
https://gensc.org/mixs-class#wall_area                        wall area
https://gensc.org/mixs-class#root_med_carbon                  rooting medium carbon
https://gensc.org/mixs-class#diether_lipids                   diether lipids
https://gensc.org/mixs-class#lib_size                         library size


### Find to number of terms in version 4 & 5 

In [18]:
query = """
prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
prefix mixs: <https://gensc.org/mixs-class#>

select
    (count (?iri_v4) as ?num_v4) 
    (count (?iri_v5) as ?num_v5)
where {
    {
        ?iri_v4 rdfs:subClassOf mixs:mixs_term ;
                mixs:mixs_version ?version .
        filter (?version = 4)
    } union {
        ?iri_v5 rdfs:subClassOf mixs:mixs_term ;
                mixs:mixs_version ?version .
        filter (?version = 5)
    }
}
"""

In [19]:
results = mixs_package_class_graph.query(query)

In [20]:
for r in results:
    print(f"""
        number of mixs 4 terms:  {r.num_v4}
        number of mixs 5 terms:  {r.num_v5}
    """)


        number of mixs 4 terms:  343
        number of mixs 5 terms:  601
    


### Find terms that are in both versions 4 & 5

In [21]:
query = """
prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
prefix mixs: <https://gensc.org/mixs-class#>

select     
    ?iri ?version_4 ?version_5
where {
    ?iri rdfs:subClassOf mixs:mixs_term ;
         mixs:mixs_version ?version_4, ?version_5 .
    values (?version_4 ?version_5) { (4 5) }
}
limit 5
"""

In [22]:
results = mixs_package_class_graph.query(query)

In [23]:
for r in results:
    print(f"""{r.iri:60}  {r.version_4}  {r.version_5}""")

https://gensc.org/mixs-class#part_org_nitro                   4  5
https://gensc.org/mixs-class#diether_lipids                   4  5
https://gensc.org/mixs-class#lib_size                         4  5
https://gensc.org/mixs-class#chlorophyll                      4  5
https://gensc.org/mixs-class#gynecologic_disord               4  5


### Find total number of terms that are in both versions 4 & 5

In [24]:
query = """
prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
prefix mixs: <https://gensc.org/mixs-class#>

select     
    (count (?iri) as ?num) 
where {
    ?iri rdfs:subClassOf mixs:mixs_term ;
         mixs:mixs_version ?version_4, ?version_5 .
    values (?version_4 ?version_5) { (4 5) }
}
"""

In [25]:
results = mixs_package_class_graph.query(query)

In [26]:
for r in results:
    print(f"""number of mixs terms in version 4 & 5:  {r.num}""")

number of mixs terms in version 4 & 5:  329
