Introduction to Pygenprop
=========================
An python library for interactive programatic usage of Genome Properties
------------------------------------------------------------------------

InterProScan files used in this tutorial can be found at:
- https://raw.githubusercontent.com/Micromeda/pygenprop/master/docs/source/_static/tutorial/E_coli_K12.tsv
- https://raw.githubusercontent.com/Micromeda/pygenprop/master/docs/source/_static/tutorial/E_coli_O157_H7.tsv

In [83]:
import requests
from io import StringIO
from pygenprop.results import GenomePropertiesResults
from pygenprop.database_file_parser import parse_genome_properties_flat_file
from pygenprop.assignment_file_parser import parse_interproscan_file, parse_genome_property_longform_file

In [84]:
# The Genome Properties is a flat file database that can be fount on Github.
# The latest release of the database can be found at the following URL.

genome_properties_database_url = 'https://raw.githubusercontent.com/ebi-pf-team/genome-properties/master/flatfiles/genomeProperties.txt'

# For this tutorial we will stream the file directly into the Jupyter notebook. Alternativly, 
# one could be downloaded the file with the unix wget or curl commands.

with requests.Session() as current_download:
    response = current_download.get(genome_properties_database_url, stream=True)
    tree = parse_genome_properties_flat_file(StringIO(response.text))

In [85]:
# There are 1286 properties in the Genome Properties tree.
len(tree)

1286

In [86]:
# Find all properties of type "GUILD".
for genome_property in tree:
    if genome_property.type == 'GUILD':
        print(genome_property.name)

Coenzyme F420 utilization
CRISPR region
Reduction of oxidized methionine
Phage: major features
Resistance to Reactive Oxygen Species (ROS)
tRNA aminoacylation
Toxin-antitoxin system, type II
Protein-coding palindromic elements
Flagellar components of unknown function
Bacillithiol utilization
Toxin-antitoxin system, type I
Toxin-antitoxin system, type III
Abortive infection proteins
Energy-coupling factor transporters
Initiator caspases of the apoptosis extrinsic pathway
Executor caspases of apoptosis


In [87]:
# Parse InterProScan files
with open('E_coli_K12.tsv') as ipr5_file_one:
    assignment_cache_1 = parse_interproscan_file(ipr5_file_one)

In [88]:
with open('E_coli_O157_H7.tsv') as ipr5_file_two:
    assignment_cache_2 = parse_interproscan_file(ipr5_file_two)

In [89]:
# Create results comparison object
results = GenomePropertiesResults(assignment_cache_1, assignment_cache_2, properties_tree=tree)

In [90]:
# Get property by identifier
virulence = tree['GenProp0074']

In [91]:
virulence

GenProp0074, Type: CATEGORY, Name: Virulence, Thresh: 0, References: False, Databases: False, Steps: True, Parents: True, Children: True, Public: False

In [92]:
# Iterate to get the identifiers of child properties of virulence
types_of_vir = [genprop.id for genprop in virulence.children]

In [93]:
# The property results property is used to compare two property assignments between samples.
results.property_results

Unnamed: 0_level_0,E_coli_K12,E_coli_O157_H7
Property_Identifier,Unnamed: 1_level_1,Unnamed: 2_level_1
GenProp0724,NO,NO
GenProp0757,YES,YES
GenProp0809,NO,NO
GenProp0853,NO,NO
GenProp0861,NO,NO
GenProp0901,NO,NO
GenProp0919,NO,NO
GenProp0920,NO,NO
GenProp0921,NO,NO
GenProp0936,NO,NO


In [94]:
# The step results property is used to compare two step assignments between samples.
results.step_results

Unnamed: 0_level_0,Unnamed: 1_level_0,E_coli_K12,E_coli_O157_H7
Property_Identifier,Step_Number,Unnamed: 2_level_1,Unnamed: 3_level_1
GenProp0724,1,NO,NO
GenProp0724,2,NO,NO
GenProp0724,3,NO,NO
GenProp0724,4,YES,YES
GenProp0724,5,YES,YES
GenProp0724,6,NO,NO
GenProp0724,7,YES,YES
GenProp0724,8,NO,NO
GenProp0077,2,NO,NO
GenProp0077,3,YES,YES


In [95]:
# Get properties with differing assignments
results.differing_property_results

Unnamed: 0_level_0,E_coli_K12,E_coli_O157_H7
Property_Identifier,Unnamed: 1_level_1,Unnamed: 2_level_1
GenProp0111,YES,PARTIAL
GenProp1032,YES,PARTIAL
GenProp0183,YES,PARTIAL
GenProp1695,NO,PARTIAL
GenProp1331,PARTIAL,YES
GenProp1388,NO,YES
GenProp0051,NO,YES
GenProp0232,PARTIAL,YES
GenProp0236,PARTIAL,YES
GenProp0455,YES,PARTIAL


In [96]:
# Get property assignments for virulence properties
results.get_results(*types_of_vir, steps=False)

Unnamed: 0_level_0,E_coli_K12,E_coli_O157_H7
Property_Identifier,Unnamed: 1_level_1,Unnamed: 2_level_1
GenProp0052,NO,PARTIAL
GenProp0648,YES,YES
GenProp0707,NO,NO


In [97]:
# Get step assignments for virulence properties
results.get_results(*types_of_vir, steps=True)

Unnamed: 0_level_0,Unnamed: 1_level_0,E_coli_K12,E_coli_O157_H7
Property_Identifier,Step_Number,Unnamed: 2_level_1,Unnamed: 3_level_1
GenProp0052,1,NO,NO
GenProp0052,2,NO,NO
GenProp0052,3,NO,NO
GenProp0052,4,NO,NO
GenProp0052,5,NO,NO
GenProp0052,6,NO,YES
GenProp0052,7,NO,NO
GenProp0052,8,NO,YES
GenProp0052,9,NO,NO
GenProp0052,10,YES,YES


In [98]:
# Get counts of virulence properties assigned YES, NO, and PARTIAL per organism
results.get_results_summary(*types_of_vir, steps=False, normalize=False)

Unnamed: 0,E_coli_K12,E_coli_O157_H7
NO,2.0,1
PARTIAL,0.0,1
YES,1.0,1


In [99]:
# Get counts of virulence steps assigned YES, NO, and PARTIAL per organism
results.get_results_summary(*types_of_vir, steps=True, normalize=False)

Unnamed: 0,E_coli_K12,E_coli_O157_H7
NO,46,27
YES,9,28


In [100]:
# Get percentages of virulence steps assigned YES, NO, and PARTIAL per organism
results.get_results_summary(*types_of_vir, steps=True, normalize=True)

Unnamed: 0,E_coli_K12,E_coli_O157_H7
NO,83.636364,49.090909
YES,16.363636,50.909091
