### Suicide Comorbidities (using Brown MySQL server)

This script run a PubMed-Comorbidities pipeline using the following characteristics:

* Main MeSH Heading: Colonic Neoplasms
* UMLS filtering concept: "Disease or Syndrome", "Mental or Behavioral Dysfunction" or "Neoplastic Process"
* Articles analysed: All MEDLINE 2017AA articles tagged with the  as a MeSH Heading. Note that this is equivalent to searching PubMed using [MH:noexp]
  Total number of articles found: 63304
* UMLS concept filtering: Comorbidities are analysed on all other MeSH descriptors associated with the specified UMLS concepts
* This script uses Brown MySQL databases:
    * medline
    * umls_meta
    * pubmed_miner

In [2]:
#Optional for running Step 1. However, sometimes the concurrancy of DatabaBase creates errors. 
#If that happens, reduce the number of process or comment out the line completely. 
#The fewer the processes the longer the task. At times I've successfully used 12 workers, at others only 2
addprocs(2); 

In [3]:
using Revise #used during development to detect changes in module - unknown behavior if using multiple processes
using PubMedMiner

In [4]:
#Settings
const mh = "Suicide"
const concepts = ("Disease or Syndrome", "Mental or Behavioral Dysfunction", "Neoplastic Process");

## 1. Save related occurrences to database

The folllowing code is designed to save to the pubmed_miner database a table containing the list of pmids and mesh descriptors that match the specified filtering criteria.

In [5]:
overwrite = false
@time save_semantic_occurrences(mh, concepts...; overwrite = overwrite) 

[1m[36mINFO: [39m[22m[36m33806 Articles related to MH:Suicide
[39m[1m[36mINFO: [39m[22m[36m----------------------------------------
[39m[1m[36mINFO: [39m[22m[36mStart all articles
[39m[1m[36mINFO: [39m[22m[36mUsing concept table: MESH_T047
[39m[1m[36mINFO: [39m[22m[36mUsing results table: suicide_mesh_t047
[39m[1m[36mINFO: [39m[22m[36mTable exists and will remain unchanged
[39m[1m[36mINFO: [39m[22m[36mUsing concept table: MESH_T048
[39m[1m[36mINFO: [39m[22m[36mUsing results table: suicide_mesh_t048
[39m[1m[36mINFO: [39m[22m[36mTable exists and will remain unchanged
[39m[1m[36mINFO: [39m[22m[36mUsing concept table: MESH_T191
[39m[1m[36mINFO: [39m[22m[36mUsing results table: suicide_mesh_t191
[39m[1m[36mINFO: [39m[22m[36mTable doesn't exist, create
[39m

244.662948 seconds (5.00 M allocations: 185.202 MiB, 0.02% gc time)


## 2. Retrieve results and analyze simple occurrences and co-occurrences

In [6]:
using FreqTables

@time occurrence_df = get_semantic_occurrences_df(mh, concepts...)
@time mesh_frequencies = freqtable(occurrence_df, :pmid, :descriptor);

info("Found ", size(occurrence_df, 1), " related descriptors")

[1m[36mINFO: [39m[22m[36mUsing concept table: MESH_T047
[39m[1m[36mINFO: [39m[22m[36mUsing results table: suicide_mesh_t047
[39m[1m[36mINFO: [39m[22m[36mUsing concept table: MESH_T048
[39m[1m[36mINFO: [39m[22m[36mUsing results table: suicide_mesh_t048
[39m

  0.994594 seconds (1.03 M allocations: 46.199 MiB, 2.21% gc time)


[1m[36mINFO: [39m[22m[36mUsing concept table: MESH_T191
[39m[1m[36mINFO: [39m[22m[36mUsing results table: suicide_mesh_t191
[39m

  1.353433 seconds (1.19 M allocations: 176.308 MiB, 3.00% gc time)


[1m[36mINFO: [39m[22m[36mFound 32017 related descriptors
[39m

In [18]:
using PlotlyJS
using NamedArrays

# Visualize frequency 
topn = 50
mesh_counts = vec(sum(mesh_frequencies, 1))
count_perm = sortperm(mesh_counts, rev=true)
mesh_names = collect(keys(mesh_frequencies.dicts[2]))

#traces
#remove from plot for better scaling
freq_trace = PlotlyJS.bar(; x = mesh_names[count_perm[1:topn]], y= mesh_counts[count_perm[1:topn]], marker_color="orange")

data = [freq_trace]
layout = Layout(;title="$(topn)-Most Frequent MeSH ",
                 showlegend=false,
                 margin= Dict(:t=> 70, :r=> 0, :l=> 50, :b=>200),
                 xaxis_tickangle = 90,)
plot(data, layout)

## 3. Pair Statistics

* Mutual information
* Chi-Square
* Co-occurrance matrix

In [19]:
using BCBIStats.COOccur
using StatsBase

#co-occurrance matrix - only for topp MeSH 
# min_frequency = 5 -- alternatively compute topn based on min-frequency
top_occ = mesh_frequencies.array[:, count_perm[1:topn]]
top_mesh_labels = mesh_names[count_perm[1:topn]]
top_occ_sp = sparse(top_occ)
top_coo_sp = top_occ_sp' * top_occ_sp


#Point Mutual Information
pmi_sp = BCBIStats.COOccur.pmi_mat(top_top_coo_sp)
#chi2
top_chi2= BCBIStats.COOccur.chi2_mat(top_occ, min_freq=0);

In [36]:
# typeof(full(top_chi2))
#Display full matrices as heatmaps
pmi_trace = heatmap(; z=full(pmi_sp), showscale=false)
chi2_trace = heatmap(; z=full(top_chi2), showscale=false)
coo_stats_plot = [plot(pmi_trace) plot(chi2_trace)]


In [21]:
using PlotlyJSFactory

p = create_chord_plot(top_coo_sp, labels = top_mesh_labels)
relayout!(p, title="Co-occurrances between top 50 MeSH terms")
JupyterPlot(p)

### Association Rules

* Compute using apriori algorithm (eclat version) 

In [14]:
using ARules
using DataTables

  likely near /Users/isa/.julia/v0.6/IJulia/src/kernel.jl:31
  likely near /Users/isa/.julia/v0.6/IJulia/src/kernel.jl:31


In [22]:
mh_occ = convert(BitArray{2}, mesh_frequencies.array)

# We don't need to remove Suicide (MH) because is not of the available UMLS types
# mh_col = mesh_frequencies.dicts[2][mh]
# mh_occ[:, mh_col] = zeros(size(mh_occ,1))

epilepsy_lkup = convert(DataStructures.OrderedDict{String,Int16}, mesh_frequencies.dicts[2]) 
@time epilepsy_rules = apriori(mh_occ, supp = 0.001, conf = 0.1, maxlen = 9)

#Pretty print of rules
epilepsy_lkup = Dict(zip(values(mesh_frequencies.dicts[2]), keys(mesh_frequencies.dicts[2])))
rules_dt= ARules.rules_to_datatable(epilepsy_rules, epilepsy_lkup, join_str = " | ");

  0.846432 seconds (513.53 k allocations: 59.463 MiB, 3.92% gc time)


In [23]:
println(head(rules_dt))
println("Found ", size(rules_dt, 1), " rules")

6×5 DataTables.DataTable
│ Row │ lhs                                  │
├─────┼──────────────────────────────────────┤
│ 1   │ {Acne Vulgaris}                      │
│ 2   │ {HIV Infections}                     │
│ 3   │ {Acquired Immunodeficiency Syndrome} │
│ 4   │ {Acquired Immunodeficiency Syndrome} │
│ 5   │ {Acute Disease}                      │
│ 6   │ {Acute Disease}                      │

│ Row │ rhs                                │ supp        │ conf     │ lift     │
├─────┼────────────────────────────────────┼─────────────┼──────────┼──────────┤
│ 1   │ Depression                         │ 0.000996133 │ 0.515152 │ 3.07505  │
│ 2   │ Acquired Immunodeficiency Syndrome │ 0.00158209  │ 0.2      │ 20.4383  │
│ 3   │ HIV Infections                     │ 0.00158209  │ 0.161677 │ 20.4383  │
│ 4   │ Substance-Related Disorders        │ 0.00123052  │ 0.125749 │ 1.28813  │
│ 5   │ Chronic Disease                    │ 0.0014063   │ 0.102564 │ 5.10309  │
│ 6   │ Mental Disorders       

## Frequent Item Sets

In [24]:
supp_int = round(Int, 0.001 * size(mh_occ, 1))
@time root = frequent_item_tree(mh_occ, supp_int, 9);

supp_lkup = gen_support_dict(root, size(mh_occ, 1))
item_lkup = mesh_frequencies.dicts[2]
item_lkup_t = Dict(zip(values(item_lkup), keys(item_lkup)))
freq = ARules.suppdict_to_datatable(supp_lkup, item_lkup_t);

  0.034929 seconds (68.66 k allocations: 34.563 MiB, 26.16% gc time)


In [25]:
println(head(freq))
println("Found ", size(freq, 1), " frequent itemsets")

6×2 DataTables.DataTable
│ Row │ itemset                                                │ supp │
├─────┼────────────────────────────────────────────────────────┼──────┤
│ 1   │ {Alcoholism,Schizophrenia,Substance-Related Disorders} │ 33   │
│ 2   │ {Hypertension}                                         │ 57   │
│ 3   │ {Depression,Substance-Related Disorders}               │ 236  │
│ 4   │ {Adjustment Disorders,Schizophrenia}                   │ 19   │
│ 5   │ {Alcoholism,Depressive Disorder, Major}                │ 41   │
│ 6   │ {Child Abuse,Substance-Related Disorders}              │ 31   │
Found 518 frequent itemsets


## Visualization of Frequent Item Sets

* Basic visualization of frequent item sets using Sankey diagram (experimental - use with caution)
* Future work inclused better layout for more links as well as the ability to dinamically change the number of of itemsets

In [26]:
function fill_sankey_data!(node, sources, targets, vals)
    if length(node.item_ids) >1
        push!(sources, node.item_ids[end-1]-1)
        push!(targets, node.item_ids[end]-1)
        push!(vals, node.supp)
    end
    if has_children(node)     
        for nd in node.children
            fill_sankey_data!(nd,  sources, targets, vals)
        end
    end
end

fill_sankey_data! (generic function with 1 method)

In [27]:
sources = []
targets = []
vals = []
fill_sankey_data!(root, sources, targets, vals);

In [28]:
# size(sources)
topn_links = 50
freq_vals_perm = sortperm(vals, rev=true)
s = sources[freq_vals_perm[1:topn_links]]
t = targets[freq_vals_perm[1:topn_links]]
v = vals[freq_vals_perm[1:topn_links]]
l = mesh_names

println("Found, ", length(sources), "links, showing ", topn_links)

Found, 358links, showing 50


In [33]:
pad = 1e-7
trace=sankey(orientation="h",
             node = attr(domain=attr(x=[0,1], y=[0,1]), pad=pad, thickness=pad, line = attr(color="black", width= 0.5),
                         label=l), 
             link = attr(source=s, target=t, value = v))

layout = Layout(width=900, height=1100)
    

plot([trace], layout)