# Examples of running Root Cause Discovery in Julia

First, install all necessary packages for this tutorial:

In [1]:
using Pkg
Pkg.add(PackageSpec(url="https://github.com/Jinzhou-Li/RootCauseDiscovery.git", subdir="julia"))
pkg"add CSV DataFrames DelimitedFiles Random Distributions"

[32m[1m    Updating[22m[39m git-repo `https://github.com/Jinzhou-Li/RootCauseDiscovery.git`
[32m[1m   Resolving[22m[39m package versions...
[32m[1m  No Changes[22m[39m to `/home/groups/sabatti/.julia/environments/v1.10/Project.toml`
[32m[1m  No Changes[22m[39m to `/home/groups/sabatti/.julia/environments/v1.10/Manifest.toml`
[32m[1m   Resolving[22m[39m package versions...
[32m[1m  No Changes[22m[39m to `/home/groups/sabatti/.julia/environments/v1.10/Project.toml`
[32m[1m  No Changes[22m[39m to `/home/groups/sabatti/.julia/environments/v1.10/Manifest.toml`


Now load all necessary packages

In [2]:
using RootCauseDiscovery
using DataFrames
using CSV
using DelimitedFiles
using Random
using Distributions

## Example 1 in our paper

First simulation observational data

In [3]:
p = 3    # number of gene
n = 500  # number of samples
m = 30   # number of interventional data

b1 = 10
b2 = 10
b3 = 10

sigma_C = 1
sigma_j = 1
sigma_k = 1
alpha = -1
beta = 2
gamma = 1

int_mean = 20
int_sd = 1

## check success condition
@show sigma_k^2 + gamma^2 * sigma_C^2 + 2*alpha*beta*gamma*sigma_C^2

## generate observational data
C = fill(b1, n) + randn(n)
Xj = fill(b2, n) + (beta  .* C  .+ randn(n))
Xk = fill(b3, n) + (alpha .* Xj .+ gamma .* C .+ randn(n))
Data_obs = [C Xj Xk]

sigma_k ^ 2 + gamma ^ 2 * sigma_C ^ 2 + 2 * alpha * beta * gamma * sigma_C ^ 2 = -2


500×3 Matrix{Float64}:
 11.7749   33.2377  -11.0605
 10.2999   30.9394  -10.2166
  7.64437  24.1637   -4.30992
 10.6465   31.2668  -10.1584
  9.60473  29.0874  -11.8339
  9.71713  29.4703   -9.60271
  9.4181   29.5954   -9.52052
  9.43421  26.4391   -6.26241
  9.99574  31.3675  -10.6651
  9.28728  31.6905  -11.9375
  9.54214  28.3879  -10.1283
  9.26575  27.9804   -8.39688
 12.1383   34.9517  -12.945
  ⋮                 
 10.3674   30.8747   -9.80652
  9.43913  28.5363  -10.4432
  9.79669  29.0912   -9.39033
  9.1604   27.1757   -9.13248
  8.77839  24.9609   -5.91331
  9.97182  30.8838  -11.3957
  8.22934  26.7756   -9.84687
 10.4741   29.713    -8.02177
  8.59956  26.6469   -8.03789
  9.58737  30.2365   -9.93392
  9.30406  28.1135   -7.07729
 10.4593   30.7137  -11.1154

Next simulation interventional data

In [4]:
## random intervention for each sample
delta_C = zeros(m)
delta_j = zeros(m)
delta_k = zeros(m)

Inter_target = vcat(ones(Int(m/3)), 2*ones(Int(m/3)), 3*ones(Int(m/3)))
delta_C[findall(x -> x == 1, Inter_target)] .+= rand(Normal(int_mean, int_sd), Int(m/3))
delta_j[findall(x -> x == 2, Inter_target)] .+= rand(Normal(int_mean, int_sd), Int(m/3))
delta_k[findall(x -> x == 3, Inter_target)] .+= rand(Normal(int_mean, int_sd), Int(m/3))

## generate interventional sample
C_I = fill(b1, m) + randn(m) + delta_C
Xj_I = fill(b2, m) + (beta  .* C_I   .+ randn(m)) + delta_j
Xk_I = fill(b3, m) + (alpha .* Xj_I  .+ gamma .* C_I .+ randn(m)) + delta_k
Data_int = [C_I Xj_I Xk_I]

30×3 Matrix{Float64}:
 30.2648   70.5028  -30.2158
 31.0169   71.9286  -29.0156
 30.5716   72.15    -31.0863
 28.9595   65.4909  -28.3884
 29.5375   69.0942  -28.6559
 30.7187   72.9984  -32.9171
 30.8852   70.9372  -27.8639
 28.7607   67.0083  -27.1495
 32.1153   74.3822  -30.7306
 30.6361   71.5663  -31.0648
 10.1932   50.8628  -31.1493
 10.5131   51.8016  -32.1184
  9.39787  50.8825  -31.1033
  ⋮                 
  9.37807  49.5286  -29.952
  9.45777  49.7125  -32.0673
  8.62872  28.7689   10.4634
  9.05468  27.1013   10.8782
 10.2932   30.0656   10.4432
  9.66403  28.5712   10.602
  9.2411   27.8437   13.0132
  9.2275   28.3324   10.1477
  9.97503  29.6553    8.69068
 11.5834   35.3329    7.36686
 11.319    32.0129    8.79304
 10.2686   30.7925   11.1374

Lets try the squared z-score method. We simulated 30 patients where
+ For the first 10 interventional samples, the root cause is the 1st variable
+ For the next 10 interventional samples, the root cause is the 2nd variable
+ For the last 10 interventional samples, the root cause is the 3rd variable

In [5]:
z = zscore(Data_obs, Data_int)

# first col should be largest 
z[1:10, :] 

10×3 Matrix{Float64}:
 415.662  330.988  133.729
 446.984  354.62   118.324
 428.303  358.362  145.493
 363.999  254.387  110.648
 386.455  308.443  113.89
 434.428  372.886  171.85
 441.416  338.103  104.427
 356.431  276.516   96.2429
 494.775  397.192  140.626
 430.981  348.538  145.196

In [6]:
# second col should be largest, but its not. 
# So the squared z-score method fails here
z[11:20, :]

10×3 Matrix{Float64}:
 0.0672721     88.3831  146.362
 0.336613      96.4618  160.08
 0.289644      88.5486  145.726
 0.19664       76.4947  110.433
 0.000887935   75.6477  104.472
 0.535067     100.558   127.281
 3.32548       60.9611  109.848
 0.0536713    122.451   202.014
 0.311412      77.51    130.262
 0.228595      78.9666  159.342

In [7]:
# third col should be largest
z[21:30, :]

10×3 Matrix{Float64}:
 1.71482     0.239229    136.945
 0.778535    1.52682     142.553
 0.129401    0.00834567  136.674
 0.0735895   0.333654    138.807
 0.483579    0.815942    173.204
 0.502743    0.468609    132.752
 0.00164859  0.0085285   114.243
 2.73416     5.99875      98.6302
 1.92746     0.927467    115.498
 0.112256    0.173666    146.115

Now try our proposed RC-score method 

In [8]:
Xtilde_all = zeros(size(Data_int, 1), 3)
for sample in 1:size(Data_int, 1)
    Xint = Data_int[sample, :]
    perm = collect(1:3)
    Xtilde = RootCauseDiscovery.root_cause_discovery(Data_obs, Xint, perm)
    Xtilde_all[sample, :] .= Xtilde
end

First 10 interventional samples:

In [9]:
Xtilde_all[1:10, :]

10×3 Matrix{Float64}:
 20.3878  0.101183   0.182993
 21.142   0.0286997  2.08749
 20.6955  1.11808    0.691634
 19.0787  2.2649     1.80149
 19.6585  0.142053   1.07357
 20.8429  1.66344    0.438719
 21.0099  0.687042   2.36234
 18.8794  0.385221   1.25507
 22.2435  0.287266   1.73423
 20.7601  0.418671   0.0347593

Next 10 interventional samples. Note the 2nd column is the largest here, so the RC-score method works.

In [10]:
Xtilde_all[11:20, :]

10×3 Matrix{Float64}:
 0.259369   20.12    0.18167
 0.580183   20.4155  0.163878
 0.538186   21.6968  1.09957
 0.443442   18.3232  1.27874
 0.0297982  19.1408  2.18092
 0.731483   20.5728  2.59015
 1.82359    20.6916  1.55593
 0.231671   23.8261  0.340128
 0.558043   20.4059  0.875614
 0.478116   20.4305  1.16606

Last 10 interventional samples

In [11]:
Xtilde_all[21:30, :]

10×3 Matrix{Float64}:
 1.30951    1.4842    21.0377
 0.882346   0.987922  19.2609
 0.359724   0.502062  20.5917
 0.271274   0.737564  19.8645
 0.695399   0.62382   22.0059
 0.709043   0.117243  19.6204
 0.0406029  0.28191   18.7232
 1.65353    2.14454   21.5906
 1.38833    0.598281  19.8531
 0.335046   0.260015  22.0826

## Example 2: real data

First we need to obtain QC'd gene expression data.

In [12]:
transform_int, transform_obs, ground_truth = QC_gene_expression_data(
    low_count = 10,
    threshold = 0.1, 
    max_cor = 0.999, 
)

[32mProgress: 100%|█████████████████████████████████████████| Time: 0:01:53[39m


([1m19736×60 DataFrame
[1m   Row │[1m geneID          [1m R62943   [1m R98254     [1m R89912   [1m R19100     [1m R15264   ⋯
       │[90m SubStrin…       [90m Float64  [90m Float64    [90m Float64  [90m Float64    [90m Float64  ⋯
───────┼────────────────────────────────────────────────────────────────────────
     1 │ ENSG00000000003  7.16243    7.28255    7.32937    7.31479    7.95855  ⋯
     2 │ ENSG00000000419  7.52393    7.48436    7.57685    7.64937    7.62452
     3 │ ENSG00000000457  6.15732    5.72704    5.63371    5.95707    6.30952
     4 │ ENSG00000000460  5.75861    6.77764    5.66632    5.71676    5.60984
     5 │ ENSG00000000971  6.71062    6.82252    7.63302    7.53938    7.64743  ⋯
     6 │ ENSG00000001036  8.64898    8.73812    8.58827    8.6068     8.45188
     7 │ ENSG00000001084  5.92993    6.59335    7.10089    6.76644    6.75437
     8 │ ENSG00000001167  7.22844    7.35934    7.04254    7.07121    7.04258
     9 │ ENSG00000001460  6.46056    6.59598 

check data dimensions

In [13]:
size(transform_int) # (19736, 60)
size(transform_obs) # (19736, 365)
size(ground_truth) # (70, 6)

(70, 6)

convert to numeric matrices with rows as samples and columns as genes

In [21]:
Xobs = transform_obs[:, 2:end] |> Matrix |> transpose
Xint = transform_int[:, 2:end] |> Matrix |> transpose

59×19736 transpose(::Matrix{Float64}) with eltype Float64:
 7.16243  7.52393  6.15732  5.75861  6.71062  …  4.2431    0.875808   5.86397
 7.28255  7.48436  5.72704  6.77764  6.82252     3.82992  -0.0412821  6.12833
 7.32937  7.57685  5.63371  5.66632  7.63302     4.19641   0.43521    5.80585
 7.31479  7.64937  5.95707  5.71676  7.53938     4.37997  -0.0973676  6.23057
 7.95855  7.62452  6.30952  5.60984  7.64743     4.83117   0.582672   5.7784
 6.8259   7.12483  5.88441  5.75847  6.6716   …  4.58047   0.511446   5.41672
 7.39362  6.96415  5.92161  5.20977  7.54939     4.37497   1.7503     5.50684
 6.96874  7.23932  5.89634  6.6735   6.15781     4.18727   0.0441326  5.95493
 6.6786   7.15623  5.7589   6.28329  5.65468     4.20586   0.0787295  5.74169
 7.24618  7.48998  5.87212  6.37998  7.67891     4.17893  -0.286979   6.37743
 7.24228  7.43298  6.17179  6.44091  7.89227  …  4.40687   0.42719    5.87823
 6.9665   7.20722  6.33893  4.99845  8.11296     4.53143   0.10061    5.41873
 7.424

These are the gene IDs

In [15]:
gene_ids = transform_obs[:, 1]

19736-element Vector{SubString{String31}}:
 "ENSG00000000003"
 "ENSG00000000419"
 "ENSG00000000457"
 "ENSG00000000460"
 "ENSG00000000971"
 "ENSG00000001036"
 "ENSG00000001084"
 "ENSG00000001167"
 "ENSG00000001460"
 "ENSG00000001461"
 "ENSG00000001497"
 "ENSG00000001561"
 "ENSG00000001617"
 ⋮
 "ENSG00000288538"
 "ENSG00000288541"
 "ENSG00000288542"
 "ENSG00000288550"
 "ENSG00000288559"
 "ENSG00000288564"
 "ENSG00000288585"
 "ENSG00000288586"
 "ENSG00000288591"
 "ENSG00000288596"
 "ENSG00000288598"
 "ENSG00000288602"

Finally, run our package to perform root cause discovery for high-dimensional data. Note that we use the following parameters for a fast demonstration. See comments for values we used for the real data application in our paper (runtime often exceeds ~24h). 

In [23]:
# parameters for current run
patient_id = "R62943"
method = "cv"
nshuffles = 1 # change this to 30 to reproduce our real data results
y_idx_z_threshold = 10.0 # change this to 1.5 to reproduce our real data results

# Use this patient as interventional sample.
# All other patients are treated as observational samples. 
i = findfirst(x -> x == patient_id, names(transform_int)[2:end])
Xint_sample = Xint[i, :]

# concat Xobs
nint = size(Xint, 1)
Xobs_full = vcat(Xobs, Xint[setdiff(1:nint, i), :])

# run main alg
Random.seed!(2024)
@time root_cause_score = root_cause_discovery_high_dimensional(
    Xobs_full, Xint_sample, method, y_idx_z_threshold=y_idx_z_threshold,
    nshuffles=nshuffles
);

# check result
root_cause_score

Trying 4 y_idx
Lasso found 80 non-zero entries
Lasso found 109 non-zero entries
Lasso found 67 non-zero entries
Lasso found 74 non-zero entries
 56.379518 seconds (695.65 k allocations: 5.256 GiB, 2.05% gc time, 4.65% compilation time)


19736-element Vector{Float64}:
 0.009712399177859711
 0.11895399507042398
 0.11505446567348963
 0.003145725965842752
 7.98086308277622e-5
 0.18364854524560556
 0.21397352374249412
 0.0027313249759257038
 0.026723261531083193
 0.025074526316290497
 0.002061273880593377
 0.0929657516840995
 0.2216556882620838
 ⋮
 0.021247528745765878
 0.23810686611367735
 0.06144366690152794
 0.014035910087709304
 0.08598333652970208
 0.05805265394896295
 0.0044697226556585775
 0.09449393832360445
 0.08043065388404888
 0.005298879649329278
 0.008924187060126074
 0.003353693830550429

The vector `root_cause_score` is the RC-score for each variable in our paper. 