<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Setup" data-toc-modified-id="Setup-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Setup</a></span></li><li><span><a href="#Load-data" data-toc-modified-id="Load-data-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Load data</a></span></li><li><span><a href="#Cluster-components" data-toc-modified-id="Cluster-components-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Cluster components</a></span></li><li><span><a href="#Save-final-robust-components" data-toc-modified-id="Save-final-robust-components-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Save final robust components</a></span></li></ul></div>

This notebook is used to gather components generated by NERSC runs. If you are doing exploratory analysis, this is not necessary

# Setup

In [None]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from tqdm import tqdm_notebook as tqdm

These values are determined by the NERSC consistency call:
* n_runs: Number of times the run_ica script is called within consistency.sh
* n_iter: Number of iterations within the run_ica script (after -i flag)

In [None]:
n_runs = 10
n_iter = 100

# Load data

In [None]:
DATA_DIRS= ['../data/outputs/%d/'%i for i in range(1,n_runs+1)]

We will combine all the S matrices from all runs, similar to in run_ica, to ensure that we only keep components that are always picked by the run_ica script.

In [None]:
all_S = pd.concat([pd.read_csv(data_dir+'S.csv',index_col=0) for data_dir in DATA_DIRS],axis=1)
all_S.columns = range(len(all_S.columns))
all_A = pd.concat([pd.read_csv(data_dir+'A.csv',index_col=0) for data_dir in DATA_DIRS]).reset_index(drop=True)
all_stats = pd.concat([pd.read_csv(data_dir+'component_stats.csv',index_col=0) for data_dir in DATA_DIRS]).reset_index()

If a cluster from any run contains over n_iter components, incorrect parameters were used in run_ica

In [None]:
print('No cluster contains over n_iter components:',all(all_stats['count'] <= n_iter))

# Cluster components

In [None]:
diff_mat = 1-abs(all_S.corr()).values

In [None]:
comp_idx = range(len(diff_mat))
i = comp_idx[0]
comp_dict = {}
while len(comp_idx) > 0:
    i = comp_idx[0]
    identical = np.where(diff_mat[i] < 0.1)[0]
    comp_dict[i] = identical
    comp_idx = sorted(set(comp_idx).difference(set(identical)))

In [None]:
comp_dist = {}
for i,lst in comp_dict.items():
    comp_dist[i] = all_stats.loc[lst,'count']

In [None]:
resdf = pd.DataFrame([[lst.min(),lst.max(),lst.mean(),lst.std(),len(lst)] for lst in comp_dist.values()],
             index=comp_dist.keys(),
             columns=['Min','Max','Mean','STD','Length'])
resdf.sort_values(['Length','Min'])

# Save final robust components

In [None]:
# Only keep components that reproducibly occur in every run
good_comps = resdf[resdf.Length == n_runs].index

In [None]:
all_S[good_comps].T.reset_index(drop=True).T.to_csv('../data/S.csv')
all_A.loc[good_comps].reset_index(drop=True).to_csv('../data/A.csv')
all_stats.loc[good_comps].reset_index(drop=True).to_csv('../data/component_counts.csv')

In [None]:
len(good_comps)