# Example Dictionaries

We almost always use pickled Python dictionaries to store contents such as the dataset and intermediate results.
This notebook shows small examples of these datasets' contents. 

In [1]:
import pickle
def load_dictionary(filename):
    with open(filename, 'rb') as fp:
        dict_ = pickle.load(fp)
    return dict_

# How the processed dataset looks like

The below cell shows how the dataset, data.pickle (outcome of generate_dataset.py) looks for a small dataset of 5 pairs of sites and 5 cells.

* Position index is the key of dict. In string format. Ranges from 0,...,num_sites-1.

    * Each position pair has the following keys:
        
        * pos_pair: actual positions in the genome, formatted as pos1_pos2
        * bulk: 2x2 array, represents the bulk nucleotides in numpy array format. First column is gSNV, second column is potentially mutated site.
        * commonZ: 2x2 array, represents the true mutation nucleotides in numpy array format. First column is gSNV, second column is potentially mutated site.
        * mut_count: integer, represents total number of mutated cells at this position pair. 
        * z_list: list of 2x2 arrays, represents all 12 possible mutation types. 
        * cell_list: list of Cell objects, represents the data of each cell at this position. 

In [2]:
data_dict = load_dictionary("../data/2020_07_09_1/processed_data_dict/data.pickle")
num_pos = len(list(data_dict.keys()))
print("Number of position pairs in the dataset: \t", num_pos)

for pos_idx in sorted(data_dict.keys()):
    print("\nPos idx: \t", pos_idx, "\tContents: \n\t", data_dict[pos_idx])

Number of position pairs in the dataset: 	 5

Pos idx: 	 0 	Contents: 
	 {'mut_count': 2, 'commonZ': array([[3, 3],
       [0, 1]]), 'bulk': array([[3, 1],
       [0, 1]]), 'pos_pair': '2679_2628', 'cell_list': [<cell_class.Cell object at 0x104306780>, <cell_class.Cell object at 0x10aeeecc0>, <cell_class.Cell object at 0x10aeeecf8>, <cell_class.Cell object at 0x10aeeed30>, <cell_class.Cell object at 0x10aeeed68>], 'z_list': [array([[0, 1],
       [0, 1]]), array([[1, 1],
       [0, 1]]), array([[2, 1],
       [0, 1]]), array([[3, 2],
       [0, 1]]), array([[3, 3],
       [0, 1]]), array([[3, 0],
       [0, 1]]), array([[3, 1],
       [1, 1]]), array([[3, 1],
       [2, 1]]), array([[3, 1],
       [3, 1]]), array([[3, 1],
       [0, 2]]), array([[3, 1],
       [0, 3]]), array([[3, 1],
       [0, 0]])]}

Pos idx: 	 1 	Contents: 
	 {'mut_count': 2, 'commonZ': array([[2, 3],
       [3, 0]]), 'bulk': array([[2, 3],
       [3, 3]]), 'pos_pair': '2305_2341', 'cell_list': [<cell_class.Cell ob

## How each cell object looks like

The below cell shows how the data of each single-cell is stored in data.pickle (outcome of generate_dataset.py).

* Each cell has the following information:
        
    * Y: binary, represents the genotype of cell. 0 if cell is not mutated, 1 if cell is mutated.
    * X: binary, represents the dropout of cell. 0 if no ADO, 1 if ADO on first allele, 2 if ADO on second allele.
    * lc: integer, represents the number of reads.
    * reads: lcx2 array, each row represents the nucleotides of a read. 
    * p_error: lcx2 array, each row represents the error probabilities of a read (converted from the Phred scores). 

In [3]:
pos_idx = 0
num_cells = len(data_dict[str(pos_idx)]['cell_list'])
print("Number of cells: \t", num_cells)
print("Bulk: \t", data_dict[str(pos_idx)]['bulk'][0], data_dict[str(pos_idx)]['bulk'][1])
print("Mutation: \t", data_dict[str(pos_idx)]['commonZ'][0], data_dict[str(pos_idx)]['commonZ'][1])

for cell_idx in range(num_cells):
    current_cell = data_dict[str(pos_idx)]['cell_list'][cell_idx]
    print("\nCell: \t", cell_idx)
    print("Mutation: ", current_cell.Y, "\tDropout: ", current_cell.X, "\tRead Count: ", current_cell.lc)
    print("Reads: ", current_cell.reads, "\tError probabilities: ", current_cell.p_error)

Number of cells: 	 5
Bulk: 	 [3 1] [0 1]
Mutation: 	 [3 3] [0 1]

Cell: 	 0
Mutation:  1 	Dropout:  0 	Read Count:  10
Reads:  [[3 3]
 [3 3]
 [0 1]
 [0 1]
 [0 1]
 [0 1]
 [0 1]
 [3 3]
 [0 1]
 [3 3]] 	Error probabilities:  [[7.94328235e-04 6.30957344e-04]
 [1.00000000e-03 1.00000000e-03]
 [3.16227766e-04 1.25892541e-04]
 [5.01187234e-04 1.00000000e-04]
 [3.98107171e-04 3.16227766e-04]
 [1.00000000e-04 7.94328235e-05]
 [5.01187234e-04 5.01187234e-04]
 [6.30957344e-04 3.16227766e-04]
 [1.58489319e-04 1.58489319e-04]
 [1.00000000e-04 2.51188643e-03]]

Cell: 	 1
Mutation:  1 	Dropout:  0 	Read Count:  10
Reads:  [[0 1]
 [0 1]
 [0 1]
 [3 3]
 [0 1]
 [0 1]
 [3 3]
 [3 3]
 [0 1]
 [0 1]] 	Error probabilities:  [[1.58489319e-04 1.58489319e-04]
 [7.94328235e-05 1.58489319e-04]
 [1.99526231e-04 3.16227766e-03]
 [1.00000000e-04 3.16227766e-04]
 [6.30957344e-04 1.25892541e-03]
 [1.00000000e-04 7.94328235e-05]
 [2.51188643e-04 1.58489319e-04]
 [1.00000000e-04 3.98107171e-04]
 [1.25892541e-04 7.94328235e

# How a read dictionary looks like

The below cell shows how the read dictionary of a specific position pair, i.e read_probabilities/read_dict_0.pickle, looks for a small dataset of 5 pairs of sites and 5 cells.

* Cell index is the key of dict. In integer format. Ranges from 0,...,num_cells-1.

    * Each cell entry has different contents, depending on its reads. Each entry shows the log probability of cell's reads given different fragment configurations. More specifically, it is $\log P(R_c | \pi_c, Q_c)$. 
    * Each key of the dictionary has the following structure: 
    
        * If there are 3 types of fragments: Fi_Fj_Fk_lambdai_lambdaj_lambdak.
        * If there are 2 types of fragments: Fi_Fj_lambdai_lambdaj.
        * If there is 1 type of fragment: Fi_lambdai.

In [4]:
read_dict = load_dictionary("../data/2020_07_09_1/processed_data_dict/read_probabilities/read_dict_" + str(pos_idx) + ".pickle")
num_cells = len(list(read_dict.keys()))
print("Number of cells: \t", num_cells)

cell_idx = 2
print("Cell: ", cell_idx)

for key in sorted(read_dict[cell_idx]):
    print("\t", key, read_dict[cell_idx][key])

Number of cells: 	 5
Cell:  2
	 [0 0]_3 -30.006313102789996
	 [0 0]_[0 1]_2_1 -18.67896829167958
	 [0 0]_[0 1]_[1 1]_1_1_1 -17.887249772339878
	 [0 0]_[0 1]_[2 1]_1_1_1 -17.887249772339878
	 [0 0]_[0 1]_[3 0]_1_1_1 -27.71190773721061
	 [0 0]_[0 1]_[3 1]_1_1_1 -17.887249772339874
	 [0 0]_[0 1]_[3 2]_1_1_1 -27.71190773721061
	 [0 0]_[0 1]_[3 3]_1_1_1 -27.71190773721061
	 [0 0]_[0 2]_2_1 -28.907700814121885
	 [0 0]_[0 2]_[3 1]_1_1_1 -28.024957296629694
	 [0 0]_[0 3]_2_1 -28.907700814121885
	 [0 0]_[0 3]_[3 1]_1_1_1 -28.024957296629694
	 [0 0]_[1 0]_2_1 -38.727653649557624
	 [0 0]_[1 0]_[3 1]_1_1_1 -37.81403481009703
	 [0 0]_[1 1]_2_1 -28.718104477189637
	 [0 0]_[1 1]_[3 1]_1_1_1 -27.95601085822497
	 [0 0]_[2 0]_2_1 -38.727653649557624
	 [0 0]_[2 0]_[3 1]_1_1_1 -37.81403481009703
	 [0 0]_[2 1]_2_1 -28.718104477189637
	 [0 0]_[2 1]_[3 1]_1_1_1 -27.95601085822497
	 [0 0]_[3 0]_2_1 -38.727653649557624
	 [0 0]_[3 0]_[3 1]_1_1_1 -37.81403481009703
	 [0 0]_[3 1]_2_1 -28.718104477189637
	 [0 0]_[

In [5]:
read_dict[cell_idx]['[0 1]_3']

-0.0010462164478378467

# How a log_zcy dictionary looks like

zcy summarizes the key format of the dictionary. z is the mutation type, c is the cell idx and y is the cell genotype indicator. This dictionary contains the following probability:

$$ \log( \sum_{\pi_c} P(R_c|\pi_c, Q_c) P(\pi_c | G_c, B, Z, L_c, \theta) ) $$

This dictionary is further used in cell distance calculation:

$$ \sum_Z P(Z|B,\theta) \sum_m \sum_{G_{1:C}} P(G_{1:C}|\theta) \prod_{c=1}^C \sum_{\pi_c} P(R_c|\pi_c, Q_c) P(\pi_c | G_c, B, Z, L_c, \theta) $$

The below cell shows how the zcy dictionary of a specific position pair and specific parameter setup ($p_{ado}$ and $p_{ae}$), i.e read_probabilities/log_zcy_0.2_0.001_0.pickle, looks for a small dataset of 5 pairs of sites and 5 cells.

* The keys have the form: z_c_y. z is the mutation type, c is the cell idx and y is the cell genotype indicator.
* The values are the above described log probabilities.

In [6]:
zcy_dict = load_dictionary("../data/2020_07_09_1/processed_data_dict/read_probabilities/log_zcy_0.2_0.001_" + str(pos_idx) + ".pickle")

for key in sorted(zcy_dict):
    print(key, zcy_dict[key])

0_0_0 -11.368958058326092
0_0_1 -41.38264681716879
0_1_0 -11.367351948194784
0_1_1 -36.0681740135955
0_2_0 -1.8374687727305716
0_2_1 -0.04620307749754837
0_3_0 -2.661462595860279
0_3_1 -7.557905643870927
0_4_0 -8.156784678753835
0_4_1 -18.558793624110344
10_0_0 -11.368958058326092
10_0_1 -41.58204049830741
10_1_0 -11.367351948194784
10_1_1 -37.92967190802441
10_2_0 -1.8374687727305716
10_2_1 -9.843473749820452
10_3_0 -2.661462595860279
10_3_1 -10.878562929669325
10_4_0 -8.156784678753835
10_4_1 -8.107782944824475
11_0_0 -11.368958058326092
11_0_1 -43.35229821001333
11_1_0 -11.367351948194784
11_1_1 -38.70979785019817
11_2_0 -1.8374687727305716
11_2_1 -9.843473749820452
11_3_0 -2.661462595860279
11_3_1 -10.878562929669325
11_4_0 -8.156784678753835
11_4_1 -8.157155392739096
1_0_0 -11.368958058326092
1_0_1 -42.80628673261369
1_1_0 -11.367351948194784
1_1_1 -37.62366294658303
1_2_0 -1.8374687727305716
1_2_1 -1.8374687727305716
1_3_0 -2.661462595860279
1_3_1 -9.22447945116533
1_4_0 -8.15678

In [7]:
cell_idx = 0

for y in range(2):
    for z in range(12):
        key = str(z) + "_" + str(cell_idx) + "_" + str(y)
        print(key, zcy_dict[key])

0_0_0 -11.368958058326092
1_0_0 -11.368958058326092
2_0_0 -11.368958058326092
3_0_0 -11.368958058326092
4_0_0 -11.368958058326092
5_0_0 -11.368958058326092
6_0_0 -11.368958058326092
7_0_0 -11.368958058326092
8_0_0 -11.368958058326092
9_0_0 -11.368958058326092
10_0_0 -11.368958058326092
11_0_0 -11.368958058326092
0_0_1 -41.38264681716879
1_0_1 -42.80628673261369
2_0_1 -42.80628673261369
3_0_1 -11.369287902680506
4_0_1 -2.6717551704447025
5_0_1 -11.3692879026805
6_0_1 -43.35218633277957
7_0_1 -43.35218633277957
8_0_1 -42.22009130327464
9_0_1 -43.35229821001333
10_0_1 -41.58204049830741
11_0_1 -43.35229821001333


In [8]:
data_dict[str(pos_idx)]['z_list'][4]

array([[3, 3],
       [0, 1]])

In [9]:
data_dict[str(pos_idx)]['z_list'][3]

array([[3, 2],
       [0, 1]])

In [10]:
data_dict[str(pos_idx)]['z_list'][11]

array([[3, 1],
       [0, 0]])

Older version of our code has 'Bulk' and 'Z_list' dictionary keys, rather than 'bulk', 'z_list'. Below code segment saves a dictionary with the new format.

In [4]:
import pickle
def load_dictionary(filename):
    with open(filename, 'rb') as fp:
        dict_ = pickle.load(fp)
    return dict_

def save_dictionary(filename, cell_dict):
    with open(filename, 'wb') as fp:
        pickle.dump(cell_dict, fp)

In [16]:
filename = "../../sciphi_results/medium_complete_pileup/medium_50_0.6_7.5e-05_0_123/data.pickle"
orig_dict = load_dictionary(filename)

new_dict = {}
for key in orig_dict:
    temp = {}
    temp['mut_count'] = orig_dict[key]['mut_count']
    temp['pos_pair'] = orig_dict[key]['pos_pair']
    temp['commonZ'] = orig_dict[key]['commonZ']
    temp['cell_list'] = orig_dict[key]['cell_list']
    temp['bulk'] = orig_dict[key]['Bulk']
    temp['z_list'] = orig_dict[key]['Z_list']
    new_dict[key] = temp
  
filename = "data.pickle"
save_dictionary(filename, new_dict)