# Summary

**Goal: Calculate peptide distribution and protein distribution P(Zj)** from P(Xi|Zj) fluorosequencing scores 

___

**Prior to my code**

1. **Acquiring of experimental data (reads)**
    - **4-5 fluorescence level plots vs Edman degradation cycle, 1 read for each molecule**


2. **Matts classifier converts reads into P(Xi|Zj)**, the likelihood of each partial sequence given a peptide.
    - **Calculating P(Xi) directly impossible/difficult**. You may think that it is simply number of partial sequences/number of all reads, but that is only true if the number of reads is much larger than the number of possible outcomes. It's like rolling a die a couple times, and then asking how likely each number is. You can only do that if you are certain that your number of rolls adequately samples the possibilities. In our case, we do not know this and thus it is instead more consistent to **calculate P(Xi|Zj) instead and to use that to calculate P(Zj), the target variable.**
    
___

**This code**

1. Calculation of peptide distribution (P(Zj)) using P(Xi|Zj) scores
    1. **imports P(Xi|Zj) table, generated by Matt's program**
    2. Asks user for settings
    3. For each bootstrap run:
        1. Create a subtable created from randomly sampling rows in P(Xi|Zj)
        2. Use this subtable of P(Xi|Zj) to run EM until P(Zj) converges. P(zj) calculation is like Monte Carlo integration, because P(Zj|Xi)  \
           **For each EM run:**
            
           - **Hen, egg problem. two missing variables: P(Zj|Xi) and P(Zj). calculating either requires the other**
           - for **1st calculation of P(Zj|Xi) approximate: assume equal distribution of all P(Zj)**
           - **alternate between updating P(Zj|Xi) and P(Zj) until convergence**
                
    4. Calculate **average P(Zi) from all bootstrap runs, and calculate 95% CI**
    
2. Calculation of protein distribution using calculated peptide distribution
    1. **Calculate P(Xi|Zj) from number of matches against digested proteome**
    1. run **bootstrap-EM, like in peptide inferring step**
        - First calculation of P(Zj|Xi) assumes equal distribution of all P(Zj)
        - **Uses peptide distribution P(Zj) values from (1) as P(Xi), which simplifies the Bayes-derived equation.** However, the principle is the same
        
___
**Open questions**
- Is there a case in which bootstrapping improves the guess? Test dataset with many rare peptides
- make protein inferring work
- suggestions for data I should be showing?
    - plot of results depending on number of EM runs, compared to argmax and weighted score
    - show effect of bootstrap runs and see if its advantageous to use that for any dataset
    - both for peptide and for protein inferring

# Functions

## import of packages

In [1]:
import numpy as np # better arrays than inbuilt arrays
import matplotlib.pyplot as plt # to plot stuff

import pandas as pd #for DataFrame tables
from IPython.display import display #to display dfs more nicely. works similar to head(), but is more flexible in how many columns/rows are shown
pd.set_option("display.max_rows", None) #to let display show full df
pd.set_option("display.max_columns", None)

import scipy.stats
#from scipy.stats import norm
import statistics

import math
import time #to measure runtime

## EM    

In [2]:
def update_p_zj_given_xi(p_xi_given_zj, p_zj, p_zj_given_xi):
    ## for PEPTIDE inference
    if inference_mode == "peptide" or "protein":
        numerator = p_xi_given_zj * p_zj # = ARRAY with same shape as P(Xi|Zj)
        denominator = np.expand_dims(np.sum(p_xi_given_zj * p_zj, axis=1), axis=-1) # = VECTOR with length of p Xi. expand_dims is necessary because else denominator has the shape (number of reads Xi,) instead of (number of reads Xi, 1). The entries of the vector are identical. But it does not have the same matrix "rank", which is why broadcasting fails without that 
        p_zj_given_xi = np.divide(numerator, denominator, out=np.zeros_like(numerator), where=denominator!=0) # if denominator = 0, return 0. This happens sometimes when the likelihood for Zj is equal to 0. This could alternatively be avoided by cropping the rows where the input P(Xi|Zj) scores equal to 0
        p_zj_given_xi = np.transpose(p_zj_given_xi)
       
    ## for PROTEIN inference (denominator is has no loop)
    elif inference_mode == "not protein - doesn't work for some reason, even though I expect it to...?":
        p_zj_given_xi = (p_xi_given_zj * p_zj)/np.expand_dims(p_xi, axis=-1)
        p_zj_given_xi = np.transpose(p_zj_given_xi)
        
    else:
        print("ERROR: inference_mode was not set to peptide or protein")
        
    return p_zj_given_xi

In [3]:
def update_p_zj(p_zj, p_zj_given_xi):  
    if inference_mode == "peptide":
        p_zj = p_zj_given_xi.sum(axis=1)/p_xi_given_zj.shape[1]

        p_zj_sum = np.sum(p_zj, axis = 0)
        p_zj = p_zj/p_zj_sum

    elif inference_mode == "protein":
        p_zj.fill(0) # reset all values of pzj to 0
        
        p_zj = np.sum(p_zj_given_xi * p_xi, axis = 1)
        p_zj_sum = np.sum(p_zj, axis = 0)
        p_zj = p_zj/p_zj_sum    
    
    else:
        print("ERROR: inference_mode was not set to peptide or protein")
    
    return p_zj

In [4]:
def EM_convergence_checker(p_zj_old, p_zj, EM_convergence_minimum):
    difference_abs = abs(np.sum(p_zj_old[round(0.5*len(p_zj_old))::], axis=0, dtype=float) - np.sum(p_zj[round(0.5*len(p_zj_old))::], axis=0, dtype=float))
    if difference_abs > EM_convergence_minimum:
        return False
    
    else:
        return True   

In [5]:
def EM(p_xi_given_zj):
    ### initialise p_zj_given_xi
    # p_xi_given_zj = np.arange(0,102,1).reshape(34,3) # for testing EM iterating through columns and rows
    p_zj_given_xi = np.full((p_xi_given_zj.shape[1], p_xi_given_zj.shape[0]), 0, dtype=float) #Initialisation based on array size of p_xi_given_zj -- same size, but transposed
    
    ### initialise p_zj
    n = p_xi_given_zj.shape[1] # number of peptides
    p_zj_initial = 1/n #initial approximation: all zj equally likely, to jumpstart first iteration

    global p_zj
    p_zj = np.full(n, p_zj_initial)
    p_zj_old = np.full(n, 0)
    
    global EM_loopcounter 
    EM_loopcounter = 0
    EM_convergence_checker_result = False
        
    while (EM_convergence_checker_result == False) and EM_loopcounter < EM_loopcounter_max:
        p_zj_given_xi = update_p_zj_given_xi(p_xi_given_zj, p_zj, p_zj_given_xi)
        
        p_zj_old = np.copy(p_zj, order='K', subok=False)
        p_zj = update_p_zj(p_zj, p_zj_given_xi)
        EM_convergence_checker_result = EM_convergence_checker(p_zj_old, p_zj, EM_convergence_minimum)

        EM_loopcounter = EM_loopcounter + 1
    return p_zj

## bootstrapping

**create_subarray_of_p_xi_given_zj(p_xi_given_zj)**

   - Returns subarray of P(Xi|Zj) input data. 
   - Subarray's size is a fraction of the full datasize, this fraction is user-specified (default: 0.7). Sampling is done "with replacement".
   
**bootstrap_EM()**

   - returns: P(Zj) values after bootstrap/EM, in relative values (i. e. in numbers <= 1)
   - if bootstrap_sampled_fraction is set to -1, no bootstrapping occurs, and just one EM using full P(Xi|Zj) dataset is run
   - else, multiple bootstraps are run (default: 200), in which each EM uses only a subarray created from the full dataset using the create_subarray_of_p_xi_given_zj function
   
**bootstrap_EM_analytics_AVG():**

   - returns: the average P(Zj) from all bootstrap runs
   
**bootstrap_EM_analytics_CI(CI_percent)**

   - returns: confidence interval of all bootstrap runs (default: 95% CI)

In [6]:
def create_subarray_of_p_xi_given_zj(p_xi_given_zj):
    df_p_xi_given_zj = pd.DataFrame(p_xi_given_zj)
    df_p_xi_given_zj_sample = df_p_xi_given_zj.sample(frac=bootstrap_sampled_fraction, axis='rows', replace=True) # filters for a random partial dataset
    # display(pd.DataFrame(df_p_xi_given_zj_sample))
    p_xi_given_zj_subarray = df_p_xi_given_zj_sample.to_numpy()
    
    # print("p_xi_given_zj_subarray")
    # display(pd.DataFrame(p_xi_given_zj_subarray))
    
    return p_xi_given_zj_subarray

In [7]:
def bootstrap_EM():
    global p_zj_bootstrap_results_fraction
    global p_zj_peptide_copy
    global p_xi
    i = 0
    
    p_zj_bootstrap_results_absolute = np.full((p_xi_given_zj.shape[1]), 0, dtype=float)
      
    if bootstrap_sampled_fraction == -1: # no bootstrapping, i. e. use full dataset for EM. Since the EM is deterministic, there is no point in running multiple bootstrap runs
        if inference_mode == "peptide": 
            p_xi_given_zj_subarray = p_xi_given_zj
            p_zj_bootstrap_results_absolute = EM(p_xi_given_zj_subarray)

            p_zj_bootstrap_results_fraction = p_zj_bootstrap_results_absolute/np.sum(p_zj_bootstrap_results_absolute, axis = 0)
            print("Peptide Zj values from every bootstrap run (columns: peptides, rows: bootstrap run, displayed as fractions: ")
    
        elif inference_mode == "protein": #only difference: uses p(zj) values from peptide inferrence as the p_xi values. Also, update_p_zj() called by EM() uses a different formula than in pep infer
            p_xi = np.copy(p_zj_bootstrap_results_fraction_avg) # input is just a vector, because no bootstrapping.

            p_xi_given_zj_subarray = p_xi_given_zj
            p_zj_bootstrap_results_absolute = EM(p_xi_given_zj_subarray)

            p_zj_bootstrap_results_fraction = p_zj_bootstrap_results_absolute/np.sum(p_zj_bootstrap_results_absolute, axis = 0)
            print("PRE-CORRECTION: Protein Zj values from every bootstrap run (columns: protein, rows: bootstrap run, displayed as fractions: ")
    
    elif bootstrap_sampled_fraction != -1: # with bootstrapping, i. e. a subarray of the dataset is created for each bootstrap run
        if inference_mode == "peptide":
            while i < n_bootstrap_runs:
                p_xi_given_zj_subarray = create_subarray_of_p_xi_given_zj(p_xi_given_zj) 
                if i == 0:
                    p_zj_bootstrap_results_absolute = EM(p_xi_given_zj_subarray)
                else:
                    p_zj_bootstrap_results_absolute = np.vstack((p_zj_bootstrap_results_absolute, EM(p_xi_given_zj_subarray)))

                print("Bootstrap run #", i, ". EM loops: ", EM_loopcounter, sep="")
                if EM_loopcounter == EM_loopcounter_max:
                    print("WARNING: EM_loopcounter was reached, convergence likely still has not been reached. Consider increasing the maximum number of EM loops.")

                i = i + 1
            p_zj_bootstrap_results_fraction = p_zj_bootstrap_results_absolute/np.sum(p_zj_bootstrap_results_absolute, axis = 1)[0]    

        elif inference_mode == "protein":
            p_zj_peptide_copy = np.copy(p_zj_bootstrap_results_fraction) # input: an array

            while i < n_bootstrap_runs:
                p_xi_given_zj_subarray = p_xi_given_zj
                p_xi = p_zj_peptide_copy[i]
                
                if i == 0:
                    p_zj_bootstrap_results_absolute = EM(p_xi_given_zj_subarray)
                else:
                    p_zj_bootstrap_results_absolute = np.vstack((p_zj_bootstrap_results_absolute, EM(p_xi_given_zj_subarray)))

                print("Bootstrap run #", i, ". EM loops: ", EM_loopcounter, sep="")
                if EM_loopcounter == EM_loopcounter_max:
                    print("WARNING: EM_loopcounter was reached, convergence likely still has not been reached. Consider increasing the maximum number of EM loops.")

                i = i + 1
                
            p_zj_bootstrap_results_fraction = p_zj_bootstrap_results_absolute/np.sum(p_zj_bootstrap_results_absolute, axis = 1)[0]

In [8]:
def bootstrap_EM_analytics_AVG():
    global p_zj_bootstrap_results_fraction_avg
    
    if bootstrap_sampled_fraction == -1:
        p_zj_bootstrap_results_fraction_avg = np.copy(p_zj_bootstrap_results_fraction,order='K')
    elif bootstrap_sampled_fraction != -1:
        p_zj_bootstrap_results_fraction_avg = np.sum(p_zj_bootstrap_results_fraction, axis = 0)/p_zj_bootstrap_results_fraction.shape[0]

In [9]:
def bootstrap_EM_analytics_CI():
    highindex = round(CI_percent*len(p_zj_bootstrap_results_fraction))
    lowindex = round((1-CI_percent)*len(p_zj_bootstrap_results_fraction))
    
    bootstrap_CI_max = np.sort(p_zj_bootstrap_results_fraction, axis=0, kind="quicksort", order=None)[lowindex:highindex:][-1] #fetches last row from sorted table, ie the highest values. axis=0: sorting along each column
    bootstrap_CI_min = np.sort(p_zj_bootstrap_results_fraction, axis=0, kind="quicksort", order=None)[lowindex:highindex:][0] #same but for first row, highest values
    bootstrap_CI_minmax = np.stack((bootstrap_CI_min, bootstrap_CI_max))
    
    ### show full sorted table
    # global p_zj_bootstrap_results_fraction_CI
    # print("The", CI_percent*100, "confidence interval of all zj bootstrapping values sorted, displayed as fractions: ")    
    # p_zj_bootstrap_results_fraction_CI = np.sort(p_zj_bootstrap_results_fraction, axis=0, kind="quicksort", order=None)[lowindex:highindex:] #axis=sorting along each column
    # display(pd.DataFrame(p_zj_bootstrap_results_fraction_CI))
    
    return bootstrap_CI_minmax

In [10]:
def correct_protein_pzj():
    global p_zj_bootstrap_results_fraction_avg
    p_zj_bootstrap_results_fraction_avg = p_zj_bootstrap_results_fraction_avg/dyeseq_in_prot_match_count / np.sum(p_zj_bootstrap_results_fraction_avg/dyeseq_in_prot_match_count)

## output

In [11]:
def output():
    global p_zj_bootstrap_results_fraction_avg
    bootstrap_EM_analytics_AVG()
    bootstrap_CI_minmax = bootstrap_EM_analytics_CI()

    ### STDs (alternative to CI)
    p_zj_bootstrap_results__fraction_std = np.std(p_zj_bootstrap_results_fraction, axis = 0)
                 
    # Combines AVG zj values (from all bootstrap runs), the STD, and the two bounds for the user-chosen confidence interval in one  
    if bootstrap_sampled_fraction == -1: #ie if no bootstrapping. Fills -CI and +CI columns with "N/A" values
        p_zj_avg_plus_CI = np.stack((p_zj_bootstrap_results_fraction_avg*100, np.full((len(p_zj_bootstrap_results_fraction_avg)), fill_value="N/A", dtype=None), np.full((len(p_zj_bootstrap_results_fraction_avg)), fill_value="N/A", dtype=None), np.full((len(p_zj_bootstrap_results_fraction_avg)), fill_value="N/A", dtype=None)), axis=1)    
        display(pd.DataFrame(p_zj_avg_plus_CI, columns = ["AVG [%]", "±STD [%]", "-CI [%]", "+CI [%]"]).astype({"AVG [%]":float}).round(4))
    else:
        p_zj_avg_plus_CI = np.stack((p_zj_bootstrap_results_fraction_avg, p_zj_bootstrap_results__fraction_std, bootstrap_CI_minmax[0], bootstrap_CI_minmax[1]), axis = 1)
        display(pd.DataFrame(p_zj_avg_plus_CI, columns = ["AVG [%]", "±STD [%]", "-CI [%]", "+CI [%]"]).round(4)*100)

    ### print user-settings
    print('\033[1m' + 'Settings of this run' + '\033[0m')
    print("Inference mode:", inference_mode)
    print("EM_convergence_minimum:", EM_convergence_minimum)
    print("EM_loopcounter_max:", EM_loopcounter_max)
    print("bootstrap_sampled_fraction:", bootstrap_sampled_fraction)
    print("n_bootstrap_runs:", n_bootstrap_runs)
    print("CI_percent:", CI_percent, "\n")

    ### print other
    print("Runtime: %s seconds ---" % round((time.time() - start_time),2)) #run time

# call functions

In [12]:
# esc, ctrl+a, ctrl+enter to run all cells

#p_xi_given_zj = create_3_gaussians_and_calculate_p_xi_given_zj(140, 150, 160, 15, 15, 15, 5, 5, 5) #mean1-3, std1-3, n1-3
# plot_histograms_and_pdfs_from_gaussians(140, 150, 160, 15, 15, 15, 5, 5, 5)

EM_convergence_minimum = float(input("EM_convergence_minimum? If nothing is entered, it is set to 0.0001.") or "0.0001")
EM_loopcounter_max = int(input("Maximum number of EM runs (per bootstrap run)? If nothing is entered, it is set to 200.") or "200") 
bootstrap_sampled_fraction = float(input("Fraction of subarray sampled for each bootstrap run? If nothing is entered, it is set to 0.8. If -1 is entered, bootstrapping is turned off. Note: Bootstrapping is always turned OFF for the protein inference part.") or "0.8")                   
n_bootstrap_runs = int(input("Number of bootstrap runs? If nothing is entered, it is set to 200.") or "200") 
CI_percent = float(input("Condidence interval? If nothing is entered, it is set to 95.") or "95")

start_time = time.time() # to start measuring runtime
p_xi_given_zj = np.genfromtxt('../110 data set/110-peps-shuffled-scores.csv', delimiter=',') # import P(Xi|Zj) table (on the read level - i. e. Likelihood of any given read Xi assuming a peptide Zj), calculated by Matt

EM_convergence_minimum? If nothing is entered, it is set to 0.0001. 
Maximum number of EM runs (per bootstrap run)? If nothing is entered, it is set to 200. 
Fraction of subarray sampled for each bootstrap run? If nothing is entered, it is set to 0.8. If -1 is entered, bootstrapping is turned off. Note: Bootstrapping is always turned OFF for the protein inference part. -1
Number of bootstrap runs? If nothing is entered, it is set to 200. 
Condidence interval? If nothing is entered, it is set to 95. 


In [13]:
inference_mode = "peptide"
bootstrap_EM()
output()

Peptide Zj values from every bootstrap run (columns: peptides, rows: bootstrap run, displayed as fractions: 


Unnamed: 0,AVG [%],±STD [%],-CI [%],+CI [%]
0,3.9375,,,
1,5.7233,,,
2,0.6284,,,
3,0.5387,,,
4,3.1243,,,
5,0.0253,,,
6,1.7519,,,
7,1.5901,,,
8,4.835,,,
9,1.8827,,,


[1mSettings of this run[0m
Inference mode: peptide
EM_convergence_minimum: 0.0001
EM_loopcounter_max: 200
bootstrap_sampled_fraction: -1.0
n_bootstrap_runs: 200
CI_percent: 95.0 

Runtime: 21.65 seconds ---


In [14]:
inference_mode = "protein"

dyeseq_in_prot_match_count = np.genfromtxt("../110 data set/dyeseq_in_prot_match_count.csv", delimiter=',') # To calculate this the target proteome needs to be virtually digested and labelled (i.e. turned into dye sequences). Next, the reads' dye sequences are matched against this, and the number of matches per protein is counted. I did not write any code to solve this particular problem, but it seems like it should be fairly easy
p_xi_given_zj = np.genfromtxt("../110 data set/p_xi_given_zj prot_infer input.csv", delimiter=',') # The likelihood for any dye seq Xi given a protein Zj is equal to the number of matches of that dye seq in a particular protein, divided by the sum of all dye seq matches of that protein. I did not write any code to solve this problem either

bootstrap_EM()
output()

PRE-CORRECTION: Protein Zj values from every bootstrap run (columns: protein, rows: bootstrap run, displayed as fractions: 


Unnamed: 0,AVG [%],±STD [%],-CI [%],+CI [%]
0,29.0558,,,
1,26.3001,,,
2,19.8327,,,
3,12.6345,,,
4,6.906,,,
5,3.242,,,
6,0.9698,,,
7,0.1493,,,
8,0.6909,,,
9,0.2189,,,


[1mSettings of this run[0m
Inference mode: protein
EM_convergence_minimum: 0.0001
EM_loopcounter_max: 200
bootstrap_sampled_fraction: -1.0
n_bootstrap_runs: 200
CI_percent: 95.0 

Runtime: 21.69 seconds ---


# Legacy stuff

## Generation of random distributions from 3 Gaussians and plotting them

In [15]:
def create_data_from_3_gaussians(mean1, mean2, mean3, std1, std2, std3, n1, n2, n3): # Creation of random data points from multiple Gaussians ki
    k1 = np.random.normal(mean1, std1, n1) #creates array with values created through Gaussian
    k2 = np.random.normal(mean2, std2, n2)
    k3 = np.random.normal(mean3, std3, n3)

    return np.concatenate([k1, k2, k3])

In [16]:
def p_xi_given_zj_from_gaussian_datapoints(kall, mean1, mean2, mean3, std1, std2, std3):
    # Calculating p_xi_given_zj (this is what Matt is working on with the simulated data)

    pdf_probability_k1 = scipy.stats.norm.pdf(kall, loc=mean1, scale=std1)
    pdf_probability_k2 = scipy.stats.norm.pdf(kall, loc=mean2, scale=std2)
    pdf_probability_k3 = scipy.stats.norm.pdf(kall, loc=mean3, scale=std3)

    p_xi_given_zj = np.vstack((pdf_probability_k1,pdf_probability_k2,pdf_probability_k3))
    p_xi_given_zj = np.transpose(p_xi_given_zj)

    print("p_xi_given_zj")
    display(pd.DataFrame(p_xi_given_zj))
    print("\n")
    
    print("likeliest peptide z of each datapoint x according to scipy.stats.norm.pdf (should be more accurate than EM because its dedicated to Gaussians)")
    print(np.argmax(p_xi_given_zj, axis=1)) # for each column of P(Xi|Zj), the most likely Peptide is returned
    print("\n")
    
    return p_xi_given_zj

In [17]:
def create_3_gaussians_and_calculate_p_xi_given_zj(mean1, mean2, mean3, std1, std2, std3, n1, n2, n3):
    kall = create_data_from_3_gaussians(mean1, mean2, mean3, std1, std2, std3, n1, n2, n3)
    p_xi_given_zj = p_xi_given_zj_from_gaussian_datapoints(kall, mean1, mean2, mean3, std1, std2, std3)
    
    return p_xi_given_zj

In [18]:
def plot_histograms_and_pdfs_from_gaussians(mean1, mean2, mean3, std1, std2, std3, n1, n2, n3):
    k1 = np.random.normal(mean1, std1, n1) #creates array with values created through Gaussian
    k2 = np.random.normal(mean2, std2, n2)
    k3 = np.random.normal(mean3, std3, n3)
    
    # plotting histograms
    nbins = 50
    plt.hist(k1, label = "Peptide 0", bins=nbins, alpha=0.3, density=True, color="orange") # alpha=transparency, density=True normalises to 1 
    plt.hist(k2, label = "Peptide 1", bins=nbins, alpha=0.3, density=True, color="green")
    plt.hist(k3, label = "Peptide 2", bins=nbins, alpha=0.3, density=True, color="blue")

    # PDF plot
    xmin, xmax = plt.xlim() #finds lower and upper bounds of histogram data
    x = np.linspace(start=xmin, stop=xmax, num=100) #num is the number of returned data points - the more points, the finer the fit is plotted
    p1 = scipy.stats.norm.pdf(x, mean1, std1)
    p2 = scipy.stats.norm.pdf(x, mean2, std2)
    p3 = scipy.stats.norm.pdf(x, mean3, std3)

    plt.plot(x, p1, linewidth=2, color = "orange", label = "Gauss function k1: mean = {:.2f}, STD = {:.2f}".format(mean1, std1))
    plt.plot(x, p2, linewidth=2, color = "green", label = "Gauss function k2: mean = {:.2f}, STD = {:.2f}".format(mean2, std2))
    plt.plot(x, p3, linewidth=2, color = "blue", label = "Gauss function k3: mean = {:.2f}, STD = {:.2f}".format(mean3, std3))

    plt.legend(loc='upper right')
    plt.title("PDFs of dwarves and humans")

    plt.show()

## EM functions WITHOUT broadcasting (slow as hell!)

In [19]:
def NOBROADCAST_update_p_zj_given_xi(p_xi_given_zj, p_zj, p_zj_given_xi):
    ## for PEPTIDE inference
    if inference_mode == "peptide" or "protein":
        denominator = 0
        for i, row in enumerate(p_xi_given_zj): # Calculating/Updating P(Zj|Xi)
            #print("ROW of P(Xi|Zj):", i)
            for j, cell in enumerate(row):
                #print("COLUMN of P(Xi|Zj):", j)
                numerator = cell * p_zj[j]
                #print("numerator:", numerator)

                for l, cell in enumerate(p_zj):
                    # print("cell", i, l, p_xi_given_zj[i, l], end="")
                    # print(" * zl", p_zj[l])
                    denominator = denominator + p_xi_given_zj[i, l] * p_zj[l]
                
                # print("p_zj_given_xi[j][i] = numerator/denominator", numerator, "/", denominator)
                #display(pd.DataFrame(p_zj_given_xi))
                p_zj_given_xi[j][i] = numerator/denominator
                denominator = 0                                
        #display(pd.DataFrame(p_zj_given_xi))    
        #print((np.argmax(p_zj_given_xi, axis=1))) # reports index of max value from each row

        
    ## for PROTEIN inference (denominator is has no loop)
    elif inference_mode == "not protein idk why this doesnt work":
        # print("p_xi_given_zj (subarray)")
        # display(pd.DataFrame(p_xi_given_zj))
        for i, row in enumerate(p_xi_given_zj): # Calculating/Updating P(Zj|Xi)#
            # print("ROW of P(Xi|Zj):", i)
            for j, cell in enumerate(row):
                # print("COLUMN of P(Xi|Zj):", j)
                # print("p_zj_given_xi", cell, "*",  p_zj[j], "/", p_xi[i],"=", cell*p_zj[j]/p_xi[i])
                p_zj_given_xi[j][i] = cell * p_zj[j]/p_xi[i]

        # display(pd.DataFrame(p_zj_given_xi))
       # print((np.argmax(p_zj_given_xi, axis=1))) # reports index of max value from each row
        
    else:
        print("ERROR: inference_mode was not set to peptide or protein")
        
    return p_zj_given_xi

In [20]:
def NOBROADCAST_update_p_zj(p_zj, p_zj_given_xi):  
    if inference_mode == "peptide":
        for j, element in enumerate(p_zj): #updating the expectation value of Zi
            p_zj[j] = p_zj_given_xi[j].sum()/p_xi_given_zj.shape[1] # divide by number of peptides z

        p_zj_sum = np.sum(p_zj, axis = 0)
        p_zj = p_zj/p_zj_sum

        # print("EM loop", EM_loopcounter, ")")
        # display(pd.DataFrame(p_zj))
    
    elif inference_mode == "protein":
        p_zj.fill(0) # overwrite all values of pzj with 0

        print("p_zj_given_xi after update")
        display(pd.DataFrame(p_zj_given_xi))
        
        for j, row in enumerate(p_zj_given_xi):            
            for i, element in enumerate(row):
                print("p_zj_given_xi[j][i] * p_xi[i]", p_zj_given_xi[j][i], p_xi[i], p_zj_given_xi[j][i] * p_xi[i])
                p_zj[j] = p_zj[j] + (p_zj_given_xi[j][i] * p_xi[i])
        
        p_zj_sum = np.sum(p_zj, axis = 0)
        p_zj = p_zj/p_zj_sum
        print("p_zj (EM loop", EM_loopcounter, ")")
        display(pd.DataFrame(p_zj))
    
    else:
        print("ERROR: inference_mode was not set to peptide or protein")
    
    return p_zj