# A quick guide of using msBayesImpute for imputing missing values in mass-spectrometry proteomics/metabolomics data

MsBayesImpute is a method that combines Bayesian factorization models and probablistic dropout models for imputing missing values in mass-spectrometry proteomics data, accounting for the non-randomness of missing data points in MS data.  
MsBayesImpute is built upon the Pyro, a universal probabilistic programming language (PPL) written in Python.  
The required Python packages for running msBayesImpute are: **numpy** and **pandas** (for reading and writing csv files). 

## Install package

In [1]:
pip install ../dist/msbayesimputepy-0.1.0-py3-none-any.whl

Processing /Users/jiaojiaohe/Desktop/Project2 - Imputation/Factorization/msBayesImpute/dist/msbayesimputepy-0.1.0-py3-none-any.whl
Installing collected packages: msbayesimputepy
Successfully installed msbayesimputepy-0.1.0
Note: you may need to restart the kernel to use updated packages.


## Import required packages

In [2]:
from msbayesimputepy.generation import gen_data, gen_prob_miss
from msbayesimputepy.core import msBayesImpute
import numpy as np
import pandas as pd

## Generate data with artificial missing values

- The complete data is generated by defining protein size (D), sample size (N) and latent factor (K) and protein intercept (feature_inter).
- Missing values are created by a probabilistic dropout distribution characterized by inflection and scale parameters.

In [3]:
# generate data
np.random.seed(202411)
D = 5000
N = 200
K = 10
feature_inter = 20
ratio = 0.1
generation = gen_data(n_features = D, n_samples = N, n_factors = K, alpha_col = feature_inter)
generate_data = generation["data"] 

In [4]:
# generate missings
simuData = gen_prob_miss(generate_data, rho = 17.5, zeta = 2, rho_sd = 2, zeta_sd = 1, model = "perFeature")
simuData["X_miss"]

Unnamed: 0,sample_1,sample_2,sample_3,sample_4,sample_5,sample_6,sample_7,sample_8,sample_9,sample_10,...,sample_191,sample_192,sample_193,sample_194,sample_195,sample_196,sample_197,sample_198,sample_199,sample_200
feature_1,,,,,,,17.30993,,19.36536,,...,,23.28569,,,,,,,,
feature_2,21.35771,18.15361,25.70704,16.94694,20.50456,24.48452,21.99810,22.63902,25.21135,21.02091,...,27.28163,26.39487,20.52996,,,24.08542,26.66449,20.79465,20.65267,
feature_3,21.27044,,,15.46760,13.54686,25.59514,20.41242,,22.90487,18.25608,...,24.31340,25.65968,,19.79213,,23.21405,22.11616,20.80227,22.55273,
feature_4,,,20.70437,20.01081,23.54113,,23.70579,,,20.37735,...,19.97506,21.75430,,22.94145,21.23125,20.16064,,23.32555,,
feature_5,19.24773,19.72803,24.08629,,,26.66391,20.33935,17.18838,24.08511,23.71116,...,25.31210,24.71970,19.46055,27.40503,,26.30030,25.52853,20.54043,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
feature_4996,,,19.60327,21.05771,19.93693,22.27992,20.48328,20.83228,20.71718,19.82733,...,23.72726,22.78724,,19.25925,18.15226,,,,,
feature_4997,,22.32774,25.81730,19.30176,,24.40884,21.64394,20.78337,23.04999,,...,,,19.71806,26.14963,21.62542,21.28454,21.17954,24.10425,21.73265,
feature_4998,22.44802,21.05945,20.06088,22.95352,21.75321,,20.05362,,22.30025,25.25971,...,,,,23.82556,25.43966,23.65458,,26.92724,,
feature_4999,,22.63368,24.66836,22.98442,17.88588,24.81728,16.86489,23.40688,19.94300,18.62826,...,26.30401,18.09135,26.18410,30.52415,22.66261,,23.26888,16.76908,19.38203,21.79644


## Read example dataset

Prior to running the model, certain data preprocessing steps are required: 
1. Remove completely missing proteins/metabolites
2. Ensure the dataset follows a normal distribution, such as by applying a log transformation to the raw data.

In [5]:
#read in data as dataframe
df = pd.read_csv('../data/hematology_proteomics_example.csv', delimiter = "\t", index_col = 0)
df

Unnamed: 0,S1199,S1207,S1215,S1223,S1231,S1239,S1247,S1255,S1263,S1271,...,S1847,S1855,S1863,S1871,S1879,S1887,S1895,S1903,S1911,S1919
A0A0B4J2D5;P0DPI2,16.116681,14.742736,15.510665,16.556805,15.112663,15.943939,15.275033,15.111898,16.985809,15.966992,...,15.830669,13.861688,15.567502,15.659748,13.688950,15.265659,14.454235,16.284690,15.164003,15.867440
A0AVT1,,16.177974,15.526808,16.253670,15.365721,15.239020,15.351181,15.713115,15.305431,15.777399,...,15.724647,15.954483,16.077510,15.943236,,16.492938,16.154227,16.133190,16.425780,15.866668
A0FGR8,14.973173,15.783661,15.545794,14.591382,17.044608,17.667375,16.985008,16.510355,14.922426,14.792912,...,14.500892,14.670662,14.679497,14.393303,14.516697,14.170363,14.900853,15.146115,15.208181,15.244843
A0JLT2,,12.939801,13.501315,,,,13.405474,13.509132,12.731601,13.320025,...,,,12.992579,12.768710,12.302490,,,,13.193582,13.036364
A0PJW6,,,,,12.327912,12.554208,12.814807,12.469543,13.291732,,...,13.105339,,12.743749,12.861310,12.414321,,,12.409447,12.563559,12.639381
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Q9Y6W3,15.130530,14.364545,13.983002,14.233193,,14.287417,14.175175,14.671834,13.774375,14.705028,...,14.327068,13.996544,14.625149,14.320025,14.635582,15.307155,14.751026,14.701426,14.274393,14.430420
Q9Y6W5,15.630825,15.277614,15.056180,16.297731,15.491599,15.519980,15.481371,15.492711,15.273872,15.565168,...,15.545426,15.300585,15.494568,16.092293,16.347782,16.340391,16.717502,16.193079,16.933160,15.818258
Q9Y6X3,13.786903,13.205598,,13.559831,13.909921,13.536174,13.352595,13.905086,13.500145,13.232004,...,13.451134,13.603406,13.915319,13.332624,13.343644,13.751481,13.765524,13.539389,13.888715,13.593520
Q9Y6X9,13.606613,13.131547,14.023694,13.035376,13.538541,13.272991,13.262631,13.787250,13.727857,13.535178,...,13.173521,13.699204,13.246320,13.088866,13.019200,13.508773,12.815473,13.472043,13.569037,13.088278


## Perform imputation

- msBayesImpute is user-friendly because it requires no parameter tunning.
- In some cases, users would like to fix the number of latent factors or minimum explained variance, which can be set in msBayesImpute arguments by **n_components** (default is None) and **drop_factor_threshold** (default is 0.01).
- There is also two options in **convergence_mode**: "fast" and "slow" (defualt)

In [6]:
# perform model training
msBayes_model = msBayesImpute()  # arguments
msBayes_model.train(df)
impute_data = msBayes_model.predict(df)

[1mModel option: alternating_featureWise, convergence mode: fast, shinrakge: HorseShoe[0m

1. Initialize dropout curve:  rho - 12.26, zeta - 0.80
2. Initialize training and optimize number of factors.
[94m  - Current factor number is: 33[0m
  Epoch 1: elapsed time=00:00, LOSS=3.440875
  Epoch 100: elapsed time=00:08, deltaLOSS=-0.013737 (0.39924%)
  Epoch 200: elapsed time=00:15, deltaLOSS=0.006306 (0.18328%)
  Epoch 300: elapsed time=00:22, deltaLOSS=0.003946 (0.11467%)
  Epoch 400: elapsed time=00:30, deltaLOSS=0.001612 (0.04684%)
[94m  - Current factor number is: 15[0m
  Epoch 500: elapsed time=00:39, deltaLOSS=0.016757 (0.48701%)
  Epoch 600: elapsed time=00:45, deltaLOSS=-0.001609 (0.04677%)
  Epoch 700: elapsed time=00:52, deltaLOSS=-0.001507 (0.04378%)
[94m  The final number is 15.[0m
3. Start final model training.
[91m  Step 1, refine factorization model.[0m
  Epoch 1: elapsed time=00:57, LOSS=1.886024
  Epoch 100: elapsed time=01:02, deltaLOSS=-0.000914 (0.02657%)
  

In [7]:
#save the results to a csv file
impute_data.to_csv('imputation_python.csv')