# Normalization toolbox

Why normalize your data at all? So that your data is comparable. There are many sources of variation in data processing pipelines and (metabol)omics data is also variable. The differences in concentrations between individual metabolites or individual samples can both be very high. Good normalization methods need to preserve these ratios and make sure that neither low- nor highly abundant metabolites are disadvantaged in the process [(Kohl 2012)](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3337420/pdf/11306_2011_Article_350.pdf).

[This paper here](https://www.nature.com/articles/srep38881.pdf) is a great study that compares different normalization methods for LC/MS bases metabolomics. I'd use it as a starting point if we decide to do one for FIA-MS

# Note to self:
here is more literature to include:
- Impact of normalization on the outcome [(Saccenti 2017)](https://pubs.acs.org/doi/10.1021/acs.jproteome.6b00704)
- Novel developments [(Engel 2013)](https://www.sciencedirect.com/science/article/abs/pii/S0165993613001465)
- Normalization in metabolomics [(Misra 2020)](https://journals.sagepub.com/doi/abs/10.1177/1469066720918446?journalCode=emsa)
- A Python-Based Pipeline for Preprocessing LC–MSData for Untargeted Metabolomics Workflows [(Riquelme 2020)](https://www.mdpi.com/2218-1989/10/10/416)
- Hints about PLR (don't know how to implement it), [.](http://www.compositionaldata.com/codawork2017/abstracts/9thFriday/Oral/OMI-2-2.pdf) [.](https://www.sciencedirect.com/science/article/abs/pii/S0021967314012990?via%3Dihub) [.](https://www.sciencedirect.com/science/article/abs/pii/S0003267012005685)
- Batch effect [(Gagnon-Bartsch and Speed 2012)](https://watermark.silverchair.com/kxr034.pdf?token=AQECAHi208BE49Ooan9kkhW_Ercy7Dm3ZL_9Cf3qfKAc485ysgAAArYwggKyBgkqhkiG9w0BBwagggKjMIICnwIBADCCApgGCSqGSIb3DQEHATAeBglghkgBZQMEAS4wEQQMipKBNk2L1l3QyfmiAgEQgIICaaHlZyuwWjWC5T5v2hDtYSKaPIiYhhbr0szzbcrp3xtOaziuxhzkB0NFyJJhE37sGWeBsuyKdyswTmtfJTlPcLDkYP-vvXv9ykjyulkCuTRQ2MHivovM1UkGWB5AnQvM-ZFcokabvt9JrG8j6xXzTJmE1fmifs8KCzIl0yC9i5V9Py7ctlS4odzj5uYKEqqSRceEyZk_QAFwtr9AUrqwP-q9MUZ3Wh28JLEHcKgVUFyNHbnk-r2Or2ouIdUDkwwQnRTPXZDfrwLAASs0E5yir1Kdsi4_Jq7m1klvinDAaYEuxiHo7O1AjOs5U-ltPbh2sQUrhWGvanWoaH_6fXFIviKsuRssL7FCaxzOsbbdY7UAB13pRAiHIHGu7-qhpX4hakVT_OGTF5dD2q8O6tVibfBBzoLFvIjbMYLFGZWhHXxLnfnTjFDuojdo6QuVefpWrjHEHHg3seCPFJx2Yv3Z_hu8CvRJU1VMwwCQ5SmCjBh7_TY8MpbKpj3gcR85wj0akjq_gd02RL8LMpvjVFu1rb1-KpZ__bokI1DN02pOGT-WJl2ot7l9xqUyF9u458MMHQ5xc0CvzNMVCrmppFtGzCTtB_z6AgkXzseaGZrTqlGt-JOWbKev7de9BRM58HCYIxV5zIThZoZkPEUE2-cZOV_Klc3FvEN2i1eoyYIASH8iODuwdsDm8RcjjECecgSPG31dfO9HMFUjaOUWGSK40VcTBuc9OeNzuHOybIEAGsgYQ5XEO8R3R9pMvMi7KcsxK0xQv6SSwX6AjVQJnkVlCdaQObxMMOYqMavxuJPNoYYAdDZs2tbifRag), [(Cheng Li and Rabinovic)](https://watermark.silverchair.com/kxj037.pdf?token=AQECAHi208BE49Ooan9kkhW_Ercy7Dm3ZL_9Cf3qfKAc485ysgAAArQwggKwBgkqhkiG9w0BBwagggKhMIICnQIBADCCApYGCSqGSIb3DQEHATAeBglghkgBZQMEAS4wEQQMtFy6c03qWXwVaMgwAgEQgIICZ7O9atjsPumn5RrucSZ4xM9V9Z9CYeM59bDNp7tD_7iZVAu9d9ODm2gl7PH--EMwgoMO6EJ6H0uQ3JUojod0dbiEbxfjv0ELlZCnPg1X9akLZB_yRJE7aPAE7MRr6I0hF8_374Z2irAuvr7mjDQ_bZ40a1tujXvwwxMztgn4ozKC6sL1NLj0Uhs2KYhJrjHed2be4gwsARFjjv0FRzwzIR81ov2dMq3-qtBdgmSWqMzsuKLE0qY4VWH_trPDg393UWGhw5ZdyyOaLts70hUNL22-7qDLSRdbTxJs_IkKxOVDfdrrPTNADm1TRPyAWlT7nsaOuLtNguH4fxj-gBx31PTuYnyiOgK1LXy5gSJtgoSslyL3dyNtUU9ttNjnALYX9-YXjWTCtgt8pByuJiupy9b6xc_IPfDdnK4Ef7PfcZLXYJUjtlLoxyJdhbAquYvr-KjL8l0wqg0BDyrrL5fML8GNh3-ss_d2ubg2QzgPgWAhFN1sOvfRJDXirtM_9MrK69s_LIo5rdezrWQfqDotAXKq2J3mhg95viR_FH3sf3SG4PK2PKy5rEhKIXDuIuXIYqF1YD8n3DMDyt1ASpmH4JaluO8-8QnL-XZtIkbimkmbNg53cTD8rujcp-e0J47wbsMpHWNs3pCN_2SivxQ7jzSpT_9HDhdnBxH-uaBVj3xZkixvcb8wUDhsl4oxUpFwwXaWwzs-3ToKkadGbM0lWjIYWSUg7B34FmMDcUVdy4UgDrH8FS2QuhtK_uGB8_-FcBaixeloZfsk0QSEgK6rrGIcrVXC5WxGcOes8QyfesPAqwYxydwslg), [(Wherens 2016)](https://link.springer.com/article/10.1007/s11306-016-1015-8)
- Internal Standard normalization taking care of Cross Contribution [(Redestig 2009)](https://pubs.acs.org/doi/abs/10.1021/ac901143w?casa_token=SMFtt8olvVEAAAAA:btSEUUFPXdGg4Cx90tG1DXWGDOiIXP0vrsQ0pmLDccitoP_amy8j-oRBDBJXHBkZ84jqgWmtIhPRokM)
- Workflow for metabolomics [(De Livera)](https://pubs.acs.org/doi/10.1021/ac302748b) (it's from the Terry Speed group and he seems to be good at his job)
- Surrogate variable analysis (SVA), not sure how relevant that is though [(Leek 2007)](https://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.0030161)

### to include:
- constant sum
- [quantile normalization](https://watermark.silverchair.com/190185.pdf?token=AQECAHi208BE49Ooan9kkhW_Ercy7Dm3ZL_9Cf3qfKAc485ysgAAArswggK3BgkqhkiG9w0BBwagggKoMIICpAIBADCCAp0GCSqGSIb3DQEHATAeBglghkgBZQMEAS4wEQQMW8dHBAPZhc2xZmniAgEQgIICbp3S-UhHiEsDmL2jlngSyxCN-rWd2LquzmRg58iAmZqt4Wi5UGEMpLC8dhD1sMovXXZfs09Ff_ae62aww8h_iIYWLEyVi-W19oDdljxe-FprdiBFq_bwEuNo2znrbzZnyH6hegGXUVdUrp0CfYpG1-H1pKlj3cxuF-bZGnhzB2J7fZ66fLz02w1CQM9WpuHgNG1R_VebtwOPPuJ_RMMoxbXxh5589w5si7tEuDnGu8vUy4Hh3W1caCbKwutQomBANDASYMQpjGFLmtHZlzhTdIPozbnue8oduA4urZoKt5WFlMegaNRmUybyfZQJNJ48SZshcefMTa-q5EE5xAC_kJmITGubMMKcBjUOH0KdXBfGFjb8OFi6XkwuZo2lfYtk5DyW08GQAip18uDiVANM2xkfldmWnnDbD264QAUZbtcFhSpTcCF7trHOGYHhNV-e36EMt3L14t2U7q_UGJVDkozXrxf-eh_m4Cl2w0yB8H6OptK3zdSjH77iLBxMr2SI79ZX77H4WWkwyyH22qjXCVJXVUuHEFAo6Kbcttrjt2IFHMot9bcLbsuetfATjYVVU0OWtGyhYHefkmkYsE2iJ1mrt8EEwNn9PgfqCBiXHLhBMCM4ML6jNByT6QDTSPp9gXAThQO8LVf3pvGzaog1FBQ7br9-riLHkd4Ub8AlL3f1yIn42Ec632VpaQBc-S6rI_vn4qV-1hbf3Xy1_h9s2u4iH4mS-EJueu0VUO4QSsWi8wVknKA8sc1zcBrPDLH-ektOWvtl3kw7zi6xSJoM6JHyF3nIJAKpwCWYMVtja4qnXGCknYVgaM3tsB3r95k) (reviewed in Kohl, but only really good for large 50+ datasets) 
- Pairwise log-ratios

### to include because identified to be great for LC/MS data in [this paper here](https://www.nature.com/articles/srep38881.pdf):
- Log transform
- VSN

### to include because identified to be great for GC/MS data in [this paper here](https://bmcgenomics.biomedcentral.com/articles/10.1186/1471-2164-7-142):
- Auto Scaling
- Range Scaling

Re-write this part if you want to use it becuase it's mainly copied

[(Vu et. al 2017)](http://bionmr.unl.edu/files/publications/152.pdf) compared 9 different normalization methods for NMR metabolomics data by adding Gaussian noise and dilution factors and comparing their peak recovery. The methods tested were: probabilistic quotient (PQ) [(Dieterle et. al, 2006)](https://pubs.acs.org/doi/10.1021/ac051632c), histogram matching (HM) [(Torgrip et al. 2008)](https://link.springer.com/article/10.1007%2Fs11306-007-0102-2#auth-J_-Lindberg), standard normal variate (SNV) [(Barnes et al. 1989)](https://journals.sagepub.com/doi/abs/10.1366/0003702894202201?casa_token=Ywnp29pqbS0AAAAA:_peitRZdEDIBNjf28GvJaHHBlGDzJJlVQejdthxBxMxWyo5wFllaXHh_F7Cff4mfdXcb1IEgB7kLAQ), multiplicative scatter correction (MSC) [(Windig et al. 2008)](https://journals.sagepub.com/doi/10.1366/000370208786049097), quantile (Q) [(Bolstad 2003)](https://watermark.silverchair.com/190185.pdf?token=AQECAHi208BE49Ooan9kkhW_Ercy7Dm3ZL_9Cf3qfKAc485ysgAAArswggK3BgkqhkiG9w0BBwagggKoMIICpAIBADCCAp0GCSqGSIb3DQEHATAeBglghkgBZQMEAS4wEQQMW8dHBAPZhc2xZmniAgEQgIICbp3S-UhHiEsDmL2jlngSyxCN-rWd2LquzmRg58iAmZqt4Wi5UGEMpLC8dhD1sMovXXZfs09Ff_ae62aww8h_iIYWLEyVi-W19oDdljxe-FprdiBFq_bwEuNo2znrbzZnyH6hegGXUVdUrp0CfYpG1-H1pKlj3cxuF-bZGnhzB2J7fZ66fLz02w1CQM9WpuHgNG1R_VebtwOPPuJ_RMMoxbXxh5589w5si7tEuDnGu8vUy4Hh3W1caCbKwutQomBANDASYMQpjGFLmtHZlzhTdIPozbnue8oduA4urZoKt5WFlMegaNRmUybyfZQJNJ48SZshcefMTa-q5EE5xAC_kJmITGubMMKcBjUOH0KdXBfGFjb8OFi6XkwuZo2lfYtk5DyW08GQAip18uDiVANM2xkfldmWnnDbD264QAUZbtcFhSpTcCF7trHOGYHhNV-e36EMt3L14t2U7q_UGJVDkozXrxf-eh_m4Cl2w0yB8H6OptK3zdSjH77iLBxMr2SI79ZX77H4WWkwyyH22qjXCVJXVUuHEFAo6Kbcttrjt2IFHMot9bcLbsuetfATjYVVU0OWtGyhYHefkmkYsE2iJ1mrt8EEwNn9PgfqCBiXHLhBMCM4ML6jNByT6QDTSPp9gXAThQO8LVf3pvGzaog1FBQ7br9-riLHkd4Ub8AlL3f1yIn42Ec632VpaQBc-S6rI_vn4qV-1hbf3Xy1_h9s2u4iH4mS-EJueu0VUO4QSsWi8wVknKA8sc1zcBrPDLH-ektOWvtl3kw7zi6xSJoM6JHyF3nIJAKpwCWYMVtja4qnXGCknYVgaM3tsB3r95k), natural cubic splines (CSpline) [(Workman et al. 2002)](https://genomebiology.biomedcentral.com/articles/10.1186/gb-2002-3-9-research0048), smoothing splines (SSpline) [(Fujioka and Kano 2005)](https://magnus.ece.gatech.edu/Papers/BSplinesIJICIC05.pdf), constant sum (CS) and region of interest (ROI) [(Dieterle et. al, 2006)](https://pubs.acs.org/doi/10.1021/ac051632c). At low noise levels, all of the methods except for HM performed similarily well. The three best performing methods were PQ, CS and ROI, also for high dilution factors. At high variance levels on the other hand ROI dropped off. Thus they recommend using either PQ or CS whenever possible. The two methods displayed some differences as well; CS performing slightly better at high noise levels while PQ performed significantly better than CS at low to moderate noise levels.

This is an example notebook for the normalization module. In general, the module provides different custom functions to normalize your data in different ways. Normalization is a tricky subject and the approach you choose can impact your results, please be aware of that. Some considerations are discussed here in the notebook.

Data used: AUT-440-446_FIAMSMatrixLinearityR2-Reproducibility_DataProcessing\20201223_AddAdducts-SumInt-BackSub

This data set is blank corrected

In [2]:
import pandas as pd
import numpy as np
import copy
import pickle
import BFAIR.normalization as normalization

## Data

In [4]:
with open("data/normalization_Data/Ecoli_intensities_linearity.txt", 'rb') as handle:
    intensities_Ecoli = pickle.loads(handle.read())

In [5]:
intensities_Ecoli

Unnamed: 0,sample_group_name,Metabolite,Formula,Intensity
0,Linearity_P1Ecoli_100xDil_1,12dgr120_c,C27H52O5,9.583652e+03
1,Linearity_P1Ecoli_100xDil_1,12dgr140_c,C31H60O5,1.276414e+03
2,Linearity_P1Ecoli_100xDil_1,12dgr141_c,C31H56O5,1.282265e+03
3,Linearity_P1Ecoli_100xDil_1,12dgr160_c,C35H68O5,2.968548e+04
4,Linearity_P1Ecoli_100xDil_1,12dgr161_c,C35H64O5,2.144749e+04
...,...,...,...,...
10388,Linearity_P1Ecoli_1xDil_3,xdp_c,C10H14N4O12P2,1.575200e+06
10389,Linearity_P1Ecoli_1xDil_3,xmp_c,C10H13N4O9P1,9.985480e+04
10390,Linearity_P1Ecoli_1xDil_3,xtp_c,C10H15N4O15P3,3.546930e+05
10391,Linearity_P1Ecoli_1xDil_3,xtsn_c,C10H12N4O6,9.593014e+05


## Normalization methods

### min/max feature scaling: value - min / max - min

A standard method to scale your data between 0 (min) and 1 (max). This method depends soley on the smallest and the largest value in your dataset, so the results can easily be influenced by dominant elements (e.g. very highly abundant metabolites) or outliers. This method should probably only be used if your data has to be scaled from 0 to 1.

In [6]:
normalization.min_max_norm(intensities_Ecoli, columnname='Intensity')

Unnamed: 0,sample_group_name,Metabolite,Formula,Intensity
0,Linearity_P1Ecoli_100xDil_1,12dgr120_c,C27H52O5,1.416323e-05
1,Linearity_P1Ecoli_100xDil_1,12dgr140_c,C31H60O5,4.546221e-07
2,Linearity_P1Ecoli_100xDil_1,12dgr141_c,C31H56O5,4.642775e-07
3,Linearity_P1Ecoli_100xDil_1,12dgr160_c,C35H68O5,4.733527e-05
4,Linearity_P1Ecoli_100xDil_1,12dgr161_c,C35H64O5,3.374094e-05
...,...,...,...,...
10388,Linearity_P1Ecoli_1xDil_3,xdp_c,C10H14N4O12P2,5.051276e-04
10389,Linearity_P1Ecoli_1xDil_3,xmp_c,C10H13N4O9P1,3.167472e-05
10390,Linearity_P1Ecoli_1xDil_3,xtp_c,C10H15N4O15P3,1.134548e-04
10391,Linearity_P1Ecoli_1xDil_3,xtsn_c,C10H12N4O6,3.074797e-04


### Total Sum Intensity, value / tsi

The following methods are all based on total sum normalization, which comes with caveats. One should keep in mind that this depends on the intensity of all the measured metabolites. If there is an unequal distribution of values or a particularly dominant metabolite (as can easily be the case in metabolic engineering) then this would alter the normalized concentration of the other metabolites (a known problem with closed data, [(Filzmoser & Walczak, 2014)](https://www.sciencedirect.com/science/article/abs/pii/S0021967314012990?via%3Dihub)). Other methods such as PQN, which makes use of TSI in its initial step are recommended [(Dieterle et. al, 2006)](https://pubs.acs.org/doi/10.1021/ac051632c). Also, while this method preserves the overall signal quality, the can skew correlations between signals [(Craig et. al, 2006)](https://pubs.acs.org/doi/10.1021/ac0519312), and intensities for the same compound in different KOs can appear different.

In [7]:
normalization.tsi_norm(intensities_Ecoli, columnname='Intensity')

Unnamed: 0,sample_group_name,Metabolite,Formula,Intensity
0,Linearity_P1Ecoli_100xDil_1,12dgr120_c,C27H52O5,1.401056e-06
1,Linearity_P1Ecoli_100xDil_1,12dgr140_c,C31H60O5,1.866019e-07
2,Linearity_P1Ecoli_100xDil_1,12dgr141_c,C31H56O5,1.874573e-07
3,Linearity_P1Ecoli_100xDil_1,12dgr160_c,C35H68O5,4.339788e-06
4,Linearity_P1Ecoli_100xDil_1,12dgr161_c,C35H64O5,3.135457e-06
...,...,...,...,...
10388,Linearity_P1Ecoli_1xDil_3,xdp_c,C10H14N4O12P2,5.544914e-05
10389,Linearity_P1Ecoli_1xDil_3,xmp_c,C10H13N4O9P1,3.515023e-06
10390,Linearity_P1Ecoli_1xDil_3,xtp_c,C10H15N4O15P3,1.248567e-05
10391,Linearity_P1Ecoli_1xDil_3,xtsn_c,C10H12N4O6,3.376870e-05


## These methods are based on the E. coli model that we used in the INCA example for now; [iJS2012](https://bmcsystbiol.biomedcentral.com/articles/10.1186/1752-0509-6-9).
Biomass reactions and the corresponding metabolite identifiers will be added for different organisms. All metabolites that are part of the biomass function, substrates and products, are being considered

### Main Biomass components

In [8]:
biomass_mets = ['phe__L_c', 'mlthf_c' ,'oaa_c', 'lys__L_c',
                'atp_c', 'ser__L_c', 'g3p_c', 'tyr__L_c', 'pep_c',
                'met__L_c', 'g6p_c', 'akg_c', 'glu__L_c',
                'gln__L_c', 'r5p_c', 'f6p_c', 'pyr_c', 'gly_c',
                'thr_c', 'asp__L_c', 'nadph_c', 'cys__L_c',
                '3pg_c', 'val__L_c', 'ala__L_c', 'ile__L_c',
                'asn__L_c', 'his__L_c', 'leu__L_c', 'accoa_c',
                'arg__L_c', 'pro__L_c', 'trp__L_c', 'nadh_c']
bm_vals = [0.176, 0.443, 0.34, 0.326, 33.247, 0.205, 0.129, 
           0.131, 0.051, 0.146, 0.205, 0.087, 0.25, 0.25, 0.754,
           0.071, 0.083, 0.582, 0.241, 0.229, 5.363, 0.087, 0.619,
           0.402, 0.488, 0.276, 0.229, 0.09, 0.428, 2.51, 0.281,
           0.21, 0.054, -1.455]
biomass_value = 39.68
biomass_df = pd.DataFrame()
biomass_df['Metabolite'] = biomass_mets
biomass_df['Value'] = bm_vals

#### Using just the metabolites

In [9]:
normalization.lim_tsi_norm(biomass_df['Metabolite'], intensities_Ecoli, columnname='Intensity')

Unnamed: 0,sample_group_name,Metabolite,Formula,Intensity
0,Linearity_P1Ecoli_100xDil_1,12dgr120_c,C27H52O5,0.000073
1,Linearity_P1Ecoli_100xDil_1,12dgr140_c,C31H60O5,0.000010
2,Linearity_P1Ecoli_100xDil_1,12dgr141_c,C31H56O5,0.000010
3,Linearity_P1Ecoli_100xDil_1,12dgr160_c,C35H68O5,0.000227
4,Linearity_P1Ecoli_100xDil_1,12dgr161_c,C35H64O5,0.000164
...,...,...,...,...
10388,Linearity_P1Ecoli_1xDil_3,xdp_c,C10H14N4O12P2,0.000289
10389,Linearity_P1Ecoli_1xDil_3,xmp_c,C10H13N4O9P1,0.000018
10390,Linearity_P1Ecoli_1xDil_3,xtp_c,C10H15N4O15P3,0.000065
10391,Linearity_P1Ecoli_1xDil_3,xtsn_c,C10H12N4O6,0.000176


#### Also using the stoichiometric coefficients

In [10]:
normalization.lim_tsi_norm(biomass_df, intensities_Ecoli, biomass_value=biomass_value, columnname='Intensity')

Unnamed: 0,sample_group_name,Metabolite,Formula,Intensity
0,Linearity_P1Ecoli_100xDil_1,12dgr120_c,C27H52O5,0.000896
1,Linearity_P1Ecoli_100xDil_1,12dgr140_c,C31H60O5,0.000119
2,Linearity_P1Ecoli_100xDil_1,12dgr141_c,C31H56O5,0.000120
3,Linearity_P1Ecoli_100xDil_1,12dgr160_c,C35H68O5,0.002774
4,Linearity_P1Ecoli_100xDil_1,12dgr161_c,C35H64O5,0.002004
...,...,...,...,...
10388,Linearity_P1Ecoli_1xDil_3,xdp_c,C10H14N4O12P2,0.007044
10389,Linearity_P1Ecoli_1xDil_3,xmp_c,C10H13N4O9P1,0.000447
10390,Linearity_P1Ecoli_1xDil_3,xtp_c,C10H15N4O15P3,0.001586
10391,Linearity_P1Ecoli_1xDil_3,xtsn_c,C10H12N4O6,0.004290


### Amino acids
same but using AAs

In [11]:
# maybe add L-Selenocysteine and L-Pyrrolysine
amino_acids = ['ala__L_c', 'arg__L_c', 'asn__L_c', 'asp__L_c',
               'cys__L_c', 'glu__L_c', 'gln__L_c', 'gly_c',
               'his__L_c', 'ile__L_c', 'leu__L_c', 'lys__L_c',
               'met__L_c', 'phe__L_c', 'pro__L_c', 'ser__L_c',
               'thr_c', 'trp__L_c', 'tyr__L_c', 'val__L_c']

In [13]:
normalization.lim_tsi_norm(amino_acids, intensities_Ecoli, columnname='Intensity')

Unnamed: 0,sample_group_name,Metabolite,Formula,Intensity
0,Linearity_P1Ecoli_100xDil_1,12dgr120_c,C27H52O5,0.000101
1,Linearity_P1Ecoli_100xDil_1,12dgr140_c,C31H60O5,0.000013
2,Linearity_P1Ecoli_100xDil_1,12dgr141_c,C31H56O5,0.000014
3,Linearity_P1Ecoli_100xDil_1,12dgr160_c,C35H68O5,0.000313
4,Linearity_P1Ecoli_100xDil_1,12dgr161_c,C35H64O5,0.000226
...,...,...,...,...
10388,Linearity_P1Ecoli_1xDil_3,xdp_c,C10H14N4O12P2,0.000394
10389,Linearity_P1Ecoli_1xDil_3,xmp_c,C10H13N4O9P1,0.000025
10390,Linearity_P1Ecoli_1xDil_3,xtp_c,C10H15N4O15P3,0.000089
10391,Linearity_P1Ecoli_1xDil_3,xtsn_c,C10H12N4O6,0.000240


### Probabilistic Quotient Normalization
This method [(Dieterle et. al, 2006)](https://pubs.acs.org/doi/10.1021/ac051632c) adjusts for dilutions so we would use it to compare different dilutions of the same sample, e.g. E. coli 1x, 10x, 100x etc

**0)** Put all the corresponding samples into one dataframe. Because of the architecture I'd say columns correspond to sample, rows to metabolites. Use suitable values for missing metabolites, probably NA is fine but check that this is conform with the mean/median functions

That one is 0 because it's still pre-processing. Here is the logic of the method

**1)** tsi correct each sample separately

**2)** Set up a QC vector (or additional column) with the mean/median for each metabolite over all samples

**3)** Divide each value for each sample with the correscponding QC value

**4)** Get the mean/median over all metabolites for each sample -> This is your dilution factor for this sample

**5)** Take the values from 1) and multiply each value with the dilution factor corresponding to this sample

And to finish that, export it like the input

**6)** Copy the input df and insert the new values

In [29]:
normalization.pqn_norm(intensities_Ecoli, 'sample_group_name', 'Intensity', 'median')

Unnamed: 0,sample_group_name,Metabolite,Formula,Intensity
0,Linearity_P1Ecoli_100xDil_1,12dgr120_c,C27H52O5,1.401056e-06
1,Linearity_P1Ecoli_100xDil_1,12dgr140_c,C31H60O5,1.866019e-07
2,Linearity_P1Ecoli_100xDil_1,12dgr141_c,C31H56O5,1.874573e-07
3,Linearity_P1Ecoli_100xDil_1,12dgr160_c,C35H68O5,4.339788e-06
4,Linearity_P1Ecoli_100xDil_1,12dgr161_c,C35H64O5,3.135457e-06
...,...,...,...,...
10388,Linearity_P1Ecoli_1xDil_3,xdp_c,C10H14N4O12P2,5.773747e-05
10389,Linearity_P1Ecoli_1xDil_3,xmp_c,C10H13N4O9P1,3.660085e-06
10390,Linearity_P1Ecoli_1xDil_3,xtp_c,C10H15N4O15P3,1.300094e-05
10391,Linearity_P1Ecoli_1xDil_3,xtsn_c,C10H12N4O6,3.516230e-05


In [16]:
normalization.pqn_norm(intensities_Ecoli, 'sample_group_name', 'Intensity', 'mean')

Unnamed: 0,sample_group_name,Metabolite,Formula,Intensity
0,Linearity_P1Ecoli_100xDil_1,12dgr120_c,C27H52O5,1.386942e-06
1,Linearity_P1Ecoli_100xDil_1,12dgr140_c,C31H60O5,1.847222e-07
2,Linearity_P1Ecoli_100xDil_1,12dgr141_c,C31H56O5,1.855689e-07
3,Linearity_P1Ecoli_100xDil_1,12dgr160_c,C35H68O5,4.296071e-06
4,Linearity_P1Ecoli_100xDil_1,12dgr161_c,C35H64O5,3.103872e-06
...,...,...,...,...
10388,Linearity_P1Ecoli_1xDil_3,xdp_c,C10H14N4O12P2,6.273600e-05
10389,Linearity_P1Ecoli_1xDil_3,xmp_c,C10H13N4O9P1,3.976950e-06
10390,Linearity_P1Ecoli_1xDil_3,xtp_c,C10H15N4O15P3,1.412648e-05
10391,Linearity_P1Ecoli_1xDil_3,xtsn_c,C10H12N4O6,3.820641e-05
