## **Ipthon Script to calculate NACorrection and Fractional Enrichment for a demo dataset.**

Prerequisite Knowledge:

**Natural Abundance Correction**- Natural abundance (NA) refers to the abundance of isotopes of a chemical element as naturally found on the planet. While performing analysis,the observed intensity contains contribution from isotopic natural abundance that needs to be corrected. This process is referred as NA Correction.

**Pool Total**- Sum of the intensities of all different number of labeled atoms of the isotope element is called pool total.

**Fractional enrichment**- Normalization of intensities of a metabolites between the range of 0 to 1.

**Welcome to the interactive Polly IPython Notebook.**

**With this interactive Polly notebook you would be able to calculate NA Corrected intensities as well as fractional enrichment for LCMS input file.** Information on some functions used:

 - corna- package which looks into NA Correction.
 - lcms.csv - demo raw_intensity file.
 - parser.convert_inputdata_to_stdfrom - converts input df from wide to long format.
 - parser.save_original_label_and_processed_label - saves original dataframe to retrive back information of label      and original intensities after NA Corrected intensities are obtained.
 - parser.get_isotope_columns_frm_label_col - from label of the form C13N15-label-1-1 get dataframe of the form 
   C13 N15
    1   1
 - parser.filter_required_col_and_get_formula_dict - filter dataframe columns that are required for NA Correction      calculation and get formula dictionary of the form (element: number of atoms of the element in formula)
   Ex: C5H2O3 -> formula dictionary- {'C': 5, 'H': 2, 'O': 3}
 - algo.make_all_corr_matrices - This function forms correction matrix M, such that Mx=y where y is the observed      isotopic distribution and x is the expected distribution of input labels, for each indistinguishable element for    a particular isotracer one by one.
 - nacorr_lcms.correct_df_for_multiplication - This function arranges the dataframe according to each isotope by      grouping it on the basis of that isotope column and then pass it as input to perform multiplication with            correction matrix.
 - parser.add_name_formula_label_col - Adds metabolite name, formula and label columns back to the na corrected        Dataframe. 
 - fractional_enrichment - Calculates fractional enrichment for the NA Corrected dataframe.






In [3]:
import pandas as pd
import numpy as np

from corna.inputs import maven_parser as parser
from corna import constants as cons
from corna.autodetect_isotopes import get_element_correction_dict
from corna.algorithms import matrix_calc as algo
from corna.algorithms import nacorr_lcms
from corna.postprocess import fractional_enrichment
from corna.helpers import _merge_dfs

**Validating raw_intensity_file and metadata_file to check for empty dataframes, empty cells, incorrect format etc.**

**If metadata fie present merging the metadata with raw intensity file.**

In [5]:
input_files= {
'raw_input_files': ['raw_intensity_file_LCMS.csv']
}
raw_input_file = input_files.get('raw_input_files')[0]
# meta_input_file = input_files.get('metadata_file', None)
validated_raw_df = pd.read_csv(raw_input_file)
merged_df, iso_tracer_data, element_list = parser.read_maven_file\
                                (raw_input_file, validated_raw_df, pd.DataFrame())

**Defining the natural abundance values of different elements and the isotracers present in the file.**

In [6]:
#df= pd.read_csv('raw_intensity_file_LCMS.csv')
iso_tracers= ['C13', 'N15']
autodetect= False
ppm_input_user= 13
na_dict = {'C':[0.9889,0.0111],
'N':[0.9964,0.0036],
'O':[0.9976,0.0004,0.002],
'H':[0.99985,0.00015],
'S':[0.95,0.0076,0.0424],
}

In [7]:
#merged_df= parser.convert_inputdata_to_stdfrom(df)

**Saving original label column and original intensities to be mapped after performing NA Correction.**

In [8]:
original_df= parser.save_original_label_and_processed_label(merged_df, iso_tracers)
print original_df
sample_list=merged_df.Sample.unique()
required_col=np.append(sample_list, iso_tracers)
final_df=pd.DataFrame()
eleme_corr_dict = {}
eleme_corr={}

                          Name   Formula       Sample     Intensity  \
0                      Glycine   C2H5NO2  SAMPLE_2_10  1.527025e+06   
1                      Glycine   C2H5NO2  SAMPLE_2_10  5.085417e+04   
2                      Glycine   C2H5NO2  SAMPLE_2_10  0.000000e+00   
3                      Glycine   C2H5NO2  SAMPLE_2_10  3.031570e+03   
4                      Glycine   C2H5NO2  SAMPLE_2_10  0.000000e+00   
5                      Glycine   C2H5NO2  SAMPLE_2_10  0.000000e+00   
6                 Pyruvic acid    C3H4O3  SAMPLE_2_10  2.603249e+05   
7                 Pyruvic acid    C3H4O3  SAMPLE_2_10  0.000000e+00   
8                 Pyruvic acid    C3H4O3  SAMPLE_2_10  0.000000e+00   
9                 Pyruvic acid    C3H4O3  SAMPLE_2_10  0.000000e+00   
10                   L-Alanine   C3H7NO2  SAMPLE_2_10  1.398106e+06   
11                   L-Alanine   C3H7NO2  SAMPLE_2_10  5.677794e+04   
12                   L-Alanine   C3H7NO2  SAMPLE_2_10  1.604293e+05   
13    

**Converting input dataframe from long to wide format**

In [9]:
merged_df=merged_df.pivot_table(index=[cons.NAME_COL, cons.FORMULA_COL, cons.LABEL_COL],
                                            columns=cons.SAMPLE_COL, values= cons.INTENSITY_COL)
merged_df =merged_df.rename_axis(None, axis=1).reset_index()

**From label of the form C13N15-label-1-1 get dataframe columns of the form -** 

    C13 N15
     1   1

In [10]:
std_label_df = parser.get_isotope_columns_frm_label_col(merged_df, iso_tracers)
print std_label_df

                       Name   Formula  SAMPLE_2_10  SAMPLE_2_2  SAMPLE_2_3  \
0     2-Isopropylmalic acid   C7H12O5   291402.500  276754.800  220976.300   
1     2-Isopropylmalic acid   C7H12O5     9314.458       0.000    7499.382   
2     2-Isopropylmalic acid   C7H12O5        0.000       0.000       0.000   
3     2-Isopropylmalic acid   C7H12O5        0.000       0.000       0.000   
4     2-Isopropylmalic acid   C7H12O5        0.000       0.000       0.000   
5     2-Isopropylmalic acid   C7H12O5        0.000       0.000       0.000   
6     2-Isopropylmalic acid   C7H12O5        0.000       0.000       0.000   
7     2-Isopropylmalic acid   C7H12O5   148056.400  160234.500       0.000   
8    3-Phosphoglyceric acid   C3H7O7P        0.000   84080.180   89475.740   
9    3-Phosphoglyceric acid   C3H7O7P        0.000       0.000       0.000   
10   3-Phosphoglyceric acid   C3H7O7P        0.000       0.000       0.000   
11   3-Phosphoglyceric acid   C3H7O7P        0.000   18325.710  

**For each metabolite in the file-**

**1) Create correction matrix according to the formula of metabolite.**

**2) Multiply the correction matrix with the dataframe of metabolite.**

**3) Add Name, Formula and Label column back to corrected dataframe.**

**4) Append to final dataframe.**

In [11]:
for metab in std_label_df.Name.unique():
    required_df, formula, formula_dict = parser.filter_required_col_and_get_formula_dict(std_label_df, metab,
                                                                     iso_tracers, required_col)
    corr_mats = algo.make_all_corr_matrices(iso_tracers, formula_dict, na_dict, eleme_corr)
    df_corr_C_N = nacorr_lcms.correct_df_for_multiplication(iso_tracers, required_df, corr_mats) 
    info_df= parser.add_name_formula_label_col(df_corr_C_N, metab, formula[0], iso_tracers, eleme_corr)
    final_df=final_df.append(info_df)

**Convert the NA Corrected dataframe back from wide format to long format.**

In [12]:
df_long = pd.melt(final_df, id_vars=[cons.NAME_COL, cons.FORMULA_COL, cons.LABEL_COL, cons.INDIS_ISOTOPE_COL])
df_long.rename(columns={'variable': cons.SAMPLE_COL, 'value':cons.NA_CORRECTED_COL},inplace=True)
df_long

Unnamed: 0,Name,Formula,Label,Indistinguishable_isotope,Sample,NA Corrected
0,2-Isopropylmalic acid,C7H12O5,C13N15-label-0-0,{},SAMPLE_2_10,3.150842e+05
1,2-Isopropylmalic acid,C7H12O5,C13N15-label-1-0,{},SAMPLE_2_10,-1.452241e+04
2,2-Isopropylmalic acid,C7H12O5,C13N15-label-2-0,{},SAMPLE_2_10,1.519405e+02
3,2-Isopropylmalic acid,C7H12O5,C13N15-label-3-0,{},SAMPLE_2_10,3.324733e+00
4,2-Isopropylmalic acid,C7H12O5,C13N15-label-4-0,{},SAMPLE_2_10,-1.050100e-01
5,2-Isopropylmalic acid,C7H12O5,C13N15-label-5-0,{},SAMPLE_2_10,1.152950e-03
6,2-Isopropylmalic acid,C7H12O5,C13N15-label-6-0,{},SAMPLE_2_10,-5.944136e-06
7,2-Isopropylmalic acid,C7H12O5,C13N15-label-7-0,{},SAMPLE_2_10,1.480564e+05
8,3-Phosphoglyceric acid,C3H7O7P,C13N15-label-0-0,{},SAMPLE_2_10,0.000000e+00
9,3-Phosphoglyceric acid,C3H7O7P,C13N15-label-1-0,{},SAMPLE_2_10,0.000000e+00


**Merging NACorrected_df with the original_df to retrive information of original label and original intesities.**

In [13]:
joined= pd.merge(df_long, original_df, on=[cons.NAME_COL, cons.FORMULA_COL, cons.LABEL_COL, cons.SAMPLE_COL],
                                                                                 how='left').fillna(0)     
joined.loc[joined.Original_label == 0, cons.ORIGINAL_LABEL_COL] = joined.Label
joined.drop(cons.LABEL_COL, axis=1, inplace=True)
joined.rename(index= str, columns={cons.ORIGINAL_LABEL_COL :cons.LABEL_COL}, inplace=True)
print joined

                         Name   Formula Indistinguishable_isotope  \
0       2-Isopropylmalic acid   C7H12O5                        {}   
1       2-Isopropylmalic acid   C7H12O5                        {}   
2       2-Isopropylmalic acid   C7H12O5                        {}   
3       2-Isopropylmalic acid   C7H12O5                        {}   
4       2-Isopropylmalic acid   C7H12O5                        {}   
5       2-Isopropylmalic acid   C7H12O5                        {}   
6       2-Isopropylmalic acid   C7H12O5                        {}   
7       2-Isopropylmalic acid   C7H12O5                        {}   
8      3-Phosphoglyceric acid   C3H7O7P                        {}   
9      3-Phosphoglyceric acid   C3H7O7P                        {}   
10     3-Phosphoglyceric acid   C3H7O7P                        {}   
11     3-Phosphoglyceric acid   C3H7O7P                        {}   
12      5-Aminolevulinic acid   C5H9NO3                        {}   
13      5-Aminolevulinic acid   C5

**Fractional Enrichment Calculation**

In [14]:
output_df = fractional_enrichment(joined)
print output_df

            Sample                    Name             Label   Formula  \
0      SAMPLE_2_10   2-Isopropylmalic acid        C12 PARENT   C7H12O5   
1      SAMPLE_2_10   2-Isopropylmalic acid       C13-label-1   C7H12O5   
2      SAMPLE_2_10   2-Isopropylmalic acid       C13-label-2   C7H12O5   
3      SAMPLE_2_10   2-Isopropylmalic acid       C13-label-3   C7H12O5   
4      SAMPLE_2_10   2-Isopropylmalic acid       C13-label-4   C7H12O5   
5      SAMPLE_2_10   2-Isopropylmalic acid       C13-label-5   C7H12O5   
6      SAMPLE_2_10   2-Isopropylmalic acid       C13-label-6   C7H12O5   
7      SAMPLE_2_10   2-Isopropylmalic acid       C13-label-7   C7H12O5   
8      SAMPLE_2_10  3-Phosphoglyceric acid        C12 PARENT   C3H7O7P   
9      SAMPLE_2_10  3-Phosphoglyceric acid       C13-label-1   C3H7O7P   
10     SAMPLE_2_10  3-Phosphoglyceric acid       C13-label-2   C3H7O7P   
11     SAMPLE_2_10  3-Phosphoglyceric acid       C13-label-3   C3H7O7P   
12     SAMPLE_2_10   5-Aminolevulinic 

In [15]:
df= _merge_dfs(output_df, joined)
df

Unnamed: 0,Sample,Name,Label,Formula,Pool_total,Fractional enrichment,Indistinguishable_isotope,NA Corrected,Intensity
0,SAMPLE_2_10,2-Isopropylmalic acid,C12 PARENT,C7H12O5,4.632959e+05,6.800928e-01,{},3.150842e+05,291402.500
1,SAMPLE_2_10,2-Isopropylmalic acid,C13-label-1,C7H12O5,4.632959e+05,0.000000e+00,{},-1.452241e+04,9314.458
2,SAMPLE_2_10,2-Isopropylmalic acid,C13-label-2,C7H12O5,4.632959e+05,3.279556e-04,{},1.519405e+02,0.000
3,SAMPLE_2_10,2-Isopropylmalic acid,C13-label-3,C7H12O5,4.632959e+05,7.176263e-06,{},3.324733e+00,0.000
4,SAMPLE_2_10,2-Isopropylmalic acid,C13-label-4,C7H12O5,4.632959e+05,0.000000e+00,{},-1.050100e-01,0.000
5,SAMPLE_2_10,2-Isopropylmalic acid,C13-label-5,C7H12O5,4.632959e+05,2.488582e-09,{},1.152950e-03,0.000
6,SAMPLE_2_10,2-Isopropylmalic acid,C13-label-6,C7H12O5,4.632959e+05,0.000000e+00,{},-5.944136e-06,0.000
7,SAMPLE_2_10,2-Isopropylmalic acid,C13-label-7,C7H12O5,4.632959e+05,3.195720e-01,{},1.480564e+05,148056.400
8,SAMPLE_2_10,3-Phosphoglyceric acid,C12 PARENT,C3H7O7P,0.000000e+00,0.000000e+00,{},0.000000e+00,0.000
9,SAMPLE_2_10,3-Phosphoglyceric acid,C13-label-1,C3H7O7P,0.000000e+00,0.000000e+00,{},0.000000e+00,0.000
