# Sampling ressources to generate test data set input files
qtp-services includes a utils package to harness the generation of small to medium sized datasets.
Based on a MS experiment file, in csv format, proteins can be selected and their related informations extracted from string and uniprot records.

You will need to describe the expected format of the MS csv file: 
* the record format as a dictionary matching the pandas.csv_read 'dtype' options
* the name of the column featuring the uniprot identifiers
* the name of the column featuring the quantitative variable use to sort the csv record

The above cells showcase the following steps:
1. Sort uniprot entries from experiment MS
2. Pick uniprot identifiers according to user's settings
3. Extract matching xml elements
4. Extract matching string entries
5. Generates corresponding xml/string inputs files

From a list of uniprot identifiers extracted from the MS csv file, you will obtain sample/reduced sets for the following data:
* uniprot_xml
* string_alias
* string_details
Which will exactly cover the information avaible for the extracted uniprot identifier


In [1]:
import numpy as np

data_type = {
    'Accession': str, 'Description': str, 'Gene Symbol': str,'Log2 Corrected Abundance Ratio':  np.float64, 
     'Corrected Abundance ratio (1.53)':  np.float64,  
     'Abundance Ratio Adj. P-Value: (127. T3 Tc WT) / (126. T0 WT)':  np.float64, '-LOG10 Adj.P-val':  np.float64
    }

ms_data_csv_file     = 'data/exp/Nolivos/Nolivos_wt1_subset.tsv'
uniprot_proteome_xml = 'data/proteomes/Escherichia_coli_K12_and_TMT_21026.xml'
string_alias         = 'data/ppi/string/Escherichia_coli/511145.protein.aliases.v11.5.txt'
string_details       = 'data/ppi/string/Escherichia_coli/511145.protein.links.detailed.v11.5.txt'

### Process MS experimental input (csv)
##### Sort uniprot entries from experiment MS
##### Pick uniprot identifiers according to user's settings

In [2]:
from qtp_services.utils.ms_data_csv import MS_frame
ms_frame = MS_frame(ms_data_csv_file)
ms_frame.parse(ms_data_csv_file, 
                       'Corrected Abundance ratio (1,53)', 
                       data_type, "Accession")

sample_uniprot_list = ms_frame.transform(min_value=0.1).uniprot_ids[:50]
sample_uniprot_list[:5]

{'Pfam IDs', 'Description', 'Ensembl Gene ID', 'Corrected Abundance ratio (1,53)', 'Accession', 'Log2 Corrected Abundance Ratio', 'Abundance Ratio Adj. P-Value: (127, T3 Tc WT) / (126, T0 WT)', 'LOG10 Adj.P-val', 'Molecular Function', 'Abundance Ratio: (127, T3 Tc WT) / (126, T0 WT)', 'KEGG Pathways', 'Cellular Component', 'Biological Process', 'Entrez Gene ID', 'Gene Symbol'}


['P0A8S9', 'P05706', 'P29744', 'P43533', 'P69741']

## Extract all the corresponding xml proteome entries 

In [4]:
from qtp_services.utils.uniprot_xml import transform_xml_tree
sample_proteome_xml = transform_xml_tree(uniprot_proteome_xml, sample_uniprot_list)

indexing
successfully indexed 10698


## Extract matching string entries, from the *alias* and *details* flat files
#### Generates corresponding xml/string inputs files

In [5]:
from qtp_services.utils.string_flat import extract_string
new_alias, new_detail = extract_string(string_alias, string_details,
                sample_uniprot_list)
label = "my_dataset"
with open(f"{label}.protein.links.detailed.txt", 'w') as f_detail:
    f_detail.write(new_detail)
with open(f"{label}.protein.aliases.txt", 'w') as f_alias:
    f_alias.write(new_alias)
      