In [1]:
import pandas as pd
import numpy as np
import sys

The purpose of this experiment is to figure out how to best automate a larger number of searches on the beta server.  The goal, is to search all existing public datasets on the beta server with 'hdmb-v4' for the 'H2O' neutral loss, at FDR=0.5.  Data will later be filtered to FDR<=0.2 for parent, and FDR<=0.5 for neutral loss, since the current FDR calculations may unfairly penalize neutral losses.

Here, we will use a subset of "high-quality" datasets from the top-4 labs submitting orbitrap/FTMS in the positive and negative ion modes.

Data will be searched against a "core metabolome DB".

#Off-line steps:
1. Metadata as csv was downloaded from: https://beta.metaspace2020.eu/ on 2020 Feb 06.
2. Data were imported into Google Sheets, since formatting is comma seperated but then there are commas in text files:
https://docs.google.com/spreadsheets/d/1DOLikG1euG-brCrMB5jrKwiu7bknPUVGVf8JsZXY3hM/edit?usp=sharing
3. Data were further filtered for quality as described above in: "neutral_loss/good_nl_reports/high_quality_data_investigations.ipynb"
4. Core metabolome was calculated as in: "core_metabolome/core_metabolome_v1.pickle" 

In [2]:
# All datasets on beta server
beta_raw = pd.read_csv('/Users/dis/PycharmProjects/neutal_loss_2/Metaspace_beta_2020_Feb.csv',
                       sep='\t')

# Good quality datasets
good_ds_list = list(pd.read_csv('good_ds_2020_Feb_25.txt').good_datasets)

# Filter for only good datasets
beta_raw = beta_raw[beta_raw['datasetId'].isin(good_ds_list)]

# Core metabolome
core_metabolome = pd.read_pickle('core_metabolome_v1.pickle')

In [7]:
beta_df = beta_raw.replace({',':''}, regex=True)
columns = ['datasetId', 'datasetName', 'polarity', 'organism', 'organismPart', 'analyzer', 
           'ionisationSource', 'maldiMatrix']
temp_df = beta_df[columns].copy(deep=True)
beta_df['query'] = temp_df.apply(lambda x: ','.join(x.dropna().values.tolist()), axis=1)
beta_df['result'] = ''

In [8]:
beta_df.shape

(433, 20)

In [9]:
#%%capture cap_out
i = 0 # 0
j = 432 # Max = 3248

while i <= j:
    x = "Current row is: " + str(i) + " of 432"
    print(x)
    i += 1
    row = beta_df.iloc[i,:]

    with open('neutral_loss_report_beta_test_3.py', 'r') as file:
        filedata = file.read()

    filedata = filedata.replace('Literal_to_replace_ds_id', 
                                row.query)
    filedata = filedata.replace('Literal_to_replace_out', 
                                row.query.split(',')[0])

    with open('neutral_loss_report_beta_test_4.py', 'w') as file:
        file.write(filedata)
        
    %run -i 'neutral_loss_report_beta_test_4.py'

Current row is: 246 of 3333
Unauthorized. Only public but not private datasets will be accessible.
Authorized.
Current row is: 247 of 3333
Unauthorized. Only public but not private datasets will be accessible.
Authorized.
Current row is: 248 of 3333
Unauthorized. Only public but not private datasets will be accessible.
Authorized.
Current row is: 249 of 3333
Unauthorized. Only public but not private datasets will be accessible.
Authorized.
Current row is: 250 of 3333
Unauthorized. Only public but not private datasets will be accessible.
Authorized.
Current row is: 251 of 3333
Unauthorized. Only public but not private datasets will be accessible.
Authorized.
Current row is: 252 of 3333
Unauthorized. Only public but not private datasets will be accessible.
Authorized.
Current row is: 253 of 3333
Unauthorized. Only public but not private datasets will be accessible.
Authorized.
Current row is: 254 of 3333
Unauthorized. Only public but not private datasets will be accessible.
Authorized.
C

Authorized.
Current row is: 320 of 3333
Unauthorized. Only public but not private datasets will be accessible.
Authorized.
Current row is: 321 of 3333
Unauthorized. Only public but not private datasets will be accessible.
Authorized.
Current row is: 322 of 3333
Unauthorized. Only public but not private datasets will be accessible.
Authorized.
Current row is: 323 of 3333
Unauthorized. Only public but not private datasets will be accessible.
Authorized.
Current row is: 324 of 3333
Unauthorized. Only public but not private datasets will be accessible.
Authorized.
Current row is: 325 of 3333
Unauthorized. Only public but not private datasets will be accessible.
Authorized.
Current row is: 326 of 3333
Unauthorized. Only public but not private datasets will be accessible.
Authorized.
Current row is: 327 of 3333
Unauthorized. Only public but not private datasets will be accessible.
Authorized.
Current row is: 328 of 3333
Unauthorized. Only public but not private datasets will be accessible.
A

Authorized.
Current row is: 394 of 3333
Unauthorized. Only public but not private datasets will be accessible.
Authorized.
Current row is: 395 of 3333
Unauthorized. Only public but not private datasets will be accessible.
Authorized.
Current row is: 396 of 3333
Unauthorized. Only public but not private datasets will be accessible.
Authorized.
Current row is: 397 of 3333
Unauthorized. Only public but not private datasets will be accessible.
Authorized.
Current row is: 398 of 3333
Unauthorized. Only public but not private datasets will be accessible.
Authorized.
Current row is: 399 of 3333
Unauthorized. Only public but not private datasets will be accessible.
Authorized.
Current row is: 400 of 3333
Unauthorized. Only public but not private datasets will be accessible.
Authorized.
Current row is: 401 of 3333
Unauthorized. Only public but not private datasets will be accessible.
Authorized.
Current row is: 402 of 3333
Unauthorized. Only public but not private datasets will be accessible.
A

IndexError: single positional indexer is out-of-bounds

3248

In [None]:
i

In [9]:
cap_out.show()

Current row is: 0 of 3333
Unauthorized. Only public but not private datasets will be accessible.
Got 70 annotations for 2018-07-09_16h30m44s @ 0.2
Current row is: 1 of 3333
Unauthorized. Only public but not private datasets will be accessible.
Got 8 annotations for 2018-07-10_18h24m20s @ 0.2
Current row is: 2 of 3333
Unauthorized. Only public but not private datasets will be accessible.
Got 1 annotations for 2018-08-03_10h46m39s @ 0.2
Current row is: 3 of 3333
Unauthorized. Only public but not private datasets will be accessible.
Got 713 annotations for 2018-08-08_09h27m46s @ 0.2
Current row is: 4 of 3333
Unauthorized. Only public but not private datasets will be accessible.
Got 771 annotations for 2018-08-10_12h02m46s @ 0.2
Current row is: 5 of 3333
Unauthorized. Only public but not private datasets will be accessible.
Got 436 annotations for 2018-09-19_09h15m47s @ 0.2
Current row is: 6 of 3333
Unauthorized. Only public but not private datasets will be accessible.
Got 611 annotations 

In [10]:
original = sys.stdout
sys.stdout = open('cap_out_Feb__14_2020.txt', 'w')
cap_out.show()
sys.stdout = original

In [None]:
# Repeat download after all processing complete with reprocess turned off!
# Download to a fresh new folder

In [None]:
# Next script is: 
'http://localhost:8888/notebooks/PycharmProjects/neutral_loss/nl_0_3_clean_join_nb.ipynb'