Notebook where FlashLFQ will be used to quantify peptides and proteins in the label-free projects

FlashLFQ requires 3 files: <br>

### **1. MS/MS identification file**

- File Name - With or without file extension (e.g. MyFile or MyFile.mzML)

- Base Sequence - Should only contain an amino acid sequence (e.g., PEPTIDE and not PEPT[Phosphorylation]IDE

- Full Sequence - Modified sequence. Can contain any characters (e.g., PEPT[Phosphorylation]IDE is fine), but must be consistent between the same peptidoform to get accurate results

- Peptide Monoisotopic Mass - Theoretical monoisotopic mass, including modification mass

- Scan Retention Time - MS/MS identification scan retention time in minutes

- Precursor Charge - Charge of the ion selected for MS/MS resulting in the identification. Use the number only (e.g., "3" and not "+3")

- Protein Accession - Protein accession(s) for the peptide. It is important to list all of the parent protein options if you want the "shared peptides" to be accurate. Use the semicolon (;) to delimit different proteins.

### **2. Experimental design file**

This file must be put in the folder with the RAW-files

- FileName	

- Condition	

- Biorep	

- Fraction	

- Techrep

### **3. RAW file**

### **CMD commands**

To run FlashLFQ:

example: dotnet CMD.ddl --idt "C:\MyFolder\msms.txt" --rep "C:\MyFolder" --ppm 5 --chg

  --idt        Required. string; identification file path <br>
  --rep        Required. string; directory containing spectral data files <br>
  --sil        (Default: false) bool; silent mode <br>
  --rea        (Default: false) bool; filesystem is readonly, prevents writing
               to the FlashLFQ folder <br>
  --pth        (Default: false) bool; print Thermo's RawFileReader licence;
               required to read .raw files <br>
  --ath        (Default: false) bool; accept Thermo's RawFileReader licence;
               required to read .raw files <br>
  --out        string; output directory <br>
  --nor        (Default: false) bool; normalize intensity results <br>
  --ppm        (Default: 10) double; ppm tolerance <br>
  --iso        (Default: 5) double; isotopic distribution tolerance in ppm <br>
  --int        (Default: false) bool; integrate peak areas (not recommended) <br>
  --nis        (Default: 2) int; number of isotopes required to be observed <br>
  --chg        (Default: false) bool; use only precursor charge state <br>
  --thr        (Default: -1) int; number of CPU threads to use <br>
  --mbr        (Default: false) bool; match between runs <br>
  --mrt        (Default: 2.5) double; maximum MBR window in minutes <br>
  --rmc        (Default: false) bool; require MS/MS ID in condition <br>
  --bay        (Default: false) bool; Bayesian protein fold-change analysis <br>
  --ctr        string; control condition for Bayesian protein fold-change 
               analysis <br>
  --fcc        (Default: 0.1) double; fold-change cutoff for Bayesian protein 
               fold-change analysis <br>
  --mcm        (Default: 3000) int; number of markov-chain monte carlo 
               iterations for the Bayesian protein fold-change analysis <br>
  --bur        (Default: 1000) int; number of markov-chain monte carlo burn-in 
               iterations <br>
  --sha        (Default: false) bool; use shared peptides for protein 
               quantification <br>
  --rns        int; random seed for the Bayesian protein fold-change analysis <br>
  --help       Display this help screen. <br>
  --version    Display version information. <br>

source: https://github.com/smith-chem-wisc/FlashLFQ

# Test

In [1]:
import pandas as pd
import mysql.connector
import glob
import os
import numpy as np
import logging

In [7]:
def find_pxd_path(pxds: list, return_path = False):
    '''Given a list of pxds, this function searches the compomics directories for the pxd directory and returns the pxds it finds
       
    If return path is set to True, returns (x,y,z)
      
    - x = list of pxd that were found
    - y = dictionary of path: amount of files
    - z = ionbot version paths'''
        
    path_found = []
    version_path = []
    file_output = {}
    pxd_found = []
    for pxd in pxds:
        for path in glob.glob("/home/compomics/mounts/*/*/PRIDE_DATA/" + str(pxd)):
            if pxd not in pxd_found:
                pxd_found.append(pxd)
            path_found.append(path)
      
    flag = False
    for path in path_found:
        count = 0
        for version in glob.glob(path + '/IONBOT_v*/*'):
            if count == 0:
                version_path.append("/".join(version.split('/')[0:-1]))
            count += 1
            
        file_output[path] = count

    print(f"Found {len(pxd_found)} out of {len(pxds)}.")
    if not return_path:
        return pxd_found
    return pxd_found, file_output, version_path

def find_file_path(pxd, raw):
    '''Used in pd.apply as function to return the path of a given RAW-file. If more paths are present, returns the string "multiple_paths"'''

    files = []
       
    for file in glob.glob("/home/compomics/mounts/*/*/PRIDE_DATA/" + str(pxd) + '/IONBOT_v*/*/' + str(raw) + '.mgf.gzip.ionbot.csv'):
        files.append(file)
       
    if files == []:
        for file in glob.glob("/home/compomics/mounts/*/*/PRIDE_DATA/" + str(pxd) + '/IONBOT_v*/' + str(raw) + '.mgf.ionbot.csv'):
            files.append(file)

    if len(files) == 1:
        return files[0]
        
    if len(files) > 1:
        versions = [file.split("/")[8] for file in files]
        versions = [[int(version.split('.')[1]), int(version.split('.')[2])] for version in versions]
        best = [0,0]
        for version in versions:
            if version[0] > best[0]:
                best[0], best[1] = version[0], version[1]

            if version[0] == best[0]:
                if version[1] > best[1]:
                    best[0], best[1] = version[0], version[1]
        return files[versions.index(best)]
            
    return np.nan

def ionbot_parse(file, version):
    df = pd.read_csv(file, sep=',')
        
    if df.empty:
        logging.debug(f"{file.split('/')[-1]} is empty.")

    # best_psm is equal to 1
    df = df.loc[df['best_psm'] == 1]
    #  q-value-best <= 0.01
    df = df.loc[df['q_value'] <= 0.01]
    # DB column needs to contain 'T' (otherwise decoy hit) +  extra check: only retain swissprot entries (start with sp)
    df = df.loc[df['DB'] == 'T']

In [21]:
conn = mysql.connector.connect(user='root', password='password', host='127.0.0.1', port='3306',
                               database='expression_atlas_cells')
mycursor = conn.cursor(buffered=True)

# check the connection
if conn.is_connected():
    print("connection succesfull")
else:
    print("no connection")

connection succesfull


In [22]:
projects = "SELECT * FROM project"
projects = pd.read_sql_query(projects, conn)
projects

Unnamed: 0,project_id,PXD_accession,experiment_type,instrument,pmid
0,1815,PXD000533,/,/,/
1,1816,PXD004280,/,/,/
2,1817,PXD002842,/,/,/
3,1818,PXD003594,/,/,/
4,1819,PXD008996,/,/,/
...,...,...,...,...,...
57,1872,PXD005354,/,/,/
58,1873,PXD004900,/,/,/
59,1874,PXD008222,/,/,/
60,1875,PXD005946,/,/,/


In [12]:
assays = "SELECT * FROM assay"
assays = pd.read_sql_query( assays, conn)
assays

Unnamed: 0,assay_id,project_id,filename
0,30960,1815,3B10-1
1,30961,1815,3B10-2
2,30962,1815,3B1-1
3,30963,1815,3B11-1
4,30964,1815,3B11-2
...,...,...,...
3147,34107,1876,00524_G02_P003819_B0O_A00_R2
3148,34108,1876,00524_G03_P003819_B0W_A00_R1
3149,34109,1876,00524_H01_P003819_B0H_A00_R2
3150,34110,1876,00524_H02_P003819_B0P_A00_R1


In [3]:
file_annotation = pd.read_csv("/home/compomics/Sam/git/python/master_thesis/Database/parsed_manual_meta2.csv")

In [4]:
extra_lfq = ['PXD000263', 'PXD022752', 'PXD029525', 'PXD011347', 'PXD033373', 'PXD003819', 'PXD005453', 'PXD014258', 'PXD003547', 'PXD031847', 'PXD022927', 'PXD014855', 'PXD018450', 'PXD003914', 'PXD014448', 'PXD029805', 'PXD005880', 'PXD017452', 'PXD028647']

In [5]:
all_lfq_pxds = set(file_annotation.PXD.unique().tolist() + extra_lfq)

In [8]:
found, num_files, version_path = find_pxd_path(extra_lfq, return_path = True)

Found 2 out of 19.


In [26]:
found, num_files, version_path = find_pxd_path(all_lfq_pxds, return_path = True)

Found 64 out of 81.


In [14]:
annotation_file = pd.read_csv("../Metadata/annotation_excel4.csv")

In [24]:
projects.PXD_accession.unique()

array(['PXD000533', 'PXD004280', 'PXD002842', 'PXD003594', 'PXD008996',
       'PXD006035', 'PXD008719', 'PXD006591', 'PXD003406', 'PXD003407',
       'PXD001327', 'PXD002057', 'PXD001352', 'PXD000661', 'PXD008381',
       'PXD009149', 'PXD005045', 'PXD000529', 'PXD000443', 'PXD010538',
       'PXD009600', 'PXD009442', 'PXD009560', 'PXD003252', 'PXD003587',
       'PXD007543', 'PXD008693', 'PXD000426', 'PXD007759', 'PXD018066',
       'PXD017391', 'PXD016742', 'PXD014381', 'PXD001468', 'PXD004051',
       'PXD001952', 'PXD000612', 'PXD002117', 'PXD006653', 'PXD001511',
       'PXD001668', 'PXD009185', 'PXD005507', 'PXD001974', 'PXD003903',
       'PXD003896', 'PXD002613', 'PXD003596', 'PXD004182', 'PXD005912',
       'PXD004940', 'PXD001441', 'PXD003790', 'PXD003530', 'PXD006112',
       'PXD003668', 'PXD004452', 'PXD005354', 'PXD004900', 'PXD008222',
       'PXD005946', 'PXD005940'], dtype=object)

## Finding the results paths that have associated RAW-files for the projects

In [39]:
# Create dataframe 
file_path_df = pd.DataFrame(columns = ["PXD", "RAW", "results_path"])

In [55]:
extensions = []
for pxd in projects.PXD_accession.unique():
    print(pxd)
    # find pxd dir
    counter = 0

    for path in glob.glob("/home/compomics/mounts/*/*/PRIDE_DATA/" + str(pxd)):
        counter += 1
        
        # Check only 1 directory for a pxd
        if counter > 1:
            print(f"multiple dirs for {pxd}")
        
        pxd_path = path
    
    if counter == 0:
        print(f"no dir for {pxd}")
        continue

    # find the RAW files
    raw_files = []
    file_counter = 0
    for path in glob.glob(str(pxd_path) + "/RAW/*"):
        file_counter += 1
        if file_counter == 1:
            extension = ".".join(path.split("/")[-1].split(".")[1:])
            if extension not in extensions:
                extensions.append(extension)
        
        raw_files.append(path.split("/")[-1].split(".")[0])

    print(raw_files)
    for raw_file in raw_files:
        results_file = find_file_path(pxd, raw_file)
        print(results_file)
        # Add the entries
        file_path_df = file_path_df.append(pd.DataFrame(data = {"PXD": [pxd], "RAW": [raw_file], "results_path": [results_file]}))

    break


Extension: raw.xz
['H7-1', 'H10-1', 'H11-2', 'H13-1', 'H8-2', '3B20-2', 'H24-2', 'H17-2', '3B21-2', '3B16-1', 'H12-1', 'H19-1', '3B19-2', '3B2-1', '3B24-2', '3B6-2', '3B1-1', '3B23-2', 'H14-2', 'H9-1', 'H20-2', 'H5-1', '3B15-2', 'H16-1', 'H19-2', '3B24-1', 'H14-1', 'H18-2', 'H3-1', '3B7-2', 'H22-1', 'H22-2', '3B12-2', '3B5-1', '3B16-2', '3B8-1', 'H4-2', 'H1-1', '3B10-1', '3B13-1', '3B17-1', 'H2-2', '3B9-1', '3B1-2', '3B17-2', 'H8-1', 'H16-2', '3B23-1', '3B11-2', 'H12-2', '3B3-2', 'H7-2', '3B4-1', '3B14-2', '3B19-1', '3B11-1', '3B9-2', 'H23-2', '3B22-2', '3B6-1', '3B8-2', 'H23-1', 'H24-1', '3B22-1', 'H6-2', '3B21-1', 'H5-2', 'H11-1', '3B14-1', 'H21-2', 'H2-1', '3B2-2', 'H6-1', 'H17-1', 'H9-2', '3B18-2', '3B3-1', '3B12-1', 'H4-1', 'H1-2', '3B15-1', 'H20-1', 'H13-2', 'H18-1', 'H10-2', 'H15-2', '3B20-1', '3B5-2', '3B10-2', 'H15-1', 'H3-2', '3B13-2', 'H21-1', '3B18-1', '3B7-1', '3B4-2']
/home/compomics/mounts/conode53/pride/PRIDE_DATA/PXD000533/IONBOT_v0.8.0/H7-1.mgf.gzip/H7-1.mgf.gzip.ionb

In [56]:
file_path_df

Unnamed: 0,PXD,RAW,results_path
0,PXD000533,H7-1,/home/compomics/mounts/conode53/pride/PRIDE_DA...
0,PXD000533,H10-1,/home/compomics/mounts/conode53/pride/PRIDE_DA...
0,PXD000533,H11-2,/home/compomics/mounts/conode53/pride/PRIDE_DA...
0,PXD000533,H13-1,/home/compomics/mounts/conode53/pride/PRIDE_DA...
0,PXD000533,H8-2,/home/compomics/mounts/conode53/pride/PRIDE_DA...
...,...,...,...
0,PXD000533,3B13-2,/home/compomics/mounts/conode53/pride/PRIDE_DA...
0,PXD000533,H21-1,/home/compomics/mounts/conode53/pride/PRIDE_DA...
0,PXD000533,3B18-1,/home/compomics/mounts/conode53/pride/PRIDE_DA...
0,PXD000533,3B7-1,/home/compomics/mounts/conode53/pride/PRIDE_DA...


## Initializing ExperimentalDesign.tsv for each project separately

Manually annotated

In [None]:
# Must be stored in the RAW-file directory



## Parsing the ionbot results for each sample

1 file per project

In [74]:
import re

def id_regex(string):
    uniprot_regex = re.compile('\(\([a-zA-Z0-9]+\)\)')
    alt_uniprot_regex = re.compile('\(\([a-zA-Z0-9\|_-]+\)\)')
    protID = uniprot_regex.search(string)
    
    if protID == None: 
        protID = alt_uniprot_regex.findall(string)
            
        if len(protID) != 2:
            logging.warning(f"{string} does not match regex.")
            return np.nan

        protID = protID[1]
            
        if protID[2:5] == "sp|":
            return protID[5:-2]
        else:
            return protID[2:-2]
        
    return protID.group()[2:-2]

def ionbot2LFQ(file):

    df = pd.read_csv(file, sep = ",")

    if df.empty:
        logging.debug(f"{file.split('/')[-1]} is empty.")

    df = df.loc[df["best_psm"] == 1]
    df = df.loc[df["q_value"] <= .01]
    df = df.loc[df["DB"] == "T"]

    if df.empty:
        logging.debug(f"{file.split('/')[-1]} lowest q-value: {pd.read_csv(file, sep = ',').q_value.min()}")
        return False

    # Only supports certain ionbot versions
    versions = "IONBOT_v0.6.2 IONBOT_v0.6.3 IONBOT_v0.7.0 IONBOT_v0.8.0".split()
    
    version = file.split("/")[8]
    if version not in versions:
        logging.debug(f"{file} not supported by versions {versions}. Version given: {version}")
        return False

    if version in ["IONBOT_v0.6.2", "IONBOT_v0.6.3"]:

        
        df_validated = df[df['proteins'].astype(str).str.startswith('sp')]
        # remove peptides that are not uniquely identified and are linked to multiple proteins = containing || in proteins
        x = '||'
        # regex is False otherwise it also detects a single | which is in every protein present
        df_validated = df_validated[~df_validated['proteins'].str.contains(x, regex=False)]

        if df_validated.empty:
            logging.debug(f"{file.split('/')[-1]}: no proteins after excluding duplicates.")
            return False

        df_validated["proteins"] = df_validated.apply(lambda x: x["proteins"].split('|')[1], axis = 1)

    elif version in ["IONBOT_v0.7.0", "IONBOT_v0.8.0"]:
        x = "||"
        df_validated = df[~df['proteins'].str.contains(x, regex = False)]

        if df_validated.empty:
            logging.debug(f"{file.split('/')[-1]}: no proteins after excluding duplicates.")
            return False

        df_validated["proteins"] = df_validated.apply(lambda x: id_regex(x["proteins"]), axis = 1)
        df_validated = df_validated[df_validated["proteins"].notna()]

    if df_validated.empty:
        return False
    
    return df_validated

In [66]:
file_path_df.results_path.reset_index(drop  = True)[0]

'/home/compomics/mounts/conode53/pride/PRIDE_DATA/PXD000533/IONBOT_v0.8.0/H7-1.mgf.gzip/H7-1.mgf.gzip.ionbot.csv'

In [75]:
ionbot2LFQ('/home/compomics/mounts/conode53/pride/PRIDE_DATA/PXD000533/IONBOT_v0.8.0/H7-1.mgf.gzip/H7-1.mgf.gzip.ionbot.csv')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


Unnamed: 0,title,scan_id,psm_id,scan_psm_id,matched_peptide,modifications,DB,best_psm,charge,numpeaks,...,explained_x,explained_a2,explained_b2_h2o,explained_b2_nh3,explained_b2,explained_c2,explained_y2_h2o,explained_z2,explained_y2,explained_x2
0,H7-1: controllerType=0 controllerNumber=1 scan...,0_10000,2,0_10000_2,MATPGNIGSSVIASK,1|[766]Met-loss+Acetyl[M],T,1,2,224,...,67,42,0,0,0,0,22,147,526,48
1,H7-1: controllerType=0 controllerNumber=1 scan...,0_10001,6,0_10001_6,IIVAYVDDIDRR,,T,1,3,121,...,0,40,0,0,0,0,0,13,2388,0
2,H7-1: controllerType=0 controllerNumber=1 scan...,0_10002,4,0_10002_4,IIEEAIIR,,T,1,2,106,...,26,0,0,0,102,0,0,0,17,9
3,H7-1: controllerType=0 controllerNumber=1 scan...,0_10004,2,0_10004_2,DIAVPAAITPR,,T,1,2,138,...,0,71,0,31,33,32,0,0,209,105
4,H7-1: controllerType=0 controllerNumber=1 scan...,0_10008,1,0_10008_1,IGSSIYAIGTQDSTDICK,13|[4]Carbamidomethyl[S],T,1,2,147,...,0,0,0,0,0,70,0,0,22,12
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11466,H7-1: controllerType=0 controllerNumber=1 scan...,0_9994,1,0_9994_1,GDIAPIWK,,T,1,2,83,...,0,0,0,0,0,0,0,0,264,0
11467,H7-1: controllerType=0 controllerNumber=1 scan...,0_9995,2,0_9995_2,DIPIHACSYCGIHDPACVVYCNTSK,21|[4]carbamidomethyl[C]|17|[767]Menadione-HQ[C],T,1,3,449,...,2,21,37,19,165,128,61,38,479,0
11469,H7-1: controllerType=0 controllerNumber=1 scan...,0_9997,1,0_9997_1,EGNVPNIIIAGPPGTGK,,T,1,2,320,...,13,3,10,131,0,0,3,10,75,3
11470,H7-1: controllerType=0 controllerNumber=1 scan...,0_9998,1,0_9998_1,VTGESHIGGVIIK,0|[122]Formyl[N-TERM],T,1,2,232,...,0,10,22,0,0,13,127,0,264,0


In [None]:
pxds = file_path_df.PXD.unique()

for pxd in pxds:
    files = file_path_df[file_path_df.PXD == pxd].results_path

    for file_path in files:
        ionbot_results = pd.read_csv(file_path)

        parse_ionbot()