### Combining stripped peptides from *de novo* sequencing and PeaksDB search results Trocas 8 non-incubation samples

The dataset:

    30 samples: 4 (including x-channel, really 8 stations, 2 depths (50% and surface), 1 size fraction (GF75):
    
    Stations: 
    
        - Macapa South (MS) South stem, upriver (left, right, middle)
        - Macapa North (MN) North stem, upriver (left, right, middle)
        - Chaves (CV) South stem, downriver
        - Baylique (BY) North stem, downriver


    Proteomics samples from 1 trips to UWPR (June 2021 on the Fusion)
    A couple were injected twice.

Starting with:

    Peaks de novo results of PTM-optimized sequencing
    PeaksDB de novo-assisted results from PTM-optimized database searches
    
    Multiple samples per treatment

Goal:

    Txt files with combined de novo and PeaksDB for each sample
    
Using:

    - pandas
    - matplotlib
    - numpy

In [3]:
# LIBRARIES
#import pandas library for working with tabular data
import os
os.getcwd()
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import kde
#import regular expresson (regex)
import re
#check pandas version
pd.__version__

'1.0.5'

In [4]:
cd /home/millieginty/Documents/git-repos/amazon/data/Trocas8-notincs/

/home/millieginty/Documents/git-repos/amazon/data/Trocas8-notincs


### 1. Macapa S, 50% depth, size fraction 0.3-0.7 um (denoted as SMCP_50_GF75)
### T8 samples #s: 679, 687
### Exported NAAF and stripped peptides contained in the following directories:

    Trocas8-notincs/processed/PeaksDB/679_SMCP_50_GF75_PDB
    Trocas8-notincs/processed/PeaksDN/679_SMCP_50_GF75_DN

In [5]:
# for each of the 4 incubation samples:
# read in NAAF totals csvs made in PeaksDN (de novo), and PeaksDB (de novo assisted database search) notebooks
# bringing de novo peptides >50% ALC and PeaksDB peptides <1% FDR

peaks50_679 = pd.read_csv("processed/PeaksDN/679_SMCP_50_GF75_DN/679_SMCP_50_GF75_DN50_stripped_peptides.txt", header=None)
peaks50_687 = pd.read_csv("processed/PeaksDN/679_SMCP_50_GF75_DN/687_SMCP_50_GF75_DN50_stripped_peptides.txt", header=None)

peaksdb_679 = pd.read_csv("processed/PeaksDB/679_SMCP_50_GF75_PDB/679_SMCP_50_GF75_PDB_stripped_peptides.txt", header=None)
peaksdb_687 = pd.read_csv("processed/PeaksDB/679_SMCP_50_GF75_PDB/687_SMCP_50_GF75_PDB_stripped_peptides.txt", header=None)

frames = [peaks50_679, peaks50_687, peaksdb_679, peaksdb_687]
#index = [index]

# concatenate dataframes
tot_679 = pd.concat(frames)

# deduplicate
tot_679_nr = tot_679.drop_duplicates()

print('total 679 peptides, redundant', len(tot_679))
print('total 679 peptides, nonredundant', len(tot_679_nr))

tot_679.to_csv("processed/stripped_peptides/679_SMCP_50_GF75_stripped_peptides.txt", header=False, index=False)

tot_679.head()

total 679 peptides, redundant 279
total 679 peptides, nonredundant 202


Unnamed: 0,0
0,LSSPATLNSR
1,LSSPATLNSR
2,LSSPATLDSR
3,LSSPATLDSR
4,LSSPATLNSR


### 2. Macapa S, surface depth, size fraction 0.3-0.7 um (denoted as SMCP_surf_GF75)
### T8 samples #s:  680, 688
### Exported NAAF and stripped peptides contained in the following directories:

    Trocas8-notincs/processed/PeaksDB/680_SMCP_surf_GF75_PDB
    Trocas8-notincs/processed/PeaksDN/680_SMCP_surf_GF75_DN

In [6]:
# for each of the 4 incubation samples:
# read in NAAF totals csvs made in PeaksDN (de novo), and PeaksDB (de novo assisted database search) notebooks
# bringing de novo peptides >50% ALC and PeaksDB peptides <1% FDR

peaks50_680 = pd.read_csv("processed/PeaksDN/680_SMCP_surf_GF75_DN/680_SMCP_surf_GF75_DN50_stripped_peptides.txt", header=None)
peaks50_688 = pd.read_csv("processed/PeaksDN/680_SMCP_surf_GF75_DN/680_SMCP_surf_GF75_DN50_stripped_peptides.txt", header=None)
peaksdb_680 = pd.read_csv("processed/PeaksDB/680_SMCP_surf_GF75_PDB/688_SMCP_surf_GF75_PDB_stripped_peptides.txt", header=None)
peaksdb_688 = pd.read_csv("processed/PeaksDB/680_SMCP_surf_GF75_PDB/688_SMCP_surf_GF75_PDB_stripped_peptides.txt", header=None)

frames = [peaks50_680, peaks50_688, peaksdb_680, peaksdb_688]
#index = [index]

# concatenate dataframes
tot_680 = pd.concat(frames)

# deduplicate
tot_680_nr = tot_680.drop_duplicates()

print('total 680 peptides, redundant', len(tot_680))
print('total 680 peptides, nonredundant', len(tot_680_nr))

tot_680.to_csv("processed/stripped_peptides/680_SMCP_surf_GF75_stripped_peptides.txt", header=False, index=False)

tot_680.head()

total 680 peptides, redundant 240
total 680 peptides, nonredundant 99


Unnamed: 0,0
0,LSSPATLNSR
1,LSSPATLDSR
2,LSSPATLNSR
3,SPATLNSR
4,LSSPATLNSR
