### Combining stripped peptides from *de novo* sequencing and PeaksDB search results Trocas 8 non-incubation samples

The dataset:

    30 samples: 4 (including x-channel, really 8 stations, 2 depths (50% and surface), 1 size fraction (GF75):
    
    Stations: 
    
        - Macapa South (MS) South stem, upriver (left, right, middle)
        - Macapa North (MN) North stem, upriver (left, right, middle)
        - Chaves (CV) South stem, downriver
        - Baylique (BY) North stem, downriver


    Proteomics samples from 1 trips to UWPR (June 2021 on the Fusion)
    A couple were injected twice.

Starting with:

    Peaks de novo results of PTM-optimized sequencing
    PeaksDB de novo-assisted results from PTM-optimized database searches
    
    Multiple samples per treatment

Goal:

    Txt files with combined de novo and PeaksDB for each sample
    
Using:

    - pandas
    - matplotlib
    - numpy

In [3]:
# LIBRARIES
#import pandas library for working with tabular data
import os
os.getcwd()
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import kde
#import regular expresson (regex)
import re
#check pandas version
pd.__version__

'1.0.5'

In [4]:
cd /home/millieginty/Documents/git-repos/amazon/data/Trocas8-notincs/

/home/millieginty/Documents/git-repos/amazon/data/Trocas8-notincs


### 1. Macapa N, 50% depth, size fraction 0.3-0.7 um (denoted as NMCP_50_GF75)
### T8 samples #s: 681, 689
### Exported NAAF and stripped peptides contained in the following directories:

    Trocas8-notincs/processed/PeaksDB/681_NMCP_50_GF75_PDB
    Trocas8-notincs/processed/PeaksDN/681_NMCP_50_GF75_DN

In [5]:
# for each of the 4 incubation samples:
# read in NAAF totals csvs made in PeaksDN (de novo), and PeaksDB (de novo assisted database search) notebooks
# bringing de novo peptides >50% ALC and PeaksDB peptides <1% FDR

peaks50_681 = pd.read_csv("processed/PeaksDN/681_NMCP_50_GF75_DN/681_NMCP_50_GF75_DN50_stripped_peptides.txt", header=None)
peaks50_689 = pd.read_csv("processed/PeaksDN/681_NMCP_50_GF75_DN/689_NMCP_50_GF75_DN50_stripped_peptides.txt", header=None)

peaksdb_681 = pd.read_csv("processed/PeaksDB/681_NMCP_50_GF75_PDB/681_NMCP_50_GF75_PDB_stripped_peptides.txt", header=None)
peaksdb_689 = pd.read_csv("processed/PeaksDB/681_NMCP_50_GF75_PDB/689_NMCP_50_GF75_PDB_stripped_peptides.txt", header=None)

frames = [peaks50_681, peaks50_689, peaksdb_681, peaksdb_689]
#index = [index]

# concatenate dataframes
tot_681 = pd.concat(frames)

# deduplicate
tot_681_nr = tot_681.drop_duplicates()

print('total 681 peptides, redundant', len(tot_681))
print('total 681 peptides, nonredundant', len(tot_681_nr))

tot_681.to_csv("processed/stripped_peptides/681_NMCP_50_GF75_stripped_peptides.txt", header=False, index=False)

tot_681.head()

total 681 peptides, redundant 247
total 681 peptides, nonredundant 164


Unnamed: 0,0
0,SPATLNSR
1,LSSPATLNSR
2,SPATLNSR
3,LSSPATLNSR
4,LSSPATLDSR


### 2. Macapa N, surface depth, size fraction 0.3-0.7 um (denoted as NMCP_surf_GF75)
### T8 samples #s:  682, 690
### Exported NAAF and stripped peptides contained in the following directories:

    Trocas8-notincs/processed/PeaksDB/682_NMCP_surf_GF75_PDB
    Trocas8-notincs/processed/PeaksDN/682_NMCP_surf_GF75_DN

In [6]:
# for each of the 4 incubation samples:
# read in NAAF totals csvs made in PeaksDN (de novo), and PeaksDB (de novo assisted database search) notebooks
# bringing de novo peptides >50% ALC and PeaksDB peptides <1% FDR

peaks50_682 = pd.read_csv("processed/PeaksDN/682_NMCP_surf_GF75_DN/682_NMCP_surf_GF75_DN50_stripped_peptides.txt", header=None)
peaks50_690 = pd.read_csv("processed/PeaksDN/682_NMCP_surf_GF75_DN/682_NMCP_surf_GF75_DN50_stripped_peptides.txt", header=None)
peaksdb_682 = pd.read_csv("processed/PeaksDB/682_NMCP_surf_GF75_PDB/690_NMCP_surf_GF75_PDB_stripped_peptides.txt", header=None)
peaksdb_690 = pd.read_csv("processed/PeaksDB/682_NMCP_surf_GF75_PDB/690_NMCP_surf_GF75_PDB_stripped_peptides.txt", header=None)

frames = [peaks50_682, peaks50_690, peaksdb_682, peaksdb_690]
#index = [index]

# concatenate dataframes
tot_682 = pd.concat(frames)

# deduplicate
tot_682_nr = tot_682.drop_duplicates()

print('total 682 peptides, redundant', len(tot_682))
print('total 682 peptides, nonredundant', len(tot_682_nr))

tot_682.to_csv("processed/stripped_peptides/682_NMCP_surf_GF75_stripped_peptides.txt", header=False, index=False)

tot_682.head()

total 682 peptides, redundant 204
total 682 peptides, nonredundant 86


Unnamed: 0,0
0,LSSPATLNSR
1,LSSPATLNSR
2,LSSPATLNSR
3,YLYELAR
4,VATVSPLR
