### Combining stripped peptides from *de novo* sequencing and PeaksDB search results Trocas 8 non-incubation samples

The dataset:

    30 samples: 4 (including x-channel, really 8 stations, 2 depths (50% and surface), 1 size fraction (GF75):
    
    Stations: 
    
        - Macapa South (MS) South stem, upriver (left, right, middle)
        - Macapa North (MN) North stem, upriver (left, right, middle)
        - Chaves (CV) South stem, downriver
        - Baylique (BY) North stem, downriver


    Proteomics samples from 1 trips to UWPR (June 2021 on the Fusion)
    A couple were injected twice.

Starting with:

    Peaks de novo results of PTM-optimized sequencing
    PeaksDB de novo-assisted results from PTM-optimized database searches
    
    Multiple samples per treatment

Goal:

    Txt files with combined de novo and PeaksDB for each sample
    
Using:

    - pandas
    - matplotlib
    - numpy

In [2]:
# LIBRARIES
#import pandas library for working with tabular data
import os
os.getcwd()
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import kde
#import regular expresson (regex)
import re
#check pandas version
pd.__version__

'1.0.5'

In [3]:
cd /home/millieginty/Documents/git-repos/amazon/data/Trocas8-notincs/

/home/millieginty/Documents/git-repos/amazon/data/Trocas8-notincs


### 1. Chaves, 50% depth, size fraction 0.3-0.7 um (denoted as CV_50_GF75)
### T8 samples #s: 677, 685
### Exported NAAF and stripped peptides contained in the following directories:

    Trocas8-notincs/processed/PeaksDB/677_CV_50_GF75_PDB
    Trocas8-notincs/processed/PeaksDN/677_CV_50_GF75_DN

In [4]:
# for each of the 4 incubation samples:
# read in NAAF totals csvs made in PeaksDN (de novo), and PeaksDB (de novo assisted database search) notebooks
# bringing de novo peptides >50% ALC and PeaksDB peptides <1% FDR

peaks50_677 = pd.read_csv("processed/PeaksDN/677_CV_50_GF75_DN/677_CV_50_GF75_DN50_stripped_peptides.txt", header=None)
peaks50_685 = pd.read_csv("processed/PeaksDN/677_CV_50_GF75_DN/685_CV_50_GF75_DN50_stripped_peptides.txt", header=None)

peaksdb_677 = pd.read_csv("processed/PeaksDB/677_CV_50_GF75_PDB/677_CV_50_GF75_PDB_stripped_peptides.txt", header=None)
peaksdb_685 = pd.read_csv("processed/PeaksDB/677_CV_50_GF75_PDB/685_CV_50_GF75_PDB_stripped_peptides.txt", header=None)

frames = [peaks50_677, peaks50_685, peaksdb_677, peaksdb_685]
#index = [index]

# concatenate dataframes
tot_677 = pd.concat(frames)

# deduplicate
tot_677_nr = tot_677.drop_duplicates()

print('total 677 peptides, redundant', len(tot_677))
print('total 677 peptides, nonredundant', len(tot_677_nr))

tot_677.to_csv("processed/stripped_peptides/677_CV_50_GF75_stripped_peptides.txt", header=False, index=False)

tot_677.head()

total 677 peptides, redundant 252
total 677 peptides, nonredundant 172


Unnamed: 0,0
0,YLYELAR
1,LSSPATLNSR
2,LSSPATLDSR
3,LSSPATLNSR
4,LATVLSPR


### 2. Chaves, surface depth, size fraction 0.3-0.7 um (denoted as CV_surf_GF75)
### T8 samples #s:  678, 686
### Exported NAAF and stripped peptides contained in the following directories:

    Trocas8-notincs/processed/PeaksDB/678_CV_surf_GF75_PDB
    Trocas8-notincs/processed/PeaksDN/678_CV_surf_GF75_DN

In [5]:
# for each of the 4 incubation samples:
# read in NAAF totals csvs made in PeaksDN (de novo), and PeaksDB (de novo assisted database search) notebooks
# bringing de novo peptides >50% ALC and PeaksDB peptides <1% FDR

peaks50_678 = pd.read_csv("processed/PeaksDN/678_CV_surf_GF75_DN/678_CV_surf_GF75_DN50_stripped_peptides.txt", header=None)
peaks50_686 = pd.read_csv("processed/PeaksDN/678_CV_surf_GF75_DN/678_CV_surf_GF75_DN50_stripped_peptides.txt", header=None)
peaksdb_678 = pd.read_csv("processed/PeaksDB/678_CV_surf_GF75_PDB/686_CV_surf_GF75_PDB_stripped_peptides.txt", header=None)
peaksdb_686 = pd.read_csv("processed/PeaksDB/678_CV_surf_GF75_PDB/686_CV_surf_GF75_PDB_stripped_peptides.txt", header=None)

frames = [peaks50_678, peaks50_686, peaksdb_678, peaksdb_686]
#index = [index]

# concatenate dataframes
tot_678 = pd.concat(frames)

# deduplicate
tot_678_nr = tot_678.drop_duplicates()

print('total 678 peptides, redundant', len(tot_678))
print('total 678 peptides, nonredundant', len(tot_678_nr))

tot_678.to_csv("processed/stripped_peptides/678_CV_surf_GF75_stripped_peptides.txt", header=False, index=False)

tot_678.head()

total 678 peptides, redundant 300
total 678 peptides, nonredundant 87


Unnamed: 0,0
0,LSSPATLNSR
1,LSSPATLNSR
2,LSSPATLNSR
3,LSSPATLNSR
4,LSSPATLNSR
