# Concatenate tables


The main porpose of this script is to concatenate several tables with the same format into one. 

Fisrt, let's check what we have:

In [1]:
import os
import pandas as pd


In [2]:
files = os.listdir('proteins')
print('The folder "proteins" has %d files' % len(files))
extensions = set()
for file in files:
    ext = file.split('.')[-1]
    extensions.add(ext)
    
print("The files have the follwing extensions:", ','.join(extensions))

The folder "proteins" has 632 files
The files have the follwing extensions: xls


So we have 632 files in .xls format. 

Despite of the .xls format, the files are text files separated by tab.

In [3]:
pd.read_table('proteins/gi_110758919.xls', sep='\t').head()

Unnamed: 0,Experiment name,Biological sample category,Biological sample name,MS/MS sample name,Protein group,Protein accession number,Protein name,Protein identification probability,Protein percentage of total spectra,Protein molecular weight (AMU),...,Mascot Ion score,Mascot Identity score,Mascot Delta Ion score,Modifications identified by spectrum,Actual peptide mass (AMU),Spectrum charge,Actual minus calculated peptide mass (AMU),Actual minus calculated peptide mass (PPM),Peptide start index,Peptide stop index
0,TUB Arachinidar e pep go,Shotgun Proteomics,TUB,Mudpit_DATA.TXT (F003448 TUB NCBInr),PREDICTED: constitutive coactivator of PPAR-ga...,gi|110758919,PREDICTED: constitutive coactivator of PPAR-ga...,100.0%,0.00131%,109122.2,...,12.5,36.7,0.0,Carbamidomethyl (+57),2035.73,3,-0.13,-64,63,79
1,TUB Arachinidar e pep go,Shotgun Proteomics,TUB,Mudpit_DATA.TXT (F003448 TUB NCBInr),PREDICTED: constitutive coactivator of PPAR-ga...,gi|110758919,PREDICTED: constitutive coactivator of PPAR-ga...,100.0%,0.00131%,109122.2,...,13.0,36.8,-3.02,Oxidation (+16),996.53,2,-1000.0,-1000000,458,474
2,TUB Arachinidar e pep go,Shotgun Proteomics,TUB,Mudpit_DATA.TXT (F003448 TUB NCBInr),PREDICTED: constitutive coactivator of PPAR-ga...,gi|110758919,PREDICTED: constitutive coactivator of PPAR-ga...,100.0%,0.00131%,109122.2,...,12.5,36.1,-0.71,Oxidation (+16),3382.33,4,850.0,250000,458,479


Time to concatenate all files into one to make future analysis easier. However, a new column (file_name) will be defined to know the origin of the row.

In [4]:
def parse_protein_table(fname):
    df = pd.read_table('proteins/{}'.format(fname))
    df['file_name'] = fname
    return df
    

In [5]:
%%time
df = parse_protein_table(files[0])

for f in files[1:]:
    df = pd.concat([df, parse_protein_table(f)])

CPU times: user 5.64 s, sys: 28 ms, total: 5.66 s
Wall time: 5.66 s


## Thanks, Python! <3

Wow! Within ten seconds we solve a problem that would be a very boring task.

In [6]:
df.shape

(2214, 29)

In [7]:
df.to_csv('all_proteins.tsv', sep='\t', index=None)