Skip to content

Latest commit

 

History

History
131 lines (110 loc) · 7.41 KB

ftp-site.rst

File metadata and controls

131 lines (110 loc) · 7.41 KB

FTP Site

The Pfam FTP site is organised into the following structure:

The most important directory is probably the :ref:`current_release` directory. It contains the flat-files for the current release.

AntiFam

The AntiFam directory contains the different releases of the AntiFam database, identifying spurious proteins.

RoseTTAfold_aln

The RoseTTAfold_aln directory contains the alignments used by RoseTTAfold to predict their structural models using Pfam.

Tools

The Tools directory contains code for running pfam_scan.pl.

The README file in this directory contains detailed information on how to install and run the script. Note that we have gone for a modular design for the script, enabling the functionally on the script to be easily incorporated into other Perl scripts. The ChangeLog file lists the versions and changes to the current version of pfam_scan.pl (and modules).

There is also an archived version of pfam_scan.pl that works with HMMER2. This is no longer supported.

There is also Perl code for predicting active sites found in the ActSitePred directory, the functionality of which has been rolled into the latest version of pfam_scan.pl.

current_release

This directory contains the flat-files for the current release. Some of these files may be very large (of the order of several hundred megabytes). Please check the sizes on the FTP site before trying to download them over a slow connection. The files, most of which are compressed using gzip, are:

Pfam-A.dead.gz
Listing of families that have been deleted from the database
Pfam-A.fasta.gz
A 90% non-redundant set of fasta formatted sequence for each Pfam-A family. The sequences are only the regions hit by the model and not full length protein sequences.
Pfam-A.full.gz
The full alignments of the curated families, searched against pfamseq/UniProtKB reference proteomes (prior to Pfam 29.0, this file contained matches against the whole of UniProtKB).
Pfam-A.full.uniprot.gz
The full alignments of the curated families, searched against UniProtKB.
Pfam-A.full.metagenomics.gz
The full alignments of the curated families, searched against Metagenomic proteins.
Pfam-A.full.ncbi.gz
The full alignments of the curated families, searched against NCBI GenPept proteins.
Pfam-A.hmm.dat.gz
A data file that contains information about each Pfam-A family
Pfam-A.hmm.gz
The Pfam HMM library for Pfam-A families
Pfam-A.seed.gz
The SEED alignments of the curated families. Please note that from Pfam 36.0 onwards we do not process PDB data. Hence secondary structure annotations aren't available in the SEED alignments anymore. However, PDBe provides mappings to Pfam which might be of interest.
Pfam-C.gz
A file that contains the information about clans and the Pfam-A membership
active_site.dat.gz
Tar-ball of data required for the predictions of active sites by Pfam scan.
database.tar
A tar-ball of the database_files directory.
database_files
Directory contains two files per table from the MySQL database. The .sql.gz file contains the table structure, the .txt.gz files contains the content of the table as a tab delimited file with field enclosed by a single quote (').
diff.gz
Stores the change status of entries between this release and last.
md5_checksums
A file containing the MD5 checksum for each release file
metaseq.gz
Metagenomic sequence database used in this release
ncbi.gz
NCBI GenPept sequence database used in this release.
pdbmap.gz
Mapping between PDB structures and Pfam domains.
pfamseq.gz
A fasta version of Pfam's underlying sequence database
relnotes.txt
Release notes
swisspfam.gz
ASCII representation of the domain structure of UniProt proteins according to Pfam
uniprot_sprot.dat.gz
Data files from UniProt containing SwissProt annotations.
uniprot_trembl.dat.gz
Data files from UniProt containing TrEMBL annotations.
userman.txt
File containing information about the flatfile format
Pfam-A.regions.tsv.gz
A tab separated file containing UniProtKB reference proteome sequences and Pfam-A family information
Pfam-A.regions.uniprot.tsv.gz
A tab separated file containing UniProtKB sequences and Pfam-A family information
Pfam-A.clans.tsv.gz
A tab separated file containing Pfam-A family and clan information for all Pfam-A families

mappings

The mapping directory contains the mapping between PDB structures and Pfam entries.

papers

The papers directory contains each NAR database issue article describing Pfam. For a detailed description of the latest changes to Pfam, please consult (and cite) these papers.

releases

The releases directory contains all the flat files and database dumps (where appropriate) for all version of Pfam to-date. The files in more recent releases are the same as described for the current release, but in older releases the contents do change.