# Collect protein PDBs

Collect additional protein structures from PDB database.  
Various summaries of current data in the PDB archive are available on [summaries_link](https://www.rcsb.org/pages/general/summaries).
Download [`pdb_entry_type.txt`](ftp://ftp.wwpdb.org/pub/pdb/derived_data/pdb_entry_type.txt) contraining all protein IDs. Based on the protein ID, we will download the protein `*.pdb` files.
We are only interested in proteins whose structure was determined by **EM**.

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv("ftp://ftp.wwpdb.org/pub/pdb/derived_data/pdb_entry_type.txt", header=None, names=["id", "acid", "structure_determination"], sep="\t")

In [3]:
print(df.shape)
df.head()

(168095, 3)


Unnamed: 0,id,acid,structure_determination
0,100d,nuc,diffraction
1,101d,nuc,diffraction
2,101m,prot,diffraction
3,102d,nuc,diffraction
4,102l,prot,diffraction


In [4]:
df.structure_determination.unique()

array(['diffraction', 'NMR', 'other', 'EM'], dtype=object)

In [5]:
df.acid.unique()

array(['nuc', 'prot', 'prot-nuc', 'other'], dtype=object)

In [6]:
df_EM = df[(df.structure_determination=='EM')&df.acid.isin(['prot', 'prot-nuc'])]
print(df_EM.shape)
df_EM.head()

(5455, 3)


Unnamed: 0,id,acid,structure_determination
3804,1d3e,prot,EM
3808,1d3i,prot,EM
4221,1dgi,prot,EM
4799,1dyl,prot,EM
5383,1eg0,prot-nuc,EM


In [28]:
import random

In [30]:
rand_proteins = [random.randint(0,len(df_EM)) for _ in range(50)]

In [38]:
selected_proteins = list(df_EM.id.iloc[rand_proteins].values)

In [34]:
for i in selected_proteins:
    get_ipython().system_raw(f'wget http://files.rcsb.org/download/{i}.pdb -O /mnt/scratch/students/PDB/{i}.pdb')

In [35]:
!ls /mnt/scratch/students/PDB

1d3e.pdb  1gw8.pdb  3j6g.pdb  6bf8.pdb	6h25.pdb  6qvu.pdb  6usf.pdb  7bte.pdb
1d3i.pdb  1hb5.pdb  3jam.pdb  6c21.pdb	6iok.pdb  6qwl.pdb  6vam.pdb  7btr.pdb
1dgi.pdb  2j37.pdb  3jb7.pdb  6caa.pdb	6iy7.pdb  6rr7.pdb  6vkt.pdb
1dyl.pdb  2om7.pdb  4v3a.pdb  6cet.pdb	6mk1.pdb  6snb.pdb  6vm2.pdb
1eg0.pdb  2r1g.pdb  5a9k.pdb  6dso.pdb	6muw.pdb  6t1y.pdb  6wut.pdb
1gr5.pdb  3iyk.pdb  5fwp.pdb  6ejf.pdb	6ofj.pdb  6th3.pdb  6y9x.pdb
1gru.pdb  3j0i.pdb  5owx.pdb  6fhl.pdb	6osj.pdb  6u1s.pdb  6ysu.pdb
1gw7.pdb  3j1u.pdb  6asx.pdb  6g8z.pdb	6qm5.pdb  6up6.pdb  6zhd.pdb


# EMAN2 script for PDB to MRC conversion

Installation instructions available [here](https://blake.bcm.edu/emanwiki/EMAN2/Install/BinaryInstallAnaconda/2.31).  
Download available [here](https://cryoem.bcm.edu/cryoem/downloads/view_eman2_versions).  
Command instruction `pdb2mrc` available [here](https://blake.bcm.edu/emanwiki/PdbToMrc).

In [36]:
for i in selected_proteins:
    get_ipython().system_raw(f'export PATH="/mnt/scratch/students/EMAN2/bin:$PATH";/mnt/scratch/students/EMAN2/bin/e2pdb2mrc.py /mnt/scratch/students/PDB/{i}.pdb /mnt/scratch/students/MRC/{i}.mrc res=5')

In [37]:
!ls /mnt/scratch/students/MRC

1d3e.mrc  1gw8.mrc  3j6g.mrc  6bf8.mrc	6h25.mrc  6qvu.mrc  6usf.mrc  7btr.mrc
1d3i.mrc  1hb5.mrc  3jam.mrc  6c21.mrc	6iok.mrc  6qwl.mrc  6vam.mrc
1dgi.mrc  2j37.mrc  3jb7.mrc  6caa.mrc	6iy7.mrc  6rr7.mrc  6vkt.mrc
1dyl.mrc  2om7.mrc  4v3a.mrc  6cet.mrc	6mk1.mrc  6snb.mrc  6vm2.mrc
1eg0.mrc  2r1g.mrc  5a9k.mrc  6dso.mrc	6muw.mrc  6t1y.mrc  6wut.mrc
1gr5.mrc  3iyk.mrc  5fwp.mrc  6ejf.mrc	6ofj.mrc  6th3.mrc  6y9x.mrc
1gru.mrc  3j0i.mrc  5owx.mrc  6fhl.mrc	6osj.mrc  6u1s.mrc  6zhd.mrc
1gw7.mrc  3j1u.mrc  6asx.mrc  6g8z.mrc	6qm5.mrc  6up6.mrc  7bte.mrc
