# Collect protein PDBs

Collect additional protein structures from PDB database.  
Various summaries of current data in the PDB archive are available on [summaries_link](https://www.rcsb.org/pages/general/summaries).
Download [`pdb_entry_type.txt`](ftp://ftp.wwpdb.org/pub/pdb/derived_data/pdb_entry_type.txt) contraining all protein IDs. Based on the protein ID, we will download the protein `*.pdb` files.
We are only interested in proteins whose structure was determined by **EM**.

In [1]:
import pandas as pd
import random
import pathlib

In [2]:
df = pd.read_csv("ftp://ftp.wwpdb.org/pub/pdb/derived_data/pdb_entry_type.txt", header=None, names=["id", "acid", "structure_determination"], sep="\t")

In [3]:
print(df.shape)
df.head()

(170383, 3)


Unnamed: 0,id,acid,structure_determination
0,100d,nuc,diffraction
1,101d,nuc,diffraction
2,101m,prot,diffraction
3,102d,nuc,diffraction
4,102l,prot,diffraction


In [4]:
df.structure_determination.unique()

array(['diffraction', 'NMR', 'other', 'EM'], dtype=object)

In [5]:
df.acid.unique()

array(['nuc', 'prot', 'prot-nuc', 'other'], dtype=object)

In [6]:
df_EM = df[(df.structure_determination=='EM')&df.acid.isin(['prot', 'prot-nuc'])]
print(df_EM.shape)
df_EM.head()

(5995, 3)


Unnamed: 0,id,acid,structure_determination
3804,1d3e,prot,EM
3808,1d3i,prot,EM
4221,1dgi,prot,EM
4799,1dyl,prot,EM
5383,1eg0,prot-nuc,EM


In [7]:
len(set(list(df_EM.id.values)))

5995

Collect proteins with the following properties:
- Asymmetric C1
- EM experimental method
Search query used can be seen [here](https://www.rcsb.org/search?request=%7B%22query%22%3A%7B%22type%22%3A%22group%22%2C%22logical_operator%22%3A%22and%22%2C%22nodes%22%3A%5B%7B%22parameters%22%3A%7B%22value%22%3A%22Asymmetric%20-%20C1%22%7D%2C%22type%22%3A%22terminal%22%2C%22service%22%3A%22text%22%2C%22node_id%22%3A0%7D%2C%7B%22type%22%3A%22group%22%2C%22logical_operator%22%3A%22and%22%2C%22nodes%22%3A%5B%7B%22type%22%3A%22group%22%2C%22logical_operator%22%3A%22or%22%2C%22nodes%22%3A%5B%7B%22type%22%3A%22terminal%22%2C%22service%22%3A%22text%22%2C%22parameters%22%3A%7B%22attribute%22%3A%22exptl.method%22%2C%22operator%22%3A%22exact_match%22%2C%22value%22%3A%22ELECTRON%20MICROSCOPY%22%7D%2C%22node_id%22%3A1%7D%5D%7D%2C%7B%22type%22%3A%22group%22%2C%22logical_operator%22%3A%22or%22%2C%22nodes%22%3A%5B%7B%22type%22%3A%22group%22%2C%22logical_operator%22%3A%22and%22%2C%22nodes%22%3A%5B%7B%22type%22%3A%22terminal%22%2C%22service%22%3A%22text%22%2C%22parameters%22%3A%7B%22attribute%22%3A%22rcsb_struct_symmetry.type%22%2C%22operator%22%3A%22exact_match%22%2C%22value%22%3A%22Asymmetric%22%7D%2C%22node_id%22%3A2%7D%2C%7B%22type%22%3A%22terminal%22%2C%22service%22%3A%22text%22%2C%22parameters%22%3A%7B%22attribute%22%3A%22rcsb_struct_symmetry.kind%22%2C%22operator%22%3A%22exact_match%22%2C%22value%22%3A%22Global%20Symmetry%22%7D%2C%22node_id%22%3A3%7D%5D%7D%5D%7D%5D%2C%22label%22%3A%22refinements%22%7D%5D%7D%2C%22return_type%22%3A%22entry%22%2C%22request_options%22%3A%7B%22pager%22%3A%7B%22start%22%3A0%2C%22rows%22%3A100%7D%2C%22scoring_strategy%22%3A%22combined%22%2C%22sort%22%3A%5B%7B%22sort_by%22%3A%22score%22%2C%22direction%22%3A%22desc%22%7D%5D%7D%2C%22request_info%22%3A%7B%22src%22%3A%22ui%22%2C%22query_id%22%3A%22b9a06adb84a9ffb88c6bb8186b90b9f9%22%7D%7D).

In [8]:
# rand_proteins = [random.randint(0,len(df_EM)) for _ in range(20)]
# selected_proteins = list(df_EM.id.iloc[rand_proteins].values)

selected_proteins = ['4usn', 
                     '5nvu',
                     '5nvs',
                     '6mem',
                     '6o1o',
                     '6ran',
                     '6ram',
                     '5j0n']
assert len(selected_proteins) == len(set.intersection(set(list(df_EM.id.values)), set(selected_proteins))), "Selected proteins should be using EM experimental method"

In [9]:
PDB_DIR = "/home/jelena/PDB"  #"/mnt/scratch/students/PDB"
pathlib.Path(PDB_DIR).mkdir(parents=True, exist_ok=True)

In [10]:
for i in selected_proteins:
    get_ipython().system_raw(f'wget http://files.rcsb.org/download/{i}.pdb -O {PDB_DIR}/{i}.pdb')

In [11]:
get_ipython().getoutput(f"ls {PDB_DIR}", split=True)

['2cse.pdb',
 '4bed.pdb',
 '4usn.pdb',
 '4v71.pdb',
 '5iou.pdb',
 '5j0n.pdb',
 '5nvs.pdb',
 '5nvu.pdb',
 '5o5b.pdb',
 '5t4p.pdb',
 '5zlu.pdb',
 '6buz.pdb',
 '6lz1.pdb',
 '6mem.pdb',
 '6n8m.pdb',
 '6o1o.pdb',
 '6psf.pdb',
 '6qee.pdb',
 '6ram.pdb',
 '6ran.pdb',
 '6rd5.pdb',
 '6re5.pdb',
 '6sjl.pdb',
 '6vkn.pdb',
 '6w4o.pdb',
 '6wbk.pdb',
 '6xe0.pdb',
 '7c79.pdb']

# EMAN2 script for PDB to MRC conversion

Installation instructions available [here](https://blake.bcm.edu/emanwiki/EMAN2/Install/BinaryInstallAnaconda/2.31).  
Download available [here](https://cryoem.bcm.edu/cryoem/downloads/view_eman2_versions).  
Command instruction `pdb2mrc` available [here](https://blake.bcm.edu/emanwiki/PdbToMrc).

Command optional parameters e2pdb2mrc [here](https://blake.bcm.edu/eman2/EMAN2.html/node100.html).

In [32]:
MRC_DIR = "/home/jelena/MRC"  #"/mnt/scratch/students/MRC"
pathlib.Path(MRC_DIR).mkdir(parents=True, exist_ok=True)

In [33]:
EMAN2 = "/home/jelena/EMAN2"

In [34]:
for i in selected_proteins:
    get_ipython().system_raw(f'export PATH="{EMAN2}/bin:$PATH";{EMAN2}/bin/e2pdb2mrc.py -R 8 -A 2 {PDB_DIR}/{i}.pdb {MRC_DIR}/{i}.mrc')

In [35]:
get_ipython().getoutput(f"ls {MRC_DIR}", split=True)

['4usn.mrc',
 '5j0n.mrc',
 '5nvs.mrc',
 '5nvu.mrc',
 '6mem.mrc',
 '6o1o.mrc',
 '6ram.mrc',
 '6ran.mrc']