# Arpeggio

#### Let's use Arpeggio with 1dod

1dod structure:
|MOLECULE_CHEMBL_ID|MOLECULE_PDB_ID|        STRUCTURE_ID|
|-------------------------|---------------------|--------------------|
|      CHEMBL328910        |            DOB                | [1dod, 1doe, 1pbb]  |


#### Arpeggio is a command line tool.
The imports are os, pyspark and the complex viewer nglview

In [1]:
import os
import timeit
import pyspark.sql.functions as F
from pyspark.sql import SparkSession
import nglview as nv
from pyspark.sql.functions import posexplode



#### Let's visualise the complex is a command line tool.

In [2]:
struct_of_interest = '1DOD'

In [3]:
ngl_viewer = nv.show_pdbid(struct_of_interest)
# Add the ligands
ngl_viewer.add_representation(repr_type="ball+stick", selection="hetero and not water")
# Center view on binding site
ngl_viewer.center("ligand")

In [4]:
ngl_viewer

NGLWidget()

#### Start running the code for 1dod - DOB complex investigation with Arpeggio

Open a Spark session

In [5]:
spark = SparkSession.builder.getOrCreate()

2022-02-15 16:05:28 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


Command line to DL the wanted .CIF file of the structure

In [6]:
dl_unzip_cif = 'curl http://ftp.ebi.ac.uk/pub/databases/pdb/data/structures/divided/mmCIF/do/1dod.cif.gz -o 1dod.cif.gz'
os.system(dl_unzip_cif)

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 93068  100 93068    0     0  2273k      0 --:--:-- --:--:-- --:--:-- 2754k


0

Command line to run the Arpeggio command line
Reminder: We investigate the 1dod structure and it's interaction with the DOB molecule

|MOLECULE_CHEMBL_ID|MOLECULE_PDB_ID|        STRUCTURE_ID|
|-------------------------|---------------------|--------------------|
|      CHEMBL328910        |            DOB                | [1dod, 1doe, 1pbb]  |

**-s** option is to specified the chain (A) and the residue number of the ligand (396) which correspond to the DOB molecule

**-o** option is to specified the output (here 'arpeggio_result' is a folder in which the 1dod.cif file will be)

In [7]:
# Start chrono for Arpeggio comand line when the Chain and Residue is precised
start_1 = timeit.default_timer()

In [8]:
run_arpeggio = 'arpeggio -o arpeggio_result 1dod.cif'
os.system(run_arpeggio)

INFO//16:05:30.608//Program begin.
DEBUG//16:05:30.667//Loaded PDB structure (BioPython)
DEBUG//16:05:30.769//Loaded MMCIF structure (OpenBabel)
DEBUG//16:05:30.777//Mapped OB to BioPython atoms and vice-versa.
DEBUG//16:05:30.927//Added hydrogens.
DEBUG//16:05:31.162//Determined atom explicit and implicit valences, bond orders, atomic numbers, formal charge and number of bound hydrogens.
DEBUG//16:05:31.185//Initialised SIFts.
DEBUG//16:05:31.188//Determined polypeptide residues, chain breaks, termini
DEBUG//16:05:31.395//Percieved and stored rings.
DEBUG//16:05:31.412//Perceived and stored amide groups.
DEBUG//16:05:31.416//Added hydrogens to BioPython atoms.
DEBUG//16:05:31.421//Added VdW radii.
DEBUG//16:05:31.425//Added covalent radii.
DEBUG//16:05:31.429//Completed NeighborSearch.
DEBUG//16:05:31.431//Assigned rings to residues.
DEBUG//16:05:31.434//Made selection.
DEBUG//16:05:31.574//Expanded to binding site.
DEBUG//16:05:31.576//Flagged selection rings.
DEBUG//16:05:31.580//Co

0

In [9]:
# Stop chrono
stop_1 = timeit.default_timer()

In [10]:
# Start chrono for Arpeggio comand line when the Chain and Residue is NOT precised
start_2 = timeit.default_timer()

In [11]:
run_arpeggio_precised_res = 'arpeggio -s /A/396/ -o arpeggio_result 1dod.cif'
os.system(run_arpeggio_precised_res)

INFO//16:06:39.663//Program begin.
INFO//16:06:39.663//Selection perceived: ['/A/396/']
DEBUG//16:06:39.727//Loaded PDB structure (BioPython)
DEBUG//16:06:39.828//Loaded MMCIF structure (OpenBabel)
DEBUG//16:06:39.836//Mapped OB to BioPython atoms and vice-versa.
DEBUG//16:06:39.981//Added hydrogens.
DEBUG//16:06:40.223//Determined atom explicit and implicit valences, bond orders, atomic numbers, formal charge and number of bound hydrogens.
DEBUG//16:06:40.246//Initialised SIFts.
DEBUG//16:06:40.251//Determined polypeptide residues, chain breaks, termini
DEBUG//16:06:40.476//Percieved and stored rings.
DEBUG//16:06:40.493//Perceived and stored amide groups.
DEBUG//16:06:40.497//Added hydrogens to BioPython atoms.
DEBUG//16:06:40.502//Added VdW radii.
DEBUG//16:06:40.506//Added covalent radii.
DEBUG//16:06:40.510//Completed NeighborSearch.
DEBUG//16:06:40.512//Assigned rings to residues.
DEBUG//16:06:40.518//Made selection.
DEBUG//16:06:40.619//Expanded to binding site.
DEBUG//16:06:40.

0

In [12]:
# Stop chrono
stop_2 = timeit.default_timer()

#### Get the data generated by Arpeggio

Creation of a PySpark DataFrame

In [13]:
# Read JSON file into dataframe
df = spark.read.json("arpeggio_result/1dod.json", multiLine=True)
df.printSchema()
df.show()
df.count()

                                                                                

root
 |-- bgn: struct (nullable = true)
 |    |-- auth_asym_id: string (nullable = true)
 |    |-- auth_atom_id: string (nullable = true)
 |    |-- auth_seq_id: long (nullable = true)
 |    |-- label_comp_id: string (nullable = true)
 |    |-- pdbx_PDB_ins_code: string (nullable = true)
 |-- contact: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- distance: double (nullable = true)
 |-- end: struct (nullable = true)
 |    |-- auth_asym_id: string (nullable = true)
 |    |-- auth_atom_id: string (nullable = true)
 |    |-- auth_seq_id: long (nullable = true)
 |    |-- label_comp_id: string (nullable = true)
 |    |-- pdbx_PDB_ins_code: string (nullable = true)
 |-- interacting_entities: string (nullable = true)
 |-- type: string (nullable = true)

+--------------------+--------------------+--------+--------------------+--------------------+---------+
|                 bgn|             contact|distance|                 end|interacting_entities|     type|
+---

561

#### Data manipulation with PySpark
Filter and get only lines containing DOB

In [14]:
df = df.filter(F.col('end.label_comp_id') == 'DOB')

#### According to the [Arpeggio doc](https://github.com/PDBeurope/arpeggio):

**Proximal** : 'Denotes if the atom is > the VdW interaction distance, but within 5 Angstroms of other atom(s).'

**Hydrophobic** : 'Denotes hydrophobic interaction.'

**Aromatic** : 'Denotes an aromatic ring atom interacting with another aromatic ring atom.'

**INTER** : 'Between an atom from the user's selection and a non-selected atom'

In [15]:
df.show()
df.count()

+--------------------+--------------------+--------+--------------------+--------------------+---------+
|                 bgn|             contact|distance|                 end|interacting_entities|     type|
+--------------------+--------------------+--------+--------------------+--------------------+---------+
|[A, CZ, 201, TYR,  ]|          [proximal]|    4.33|[A, C5, 396, DOB,  ]|               INTER|atom-atom|
| [A, O, 428, HOH,  ]|          [proximal]|    4.78|[A, O4, 396, DOB,  ]|     SELECTION_WATER|atom-atom|
|[A, CA, 296, ALA,  ]|          [proximal]|    3.98|[A, O4, 396, DOB,  ]|               INTER|atom-atom|
|[A, CA, 295, GLY,  ]|          [proximal]|    4.65|[A, O4, 396, DOB,  ]|               INTER|atom-atom|
| [A, N, 296, ALA,  ]|          [proximal]|    4.45|[A, C4, 396, DOB,  ]|               INTER|atom-atom|
| [A, N, 296, ALA,  ]|          [proximal]|    3.67|[A, O4, 396, DOB,  ]|               INTER|atom-atom|
|[A, CB, 296, ALA,  ]|          [proximal]|    3.82|[A,

94

Filter and get only lines with residues in an hydrophobic interaction

In [16]:
df = df.filter(F.array_contains(F.col('contact'), 'hydrophobic'))
df.show()
df.count()

+--------------------+--------------------+--------+--------------------+--------------------+---------+
|                 bgn|             contact|distance|                 end|interacting_entities|     type|
+--------------------+--------------------+--------+--------------------+--------------------+---------+
|[A, CB, 296, ALA,  ]|[proximal, hydrop...|    4.23|[A, C5, 396, DOB,  ]|               INTER|atom-atom|
|[A, CD2, 210, LEU...|[proximal, hydrop...|    4.02|[A, C3, 396, DOB,  ]|               INTER|atom-atom|
|[A, CB, 293, PRO,  ]|[proximal, hydrop...|    4.43|[A, C3, 396, DOB,  ]|               INTER|atom-atom|
|[A, CD2, 210, LEU...|[proximal, hydrop...|    3.95|[A, C5, 396, DOB,  ]|               INTER|atom-atom|
|[A, CB, 296, ALA,  ]|[proximal, hydrop...|    4.41|[A, C3, 396, DOB,  ]|               INTER|atom-atom|
|[A, CD2, 210, LEU...|[proximal, hydrop...|    4.34|[A, C1, 396, DOB,  ]|               INTER|atom-atom|
|[A, CE2, 222, TYR...|[proximal, aromat...|    3.95|[A,

7

Explode the contact column

In [17]:
df.select('*', posexplode('contact')).show()

+--------------------+--------------------+--------+--------------------+--------------------+---------+---+-----------+
|                 bgn|             contact|distance|                 end|interacting_entities|     type|pos|        col|
+--------------------+--------------------+--------+--------------------+--------------------+---------+---+-----------+
|[A, CB, 296, ALA,  ]|[proximal, hydrop...|    4.23|[A, C5, 396, DOB,  ]|               INTER|atom-atom|  0|   proximal|
|[A, CB, 296, ALA,  ]|[proximal, hydrop...|    4.23|[A, C5, 396, DOB,  ]|               INTER|atom-atom|  1|hydrophobic|
|[A, CD2, 210, LEU...|[proximal, hydrop...|    4.02|[A, C3, 396, DOB,  ]|               INTER|atom-atom|  0|   proximal|
|[A, CD2, 210, LEU...|[proximal, hydrop...|    4.02|[A, C3, 396, DOB,  ]|               INTER|atom-atom|  1|hydrophobic|
|[A, CB, 293, PRO,  ]|[proximal, hydrop...|    4.43|[A, C3, 396, DOB,  ]|               INTER|atom-atom|  0|   proximal|
|[A, CB, 293, PRO,  ]|[proximal,

In [18]:
# Display time running
time_no_precised_res = int(round(stop_1 - start_1, 3))
time_precised_res = int(round(stop_2 - start_2, 3))

print(str(time_precised_res) + ' seconds')
print(str(time_no_precised_res) + ' seconds')

1 seconds
68 seconds


Number of lines

In [19]:
df.count()

7

#### PDB line for DOB ligand

```ATOM   3178  C1' DOB X 396      17.205  96.843  45.676  1.00 11.04      A     ```

#### JSON line for DOB ligand

```
{
        "bgn": {
            "auth_asym_id": "A",
            "auth_atom_id": "CZ",
            "auth_seq_id": 201,
            "label_comp_id": "TYR",
            "pdbx_PDB_ins_code": " "
        },
        "contact": [
            "proximal"
        ],
        "distance": 3.78,
        "end": {
            "auth_asym_id": "A",
            "auth_atom_id": "O4",
            "auth_seq_id": 396,
            "label_comp_id": "DOB",
            "pdbx_PDB_ins_code": " "
        },
        "interacting_entities": "INTRA_SELECTION",
        "type": "atom-atom"
    }```
