# mmtfPyspark 1 - Input & Filtering
mmtf-pyspark operates on 3D structures in the compressed binary MMTF file format.

Info about MMTF:
* [Website](http://mmtf.rcsb.org/index.html)
* [Format paper](https://doi.org/10.1371/journal.pcbi.1005575)
* [Compression paper](https://doi.org/10.1371/journal.pone.0174846)
* [Specification](https://github.com/rcsb/mmtf/blob/master/spec.md)

Protein Data Bank structures are available in two MMTF data representations:
* full
 * All atom representation 
 * 0.001Å coordinate precision, 0.01 B-factor and occupancy precision
* reduced
 * C-alpha atoms only for polypeptides 
 * P-backbone atoms only for polynucleotides 
 * All atom representation for all other residue types 
 * 0.1Å coordinate precision, 0.1 B-factor and occupancy precision.

## Import pyspark and mmtfPyspark

In [None]:
from pyspark.sql import SparkSession
from mmtfPyspark.io import mmtfReader

## Configure Spark

In [None]:
spark = SparkSession.builder.appName("1-Input").getOrCreate()
sc = spark.sparkContext

In [None]:
sc.defaultParallelism

## Download Structures
For a small list of PDB entries (10s to 100), the download methods are the quickest way to import structures. Here we download a list of 4 structure in the full representation.

In [None]:
pdbids = ['1LQ9','1LXJ','4XPX','1P1J']
structures = mmtfReader.download_full_mmtf_files(pdbids)

Structures are represented as keyword-value pairs (tuples):
* key: structure identifier (e.g., PDB ID)
* value: MmtfStructure (structure data)

We can print the keys and values using the collect() method. Note, that the structures are loaded in an arbritray order. You cannot rely on the order of structures.

In [None]:
structures.keys().collect()

In [None]:
structures.values().collect()

Spark represents these keyword-value pairs as Resilient Distributed Datasets (RDDs), which are a fault-tolerant collection of elements that can be operated on in parallel. To see how the dataset was distributed, we can print the number of partitions.

In [None]:
structures.getNumPartitions()

## Reading structures from an MMTF Hadoop Sequence File
Next, we read PDB structures from a local copy of an MMTF Hadoop Sequence file. For the following examples to work, the MMTF_FULL and MMTF_REDUCED environment variables need to be set. See installation instructions for details.

If you have long list (1000s) of PDB IDs, you can read the list of structures from a local copy of the MMTF Hadoop Sequence file,
however, it's very inefficent for a few structures, e.g, in the example below.

In [None]:
path = "../resources/mmtf_reduced_sample/"
structures = mmtfReader.read_sequence_file(path, pdbids)

Let's print the keys again and see how long this takes. You can see that Spark loads the data only when and if it's required.

In [None]:
structures.keys().collect()

Now, let's read a sample of the PDB archive from the MMTF Hadoop Sequence file

In [None]:
structures = mmtfReader.read_sequence_file(path).cache()

#### There are 9756 structures in the sample file

In [None]:
%%time
structures.count()

### About data flow and caching in Spark
Now, let's count the number of structures again. Should this be faster this time since we already loaded the entire PDB? 

Not necessarily, the data from the Hadoop Sequence file are streamed through parallel threads. If you need the data again, they need to be reloaded from scratch, unless they are cached. See .cache() method call after reading the MMTF Hadoop Sequence file.

Remove the .cache() method call, run this notebook again and compare the time it takes to count the number of structures.

In [None]:
%%time
structures.count()

## Reading the whole PDB from MMTF-Hadoop Sequence files

In this workshop we use a sample set of the PDB with about 10,000 structures.

To use the entire PDB, the MMTF_FULL and MMTF_REDUCED environment variables must to be set. See mmtf-pypark [installation instructions](https://github.com/sbl-sdsc/mmtf-pyspark#installation) for details.

### Read whole PDB in the full (all atom) representation
We commented this lines below, since we are using a smaller sample of the PDB for the tutorials. 

To use the whole PDB, the MMTF_FULL and MMTF_REDUCED environment variables need to be set to the `full` and `reduced` MMTF Hadoop Sequence file locations. See [installation instructions](https://github.com/sbl-sdsc/mmtf-pyspark#hadoop-sequence-files) for details.

In [None]:
# %%time
# pdb_full = mmtfReader.read_full_sequence_file();
# pdb_full.count()

### Read whole PDB in the reduced representation

In [None]:
# %%time
# pdb_reduced = mmtfReader.read_reduced_sequence_file();
# pdb_reduced.count()

# Very Important: Stop Spark!!!
It is very important to run the notebook all the way to the spark.stop() statement to terminate Spark. Otherwise you may end up running multiple instances of Spark that will interfere with each other.

In [None]:
spark.stop()

# 2-Filtering
This tutorial demonstrates how to filter PDB to create subsets of structures. For details see [filters](https://github.com/sbl-sdsc/mmtf-pyspark/tree/master/mmtfPyspark/filters) and [demos](https://github.com/sbl-sdsc/mmtf-pyspark/tree/master/demos/filters).

### Import pyspark and mmtfPyspark

In [None]:
from pyspark.sql import SparkSession
from mmtfPyspark.io import mmtfReader
from mmtfPyspark.filters import ContainsGroup, ContainsLProteinChain, PolymerComposition, Resolution 
from mmtfPyspark.structureViewer import view_group_interaction

### Configure Spark

In [None]:
spark = SparkSession.builder.appName("2-Filtering").getOrCreate()

### Read PDB structures

In [None]:
path = "../resources/mmtf_reduced_sample"
pdb = mmtfReader.read_sequence_file(path).cache()
pdb.count()

## Filter by Quality Metrics
Structures can be filtered by [Resolution](https://pdb101.rcsb.org/learn/guide-to-understanding-pdb-data/resolution) and [R-free](https://pdb101.rcsb.org/learn/guide-to-understanding-pdb-data/r-value-and-r-free). Each filter takes a minimum and maximum values. The example below returns structures with a resolution in the inclusive range [0.0, 1.5]

In [None]:
pdb = pdb.filter(Resolution(0.0, 1.5))
pdb.count()

## Filter by Polymer Chain Types
A number of filters are available to filter by the type of the polymer chain.

### Create a subset of structures that contain at least one L-protein chain

In [None]:
pdb = pdb.filter(ContainsLProteinChain())
pdb.count()

### Create a subset of structure that exclusively contain L-protein chains (e.g., exclude protein-nucleic acid complexes)

In [None]:
pdb = pdb.filter(ContainsLProteinChain(exclusive=True))
pdb.count()

### Keep protein structures that exclusively contain chains made out of the 20 standard amino acids

In [None]:
pdb = pdb.filter(PolymerComposition(PolymerComposition.AMINO_ACIDS_20, exclusive=True))
pdb.count()

## Find the subset of structures that contains ATP

In [None]:
pdb = pdb.filter(ContainsGroup("ATP"))

## Visualize the hits

In [None]:
view_group_interaction(pdb.keys().collect(),"ATP");

## Filter with a lambda expression
Rather than using a pre-made filter, we can create simple filters using lambda expressions. The expression needs to evaluate to a boolean type.

The variable t in the lambda expression below represents a tuple and t[1] is the second element in the tuple representing the mmtfStructure. 

Here, we filter by the number of atoms in an entry. You will learn more about extracting structural information from an mmtfStructure in future tutorials.

In [None]:
pdb = pdb.filter(lambda t: t[1].num_atoms < 500)
pdb.count()

Or, we can filter by the key, represented by the first element in a tuple: t[0].

**Keys are case sensitive. Always use upper case PDB IDs in mmtf-pyspark!**

In [None]:
pdb = pdb.filter(lambda t: t[0] in ["4AFF", "4CBU"])
pdb.count()

In [None]:
spark.stop()