# Flatmapping

This tutorial demonstrates how to split PDB structures into subcomponents or create biological assemblies. In Spark, a flatMap transformation splits each data record into zero or more records.


### Import pyspark and mmtfPyspark


In [None]:
from pyspark.sql import SparkSession
from mmtfPyspark.io import mmtfReader
from mmtfPyspark.filters import ContainsDnaChain
from mmtfPyspark.mappers import StructureToPolymerChains, StructureToPolymerSequences
from mmtfPyspark.structureViewer import view_structure

### Configure Spark


In [None]:
spark = SparkSession.builder.appName("mmtfPyspark-03-Flatmapping").getOrCreate()

## Read PDB structures

In this example we download the hemoglobin structure 4HHB, consisting of two alpha subunits and two beta subunits.


In [None]:
quaternary = mmtfReader.download_reduced_mmtf_files(["4HHB"])

In [None]:
view_structure(quaternary.keys().collect())

## Flatmap by protein sequence

Here we extract the polymer sequences using a flatMap transformation. Chains A and C (alpha subunits) and chains B and D (beta subunits) have identical sequences, respectively.


In [None]:
sequences = quaternary.flatMap(StructureToPolymerSequences())
sequences.take(4)

## Flatmap structures

A flatMap operation splits data records into zero or more records. Here, we use the StructureToPolymerChains class to flatMap a PDB entry (quaternary structure) to its polymer chains (tertiary structure). Note, the chain Id is appended to the PDB Id. The two alpha subunit are 4HHB.A and 4HHB.C and the beta subunits are 4HHB.B and 4HHB.C.


In [None]:
tertiary = quaternary.flatMap(StructureToPolymerChains())
tertiary.keys().collect()

In [None]:
view_structure(tertiary.keys().collect())

For some analyses we may only need one copy of each unique subunit (identical polymer sequence). This can be done by setting excludeDuplicates = True.


In [None]:
tertiary = quaternary.flatMap(StructureToPolymerChains(excludeDuplicates=True))
tertiary.keys().collect()

### Combine FlatMap with Filter

The filter operations we used previously for whole structures can also be applied to single polymer chains. Here we flatMap PDB structures into polymer chains and then select select DNA chains.


In [None]:
path = "../data/mmtf_reduced_sample"

dna_chains = (
    mmtfReader.read_sequence_file(path)
    .flatMap(StructureToPolymerChains(excludeDuplicates=True))
    .filter(ContainsDnaChain())
)

In [None]:
view_structure(dna_chains.keys().collect())

In [None]:
spark.stop()