# Fondant Pipeline

This notebook contains the implementation of the Fondant Pipeline using the components created during the internship at ML6. The pipeline is a sequence of steps that are executed using Docker containers in order to calculate and extract the features from protein sequences. The pipeline is composed of the following steps:

1. Data Loading
2. Applying Components
3. Executing the Pipeline

## Data Loading

The `/data` folder already contains a sample dataset of protein sequences called `uniprotkb_cyp_2024_05_17.xlsx.gz`. This dataset is loaded into the pipeline and used to extract the features.

In [3]:
import gzip
import pandas as pd

with gzip.open('./data/uniprotkb_cyp_2024_05_17.xlsx.gz', 'rb') as f:
    df = pd.read_excel(f)

df.to_parquet('./data/uniprotkb_cyp_2024_05_17.parquet')

  warn("Workbook contains no default style, apply openpyxl's default")


In [4]:
df

Unnamed: 0,Entry,Entry Name,Protein names,Gene Names,Sequence
0,A0A1U7Q236,A0A1U7Q236_MESAU,Cytochrome P450 1A (EC 1.14.14.1),LOC101825565 CP450 CYP-IA2 CYP1A2 CYPIA2,MALSQYTSLSTELVLATAIFCIVFWVARALRTQVPKGLKTPPGPWG...
1,A0A1U7Q283,A0A1U7Q283_MESAU,Cytochrome P450 1A (EC 1.14.14.1),LOC101844540 CYP-IA1 CYP1A1 CYPIA1,MSSIYGLLNFMSATELLVAITVFCLGFWVVRALRTQVPKGLKTPPG...
2,F4IX26,F4IX26_ARATH,peptidylprolyl isomerase (EC 5.2.1.8),ROC4 cyclophilin 20-3 CYP20-3 PEPTIDYLPROLYL I...,MFRLLLLPYAVGAQQKLLQTPRETKVADAWNIKCQNLLLSKANQQK...
3,G4N2X9,OXEAS_PYRO7,Bifunctional dioxygenase (DOX)-epoxy alcohol s...,MGG_10859,MDGAVRLDWTGLDLTGHEIHDGVPIASRVQVMVSFPLFKDQHIIMS...
4,G4N4J5,LIDS_PYRO7,"7,8-linoleate diol synthase (LDS) [Includes: L...",MGG_13239,MASSSSSGSSTRSSSPSDPPSSFFQKLGAFLGLFSKPQPPRPDYPH...
...,...,...,...,...,...
20391,X8IV98,X8IV98_9AGAM,Cytochrome P450 family protein,RSOL_013040,MDYLADLTDQLTLKHLFFIACGVAVIKLRHDFIYRPIRNWRSPLRN...
20392,X8IW15,X8IW15_9AGAM,Cytochrome P450 family protein,RSOL_018950,MSQPQLTFDLNRIQLLGNSVLKIFQNQPVGCTIALSVLTCLWYTFR...
20393,X8J6H9,X8J6H9_9AGAM,Cytochrome P450 family protein,RSOL_226310,MSNLAITSLASLRSLLNGALQSESGTLGPHVLETAKQHSNQLVLAL...
20394,X8JU90,X8JU90_9AGAM,Cytochrome P450 family protein,RSOL_478290,MDSSTYLSVIVFIYVVVTLLRKWHRAWAYRALNHLPGPPREKWSKG...


In [5]:
# keep only 2650 rows

df = df.sample(2650)
df.to_parquet('./data/data.parquet')

## Applying Components

In the cells below the components are applied to the dataset. The components are the following:

- generate_protein_sequence_checksum_component
  - This component generates a checksum for the protein sequence.

- biopython_component
  - Extracts features from the protein sequence using Biopython.

- iFeatureOmega_component
  - Extracts features from the protein sequence using the [iFeatureOmega-CLI GitHub repo](https://github.com/Superzchen/iFeatureOmega-CLI). Arguments are used to specify the type of features to extract.

- unikp_component
  - Uses the UniKP endpoint on HuggingFace to predict the kinetic parameters of a protein sequence and substrate (SMILES) combination. If you don't have access to the UniKP endpoint, you can't run this component. [A HuggingFace Space](https://huggingface.co/spaces/ml6team/ML6-UniKP) was created during the internship to host the UniKP model, so you can use it to run the component.

- peptide_component
  - Calculates the features from the protein sequence using the `peptides` package.

In [8]:
import pyarrow as pa
import pandas as pd
import json

from fondant.pipeline import Pipeline
from fondant.pipeline.runner import DockerRunner

DATA_PATH_FONDANT = "/data/data.parquet"
DATA_PATH = "./data/data.parquet"

BASE_PATH = ".fondant"
PIPELINE_NAME = "feature_extraction_pipeline"


# create a new pipeline
pipeline = Pipeline(
    name=PIPELINE_NAME,
    base_path=BASE_PATH,
    description="A pipeline to extract features from protein sequences."
)

In [18]:
def create_protein_smiles():
	"""
	This function will create a json file with the protein sequences and substrate smiles.
	The substrate will be "Narganine" for all protein sequences.
	"""
	narganine_smiles = "O=C4c5c(O)cc(O[C@@H]2O[C@H](CO)[C@@H](O)[C@H](O)[C@H]2O[C@H]1O[C@@H]([C@H](O)[C@H](O)[C@H]1O)C)cc5O[C@H](c3ccc(O)cc3)C4"

	# for all protein sequences in the dataset
	protein_smiles = {}
	dataset = pd.read_parquet(DATA_PATH)
	for _, row in dataset.iterrows():
		protein_smiles[row["sequence"]] = [narganine_smiles]

	# save the protein smiles to a json file
	with open("./data/protein_smiles.json", "w") as f:
		json.dump(protein_smiles, f)

create_protein_smiles()

In [3]:
dataset = pipeline.read(
	"load_from_parquet",
	arguments={
		"dataset_uri": DATA_PATH_FONDANT,
	},
	produces={
		"sequence": pa.string()
	}
)


In [None]:
_ = dataset.apply(
		"./components/biopython_component"
	).apply(
		"./components/generate_protein_sequence_checksum_component"
	).apply(
		"./components/iFeatureOmega_component",
		input_partition_rows=5,
		arguments={
			"descriptors": ["AAC", "CTDC", "CTDT"]
		}
	).apply(
        "./components/unikp_component",
        arguments={
			"protein_smiles_path": "/data/protein_smiles.json",
        },
    ).apply(
        "./components/peptide_features_component"
	)

## Executing the Pipeline

There is a cell that executes the pipeline using the components described above with the `DockerRunner` class from Fondant. This will start up the Docker containers and execute the pipeline. The remaining cells are used to extract the results from the pipeline and save them in a parquet file.

In [None]:
import os

# get full path
pipeline_path = os.path.normpath(os.path.join(os.getcwd()))

runner = DockerRunner()
runner.run(
        input=pipeline,
        extra_volumes=[f"{pipeline_path}/data:/data"]
)

In [9]:
import glob

# get the most recent folder in the folder named: BASE_PATH + PIPELINE_NAME + PIPELINE_NAME-<timestamp>
matching_folders = glob.glob(f"{BASE_PATH}/{PIPELINE_NAME}/{PIPELINE_NAME}-*")

if matching_folders:
    OUTPUT_FOLDER = max(matching_folders, key=os.path.getctime)
else:
    print("No matching folders found")
    exit()

if os.path.exists(OUTPUT_FOLDER):
	# remove the manifest file from each folder in the output folder
	for root, dirs, files in os.walk(OUTPUT_FOLDER):
		for file in files:
			if file == "manifest.json":
				os.remove(os.path.join(root, file))
				REMOVED_MANIFEST = True

In [10]:
import os
import pandas as pd

def merge_parquet_folders(folder_path):
	merge_df = pd.DataFrame()
	
	for folder in os.listdir(folder_path):
		parquet_partitions = os.path.join(folder_path, folder)
		df = pd.read_parquet(parquet_partitions)
		
		if merge_df.empty:
			merge_df = df
		else:
			merge_df = merge_df.merge(df, on="sequence")
	
	return merge_df

In [11]:
if REMOVED_MANIFEST and os.path.exists(OUTPUT_FOLDER):
	merged_df = merge_parquet_folders(OUTPUT_FOLDER)
	merged_df

In [12]:
if REMOVED_MANIFEST and os.path.exists(OUTPUT_FOLDER):
	if not os.path.exists(os.path.join(os.path.abspath("data"), "export")):
		os.makedirs(os.path.join(os.path.abspath("data"), "export"))

	output_path = os.path.join(os.path.abspath("data"), "export")

	merged_df.to_parquet(os.path.join(output_path, "results.parquet"))

In [None]:
# read the output file

output_df = pd.read_parquet("./data/export/results.parquet")
output_df