# Genomic Data - DemoHumanOrWorm.ipynb

This is one of the genomic datasets taken from [here](https://github.com/ML-Bioinfo-CEITEC/genomic_benchmarks).
The classification task is evaluated using the _SeqRep_ package.

You can [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/MIR-MU/seqrep/blob/main/examples/genomic_data/DemoHumanOrWorm.ipynb)
or
[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/MIR-MU/seqrep/main?labpath=examples%2Fgenomic_data%2FDemoHumanOrWorm.ipynb).

## Install _SeqRep_ Package

In [2]:
!pip install seqrep

Collecting seqrep
  Downloading seqrep-0.0.2-py3-none-any.whl (19 kB)
Collecting pandas-ta>=0.3.14b0
  Downloading pandas_ta-0.3.14b.tar.gz (115 kB)
[K     |████████████████████████████████| 115 kB 5.4 MB/s 
[?25hCollecting ta>=0.8.0
  Downloading ta-0.9.0.tar.gz (25 kB)
Collecting numpy-ext>=0.9.6
  Downloading numpy_ext-0.9.6-py3-none-any.whl (6.9 kB)
Collecting hrv-analysis>=1.0.4
  Downloading hrv_analysis-1.0.4-py3-none-any.whl (28 kB)
Collecting nolds>=0.4.1
  Downloading nolds-0.5.2-py2.py3-none-any.whl (39 kB)
Collecting numpy>=1.15.1
  Downloading numpy-1.20.3-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (15.3 MB)
[K     |████████████████████████████████| 15.3 MB 39.5 MB/s 
[?25hCollecting joblib<1.1.0,>=1.0.1
  Downloading joblib-1.0.1-py3-none-any.whl (303 kB)
[K     |████████████████████████████████| 303 kB 41.1 MB/s 
Building wheels for collected packages: pandas-ta, ta
  Building wheel for pandas-ta (setup.py) ... [?25l[?25hdone
  Created wheel for pan

## Import Needed Packages

In [4]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.neural_network import MLPClassifier

!pip install icecream
from icecream import ic

from seqrep import *
from seqrep.feature_engineering import *
from seqrep.labeling import *
from seqrep.splitting import *
from seqrep.scaling import *
from seqrep.feature_reduction import *
from seqrep.evaluation import *
from seqrep.pipeline_evaluation import *



## Load or Download Data

In [5]:
!pip install genomic-benchmarks

Collecting genomic-benchmarks
  Downloading genomic_benchmarks-0.0.6.tar.gz (17 kB)
Collecting biopython>=1.79
  Downloading biopython-1.79-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.whl (2.3 MB)
[K     |████████████████████████████████| 2.3 MB 5.0 MB/s 
Collecting pyyaml>=5.3.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 37.9 MB/s 
Collecting yarl
  Downloading yarl-1.7.2-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (271 kB)
[K     |████████████████████████████████| 271 kB 48.9 MB/s 
Collecting multidict>=4.0
  Downloading multidict-6.0.2-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (94 kB)
[K     |████████████████████████████████| 94 kB 1.8 MB/s 
[?25hBuilding wheels for collected packages: genomic-benchmarks
  Building wheel for genomic-benchmarks (setup.py) ... [?25l[?25

In [6]:
from genomic_benchmarks.data_check import list_datasets

list_datasets()

['demo_human_or_worm',
 'demo_mouse_enhancers',
 'human_nontata_promoters',
 'demo_coding_vs_intergenomic_seqs',
 'human_enhancers_cohn',
 'human_enhancers_ensembl']

In [7]:
["".join([w.capitalize() for w in dat.split("_")]) for dat in list_datasets()]

['DemoHumanOrWorm',
 'DemoMouseEnhancers',
 'HumanNontataPromoters',
 'DemoCodingVsIntergenomicSeqs',
 'HumanEnhancersCohn',
 'HumanEnhancersEnsembl']

In [8]:
# demo_human_or_worm
from genomic_benchmarks.dataset_getters.pytorch_datasets import (
    DemoHumanOrWorm,
)

X_train = DemoHumanOrWorm(split="train", version=0)
X_test = DemoHumanOrWorm(split="test", version=0)


y_train = pd.Series([y for _, y in X_train])
X_train = pd.DataFrame([x for x, _ in X_train], columns=["genom"])
y_test = pd.Series([y for _, y in X_test])
X_test = pd.DataFrame([x for x, _ in X_test], columns=["genom"])

Downloading 1Vuc44bXRISqRDXNrxt5lGYLpLsJbrSg8 into /root/.genomic_benchmarks/demo_human_or_worm.zip... Done.
Unzipping...Done.


In [9]:
## Random shuffling
# idx = np.random.permutation(len(y_train))
# X_train = X_train.iloc[idx, :].reset_index(drop=True)
# y_train = y_train.iloc[idx].reset_index(drop=True)

# idx = np.random.permutation(len(y_test))
# X_test = X_test.iloc[idx, :].reset_index(drop=True)
# y_test = y_test.iloc[idx].reset_index(drop=True)

X_train.join(pd.DataFrame(y_train, columns=["label"]))

Unnamed: 0,genom,label
0,CTAAAAATACAAAAATTAGCTGGGTGTGGTGGCGCGCGCCTGTAAT...,0
1,CTGGTGATGCTGGAAGCATTGGATGCCCTGTAAGGACATGATTTTG...,0
2,ATTAAAAGCATACTTGTTCAAATTTGGTATAAATAGGACATATTAC...,0
3,GGAGGCCAAGGCGGGTGGATCACCTGAGGTCGGGCGTTCAAGACCA...,0
4,TATAAGACCTAAAGGCAGCAACTAGCTAATATCTGTCCAGTGTTAT...,0
...,...,...
74995,CGAAGTTTGGTTCTCGGATTGTGTGCTGGCACTTTCCTGCCAAATG...,1
74996,AGACACCCTGAGAGTCGATTTGTCTCATTTTTCGTCGATAAATGTA...,1
74997,CGTATCTCTGGTTGCCAGTTTATTTCTACGATGAGCCATTTCAATT...,1
74998,TTTCGTTCCATGCATCAATGTCTAATCCAGCCTTCATAGAGTTTCT...,1


## Run Pipeline Evaluation

In [10]:
# This DataFrame collects the results of various runs for comparison.

# Uncomment following line if you want to clear the DataFrame with the results.
# del results_for_comparison

try:
    results_for_comparison
except NameError:
    print("Create new empty DataFrame.")
    results_for_comparison = pd.DataFrame()
else:
    print("DataFrame already exist!")

Create new empty DataFrame.


In [11]:
class SubstringsExtractor(FeatureExtractor):
    def __init__(
        self,
        substrings: List,
        columns_to_apply: Union[str, List[str]] = None,
        return_original_columns: bool = False,
        normalize: bool = True,
        verbose: bool = True,
        inplace: bool = False,
    ):
        self.substrings = substrings
        self.columns_to_apply = columns_to_apply
        self.return_original_columns = return_original_columns
        self.normalize = normalize
        self.verbose = verbose
        self.inplace = inplace

    def fit(self, X, y=None):
        if self.verbose:
            print(f"\tNumber of substrings BEFORE fit: {len(self.substrings)}")
        if not self.columns_to_apply:
            self.columns_to_apply = X.columns
        if isinstance(self.columns_to_apply, str):
            columns_to_apply = [self.columns_to_apply]

        tmp = ""
        for c in self.columns_to_apply:
            tmp += "@".join(X[c])
        self.substrings = [
            s
            for s in tqdm(
                self.substrings, leave=False, desc="Fitting SubstringsExtractor"
            )
            if s in tmp
        ]
        del tmp

        if self.verbose:
            print(f"\tNumber of substrings AFTER fit:  {len(self.substrings)}")
        return self

    def transform(self, X):
        if not self.inplace:
            X = X.copy()
        for column in tqdm(
            self.columns_to_apply,
            leave=False,
            desc="Transforming SubstringsExtractor - columns",
        ):
            col_pref = column + "_" if len(self.columns_to_apply) > 1 else ""
            for substr in tqdm(
                self.substrings,
                leave=False,
                desc="Transforming SubstringsExtractor - substrings",
            ):
                X.loc[:, f"{col_pref}count-{substr}"] = (
                    X[column].str.count(substr) / X[column].str.len()
                    if self.normalize
                    else 1
                )

        if self.return_original_columns:
            return X
        return X.drop(columns=self.columns_to_apply)

In [12]:
from itertools import product

perms = [
    "".join(p)
    for i in range(1, 6)
    for p in product("ACTG", repeat=i)
    if len(p) < 5 or len(set(p)) > 2
]
ic(len(perms))
# ic(perms[:10])

ic| len(perms): 1180


1180

In [13]:
%%capture --no-stdout --no-display
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score

run_identification = f"{len(perms)} substrings"

# 1. step - define your pipeline
pipe = Pipeline(
    [
        (
            "fext_substr",
            SubstringsExtractor(
                substrings=perms,
            ),
        ),
        ("scale_u", UniversalScaler(scaler=MinMaxScaler())),
    ]
)

# 2. step - define your workflow
pipe_eval = PipelineEvaluator(
    pipeline=pipe,
    model=MLPClassifier(
        hidden_layer_sizes=(128, 32, 8),
        batch_size=32,
    ),
    evaluator=SequentialEvaluator(
        [
            ClassificationEvaluator(),
            UniversalEvaluator(metrics=[f1_score]),
        ]
    ),
)
# 3. step
pipe_eval.X_train = X_train.copy()
pipe_eval.y_train = y_train.copy()
pipe_eval.X_test = X_test.copy()
pipe_eval.y_test = y_test.copy()

result = pipe_eval.run()

results_for_comparison = results_for_comparison.append(
    pd.Series(result, name=run_identification),
)

15:04:00.380 Fitting pipeline
	Number of substrings BEFORE fit: 1180


Fitting SubstringsExtractor:   0%|          | 0/1180 [00:00<?, ?it/s]

	Number of substrings AFTER fit:  1180


Transforming SubstringsExtractor - columns:   0%|          | 0/1 [00:00<?, ?it/s]

Transforming SubstringsExtractor - substrings:   0%|          | 0/1180 [00:00<?, ?it/s]

15:07:22.702 Applying pipeline transformations


Transforming SubstringsExtractor - columns:   0%|          | 0/1 [00:00<?, ?it/s]

Transforming SubstringsExtractor - substrings:   0%|          | 0/1180 [00:00<?, ?it/s]

15:08:29.621 	Original shape:		(75000, 1180); 
		shape after removing NaNs: (74998, 1180).
15:08:29.918 	Original shape:		(25000, 1180); 
		shape after removing NaNs: (25000, 1180).
15:08:29.918 Fitting model
15:48:00.773 Predicting
15:48:08.026 Evaluating predictions
[[12102   398]
 [  708 11792]] 
 95.57600000000001 % accuracy
 96.73502871205906 % precision of 1 classes
 94.336 % recall of 1 classes

              precision    recall  f1-score   support

           0       0.94      0.97      0.96     12500
           1       0.97      0.94      0.96     12500

    accuracy                           0.96     25000
   macro avg       0.96      0.96      0.96     25000
weighted avg       0.96      0.96      0.96     25000

f1_score:
	0.9552045362494936


In [14]:
results_for_comparison

Unnamed: 0,accuracy_score,precision_score,recall_score,confusion_matrix,f1_score
1180 substrings,0.95576,0.96735,0.94336,"[[12102, 398], [708, 11792]]",0.955205


| Dataset                          |   Accuracy |   F1 score |   |
|:---------------------------------|-----------:|-----------:|-----------:|
| demo_human_or_worm |    95.224 |     95.26 | 4900 substrings	 |
| demo_human_or_worm |    95.576 |     95.5205 | 1180  substrings |
| demo_human_or_worm |    0 |     0 |   |