# Genomic Data - HumanEnhancersCohn

This is one of the genomic datasets taken from [here](https://github.com/ML-Bioinfo-CEITEC/genomic_benchmarks).
The classification task is evaluated using the _SeqRep_ package.

You can [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/MIR-MU/seqrep/blob/main/examples/genomic_data/HumanEnhancersCohn.ipynb)
or
[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/MIR-MU/seqrep/main?labpath=examples%2Fgenomic_data%2FHumanEnhancersCohn.ipynb).

## Install _SeqRep_ Package

In [None]:
!pip install seqrep

Collecting seqrep
  Downloading seqrep-0.0.2-py3-none-any.whl (19 kB)
Collecting ta>=0.8.0
  Downloading ta-0.9.0.tar.gz (25 kB)
Collecting numpy-ext>=0.9.6
  Downloading numpy_ext-0.9.6-py3-none-any.whl (6.9 kB)
Collecting hrv-analysis>=1.0.4
  Downloading hrv_analysis-1.0.4-py3-none-any.whl (28 kB)
Collecting pandas-ta>=0.3.14b0
  Downloading pandas_ta-0.3.14b.tar.gz (115 kB)
[K     |████████████████████████████████| 115 kB 8.3 MB/s 
Collecting nolds>=0.4.1
  Downloading nolds-0.5.2-py2.py3-none-any.whl (39 kB)
Collecting numpy>=1.15.1
  Downloading numpy-1.20.3-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (15.3 MB)
[K     |████████████████████████████████| 15.3 MB 500 kB/s 
[?25hCollecting joblib<1.1.0,>=1.0.1
  Downloading joblib-1.0.1-py3-none-any.whl (303 kB)
[K     |████████████████████████████████| 303 kB 47.1 MB/s 
Building wheels for collected packages: pandas-ta, ta
  Building wheel for pandas-ta (setup.py) ... [?25l[?25hdone
  Created wheel for pandas-ta:

## Import Needed Packages

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from itertools import product

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.neural_network import MLPClassifier

!pip install icecream
from icecream import ic

from seqrep import *
from seqrep.feature_engineering import *
from seqrep.labeling import *
from seqrep.splitting import *
from seqrep.scaling import *
from seqrep.feature_reduction import *
from seqrep.evaluation import *
from seqrep.pipeline_evaluation import *



## Load or Download Data

In [None]:
!pip install genomic-benchmarks

Collecting genomic-benchmarks
  Downloading genomic_benchmarks-0.0.6.tar.gz (17 kB)
Collecting biopython>=1.79
  Downloading biopython-1.79-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.whl (2.3 MB)
[K     |████████████████████████████████| 2.3 MB 4.7 MB/s 
Collecting pyyaml>=5.3.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 57.4 MB/s 
Collecting yarl
  Downloading yarl-1.7.2-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (271 kB)
[K     |████████████████████████████████| 271 kB 67.3 MB/s 
Collecting multidict>=4.0
  Downloading multidict-6.0.2-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (94 kB)
[K     |████████████████████████████████| 94 kB 2.8 MB/s 
Building wheels for collected packages: genomic-benchmarks
  Building wheel for genomic-benchmarks (setup.py) ... [?25l[?25hdone


In [None]:
from genomic_benchmarks.data_check import list_datasets

list_datasets()

['demo_mouse_enhancers',
 'human_nontata_promoters',
 'human_enhancers_cohn',
 'human_enhancers_ensembl',
 'demo_human_or_worm',
 'demo_coding_vs_intergenomic_seqs']

In [None]:
from genomic_benchmarks.dataset_getters.pytorch_datasets import HumanEnhancersCohn

X_train = HumanEnhancersCohn(split="train", version=0)
X_test = HumanEnhancersCohn(split="test", version=0)

y_train = pd.Series([y for _, y in X_train])
X_train = pd.DataFrame([x for x, _ in X_train], columns=["genom"])
y_test = pd.Series([y for _, y in X_test])
X_test = pd.DataFrame([x for x, _ in X_test], columns=["genom"])

Downloading 176563cDPQ5Y094WyoSBF02QjoVQhWuCh into /root/.genomic_benchmarks/human_enhancers_cohn.zip... Done.
Unzipping...Done.


In [None]:
## Random shuffling
# idx = np.random.permutation(len(y_train))
# X_train = X_train.iloc[idx, :].reset_index(drop=True)
# y_train = y_train.iloc[idx].reset_index(drop=True)

# idx = np.random.permutation(len(y_test))
# X_test = X_test.iloc[idx, :].reset_index(drop=True)
# y_test = y_test.iloc[idx].reset_index(drop=True)

X_train.join(pd.DataFrame(y_train, columns=["label"]))

Unnamed: 0,genom,label
0,CCCCCAGCTTTAAGCAGTTTCATAAGTAGATGTTAACAACTGTGTT...,0
1,TACCCATTGGGCAGGGAAGGAAGCTTGAGAAATCAGACTTGATTTT...,0
2,CTGATGCGGGTGGTCTGCAAACCACACTTGCAGCAACCCTGGCACA...,0
3,CATCCTCCTCCAGACACCGTCCCTTCTTCTGTCTCTGCATTTCCCA...,0
4,TGTTTATACAGTTTTCATGAGAATTTGCTTTGAAAAGCACTCAGCC...,0
...,...,...
20838,GTAGATGGCTGTATTCTCGTTGTATCCTCACACAGCAGAGAGCCGA...,1
20839,CCAGGAGGCGGAGGTTGCAGTGAGCTGAGATCGTGCCACTGCACTC...,1
20840,ATGTGAACCTTGCATTAAAATAGGGGAACATCACACACCGGGGCCT...,1
20841,CCAAATGTAAACTTCCCTTTAAAAAAATTTTTTTTGCAAGATAAAC...,1


## Run Pipeline Evaluation

In [None]:
# This DataFrame collects the results of various runs for comparison.

# Uncomment following line if you want to clear the DataFrame with the results.
# del results_for_comparison

try:
    results_for_comparison
except NameError:
    print("Create new empty DataFrame.")
    results_for_comparison = pd.DataFrame()
else:
    print("DataFrame already exist!")

Create new empty DataFrame.


In [85]:
%%time


class SubstringsExtractor(FeatureExtractor):
    def __init__(
        self,
        substrings: List,
        occurrences: Union[int, float] = 1,
        columns_to_apply: Union[str, List[str]] = None,
        return_original_columns: bool = False,
        normalize: bool = True,
        verbose: bool = True,
        inplace: bool = False,
    ):
        self.substrings = substrings
        self.occurrences = occurrences
        self.columns_to_apply = columns_to_apply
        self.return_original_columns = return_original_columns
        self.normalize = normalize
        self.verbose = verbose
        self.inplace = inplace

    def fit(self, X, y=None):
        if not self.columns_to_apply:
            self.columns_to_apply = X.columns
        if isinstance(self.columns_to_apply, str):
            columns_to_apply = [self.columns_to_apply]
        if self.occurrences <= 0:
            return self
        if self.occurrences < 1:
            self.occurrences = int(self.occurrences * X.shape[0])
        if self.verbose:
            print(f"\tNumber of substrings BEFORE fit: {len(self.substrings)}")

        new_substrings = []
        for c in self.columns_to_apply:
            for s in tqdm(
                self.substrings, leave=False, desc="Fitting SubstringsExtractor"
            ):
                count = 0
                for x in X[c]:
                    if s in x:
                        count += 1
                        if count >= self.occurrences:
                            new_substrings.append(s)
                            break

        self.substrings = new_substrings

        if self.verbose:
            print(f"\tNumber of substrings AFTER fit:  {len(self.substrings)}")
        return self

    def transform(self, X):
        if not self.inplace:
            X = X.copy()
        for column in tqdm(
            self.columns_to_apply,
            leave=False,
            desc="Transforming SubstringsExtractor - columns",
        ):
            col_pref = column + "_" if len(self.columns_to_apply) > 1 else ""
            for substr in tqdm(
                self.substrings,
                leave=False,
                desc="Transforming SubstringsExtractor - substrings",
            ):
                X.loc[:, f"{col_pref}count-{substr}"] = (
                    X[column].str.count(substr) / X[column].str.len()
                    if self.normalize
                    else 1
                )

        if self.return_original_columns:
            return X
        return X.drop(columns=self.columns_to_apply)


tmp = SubstringsExtractor(
    # substrings=["A", "C", "T", "G"]
    substrings=["".join(p) for i in range(1, 11) for p in product("ACTG", repeat=i)],
    occurrences=0.1,
).fit(X_train.iloc[np.random.permutation(1000), :])
# ).fit_transform(X_train)
tmp

	Number of substrings BEFORE fit: 1398100


Fitting SubstringsExtractor:   0%|          | 0/1398100 [00:00<?, ?it/s]

	Number of substrings AFTER fit:  3520
CPU times: user 47min 16s, sys: 20.6 s, total: 47min 37s
Wall time: 47min 43s


In [86]:
%%capture --no-stdout --no-display
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score

run_identification = f"{len(tmp.substrings)} substrings"

# 1. step - define your pipeline
pipe = Pipeline(
    [
        (
            "fext_substr",
            SubstringsExtractor(
                substrings=tmp.substrings,
                occurrences=0,
            ),
        ),
        ("scale_u", UniversalScaler(scaler=MinMaxScaler())),
    ]
)

# 2. step - define your workflow
pipe_eval = PipelineEvaluator(
    pipeline=pipe,
    model=MLPClassifier(
        hidden_layer_sizes=(128, 32, 8),
        batch_size=32,
    ),
    evaluator=SequentialEvaluator(
        [
            ClassificationEvaluator(),
            UniversalEvaluator(metrics=[f1_score]),
        ]
    ),
)
# 3. step
pipe_eval.X_train = X_train.copy()
pipe_eval.y_train = y_train.copy()
pipe_eval.X_test = X_test.copy()
pipe_eval.y_test = y_test.copy()

result = pipe_eval.run()

results_for_comparison = results_for_comparison.append(
    pd.Series(result, name=run_identification),
)

11:34:49.830 Fitting pipeline


Transforming SubstringsExtractor - columns:   0%|          | 0/1 [00:00<?, ?it/s]

Transforming SubstringsExtractor - substrings:   0%|          | 0/3520 [00:00<?, ?it/s]

11:40:09.430 Applying pipeline transformations


Transforming SubstringsExtractor - columns:   0%|          | 0/1 [00:00<?, ?it/s]

Transforming SubstringsExtractor - substrings:   0%|          | 0/3520 [00:00<?, ?it/s]

11:42:05.444 	Original shape:		(20843, 3520); 
		shape after removing NaNs: (20843, 3520).
11:42:05.711 	Original shape:		(6948, 3520); 
		shape after removing NaNs: (6948, 3520).
11:42:05.712 Fitting model
12:05:28.650 Predicting
12:05:33.954 Evaluating predictions
[[2541  933]
 [1193 2281]] 
 69.40126655152562 % accuracy
 70.9707529558183 % precision of 1 classes
 65.65918249856074 % recall of 1 classes

              precision    recall  f1-score   support

           0       0.68      0.73      0.71      3474
           1       0.71      0.66      0.68      3474

    accuracy                           0.69      6948
   macro avg       0.70      0.69      0.69      6948
weighted avg       0.70      0.69      0.69      6948

f1_score:
	0.6821172248803827


In [87]:
results_for_comparison

Unnamed: 0,accuracy_score,precision_score,recall_score,confusion_matrix,f1_score
7906 substrings,0.659039,0.650258,0.688256,"[[2188, 1286], [1083, 2391]]",0.668718
344932 substrings,0.694876,0.729959,0.618595,"[[2679, 795], [1325, 2149]]",0.669679
344932 substrings,0.684082,0.675542,0.708405,"[[2292, 1182], [1013, 2461]]",0.691584
3520 substrings,0.694013,0.709708,0.656592,"[[2541, 933], [1193, 2281]]",0.682117


| Dataset                          |   Accuracy |   F1 score |   |
|:---------------------------------|-----------:|-----------:|-----------:|
| human_enhancers_cohn |    71.8768 |     70.7660 | 7906 substrings - comb |
| human_enhancers_cohn |    67.1704 |     67.521 | 4900 substrings |
| human_enhancers_cohn |    69.9626 |     69.1409 | 1180 substrings |
| human_enhancers_cohn |    68.4082 |     69.1584 |  3520 substrings |