This notebook uses a Python kernel.

# Sequence Processing for DL

Author: Zhongyi (James) Guo <br>
Date: 10/29/2024

## Import Packages

In [None]:
import os
import pandas as pd
import numpy as np
import pickle

In [None]:
os.getcwd()

## Import Data

In [None]:
all_gene_sequence = pd.read_csv('../../result/deseq2/all_gene_sequence.tsv', sep='\t')
all_gene_sequence.head()

In [None]:
all_gene_sequence.shape

In [None]:
upstream_region = all_gene_sequence[['ensembl_gene_id', 'upstream_region']]
upstream_region.head()

## Quality Control

The promoter/enhancer region must contain only A, T, C, and G. Rows violating this rule will be dropped.

In [None]:
upstream_region_filtered = upstream_region[upstream_region['upstream_region'].str.fullmatch(r'[ATCG]+')]
upstream_region_filtered.head()

<hr>

We will manually inspect dropped rows.

In [None]:
dropped_rows = upstream_region[~upstream_region.index.isin(upstream_region_filtered.index)]
dropped_rows.head()

A lot of dropped rows contained "N."

<hr>

In [None]:
upstream_region_filtered.shape

55226 genes had valid promoter/enhancer sequences, out of 56523 total genes.

## Label Genes

We’ll label all significant genes with DE status as 1 (DE = 1).

In [None]:
sig_gene_sequence = pd.read_csv('../../result/deseq2/sig_gene_sequence.tsv', sep='\t')
sig_gene_sequence.head()

In [None]:
sig_gene_sequence = sig_gene_sequence[['ensembl_gene_id']]
sig_gene_sequence['DE'] = 1

In [None]:
sig_gene_sequence

There are 8195 significant genes.

Left join `sig_gene_sequence` to `upstream_region_filtered` by the column `ensembl_gene_id`.

In [None]:
merged_df = upstream_region_filtered.merge(sig_gene_sequence, on='ensembl_gene_id', how='left')
merged_df['DE'] = merged_df['DE'].fillna(0).astype(int)

In [None]:
merged_df.head()

In [None]:
merged_df.shape

### Sanity Check

In [None]:
'ENSG00000000457' in sig_gene_sequence['ensembl_gene_id'].values

In [None]:
'ENSG00000001460' in sig_gene_sequence['ensembl_gene_id'].values

In [None]:
'ENSG00000000938' in sig_gene_sequence['ensembl_gene_id'].values

## One Hot Encoding

In [None]:
def one_hot_encode(sequence):
    # Define mapping for each base
    encoding_dict = {
        'A': [1, 0, 0, 0],
        'T': [0, 1, 0, 0],
        'C': [0, 0, 1, 0],
        'G': [0, 0, 0, 1] 
    }
    
    encoded_sequence = [encoding_dict[base] if base in encoding_dict 
                        else (_ for _ in ()).throw(KeyError(f"Invalid base: {base}")) for base in sequence]
    
    return np.array(encoded_sequence)

In [None]:
merged_df['upstream_region_encoded'] = merged_df['upstream_region'].apply(one_hot_encode)
merged_df.head()

In [None]:
merged_df.to_pickle('../../result/one_hot_encoding/gene_id_seq_label_ohe.pkl')

## Conclusion

In this notebook, we one-hot-encoded the DNA region sequence and saved it as a pickle file.

In [None]:
import pkg_resources
installed_packages = {d.project_name: d.version for d in pkg_resources.working_set}
print(installed_packages)