This notebook uses a Python kernel.

# Sequence Processing for DL

Author: Zhongyi (James) Guo <br>
Date: 10/29/2024

## Import Packages

In [1]:
import os
import pandas as pd
import numpy as np
import pickle

In [2]:
os.getcwd()

'/home/ubuntu/SCA-DL-DGE/script/seq_process'

## Import Data

In [3]:
all_gene_sequence = pd.read_csv('../../result/deseq2/all_gene_sequence.tsv.gz', sep='\t', compression='gzip')
all_gene_sequence.head()

Unnamed: 0,ensembl_gene_id,hgnc_symbol,chromosome_name,start_position,end_position,strand,upstream_start,upstream_end,upstream_region
0,ENSG00000000457,SCYL3,1,169849631,169894267,-1,169894268,169896267,ACATAAAATGTGGTGTATCCCTCTAGACTAGTATATGCAACTATTA...
1,ENSG00000000460,FIRRM,1,169662007,169854080,1,169660007,169662006,GGGGCAGGGGAAAGGAGAGCATTTCATTGTGAATCAAGGAATTTCT...
2,ENSG00000000938,FGR,1,27612064,27635185,-1,27635186,27637185,ACTAAATTGATTTCACATATGCAAGTTTTTGAAGTGCCCTGGATTA...
3,ENSG00000000971,CFH,1,196651754,196752476,1,196649754,196651753,CTGCTGCAATTAGTGAAACAAGGAACAGTGTTACCACATATGGTCC...
4,ENSG00000001460,STPG1,1,24356999,24416934,-1,24416935,24418934,GGAGGCCGTGTCCCCGCACTCGAGCTTAAGGACATCTGACAGGTGC...


In [4]:
all_gene_sequence.shape

(56523, 9)

In [5]:
upstream_region = all_gene_sequence[['ensembl_gene_id', 'upstream_region']]
upstream_region.head()

Unnamed: 0,ensembl_gene_id,upstream_region
0,ENSG00000000457,ACATAAAATGTGGTGTATCCCTCTAGACTAGTATATGCAACTATTA...
1,ENSG00000000460,GGGGCAGGGGAAAGGAGAGCATTTCATTGTGAATCAAGGAATTTCT...
2,ENSG00000000938,ACTAAATTGATTTCACATATGCAAGTTTTTGAAGTGCCCTGGATTA...
3,ENSG00000000971,CTGCTGCAATTAGTGAAACAAGGAACAGTGTTACCACATATGGTCC...
4,ENSG00000001460,GGAGGCCGTGTCCCCGCACTCGAGCTTAAGGACATCTGACAGGTGC...


## Quality Control

The promoter/enhancer region must contain only A, T, C, and G. Rows violating this rule will be dropped.

In [6]:
upstream_region_filtered = upstream_region[upstream_region['upstream_region'].str.fullmatch(r'[ATCG]+')]
upstream_region_filtered.head()

Unnamed: 0,ensembl_gene_id,upstream_region
0,ENSG00000000457,ACATAAAATGTGGTGTATCCCTCTAGACTAGTATATGCAACTATTA...
1,ENSG00000000460,GGGGCAGGGGAAAGGAGAGCATTTCATTGTGAATCAAGGAATTTCT...
2,ENSG00000000938,ACTAAATTGATTTCACATATGCAAGTTTTTGAAGTGCCCTGGATTA...
3,ENSG00000000971,CTGCTGCAATTAGTGAAACAAGGAACAGTGTTACCACATATGGTCC...
4,ENSG00000001460,GGAGGCCGTGTCCCCGCACTCGAGCTTAAGGACATCTGACAGGTGC...


<hr>

We will manually inspect dropped rows.

In [7]:
dropped_rows = upstream_region[~upstream_region.index.isin(upstream_region_filtered.index)]
dropped_rows.head()

Unnamed: 0,ensembl_gene_id,upstream_region
207,ENSG00000116198,NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN...
534,ENSG00000131788,NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN...
666,ENSG00000142606,ACCCCCAGGTGAGCATCTGGCAACCTGGAACAGCATCTACAGCCCC...
1090,ENSG00000162493,NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN...
1287,ENSG00000169598,NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN...


A lot of dropped rows contained "N."

<hr>

In [8]:
upstream_region_filtered.shape

(55226, 2)

55226 genes had valid promoter/enhancer sequences, out of 56523 total genes.

## Label Genes

We’ll label all significant genes with DE status as 1 (DE = 1).

In [9]:
sig_gene_sequence = pd.read_csv('../../result/deseq2/sig_gene_sequence.tsv', sep='\t')
sig_gene_sequence.head()

Unnamed: 0,ensembl_gene_id,hgnc_symbol,chromosome_name,start_position,end_position,strand,upstream_start,upstream_end,... <- NULL
0,ENSG00000000971,CFH,1,196651754,196752476,1,196649754,196651753,CTGCTGCAATTAGTGAAACAAGGAACAGTGTTACCACATATGGTCC...
1,ENSG00000001617,SEMA3F,3,50155045,50189075,1,50153045,50155044,CCATGCTTCTGAGCTGATCTGAGGGGGCGAGGGGAGGCAGTGAGAC...
2,ENSG00000001630,CYP51A1,7,92084987,92134803,-1,92134804,92136803,TCATCCCATTGAGAATTTTAGATGGCTTACAACAAAATAAAAAAAG...
3,ENSG00000002330,BAD,11,64269830,64284704,-1,64284705,64286704,AGGAAGTCCATGGATGCATTACCAAGCAAGTGTCAATGTGAGCCAC...
4,ENSG00000002726,AOC1,7,150824627,150861504,1,150822627,150824626,GATGCAGTGACACCCTTTAGGGGGTGCCCGCTGCTTCCCGTTCTGT...


In [10]:
sig_gene_sequence = sig_gene_sequence[['ensembl_gene_id']]
sig_gene_sequence['DE'] = 1

In [11]:
sig_gene_sequence

Unnamed: 0,ensembl_gene_id,DE
0,ENSG00000000971,1
1,ENSG00000001617,1
2,ENSG00000001630,1
3,ENSG00000002330,1
4,ENSG00000002726,1
...,...,...
4248,ENSG00000283541,1
4249,ENSG00000283930,1
4250,ENSG00000284070,1
4251,ENSG00000284128,1


There are 8195 significant genes.

Left join `sig_gene_sequence` to `upstream_region_filtered` by the column `ensembl_gene_id`.

In [12]:
merged_df = upstream_region_filtered.merge(sig_gene_sequence, on='ensembl_gene_id', how='left')
merged_df['DE'] = merged_df['DE'].fillna(0).astype(int)

In [13]:
merged_df.head()

Unnamed: 0,ensembl_gene_id,upstream_region,DE
0,ENSG00000000457,ACATAAAATGTGGTGTATCCCTCTAGACTAGTATATGCAACTATTA...,0
1,ENSG00000000460,GGGGCAGGGGAAAGGAGAGCATTTCATTGTGAATCAAGGAATTTCT...,0
2,ENSG00000000938,ACTAAATTGATTTCACATATGCAAGTTTTTGAAGTGCCCTGGATTA...,0
3,ENSG00000000971,CTGCTGCAATTAGTGAAACAAGGAACAGTGTTACCACATATGGTCC...,1
4,ENSG00000001460,GGAGGCCGTGTCCCCGCACTCGAGCTTAAGGACATCTGACAGGTGC...,0


In [14]:
merged_df.shape

(55226, 3)

### Sanity Check

In [15]:
'ENSG00000000457' in sig_gene_sequence['ensembl_gene_id'].values

False

In [16]:
'ENSG00000001460' in sig_gene_sequence['ensembl_gene_id'].values

False

In [17]:
'ENSG00000000938' in sig_gene_sequence['ensembl_gene_id'].values

False

Verify that all entries in the `upstream_region_encoded` column have a consistent dimension of (2000, 4).

In [18]:
lengths = [len(row) for row in merged_df['upstream_region']]
if len(set(lengths)) == 1:
    print("All rows have the same length:", lengths[0])
else:
    print("Rows have different lengths.")
    print("Unique lengths:", set(lengths))

Rows have different lengths.
Unique lengths: {2000, 546, 1827, 1896}


Some rows have entries in `upstream_region_encoded` that are not of length 2000, which we expected. This likely occurs because not all genes have an upstream region of at least 2000 base pairs—some genes may be located close to the start of a chromosome, limiting the available upstream region. Therefore, we will remove genes with upstream regions that are not exactly 2000 bp.

In [19]:
merged_df = merged_df[merged_df['upstream_region'].apply(lambda x: len(x) == 2000)]

In [20]:
merged_df

Unnamed: 0,ensembl_gene_id,upstream_region,DE
0,ENSG00000000457,ACATAAAATGTGGTGTATCCCTCTAGACTAGTATATGCAACTATTA...,0
1,ENSG00000000460,GGGGCAGGGGAAAGGAGAGCATTTCATTGTGAATCAAGGAATTTCT...,0
2,ENSG00000000938,ACTAAATTGATTTCACATATGCAAGTTTTTGAAGTGCCCTGGATTA...,0
3,ENSG00000000971,CTGCTGCAATTAGTGAAACAAGGAACAGTGTTACCACATATGGTCC...,1
4,ENSG00000001460,GGAGGCCGTGTCCCCGCACTCGAGCTTAAGGACATCTGACAGGTGC...,0
...,...,...,...
55221,ENSG00000284520,TGTAAGGGATTGAAATCATAGAGCACATTCTCTTACTCATAAGATG...,0
55222,ENSG00000284544,TGGAGTGCAATGGCGCGATTTTGGCTCACTGTAACCTCTGCCTCCC...,0
55223,ENSG00000284554,ACAGAGGCATCTTTCAAGAAGTTCTGGGCTGGATGTGGCTCATGCC...,0
55224,ENSG00000284568,AGACCCCTGCACTGAGAACTCAGCTCCCGGATGTGGGCGCTGTGCA...,0


In [21]:
55226 - 55223

3

Only 3 rows (genes) were removed.

## One Hot Encoding

In [22]:
def one_hot_encode(sequence):
    # Define mapping for each base
    encoding_dict = {
        'A': [1, 0, 0, 0],
        'T': [0, 1, 0, 0],
        'C': [0, 0, 1, 0],
        'G': [0, 0, 0, 1] 
    }
    
    encoded_sequence = [encoding_dict[base] if base in encoding_dict 
                        else (_ for _ in ()).throw(KeyError(f"Invalid base: {base}")) for base in sequence]
    
    return np.array(encoded_sequence)

In [23]:
merged_df['upstream_region_encoded'] = merged_df['upstream_region'].apply(one_hot_encode)
merged_df.head()

Unnamed: 0,ensembl_gene_id,upstream_region,DE,upstream_region_encoded
0,ENSG00000000457,ACATAAAATGTGGTGTATCCCTCTAGACTAGTATATGCAACTATTA...,0,"[[1, 0, 0, 0], [0, 0, 1, 0], [1, 0, 0, 0], [0,..."
1,ENSG00000000460,GGGGCAGGGGAAAGGAGAGCATTTCATTGTGAATCAAGGAATTTCT...,0,"[[0, 0, 0, 1], [0, 0, 0, 1], [0, 0, 0, 1], [0,..."
2,ENSG00000000938,ACTAAATTGATTTCACATATGCAAGTTTTTGAAGTGCCCTGGATTA...,0,"[[1, 0, 0, 0], [0, 0, 1, 0], [0, 1, 0, 0], [1,..."
3,ENSG00000000971,CTGCTGCAATTAGTGAAACAAGGAACAGTGTTACCACATATGGTCC...,1,"[[0, 0, 1, 0], [0, 1, 0, 0], [0, 0, 0, 1], [0,..."
4,ENSG00000001460,GGAGGCCGTGTCCCCGCACTCGAGCTTAAGGACATCTGACAGGTGC...,0,"[[0, 0, 0, 1], [0, 0, 0, 1], [1, 0, 0, 0], [0,..."


In [24]:
# remove the sequence column
merged_df_filtered = merged_df[['ensembl_gene_id', 'DE', 'upstream_region_encoded']]

In [25]:
# compress to save space
merged_df_filtered.to_pickle('../../result/one_hot_encoding/gene_id_label_ohe.pkl.gz', compression='gzip')

## Conclusion

In this notebook, we one-hot-encoded the DNA region sequence and saved it as a pickle file.

In [26]:
import pkg_resources
installed_packages = {d.project_name: d.version for d in pkg_resources.working_set}
print(installed_packages)

{'Automat': '24.8.1', 'Markdown': '3.7', 'MarkupSafe': '3.0.2', 'Protego': '0.3.1', 'PyDispatcher': '2.0.7', 'PyJWT': '2.9.0', 'PySocks': '1.7.1', 'PyYAML': '6.0.2', 'SQLAlchemy': '2.0.24', 'Scrapy': '2.11.2', 'SecretStorage': '3.3.3', 'Send2Trash': '1.8.3', 'absl-py': '2.1.0', 'aiohappyeyeballs': '2.4.3', 'aiohttp': '3.10.10', 'aiosignal': '1.3.1', 'amqp': '5.2.0', 'annotated-types': '0.7.0', 'anyio': '4.6.2.post1', 'argon2-cffi': '23.1.0', 'argon2-cffi-bindings': '21.2.0', 'arrow': '1.3.0', 'astroid': '3.3.5', 'asttokens': '2.4.1', 'astunparse': '1.6.3', 'async-lru': '2.0.4', 'attrs': '24.2.0', 'awscli': '1.35.14', 'babel': '2.16.0', 'beautifulsoup4': '4.12.3', 'billiard': '4.2.1', 'black': '24.10.0', 'bleach': '6.1.0', 'blinker': '1.8.2', 'bokeh': '3.6.0', 'boto3': '1.35.48', 'botocore': '1.35.48', 'build': '1.2.2.post1', 'celery': '5.4.0', 'certifi': '2024.8.30', 'cffi': '1.17.1', 'charset-normalizer': '3.4.0', 'click': '8.1.7', 'click-didyoumean': '0.3.1', 'click-plugins': '1.1.1'

  import pkg_resources
