# Genome Sequencing
Authors:  
- Minh Duc Ngo  
- Catharina Hoppensack  

This notebook is our own work. Any other sources have been clearly marked and cited.

All authors contributed equally.

# 1 Environment
We worked with the following environment for this jupyter notebook:

- Python version: (3.10.19) 3.11.9  
- OS: Windows 11  
- Environment: Visual Studio Code

In [69]:
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay, classification_report
import torch
import torch.nn as nn

In [70]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

url="https://github.com/thomasmanke/DataSets/raw/refs/heads/main/labeled_sequences.csv.gz"

df = pd.read_csv(url)
seqs = df["sequence"].tolist()
y    = df["label"].values

S = len(seqs)    # number of sequences
L = len(seqs[0]) # same length for all (!!!) sequences



print(S)
print(L)

df.head()

20000
10


Unnamed: 0,sequence,label
0,GTAGGTAAGC,0
1,GGGGTATTTG,0
2,CACTTCCCTT,0
3,AATCCATAAG,0
4,GGCTTTTGCC,0


In [71]:
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   sequence  20000 non-null  object
 1   label     20000 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 312.6+ KB


In [72]:
print(df)
print(df.shape)
print(df.ndim)

         sequence  label
0      GTAGGTAAGC      0
1      GGGGTATTTG      0
2      CACTTCCCTT      0
3      AATCCATAAG      0
4      GGCTTTTGCC      0
...           ...    ...
19995  GGGGTGAGAG      0
19996  ACCGCGACTA      0
19997  TCCTGCGAAT      0
19998  GATGCAGGCT      0
19999  ATCATTACGT      1

[20000 rows x 2 columns]
(20000, 2)
2


In [73]:
print(type(df))
print(len(df[df["label"] == 1]))
onlyOne = df[df["label"] == 1]
noDuplicates = onlyOne.drop_duplicates()
print(len(noDuplicates))


<class 'pandas.core.frame.DataFrame'>
547
543


Load the dataset, inspect its structure and dimensions, and briefly describe its content.


- 20000 rows, 2 columns: sequence [object] , label [int]
- Das dataset beinhaltet Genomsequencen mit 20000 exemplare, wobei 19797 einzigartig sind. Also 203 dopplungen sind.
- Die Sequenz wird als String gespeichert und besteht aus 10 Buchstaben kombinatorisch aus A T C G
- Jedes Sequenzobjekt hat eine 1 oder eine 0 als Label
- 547 als label "1", darunter 4 Duplikate
- 2 Dimensionales df


In [74]:
print(df)

         sequence  label
0      GTAGGTAAGC      0
1      GGGGTATTTG      0
2      CACTTCCCTT      0
3      AATCCATAAG      0
4      GGCTTTTGCC      0
...           ...    ...
19995  GGGGTGAGAG      0
19996  ACCGCGACTA      0
19997  TCCTGCGAAT      0
19998  GATGCAGGCT      0
19999  ATCATTACGT      1

[20000 rows x 2 columns]


### Task 2
Transform the data into a form suitable for MLP input:
– Convert to numpy / torch tensors
– Flatten or reshape inputs as needed
– Split into training and test sets

- flatten unnötig, sonst verliert man den überblick der infos
- reshapen nötig? Glaub nur, wenn man es für Modelle verlangt

In [75]:
#sequenz in Nummern umwandeln, denn NN kann nur mit zahlen arbeiten
chars = sorted(list(set(''.join(seqs))))
print("Alphabet:", chars)
print("Alphabetgröße:", len(chars))

mapping = str.maketrans({
    'A': '1',
    'C': '2',
    'G': '3',
    'T': '4'
})
text = "AA BB CC"
result = text.translate(mapping)

print(result)


Alphabet: ['A', 'C', 'G', 'T']
Alphabetgröße: 4
11 BB 22


In [76]:

#anneinanderreihen der Strings
seqs_combined = "".join(seqs)                     
print(seqs)     

#den einzelnen String in zahlen umwandeln (zahlenString)
result = seqs_combined.translate(mapping)
print(result)

#den Zahlenstring  in int umwandeln und in 10er blöcke zurück aufteilen


seqs_number = [int(result[i:i+L]) for i in range(0, len(result), L)]
print(seqs_number)

['GTAGGTAAGC', 'GGGGTATTTG', 'CACTTCCCTT', 'AATCCATAAG', 'GGCTTTTGCC', 'GCGTGTTAGA', 'GGAAGCTATC', 'CCACACTTGT', 'GTATGGCATC', 'TTCCCCCTCA', 'GCCTCCCTCG', 'TGTCGTACTA', 'TACGATCATT', 'TAAAGAAAGA', 'TATTTGGGAT', 'GGAGACGCAT', 'GATTCATGGC', 'TAGTTCGGAG', 'AGCGAACGGC', 'GGAGGCCTAG', 'GTGATATTCA', 'GGAGGATATG', 'GGCTCCACAA', 'CTTTTTCCGT', 'CGTAGCAAAG', 'CATAAGGCTG', 'ACAAGCTTGG', 'CTTTATACAC', 'TTCGCGAAAT', 'AGACCTCGAT', 'AAGCCATCTC', 'TGTGGTGAGC', 'TATCCCGGTT', 'AATGCTAGTT', 'GTGCGGGTTG', 'TAATTGCTAG', 'TAACGGCCGG', 'TTCTATTACA', 'TCTAATGGAA', 'GGTTGTTCTA', 'TTGATTCTTC', 'GTCAGAACTC', 'CCCGTAATAT', 'ACATTTTTGG', 'ATATTGGCGC', 'CCCCAGCTGG', 'CACATGTAAT', 'ATGTGTATAT', 'TCACACGTAA', 'TAACAGGTAT', 'GAATGATGTC', 'ACGCCGTCTC', 'TGCGCGGCCC', 'ATAAGCTGAC', 'GCGCATATCG', 'ATATATTCTC', 'TGGGTCCTGG', 'CGACGCACCC', 'CATCCGCGTA', 'ATATTTAGTC', 'ATTCGGGTTT', 'ACTCCGATGG', 'TCGCACACGG', 'ATAACCAGCT', 'CCTATAAATA', 'GTGACAGGTC', 'TGACAACTAG', 'ACCCTATTCC', 'TAGTACCAGC', 'CCATCTGCCG', 'CTATAATTTT', 'GCAT

In [78]:
X_train, X_test, y_train, y_test = train_test_split(
    seqs_number, y, test_size=0.2, random_state=42)


In [79]:
print(type(y_train))

<class 'numpy.ndarray'>


In [84]:
X_train_tensor = torch.tensor((X_train), dtype=torch.long)
X_test_tensor  = torch.tensor(X_test, dtype=torch.long)
y_train_tensor = torch.tensor(y_train, dtype=torch.int)
y_test_tensor = torch.tensor(y_test, dtype=torch.int)
print(f'X_train_tensor.shape: {X_train_tensor.shape}  y_train_tensor.shape: {y_train_tensor.shape}')
print(f'X_test_tensor.shape:  {X_test_tensor.shape}   y_test_tensor.shape:  {y_test_tensor.shape}')



X_train_tensor.shape: torch.Size([16000])  y_train_tensor.shape: torch.Size([16000])
X_test_tensor.shape:  torch.Size([4000])   y_test_tensor.shape:  torch.Size([4000])


### task 3
Define an appropriate MLP architecture (e.g with PyTorch nn.Sequential). Include number of layers, activation functions, and output dimensions. Report structure and number of parameters
