# Pre-processing sequences

In this notebook we illustrate how to use utility functions implemented in the **utils.py** file to pre-process the sequences.

These functions allow to transform a list of sequences (that are available in the **datachallenge-traindata.csv** file into matrices that can be used by machine-/deep-learning algorithms.
More precisely, three types of matrices can be built:
* one-hot encoding matrices, as already seen in the TP8 (exercice about CNNs and one-hot encoding) 
* matrices of kmers tokens, as already seen in TP8 (exercice and CNNs and embeddings)
* matrices of kmers profiles, which is equivalent to the well known "bag of words" representation in text analysis.

You can refer to the last course and to the TP8 for more details.


## We first import packages

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
# generic imports 
import pandas as pd
import numpy as np
# seaborn
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
# utility functions
import utils

## 1. Load train dataset

In [None]:
train_file = "../dataset/datachallenge-traindata.csv"
df_train = pd.read_csv(train_file, sep = ';')
df_train

### Extract sequences

In [None]:
# extract sequences
seqs = df_train["seq"].values

In [None]:
# print minimum and maximum sequence length
seq_len = [len(x) for x in seqs]
print("minimum / maximum sequence length = {} / {}".format(np.min(seq_len),np.max(seq_len)))

In [None]:
# show histogram
sns.histplot(x=seq_len)

## 2. Representing sequences using "one-hot-encoding" of ATGC bases

Note that we can choose to **padd** short sequences to reach the maximum sequence length, or **truncate** long sequences to reach the minimum sequence length.

In [None]:
X = utils.build_onehot_matrix(seqs, padd = True, truncate = False)
print(X.shape)

In [None]:
X = utils.build_onehot_matrix(seqs, padd = False, truncate = True)
print(X.shape)

## 3. Representing sequences as vectors of  kmers tokens / indices

We first build a dictionary associating an index to each kmer. Note that we will only consider kmers made of A, T, G and C's only.

In [None]:
k = 7
kmer_dic = utils.build_kmer_dic(seqs, k)

We then extract a matrix containing sequences of kmer indices. Note that here also, we need to choose wheter we want to **padd** short sequences, or **truncate** long sequences.

In [None]:
X = utils.build_kmer_tokens_matrix(seqs, k, kmer_dic, padd = True, truncate = False)
print(X.shape)

In [None]:
X = utils.build_kmer_tokens_matrix(seqs, k, kmer_dic, padd = False, truncate = True)
print(X.shape)

We can then check that the values contained in the matrix are comprised between zero and the number of kmers found in the dictionary.

In [None]:
print("min/max value in X = {}/{} and number of kmers of the dictionary = {}".format(np.min(X), np.max(X), len(kmer_dic)))

## 4. Representing sequences as kmer profiles - vectors counting kmers occurences

Likewise, we first need to build a dictionary associating an index to each kmer. Note that we will only consider kmers made of A, T, G and C's only.

In [None]:
k = 7
kmer_dic = utils.build_kmer_dic(seqs, k)

We then extract a matrix containing kmer profiles. Each column of the matrix will correspond to a kmer of the dictionary, and will count the number of occurences of this kmer in the sequences.

Note that here we don't need to to padd or truncate sequences.

In [None]:
X = utils.build_kmer_profile_matrix(seqs, k, kmer_dic)
print(X.shape)

In [None]:
print("number of columns of X = {} and number of kmers of the dictionary = {}".format(X.shape[1], len(kmer_dic)))
print("min/max value in X = {}/{} ".format(np.min(X), np.max(X)))