# Data preparation
---
The notebook contains functions used in preprocessing the datasets before using them for machine learning training performed in the project:
1. **amyloid_preprocess**
2. **gap_sequences**
3. **one_hot_encoding**

---
Import necessary libraries for datast processing and manipulation.

In [27]:
import numpy as np
import pandas as pd
import random

---
Function performing initial preprocessing of the datasets by:
1. reading the CSV file from a path given as an argument,
2. extracting columns with interactor and interactee sequences based on names given as arguments,
3. removing hyphens (-) present in the initial sequences,
4. creating a DataFrame with interactor and interacteee columns,
5. removing rows with uncorrectly read datapoints ('nan'),
6. removing duplicated rows,
7. resetting the indexing
8. returning the dataset with processed data

In [2]:
def amyloid_preprocess(csv_file, interactor_column, interactee_column, delimiter=";", skiprows=0,):
  """
  Function to preprocesses amyloid data from a CSV file and return it as a DataFrame.

  Args:
      csv_file (str): Path to the CSV file containing the amyloid data.
      interactor_column (str): Name of the column containing the interactor sequences.
      interactee_column (str): Name of the column containing the interactee sequences.
      delimiter (str, optional): Delimiter used in the CSV file. Defaults to ";".
      skiprows (int, optional): Number of initial rows to skip while reading the CSV file. Defaults to 0.

  Returns:
      pandas.DataFrame: Preprocessed DataFrame with columns 'interactor' and 'interactee'.
  """

  data_file = pd.read_csv(csv_file, delimiter=delimiter, skiprows=skiprows)

  interactor = data_file[interactor_column]
  interactee = data_file[interactee_column]

  new_interactor = []
  new_interactee = []

  for i in range(len(interactor)):
    new_interactor.append(str(interactor[i]).replace("-", ""))
    new_interactee.append(str(interactee[i]).replace("-", ""))

  columns = ["interactor", "interactee"]
  data = pd.DataFrame(list(zip(new_interactor, new_interactee)), columns=columns)

  data.replace('nan', np.nan, inplace=True)
  data.dropna(inplace=True)
  data.drop_duplicates(inplace=True)
  data.reset_index(drop=True, inplace=True)

  return data

---
Call the *amyloid_preprocess* function on "HRAM.csv" file, containing interacting pairs of sequences of HRAM amyloid motif.

In [8]:
HRAM_data = amyloid_preprocess("HRAM.csv", "motif_1_aligned", "motif_2_aligned")

print("Rows: ", HRAM_data.shape[0])
print("Columns: ", HRAM_data.shape[1])
print("First 5 rows of data:")
print(HRAM_data.head())

Rows:  146
Columns:  2
First 5 rows of data:
              interactor             interactee
0  GRNSAKDIRTEERARVQLGNV  TTNSVETVVGKGESRVLIGNE
1  GRNSAKDIRTEKRARVQLGNV  TTNSVETVVGKGESKVLIGNE
2  VRIYAKDIKSEEMARVRVGNE  TVSRVDSVAARGKSAVHIGHQ
3    GKNSAGRINGPGMVNIGNS  TVNHVDEINTAEPSRVHIGNT
4  HRIKIGKVTQASNAKAVIGVH  MNVEIDDVSVGPGSWSLVGVS


---
Call the *amyloid_preprocess** function on "BASS.csv" file, containing interacting pairs of sequences of BASS amyloid motif. One first row of the original file needs to be skipped in order to correctly collect interactor and interactee sequences.

In [9]:
BASS_data = amyloid_preprocess("BASS.csv", "Bell-side", "NLR-side", skiprows=1)

print("Rows: ", BASS_data.shape[0])
print("Columns: ", BASS_data.shape[1])
print("First 5 rows of data:")
print(BASS_data.head())

Rows:  279
Columns:  2
First 5 rows of data:
                    interactor              interactee
0  RHVHLRARASGSARIYQAGRDQHITER  MEGRASGSARIYQAGGDQYIEE
1  RDVHLRARASGSARIYQAGRDQHITER  MEGRASGSARIYQSGGDQYIEE
2  RHVHLRARASGSARIYQAGRDQHITER  MEGRASGSARIYQTGGDQYIEE
3    VHLRARASGSARIYQAGRDQHITER  MEGRASGSARIYQSGGDQYIEE
4  RSVHLRARASGSARIYQAGRDQHITER  MEGRASGSARIYQTGGDQYIEE



---
Call the *amyloid_preprocess* function on "many.csv" file, containing interacting pairs of sequences of a few different amyloid motifs. Two first rows of the original file needs to be skipped, in order to correctly collect interactor and interactee sequences.

In [10]:
many_data = amyloid_preprocess("many.csv", "i/r_seq_aln", "i/e_seq_aln", skiprows=2)

print("Rows: ", many_data.shape[0])
print("Columns: ", many_data.shape[1])
print("First 5 rows of data:")
print(many_data.head())

Rows:  175
Columns:  2
First 5 rows of data:
                   interactor                 interactee
0        NQVRSIHAEGQARVHVGNSY      GRNSAKDIRTEKRARVQLGNV
1        NQVRSIHAEGQARVHVGNSY   TTNSVETVVGKGESKVLIGNEYGG
2     SFGDNNSGFQAGIINGAVNTNFY           SFGSQNSGFQAGIING
3  NNSGTGTQNNNSGAGRQYNAHTITIV  NNTGSGTQNNNSGDGRQYNAQTMNF
4          VQSLNASGSSRVHVGNSY  STVNHVNEVNTTEPSRVNIGNTWGG


---
Merge 3 created DataFrames together, creating one big dataset called "merged".

In [11]:
merged_data = pd.concat([HRAM_data, BASS_data, many_data], ignore_index=True)

print("Rows: ", merged_data.shape[0])
print("Columns: ", merged_data.shape[1])
print("First 5 rows of data:")
print(merged_data.head())

Rows:  600
Columns:  2
First 5 rows of data:
              interactor             interactee
0  GRNSAKDIRTEERARVQLGNV  TTNSVETVVGKGESRVLIGNE
1  GRNSAKDIRTEKRARVQLGNV  TTNSVETVVGKGESKVLIGNE
2  VRIYAKDIKSEEMARVRVGNE  TVSRVDSVAARGKSAVHIGHQ
3    GKNSAGRINGPGMVNIGNS  TVNHVDEINTAEPSRVHIGNT
4  HRIKIGKVTQASNAKAVIGVH  MNVEIDDVSVGPGSWSLVGVS


---
Save "merged" dataset to CSV file to use in machine learning trainings preformed during the project.

In [None]:
merged_data.to_csv("merged_data.csv", index=False)

---
Function putting gaps in sequences by:
1. iterating through the number of repeat times,
2. iterating through sequences in the list,
3. getting indices of elements in the sequence,
4. selecting randomply indices where gaps will be put based on given "gaps" argument,
5. iterating through sampled indices and putting underscore (_) as a represenation of gaps,
6. returning lists of full and gapped sequences.

In [29]:
def gap_sequences(column, gaps, repeat=1):
  """
    Generate gapped sequences based on the input column.

    Args:
        column (list): List of sequences.
        gaps (int): Number of gaps to be inserted in each sequence.
        repeat (int, optional): Number of times to repeat the process. Defaults to 1.

    Returns:
        tuple: A tuple containing two lists:
            - seqs: The original sequences from the input column.
            - gapped: The gapped sequences with gaps inserted.

    The function generates gapped sequences by randomly inserting gaps in each sequence.
    The number of gaps inserted is determined by the 'gaps' parameter.

    The process can be repeated multiple times by specifying the 'repeat' parameter.
    The resulting sequences are collected in the 'gapped' list.

    The original sequences from the input column are stored in the 'seqs' list.

    Both 'seqs' and 'gapped' lists are returned as a tuple.
  """
  inds = []
  seqs = []
  gapped = []

  for i in range(repeat):
    for seq in column:
      seqs.append(seq)
      inds = [i for i, _ in enumerate(seq)]

      samp = random.sample(inds, gaps)

      for index in samp:
        seq = seq[:index] + "_" + seq[index + 1:]

      gapped.append(seq)

  return seqs, gapped

---
Call the *gap_sequences* function on interactor sequences from "merged" dataset.

In [32]:
seq_list = merged_data["interactor"].values

full, gapped = gap_sequences(seq_list, 5)

print(full[0:5])
print(gapped[0:5])

['GRNSAKDIRTEERARVQLGNV', 'GRNSAKDIRTEKRARVQLGNV', 'VRIYAKDIKSEEMARVRVGNE', 'GKNSAGRINGPGMVNIGNS', 'HRIKIGKVTQASNAKAVIGVH']
['G_NS_KDIRTE_R_R_QLGNV', 'GRN_AK_IRT_K_A_VQLGNV', 'VRIY_KDI__EEMA_V_VGNE', 'G_NSAGRI__PGMVN_GN_', '_RIKIG_VTQ_SN_KAV_GVH']


---
Function performing one-hot encoding on a list of sequences by:
1. determining number of rows (n_row),
2. determining maximal sequence length (max_seq_len),
3. choosing greater value amongst maximal sequence length with "length" given as an argument, to determine the number of positions in sequences,
4. creating a 3D array filled with zeros,
5. iterating through sequences and aminoacids in a nested loop,
6. setting array positions coresponding to aminoacids as 1,
7. returning the array with one-hot encoded representation of initial list of sequences.

In [28]:
char_to_int = {'A': 0, 'C': 1, 'D': 2, 'E': 3, 'F': 4, 'G': 5, 'H': 6,
               'I': 7, 'K': 8, 'L': 9, 'M': 10, 'N': 11, 'P': 12,
               'Q': 13, 'R': 14, 'S': 15, 'T': 16, 'V': 17, 'W': 18,
               'Y': 19, '_': 20}

def one_hot_encoding(seqs, length=1):
  """
    Perform one-hot encoding on a list of sequences.

    Args:
        seqs (list): List of sequences to be encoded.
        length (int, optional): Desired length of the encoded sequences. Defaults to 1.

    Returns:
        numpy.ndarray: One-hot encoded representation of the sequences.
            The array shape is (n_rows, max_len, n_classes), where:
                - n_rows is the number of sequences in seqs.
                - max_len is the maximum length among all sequences in seqs or length, whichever is greater.
                - n_classes is the number of unique classes (characters) in the sequences.

  """
  n_rows = len(seqs)
  max_seq_len = max(len(seqs[i]) for i in range(n_rows))
  max_len = max(max_seq_len, length)
  n_classes = len(char_to_int)

  encoded = np.zeros((n_rows, max_len, n_classes))

  for i in range(n_rows):
      seq = seqs[i]
      for j, letter in enumerate(seq):
        encoded[i, j, char_to_int[letter.upper()]] = 1

  return encoded

---
Call the *one_hot_encoding* function on interactor sequences from "merged" dataset.



In [26]:
seq_list = merged_data["interactor"].values

one_hot_encoded = one_hot_encoding(seq_list)

print("Number of sequences: ", one_hot_encoded.shape[0])
print("Length of encoded sequences: ", one_hot_encoded.shape[1])
print("Possible aminoacids on a position of sequence: ", one_hot_encoded.shape[2])
print("Vector encoding a G aminoacid: ", one_hot_encoded[0][0])

Number of sequences:  600
Length of encoded sequences:  65
Possible aminoacids on a position of sequence:  21
Vector encoding a G aminoacid:  [0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
