# Processing
This notebook will walk through what the processing is for the 3 different forms, and the trial and error taken to get to those forms.

# NOTE (DEPRECATION OF FORMS)
Most of this notebook is now deprecated. I originally tried using previous work as a blueprint which is how these 3 forms came to be. However, after further inspection of that paper's code, I realized that our given data is different, and the representations used to get the forms is not applicable to this data. With that, 2 forms (and subsequent 2 types) is being removed (forms 1 and 2). Form 3 is being kept and used, and will still be referenced in code as form 3 so that it matches other documentation.

In [4]:
%pip install numpy matplotlib pandas biopython keras tensorflow

Collecting tensorflow
  Downloading tensorflow-2.4.1-cp38-cp38-win_amd64.whl (370.7 MB)
Collecting wrapt~=1.12.1
  Using cached wrapt-1.12.1.tar.gz (27 kB)
Collecting tensorboard~=2.4
  Downloading tensorboard-2.4.1-py3-none-any.whl (10.6 MB)
Collecting gast==0.3.3
  Using cached gast-0.3.3-py2.py3-none-any.whl (9.7 kB)
Collecting grpcio~=1.32.0
  Downloading grpcio-1.32.0-cp38-cp38-win_amd64.whl (2.6 MB)
Collecting keras-preprocessing~=1.1.2
  Using cached Keras_Preprocessing-1.1.2-py2.py3-none-any.whl (42 kB)
Collecting tensorflow-estimator<2.5.0,>=2.4.0
  Using cached tensorflow_estimator-2.4.0-py2.py3-none-any.whl (462 kB)
Collecting flatbuffers~=1.12.0
  Using cached flatbuffers-1.12-py2.py3-none-any.whl (15 kB)
Collecting astunparse~=1.6.3
  Using cached astunparse-1.6.3-py2.py3-none-any.whl (12 kB)
Collecting termcolor~=1.1.0
  Using cached termcolor-1.1.0.tar.gz (3.9 kB)
Collecting opt-einsum~=3.3.0
  Using cached opt_einsum-3.3.0-py3-none-any.whl (65 kB)
Collecting wheel~=0.35

In [5]:
# Load all neccessary libraries
import numpy as np
import pandas as pd
from keras.preprocessing.sequence import pad_sequences

# Bio libraries needed
from Bio.Align import substitution_matrices

**Load needed variables**

In [7]:
data_dir = "../data/test/"
blosum62 = substitution_matrices.load('BLOSUM62')

# Form 1
The list of annotations are created into a list of scores using the following methods:
- comparing substitutions using BLOSUM-62 matrix
- Frame shifts                    = -10
- Inserts (no matter length)      = -5
- Deletions (no matter length)    = -5
- Duplications (no matter length) = -2

## Getting data
First, we need to get the data. The first column of the CSV file should have Isolate_ID, but this may not be unique as 1 isolate could have many annotations.

In [8]:
ann_df = pd.read_csv(f'{data_dir}export_69.csv')

In [9]:
# See what the original data annotations look like
ann_df.head()

Unnamed: 0,Isolate_ID,Gene,Type,Ref,Position(s),AA,% coverage
0,1,SHV-12,sub,G,232,R,25
1,2,SHV-12,fs,I,9,,100


## Dropping columns
We do not care about the column \[%], so we can drop that. Even though Isolate_ID is not unique, we still want to include it so we can group by this column later as each datapoint will be a list of annotations using this column.

In [10]:
ann_df = ann_df.drop(['% coverage'], axis=1)
ann_df.head()

Unnamed: 0,Isolate_ID,Gene,Type,Ref,Position(s),AA
0,1,SHV-12,sub,G,232,R
1,2,SHV-12,fs,I,9,


## Create Score column
Next, we need to create the score column that we will use later.

### Score function
To make it more clear, we can make a score function that will calculate each annotation's score.

In [11]:
def calc_score(row):
    if row.Type == 'sub':
        return blosum62[row.Ref, row.AA]    # Matrix is square, so ref and aa are interchangable
    elif row.Type == 'fs':
        return -10
    elif row.Type == 'ins':
        return -5
    elif row.Type == 'del':
        return -5
    elif row.Type == 'dup':
        return -2

### Apply the score function
Here, the score function is applied over all rows (axis = 1).

In [12]:
ann_df['Score'] = ann_df.apply(calc_score, axis=1)

In [13]:
ann_df.head()

Unnamed: 0,Isolate_ID,Gene,Type,Ref,Position(s),AA,Score
0,1,SHV-12,sub,G,232,R,-2.0
1,2,SHV-12,fs,I,9,,-10.0


## Drop columns
Now, we do not need Type, Ref, Position(s), and AA.

In [14]:
ann_df = ann_df.drop(['Type', 'Ref', 'Position(s)', 'AA'], axis=1)

In [15]:
ann_df.head()

Unnamed: 0,Isolate_ID,Gene,Score
0,1,SHV-12,-2.0
1,2,SHV-12,-10.0


## Group all scores into a list per isolate/gene
Our final dataframe needs to have each row being a single datapoint representing an isolate and gene combination. The datapoint's value should be the list of annotations found. We want to go over a few options to see which is more efficient for the amount of data we have.

Code below found in [this StackOverflow answer](https://stackoverflow.com/a/22221675).

In [118]:
# First, we want to groub by id and gene so we get all combined scores
# Second, we want to get the made list of Score rows
# Third, we want to make want to make each row into a list
# Finally, we want to reset the index for each row so that we get individual rows per each element in the group
%timeit ann_df.groupby(['Isolate_ID', 'Gene'])['Score'].apply(list).reset_index(name='ann_lists')

6.84 ms ± 101 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


This next option is more convienent, but it is twice as slow and has weird output ([found here](https://stackoverflow.com/a/66853746)).

In [119]:
%timeit pd.pivot_table(ann_df, values='Score', index=['Isolate_ID', 'Gene'], aggfunc={'Score': list})

13.2 ms ± 188 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


Trying the same line as the first one, but this time using `np.array` instead of `list`. It seems to be about the same time as just using `list`.

In [120]:
%timeit ann_df.groupby(['Isolate_ID', 'Gene'])['Score'].apply(np.array).reset_index(name='ann_lists')

6.97 ms ± 98.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


Since this will only be run once to generate the datasets, I will use the first option for convienence and ease of reading.

In [16]:
ann_df = ann_df.groupby(['Isolate_ID', 'Gene'])['Score'].apply(list).reset_index(name='ann_lists')

In [17]:
ann_df.head()

Unnamed: 0,Isolate_ID,Gene,ann_lists
0,1,SHV-12,[-2.0]
1,2,SHV-12,[-10.0]


## To CSV
Lastly, we want to export as CSV without the indecies. We do not need the pandas pre-made indicies, so we can remove them from the CSV file. Since this file is just testing, it will be saved to the current folder.

In [123]:
ann_df.to_csv('form_1.csv', index=False)

Test to see if the file can be loaded, and what it will look like if it is.

In [124]:
pd.read_csv('form_1.csv').head()

Unnamed: 0,Isolate_ID,Gene,ann_lists
0,1,SHV-12,[-2.0]
1,2,SHV-12,[-10.0]


## Update 1 (Padding)
To make it easier for the XGBoost algorithm, we need to have the list of annotations be padded so that each isoalte has the same size array.

In [22]:
# You would want to change 5 to be the length of the gene sequence or something else.
new_lists = pd.DataFrame(pad_sequences(ann_df["ann_lists"], padding='post', maxlen=5, value=-99, truncating='post'))

In [23]:
new_lists.head()

Unnamed: 0,0,1,2,3,4
0,-2,-99,-99,-99,-99
1,-10,-99,-99,-99,-99


In [26]:
# Concatenate old and new dataframe so that rows match up.
ann_df = pd.concat([ann_df, new_lists], axis=1)

In [27]:
ann_df.head()

Unnamed: 0,Isolate_ID,Gene,ann_lists,0,1,2,3,4
0,1,SHV-12,[-2.0],-2,-99,-99,-99,-99
1,2,SHV-12,[-10.0],-10,-99,-99,-99,-99


In [31]:
# Drop the old list column
ann_df = ann_df.drop(['ann_lists'], axis=1)

In [32]:
ann_df.head()

Unnamed: 0,Isolate_ID,Gene,0,1,2,3,4
0,1,SHV-12,-2,-99,-99,-99,-99
1,2,SHV-12,-10,-99,-99,-99,-99


In [33]:
ann_df.to_csv('form_1.csv', index=False)

# Form 2
Create input form 2 where the annotated Amino Acid sequences will be converted into either -1 or 1 given the following criteria:
- -1 if:
    1. The Amino Acid had no mutation
- 1 if:
    1. A SNP occurred at the position
    2. An insertion occurred (no matter how long the inserted sequence was, only a single -1 is given)
    2. A deletion occurred (no matter how long the deleted sequence was, only a single -1 is given)
- Duplications are considered (-1) for the unmuttated Amino Acid and then the rest of the dupplication is treated as a single insert (1).
- Frameshifts are considered a 1 for every position beyond where the frameshift occurred.


## Get and show
First, let's get and see what the data looks like as a dataframe.

In [35]:
msa_df = pd.read_csv(f'{data_dir}export_msa_69.csv')

In [36]:
msa_df.head()

Unnamed: 0,Name,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,...,Unnamed: 278,Unnamed: 279,280,Unnamed: 281,Unnamed: 282,Unnamed: 283,Unnamed: 284,Unnamed: 285,Unnamed: 286,Unnamed: 287
0,1,M,R,Y,I,R,L,C,I,I,...,A,A,L,I,E,H,W,Q,R,X
1,2,M,R,Y,I,R,L,C,I,?,...,-,-,-,-,-,-,-,-,-,-
2,consensus,M,R,Y,I,R,L,C,I,I,...,A,A,L,I,E,H,W,Q,R,X
3,reference,M,R,Y,I,R,L,C,I,I,...,A,A,L,I,E,H,W,Q,R,X


## Remove consensus row
We do not need the consensus row, so we can remove that

In [37]:
msa_df = msa_df[msa_df.Name != 'consensus']

In [38]:
msa_df.head()

Unnamed: 0,Name,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,...,Unnamed: 278,Unnamed: 279,280,Unnamed: 281,Unnamed: 282,Unnamed: 283,Unnamed: 284,Unnamed: 285,Unnamed: 286,Unnamed: 287
0,1,M,R,Y,I,R,L,C,I,I,...,A,A,L,I,E,H,W,Q,R,X
1,2,M,R,Y,I,R,L,C,I,?,...,-,-,-,-,-,-,-,-,-,-
3,reference,M,R,Y,I,R,L,C,I,I,...,A,A,L,I,E,H,W,Q,R,X


## Change Amino Acids to 1's and -1's
We will next need to remove the reference row and use it to turn the Amino Acid sequence into a sequence of 1's and -1's. The resulting list of numbers should be the same size as the reference.

In [39]:
# Get the reference row, turn it into a list of lists (there will only be 1 row, so 1 list)
# Then, remove the first element which is the name of the row 'reference'
reference = msa_df[msa_df.Name == 'reference'].values.tolist()[0][1:]

# Remove reference row
msa_df = msa_df[msa_df.Name != 'reference']

There needs to be a function that can go over each row, create the list of 1's and -1's, and return that list to be readded to the df.

In [40]:
def create_ones_list(row, reference = []):
    # The first element of the row is the isolate id, so we must remove that first
    row = row[1:]

    # Since both sequnces were part of a MSA, we can assume the lists are the same length.
    frameshift = False
    insert = False
    delete = False
    ones = []
    for p, r in zip(row, reference):

        # Frameshift occurred in the sequence
        if frameshift:
            ones.append(1)
            continue
        elif p == '?':
            frameshift = True
            ones.append(1)
            continue

        # Insert occurred in the sequence meaning reference has a -
        if insert and r == '-':
            continue
        elif r == '-':
            insert = True
            ones.append(1)
            continue
        else:
            insert = False

        # Delete occurred in the sequence meaning sequence has a -
        if delete and p == '-':
            continue
        elif p == '-':
            delete = True
            ones.append(1)
            continue
        else:
            delete = False

        # Either a substitution occurred or the 2 sequences are the same
        if p != r:
            ones.append(1)
        else:
            ones.append(-1)

    return ones

In [41]:
msa_df.head()

Unnamed: 0,Name,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,...,Unnamed: 278,Unnamed: 279,280,Unnamed: 281,Unnamed: 282,Unnamed: 283,Unnamed: 284,Unnamed: 285,Unnamed: 286,Unnamed: 287
0,1,M,R,Y,I,R,L,C,I,I,...,A,A,L,I,E,H,W,Q,R,X
1,2,M,R,Y,I,R,L,C,I,?,...,-,-,-,-,-,-,-,-,-,-


In [42]:

msa_df['ones'] = msa_df.apply(create_ones_list, axis=1, reference=reference)

In [43]:
msa_df.head()

Unnamed: 0,Name,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,...,Unnamed: 279,280,Unnamed: 281,Unnamed: 282,Unnamed: 283,Unnamed: 284,Unnamed: 285,Unnamed: 286,Unnamed: 287,ones
0,1,M,R,Y,I,R,L,C,I,I,...,A,L,I,E,H,W,Q,R,X,"[-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -..."
1,2,M,R,Y,I,R,L,C,I,?,...,-,-,-,-,-,-,-,-,-,"[-1, -1, -1, -1, -1, -1, -1, -1, 1, 1, 1, 1, 1..."


## Remove all gene columns
Now that we have the ones list created, we can now remove all other gene columns.

In [44]:
msa_df = msa_df.drop(msa_df.columns[[x not in ['Name', 'ones'] for x in msa_df.columns]], axis=1)

In [45]:
msa_df.head()

Unnamed: 0,Name,ones
0,1,"[-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -..."
1,2,"[-1, -1, -1, -1, -1, -1, -1, -1, 1, 1, 1, 1, 1..."


## To CSV
Lastly, we want to export as CSV without the indecies. We do not need the pandas pre-made indicies, so we can remove them from the CSV file. Since this file is just testing, it will be saved to the current folder.

In [168]:
msa_df.to_csv('form_2.csv', index=False)

Test to see if the file can be loaded, and what it will look like if it is.

In [169]:
pd.read_csv('form_2.csv').head()

Unnamed: 0,Name,ones
0,1,"[-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -..."
1,2,"[-1, -1, -1, -1, -1, -1, -1, -1, 1, 1, 1, 1, 1..."


## Update 1
The sequences need to be padded to all be the same length (similar to form 1). In order to pad, we need to have a default length that all sequences should be. We will use the reference sequence to figure out that max length.

In [48]:
# If there are any '-' in the reference sequence, remove it
reference = [x for x in reference if x != '-']

# Used for max length in padding
reference_length = len(reference)

In [49]:
new_lists = pd.DataFrame(pad_sequences(msa_df["ones"], padding='post', maxlen=reference_length, value=-99, truncating='post'))

In [50]:
# Concatenate old and new dataframe so that rows match up.
msa_df = pd.concat([msa_df, new_lists], axis=1)

In [51]:
msa_df.head()

Unnamed: 0,Name,ones,0,1,2,3,4,5,6,7,...,277,278,279,280,281,282,283,284,285,286
0,1,"[-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -...",-1,-1,-1,-1,-1,-1,-1,-1,...,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1
1,2,"[-1, -1, -1, -1, -1, -1, -1, -1, 1, 1, 1, 1, 1...",-1,-1,-1,-1,-1,-1,-1,-1,...,1,1,1,1,1,1,1,1,1,1


In [52]:
# Drop the old list column
msa_df = msa_df.drop(['ones'], axis=1)

In [53]:
msa_df.head()

Unnamed: 0,Name,0,1,2,3,4,5,6,7,8,...,277,278,279,280,281,282,283,284,285,286
0,1,-1,-1,-1,-1,-1,-1,-1,-1,-1,...,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1
1,2,-1,-1,-1,-1,-1,-1,-1,-1,1,...,1,1,1,1,1,1,1,1,1,1


In [54]:
msa_df.to_csv('form_2.csv', index=False)

# Form 3
Create input form 3 where the annotated Amino Acid sequences will be converted a sequence of numbers corresponding to the character that occurres in the annotated sequence.

NOTE: It does not rely on the reference sequence

## Start
To start, we will do the same things as we did for form 2 up to the reference. We will be removing the reference column and not using it.

In [178]:
msa_df = pd.read_csv(f'{data_dir}.csv', index_col=0)
msa_df = msa_df[msa_df.index != 'consensus']
msa_df = msa_df[msa_df.index != 'reference']

In [179]:
msa_df.head()

Unnamed: 0_level_0,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,10,...,Unnamed: 278,Unnamed: 279,280,Unnamed: 281,Unnamed: 282,Unnamed: 283,Unnamed: 284,Unnamed: 285,Unnamed: 286,Unnamed: 287
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,M,R,Y,I,R,L,C,I,I,S,...,A,A,L,I,E,H,W,Q,R,X
2,M,R,Y,I,R,L,C,I,?,-,...,-,-,-,-,-,-,-,-,-,-


## Create function for values
We need to create a conversion function that will take in an Amino Acid (plus a few extra characters) and return an index of that character.

In [187]:
def convert(col):
    characters = {
        'A': 0,
        'R': 1, 
        'N': 2,
        'D': 3,
        'C': 4,
        'Q': 5,
        'E': 6,
        'G': 7,
        'H': 8,
        'I': 9,
        'L': 10,
        'K': 11,
        'M': 12,
        'F': 13,
        'P': 14,
        'S': 15,
        'T': 16,
        'W': 17,
        'Y': 18,
        'V': 19,
        'B': 20,
        'Z': 21,
        'X': 22,
        '?': 23,
        '-': 24
    }

    new_col = []
    for c in col:
        tmp = characters.get(c, None)
        if tmp is None:
            raise ValueError(f'Value {c} does not exist in characters dictionary')

        new_col.append(tmp)

    return new_col

In [181]:
# See old head of dataframe
msa_df.head()

Unnamed: 0_level_0,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,10,...,Unnamed: 278,Unnamed: 279,280,Unnamed: 281,Unnamed: 282,Unnamed: 283,Unnamed: 284,Unnamed: 285,Unnamed: 286,Unnamed: 287
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,M,R,Y,I,R,L,C,I,I,S,...,A,A,L,I,E,H,W,Q,R,X
2,M,R,Y,I,R,L,C,I,?,-,...,-,-,-,-,-,-,-,-,-,-


In [189]:
# Apply function
msa_df = msa_df.apply(convert, axis=1, result_type='broadcast')

In [190]:
msa_df.head()

Unnamed: 0_level_0,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,10,...,Unnamed: 278,Unnamed: 279,280,Unnamed: 281,Unnamed: 282,Unnamed: 283,Unnamed: 284,Unnamed: 285,Unnamed: 286,Unnamed: 287
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,12,1,18,9,1,10,4,9,9,15,...,0,0,10,9,6,8,17,5,1,22
2,12,1,18,9,1,10,4,9,23,24,...,24,24,24,24,24,24,24,24,24,24


## To CSV
Lastly, we want to export as CSV without the indecies. We do not need the pandas pre-made indicies, so we can remove them from the CSV file. Since this file is just testing, it will be saved to the current folder.

In [193]:
msa_df.to_csv('form_3.csv')

Test to see if the file can be loaded, and what it will look like if it is.

In [194]:
pd.read_csv('form_3.csv').head()

Unnamed: 0,Name,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,...,Unnamed: 278,Unnamed: 279,280,Unnamed: 281,Unnamed: 282,Unnamed: 283,Unnamed: 284,Unnamed: 285,Unnamed: 286,Unnamed: 287
0,1,12,1,18,9,1,10,4,9,9,...,0,0,10,9,6,8,17,5,1,22
1,2,12,1,18,9,1,10,4,9,23,...,24,24,24,24,24,24,24,24,24,24
