In [78]:
# Processing
This notebook will walk through what the processing is for the 3 different forms, and the trial and error taken to get to those forms.

In [77]:
%pip install numpy matplotlib pandas biopython

Note: you may need to restart the kernel to use updated packages.
You should consider upgrading via the 'C:\Users\Cory\AppData\Local\Programs\Python\Python38\python.exe -m pip install --upgrade pip' command.


# Load all neccessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

# Bio libraries needed
from Bio.Align import substitution_matrices

**Load needed variables**

In [79]:
data_dir = "../data/test/"
blosum62 = substitution_matrices.load('BLOSUM62')

# Form 1
The list of annotations are created into a list of scores using the following methods:
- comparing substitutions using BLOSUM-62 matrix
- Frame shifts                    = -10
- Inserts (no matter length)      = -5
- Deletions (no matter length)    = -5
- Duplications (no matter length) = -2

## Getting data
First, we need to get the data. The first column of the CSV file should have Isolate_ID, but this may not be unique as 1 isolate could have many annotations.

In [80]:
ann_df = pd.read_csv(f'{data_dir}export_62.csv')

In [81]:
# See what the original data annotations look like
ann_df.head()

Unnamed: 0,Isolate_ID,Gene,Type,Ref,Position(s),AA,% coverage
0,1,OMPK35_YP002239423.1_OMPK35_KPN,sub,A,182,S,100.0
1,1,OMPK35_YP002239423.1_OMPK35_KPN,sub,V,183,I,100.0
2,1,OMPK35_YP002239423.1_OMPK35_KPN,sub,R,184,H,100.0
3,1,OMPK35_YP002239423.1_OMPK35_KPN,sub,E,257,N,100.0
4,1,OMPK35_YP002239423.1_OMPK35_KPN,sub,D,258,Q,100.0


## Dropping columns
We do not care about the column \[%], so we can drop that. Even though Isolate_ID is not unique, we still want to include it so we can group by this column later as each datapoint will be a list of annotations using this column.

In [82]:
ann_df = ann_df.drop(['% coverage'], axis=1)
ann_df.head()

Unnamed: 0,Isolate_ID,Gene,Type,Ref,Position(s),AA
0,1,OMPK35_YP002239423.1_OMPK35_KPN,sub,A,182,S
1,1,OMPK35_YP002239423.1_OMPK35_KPN,sub,V,183,I
2,1,OMPK35_YP002239423.1_OMPK35_KPN,sub,R,184,H
3,1,OMPK35_YP002239423.1_OMPK35_KPN,sub,E,257,N
4,1,OMPK35_YP002239423.1_OMPK35_KPN,sub,D,258,Q


## Create Score column
Next, we need to create the score column that we will use later.

### Score function
To make it more clear, we can make a score function that will calculate each annotation's score.

In [83]:
def calc_score(row):
    if row.Type == 'sub':
        return blosum62[row.Ref, row.AA]    # Matrix is square, so ref and aa are interchangable
    elif row.Type == 'fs':
        return -10
    elif row.Type == 'ins':
        return -5
    elif row.Type == 'del':
        return -5
    elif row.Type == 'dup':
        return -2

### Apply the score function
Here, the score function is applied over all rows (axis = 1).

In [84]:
ann_df['Score'] = ann_df.apply(calc_score, axis=1)

In [85]:
ann_df.head()

Unnamed: 0,Isolate_ID,Gene,Type,Ref,Position(s),AA,Score
0,1,OMPK35_YP002239423.1_OMPK35_KPN,sub,A,182,S,1.0
1,1,OMPK35_YP002239423.1_OMPK35_KPN,sub,V,183,I,3.0
2,1,OMPK35_YP002239423.1_OMPK35_KPN,sub,R,184,H,0.0
3,1,OMPK35_YP002239423.1_OMPK35_KPN,sub,E,257,N,0.0
4,1,OMPK35_YP002239423.1_OMPK35_KPN,sub,D,258,Q,0.0


## Drop columns
Now, we do not need Type, Ref, Position(s), and AA.

In [86]:
ann_df = ann_df.drop(['Type', 'Ref', 'Position(s)', 'AA'], axis=1)

In [87]:
ann_df.head()

Unnamed: 0,Isolate_ID,Gene,Score
0,1,OMPK35_YP002239423.1_OMPK35_KPN,1.0
1,1,OMPK35_YP002239423.1_OMPK35_KPN,3.0
2,1,OMPK35_YP002239423.1_OMPK35_KPN,0.0
3,1,OMPK35_YP002239423.1_OMPK35_KPN,0.0
4,1,OMPK35_YP002239423.1_OMPK35_KPN,0.0


## Group all scores into a list per isolate/gene
Our final dataframe needs to have each row being a single datapoint representing an isolate and gene combination. The datapoint's value should be the list of annotations found. We want to go over a few options to see which is more efficient for the amount of data we have.

Code below found in [this StackOverflow answer](https://stackoverflow.com/a/22221675).

In [93]:
# First, we want to groub by id and gene so we get all combined scores
# Second, we want to get the made list of Score rows
# Third, we want to make want to make each row into a list
# Finally, we want to reset the index for each row so that we get individual rows per each element in the group
%timeit ann_df.groupby(['Isolate_ID', 'Gene'])['Score'].apply(list).reset_index(name='ann_lists')

7.74 ms ± 99.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


This next option is more convienent, but it is twice as slow and has weird output ([found here](https://stackoverflow.com/a/66853746)).

In [92]:
%timeit pd.pivot_table(ann_df, values='Score', index=['Isolate_ID', 'Gene'], aggfunc={'Score': list})

14.1 ms ± 196 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


Trying the same line as the first one, but this time using `np.array` instead of `list`. It seems to be about the same time as just using `list`.

In [95]:
%timeit ann_df.groupby(['Isolate_ID', 'Gene'])['Score'].apply(np.array).reset_index(name='ann_lists')

7.94 ms ± 126 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


Since this will only be run once to generate the datasets, I will use the first option for convienence and ease of reading.

In [96]:
ann_df = ann_df.groupby(['Isolate_ID', 'Gene'])['Score'].apply(list).reset_index(name='ann_lists')

In [97]:
ann_df.head()

Unnamed: 0,Isolate_ID,Gene,ann_lists
0,1,OMPK35_YP002239423.1_OMPK35_KPN,"[1.0, 3.0, 0.0, 0.0, 0.0, -5.0, 2.0, 1.0, 0.0,..."
1,1,OMPK36_YP002237369.1_OMPK36_KPN,"[1.0, 3.0, 2.0]"
2,1,OMPK37_YP002238854.1_OMPK37_KPN,"[0.0, 1.0, 1.0, 0.0, 1.0, 0.0, 2.0, 3.0, 1.0, ..."
3,2,OMPK35_YP002239423.1_OMPK35_KPN,"[-1.0, -10.0, -10.0, 3.0, 0.0, 3.0]"
4,2,OMPK36_YP002237369.1_OMPK36_KPN,"[-3.0, 0.0, 3.0, 1.0, 1.0, 2.0, 1.0, 1.0, -5.0..."


## To CSV
Lastly, we want to export as CSV without the indecies. We do not need the pandas pre-made indicies, so we can remove them from the CSV file. Since this file is just testing, it will be saved to the current folder.

In [99]:
ann_df.to_csv('form_1.csv', index=False)

Test to see if the file can be loaded, and what it will look like if it is.

In [100]:
pd.read_csv('form_1.csv').head()

Unnamed: 0,Isolate_ID,Gene,ann_lists
0,1,OMPK35_YP002239423.1_OMPK35_KPN,"[1.0, 3.0, 0.0, 0.0, 0.0, -5.0, 2.0, 1.0, 0.0,..."
1,1,OMPK36_YP002237369.1_OMPK36_KPN,"[1.0, 3.0, 2.0]"
2,1,OMPK37_YP002238854.1_OMPK37_KPN,"[0.0, 1.0, 1.0, 0.0, 1.0, 0.0, 2.0, 3.0, 1.0, ..."
3,2,OMPK35_YP002239423.1_OMPK35_KPN,"[-1.0, -10.0, -10.0, 3.0, 0.0, 3.0]"
4,2,OMPK36_YP002237369.1_OMPK36_KPN,"[-3.0, 0.0, 3.0, 1.0, 1.0, 2.0, 1.0, 1.0, -5.0..."


# Form 2
Create input form 2 where the annotated Amino Acid sequences will be converted into either -1 or 1 given the following criteria:
- -1 if:
    1. The Amino Acid had no mutation
- 1 if:
    1. A SNP occurred at the position
    2. An insertion occurred (no matter how long the inserted sequence was, only a single -1 is given)
    2. A deletion occurred (no matter how long the deleted sequence was, only a single -1 is given)