# Make several datasets
Being my first NLP LSTM neural network, and first neural network in general built from my own data, I don't know how to preprocess the data to make the best model.


**Setup**: <br>
I have chosen to use Arg (all anticodons) because the distribution of tRNAs with scores 74% and above seemed fairly normal, and there were 1000+ tRNAs from my 400+ genomes.


Unfortunately there were only *103 unique Arg tRNAs* in this list of 1000 detected Arg from the Enterobacterales taxon.


**Paths forward**:
1. I can try to filter and keep only the unique tRNAs to train my LSTM fake tRNA generator. <br>
        **Pros**: Gets rid of potential codon biases or over-sequenced groups of Enterobacterales bacteria
        **Cons**: Trims my sample size of 1000+ to only 103 sequences...will this be enough data? Without being dumb and simply getting unique tRNAs I don't know how to retain meta data about what organism these tRNAs came from originally.
2. I can continue without filtering
        **Pros**: I have a sample size of 1000+ Arg tRNAs. Maybe having repeated sequences is a good thing and will train the model to more often than not go with the more "conserved" tRNA bases. I retain metadata structure
        **Cons**: Can potentially just be artificially inflating my data. The bias introduced might be too much and might not effectively train the model on rarer Arg tRNAs.

## I kind of want to go with option 2 and just see what happens

In [1]:
import pandas as pd
import numpy as np

df = pd.read_csv("organism_tRNAs.csv", names=['fasta_accession','tRNA_num','begin','end','length','isotype','anticodon','score','sequence','ss'])
df = df[['fasta_accession', 'length', 'isotype', 'anticodon', 'score', 'sequence', 'ss']]

score_filter = df['score'] > 60
arg_filter = df['isotype'] == 'Arg'

df_worthy = df[ score_filter & arg_filter ]
df_worthy.shape

(1894, 7)

## Actual datasets
1. All Arg
2. All Arg w/ scores > 74
3. **Only 1 unique Arg**
4. Remove tRNAs found in modomics

In [2]:
file_allArg = "entero_arg.csv"
file_goodArg = "entero_highscore_arg.csv"
file_uniqueArg = "entero_unique_arg.csv"

In [3]:
# File types 1 and 2
#df[arg_filter].to_csv(file_allArg, index=False)
#df_worthy.to_csv(file_goodArg, index=False)

In [4]:
df_worthy.groupby('sequence')['sequence'].count().shape

(228,)

In [5]:
moreAbundant = df.groupby('sequence')['sequence'].count() > 10
moreAbundant = df.groupby('sequence')['sequence'].count()[moreAbundant]
filtered = df.sequence.isin(moreAbundant.index)
excludingRare = df[filtered]['sequence']

In [6]:
tRNA_modomics_ecoli1 = 'CATCCGTAGCTCAGCTGGAtAGAGTACTCGGCTACGAACCGAGCGGtCGGAGGTTCGAATCCTCCCGGATGCA'
tRNA_modomics_ecoli2 = 'CGCCCGUAGCUCAGCTGGAtAGAGCGCUGCCCTCCGGAGGCAGAGGtCTCAGGTTCGAATCCTGTCGGGCGCG'.replace('U', 'T')
tRNA_modomics_ecoli3 = 'CGCCCTTAGCTCAGTTGGAtAGAGCAACGACCTTCTAAGTCGTGGGcCGCAGGTTCGAATCCTGCAGGGCG'.replace('U', 'T')
tRNAs_in_modomics = [tRNA_modomics_ecoli1, tRNA_modomics_ecoli2, tRNA_modomics_ecoli3]

for tRNA in tRNAs_in_modomics:
    print( df_worthy[ df_worthy['sequence'].str.contains(tRNA) ].shape )

(534, 7)
(131, 7)
(13, 7)


I think I would like to exclude the first tRNA, just cuz there is a ton of it, so it might introduce more bias, and it seems relatively similar, yet different, from the other two I left in there.

In [10]:
#================================================================
#================================================================
#================================================================

In [12]:
df_worthy['sequence'].unique()[0:10]

array(['GCATCCGTAGTTCAGTTGGAtAGAGCACTCGGCTACGAACCGAGAGGtCGGAGGTTCAAATCCTTCCGGATGCA',
       'GCGCTTGTAGCTCAGTTGGAtAGAGCGCTACCCTCCGAAGGTAGAGGcCTCAGGTTCGAATCCTGTCAAGCGCA',
       'GCGCTCTTAGCTCAGATGGAtAGAGCAACGGCCTTCTAAGCCGTAGGtCATAGGTTCGAATCCTATAGAGCGCA',
       'GCGCCCTTAGCTCAGTTGGAtAGAGCAACGACCTTCTAAGTCGTGGGcCGCAGGTTCGAATCCTGCAGGGCGCGCCA',
       'GTCCTCTTAGTTAAATGGAtATAACGAGCCCCTCCTAAGGGCTAAtTGCAGGTTCGATTCCTGCAGGGGACACCA',
       'GCGCCCGTAGCTCAGCTGGAtAGAGCGCTGCCCTCCGGAGGCAGAGGtCTCAGGTTCGAATCCTGTCGGGCGCGCCA',
       'GCATCCGTAGCTCAGCTGGAtAGAGTACTCGGCTACGAACCGAGCGGtCGGAGGTTCGAATCCTCCCGGATGCACCA',
       'CCACCACTAGCTCATCCGGAtAGAGCATCAACCTTCTAAGTTGACGGtGCGAGGTTCGAGTCCTCGGTGGTGGGCCA',
       'GTATCCGTAGCTCAGCTGGAtAGAGTACTCGGCTACGAACCGAGCGGtCGGAGGTTCGAATCCTCCCGGATGCACCA',
       'CCACCACTAGCTCATCCGGAtAGAGCATCAACCTTCTAAGTTGACGGtCCGAGGTTCGAGTCCTCGGTGGTGGGCCA'],
      dtype=object)

In [21]:
def trim_end(tRNA):
    if tRNA[-3:] == 'CCA':
        return tRNA[:-3]
    else:
        return tRNA

In [23]:
df_worthy['sequence'] = df_worthy['sequence'].map(trim_end)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_worthy['sequence'] = df_worthy['sequence'].map(trim_end)


In [30]:
with open(file_uniqueArg, 'w') as file:
    file.write('# This file contains 224 Enterobacterales Arg tRNAs\n')
    file.write('# I trimmed off the CCA tails and left only unique seqs\n')
    file.write('# I left the lowercase letters (unconserved)\n')
    for item in df_worthy['sequence'].unique():
        file.write(f'{item}\n')