#Kill-Neighbors:

Logic: I've been operating on a particular guess, which I believe I've proven by now, that there are a number of "local minima". That is: a. Each time you run the program, it spits out a number (in this case, 100) of sequences which are not random, but are clusters of related sequences, whose binding affinities are also related to eachother. b. On sequential runs of the program, the groups produced "overlap", around the aforementioned clusters. -- Taken together, it is implied then, that there are one or more sequences in a cluster that will have the highest API-score in that cluster. Running the program more times, and gathering more sequences, maes it more-and-more likely that you'll hit that "local-minimum". So, generating more-and-more sequences, while killing off the sub-optimaal sequences, should function as a sort of optimization problem. "How close do you want to get?"/"How likely do you want to be to get the optimal solution?" -> That depends on how many times you can afford to run Apta-MCTS, and the neighbor-killing function, which depends on your computing power.

Idelly, I'll be able to run Apta-MCTS directly in colab in the future, taking advantage of Google's runtime, and making it much simpler to iterate.

kill_neighbors(D=0.1, N=0)
- The function takes in a list of sequences, and their respective API scores.
- Function determines the number of neighbors a sequence has, based on 'D', the neighbourhood cutoff distance.
- The neighborhoods are then ranked by size.
- For the neighborhoods of the largest size, the sequence in that neighborhood with the lowest API score is "killed".

-Iterate the whole process until the largest neighborhood is of the maximum size, N.


## Installing packages and such:

In [None]:
pip install tn93 --q

In [None]:
pip install Bio --q

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m276.4/276.4 kB[0m [31m6.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m48.3 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
import tn93

In [None]:
import pandas as pd

In [None]:
from tn93 import tn93 as model

In [None]:
from Bio import SeqIO

## Import the data:

In [None]:
#From StackExchange
from google.colab import files
def getLocalFiles():
    _files = files.upload()
    if len(_files) >0:
       for k,v in _files.items():
         open(k,'wb').write(v)
getLocalFiles()

Saving Exp_1_1.csv to Exp_1_1.csv
Saving Exp_1_16.csv to Exp_1_16.csv
Saving Exp_2_1.csv to Exp_2_1.csv
Saving Exp_2_16.csv to Exp_2_16.csv
Saving Exp_3_1.csv to Exp_3_1.csv
Saving Exp_3_16.csv to Exp_3_16.csv
Saving Exp_4_1.csv to Exp_4_1.csv
Saving Exp_4_16.csv to Exp_4_16.csv
Saving Exp_5_1.csv to Exp_5_1.csv
Saving Exp_5_16.csv to Exp_5_16.csv


In [None]:
#Give the names of the files you're working with:
fl = ['Exp_1_1.csv','Exp_2_1.csv', 'Exp_3_1.csv', 'Exp_4_1.csv', 'Exp_5_1.csv']

## Tamura_Nei Distance function:

In [None]:
def tamura_nei(s1,s2):
  #Make the SeqRecord objects for each.
  seq1 = SeqIO.SeqRecord(seq=s1)
  seq1.id=s1
  seq2 = SeqIO.SeqRecord(seq=s2)
  seq2.id=s2
  #And feed into the model.
  tn_model = model.TN93(minimum_overlap=8)
  distance = tn_model.tn93_distance(seq1, seq2, "RESOLVE")
  #The output looks like a list, but it isn't. Annoying.
  distance = distance.split(',')
  return(distance)

In [None]:
#Test:
tamura_nei('ATTATAGGTGATG','ATTAAAGGTGATG')

['ATTATAGGTGATG', 'ATTAAAGGTGATG', '0.0817849']

## Making necessary objects:

1. 'API_dict': The function requires a dictionary of sequences vs their respective API-scores, in order to determine which sequences must be cut.


2. 'pairwise_tn_df': It also requires an nxn matrix. It makes sense to generate this matrix of distances "up-front" and keep resampling, rather than to continuously regenerate them for every iteration of the function.

### Make 'o_seqs'
Read in data, build the all-in-one table and pull out 'o_seqs', the nx2 matrix of all sequences and their API-scores.

In [None]:
#fl = ['Exp_1_1.csv','Exp_2_1.csv', 'Exp_3_1.csv', 'Exp_4_1.csv', 'Exp_5_1.csv']
frames = []
for file in fl:
  name = file + 'df'
  name = pd.read_csv(file, delimiter=',', encoding_errors = 'replace')
  name['file'] = file

  frames.append(name)

fin_df = pd.concat(frames)
fin_df

Unnamed: 0,aptamer_protein_interaction_score,primary_sequence,secondary_structure,minimum_free_energy,file
0,0.400000,AACCGGTTUAAAUAAUUUACGUUTTGGCCAA,....((((.((((.......)))).))))..,-3.4,Exp_1_1.csv
1,0.400000,AACCGGTTAAUUAAUUUACGUGGTTGGCCAA,....(((((((((.......)))))))))..,-6.9,Exp_1_1.csv
2,0.400000,AACCGGTTAAUAAUUUACGCUAATTGGCCAA,....(((((((...........)))))))..,-4.7,Exp_1_1.csv
3,0.400000,AACCGGTTCAAAUAAUUUACGUUTTGGCCAA,....(((.((((..........)))))))..,-3.6,Exp_1_1.csv
4,0.400000,AACCGGTTCUGUAAUUUACGUGGTTGGCCAA,(((((((..........)).)))))......,-2.9,Exp_1_1.csv
...,...,...,...,...,...
95,0.371429,AACCGGTTCUAUAUGUGGUUGAGTTGGCCAA,..(((((((...........)))))))....,-3.5,Exp_5_1.csv
96,0.371429,AACCGGTTAGUGUGGUUGUUAGCTTGGCCAA,....((((((.((........))))))))..,-5.6,Exp_5_1.csv
97,0.371429,AACCGGTTAGUUUAAUGGUUGCUTTGGCCAA,....((((((.............))))))..,-3.6,Exp_5_1.csv
98,0.371429,AACCGGTTCUGGUGGGUAAUUGUTTGGCCAA,.((((....))))((.(((....))).))..,-6.1,Exp_5_1.csv


In [None]:
o_seqs = fin_df.drop(['secondary_structure', 'minimum_free_energy', 'file'],axis=1)
o_seqs = o_seqs.values
o_seqs


array([[0.4, 'AACCGGTTUAAAUAAUUUACGUUTTGGCCAA'],
       [0.4, 'AACCGGTTAAUUAAUUUACGUGGTTGGCCAA'],
       [0.4, 'AACCGGTTAAUAAUUUACGCUAATTGGCCAA'],
       [0.4, 'AACCGGTTCAAAUAAUUUACGUUTTGGCCAA'],
       [0.4, 'AACCGGTTCUGUAAUUUACGUGGTTGGCCAA'],
       [0.4, 'AACCGGTTAGUAAAUUUAAUCGUTTGGCCAA'],
       [0.3714285714285714, 'AACCGGTTUUAAUUAAUUUACGCTTGGCCAA'],
       [0.3714285714285714, 'AACCGGTTAAAGUUAAUUUACGGTTGGCCAA'],
       [0.3714285714285714, 'AACCGGTTAAAUUAAUUUACGCGTTGGCCAA'],
       [0.3714285714285714, 'AACCGGTTAUGUAAUUUACGUGUTTGGCCAA'],
       [0.3714285714285714, 'AACCGGTTUCAAUUUACACUAAUTTGGCCAA'],
       [0.3714285714285714, 'AACCGGTTAUUUAUUUAUGGAUCTTGGCCAA'],
       [0.3714285714285714, 'AACCGGTTUCGCAAUUUAUGGUGTTGGCCAA'],
       [0.3714285714285714, 'AACCGGTTGGAAUUAAUUUACGCTTGGCCAA'],
       [0.3714285714285714, 'AACCGGTTUAAAUUAAUUUACGUTTGGCCAA'],
       [0.3714285714285714, 'AACCGGTTUUCAUAAUUUACAAATTGGCCAA'],
       [0.3714285714285714, 'AACCGGTTCCAAAUUAAUUUACGTTGGCCAA'],
  

###Dictionary of API-scores

Convert 'o_seqs' into a dictionary, for later use.

In [None]:
API_dict = {}
for n in o_seqs:
  API_dict[n[1]] = n[0]

API_dict

{'AACCGGTTUAAAUAAUUUACGUUTTGGCCAA': 0.4,
 'AACCGGTTAAUUAAUUUACGUGGTTGGCCAA': 0.4,
 'AACCGGTTAAUAAUUUACGCUAATTGGCCAA': 0.4,
 'AACCGGTTCAAAUAAUUUACGUUTTGGCCAA': 0.4,
 'AACCGGTTCUGUAAUUUACGUGGTTGGCCAA': 0.4,
 'AACCGGTTAGUAAAUUUAAUCGUTTGGCCAA': 0.4,
 'AACCGGTTUUAAUUAAUUUACGCTTGGCCAA': 0.3714285714285714,
 'AACCGGTTAAAGUUAAUUUACGGTTGGCCAA': 0.3714285714285714,
 'AACCGGTTAAAUUAAUUUACGCGTTGGCCAA': 0.3714285714285714,
 'AACCGGTTAUGUAAUUUACGUGUTTGGCCAA': 0.3714285714285714,
 'AACCGGTTUCAAUUUACACUAAUTTGGCCAA': 0.3714285714285714,
 'AACCGGTTAUUUAUUUAUGGAUCTTGGCCAA': 0.3714285714285714,
 'AACCGGTTUCGCAAUUUAUGGUGTTGGCCAA': 0.3714285714285714,
 'AACCGGTTGGAAUUAAUUUACGCTTGGCCAA': 0.3714285714285714,
 'AACCGGTTUAAAUUAAUUUACGUTTGGCCAA': 0.3714285714285714,
 'AACCGGTTUUCAUAAUUUACAAATTGGCCAA': 0.3714285714285714,
 'AACCGGTTCCAAAUUAAUUUACGTTGGCCAA': 0.3714285714285714,
 'AACCGGTTAGUUAAUUUACGGCUTTGGCCAA': 0.3714285714285714,
 'AACCGGTTUGUGUAAUUUACGUATTGGCCAA': 0.3714285714285714,
 'AACCGGTTUUCAUUAAUUUACGGT

### Make the nxn distance matrix:

This is currently the longest, most time consuming part of the algorithm.

In [None]:
#For a pairwise comparison of *every* sequence:
pairwise_tn = []
for n in API_dict:
  row = []
  for m in API_dict:
    try: #A few of the pairs return a 'math domain error'
      row.append(float(tamura_nei(n,m)[2]))
    except:
      print('Math_error', n,m)
      row.append(1.0)    #Impute the maximum distance.
  pairwise_tn.append(row)


print()
pairwise_tn

In [None]:
print(len(pairwise_tn),'x',len(pairwise_tn[0]))

500 x 500


Given that we're intending to do both row and column operations, the most intuitive way to see what we're doing is for the array to become a dataframe.

In [None]:
pairwise_tn_df = pd.DataFrame(pairwise_tn, columns=API_dict, index=API_dict)
#pairwise_tn_df.style.set_table_styles([{'selector': 'th.row_heading', 'props': [('font-size', '1pt')]}])
pairwise_tn_df

Unnamed: 0,AACCGGTTUAAAUAAUUUACGUUTTGGCCAA,AACCGGTTAAUUAAUUUACGUGGTTGGCCAA,AACCGGTTAAUAAUUUACGCUAATTGGCCAA,AACCGGTTCAAAUAAUUUACGUUTTGGCCAA,AACCGGTTCUGUAAUUUACGUGGTTGGCCAA,AACCGGTTAGUAAAUUUAAUCGUTTGGCCAA,AACCGGTTUUAAUUAAUUUACGCTTGGCCAA,AACCGGTTAAAGUUAAUUUACGGTTGGCCAA,AACCGGTTAAAUUAAUUUACGCGTTGGCCAA,AACCGGTTAUGUAAUUUACGUGUTTGGCCAA,...,AACCGGTTAUUGUGGUUGUUGGGTTGGCCAA,AACCGGTTUUGUGGUUGUUUGGCTTGGCCAA,AACCGGTTUUGUGGUUGUUACUCTTGGCCAA,AACCGGTTAAAAAAUGUGGUUACTTGGCCAA,AACCGGTTUCCGCUUCAUUUACGTTGGCCAA,AACCGGTTCUAUAUGUGGUUGAGTTGGCCAA,AACCGGTTAGUGUGGUUGUUAGCTTGGCCAA,AACCGGTTAGUUUAAUGGUUGCUTTGGCCAA,AACCGGTTCUGGUGGGUAAUUGUTTGGCCAA,AACCGGTTACAUGUGGUUGUGUATTGGCCAA
AACCGGTTUAAAUAAUUUACGUUTTGGCCAA,0.000000,0.536284,0.495175,0.034723,0.564295,0.372718,0.325768,0.380993,0.143741,0.514167,...,0.500044,0.487884,0.496017,0.433961,0.637859,0.559587,0.647795,0.373681,0.519769,0.427483
AACCGGTTAAUUAAUUUACGUGGTTGGCCAA,0.536284,0.000000,0.327228,0.534410,0.104792,0.224886,0.632869,0.433130,0.332553,0.104864,...,0.371428,0.491149,0.564824,0.324056,0.650225,0.486450,0.435475,0.427191,0.509457,0.574162
AACCGGTTAAUAAUUUACGCUAATTGGCCAA,0.495175,0.327228,0.000000,0.494202,0.504296,0.384633,0.633141,0.557520,0.489237,0.442288,...,0.560691,0.630595,0.621325,0.276985,0.568445,0.486208,0.560177,0.555729,0.714769,0.490981
AACCGGTTCAAAUAAUUUACGUUTTGGCCAA,0.034723,0.534410,0.494202,0.000000,0.510740,0.372018,0.379921,0.380363,0.143571,0.512753,...,0.499167,0.558790,0.551832,0.432227,0.753709,0.491765,0.646530,0.372164,0.466866,0.426784
AACCGGTTCUGUAAUUUACGUGGTTGGCCAA,0.564295,0.104792,0.504296,0.510740,0.000000,0.325193,0.555551,0.548395,0.440332,0.067947,...,0.429014,0.374209,0.446446,0.444472,0.665790,0.383185,0.555729,0.563786,0.325085,0.639640
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
AACCGGTTCUAUAUGUGGUUGAGTTGGCCAA,0.559587,0.486450,0.486208,0.491765,0.383185,0.574411,0.489569,0.499518,0.428057,0.499167,...,0.274994,0.339000,0.431633,0.458788,0.488051,0.000000,0.438718,0.325286,0.515013,0.370783
AACCGGTTAGUGUGGUUGUUAGCTTGGCCAA,0.647795,0.435475,0.560177,0.646530,0.555729,0.329303,0.441818,0.377396,0.521533,0.486191,...,0.105272,0.384141,0.537709,0.463304,0.504558,0.438718,0.000000,0.285076,0.319295,0.510667
AACCGGTTAGUUUAAUGGUUGCUTTGGCCAA,0.373681,0.427191,0.555729,0.372164,0.563786,0.325286,0.589766,0.509457,0.269658,0.429541,...,0.279358,0.373368,0.488793,0.493195,0.551507,0.325286,0.285076,0.000000,0.592129,0.493008
AACCGGTTCUGGUGGGUAAUUGUTTGGCCAA,0.519769,0.509457,0.714769,0.466866,0.325085,0.370348,0.525195,0.491794,0.559801,0.325286,...,0.275424,0.443780,0.588972,0.626902,0.650694,0.515013,0.319295,0.592129,0.000000,0.487932


It's a bit ugly, but it works. Notably, you can see that all the diagonals have distance of zero, as you'd expect them to.

## The actual function:

To implement, the fucntion should be built up as usual, from simple functions to complex.

In [None]:
#Check:
API_dict

In [None]:
#Check:
pairwise_tn_df

### Generate a list of neighborhoods, using the indexing.

#### Exploration:

In [None]:
#Using this handy loop as a template:
for index, row in pairwise_tn_df.iterrows():
    print(f"Row {index}:")
    for column, value in row.items():
        print(f"  {column}: {value}")

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
  AACCGGTTAUGUAAUUUACGUGUTTGGCCAA: 0.428957
  AACCGGTTUCAAUUUACACUAAUTTGGCCAA: 0.663354
  AACCGGTTAUUUAUUUAUGGAUCTTGGCCAA: 0.514984
  AACCGGTTUCGCAAUUUAUGGUGTTGGCCAA: 0.429014
  AACCGGTTGGAAUUAAUUUACGCTTGGCCAA: 0.513006
  AACCGGTTUAAAUUAAUUUACGUTTGGCCAA: 0.505809
  AACCGGTTUUCAUAAUUUACAAATTGGCCAA: 0.678885
  AACCGGTTCCAAAUUAAUUUACGTTGGCCAA: 0.554383
  AACCGGTTAGUUAAUUUACGGCUTTGGCCAA: 0.428081
  AACCGGTTUGUGUAAUUUACGUATTGGCCAA: 0.385207
  AACCGGTTUUCAUUAAUUUACGGTTGGCCAA: 0.373455
  AACCGGTTCUGGUAAUUUACAACTTGGCCAA: 0.468275
  AACCGGTTCAAAUUAAUUUACGUTTGGCCAA: 0.504666
  AACCGGTTUUUAAUUUAAAAUCGTTGGCCAA: 0.505809
  AACCGGTTGUAAUUAAUUUACGUTTGGCCAA: 0.454235
  AACCGGTTACAGGCUAGUUUAUGTTGGCCAA: 0.429614
  AACCGGTTUGCUUCAUUUACGCUTTGGCCAA: 0.487604
  AACCGGTTUAAAUUUACAAUGGCTTGGCCAA: 0.486611
  AACCGGTTUCGAUUUAAUUUACGTTGGCCAA: 0.491149
  AACCGGTTCAUUAAUUUACGUGUTTGGCCAA: 0.491149
  AACCGGTTUAAAUAAUUUACGGCTTGGCCAA: 0.435475
  AACCGGTTU

In [None]:
#Using this handy loop as a template:
for index, row in pairwise_tn_df.iterrows():
    print(f"{index}")
    for column, value in row.items():
        print(f"{value}")
        print(f"{column}:{value}")
    print()

In [None]:
D = 0.1
groups = []

for index, row in pairwise_tn_df.iterrows():
    group = []
    group.append(f"{index}")
    for column, value in row.items():
        if float(f"{value}") <= D:
          group.append(f"{column}")
    groups.append(group)

print(groups)

[['AACCGGTTUAAAUAAUUUACGUUTTGGCCAA', 'AACCGGTTUAAAUAAUUUACGUUTTGGCCAA', 'AACCGGTTCAAAUAAUUUACGUUTTGGCCAA', 'AACCGGTTUAAAUAAUUUACGGCTTGGCCAA', 'AACCGGTTUAAUUAAUUUACGUATTGGCCAA', 'AACCGGTTUAAUUAAUUUACGUGTTGGCCAA'], ['AACCGGTTAAUUAAUUUACGUGGTTGGCCAA', 'AACCGGTTAAUUAAUUUACGUGGTTGGCCAA', 'AACCGGTTCAUUAAUUUACGUGUTTGGCCAA'], ['AACCGGTTAAUAAUUUACGCUAATTGGCCAA', 'AACCGGTTAAUAAUUUACGCUAATTGGCCAA'], ['AACCGGTTCAAAUAAUUUACGUUTTGGCCAA', 'AACCGGTTUAAAUAAUUUACGUUTTGGCCAA', 'AACCGGTTCAAAUAAUUUACGUUTTGGCCAA'], ['AACCGGTTCUGUAAUUUACGUGGTTGGCCAA', 'AACCGGTTCUGUAAUUUACGUGGTTGGCCAA', 'AACCGGTTAUGUAAUUUACGUGUTTGGCCAA'], ['AACCGGTTAGUAAAUUUAAUCGUTTGGCCAA', 'AACCGGTTAGUAAAUUUAAUCGUTTGGCCAA', 'AACCGGTTAGUUAAUUAAAUCGUTTGGCCAA'], ['AACCGGTTUUAAUUAAUUUACGCTTGGCCAA', 'AACCGGTTUUAAUUAAUUUACGCTTGGCCAA', 'AACCGGTTGGAAUUAAUUUACGCTTGGCCAA', 'AACCGGTTUAAAUUAAUUUACGUTTGGCCAA', 'AACCGGTTUUCAUUAAUUUACGGTTGGCCAA', 'AACCGGTTGUAAUUAAUUUACGUTTGGCCAA', 'AACCGGTTUAAAUUAAUUUACGCTTGGCCAA', 'AACCGGTTGUAAUUAAUUUACGGTTGGCCAA'], ['AAC

In [None]:
print(len(groups))

500


In [None]:
def mean_list_length(list_of_lists):
    total_length = 0
    num_lists = len(list_of_lists)

    for sublist in list_of_lists:
        total_length += len(sublist)

    mean_length = total_length / num_lists

    return mean_length

mean_length = mean_list_length(groups)
print("Mean list length:", mean_length)

Mean list length: 3.952


In [None]:
#With a few checks and balances:

D = 0.1
groups = []

for index, row in pairwise_tn_df.iterrows():
    group = []
    group.append(f"{index}")
    for column, value in row.items():
        if float(f"{value}") <= D and float(f"{value}") != 0: #Add self-match condition.
          group.append(f"{column}")
    if len(group) > 1: #Kill off neighborhoods with only one sequence, which are terminal anyway.
      groups.append(group)

print(groups)

[['AACCGGTTUAAAUAAUUUACGUUTTGGCCAA', 'AACCGGTTCAAAUAAUUUACGUUTTGGCCAA', 'AACCGGTTUAAAUAAUUUACGGCTTGGCCAA', 'AACCGGTTUAAUUAAUUUACGUATTGGCCAA', 'AACCGGTTUAAUUAAUUUACGUGTTGGCCAA'], ['AACCGGTTAAUUAAUUUACGUGGTTGGCCAA', 'AACCGGTTCAUUAAUUUACGUGUTTGGCCAA'], ['AACCGGTTCAAAUAAUUUACGUUTTGGCCAA', 'AACCGGTTUAAAUAAUUUACGUUTTGGCCAA'], ['AACCGGTTCUGUAAUUUACGUGGTTGGCCAA', 'AACCGGTTAUGUAAUUUACGUGUTTGGCCAA'], ['AACCGGTTAGUAAAUUUAAUCGUTTGGCCAA', 'AACCGGTTAGUUAAUUAAAUCGUTTGGCCAA'], ['AACCGGTTUUAAUUAAUUUACGCTTGGCCAA', 'AACCGGTTGGAAUUAAUUUACGCTTGGCCAA', 'AACCGGTTUAAAUUAAUUUACGUTTGGCCAA', 'AACCGGTTUUCAUUAAUUUACGGTTGGCCAA', 'AACCGGTTGUAAUUAAUUUACGUTTGGCCAA', 'AACCGGTTUAAAUUAAUUUACGCTTGGCCAA', 'AACCGGTTGUAAUUAAUUUACGGTTGGCCAA'], ['AACCGGTTAAAUUAAUUUACGCGTTGGCCAA', 'AACCGGTTUAAUUAAUUUACGUGTTGGCCAA', 'AACCGGTTUAAUUAAUUUACGCCTTGGCCAA'], ['AACCGGTTAUGUAAUUUACGUGUTTGGCCAA', 'AACCGGTTCUGUAAUUUACGUGGTTGGCCAA'], ['AACCGGTTGGAAUUAAUUUACGCTTGGCCAA', 'AACCGGTTUUAAUUAAUUUACGCTTGGCCAA', 'AACCGGTTGUAAUUAAUUUACGUTTGGCCAA', 'A

In [None]:
print(len(groups))

268


In [None]:
def mean_list_length(list_of_lists):
    total_length = 0
    num_lists = len(list_of_lists)

    for sublist in list_of_lists:
        total_length += len(sublist)

    mean_length = total_length / num_lists

    return mean_length

mean_length = mean_list_length(groups)
print("Mean list length:", mean_length)

Mean list length: 4.641791044776119


Finally, don't forget to kill off redundant groups. These would pose a huge problem, in that in later stages, you would attempt to kill the same sequence, twice.

The issue is that these are lists, meaning that the order matters. A list containing the same elements, reordered, is not the same. So, the lists must become sets, and a set must be built of all.

According to the internet, you can do this, but you need to corvert to frozen sets. The more you know...

In [None]:
groups_non_red = set()
for group in groups:
  print (set(group))
  groups_non_red.add(frozenset(group))

{'AACCGGTTUAAAUAAUUUACGGCTTGGCCAA', 'AACCGGTTUAAUUAAUUUACGUATTGGCCAA', 'AACCGGTTUAAUUAAUUUACGUGTTGGCCAA', 'AACCGGTTUAAAUAAUUUACGUUTTGGCCAA', 'AACCGGTTCAAAUAAUUUACGUUTTGGCCAA'}
{'AACCGGTTAAUUAAUUUACGUGGTTGGCCAA', 'AACCGGTTCAUUAAUUUACGUGUTTGGCCAA'}
{'AACCGGTTCAAAUAAUUUACGUUTTGGCCAA', 'AACCGGTTUAAAUAAUUUACGUUTTGGCCAA'}
{'AACCGGTTCUGUAAUUUACGUGGTTGGCCAA', 'AACCGGTTAUGUAAUUUACGUGUTTGGCCAA'}
{'AACCGGTTAGUAAAUUUAAUCGUTTGGCCAA', 'AACCGGTTAGUUAAUUAAAUCGUTTGGCCAA'}
{'AACCGGTTUUAAUUAAUUUACGCTTGGCCAA', 'AACCGGTTUUCAUUAAUUUACGGTTGGCCAA', 'AACCGGTTGUAAUUAAUUUACGGTTGGCCAA', 'AACCGGTTGGAAUUAAUUUACGCTTGGCCAA', 'AACCGGTTGUAAUUAAUUUACGUTTGGCCAA', 'AACCGGTTUAAAUUAAUUUACGUTTGGCCAA', 'AACCGGTTUAAAUUAAUUUACGCTTGGCCAA'}
{'AACCGGTTUAAUUAAUUUACGCCTTGGCCAA', 'AACCGGTTUAAUUAAUUUACGUGTTGGCCAA', 'AACCGGTTAAAUUAAUUUACGCGTTGGCCAA'}
{'AACCGGTTCUGUAAUUUACGUGGTTGGCCAA', 'AACCGGTTAUGUAAUUUACGUGUTTGGCCAA'}
{'AACCGGTTGCAAUUAAUUUACGUTTGGCCAA', 'AACCGGTTUUAAUUAAUUUACGCTTGGCCAA', 'AACCGGTTGUAAUUAAUUUACGGTTGGCCAA', 'AACCGGTTGG

In [None]:
print(len(groups_non_red))

242


In [None]:
def mean_list_length(list_of_lists):
    total_length = 0
    num_lists = len(list_of_lists)

    for sublist in list_of_lists:
        total_length += len(sublist)

    mean_length = total_length / num_lists

    return mean_length

mean_length = mean_list_length(groups)
print("Mean list length:", mean_length)

Mean list length: 4.641791044776119


Just for completeness, here is the final workflow:

In [None]:
D = 0.1
groups = []

for index, row in pairwise_tn_df.iterrows():
    group = []
    group.append(f"{index}")
    for column, value in row.items():
        if float(f"{value}") <= D and float(f"{value}") != 0: #Add self-match condition.
          group.append(f"{column}")
    if len(group) > 1: #Kill off neighborhoods with only one sequence, which are terminal anyway.
      groups.append(group)

groups_non_red = set()
for group in groups:
  groups_non_red.add(frozenset(group))

#And now convert back to a list of lists, because sets are annoying...
output_groups = []
for group in groups_non_red:
  output_groups.append(list(group))

output_groups

[['AACCGGTTAUGAGGUUGGUUGUCTTGGCCAA', 'AACCGGTTCUGAGGUUGGUUCUCTTGGCCAA'],
 ['AACCGGTTCUUGUGGUUUAUGAATTGGCCAA',
  'AACCGGTTAUUGUGGUUUAUGAATTGGCCAA',
  'AACCGGTTAUUGUGGUUUAUGACTTGGCCAA',
  'AACCGGTTGAUGUGGUUUAUGAATTGGCCAA',
  'AACCGGTTGUUGUGGUUUAUGACTTGGCCAA',
  'AACCGGTTCGUGUGGUUUAUGUATTGGCCAA',
  'AACCGGTTUUUGUGGUUUAUGAATTGGCCAA'],
 ['AACCGGTTGGUUUGUGGUUUAUGTTGGCCAA',
  'AACCGGTTACGUUGUGGUUUAUGTTGGCCAA',
  'AACCGGTTAAGUUGUGGUUUAUGTTGGCCAA',
  'AACCGGTTAGGUUGUGGUUUAUGTTGGCCAA',
  'AACCGGTTAAUUUGUGGUUUAUGTTGGCCAA',
  'AACCGGTTAUGUUGUGGUUUAUGTTGGCCAA'],
 ['AACCGGTTAUAAUUAAAUCGUGUTTGGCCAA',
  'AACCGGTTUCAAUUAAAUCGUUATTGGCCAA',
  'AACCGGTTAAAAUUAAAUCGUUGTTGGCCAA',
  'AACCGGTTGUAAUUAAAUCGUAATTGGCCAA',
  'AACCGGTTAUAAUUAAAUCGUUATTGGCCAA',
  'AACCGGTTACAAUUAAAUCGUAATTGGCCAA'],
 ['AACCGGTTCGUAAUUAAAUCGUATTGGCCAA',
  'AACCGGTTAGUAAUUAAAUCGUUTTGGCCAA',
  'AACCGGTTCACAAUUAAAUCGUATTGGCCAA',
  'AACCGGTTCGGAAUUAAAUCGUATTGGCCAA',
  'AACCGGTTGGUAAUUAAAUCGUATTGGCCAA',
  'AACCGGTTCGUAAUUAAAUCGUCTTGGCCAA',

#### Old function:

In [None]:
def get_neighbors(pairwise_tn_df, D = 0.1):
  groups = []

  for index, row in pairwise_tn_df.iterrows():
      group = []
      group.append(f"{index}")
      for column, value in row.items():
          if float(f"{value}") <= D and float(f"{value}") != 0: #Add self-match condition.
            group.append(f"{column}")
      if len(group) > 1: #Kill off neighborhoods with only one sequence, which are terminal anyway.
        groups.append(group)

  groups_non_red = set()
  for group in groups:
    groups_non_red.add(frozenset(group))

  #And now convert back to a list of lists, because sets are annoying...
  output_groups = []
  for group in groups_non_red:
    output_groups.append(list(group))

  return(output_groups)

This previous version of the function has a significant error in it's logic:
If you create a "neighborhood" from a single line of the distance matrix, what you're saying is that all the other sequences are sufficiently close to the particualr index sequence. So, that sequence is the centroid of the neighborhood. It must be known. Otherwise, having simply a group sequences as a neighborhood implies that all of these sequences are close to eachother. False.

Similarly, [a], [b, c, d] and [b], [a, c, d] do not convey the same information. So, they are not redundant, and must be preserved.

Additional: Don't kill off neighborhoods with only one sequence (that is, a centroid and no neighbors). Those neighborhoods also contain information that shouldn't be discarded.

I'm preserving it here to maintain a sense of the exploration process.

#### get_neighbors: Function.

In [None]:
def get_neighbors(pairwise_tn_df, D=0.1):
  groups = {} #Since each group is now a tuple, encoding the whole thing as a dictionary may prove useful.

  for index, row in pairwise_tn_df.iterrows():
      centroid = (f"{index}")
      neighbors = []
      for column, value in row.items():
          if float(f"{value}") <= D and float(f"{value}") != 0: #Add self-match condition.
            neighbors.append(f"{column}")

      #Add to the dictionary:
      groups[centroid] = neighbors

  return(groups)

In [None]:
get_neighbors(pairwise_tn_df)

{'AACCGGTTUAAAUAAUUUACGUUTTGGCCAA': ['AACCGGTTCAAAUAAUUUACGUUTTGGCCAA',
  'AACCGGTTUAAAUAAUUUACGGCTTGGCCAA',
  'AACCGGTTUAAUUAAUUUACGUATTGGCCAA',
  'AACCGGTTUAAUUAAUUUACGUGTTGGCCAA'],
 'AACCGGTTAAUUAAUUUACGUGGTTGGCCAA': ['AACCGGTTCAUUAAUUUACGUGUTTGGCCAA'],
 'AACCGGTTAAUAAUUUACGCUAATTGGCCAA': [],
 'AACCGGTTCAAAUAAUUUACGUUTTGGCCAA': ['AACCGGTTUAAAUAAUUUACGUUTTGGCCAA'],
 'AACCGGTTCUGUAAUUUACGUGGTTGGCCAA': ['AACCGGTTAUGUAAUUUACGUGUTTGGCCAA'],
 'AACCGGTTAGUAAAUUUAAUCGUTTGGCCAA': ['AACCGGTTAGUUAAUUAAAUCGUTTGGCCAA'],
 'AACCGGTTUUAAUUAAUUUACGCTTGGCCAA': ['AACCGGTTGGAAUUAAUUUACGCTTGGCCAA',
  'AACCGGTTUAAAUUAAUUUACGUTTGGCCAA',
  'AACCGGTTUUCAUUAAUUUACGGTTGGCCAA',
  'AACCGGTTGUAAUUAAUUUACGUTTGGCCAA',
  'AACCGGTTUAAAUUAAUUUACGCTTGGCCAA',
  'AACCGGTTGUAAUUAAUUUACGGTTGGCCAA'],
 'AACCGGTTAAAGUUAAUUUACGGTTGGCCAA': [],
 'AACCGGTTAAAUUAAUUUACGCGTTGGCCAA': ['AACCGGTTUAAUUAAUUUACGUGTTGGCCAA',
  'AACCGGTTUAAUUAAUUUACGCCTTGGCCAA'],
 'AACCGGTTAUGUAAUUUACGUGUTTGGCCAA': ['AACCGGTTCUGUAAUUUACGUGGTTGGCCAA'],
 'A

#### Function testing:

In [None]:
#Check:
#(Formatted for presentation)
nbs =  get_neighbors(pairwise_tn_df, D=0.1)
lengs = 0
leng_list = 20*[0]
for n in nbs:
  print(n)
  print(nbs[n])
  print()

  lengs += len(nbs[n])
  leng_list[len(nbs[n])] += 1



print('number of groups:',len(nbs))
print('mean group size:',lengs/len(nbs))
print('distribution of sizes:',leng_list)
print()

AACCGGTTUAAAUAAUUUACGUUTTGGCCAA
['AACCGGTTCAAAUAAUUUACGUUTTGGCCAA', 'AACCGGTTUAAAUAAUUUACGGCTTGGCCAA', 'AACCGGTTUAAUUAAUUUACGUATTGGCCAA', 'AACCGGTTUAAUUAAUUUACGUGTTGGCCAA']

AACCGGTTAAUUAAUUUACGUGGTTGGCCAA
['AACCGGTTCAUUAAUUUACGUGUTTGGCCAA']

AACCGGTTAAUAAUUUACGCUAATTGGCCAA
[]

AACCGGTTCAAAUAAUUUACGUUTTGGCCAA
['AACCGGTTUAAAUAAUUUACGUUTTGGCCAA']

AACCGGTTCUGUAAUUUACGUGGTTGGCCAA
['AACCGGTTAUGUAAUUUACGUGUTTGGCCAA']

AACCGGTTAGUAAAUUUAAUCGUTTGGCCAA
['AACCGGTTAGUUAAUUAAAUCGUTTGGCCAA']

AACCGGTTUUAAUUAAUUUACGCTTGGCCAA
['AACCGGTTGGAAUUAAUUUACGCTTGGCCAA', 'AACCGGTTUAAAUUAAUUUACGUTTGGCCAA', 'AACCGGTTUUCAUUAAUUUACGGTTGGCCAA', 'AACCGGTTGUAAUUAAUUUACGUTTGGCCAA', 'AACCGGTTUAAAUUAAUUUACGCTTGGCCAA', 'AACCGGTTGUAAUUAAUUUACGGTTGGCCAA']

AACCGGTTAAAGUUAAUUUACGGTTGGCCAA
[]

AACCGGTTAAAUUAAUUUACGCGTTGGCCAA
['AACCGGTTUAAUUAAUUUACGUGTTGGCCAA', 'AACCGGTTUAAUUAAUUUACGCCTTGGCCAA']

AACCGGTTAUGUAAUUUACGUGUTTGGCCAA
['AACCGGTTCUGUAAUUUACGUGGTTGGCCAA']

AACCGGTTUCAAUUUACACUAAUTTGGCCAA
[]

AACCGGTTAUUUAUUUAUGGAUCTT

In [None]:
#Iterate by size of the group:
for x in range (10, 30, 2):
  nbs =  get_neighbors(pairwise_tn_df, D=(x/100))
  lengs = 0
  leng_list = 100*[0]
  for n in nbs:
    lengs += len(nbs[n])
    leng_list[len(nbs[n])] += 1

  print('D:',x/100)
  print('mean group size:',lengs/len(nbs))
  print('distribution of sizes:',leng_list)
  print()

D: 0.1
mean group size: 1.952
distribution of sizes: [232, 59, 51, 39, 26, 37, 19, 17, 10, 6, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

D: 0.12
mean group size: 4.848
distribution of sizes: [146, 41, 30, 28, 26, 38, 20, 21, 19, 32, 25, 17, 22, 8, 6, 6, 2, 7, 6, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

D: 0.14
mean group size: 4.86
distribution of sizes: [146, 41, 30, 28, 26, 38, 20, 20, 20, 31, 25, 15, 25, 8, 6, 6, 2, 7, 6, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

The group sizes get a bit ridiculous as you scan through a fairly reasonable range of T-N distance values. In other words, the capacity of this function to weed out sequences depends strongly on the variable 'D'. This could probably be operationalized into "tree" process of sorts. Scan through a range of D values, and present the resulting sequences and their API scores in a tree, where the levels are weeded progressively. Based on the number of sequences you'd like to end up with, you can pick a cutoff point.

### Hit-list:

1. 'NS': The first algorithm that I came up with was: a. Look at the sizes of all neighborhoods, and determine the maximim size. b. For all of that size, determine which sequence has the lowest API-score. c. Add that sequence to the hit-list, to be pruned. -- Note that sequences are prioritized for pruning by proximity to a sequence that has a lot of neighbors, by "neighborhood size".

2. 'API': Another potential algorithm: a. Look for all sequences which have the lowest API score in their own neighborhood. b. Of these, determine which has the largest neighborhood (is the most redundant), and add it to the hit-list.

In both cases, the function needs to return a "stop"  or something like it, if the largest group size is at or below the threshold 'N'. This would be how we iterate and stop, progressively weeding out sequences until we reach the desired size. Unless... maybe doing that is pointless.

#### Exploration:

In [None]:
inp = get_neighbors(pairwise_tn_df, D=0.1)

In [None]:
inp

In [None]:
def NS(inp):
  ls = []
  for n in inp:
    print(n)
    print(inp[n])
    print()

    ls.append(len(inp[n]))

  print(ls)
  print(max(ls))

In [None]:
NS(inp)

AACCGGTTUAAAUAAUUUACGUUTTGGCCAA
['AACCGGTTCAAAUAAUUUACGUUTTGGCCAA', 'AACCGGTTUAAAUAAUUUACGGCTTGGCCAA', 'AACCGGTTUAAUUAAUUUACGUATTGGCCAA', 'AACCGGTTUAAUUAAUUUACGUGTTGGCCAA']

AACCGGTTAAUUAAUUUACGUGGTTGGCCAA
['AACCGGTTCAUUAAUUUACGUGUTTGGCCAA']

AACCGGTTAAUAAUUUACGCUAATTGGCCAA
[]

AACCGGTTCAAAUAAUUUACGUUTTGGCCAA
['AACCGGTTUAAAUAAUUUACGUUTTGGCCAA']

AACCGGTTCUGUAAUUUACGUGGTTGGCCAA
['AACCGGTTAUGUAAUUUACGUGUTTGGCCAA']

AACCGGTTAGUAAAUUUAAUCGUTTGGCCAA
['AACCGGTTAGUUAAUUAAAUCGUTTGGCCAA']

AACCGGTTUUAAUUAAUUUACGCTTGGCCAA
['AACCGGTTGGAAUUAAUUUACGCTTGGCCAA', 'AACCGGTTUAAAUUAAUUUACGUTTGGCCAA', 'AACCGGTTUUCAUUAAUUUACGGTTGGCCAA', 'AACCGGTTGUAAUUAAUUUACGUTTGGCCAA', 'AACCGGTTUAAAUUAAUUUACGCTTGGCCAA', 'AACCGGTTGUAAUUAAUUUACGGTTGGCCAA']

AACCGGTTAAAGUUAAUUUACGGTTGGCCAA
[]

AACCGGTTAAAUUAAUUUACGCGTTGGCCAA
['AACCGGTTUAAUUAAUUUACGUGTTGGCCAA', 'AACCGGTTUAAUUAAUUUACGCCTTGGCCAA']

AACCGGTTAUGUAAUUUACGUGUTTGGCCAA
['AACCGGTTCUGUAAUUUACGUGGTTGGCCAA']

AACCGGTTUCAAUUUACACUAAUTTGGCCAA
[]

AACCGGTTAUUUAUUUAUGGAUCTT

In [None]:
def NS(inp):
  ls = []
  for n in inp:
    ls.append(len(inp[n]))
  # Get the maximum neighborhood size:
  ml = max(ls)
  #Now, for all neighborhoods of that size,
  #return the neighborhood.
  for n in inp:
    if len(inp[n]) == ml:
      print (n)
      print (inp[n])
      print ()

In [None]:
NS(inp)

AACCGGTTAUUUGUGGUUUAUGATTGGCCAA
['AACCGGTTAAUUGUGGUUUAUGUTTGGCCAA', 'AACCGGTTAGUUGUGGUUUAUGGTTGGCCAA', 'AACCGGTTUGUUGUGGUUUAUGATTGGCCAA', 'AACCGGTTAGUUGUGGUUUAUGCTTGGCCAA', 'AACCGGTTAUCUGUGGUUUAUGUTTGGCCAA', 'AACCGGTTCAUUGUGGUUUAUGATTGGCCAA', 'AACCGGTTAAGUGUGGUUUAUGATTGGCCAA', 'AACCGGTTAUCUGUGGUUUAUGGTTGGCCAA', 'AACCGGTTACUUGUGGUUUAUGGTTGGCCAA', 'AACCGGTTGUAUGUGGUUUAUGATTGGCCAA', 'AACCGGTTCUAUGUGGUUUAUGATTGGCCAA']

AACCGGTTGUUGUGGUUUAUGACTTGGCCAA
['AACCGGTTGUUGUGGUUUAUGUCTTGGCCAA', 'AACCGGTTGUUGUGGUUUAUGGGTTGGCCAA', 'AACCGGTTGUUGUGGUUUAUGGCTTGGCCAA', 'AACCGGTTGUUGUGGUUUAUGCGTTGGCCAA', 'AACCGGTTGGUGUGGUUUAUGAGTTGGCCAA', 'AACCGGTTCUUGUGGUUUAUGAATTGGCCAA', 'AACCGGTTGUUGUGGUUUAAUACTTGGCCAA', 'AACCGGTTUUUGUGGUUUAUGAATTGGCCAA', 'AACCGGTTAUUGUGGUUUAUGACTTGGCCAA', 'AACCGGTTGAUGUGGUUUAUGAATTGGCCAA', 'AACCGGTTAUUGUGGUUUAUGAATTGGCCAA']



Now, we return to 'API_dict', the object we made way back when.
Must give the API scores associated with each sequence.

In [None]:
def NS(inp):
  ls = []
  for n in inp:
    ls.append(len(inp[n]))
  ml = max(ls)
  for n in inp:
    if len(inp[n]) == ml:
      print (n, API_dict[n])
      #Get API value for each. It's a bit awkward given that this is a list:
      lw_APIs = []
      for i in inp[n]:
        lw_APIs.append([i,API_dict[i]])
      print (lw_APIs)
      print ()
      print ()

In [None]:
NS(inp)

AACCGGTTAUUUGUGGUUUAUGATTGGCCAA 0.4571428571428571
[['AACCGGTTAAUUGUGGUUUAUGUTTGGCCAA', 0.4857142857142857], ['AACCGGTTAGUUGUGGUUUAUGGTTGGCCAA', 0.4857142857142857], ['AACCGGTTUGUUGUGGUUUAUGATTGGCCAA', 0.4857142857142857], ['AACCGGTTAGUUGUGGUUUAUGCTTGGCCAA', 0.4857142857142857], ['AACCGGTTAUCUGUGGUUUAUGUTTGGCCAA', 0.4571428571428571], ['AACCGGTTCAUUGUGGUUUAUGATTGGCCAA', 0.4285714285714285], ['AACCGGTTAAGUGUGGUUUAUGATTGGCCAA', 0.4285714285714285], ['AACCGGTTAUCUGUGGUUUAUGGTTGGCCAA', 0.4285714285714285], ['AACCGGTTACUUGUGGUUUAUGGTTGGCCAA', 0.4285714285714285], ['AACCGGTTGUAUGUGGUUUAUGATTGGCCAA', 0.4], ['AACCGGTTCUAUGUGGUUUAUGATTGGCCAA', 0.3714285714285714]]


AACCGGTTGUUGUGGUUUAUGACTTGGCCAA 0.4571428571428571
[['AACCGGTTGUUGUGGUUUAUGUCTTGGCCAA', 0.4857142857142857], ['AACCGGTTGUUGUGGUUUAUGGGTTGGCCAA', 0.4857142857142857], ['AACCGGTTGUUGUGGUUUAUGGCTTGGCCAA', 0.4857142857142857], ['AACCGGTTGUUGUGGUUUAUGCGTTGGCCAA', 0.4571428571428571], ['AACCGGTTGGUGUGGUUUAUGAGTTGGCCAA', 0.4285714285714285

This looks good, but it brings up a somewhat distressing point. What do you do if the largest neighborhood has all sequences with the same API score?

My plan is that whatever the lowest API score in the neighborhood is, all the sequences that have it are getting put on the hit-list. But, you reach a "pruning impasse" if there are no more sequences that can be cut. I'm trying to think whether the API algorithm would have the same problem... It would. The equivalent is a sequence that does technically have the lowest API score in its neighborhood, because all seqs in the neighborhood have the same score.

So, in both cases, the program needs to be able to ignore thesse impasses, and continue pruning.

For this function:

1. If one but not all of the largest neighborhoods has an impasse, while the other doesn't, then... continue to prune the other neighborhood.
2. If all of the largest neighborhoods have reached impasse, then you should skip to the next level.

Algorithm: Write a loop that starts at the max length, and IDs all possible sequences to be killed at that length. The sequences to be killed are sequences with minimal API scores, excepting for impasse solutions. -- If the loop finds more-than-zero (things to kill), it terminates. Otherwise, it keeps stepping down until it does.

In [None]:
def NS(inp):

  #Finally creating the hit-list:
  hit_list = []

  #This block simply gives the maximum size of any neighborhood.
  ls = []
  for n in inp:
    ls.append(len(inp[n]))
  ml = max(ls)

  #Looking at all neighborhoods of this maximum size.
  for n in inp:
    if len(inp[n]) == ml:
      #Create an output list.
      lw_APIs = []
      #Append the centroid
      lw_APIs.append([n, API_dict[n]])
      for i in inp[n]:
        #Append the neighbors
        lw_APIs.append([i,API_dict[i]])
      print (lw_APIs)

      #To generate the spread of API scores
      lAPIs = []
      for j in lw_APIs:
        lAPIs.append(j[1])
      print(min(lAPIs),max(lAPIs))

      for j in lw_APIs:
        if j[1] == min(lAPIs) and j[1] != max(lAPIs):
          hit_list.append(j)


      print ()
      print ()

  print('hit_list:', hit_list)

In [None]:
NS(inp)

[['AACCGGTTAUUUGUGGUUUAUGATTGGCCAA', 0.4571428571428571], ['AACCGGTTAAUUGUGGUUUAUGUTTGGCCAA', 0.4857142857142857], ['AACCGGTTAGUUGUGGUUUAUGGTTGGCCAA', 0.4857142857142857], ['AACCGGTTUGUUGUGGUUUAUGATTGGCCAA', 0.4857142857142857], ['AACCGGTTAGUUGUGGUUUAUGCTTGGCCAA', 0.4857142857142857], ['AACCGGTTAUCUGUGGUUUAUGUTTGGCCAA', 0.4571428571428571], ['AACCGGTTCAUUGUGGUUUAUGATTGGCCAA', 0.4285714285714285], ['AACCGGTTAAGUGUGGUUUAUGATTGGCCAA', 0.4285714285714285], ['AACCGGTTAUCUGUGGUUUAUGGTTGGCCAA', 0.4285714285714285], ['AACCGGTTACUUGUGGUUUAUGGTTGGCCAA', 0.4285714285714285], ['AACCGGTTGUAUGUGGUUUAUGATTGGCCAA', 0.4], ['AACCGGTTCUAUGUGGUUUAUGATTGGCCAA', 0.3714285714285714]]
0.3714285714285714 0.4857142857142857


[['AACCGGTTGUUGUGGUUUAUGACTTGGCCAA', 0.4571428571428571], ['AACCGGTTGUUGUGGUUUAUGUCTTGGCCAA', 0.4857142857142857], ['AACCGGTTGUUGUGGUUUAUGGGTTGGCCAA', 0.4857142857142857], ['AACCGGTTGUUGUGGUUUAUGGCTTGGCCAA', 0.4857142857142857], ['AACCGGTTGUUGUGGUUUAUGCGTTGGCCAA', 0.4571428571428571], ['AA

Almost there....

In [None]:
def NS(inp):

  #Finally creating the hit-list:
  hit_list = []

  #This block simply gives the maximum size of any neighborhood.
  ls = []
  for n in inp:
    ls.append(len(inp[n]))
  ml = max(ls)


  #Iterate this block based on whether the loop has found something to prune yet:
  for m in range(0, ml):
    print('legnth of interest:', ml-m)
    print('legnth of hit-list',len(hit_list))
    print()

    if len(hit_list) == 0:

      #Looking at all neighborhoods of this maximum size.
      for n in inp:
        if len(inp[n]) == (ml-m):  #Notice the new '-m' condition.
          #Create an output list.
          lw_APIs = []
          #Append the centroid
          lw_APIs.append([n, API_dict[n]])
          for i in inp[n]:
            #Append the neighbors
            lw_APIs.append([i,API_dict[i]])
          #Output the output list.
          print (lw_APIs)

          #To generate the spread of API scores in each list.
          lAPIs = []
          for j in lw_APIs:
            lAPIs.append(j[1])
          print(min(lAPIs),max(lAPIs))

          #Add all the prunable sequences to the hit-list:
          for j in lw_APIs:
            if j[1] == min(lAPIs) and j[1] != max(lAPIs):
              hit_list.append(j)

          print ()
          print ()

  print('hit_list:', hit_list)

In [None]:
NS(inp)

legnth of interest: 11
legnth of hit-list 0

[['AACCGGTTAUUUGUGGUUUAUGATTGGCCAA', 0.4571428571428571], ['AACCGGTTAAUUGUGGUUUAUGUTTGGCCAA', 0.4857142857142857], ['AACCGGTTAGUUGUGGUUUAUGGTTGGCCAA', 0.4857142857142857], ['AACCGGTTUGUUGUGGUUUAUGATTGGCCAA', 0.4857142857142857], ['AACCGGTTAGUUGUGGUUUAUGCTTGGCCAA', 0.4857142857142857], ['AACCGGTTAUCUGUGGUUUAUGUTTGGCCAA', 0.4571428571428571], ['AACCGGTTCAUUGUGGUUUAUGATTGGCCAA', 0.4285714285714285], ['AACCGGTTAAGUGUGGUUUAUGATTGGCCAA', 0.4285714285714285], ['AACCGGTTAUCUGUGGUUUAUGGTTGGCCAA', 0.4285714285714285], ['AACCGGTTACUUGUGGUUUAUGGTTGGCCAA', 0.4285714285714285], ['AACCGGTTGUAUGUGGUUUAUGATTGGCCAA', 0.4], ['AACCGGTTCUAUGUGGUUUAUGATTGGCCAA', 0.3714285714285714]]
0.3714285714285714 0.4857142857142857


[['AACCGGTTGUUGUGGUUUAUGACTTGGCCAA', 0.4571428571428571], ['AACCGGTTGUUGUGGUUUAUGUCTTGGCCAA', 0.4857142857142857], ['AACCGGTTGUUGUGGUUUAUGGGTTGGCCAA', 0.4857142857142857], ['AACCGGTTGUUGUGGUUUAUGGCTTGGCCAA', 0.4857142857142857], ['AACCGGTTGUUGUG

I'm fairly satisfied. But, I would need to check whether the two conditions I mentioned are working: To do so, I need to "break" the Api-Dict temporarily.

In [None]:
#Taking the first group of 11, I convert it into an impasse group of 11.
change_list = ['AACCGGTTAUUUGUGGUUUAUGATTGGCCAA', 'AACCGGTTAAUUGUGGUUUAUGUTTGGCCAA', 'AACCGGTTAGUUGUGGUUUAUGGTTGGCCAA', 'AACCGGTTUGUUGUGGUUUAUGATTGGCCAA', 'AACCGGTTAGUUGUGGUUUAUGCTTGGCCAA', 'AACCGGTTAUCUGUGGUUUAUGUTTGGCCAA', 'AACCGGTTCAUUGUGGUUUAUGATTGGCCAA', 'AACCGGTTAAGUGUGGUUUAUGATTGGCCAA', 'AACCGGTTAUCUGUGGUUUAUGGTTGGCCAA', 'AACCGGTTACUUGUGGUUUAUGGTTGGCCAA', 'AACCGGTTGUAUGUGGUUUAUGATTGGCCAA', 'AACCGGTTCUAUGUGGUUUAUGATTGGCCAA']
for item in change_list:
  API_dict[item] = 0.4857142857142857

#Test:
NS(inp)

legnth of interest: 11
legnth of hit-list 0

[['AACCGGTTAUUUGUGGUUUAUGATTGGCCAA', 0.4857142857142857], ['AACCGGTTAAUUGUGGUUUAUGUTTGGCCAA', 0.4857142857142857], ['AACCGGTTAGUUGUGGUUUAUGGTTGGCCAA', 0.4857142857142857], ['AACCGGTTUGUUGUGGUUUAUGATTGGCCAA', 0.4857142857142857], ['AACCGGTTAGUUGUGGUUUAUGCTTGGCCAA', 0.4857142857142857], ['AACCGGTTAUCUGUGGUUUAUGUTTGGCCAA', 0.4857142857142857], ['AACCGGTTCAUUGUGGUUUAUGATTGGCCAA', 0.4857142857142857], ['AACCGGTTAAGUGUGGUUUAUGATTGGCCAA', 0.4857142857142857], ['AACCGGTTAUCUGUGGUUUAUGGTTGGCCAA', 0.4857142857142857], ['AACCGGTTACUUGUGGUUUAUGGTTGGCCAA', 0.4857142857142857], ['AACCGGTTGUAUGUGGUUUAUGATTGGCCAA', 0.4857142857142857], ['AACCGGTTCUAUGUGGUUUAUGATTGGCCAA', 0.4857142857142857]]
0.4857142857142857 0.4857142857142857


[['AACCGGTTGUUGUGGUUUAUGACTTGGCCAA', 0.4571428571428571], ['AACCGGTTGUUGUGGUUUAUGUCTTGGCCAA', 0.4857142857142857], ['AACCGGTTGUUGUGGUUUAUGGGTTGGCCAA', 0.4857142857142857], ['AACCGGTTGUUGUGGUUUAUGGCTTGGCCAA', 0.4857142857142857], [

Success! It is ignored, and the loop only addthe sequences from the other group. Now to test whether the function will shift to a lower level if it does not find anything to drop.

In [None]:
#Change the second one, so that it will hopefully skip down,
change_list = ['AACCGGTTGUUGUGGUUUAUGACTTGGCCAA', 'AACCGGTTGUUGUGGUUUAUGUCTTGGCCAA', 'AACCGGTTGUUGUGGUUUAUGGGTTGGCCAA', 'AACCGGTTGUUGUGGUUUAUGGCTTGGCCAA', 'AACCGGTTGUUGUGGUUUAUGCGTTGGCCAA', 'AACCGGTTGGUGUGGUUUAUGAGTTGGCCAA', 'AACCGGTTCUUGUGGUUUAUGAATTGGCCAA', 'AACCGGTTGUUGUGGUUUAAUACTTGGCCAA', 'AACCGGTTUUUGUGGUUUAUGAATTGGCCAA', 'AACCGGTTAUUGUGGUUUAUGACTTGGCCAA', 'AACCGGTTGAUGUGGUUUAUGAATTGGCCAA', 'AACCGGTTAUUGUGGUUUAUGAATTGGCCAA']
for item in change_list:
  API_dict[item] = 0.4857142857142857
#Test:
NS(inp)

legnth of interest: 11
legnth of hit-list 0

[['AACCGGTTAUUUGUGGUUUAUGATTGGCCAA', 0.4857142857142857], ['AACCGGTTAAUUGUGGUUUAUGUTTGGCCAA', 0.4857142857142857], ['AACCGGTTAGUUGUGGUUUAUGGTTGGCCAA', 0.4857142857142857], ['AACCGGTTUGUUGUGGUUUAUGATTGGCCAA', 0.4857142857142857], ['AACCGGTTAGUUGUGGUUUAUGCTTGGCCAA', 0.4857142857142857], ['AACCGGTTAUCUGUGGUUUAUGUTTGGCCAA', 0.4857142857142857], ['AACCGGTTCAUUGUGGUUUAUGATTGGCCAA', 0.4857142857142857], ['AACCGGTTAAGUGUGGUUUAUGATTGGCCAA', 0.4857142857142857], ['AACCGGTTAUCUGUGGUUUAUGGTTGGCCAA', 0.4857142857142857], ['AACCGGTTACUUGUGGUUUAUGGTTGGCCAA', 0.4857142857142857], ['AACCGGTTGUAUGUGGUUUAUGATTGGCCAA', 0.4857142857142857], ['AACCGGTTCUAUGUGGUUUAUGATTGGCCAA', 0.4857142857142857]]
0.4857142857142857 0.4857142857142857


[['AACCGGTTGUUGUGGUUUAUGACTTGGCCAA', 0.4857142857142857], ['AACCGGTTGUUGUGGUUUAUGUCTTGGCCAA', 0.4857142857142857], ['AACCGGTTGUUGUGGUUUAUGGGTTGGCCAA', 0.4857142857142857], ['AACCGGTTGUUGUGGUUUAUGGCTTGGCCAA', 0.4857142857142857], [

I think it works!

##### Don't forget to reset the API_dict

In [None]:
API_dict = {}
for n in o_seqs:
  API_dict[n[1]] = n[0]

API_dict

{'AACCGGTTUAAAUAAUUUACGUUTTGGCCAA': 0.4,
 'AACCGGTTAAUUAAUUUACGUGGTTGGCCAA': 0.4,
 'AACCGGTTAAUAAUUUACGCUAATTGGCCAA': 0.4,
 'AACCGGTTCAAAUAAUUUACGUUTTGGCCAA': 0.4,
 'AACCGGTTCUGUAAUUUACGUGGTTGGCCAA': 0.4,
 'AACCGGTTAGUAAAUUUAAUCGUTTGGCCAA': 0.4,
 'AACCGGTTUUAAUUAAUUUACGCTTGGCCAA': 0.3714285714285714,
 'AACCGGTTAAAGUUAAUUUACGGTTGGCCAA': 0.3714285714285714,
 'AACCGGTTAAAUUAAUUUACGCGTTGGCCAA': 0.3714285714285714,
 'AACCGGTTAUGUAAUUUACGUGUTTGGCCAA': 0.3714285714285714,
 'AACCGGTTUCAAUUUACACUAAUTTGGCCAA': 0.3714285714285714,
 'AACCGGTTAUUUAUUUAUGGAUCTTGGCCAA': 0.3714285714285714,
 'AACCGGTTUCGCAAUUUAUGGUGTTGGCCAA': 0.3714285714285714,
 'AACCGGTTGGAAUUAAUUUACGCTTGGCCAA': 0.3714285714285714,
 'AACCGGTTUAAAUUAAUUUACGUTTGGCCAA': 0.3714285714285714,
 'AACCGGTTUUCAUAAUUUACAAATTGGCCAA': 0.3714285714285714,
 'AACCGGTTCCAAAUUAAUUUACGTTGGCCAA': 0.3714285714285714,
 'AACCGGTTAGUUAAUUUACGGCUTTGGCCAA': 0.3714285714285714,
 'AACCGGTTUGUGUAAUUUACGUATTGGCCAA': 0.3714285714285714,
 'AACCGGTTUUCAUUAAUUUACGGT

#### 'NS' Function

In [None]:
def NS(inp):
  hit_list = []
  #This block simply gives the maximum size of any neighborhood.
  ls = []
  for n in inp:
    ls.append(len(inp[n]))
  ml = max(ls)
  #Iterate this block based on whether the loop has found something to prune yet:
  for m in range(0, ml):
    if len(hit_list) == 0:
      #Looking at all neighborhoods of this maximum size.
      for n in inp:
        if len(inp[n]) == (ml-m):  #Notice the new '-m' condition.
          #Create an output list.
          lw_APIs = []
          #Append the centroid
          lw_APIs.append([n, API_dict[n]])
          for i in inp[n]:
            #Append the neighbors
            lw_APIs.append([i,API_dict[i]])

          #To generate the spread of API scores in each list.
          lAPIs = []
          for j in lw_APIs:
            lAPIs.append(j[1])

          #Add all the prunable sequences to the hit-list:
          for j in lw_APIs:
            if j[1] == min(lAPIs) and j[1] != max(lAPIs):
              hit_list.append(j)

  return ('hit_list:', hit_list)

In [None]:
inp = get_neighbors(pairwise_tn_df, D=0.1)

In [None]:
NS(inp)

('hit_list:',
 [['AACCGGTTCUAUGUGGUUUAUGATTGGCCAA', 0.3714285714285714],
  ['AACCGGTTCUUGUGGUUUAUGAATTGGCCAA', 0.4],
  ['AACCGGTTGUUGUGGUUUAAUACTTGGCCAA', 0.4],
  ['AACCGGTTUUUGUGGUUUAUGAATTGGCCAA', 0.4],
  ['AACCGGTTAUUGUGGUUUAUGACTTGGCCAA', 0.4],
  ['AACCGGTTGAUGUGGUUUAUGAATTGGCCAA', 0.4],
  ['AACCGGTTAUUGUGGUUUAUGAATTGGCCAA', 0.4]])

#### Exploration:

In [None]:
def API(inp):
  hit_list = []
  #First, to identify all sequences which are of the minimum API score in their own neighborhoood.
  for n in inp:
    print(n, inp[n])

In [None]:
API(inp)

AACCGGTTUAAAUAAUUUACGUUTTGGCCAA ['AACCGGTTCAAAUAAUUUACGUUTTGGCCAA', 'AACCGGTTUAAAUAAUUUACGGCTTGGCCAA', 'AACCGGTTUAAUUAAUUUACGUATTGGCCAA', 'AACCGGTTUAAUUAAUUUACGUGTTGGCCAA']
AACCGGTTAAUUAAUUUACGUGGTTGGCCAA ['AACCGGTTCAUUAAUUUACGUGUTTGGCCAA']
AACCGGTTAAUAAUUUACGCUAATTGGCCAA []
AACCGGTTCAAAUAAUUUACGUUTTGGCCAA ['AACCGGTTUAAAUAAUUUACGUUTTGGCCAA']
AACCGGTTCUGUAAUUUACGUGGTTGGCCAA ['AACCGGTTAUGUAAUUUACGUGUTTGGCCAA']
AACCGGTTAGUAAAUUUAAUCGUTTGGCCAA ['AACCGGTTAGUUAAUUAAAUCGUTTGGCCAA']
AACCGGTTUUAAUUAAUUUACGCTTGGCCAA ['AACCGGTTGGAAUUAAUUUACGCTTGGCCAA', 'AACCGGTTUAAAUUAAUUUACGUTTGGCCAA', 'AACCGGTTUUCAUUAAUUUACGGTTGGCCAA', 'AACCGGTTGUAAUUAAUUUACGUTTGGCCAA', 'AACCGGTTUAAAUUAAUUUACGCTTGGCCAA', 'AACCGGTTGUAAUUAAUUUACGGTTGGCCAA']
AACCGGTTAAAGUUAAUUUACGGTTGGCCAA []
AACCGGTTAAAUUAAUUUACGCGTTGGCCAA ['AACCGGTTUAAUUAAUUUACGUGTTGGCCAA', 'AACCGGTTUAAUUAAUUUACGCCTTGGCCAA']
AACCGGTTAUGUAAUUUACGUGUTTGGCCAA ['AACCGGTTCUGUAAUUUACGUGGTTGGCCAA']
AACCGGTTUCAAUUUACACUAAUTTGGCCAA []
AACCGGTTAUUUAUUUAUGGAUCTTGGCCAA []
A

Identify all min sequences.

In [None]:
def API(inp):
  hit_list = []
  #First, to identify all sequences which are of the minimum API score in their own neighborhoood.
  for n in inp:
    print(n, inp[n])

    #Create an output list.
    lw_APIs = []
    #Append the centroid
    lw_APIs.append([n, API_dict[n]])
    for i in inp[n]:
      #Append the neighbors
      lw_APIs.append([i,API_dict[i]])

    #To generate the spread of API scores in each list.
    lAPIs = []
    for j in lw_APIs:
      lAPIs.append(j[1])

    print(lw_APIs)
    print(min(lAPIs), max(lAPIs))

    if API_dict[n] == min(lAPIs) and API_dict[n] != max(lAPIs):
      print('Add to list, if max_l.')

    print()

In [None]:
API(inp)

AACCGGTTUAAAUAAUUUACGUUTTGGCCAA ['AACCGGTTCAAAUAAUUUACGUUTTGGCCAA', 'AACCGGTTUAAAUAAUUUACGGCTTGGCCAA', 'AACCGGTTUAAUUAAUUUACGUATTGGCCAA', 'AACCGGTTUAAUUAAUUUACGUGTTGGCCAA']
[['AACCGGTTUAAAUAAUUUACGUUTTGGCCAA', 0.4], ['AACCGGTTCAAAUAAUUUACGUUTTGGCCAA', 0.4], ['AACCGGTTUAAAUAAUUUACGGCTTGGCCAA', 0.3714285714285714], ['AACCGGTTUAAUUAAUUUACGUATTGGCCAA', 0.3428571428571428], ['AACCGGTTUAAUUAAUUUACGUGTTGGCCAA', 0.3428571428571428]]
0.3428571428571428 0.4

AACCGGTTAAUUAAUUUACGUGGTTGGCCAA ['AACCGGTTCAUUAAUUUACGUGUTTGGCCAA']
[['AACCGGTTAAUUAAUUUACGUGGTTGGCCAA', 0.4], ['AACCGGTTCAUUAAUUUACGUGUTTGGCCAA', 0.3714285714285714]]
0.3714285714285714 0.4

AACCGGTTAAUAAUUUACGCUAATTGGCCAA []
[['AACCGGTTAAUAAUUUACGCUAATTGGCCAA', 0.4]]
0.4 0.4

AACCGGTTCAAAUAAUUUACGUUTTGGCCAA ['AACCGGTTUAAAUAAUUUACGUUTTGGCCAA']
[['AACCGGTTCAAAUAAUUUACGUUTTGGCCAA', 0.4], ['AACCGGTTUAAAUAAUUUACGUUTTGGCCAA', 0.4]]
0.4 0.4

AACCGGTTCUGUAAUUUACGUGGTTGGCCAA ['AACCGGTTAUGUAAUUUACGUGUTTGGCCAA']
[['AACCGGTTCUGUAAUUUACGUGGTTGGCCAA', 0

This time around, a simpler strategy would be to add all candidates for the "cut" to a "provisional hit list", and then select whichever are of the maximum length.

In [None]:
def NS(inp):
  hit_list = []
  #This block simply gives the maximum size of any neighborhood.
  ls = []
  for n in inp:
    ls.append(len(inp[n]))
  ml = max(ls)
  #Iterate this block based on whether the loop has found something to prune yet:
  for m in range(0, ml):
    if len(hit_list) == 0:
      #Looking at all neighborhoods of this maximum size.
      for n in inp:
        if len(inp[n]) == (ml-m):  #Notice the new '-m' condition.
          #Create an output list.
          lw_APIs = []
          #Append the centroid
          lw_APIs.append([n, API_dict[n]])
          for i in inp[n]:
            #Append the neighbors
            lw_APIs.append([i,API_dict[i]])

          #To generate the spread of API scores in each list.
          lAPIs = []
          for j in lw_APIs:
            lAPIs.append(j[1])

          #Add all the prunable sequences to the hit-list:
          for j in lw_APIs:
            if j[1] == min(lAPIs) and j[1] != max(lAPIs):
              hit_list.append(j)

  return ('hit_list:', hit_list)

In [None]:
def API(inp):
  hit_list = []
  provis_hit_list = []

  #First, to identify all sequences which are of the minimum API score in their own neighborhoood.
  for n in inp:
    print(n, inp[n])

    #Create an output list.
    lw_APIs = []
    #Append the centroid
    lw_APIs.append([n, API_dict[n]])
    for i in inp[n]:
      #Append the neighbors
      lw_APIs.append([i,API_dict[i]])

    #To generate the spread of API scores in each list.
    lAPIs = []
    for j in lw_APIs:
      lAPIs.append(j[1])

    print(lw_APIs)
    print(min(lAPIs), max(lAPIs))

    if API_dict[n] == min(lAPIs) and API_dict[n] != max(lAPIs):
      provis_hit_list.append([n, API_dict[n], len(inp[n])])


    print()

  print(provis_hit_list)
  print(hit_list)

In [None]:
API(inp)

AACCGGTTUAAAUAAUUUACGUUTTGGCCAA ['AACCGGTTCAAAUAAUUUACGUUTTGGCCAA', 'AACCGGTTUAAAUAAUUUACGGCTTGGCCAA', 'AACCGGTTUAAUUAAUUUACGUATTGGCCAA', 'AACCGGTTUAAUUAAUUUACGUGTTGGCCAA']
[['AACCGGTTUAAAUAAUUUACGUUTTGGCCAA', 0.4], ['AACCGGTTCAAAUAAUUUACGUUTTGGCCAA', 0.4], ['AACCGGTTUAAAUAAUUUACGGCTTGGCCAA', 0.3714285714285714], ['AACCGGTTUAAUUAAUUUACGUATTGGCCAA', 0.3428571428571428], ['AACCGGTTUAAUUAAUUUACGUGTTGGCCAA', 0.3428571428571428]]
0.3428571428571428 0.4

AACCGGTTAAUUAAUUUACGUGGTTGGCCAA ['AACCGGTTCAUUAAUUUACGUGUTTGGCCAA']
[['AACCGGTTAAUUAAUUUACGUGGTTGGCCAA', 0.4], ['AACCGGTTCAUUAAUUUACGUGUTTGGCCAA', 0.3714285714285714]]
0.3714285714285714 0.4

AACCGGTTAAUAAUUUACGCUAATTGGCCAA []
[['AACCGGTTAAUAAUUUACGCUAATTGGCCAA', 0.4]]
0.4 0.4

AACCGGTTCAAAUAAUUUACGUUTTGGCCAA ['AACCGGTTUAAAUAAUUUACGUUTTGGCCAA']
[['AACCGGTTCAAAUAAUUUACGUUTTGGCCAA', 0.4], ['AACCGGTTUAAAUAAUUUACGUUTTGGCCAA', 0.4]]
0.4 0.4

AACCGGTTCUGUAAUUUACGUGGTTGGCCAA ['AACCGGTTAUGUAAUUUACGUGUTTGGCCAA']
[['AACCGGTTCUGUAAUUUACGUGGTTGGCCAA', 0

The final step would be to select sequences from the provisional hit-list to then add to the actual hit-list. I had in mind to do so based on neighborhood size, such that the "minimum" sequence with the most neighbors gets cut first.

I'm noticing though that there isn't really a correlation between API-score and neighborhood size. Some sequences have many neighbors, and also a high-API score. Just looking at the provisional hit-list, I'd want to cut the sequences with the lowest API-scores first.

I don't quite have time to try both approaches, so I will for now go with cutting the "lowest local minimum". Clearly though, this selection algorithm will warrant further attention.

In [None]:
def API(inp):
  hit_list = []
  provis_hit_list = []

  #First, to identify all sequences which are of the minimum API score in their own neighborhoood.
  for n in inp:
    #Create an output list.
    lw_APIs = []
    #Append the centroid
    lw_APIs.append([n, API_dict[n]])
    for i in inp[n]:
      #Append the neighbors
      lw_APIs.append([i,API_dict[i]])

    #To generate the spread of API scores in each list.
    lAPIs = []
    for j in lw_APIs:
      lAPIs.append(j[1])

    if API_dict[n] == min(lAPIs) and API_dict[n] != max(lAPIs):
      provis_hit_list.append([n, API_dict[n], len(inp[n])])

  #Selection of sequences from the provisional hit-list which have the lowest score.
  seq_APIs = []
  for seq in provis_hit_list:
    seq_APIs.append(seq[1])

  print(provis_hit_list)
  print(min(seq_APIs))

  for seq in provis_hit_list:
    if seq[1] == min(seq_APIs):
      hit_list.append(seq)

  print(hit_list)

In [None]:
API(inp)

[['AACCGGTTAUGUAAUUUACGUGUTTGGCCAA', 0.3714285714285714, 1], ['AACCGGTTGGUGUGAGGUCUGGCTTGGCCAA', 0.3428571428571428, 1], ['AACCGGTTUUUUAGUUUAUGCGUTTGGCCAA', 0.3428571428571428, 1], ['AACCGGTTCGUAAUUAAUUUACGTTGGCCAA', 0.3428571428571428, 2], ['AACCGGTTUAAUUAAUUUACGUATTGGCCAA', 0.3428571428571428, 4], ['AACCGGTTGCCAAUUAAUUUACGTTGGCCAA', 0.3428571428571428, 2], ['AACCGGTTUAAUUAAUUUACGUGTTGGCCAA', 0.3428571428571428, 4], ['AACCGGTTGGCAUUAAUUUACGUTTGGCCAA', 0.3428571428571428, 4], ['AACCGGTTCCCAUUAAUUUACGUTTGGCCAA', 0.3428571428571428, 3], ['AACCGGTTCGUUAAUUUACGUGUTTGGCCAA', 0.3428571428571428, 1], ['AACCGGTTUAAUUAAUUUACGCCTTGGCCAA', 0.3428571428571428, 4], ['AACCGGTTGUAAUUAAUUUACGGTTGGCCAA', 0.3142857142857143, 5], ['AACCGGTTGGUUUGUGGUUUAUGTTGGCCAA', 0.4571428571428571, 5], ['AACCGGTTCGGUUGUGGUUUAUGTTGGCCAA', 0.4571428571428571, 7], ['AACCGGTTUUGUGGUUUAUGUAATTGGCCAA', 0.4285714285714285, 2], ['AACCGGTTAUCUGUGGUUUAUGGTTGGCCAA', 0.4285714285714285, 5], ['AACCGGTTGUGUGGUUUAUGGGGTTGGCCAA', 0.4

Most of these are fairly well connected. I feel good about cutting them.

This development raises a different question: What if there were an additional argument in the overall 'kill_neighbours' function that limits the number of rounds of cutting? The number of canndidates remaining could be tailored based on the experimenters needs. It's default would be 10,000. No limits.

#### API Function:

In [None]:
def API(inp):
  hit_list = []
  provis_hit_list = []

  #First, to identify all sequences which are of the minimum API score in their own neighborhoood.
  for n in inp:
    #Create an output list.
    lw_APIs = []
    #Append the centroid
    lw_APIs.append([n, API_dict[n]])
    for i in inp[n]:
      #Append the neighbors
      lw_APIs.append([i,API_dict[i]])

    #To generate the spread of API scores in each list.
    lAPIs = []
    for j in lw_APIs:
      lAPIs.append(j[1])

    if API_dict[n] == min(lAPIs) and API_dict[n] != max(lAPIs):
      provis_hit_list.append([n, API_dict[n], len(inp[n])])

  #Selection of sequences from the provisional hit-list which have the lowest score.
  seq_APIs = []
  for seq in provis_hit_list:
    seq_APIs.append(seq[1])

  for seq in provis_hit_list:
    if seq[1] == min(seq_APIs):
      hit_list.append(seq)

  return(hit_list)

In [None]:
API(inp)

[['AACCGGTTGUAAUUAAUUUACGGTTGGCCAA', 0.3142857142857143, 5],
 ['AACCGGTTACUAAUUAAAUCGUUTTGGCCAA', 0.3142857142857143, 3],
 ['AACCGGTTAUAAUUAAAUCGUGUTTGGCCAA', 0.3142857142857143, 5],
 ['AACCGGTTUAAUUAAAUCCUCGCTTGGCCAA', 0.3142857142857143, 1],
 ['AACCGGTTCAAAUUAAAUCGUUGTTGGCCAA', 0.3142857142857143, 2],
 ['AACCGGTTAUAAUUAAAUCGUUATTGGCCAA', 0.3142857142857143, 5],
 ['AACCGGTTAUAAAUUAAAUCGUUTTGGCCAA', 0.3142857142857143, 2],
 ['AACCGGTTACAAUUAAAUCGUAATTGGCCAA', 0.3142857142857143, 4],
 ['AACCGGTTAUAAUUAAAUCGCGGTTGGCCAA', 0.3142857142857143, 2],
 ['AACCGGTTACAUAAUUAAAUCGUTTGGCCAA', 0.3142857142857143, 5],
 ['AACCGGTTAAAUUAAAUCGUACUTTGGCCAA', 0.3142857142857143, 2],
 ['AACCGGTTCAAUUAAAUCGUUGUTTGGCCAA', 0.3142857142857143, 2],
 ['AACCGGTTCGGAAUUAAAUCGUATTGGCCAA', 0.3142857142857143, 4],
 ['AACCGGTTAGCUAAUUAAAUCGUTTGGCCAA', 0.3142857142857143, 9],
 ['AACCGGTTCAAUUAAAUCAUCGGTTGGCCAA', 0.3142857142857143, 1]]

### Combined Function:

In [None]:
def hit_list(pairwise_tn_df, API_dict, D=0.1, alg='API'):
  inp = get_neighbors(pairwise_tn_df, D)
  if alg == 'NS':
    hl = NS(inp)
  elif alg == 'API':
    hl = API(inp)
  elif alg != 'NS' and alg != 'API':
    print("Unaccepatable algorithm choice. Pick either: a. 'NS' or b. 'API'")

  hl_out = []
  for i in hl:
    hl_out.append(i[0])

  return(hl_out)

In [None]:
hit_list(pairwise_tn_df, API_dict, D=0.1, alg='API')

['AACCGGTTGUAAUUAAUUUACGGTTGGCCAA',
 'AACCGGTTACUAAUUAAAUCGUUTTGGCCAA',
 'AACCGGTTAUAAUUAAAUCGUGUTTGGCCAA',
 'AACCGGTTUAAUUAAAUCCUCGCTTGGCCAA',
 'AACCGGTTCAAAUUAAAUCGUUGTTGGCCAA',
 'AACCGGTTAUAAUUAAAUCGUUATTGGCCAA',
 'AACCGGTTAUAAAUUAAAUCGUUTTGGCCAA',
 'AACCGGTTACAAUUAAAUCGUAATTGGCCAA',
 'AACCGGTTAUAAUUAAAUCGCGGTTGGCCAA',
 'AACCGGTTACAUAAUUAAAUCGUTTGGCCAA',
 'AACCGGTTAAAUUAAAUCGUACUTTGGCCAA',
 'AACCGGTTCAAUUAAAUCGUUGUTTGGCCAA',
 'AACCGGTTCGGAAUUAAAUCGUATTGGCCAA',
 'AACCGGTTAGCUAAUUAAAUCGUTTGGCCAA',
 'AACCGGTTCAAUUAAAUCAUCGGTTGGCCAA']

Looks good.

### Prune_Matrix

Remove all sequences on the hitlist from 'pairwise_tn_df' and 'API_dict'.

#### Exploration

In [None]:
#reminder:
pairwise_tn_df

Unnamed: 0,AACCGGTTUAAAUAAUUUACGUUTTGGCCAA,AACCGGTTAAUUAAUUUACGUGGTTGGCCAA,AACCGGTTAAUAAUUUACGCUAATTGGCCAA,AACCGGTTCAAAUAAUUUACGUUTTGGCCAA,AACCGGTTCUGUAAUUUACGUGGTTGGCCAA,AACCGGTTAGUAAAUUUAAUCGUTTGGCCAA,AACCGGTTUUAAUUAAUUUACGCTTGGCCAA,AACCGGTTAAAGUUAAUUUACGGTTGGCCAA,AACCGGTTAAAUUAAUUUACGCGTTGGCCAA,AACCGGTTAUGUAAUUUACGUGUTTGGCCAA,...,AACCGGTTAUUGUGGUUGUUGGGTTGGCCAA,AACCGGTTUUGUGGUUGUUUGGCTTGGCCAA,AACCGGTTUUGUGGUUGUUACUCTTGGCCAA,AACCGGTTAAAAAAUGUGGUUACTTGGCCAA,AACCGGTTUCCGCUUCAUUUACGTTGGCCAA,AACCGGTTCUAUAUGUGGUUGAGTTGGCCAA,AACCGGTTAGUGUGGUUGUUAGCTTGGCCAA,AACCGGTTAGUUUAAUGGUUGCUTTGGCCAA,AACCGGTTCUGGUGGGUAAUUGUTTGGCCAA,AACCGGTTACAUGUGGUUGUGUATTGGCCAA
AACCGGTTUAAAUAAUUUACGUUTTGGCCAA,0.000000,0.536284,0.495175,0.034723,0.564295,0.372718,0.325768,0.380993,0.143741,0.514167,...,0.500044,0.487884,0.496017,0.433961,0.637859,0.559587,0.647795,0.373681,0.519769,0.427483
AACCGGTTAAUUAAUUUACGUGGTTGGCCAA,0.536284,0.000000,0.327228,0.534410,0.104792,0.224886,0.632869,0.433130,0.332553,0.104864,...,0.371428,0.491149,0.564824,0.324056,0.650225,0.486450,0.435475,0.427191,0.509457,0.574162
AACCGGTTAAUAAUUUACGCUAATTGGCCAA,0.495175,0.327228,0.000000,0.494202,0.504296,0.384633,0.633141,0.557520,0.489237,0.442288,...,0.560691,0.630595,0.621325,0.276985,0.568445,0.486208,0.560177,0.555729,0.714769,0.490981
AACCGGTTCAAAUAAUUUACGUUTTGGCCAA,0.034723,0.534410,0.494202,0.000000,0.510740,0.372018,0.379921,0.380363,0.143571,0.512753,...,0.499167,0.558790,0.551832,0.432227,0.753709,0.491765,0.646530,0.372164,0.466866,0.426784
AACCGGTTCUGUAAUUUACGUGGTTGGCCAA,0.564295,0.104792,0.504296,0.510740,0.000000,0.325193,0.555551,0.548395,0.440332,0.067947,...,0.429014,0.374209,0.446446,0.444472,0.665790,0.383185,0.555729,0.563786,0.325085,0.639640
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
AACCGGTTCUAUAUGUGGUUGAGTTGGCCAA,0.559587,0.486450,0.486208,0.491765,0.383185,0.574411,0.489569,0.499518,0.428057,0.499167,...,0.274994,0.339000,0.431633,0.458788,0.488051,0.000000,0.438718,0.325286,0.515013,0.370783
AACCGGTTAGUGUGGUUGUUAGCTTGGCCAA,0.647795,0.435475,0.560177,0.646530,0.555729,0.329303,0.441818,0.377396,0.521533,0.486191,...,0.105272,0.384141,0.537709,0.463304,0.504558,0.438718,0.000000,0.285076,0.319295,0.510667
AACCGGTTAGUUUAAUGGUUGCUTTGGCCAA,0.373681,0.427191,0.555729,0.372164,0.563786,0.325286,0.589766,0.509457,0.269658,0.429541,...,0.279358,0.373368,0.488793,0.493195,0.551507,0.325286,0.285076,0.000000,0.592129,0.493008
AACCGGTTCUGGUGGGUAAUUGUTTGGCCAA,0.519769,0.509457,0.714769,0.466866,0.325085,0.370348,0.525195,0.491794,0.559801,0.325286,...,0.275424,0.443780,0.588972,0.626902,0.650694,0.515013,0.319295,0.592129,0.000000,0.487932


In [None]:
pruned_df = pairwise_tn_df.copy() ##
for seq in hit_list(pairwise_tn_df, API_dict, D=0.1, alg='API'):
  pruned_df.drop([seq], axis=1, inplace=True)
  pruned_df.drop([seq], axis=0, inplace=True)
pruned_df

Unnamed: 0,AACCGGTTUAAAUAAUUUACGUUTTGGCCAA,AACCGGTTAAUUAAUUUACGUGGTTGGCCAA,AACCGGTTAAUAAUUUACGCUAATTGGCCAA,AACCGGTTCAAAUAAUUUACGUUTTGGCCAA,AACCGGTTCUGUAAUUUACGUGGTTGGCCAA,AACCGGTTAGUAAAUUUAAUCGUTTGGCCAA,AACCGGTTUUAAUUAAUUUACGCTTGGCCAA,AACCGGTTAAAGUUAAUUUACGGTTGGCCAA,AACCGGTTAAAUUAAUUUACGCGTTGGCCAA,AACCGGTTAUGUAAUUUACGUGUTTGGCCAA,...,AACCGGTTAUUGUGGUUGUUGGGTTGGCCAA,AACCGGTTUUGUGGUUGUUUGGCTTGGCCAA,AACCGGTTUUGUGGUUGUUACUCTTGGCCAA,AACCGGTTAAAAAAUGUGGUUACTTGGCCAA,AACCGGTTUCCGCUUCAUUUACGTTGGCCAA,AACCGGTTCUAUAUGUGGUUGAGTTGGCCAA,AACCGGTTAGUGUGGUUGUUAGCTTGGCCAA,AACCGGTTAGUUUAAUGGUUGCUTTGGCCAA,AACCGGTTCUGGUGGGUAAUUGUTTGGCCAA,AACCGGTTACAUGUGGUUGUGUATTGGCCAA
AACCGGTTUAAAUAAUUUACGUUTTGGCCAA,0.000000,0.536284,0.495175,0.034723,0.564295,0.372718,0.325768,0.380993,0.143741,0.514167,...,0.500044,0.487884,0.496017,0.433961,0.637859,0.559587,0.647795,0.373681,0.519769,0.427483
AACCGGTTAAUUAAUUUACGUGGTTGGCCAA,0.536284,0.000000,0.327228,0.534410,0.104792,0.224886,0.632869,0.433130,0.332553,0.104864,...,0.371428,0.491149,0.564824,0.324056,0.650225,0.486450,0.435475,0.427191,0.509457,0.574162
AACCGGTTAAUAAUUUACGCUAATTGGCCAA,0.495175,0.327228,0.000000,0.494202,0.504296,0.384633,0.633141,0.557520,0.489237,0.442288,...,0.560691,0.630595,0.621325,0.276985,0.568445,0.486208,0.560177,0.555729,0.714769,0.490981
AACCGGTTCAAAUAAUUUACGUUTTGGCCAA,0.034723,0.534410,0.494202,0.000000,0.510740,0.372018,0.379921,0.380363,0.143571,0.512753,...,0.499167,0.558790,0.551832,0.432227,0.753709,0.491765,0.646530,0.372164,0.466866,0.426784
AACCGGTTCUGUAAUUUACGUGGTTGGCCAA,0.564295,0.104792,0.504296,0.510740,0.000000,0.325193,0.555551,0.548395,0.440332,0.067947,...,0.429014,0.374209,0.446446,0.444472,0.665790,0.383185,0.555729,0.563786,0.325085,0.639640
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
AACCGGTTCUAUAUGUGGUUGAGTTGGCCAA,0.559587,0.486450,0.486208,0.491765,0.383185,0.574411,0.489569,0.499518,0.428057,0.499167,...,0.274994,0.339000,0.431633,0.458788,0.488051,0.000000,0.438718,0.325286,0.515013,0.370783
AACCGGTTAGUGUGGUUGUUAGCTTGGCCAA,0.647795,0.435475,0.560177,0.646530,0.555729,0.329303,0.441818,0.377396,0.521533,0.486191,...,0.105272,0.384141,0.537709,0.463304,0.504558,0.438718,0.000000,0.285076,0.319295,0.510667
AACCGGTTAGUUUAAUGGUUGCUTTGGCCAA,0.373681,0.427191,0.555729,0.372164,0.563786,0.325286,0.589766,0.509457,0.269658,0.429541,...,0.279358,0.373368,0.488793,0.493195,0.551507,0.325286,0.285076,0.000000,0.592129,0.493008
AACCGGTTCUGGUGGGUAAUUGUTTGGCCAA,0.519769,0.509457,0.714769,0.466866,0.325085,0.370348,0.525195,0.491794,0.559801,0.325286,...,0.275424,0.443780,0.588972,0.626902,0.650694,0.515013,0.319295,0.592129,0.000000,0.487932


Success! -- Something very interesting about Pandas is that saying 'pruned_df = pairwise_tn_df' does not express a one-way connection, the way it does elsewhere. So, modifying the one ends up modifying the other.

#### Pruning Function:

In [None]:
#Note that this and the pruning function are meant to recur:
#So to begin, input_df = pairwise_tn_df, but the next time, it will be the output of the prev. step
def prune_matrix(input_df, API_dict, D=0.1, alg='API'):
  pruned_df = input_df.copy()
  for seq in hit_list(input_df, API_dict, D, alg):
    pruned_df.drop([seq], axis=1, inplace=True)
    pruned_df.drop([seq], axis=0, inplace=True)
  return(pruned_df)

In [None]:
prune_matrix(pairwise_tn_df, API_dict, D=0.1)

Unnamed: 0,AACCGGTTUAAAUAAUUUACGUUTTGGCCAA,AACCGGTTAAUUAAUUUACGUGGTTGGCCAA,AACCGGTTAAUAAUUUACGCUAATTGGCCAA,AACCGGTTCAAAUAAUUUACGUUTTGGCCAA,AACCGGTTCUGUAAUUUACGUGGTTGGCCAA,AACCGGTTAGUAAAUUUAAUCGUTTGGCCAA,AACCGGTTUUAAUUAAUUUACGCTTGGCCAA,AACCGGTTAAAGUUAAUUUACGGTTGGCCAA,AACCGGTTAAAUUAAUUUACGCGTTGGCCAA,AACCGGTTAUGUAAUUUACGUGUTTGGCCAA,...,AACCGGTTAUUGUGGUUGUUGGGTTGGCCAA,AACCGGTTUUGUGGUUGUUUGGCTTGGCCAA,AACCGGTTUUGUGGUUGUUACUCTTGGCCAA,AACCGGTTAAAAAAUGUGGUUACTTGGCCAA,AACCGGTTUCCGCUUCAUUUACGTTGGCCAA,AACCGGTTCUAUAUGUGGUUGAGTTGGCCAA,AACCGGTTAGUGUGGUUGUUAGCTTGGCCAA,AACCGGTTAGUUUAAUGGUUGCUTTGGCCAA,AACCGGTTCUGGUGGGUAAUUGUTTGGCCAA,AACCGGTTACAUGUGGUUGUGUATTGGCCAA
AACCGGTTUAAAUAAUUUACGUUTTGGCCAA,0.000000,0.536284,0.495175,0.034723,0.564295,0.372718,0.325768,0.380993,0.143741,0.514167,...,0.500044,0.487884,0.496017,0.433961,0.637859,0.559587,0.647795,0.373681,0.519769,0.427483
AACCGGTTAAUUAAUUUACGUGGTTGGCCAA,0.536284,0.000000,0.327228,0.534410,0.104792,0.224886,0.632869,0.433130,0.332553,0.104864,...,0.371428,0.491149,0.564824,0.324056,0.650225,0.486450,0.435475,0.427191,0.509457,0.574162
AACCGGTTAAUAAUUUACGCUAATTGGCCAA,0.495175,0.327228,0.000000,0.494202,0.504296,0.384633,0.633141,0.557520,0.489237,0.442288,...,0.560691,0.630595,0.621325,0.276985,0.568445,0.486208,0.560177,0.555729,0.714769,0.490981
AACCGGTTCAAAUAAUUUACGUUTTGGCCAA,0.034723,0.534410,0.494202,0.000000,0.510740,0.372018,0.379921,0.380363,0.143571,0.512753,...,0.499167,0.558790,0.551832,0.432227,0.753709,0.491765,0.646530,0.372164,0.466866,0.426784
AACCGGTTCUGUAAUUUACGUGGTTGGCCAA,0.564295,0.104792,0.504296,0.510740,0.000000,0.325193,0.555551,0.548395,0.440332,0.067947,...,0.429014,0.374209,0.446446,0.444472,0.665790,0.383185,0.555729,0.563786,0.325085,0.639640
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
AACCGGTTCUAUAUGUGGUUGAGTTGGCCAA,0.559587,0.486450,0.486208,0.491765,0.383185,0.574411,0.489569,0.499518,0.428057,0.499167,...,0.274994,0.339000,0.431633,0.458788,0.488051,0.000000,0.438718,0.325286,0.515013,0.370783
AACCGGTTAGUGUGGUUGUUAGCTTGGCCAA,0.647795,0.435475,0.560177,0.646530,0.555729,0.329303,0.441818,0.377396,0.521533,0.486191,...,0.105272,0.384141,0.537709,0.463304,0.504558,0.438718,0.000000,0.285076,0.319295,0.510667
AACCGGTTAGUUUAAUGGUUGCUTTGGCCAA,0.373681,0.427191,0.555729,0.372164,0.563786,0.325286,0.589766,0.509457,0.269658,0.429541,...,0.279358,0.373368,0.488793,0.493195,0.551507,0.325286,0.285076,0.000000,0.592129,0.493008
AACCGGTTCUGGUGGGUAAUUGUTTGGCCAA,0.519769,0.509457,0.714769,0.466866,0.325085,0.370348,0.525195,0.491794,0.559801,0.325286,...,0.275424,0.443780,0.588972,0.626902,0.650694,0.515013,0.319295,0.592129,0.000000,0.487932


###. Kill_Neighbors:

Finally, the assembly:

1. Take in 'API_dict' and 'pairwise_tn_df'.

2. ['get_neighbors'->'build_hitlist'-> 'prune_matrix'] x Iterate until a cutoff is reached.

#### Exploration:

In [None]:
def kill_neighbors(pairwise_tn_df, API_dict, D=0.1, alg='API'):
  input_df = pairwise_tn_df.copy()

  neighborhood_list = get_neighbors(input_df, D)

  return(neighborhood_list)

In [None]:
kill_neighbors(pairwise_tn_df, API_dict, D=0.1, alg='API')

{'AACCGGTTUAAAUAAUUUACGUUTTGGCCAA': ['AACCGGTTCAAAUAAUUUACGUUTTGGCCAA',
  'AACCGGTTUAAAUAAUUUACGGCTTGGCCAA',
  'AACCGGTTUAAUUAAUUUACGUATTGGCCAA',
  'AACCGGTTUAAUUAAUUUACGUGTTGGCCAA'],
 'AACCGGTTAAUUAAUUUACGUGGTTGGCCAA': ['AACCGGTTCAUUAAUUUACGUGUTTGGCCAA'],
 'AACCGGTTAAUAAUUUACGCUAATTGGCCAA': [],
 'AACCGGTTCAAAUAAUUUACGUUTTGGCCAA': ['AACCGGTTUAAAUAAUUUACGUUTTGGCCAA'],
 'AACCGGTTCUGUAAUUUACGUGGTTGGCCAA': ['AACCGGTTAUGUAAUUUACGUGUTTGGCCAA'],
 'AACCGGTTAGUAAAUUUAAUCGUTTGGCCAA': ['AACCGGTTAGUUAAUUAAAUCGUTTGGCCAA'],
 'AACCGGTTUUAAUUAAUUUACGCTTGGCCAA': ['AACCGGTTGGAAUUAAUUUACGCTTGGCCAA',
  'AACCGGTTUAAAUUAAUUUACGUTTGGCCAA',
  'AACCGGTTUUCAUUAAUUUACGGTTGGCCAA',
  'AACCGGTTGUAAUUAAUUUACGUTTGGCCAA',
  'AACCGGTTUAAAUUAAUUUACGCTTGGCCAA',
  'AACCGGTTGUAAUUAAUUUACGGTTGGCCAA'],
 'AACCGGTTAAAGUUAAUUUACGGTTGGCCAA': [],
 'AACCGGTTAAAUUAAUUUACGCGTTGGCCAA': ['AACCGGTTUAAUUAAUUUACGUGTTGGCCAA',
  'AACCGGTTUAAUUAAUUUACGCCTTGGCCAA'],
 'AACCGGTTAUGUAAUUUACGUGUTTGGCCAA': ['AACCGGTTCUGUAAUUUACGUGGTTGGCCAA'],
 'A

In [None]:
def kill_neighbors(pairwise_tn_df, API_dict, D=0.1, alg='API'):
  input_df = pairwise_tn_df.copy()

  neighborhood_list = get_neighbors(input_df, D)
  kill_list = hit_list(input_df, API_dict, D, alg)
  output_df = prune_matrix(input_df, API_dict, D, alg)

  #return(neighborhood_list)
  #return(kill_list)
  return(output_df)

In [None]:
kill_neighbors(pairwise_tn_df, API_dict, D=0.2, alg='API')

Unnamed: 0,AACCGGTTUAAAUAAUUUACGUUTTGGCCAA,AACCGGTTAAUUAAUUUACGUGGTTGGCCAA,AACCGGTTAAUAAUUUACGCUAATTGGCCAA,AACCGGTTCAAAUAAUUUACGUUTTGGCCAA,AACCGGTTCUGUAAUUUACGUGGTTGGCCAA,AACCGGTTAGUAAAUUUAAUCGUTTGGCCAA,AACCGGTTUUAAUUAAUUUACGCTTGGCCAA,AACCGGTTAAAGUUAAUUUACGGTTGGCCAA,AACCGGTTAAAUUAAUUUACGCGTTGGCCAA,AACCGGTTAUGUAAUUUACGUGUTTGGCCAA,...,AACCGGTTAUUGUGGUUGUUGGGTTGGCCAA,AACCGGTTUUGUGGUUGUUUGGCTTGGCCAA,AACCGGTTUUGUGGUUGUUACUCTTGGCCAA,AACCGGTTAAAAAAUGUGGUUACTTGGCCAA,AACCGGTTUCCGCUUCAUUUACGTTGGCCAA,AACCGGTTCUAUAUGUGGUUGAGTTGGCCAA,AACCGGTTAGUGUGGUUGUUAGCTTGGCCAA,AACCGGTTAGUUUAAUGGUUGCUTTGGCCAA,AACCGGTTCUGGUGGGUAAUUGUTTGGCCAA,AACCGGTTACAUGUGGUUGUGUATTGGCCAA
AACCGGTTUAAAUAAUUUACGUUTTGGCCAA,0.000000,0.536284,0.495175,0.034723,0.564295,0.372718,0.325768,0.380993,0.143741,0.514167,...,0.500044,0.487884,0.496017,0.433961,0.637859,0.559587,0.647795,0.373681,0.519769,0.427483
AACCGGTTAAUUAAUUUACGUGGTTGGCCAA,0.536284,0.000000,0.327228,0.534410,0.104792,0.224886,0.632869,0.433130,0.332553,0.104864,...,0.371428,0.491149,0.564824,0.324056,0.650225,0.486450,0.435475,0.427191,0.509457,0.574162
AACCGGTTAAUAAUUUACGCUAATTGGCCAA,0.495175,0.327228,0.000000,0.494202,0.504296,0.384633,0.633141,0.557520,0.489237,0.442288,...,0.560691,0.630595,0.621325,0.276985,0.568445,0.486208,0.560177,0.555729,0.714769,0.490981
AACCGGTTCAAAUAAUUUACGUUTTGGCCAA,0.034723,0.534410,0.494202,0.000000,0.510740,0.372018,0.379921,0.380363,0.143571,0.512753,...,0.499167,0.558790,0.551832,0.432227,0.753709,0.491765,0.646530,0.372164,0.466866,0.426784
AACCGGTTCUGUAAUUUACGUGGTTGGCCAA,0.564295,0.104792,0.504296,0.510740,0.000000,0.325193,0.555551,0.548395,0.440332,0.067947,...,0.429014,0.374209,0.446446,0.444472,0.665790,0.383185,0.555729,0.563786,0.325085,0.639640
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
AACCGGTTCUAUAUGUGGUUGAGTTGGCCAA,0.559587,0.486450,0.486208,0.491765,0.383185,0.574411,0.489569,0.499518,0.428057,0.499167,...,0.274994,0.339000,0.431633,0.458788,0.488051,0.000000,0.438718,0.325286,0.515013,0.370783
AACCGGTTAGUGUGGUUGUUAGCTTGGCCAA,0.647795,0.435475,0.560177,0.646530,0.555729,0.329303,0.441818,0.377396,0.521533,0.486191,...,0.105272,0.384141,0.537709,0.463304,0.504558,0.438718,0.000000,0.285076,0.319295,0.510667
AACCGGTTAGUUUAAUGGUUGCUTTGGCCAA,0.373681,0.427191,0.555729,0.372164,0.563786,0.325286,0.589766,0.509457,0.269658,0.429541,...,0.279358,0.373368,0.488793,0.493195,0.551507,0.325286,0.285076,0.000000,0.592129,0.493008
AACCGGTTCUGGUGGGUAAUUGUTTGGCCAA,0.519769,0.509457,0.714769,0.466866,0.325085,0.370348,0.525195,0.491794,0.559801,0.325286,...,0.275424,0.443780,0.588972,0.626902,0.650694,0.515013,0.319295,0.592129,0.000000,0.487932


There's not a whole lot more setup to do, given that I wrote 'get_neighbors', 'hit_list' and 'prune_matrix' to step into eachother.

I believe what's left is to iterate until the maximum number of steps is reached, or untill all possible cuts have been made.

#### (Old) Function

In [None]:
def kill_neighbors(pairwise_tn_df, API_dict, D=0.1, alg='API', lim=100, show=''):
  input_df = pairwise_tn_df.copy()

  count = 0
  output_dim = [(0, 0), input_df.values.shape]
  while count < lim and output_dim[count+1] != output_dim[count]:
    count +=1

    #A quick way to perhaps make the step-wise output available.
    neighborhood_list = get_neighbors(input_df, D)
    if show == 'neigh':
      print(neighborhood_list)
    kill_list = hit_list(input_df, API_dict, D, alg)
    if show == 'kill':
      print(kill_list)
    output_df = prune_matrix(input_df, API_dict, D, alg)
    if show == 'matr':
      print(output_df)

    #The aim of doing this is that you should be able to see "convergence."
    #If the matrix from two successive iterations is the same size, the
    output_dim.append(output_df.values.shape)
    if show == 'dims':
      print(output_dim[count+1])
    input_df = output_df.copy()

  #return(neighborhood_list)
  #return(kill_list)
  return(output_df)

In [None]:
test = kill_neighbors(pairwise_tn_df, API_dict, D=0.5, alg='API', lim=100, show='dims')
test

(425, 425)
(328, 328)
(223, 223)
(132, 132)
(63, 63)
(26, 26)
(26, 26)


Unnamed: 0,AACCGGTTAAGUUGUGGUUUAUGTTGGCCAA,AACCGGTTAAUUGUGGUUUAUGUTTGGCCAA,AACCGGTTGUGUGGUUUAUGAGCTTGGCCAA,AACCGGTTACGUUGUGGUUUAUGTTGGCCAA,AACCGGTTGUUGUGGUUUAUGUCTTGGCCAA,AACCGGTTGCGUUGUGGUUUAUGTTGGCCAA,AACCGGTTCGUUGUGGUUUAUGUTTGGCCAA,AACCGGTTAUGUUGUGGUUUAUGTTGGCCAA,AACCGGTTAUGUGGUUUAUGUUGTTGGCCAA,AACCGGTTAGUUGUGGUUUAUGGTTGGCCAA,...,AACCGGTTGUUGUGGUUUAUGGGTTGGCCAA,AACCGGTTGUUGUGGUUUAUGGCTTGGCCAA,AACCGGTTUGUUGUGGUUUAUGGTTGGCCAA,AACCGGTTAGGUUGUGGUUUAUGTTGGCCAA,AACCGGTTUGUUUUGAGGUUGGUTTGGCCAA,AACCGGTTUAUGAGGUUGGUUUGTTGGCCAA,AACCGGTTAAUUGUGGUUGUUUATTGGCCAA,AACCGGTTUAUGUGGUUGUUUAUTTGGCCAA,AACCGGTTUUAUGUGGUUGUUUATTGGCCAA,AACCGGTTUAAAAUGUGGUUGUUTTGGCCAA
AACCGGTTAAGUUGUGGUUUAUGTTGGCCAA,0.0,0.394306,0.38117,0.033096,0.445222,0.06902,0.515839,0.033104,0.281051,0.38208,...,0.441613,0.508558,0.445719,0.034548,0.458882,0.462039,0.326488,0.462808,0.445145,0.439077
AACCGGTTAAUUGUGGUUUAUGUTTGGCCAA,0.394306,0.0,0.428957,0.460879,0.497262,0.515839,0.069247,0.462808,0.383912,0.068951,...,0.445719,0.431581,0.105485,0.445,0.331253,0.445,0.144444,0.328704,0.281718,0.444289
AACCGGTTGUGUGGUUUAUGAGCTTGGCCAA,0.38117,0.428957,0.0,0.372175,0.383251,0.326725,0.431065,0.326501,0.143703,0.440703,...,0.382538,0.326488,0.445358,0.381763,0.561883,0.579866,0.593375,0.493388,0.515839,0.670479
AACCGGTTACGUUGUGGUUUAUGTTGGCCAA,0.033096,0.460879,0.372175,0.0,0.433597,0.034822,0.539795,0.035042,0.275783,0.393925,...,0.43121,0.493739,0.46256,0.033099,0.446766,0.538533,0.382942,0.540982,0.433315,0.506252
AACCGGTTGUUGUGGUUUAUGUCTTGGCCAA,0.445222,0.497262,0.383251,0.433597,0.0,0.374002,0.5156,0.389006,0.448299,0.51705,...,0.068069,0.033107,0.543891,0.446766,0.446347,0.276409,0.3878,0.275808,0.387102,0.431107
AACCGGTTGCGUUGUGGUUUAUGTTGGCCAA,0.06902,0.515839,0.326725,0.034822,0.374002,0.0,0.541457,0.069864,0.320448,0.445358,...,0.374294,0.431955,0.463821,0.069129,0.448906,0.540033,0.445222,0.542229,0.43452,0.507155
AACCGGTTCGUUGUGGUUUAUGUTTGGCCAA,0.515839,0.069247,0.431065,0.539795,0.5156,0.541457,0.0,0.542229,0.448299,0.068046,...,0.463821,0.444722,0.069314,0.46256,0.271909,0.494931,0.229374,0.37919,0.275808,0.492341
AACCGGTTAUGUUGUGGUUUAUGTTGGCCAA,0.033104,0.462808,0.326501,0.035042,0.389006,0.069864,0.542229,0.0,0.232191,0.395259,...,0.385307,0.446766,0.464489,0.033107,0.447735,0.541158,0.383912,0.543763,0.388492,0.507586
AACCGGTTAUGUGGUUUAUGUUGTTGGCCAA,0.281051,0.383912,0.143703,0.275783,0.448299,0.320448,0.448299,0.232191,0.0,0.327148,...,0.446692,0.51705,0.385061,0.281329,0.699552,0.383987,0.383912,0.448891,0.388492,0.5924
AACCGGTTAGUUGUGGUUUAUGGTTGGCCAA,0.38208,0.068951,0.440703,0.393925,0.51705,0.445358,0.068046,0.395259,0.327148,0.0,...,0.383448,0.445358,0.033112,0.335315,0.327691,0.441613,0.18959,0.446192,0.27648,0.57853


In [None]:
for i in test.columns:
  print(i, API_dict[i])

AACCGGTTAAGUUGUGGUUUAUGTTGGCCAA 0.4857142857142857
AACCGGTTAAUUGUGGUUUAUGUTTGGCCAA 0.4857142857142857
AACCGGTTGUGUGGUUUAUGAGCTTGGCCAA 0.4857142857142857
AACCGGTTACGUUGUGGUUUAUGTTGGCCAA 0.4857142857142857
AACCGGTTGUUGUGGUUUAUGUCTTGGCCAA 0.4857142857142857
AACCGGTTGCGUUGUGGUUUAUGTTGGCCAA 0.4857142857142857
AACCGGTTCGUUGUGGUUUAUGUTTGGCCAA 0.4857142857142857
AACCGGTTAUGUUGUGGUUUAUGTTGGCCAA 0.4857142857142857
AACCGGTTAUGUGGUUUAUGUUGTTGGCCAA 0.4857142857142857
AACCGGTTAGUUGUGGUUUAUGGTTGGCCAA 0.4857142857142857
AACCGGTTAAUUUGUGGUUUAUGTTGGCCAA 0.4857142857142857
AACCGGTTUCAUGUGGUUUAUGUTTGGCCAA 0.4857142857142857
AACCGGTTUGUUGUGGUUUAUGATTGGCCAA 0.4857142857142857
AACCGGTTAUGUGGUUUAUGAGCTTGGCCAA 0.4857142857142857
AACCGGTTAGUUGUGGUUUAUGCTTGGCCAA 0.4857142857142857
AACCGGTTGGUUGUGGUUUAUGUTTGGCCAA 0.4857142857142857
AACCGGTTGUUGUGGUUUAUGGGTTGGCCAA 0.4857142857142857
AACCGGTTGUUGUGGUUUAUGGCTTGGCCAA 0.4857142857142857
AACCGGTTUGUUGUGGUUUAUGGTTGGCCAA 0.4857142857142857
AACCGGTTAGGUUGUGGUUUAUGTTGGCCAA

### Bonus: Free-Energy Dict

When the selection is very heavy, what you end up with are all the sequences with the highest API scores. So, the program in this case is not much different to just picking the top sequences.

Dr. Bell does make the point, however, that it would be useful to further select based on free-energy change. That is: The program so far actually discards crucial information. It should be easy enough to add  this to the output.

#### Making the 'E_dict' object:

In [None]:
fin_df

Unnamed: 0,aptamer_protein_interaction_score,primary_sequence,secondary_structure,minimum_free_energy,file
0,0.400000,AACCGGTTUAAAUAAUUUACGUUTTGGCCAA,....((((.((((.......)))).))))..,-3.4,Exp_1_1.csv
1,0.400000,AACCGGTTAAUUAAUUUACGUGGTTGGCCAA,....(((((((((.......)))))))))..,-6.9,Exp_1_1.csv
2,0.400000,AACCGGTTAAUAAUUUACGCUAATTGGCCAA,....(((((((...........)))))))..,-4.7,Exp_1_1.csv
3,0.400000,AACCGGTTCAAAUAAUUUACGUUTTGGCCAA,....(((.((((..........)))))))..,-3.6,Exp_1_1.csv
4,0.400000,AACCGGTTCUGUAAUUUACGUGGTTGGCCAA,(((((((..........)).)))))......,-2.9,Exp_1_1.csv
...,...,...,...,...,...
95,0.371429,AACCGGTTCUAUAUGUGGUUGAGTTGGCCAA,..(((((((...........)))))))....,-3.5,Exp_5_1.csv
96,0.371429,AACCGGTTAGUGUGGUUGUUAGCTTGGCCAA,....((((((.((........))))))))..,-5.6,Exp_5_1.csv
97,0.371429,AACCGGTTAGUUUAAUGGUUGCUTTGGCCAA,....((((((.............))))))..,-3.6,Exp_5_1.csv
98,0.371429,AACCGGTTCUGGUGGGUAAUUGUTTGGCCAA,.((((....))))((.(((....))).))..,-6.1,Exp_5_1.csv


In [None]:
e_seqs = fin_df.drop(['aptamer_protein_interaction_score','secondary_structure', 'file'],axis=1)
e_seqs = e_seqs.values
e_seqs

array([['AACCGGTTUAAAUAAUUUACGUUTTGGCCAA', -3.400000095367432],
       ['AACCGGTTAAUUAAUUUACGUGGTTGGCCAA', -6.900000095367432],
       ['AACCGGTTAAUAAUUUACGCUAATTGGCCAA', -4.699999809265137],
       ['AACCGGTTCAAAUAAUUUACGUUTTGGCCAA', -3.5999999046325684],
       ['AACCGGTTCUGUAAUUUACGUGGTTGGCCAA', -2.900000095367432],
       ['AACCGGTTAGUAAAUUUAAUCGUTTGGCCAA', -4.400000095367432],
       ['AACCGGTTUUAAUUAAUUUACGCTTGGCCAA', -1.600000023841858],
       ['AACCGGTTAAAGUUAAUUUACGGTTGGCCAA', -5.5],
       ['AACCGGTTAAAUUAAUUUACGCGTTGGCCAA', -3.200000047683716],
       ['AACCGGTTAUGUAAUUUACGUGUTTGGCCAA', -6.5],
       ['AACCGGTTUCAAUUUACACUAAUTTGGCCAA', -3.200000047683716],
       ['AACCGGTTAUUUAUUUAUGGAUCTTGGCCAA', -3.0999999046325684],
       ['AACCGGTTUCGCAAUUUAUGGUGTTGGCCAA', -5.400000095367432],
       ['AACCGGTTGGAAUUAAUUUACGCTTGGCCAA', -1.7000000476837158],
       ['AACCGGTTUAAAUUAAUUUACGUTTGGCCAA', -2.799999952316284],
       ['AACCGGTTUUCAUAAUUUACAAATTGGCCAA', -3.0],
       ['AACCGG

In [None]:
E_dict = {}
for n in e_seqs:
  E_dict[n[0]] = n[1]

E_dict

{'AACCGGTTUAAAUAAUUUACGUUTTGGCCAA': -3.400000095367432,
 'AACCGGTTAAUUAAUUUACGUGGTTGGCCAA': -6.900000095367432,
 'AACCGGTTAAUAAUUUACGCUAATTGGCCAA': -4.699999809265137,
 'AACCGGTTCAAAUAAUUUACGUUTTGGCCAA': -3.5999999046325684,
 'AACCGGTTCUGUAAUUUACGUGGTTGGCCAA': -2.900000095367432,
 'AACCGGTTAGUAAAUUUAAUCGUTTGGCCAA': -4.400000095367432,
 'AACCGGTTUUAAUUAAUUUACGCTTGGCCAA': -1.600000023841858,
 'AACCGGTTAAAGUUAAUUUACGGTTGGCCAA': -5.5,
 'AACCGGTTAAAUUAAUUUACGCGTTGGCCAA': -3.200000047683716,
 'AACCGGTTAUGUAAUUUACGUGUTTGGCCAA': -6.5,
 'AACCGGTTUCAAUUUACACUAAUTTGGCCAA': -3.200000047683716,
 'AACCGGTTAUUUAUUUAUGGAUCTTGGCCAA': -3.0999999046325684,
 'AACCGGTTUCGCAAUUUAUGGUGTTGGCCAA': -5.400000095367432,
 'AACCGGTTGGAAUUAAUUUACGCTTGGCCAA': -1.7000000476837158,
 'AACCGGTTUAAAUUAAUUUACGUTTGGCCAA': -2.799999952316284,
 'AACCGGTTUUCAUAAUUUACAAATTGGCCAA': -3.0,
 'AACCGGTTCCAAAUUAAUUUACGTTGGCCAA': -2.0,
 'AACCGGTTAGUUAAUUUACGGCUTTGGCCAA': -3.5999999046325684,
 'AACCGGTTUGUGUAAUUUACGUATTGGCCAA': -6.0,
 '

####

### Modified 'kill_neighbors' function:

While I'm adding the free energy change, I realize that I should've wrote functions to make the objects'pairwise_tn_df', 'API_dict' and (now)'E_dict'.
That way, all these things can be packaged directly into the 'kill_neighbors' function. The user simply inputs a list of filenames, and so does easily modify the list of files on which they run the program, without having to go anywhere.

#### 'make_necessary_objects' function

In [None]:
def make_necessary_objects(file_list):
  frames = []
  for file in file_list:
    name = file + 'df'
    name = pd.read_csv(file, delimiter=',', encoding_errors = 'replace')
    name['file'] = file
    frames.append(name)
  fin_df = pd.concat(frames)
  #Make the 'API_dict'
  o_seqs = fin_df.drop(['secondary_structure', 'minimum_free_energy', 'file'],axis=1)
  o_seqs = o_seqs.values
  API_dict = {}
  for n in o_seqs:
    API_dict[n[1]] = n[0]
  #Make the 'E_dict'
  e_seqs = fin_df.drop(['aptamer_protein_interaction_score','secondary_structure', 'file'],axis=1)
  e_seqs = e_seqs.values
  e_seqs
  E_dict = {}
  for n in e_seqs:
    E_dict[n[0]] = n[1]
  #Make 'pairwise_tn_df'
  pairwise_tn = []
  for n in API_dict:
    row = []
    for m in API_dict:
      try: #A few of the pairs return a 'math domain error'
        row.append(float(tamura_nei(n,m)[2]))
      except:
        #print('Math_error', n,m)
        row.append(1.0)    #Impute the maximum distance.
    pairwise_tn.append(row)
  pairwise_tn_df = pd.DataFrame(pairwise_tn, columns=API_dict, index=API_dict)

  #Return everything:
  return(fin_df, API_dict, E_dict, pairwise_tn_df)

In [None]:
All_Sequences, API_dict, E_dict, pairwise_tn_df = make_necessary_objects(file_list=['Exp_1_1.csv','Exp_2_1.csv', 'Exp_3_1.csv', 'Exp_4_1.csv', 'Exp_5_1.csv'])

In [None]:
#The only one of these that the user really might need to see.....
All_Sequences

Unnamed: 0,aptamer_protein_interaction_score,primary_sequence,secondary_structure,minimum_free_energy,file
0,0.400000,AACCGGTTUAAAUAAUUUACGUUTTGGCCAA,....((((.((((.......)))).))))..,-3.4,Exp_1_1.csv
1,0.400000,AACCGGTTAAUUAAUUUACGUGGTTGGCCAA,....(((((((((.......)))))))))..,-6.9,Exp_1_1.csv
2,0.400000,AACCGGTTAAUAAUUUACGCUAATTGGCCAA,....(((((((...........)))))))..,-4.7,Exp_1_1.csv
3,0.400000,AACCGGTTCAAAUAAUUUACGUUTTGGCCAA,....(((.((((..........)))))))..,-3.6,Exp_1_1.csv
4,0.400000,AACCGGTTCUGUAAUUUACGUGGTTGGCCAA,(((((((..........)).)))))......,-2.9,Exp_1_1.csv
...,...,...,...,...,...
95,0.371429,AACCGGTTCUAUAUGUGGUUGAGTTGGCCAA,..(((((((...........)))))))....,-3.5,Exp_5_1.csv
96,0.371429,AACCGGTTAGUGUGGUUGUUAGCTTGGCCAA,....((((((.((........))))))))..,-5.6,Exp_5_1.csv
97,0.371429,AACCGGTTAGUUUAAUGGUUGCUTTGGCCAA,....((((((.............))))))..,-3.6,Exp_5_1.csv
98,0.371429,AACCGGTTCUGGUGGGUAAUUGUTTGGCCAA,.((((....))))((.(((....))).))..,-6.1,Exp_5_1.csv


Actually, it doesn't make a whole lot of sense to stick this function directly into 'kill_neighbors', because making these objects is far more tedious than pruning them. So, you would have to remake them everytime you change a single parameter, which doesn't make a whole lot of sense. I think what I've done does allow me to make these into optional arguments, which is good.

#### Final(?) version of 'kill_neighbors'

In [None]:
def kill_neighbors(pairwise_tn_df, API_dict, E_dict, D=0.1, alg='API', lim=100, show=''):
  input_df = pairwise_tn_df.copy()

  count = 0
  output_dim = [(0, 0), input_df.values.shape]
  while count < lim and output_dim[count+1] != output_dim[count]:
    count +=1

    #A quick way to perhaps make the step-wise output available.
    neighborhood_list = get_neighbors(input_df, D)
    if show == 'neigh':
      print(neighborhood_list)
    kill_list = hit_list(input_df, API_dict, D, alg)
    if show == 'kill':
      print(kill_list)
    output_df = prune_matrix(input_df, API_dict, D, alg)
    if show == 'matr':
      print(output_df)

    #The aim of doing this is that you should be able to see "convergence."
    #If the matrix from two successive iterations is the same size, the
    output_dim.append(output_df.values.shape)
    if show == 'dims':
      print(output_dim[count+1])
    input_df = output_df.copy()

  #Outputs the final distance matrix:
  if show == 'dist':
    return(output_df)

  else:
    survivors = []
    for i in output_df.columns:
      survivors.append([i, API_dict[i], E_dict[i]])
      survivors_df = pd.DataFrame(survivors, columns=['sequence','API_score', 'Free_Energy_Change'])
      survivors_df
    return(survivors_df)

In [None]:
test = kill_neighbors(pairwise_tn_df, API_dict, E_dict, D=0.5, alg='API', lim=100, show='')
test

#for i in test.columns:
#  print(i, API_dict[i], E_dict[i])

Unnamed: 0,sequence,API_score,Free_Energy_Change
0,AACCGGTTAAGUUGUGGUUUAUGTTGGCCAA,0.485714,-4.8
1,AACCGGTTAAUUGUGGUUUAUGUTTGGCCAA,0.485714,-4.7
2,AACCGGTTGUGUGGUUUAUGAGCTTGGCCAA,0.485714,-3.6
3,AACCGGTTACGUUGUGGUUUAUGTTGGCCAA,0.485714,-5.4
4,AACCGGTTGUUGUGGUUUAUGUCTTGGCCAA,0.485714,-4.1
5,AACCGGTTGCGUUGUGGUUUAUGTTGGCCAA,0.485714,-5.0
6,AACCGGTTCGUUGUGGUUUAUGUTTGGCCAA,0.485714,-3.6
7,AACCGGTTAUGUUGUGGUUUAUGTTGGCCAA,0.485714,-4.2
8,AACCGGTTAUGUGGUUUAUGUUGTTGGCCAA,0.485714,-3.2
9,AACCGGTTAGUUGUGGUUUAUGGTTGGCCAA,0.485714,-4.8
