In [1]:
import pandas as pd
import os

## Aquiring Data

This will aquire the data from Drive once fully developed. Currently getting it from the data folder that is being ignored in Github. Right now, we are only getting it from folders Hits and Query, not Assembly.

In [2]:
#This will have to change after we set it up to pull from Drive
datapath =  os.getcwd() + "/data/"

In [3]:
# Getting our dataframe. Currently hardcoded with the name of the csv.
query = pd.read_csv(datapath + "barcode13-query.csv")
hits = pd.read_csv(datapath + "barcode13-hits.csv")

## Cleaning Data

This section cleans the data and also shows exactly what was dropped/manipulated.

In [4]:
# These functions are for later exception raising/preliminary analysis of the actual csv structure.
# They are mainly for cleaning the data, analysis functions will be later/in a different block.
def compare_columns(df1,df2):
    set1_c, set2_c = set(df1.columns), set(df2.columns)
    # "a - b" removes the items in a that it shares with b. (just look up set theory)
    df1_dif = set1_c - set2_c
    df2_dif = set2_c - set1_c
    return df1_dif, df2_dif

def drop_nan_columns(df):
    dropped_df = df.dropna(axis=1,how='all')
    return dropped_df

def unusual_row_mask(df, col, threshold=0.5):
    """Return a Boolean mask for rows where the column has values,
    but the column is mostly NaN based on the threshold."""
    if df[col].isna().mean() >= threshold:
        return df[col].notna()
    return pd.Series([False] * len(df), index=df.index)

def get_peculiar_columns(df,threshold=0.5):
    return (df.isnull().sum() / df.shape[0])[df.isnull().sum() / df.shape[0] > threshold]

def rows_for_peculiar_columns(df,threshold=0.5):
    masks = {}
    p_series = get_peculiar_columns(df,threshold)
    for p_col in p_series.index:
        mask = unusual_row_mask(df, p_col, threshold)
        if mask.any():  # Only keep masks that select at least one row. A dictionary
            masks[p_col] = mask
    return masks

def split_query_name(row,splitting_column='Name'):
    name = row[splitting_column]
    name = name.split()
    for item in name:
        if "=" in item:
            item = item.split("=")
            if item[1] == '' or item[1] == ' ' or item[1] == []:
                item[1] = float('nan')
            row[item[0]] = item[1]
    return row


### Immediate Cleanup

Some of the data has all NaN values for certain columns, so they need to be cleaned up. Also, some qualities of the unclean data are shown.

In [5]:
print(len(query.columns), "columns in uncleaned query")
print(len(hits.columns), "columns in uncleaned hits")

31 columns in uncleaned query
38 columns in uncleaned hits


In [6]:
query.columns

Index(['Name', '# Source Sequences', '% GC', '% HQ', '% Identical Sites',
       '% LQ', '% MQ', '% Pairwise Identity', 'Ambiguities', 'At least Q20',
       'At least Q30', 'At least Q40', 'Bin', 'Created', 'Created Date',
       'Description', 'Failed Binning Fields', 'Free end gaps',
       'Mean Coverage', 'Modified', 'Molecular Weight (kDa)', 'Molecule Type',
       'Post-Trim Length', 'Rough Temperature (°C)', 'Sample', 'Sequence',
       'Sequence Length', 'Sequence List Name', 'Size', 'Topology', 'URN'],
      dtype='object')

In [7]:
hits.columns

Index(['Name', '# Nucleotides', '# Sequences', '% GC', '% Identical Sites',
       '% Pairwise Identity', 'Accession', 'Bit-Score', 'Created',
       'Created Date', 'Database', 'Description', 'E Value', 'Free end gaps',
       'Grade', 'Hit end', 'Hit start', 'Max Sequence Length', 'Mean Coverage',
       'Min Sequence Length', 'Modified', 'Molecular Weight (kDa)',
       'Molecule Type', 'Organism', 'Original Query Frame', 'Query',
       'Query Id', 'Query coverage', 'Query end', 'Query start',
       'Ref Seq Index', 'Ref Seq Length', 'Ref Seq Name',
       'Rough Temperature (°C)', 'Sequence', 'Sequence Length', 'Topology',
       'URN'],
      dtype='object')

In [8]:
dropped_query = drop_nan_columns(query)
dropped_hits = drop_nan_columns(hits)
# Error check. Make Better Later
try:
    print("For dropped query", compare_columns(query,dropped_query))
except:
    print("oops1")
try:
    print("For dropped hits", compare_columns(hits,dropped_hits))
except:
    print("oops2")

For dropped query ({'Sequence List Name'}, set())
For dropped hits ({'Ref Seq Name', 'Query Id'}, set())


### Further Cleanup

The exportation from Geneious Prime has extra data in Name and Query from the sequences in the folders Query and Hits respectively. These will enable us to partially link them together later in analysis.

This next block shows exactly what we have to split up in our data, as Geneious prime put more data inside certain cells than others.

In [9]:
query_val = dropped_hits.loc[10,["Query"]]
pd.set_option('display.max_colwidth', None)
print("From the Query column in hits:\n", query_val)
name_val = dropped_query.loc[20,["Name"]]
print("\nFrom the Name column in query:\n", name_val)

From the Query column in hits:
 Query    ace289cb-a9d5-470a-9e5f-86a43409224e runid=d0b33ea7460a391678012986097c20ea1c294534 ch=328 start_time=2025-02-25T14:51:59.325974-06:00 flow_cell_id=FBA87864 protocol_group_id=B3_T9_Seq_Run_25_02_25 sample_id= barcode=barcode13 barcode_alias=barcode13 parent_read_id=ace289cb-a9d5-470a-9e5f-86a43409224e basecall_model_version_id=dna_r10.4.1_e8.2_400bps_hac@v4.3.0
Name: 10, dtype: object

From the Name column in query:
 Name    1a42e9ad-2006-4a0d-8c47-45e11a2a0354 runid=d0b33ea7460a391678012986097c20ea1c294534 ch=358 start_time=2025-02-26T05:02:08.325974-06:00 flow_cell_id=FBA87864 protocol_group_id=B3_T9_Seq_Run_25_02_25 sample_id= barcode=barcode13 barcode_alias=barcode13 parent_read_id=1a42e9ad-2006-4a0d-8c47-45e11a2a0354 basecall_model_version_id=dna_r10.4.1_e8.2_400bps_hac@v4.3.0
Name: 20, dtype: object


Yet, for our queries that were contigs, the "Name" does not have the extra data (as it is just named "Contig #", so we lose a lot of extra data). We need to split the query dataframe.

In [10]:
# Splits the query table into the contigs and noncontigs.
# TODO May need to do equals() to see if each column really does correspond to a contig.
query_masks = rows_for_peculiar_columns(dropped_query)
print("These keys are what contigs but not regular sequences have:\n", query_masks.keys())
contig_query = dropped_query[query_masks['# Source Sequences']]
noncontig_query = drop_nan_columns(dropped_query[~query_masks['# Source Sequences']])

These keys are what contigs but not regular sequences have:
 dict_keys(['# Source Sequences', '% Identical Sites', '% Pairwise Identity', 'Description', 'Free end gaps', 'Mean Coverage', 'Sample'])


Some of the extra data, doesn't actually have extra data, so that also needs to be cleaned up.

In [11]:
# Splits our data up and gets the weird cell split up. Also drops sample_id as there is none.
c_contig_query = contig_query.apply(split_query_name, axis=1) # Nothing happens as Name is just Contig #.
c_noncontig_query = drop_nan_columns(noncontig_query.apply(split_query_name, axis=1))
c_hits = drop_nan_columns(dropped_hits.apply(split_query_name, axis=1, splitting_column="Query"))

In [12]:
in_contig, not_in_contig = compare_columns(c_contig_query, c_noncontig_query)
print("This is what is uniquely inside contigs:\n", in_contig, "\nThis is what is uniquely in non-contigs:\n", not_in_contig)

This is what is uniquely inside contigs:
 {'Sample', '% Identical Sites', '% Pairwise Identity', 'Description', '# Source Sequences', 'Free end gaps', 'Mean Coverage'} 
This is what is uniquely in non-contigs:
 {'protocol_group_id', 'basecall_model_version_id', 'start_time', 'barcode', 'parent_read_id', 'barcode_alias', 'runid', 'ch', 'flow_cell_id'}


## Analysis/Combining

The data has been cleaned up. Now it is time to use that to see the details about the run

In [13]:
c_hits_val = c_hits.loc[5,:]
pd.set_option('display.max_colwidth', None)
print("From a row in c_hits:\n", c_hits_val)

From a row in c_hits:
 # Nucleotides                                                                                                                                                                                                                                                                                                                                                                                        106
# Sequences                                                                                                                                                                                                                                                                                                                                                                                            2
% GC                                                                                                                                                                                           

In [14]:
# I believe that to map the two things to each other, it will be the parent_read_id in c_noncontig_query to the parent_read_id in c_hits.
# I need to check if the contig read has parent_read_id as Contig 1
c_hits['parent_read_id'].isin(['Contig 1','Contig']).any()

np.False_

In [15]:
c_contig_val = c_contig_query.iloc[3,:]
pd.set_option('display.max_colwidth', None)
print("From a row in c_hits:\n", c_contig_val)

From a row in c_hits:
 Name                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             Contig 4
# Source Sequences                                                                                                                                                                                                                                                                                                                                                     

In [16]:
# I don't know if this means that there was simply no results for the contigs, or if I am just not searching correctly.
c_hits['Sequence'].isin(['CGTTCAGTTMCGTATTGCTAAGGTTAAAGAACGACTTCCATACTCGTGTGACAGCACCTSGTGCTGKYSKATGCGTTCAAGTTCTATATCTCGATCATGGCCGGCTATGTGATTTTTCGTGTGGACTTGCTGATCGTCAATCACTTTCGCGGCGCGAGCGAAGCCGGCGTATACGCCGTGGCCTCAGGTTTCGTTCCTTCTGCTGATGATTCCGGGCGTAATCGCGACGTT',
                         'TAAGGTTAAAGAACGACTTCCATACTCGTGTGACAGCACCTAAGGTTAAAGAACGACTTCCATACTCGTGTGACAGCACCTGCGGGTCGGCGGCCTCCACGAGCAGGCGCTCGTTCCTCAGCCGGTGGTCGATCATCGGGTGCGGCCGGCCGCGCGTGAAGACGTCGTCGCCGAGGTCGATGAGCGTGTGCTGCTTCGACTTCCCAGGGATCCGCGATCTTGCGCGCGGGCTCCACCGGCGCGTTCGACCAGGCATCGCCCAGCAGGGCGGAGGCCTCGTAGCAGAAAGTGCCGCCGCTATA',
                         'AAGGTTAAAGAACGACTTCCATACTCGTGTGACAGCACCTAAGGTTTAAAGACGACTTCCATACTCGTGTGACAGCACCTGCAGCCGATCAAGGCAGTCACGGTGAGTGATCCGAAGTGCGTGGAAGCCTCCATCGATCTCAAGGACGCAACCACCGTTCGCGTGCGTGC',
                         'GTGTTAATAAAAGGACTTAAAAAGGTTGTAAATGTTAAATTCAAACATGCATCTTATAGAAACGTCCTATGATAGGTTGAAATCAAGAGAAATCACATTTCAGCAATACAGGGAAAATCTTGCTAAAGCAGGAGTTTTCCGATGGGTTACAAATATCCATGAACATAAAAGATATTACTATACCTTTGATAATTCATTACTATTTACTGAGAGCATTCAGAACACTACACAAATCTTTCCACGCTAAATCATAACGTCCGGTTTCTTCCGTGTCAGCACCGGGGCGTTGGCATAATGCAATACGTGTACGCGCTAAACCCTGTGTGCATCGTTTTAATTATTCCCGGACACTCCCGCAGAGAAGTTCCCCGTCAGGGCTGTGGACATAGTTAATCCGGGAATACAATGACGATTCATCGCACCTGACATACATTAATAAATATTAACAATATGAAATTTCAACTCATTGTTTAGGGTTTGTTTAATTTTCTACACATACGATTCTGCGAACTTCAAAAAGCATCGGGAATAACACCATGAAAAAAATGCTACTCGCTACTGCGCTGGCCCTGCTTATTACAGGATGTGCTC',
                         ]).any()
# Adding CGACCTCGCCGCCGAAGTCGGCGTGGCCGGGCGTGTCGACGATGTTGATGTGC to the list does indeed make it np.True_ so the contig just might not be searcing (or something else).

np.False_

In [17]:
assert all(hits.columns == query.columns), "Columns in hits and query do not match"

ValueError: Lengths must match to compare

In [None]:
assert all(hits.columns == query.columns), "Columns in hits and query do not match"

ValueError: Lengths must match to compare

In [None]:
filtered_hits = pd.read_csv(datapath + "barcode13-filtered-hits.csv")

In [None]:
filtered_hits = filtered_hits.apply(split_query_name, axis=1)

In [None]:
filtered_hits.columns

Index(['# Source Sequences', '% Identical Sites', '% Pairwise Identity',
       'Created Date', 'Description', 'Free end gaps', 'Mean Coverage',
       'Modified', 'Name', 'Sample', 'Sequence', 'Sequence List Name',
       'Topology', 'URN', 'barcode', 'barcode_alias',
       'basecall_model_version_id', 'ch', 'flow_cell_id', 'parent_read_id',
       'protocol_group_id', 'runid', 'sample_id', 'start_time'],
      dtype='object')

In [None]:
hits[hits["parent_read_id"] != filtered_hits["parent_read_id"]]

KeyError: 'parent_read_id'

# Number of Unique Organisms Found

In [18]:
c_hits['Organism'].unique()

array(['Rubrivivax gelatinosus', 'Candidatus Viadribacter',
       'Kaustia mangrovi', 'Marinomonas arctica', 'Enterococcus faecium',
       'MAG: Burkholderiales', 'Aquabacterium sp.',
       'Agrobacterium pusense', 'Acinetobacter baumannii',
       'Aspergillus citrinoterreus', 'Klebsiella pneumoniae',
       'Escherichia coli', 'Mus musculus', 'MAG: Pyrinomonadaceae',
       'MAG: Terriglobia', 'MAG: Phototrophicaceae', 'MAG: Hoeflea',
       'MAG: Alphaproteobacteria', 'MAG: Gammaproteobacteria',
       'Escherichia phage'], dtype=object)