# Label processing
This notebook will go over how processing for labels is done.

In [None]:
%pip install pandas numpy

In [1]:
import pandas as pd
import numpy as np

In [56]:
data_dir = "../data/test/"

# DEPRECATED
possible_mics = [0.001, 0.003, 0.007, 0.01, 0.015, 0.02, 0.03, 0.06, 0.12, 0.25, 0.5, 1., 2., 4., 8., 16., 32., 64., 128.,
                 256., 512., 1024.]

# Update 2
We will be working over 3 genes. Each set of genes may have different sets of isolates being used since some isolates could have missing coverage for one gene and not another. If an isolate has missing coverage (or holes), then it is not included for that gene. We want to concatenate all antibiotic MIC values together though, so we will load all antibiotic files and concatenate them first.

In [49]:
#ompk35 = pd.read_csv(f'{data_dir}antibiotics_OMPK35.tsv', sep='\t', index_col=0)
#ompk36 = pd.read_csv(f'{data_dir}antibiotics_OMPK36.tsv', sep='\t', index_col=0)
#ompk37 = pd.read_csv(f'{data_dir}antibiotics_OMPK37.tsv', sep='\t', index_col=0)

In [57]:
#labels_df = pd.concat([ompk35, ompk36, ompk37], axis=1)
labels_df = pd.read_csv(f'{data_dir}antibiotics.tsv', sep='\t', index_col=0)

In [58]:
labels_df.head()

Unnamed: 0_level_0,Antibiotic_1,Antibiotic_2,Antibiotic_3,Antibiotic_4,Antibiotic_5,Antibiotic_6,Antibiotic_7,Antibiotic_8,Antibiotic_9,Antibiotic_10,Antibiotic_11
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,>16.00,,0.06,0.25,1.0,,<=0.06,0.12,>64.00,4.0,4.00
2,>16.00,,0.5,,8.0,>32.00,>4.00,0.25,,,>16.00


## Dropping characters
Now that we have the data in a dataframe, we need to drop the `>` and `<=` characters from all values. When these are present, the type is of string, but if it is not then the type could be int or float. We will need to check that and remove the characters if they are present. Then, make all output values floats. We will make a function to do this and apply that to all columns.

Good help on efficient ways to accomplish this task can be found in [this StackOverflow answer](https://stackoverflow.com/a/54302517)

# Update 1 (for dropping characters and encoding)
The `encode_mics` function has been separated out into `get_mics` and `encode_mics`. This is because there could be any number of MIC values within the dataframes. We will have to first drop all characters and make NaN values -1. Then, once we union all genes, we can get the set of MICs found and encode on that.

In [59]:
def get_mics(col):
    def try_extract(x):
        if isinstance(x, str):
            return float(x.lstrip('<=').lstrip('>'))
        elif np.isnan(x):
            return -1.0
        else:
            return float(x)
    return pd.Series([try_extract(x) for x in col], dtype=float)

def encode_mics(col, set_mics=[]):
    return pd.Series([set_mics.index(x) for x in col], dtype=int)

In [60]:
labels_df = labels_df.apply(get_mics, axis=1, result_type='broadcast')
labels_df.head()

Unnamed: 0_level_0,Antibiotic_1,Antibiotic_2,Antibiotic_3,Antibiotic_4,Antibiotic_5,Antibiotic_6,Antibiotic_7,Antibiotic_8,Antibiotic_9,Antibiotic_10,Antibiotic_11
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,16.0,-1.0,0.06,0.25,1.0,-1.0,0.06,0.12,64.0,4.0,4.0
2,16.0,-1.0,0.5,-1.0,8.0,32.0,4.0,0.25,-1.0,-1.0,16.0


In [61]:
set_mics = list(set(np.concatenate(labels_df.values)))
set_mics.sort()
labels_df = labels_df.apply(encode_mics, axis=1, result_type='broadcast', set_mics=set_mics)

In [62]:
labels_df.head()

Unnamed: 0_level_0,Antibiotic_1,Antibiotic_2,Antibiotic_3,Antibiotic_4,Antibiotic_5,Antibiotic_6,Antibiotic_7,Antibiotic_8,Antibiotic_9,Antibiotic_10,Antibiotic_11
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,8,0,1,3,5,0,1,2,10,6,6
2,8,0,4,0,7,9,6,3,0,0,8


## Saving
Lastly, we just need to save the file.

In [22]:
labels_df.to_csv('labels.csv')

In [23]:
pd.read_csv('labels.csv').head()

Unnamed: 0,Name,Antibiotic_1,Antibiotic_2,Antibiotic_3,Antibiotic_4,Antibiotic_5,Antibiotic_6,Antibiotic_7,Antibiotic_8,Antibiotic_9,Antibiotic_10,Antibiotic_11
0,1,13,-1,5,7,9,-1,5,6,15,11,11
1,2,13,-1,8,-1,12,14,11,7,-1,-1,13
