# Label processing
This notebook will go over how processing for labels is done.

In [None]:
%pip install pandas numpy

In [3]:
import pandas as pd
import numpy as np

In [4]:
data_dir = "../data/test/"
possible_mics = [0.001, 0.003, 0.007, 0.015, 0.03, 0.06, 0.12, 0.25, 0.5, 1., 2., 4., 8., 16., 32., 64., 128.,
                 256., 512., 1024.]

In [5]:
labels_df = pd.read_csv(f'{data_dir}antibiotics.tsv', sep='\t', index_col=0)

In [6]:
labels_df.head()

Unnamed: 0_level_0,Antibiotic_1,Antibiotic_2,Antibiotic_3,Antibiotic_4,Antibiotic_5,Antibiotic_6,Antibiotic_7,Antibiotic_8,Antibiotic_9,Antibiotic_10,Antibiotic_11
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,>16.00,,0.06,0.25,1.0,,<=0.06,0.12,>64.00,4.0,4.00
2,>16.00,,0.5,,8.0,>32.00,>4.00,0.25,,,>16.00


## Dropping characters
Now that we have the data in a dataframe, we need to drop the `>` and `<=` characters from all values. When these are present, the type is of string, but if it is not then the type could be int or float. We will need to check that and remove the characters if they are present. Then, make all output values floats. We will make a function to do this and apply that to all columns.

Good help on efficient ways to accomplish this task can be found in [this StackOverflow answer](https://stackoverflow.com/a/54302517)

In [19]:
def encode_mics(col):
    def try_extract(x):
        if isinstance(x, str):
            return possible_mics.index(float(x.lstrip('<=').lstrip('>')))
        elif np.isnan(x):
            return -1
        else:
            return possible_mics.index(float(x))
    return pd.Series([try_extract(x) for x in col], dtype=int)

In [20]:
labels_df = labels_df.apply(encode_mics, axis=1, result_type='broadcast')

In [21]:
labels_df.head()

Unnamed: 0_level_0,Antibiotic_1,Antibiotic_2,Antibiotic_3,Antibiotic_4,Antibiotic_5,Antibiotic_6,Antibiotic_7,Antibiotic_8,Antibiotic_9,Antibiotic_10,Antibiotic_11
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,13,-1,5,7,9,-1,5,6,15,11,11
2,13,-1,8,-1,12,14,11,7,-1,-1,13


## Saving
Lastly, we just need to save the file.

In [22]:
labels_df.to_csv('labels.csv')

In [23]:
pd.read_csv('labels.csv').head()

Unnamed: 0,Name,Antibiotic_1,Antibiotic_2,Antibiotic_3,Antibiotic_4,Antibiotic_5,Antibiotic_6,Antibiotic_7,Antibiotic_8,Antibiotic_9,Antibiotic_10,Antibiotic_11
0,1,13,-1,5,7,9,-1,5,6,15,11,11
1,2,13,-1,8,-1,12,14,11,7,-1,-1,13
