# Multiclass SANITY CHECK

This notebook is for me to figure out where the heck I'm going wrong with breaking down the multiclass stuff labeling.

The problem was discovered when I tried to do a 1-to-1 comparison of the binary labels to the multiclass labels. If a multiclass label was present, it ought to have been matched by a '1' in the binary label set. This was not the case.

I circled back and parsed apart the binary label treatment and am fairly confident that I didn't make an error there.

Thus, the error must reside with how I preprocessed the multiclass labels. 

Hopefully, I can identify the source of the errors in this notebook. 

This notebook was derived from _20201229_multiclass.ipynb_.

# %%%%%%%%%%%%%%%%%%%%%%%%%%%%%

# FINISHED!

# %%%%%%%%%%%%%%%%%%%%%%%%%%%%%

In [1]:
import copy
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd

from datetime import datetime
from progressbar import ProgressBar
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.preprocessing import StandardScaler

import warnings
warnings.filterwarnings('ignore')

In [2]:
# Load the list of stations ID #s.
data_dir = "../data/stations/"
station_filenames_list = [
    filename for filename in os.listdir(path=data_dir)
    if filename!=".DS_Store"
]

# Load just the list of ACCLIMA station IDs.
file_path = "../data/acclima_stations_id_list.txt"
acclima_stations_list = pd.read_csv(file_path, header=None).iloc[:,0].values.tolist()

## Load expanded array

In [3]:
# This operation takes a couple minutes, so only do it if you really need to reload the stuff.
df_expanded = pd.read_pickle("../data/acclima_soil_water_rleeper_1214.pickle")
print(df_expanded.shape)

# The previous line loads column names as values in the first row. Set them
# as the actual column names and then delete the first row.
df_expanded.columns = df_expanded.iloc[0].values
print(df_expanded.shape)

df_expanded = df_expanded.iloc[1:, :]
print(df_expanded.shape)

(34343059, 13)
(34343059, 13)
(34343058, 13)


Checks out thus far.

## Subset the expanded 2020-12-14 dataset to ACCLIMA only

In [4]:
# This takes about 1 minute to run, so only run when necessary.
# Filter the expanded dataset down to just the ACCLIMA stations so that
# it's easier to wield in memory.
df_acclima = df_expanded.isin({"WBANNO":acclima_stations_list})
df_acclima = df_expanded.iloc[df_acclima.WBANNO.values]

# Delete df_expanded to free up some dang memory.
del(df_expanded)

In [5]:
# Rename the ACCLIMA DF's TAG columns to not be "TAGS, NaN, NaN, NaN".
df_acclima.columns =\
    df_acclima.columns[:9].tolist() + ["TAGS_00", "TAGS_01", "TAGS_02", "TAGS_03",]

In [6]:
print(df_acclima.shape)

print(34343059 - 14641744)

(14641744, 13)
19701315


Again, checks out. Successfully subsetted to ACCLIMA sensor stations only.

In [7]:
df_acclima.head()

Unnamed: 0,WBANNO,UTC_START,NAME,VALUE,ACCLIMA,RANGE_FLAG,DOOR_FLAG,FROZEN_FLAG,MANUAL_FLAG,TAGS_00,TAGS_01,TAGS_02,TAGS_03
740067,3054,2017-05-01 00:00:00,p_official,0.0,,0,0,0,0,,,,
740068,3054,2017-05-01 00:00:00,t_official,16.491,,0,0,0,0,,,,
740069,3054,2017-05-01 01:00:00,p_official,0.0,,0,0,0,0,,,,
740070,3054,2017-05-01 01:00:00,t_official,12.662,,0,0,0,0,,,,
740071,3054,2017-05-01 02:00:00,p_official,0.0,,0,0,0,0,,,,


# Subset down to a single station and create a data-prep regime that works here for reducing **from** multilabel **to** multiclass

First, create a few constants and functions.

In [133]:
TAGS_SET = {
    'Acclima-Zero', 'Acclima-Toohigh', 'Acclima-Too high', 'Acclima-NoPrcpResponse', 
    'Acclima-FrozenRecovery', 'Acclima-Noise', 'Acclima-Failure',
    'Acclima-Spike', 'Acclima-DiurnalNoise', 'Acclima-Erratic', 
    'Acclima-Static'
}

def clean_tags_dataframe(df_targets):
    
    """
    This function takes a target data frame and replaces the tags with their cleaned-up, space-less versions.
    """
    
    # Make a copy of the dataframe so we don't overwrite the original.
    df_targets_cleaned = copy.deepcopy(df_targets)
    
    # Loop through all the cleaned versions of the tags and replace the original versions,
    # which have extra whitespace pre-pended to them, with the cleaned versions.
    for tag in TAGS_SET:
        df_targets_cleaned.replace(
            to_replace=" "+tag,
            value=tag,
            inplace=True,
        )
    
    # Replace "None" tags with an empty string.
    df_targets_cleaned.replace(
        to_replace=[None],
        value=[""],
        inplace=True,
    )
    
    return df_targets_cleaned
    
    
def rename_tags_in_df(df_targets):
    """
    Replaces 'Acclima-Spike' with 'spike' and noise-related Acclima tags with 'noise'.
    Returns a dataframe with renamed tags.
    """
    df_targets_renamed = copy.deepcopy(df_targets)
    
    # Rename SPIKES.
    df_targets_renamed.replace(
        to_replace="Acclima-Spike",
        value="spike",
        inplace=True,
    )
    # Rename NOISE.
    noise_tag_list = [
        "Acclima-Noise",
        "Acclima-Diurnal Noise", 
        "Acclima-FrozenRecovery", 
        "Acclima-Erratic",
    ]
    for noise_tag in noise_tag_list:
        df_targets_renamed.replace(
            to_replace=noise_tag,
            value="noise",
            inplace=True,
        )
    return df_targets_renamed

# Prepare targets and then features for single station

Need to select station that has multiple types of tags

In [134]:
# Select station to treat. I selected 14 b/c it's the first one in the station series with >=5 tags.
station_list_idx = 14
station_id_num = acclima_stations_list[station_list_idx]

# Subset down to the single ACCLIMA station of interest.
df_station = df_acclima[df_acclima.WBANNO.eq(station_id_num)]

# Subset down to just the columns of interest.
df_station =\
    df_station[
        ["UTC_START", "NAME", "VALUE", "TAGS_00", "TAGS_01", "TAGS_02", "TAGS_03"]
    ]

# Convert all datetimes to actual datetime datatypes.
df_station.UTC_START = pd.to_datetime(df_station.UTC_START, format="%Y-%m-%d %H:%M:%S")

# Get just the station's targets.
df_station_targets = df_station[["TAGS_00", "TAGS_01", "TAGS_02", "TAGS_03"]]

# Clean up and then rename the targets.
df_station_targets =\
        rename_tags_in_df(
            clean_tags_dataframe(
                df_station[["TAGS_00", "TAGS_01", "TAGS_02", "TAGS_03"]]
            )
        )

print(df_station_targets.TAGS_00.shape)

print(np.unique(df_station_targets.values))

print(df_station_targets.TAGS_01[df_station_targets.TAGS_01.isna()])

print(df_station_targets.TAGS_01[df_station_targets.TAGS_01 != ""])

(457391,)
['' 'Acclima-Too high' 'Acclima-Zero' 'noise' 'spike']
Series([], Name: TAGS_01, dtype: object)
13435593    Acclima-Too high
13435610    Acclima-Too high
13435627    Acclima-Too high
13435644    Acclima-Too high
13435661    Acclima-Too high
                  ...       
13454155    Acclima-Too high
13454172    Acclima-Too high
13454189    Acclima-Too high
13454206    Acclima-Too high
13454223    Acclima-Too high
Name: TAGS_01, Length: 257, dtype: object


In [125]:
# # Drop all tags except for '', 'spikes', and 'noise'.
# array_is_normal = df_station_targets.eq('').values

# vec_is_normal = np.logical_or(np.logical_or(np.logical_or(array_is_normal[:,0], array_is_normal[:,1]), array_is_normal[:,2]), array_is_normal[:, 3])

# print(vec_is_normal.sum())

# print(vec_is_normal.shape)

The preceding approach won't work b/c nearly every row will have a 'normal' tag associated with it.

In [89]:
"""For each row in the targets, loop through and drop rows with more than one tag.
I'm taking this approach b/c there's plenty of data and the rows that have multiple tags
represent odd cases that clearly are not **only** an example of a spike or noise."""

# play = df_station_targets.iloc[:, 1:]

# playmore = play[play.TAGS_01!=''].values
# # playmore = play[play.TAGS_01==''].values

# # "".join(playmore[0])

# ## THE FOLLOWING LINE DOESN'T WORK AS INTENDED. Need to use np.array(["".join(row) for row in play]) instead!
# arr_tags_concatenated = np.apply_along_axis("".join, 1, playmore)
# # arr_tags_concatenated

# arr_concattags_lengths = np.array([l for l in map(len, arr_tags_concatenated)])
# # arr_concattags_lengths

# arr_tags_to_keep = ~(arr_concattags_lengths > 3)
# # ~(arr_concattags_lengths > 3)

Trying a different approach now.

In [181]:
# play = df_station_targets.iloc[:, 1:].values
play = np.char.array([ ["","",""], [""," Acclima-Too high", ""], ["","",""]])
print(np.unique(play))

['' ' Acclima-Too high']


In [182]:
arr_tags_concatenated = np.array(["".join(row) for row in play])

print(np.unique(arr_tags_concatenated))

['' ' Acclima-Too high']


In [184]:
arr_concattags_lengths = np.array([l for l in map(len, arr_tags_concatenated)])
print(arr_concattags_lengths)

[ 0 17  0]


In [185]:
arr_tags_to_keep = ~(arr_concattags_lengths > 3)
print(arr_tags_to_keep)

[ True False  True]


In [187]:
play_updated = play[arr_tags_to_keep]
print(play_updated)

[['' '' '']
 ['' '' '']]


The stuff above works great! I'm going to generate that array for the full targets list.

In [192]:
play = df_station_targets.iloc[:, 1:].values
print(np.unique(play))

['' 'Acclima-Too high']


In [193]:
arr_tags_concatenated = np.array(["".join(row) for row in play])
print(np.unique(arr_tags_concatenated))

['' 'Acclima-Too high']


In [195]:
arr_concattags_lengths = np.array([l for l in map(len, arr_tags_concatenated)])
print(arr_concattags_lengths)
print(arr_concattags_lengths.sum())

[0 0 0 ... 0 0 0]
4112


In [198]:
arr_tags_to_keep = ~(arr_concattags_lengths > 3)
print(arr_tags_to_keep)
print((~arr_tags_to_keep).sum())

[ True  True  True ...  True  True  True]
257


In [234]:
df_tags_reduced = df_station_targets[arr_tags_to_keep].iloc[:,0]
print(df_tags_reduced.shape[0])
print(df_station_targets.shape[0])
print((~arr_tags_to_keep).sum() + df_tags_reduced.shape[0])

457134
457391
457391


In [231]:
print(df_tags_reduced.unique())

['' 'spike' 'Acclima-Zero' 'noise']


Excellent. It's working. It's quick. It's time to serialize this for use this to subset the targets down to just single-label rows.

In [223]:
df_station_targets_reduced_final = df_tags_reduced[df_tags_reduced.isin(["", "spike", "noise"])]

print(df_tags_reduced.shape)
print(df_station_targets_reduced_final.shape)

(457134,)
(454740,)


Now it's time to reduce the feature set.

In [224]:
# All station data w/out tag filtering.
print(df_station.iloc[:, :3].shape)

# Station data w/tag filtering.
print(df_station.loc[df_station_targets_reduced_final.index, ["UTC_START", "NAME", "VALUE"]].shape)

(457391, 3)
(454740, 3)


The dimensions for the tag-filtered feature set check out!

Now I'll filter and double check that all the data matches.

In [230]:
df_station_features = df_station.loc[df_station_targets_reduced_final.index, ["UTC_START", "NAME", "VALUE"]]

print(df_station_features.shape)

print(df_station_targets_reduced_final.shape)

print(df_station_targets_reduced_final.unique())

(454740, 3)
(454740,)
['' 'spike' 'noise']


This looks like it's working.  I'm going functionalize these routines that I've written, test the functions, and then iterate over all the stations, caching as I go.

In [241]:
# Isolate the last three columns of targets. 
# If there's a row where they're non-empty, then that's indicative of a multilabel example. 
# Our goal here is to eliminate all multilabel examples.
array_multilabel_targets = df_station_targets.iloc[:, 1:].values

# Iterate through the rows of multilabel targets and concatenate all tags into a single string.
# For rows that don't have any multilabel tag, the resulting entry will be an empty string of length 0.
# For multilabel rows, there will be a string with non-zero length.
arr_tags_concatenated = np.array(["".join(row) for row in array_multilabel_targets])

# Iterate through the concatenated tags and calculate their lengths. 
# These lengths will be stored in the new array defined below.
arr_concattags_lengths = np.array([l for l in map(len, arr_tags_concatenated)])

# Find all zero-length elements of the array. 
# These entries are the rows in the original targets dataframe that we want to keep,
# since they are the single-label (ie, non-multilabel) rows.
arr_tags_to_keep = (arr_concattags_lengths == 0)

# Reduce the targets dataframe to the first column.
# This column represents all of the single-label targets.
df_tags_reduced = df_station_targets[arr_tags_to_keep].iloc[:,0]

# Print some sanity statistics.
print("Remaining unique labels:                                 ", df_tags_reduced.unique())
print("Number of reduced targets:                               ", df_tags_reduced.shape[0])
print("Original number of targets:                              ", df_station_targets.shape[0])
print("Number of reduced targets plus number of dropped targets:", (~arr_tags_to_keep).sum() + df_tags_reduced.shape[0])

# Get the final set of targets by filtering out anything that's not a spike, noise, or normal.
df_station_targets_reduced_final = df_tags_reduced[df_tags_reduced.isin(["", "spike", "noise"])]

# Get the final set of station features by using the indices of the remaining targets.
df_station_features = df_station.loc[df_station_targets_reduced_final.index, ["UTC_START", "NAME", "VALUE"]]

print()
print("Final set of unique labels:", df_station_targets_reduced_final.unique())
print("Number of final labels:    ", df_station_features.shape[0])
print("Number of final features:  ",df_station_targets_reduced_final.shape[0])

Remaining unique labels:                                  ['' 'spike' 'Acclima-Zero' 'noise']
Number of reduced targets:                                457134
Original number of targets:                               457391
Number of reduced targets plus number of dropped targets: 457391

Final set of unique labels: ['' 'spike' 'noise']
Number of final labels:     454740
Number of final features:   454740


Let's test this for a few other stations before moving onto actually functionalizing and implementing it.

In [256]:
# Select station to test.
# station_list_idx = 14
for station_list_idx in range(20, 25):
    print("############################", station_list_idx, "############################")

    station_id_num = acclima_stations_list[station_list_idx]

    # Subset down to the single ACCLIMA station of interest.
    df_station = df_acclima[df_acclima.WBANNO.eq(station_id_num)]

    # Subset down to just the columns of interest.
    df_station =\
        df_station[
            ["UTC_START", "NAME", "VALUE", "TAGS_00", "TAGS_01", "TAGS_02", "TAGS_03"]
        ]

    # Convert all datetimes to actual datetime datatypes.
    df_station.UTC_START = pd.to_datetime(df_station.UTC_START, format="%Y-%m-%d %H:%M:%S")

    # Get just the station's targets.
    df_station_targets = df_station[["TAGS_00", "TAGS_01", "TAGS_02", "TAGS_03"]]

    # Clean up and then rename the targets.
    df_station_targets =\
            rename_tags_in_df(
                clean_tags_dataframe(
                    df_station[["TAGS_00", "TAGS_01", "TAGS_02", "TAGS_03"]]
                )
            )

    # Print sanity-check statistics.
    print("Original set of targets:                                         ", np.unique(df_station_targets.values))
    print("Original number of targets:                                      ", df_station_targets.shape[0])

    ################################################################
    ################################################################
    # FILTER OUT MULTILABEL INSTANCES AND THEN RUN SANITY CHECKS.
    ################################################################
    ################################################################

    # Isolate the last three columns of targets. 
    # If there's a row where they're non-empty, then that's indicative of a multilabel example. 
    # Our goal here is to eliminate all multilabel examples.
    array_multilabel_targets = df_station_targets.iloc[:, 1:].values

    # Iterate through the rows of multilabel targets and concatenate all tags into a single string.
    # For rows that don't have any multilabel tag, the resulting entry will be an empty string of length 0.
    # For multilabel rows, there will be a string with non-zero length.
    arr_tags_concatenated = np.array(["".join(row) for row in array_multilabel_targets])

    # Iterate through the concatenated tags and calculate their lengths. 
    # These lengths will be stored in the new array defined below.
    arr_concattags_lengths = np.array([l for l in map(len, arr_tags_concatenated)])

    # Find all zero-length elements of the array. 
    # These entries are the rows in the original targets dataframe that we want to keep,
    # since they are the single-label (ie, non-multilabel) rows.
    arr_tags_to_keep = (arr_concattags_lengths == 0)

    # Reduce the targets dataframe to the first column.
    # This column represents all of the single-label targets.
    df_tags_reduced = df_station_targets[arr_tags_to_keep].iloc[:,0]

    # Print some sanity statistics.
    print()
    print("Remaining unique labels:                                         ", df_tags_reduced.unique())
    print("Original number of targets:                                      ", df_station_targets.shape[0])
    print("Number of reduced targets:                                       ", df_tags_reduced.shape[0])
    print("Number of reduced targets plus number of dropped targets:        ", (~arr_tags_to_keep).sum() + df_tags_reduced.shape[0])

    # Get the final set of targets by filtering out anything that's not a spike, noise, or normal.
    df_station_targets_reduced_final = df_tags_reduced[df_tags_reduced.isin(["", "spike", "noise"])]

    # Get the final set of station features by using the indices of the remaining targets.
    df_station_features = df_station.loc[df_station_targets_reduced_final.index, ["UTC_START", "NAME", "VALUE"]]

    print()
    print("Final set of unique labels:                                      ", df_station_targets_reduced_final.unique())
    print("Number of final labels:                                          ", df_station_features.shape[0])
    print("Number of final features:                                        ",df_station_targets_reduced_final.shape[0])
    print()
    print()

############################ 20 ############################
Original set of targets:                                          ['']
Original number of targets:                                       248753

Remaining unique labels:                                          ['']
Original number of targets:                                       248753
Number of reduced targets:                                        248753
Number of reduced targets plus number of dropped targets:         248753

Final set of unique labels:                                       ['']
Number of final labels:                                           248753
Number of final features:                                         248753


############################ 21 ############################
Original set of targets:                                          ['' 'Acclima-Zero' 'noise']
Original number of targets:                                       249528

Remaining unique labels:                               

#### Looks like this is working well! I reviewed the sanity-check print-outs. I'm going to functionalize these methods now.

In [251]:
def get_station_dataframe(station_id_num, df_acclima):
    # Subset down to the single ACCLIMA station of interest.
    df_station = df_acclima[df_acclima.WBANNO.eq(station_id_num)]
    return df_station


def reduce_station_df_and_convert_dates(df_station):
    # Subset down to just the columns of interest.
    df_station =\
        df_station[
            ["UTC_START", "NAME", "VALUE", "TAGS_00", "TAGS_01", "TAGS_02", "TAGS_03"]
        ]
    # Convert all datetimes to actual datetime datatypes.
    df_station.UTC_START = pd.to_datetime(df_station.UTC_START, format="%Y-%m-%d %H:%M:%S")
    return df_station


def get_station_targets(df_station):

    # Get just the station's targets.
    df_station_targets = df_station[["TAGS_00", "TAGS_01", "TAGS_02", "TAGS_03"]]

    # Clean up and then rename the targets.
    df_station_targets =\
            rename_tags_in_df(
                clean_tags_dataframe(
                    df_station[["TAGS_00", "TAGS_01", "TAGS_02", "TAGS_03"]]
                )
            )
    return df_station_targets


def get_filtered_targets(df_station_targets):
    # Isolate the last three columns of targets. 
    # If there's a row where they're non-empty, then that's indicative of a multilabel example. 
    # Our goal here is to eliminate all multilabel examples.
    array_multilabel_targets = df_station_targets.iloc[:, 1:].values

    # Iterate through the rows of multilabel targets and concatenate all tags into a single string.
    # For rows that don't have any multilabel tag, the resulting entry will be an empty string of length 0.
    # For multilabel rows, there will be a string with non-zero length.
    arr_tags_concatenated = np.array(["".join(row) for row in array_multilabel_targets])

    # Iterate through the concatenated tags and calculate their lengths. 
    # These lengths will be stored in the new array defined below.
    arr_concattags_lengths = np.array([l for l in map(len, arr_tags_concatenated)])

    # Find all zero-length elements of the array. 
    # These entries are the rows in the original targets dataframe that we want to keep,
    # since they are the single-label (ie, non-multilabel) rows.
    arr_tags_to_keep = (arr_concattags_lengths == 0)

    # Reduce the targets dataframe to the first column.
    # This column represents all of the single-label targets.
    df_tags_reduced = df_station_targets[arr_tags_to_keep].iloc[:,0]

    # Get the final set of targets by filtering out anything that's not a spike, noise, or normal.
    df_station_targets_reduced_final = df_tags_reduced[df_tags_reduced.isin(["", "spike", "noise"])]
    
    return df_station_targets_reduced_final, df_tags_reduced
    
    
def get_filtered_features(df_station, df_station_targets_reduced_final):
    # Get the final set of station features by using the indices of the remaining targets.
    df_station_features = df_station.loc[df_station_targets_reduced_final.index, ["UTC_START", "NAME", "VALUE"]]
    return df_station_features

Test these functionalized methods.

In [253]:
# Select station to test.
# station_list_idx = 14

for station_list_idx in range(15, 20):
    
    station_id_num = acclima_stations_list[station_list_idx]
    print("############################", station_list_idx, "############################")

    # Get just the station of interest.
    df_station = get_station_dataframe(station_id_num, df_acclima)

    # Cut the station data down to just the columns-of-interest.
    # Convert the date-times in the UTC_START column to datetime objects.
    df_station = reduce_station_df_and_convert_dates(df_station)

    # Isolate the station targets for filtering.
    df_station_targets = get_station_targets(df_station)

    # Print sanity-check statistics.
    print("Original set of targets:                                         ", np.unique(df_station_targets.values))
    print("Original number of targets:                                      ", df_station_targets.shape[0])
    
    # Filter the targets down to single-label targets-of-interest (ie, just normal, "spike" and "noise").
    df_station_targets_reduced_final, df_tags_reduced = get_filtered_targets(df_station_targets)

    # Print some sanity statistics.
    print()
    print("Remaining unique labels:                                         ", df_tags_reduced.unique())
    print("Original number of targets:                                      ", df_station_targets.shape[0])
    print("Number of reduced targets:                                       ", df_tags_reduced.shape[0])
    print("Number of reduced targets plus number of dropped targets:        ", (~arr_tags_to_keep).sum() + df_tags_reduced.shape[0])

    # Get the final feature-set by filtering to feature-rows that have labels remaining after the labels were filtered.
    df_station_features = get_filtered_features(df_station, df_station_targets_reduced_final)

    print()
    print("Final set of unique labels:                                      ", df_station_targets_reduced_final.unique())
    print("Number of final labels:                                          ", df_station_features.shape[0])
    print("Number of final features:                                        ",df_station_targets_reduced_final.shape[0])
    print()
    print()

############################ 15 ############################
Original set of targets:                                          ['' 'spike']
Original number of targets:                                       367029

Remaining unique labels:                                          ['' 'spike']
Original number of targets:                                       367029
Number of reduced targets:                                        367029
Number of reduced targets plus number of dropped targets:         383523

Final set of unique labels:                                       ['' 'spike']
Number of final labels:                                           367029
Number of final features:                                         367029


############################ 16 ############################
Original set of targets:                                          ['' 'Acclima-NoPrcpResponse' 'Acclima-Zero' 'spike']
Original number of targets:                                       368442

Remain

##### Functionalizing the methods produces the same results as when they were non-functionalized. Whoooo!

Moving onto creating pivot tables of features now, with appropriate filtering of missing values.

In [259]:
# Select station to test.
station_list_idx = 24
station_id_num = acclima_stations_list[station_list_idx]

# Get just the station of interest.
df_station = get_station_dataframe(station_id_num, df_acclima)

# Cut the station data down to just the columns-of-interest.
# Convert the date-times in the UTC_START column to datetime objects.
df_station = reduce_station_df_and_convert_dates(df_station)

# Isolate the station targets for filtering.
df_station_targets = get_station_targets(df_station)

# Filter the targets down to single-label targets-of-interest (ie, just normal, "spike" and "noise").
df_station_targets_reduced_final, df_tags_reduced = get_filtered_targets(df_station_targets)

# Get the final feature-set by filtering to feature-rows that have labels remaining after the labels were filtered.
df_station_features = get_filtered_features(df_station, df_station_targets_reduced_final)

In [267]:
df_station_combined = pd.concat([df_station_features, df_station_targets_reduced_final],axis=1)
df_station_combined

Unnamed: 0,UTC_START,NAME,VALUE,TAGS_00
20526527,2017-05-01 00:00:00,p_official,0.7,
20526528,2017-05-01 00:00:00,t_official,6.55,
20526529,2017-05-01 01:00:00,p_official,3.6,
20526530,2017-05-01 01:00:00,t_official,6.502,
20526531,2017-05-01 02:00:00,p_official,5.6,
...,...,...,...,...
20871673,2020-07-31 23:00:00,sw3010,0.3,
20871674,2020-07-31 23:00:00,sw3020,0.319,
20871675,2020-07-31 23:00:00,sw3050,0.292,
20871676,2020-07-31 23:00:00,sw3100,0.35,


In [277]:
df_station_combined.TAGS_00.unique()

array(['', 'spike', 'noise'], dtype=object)

#### Identified a new potential problem:
##### Even though I've eliminated multilable instances for single sensors at single timepoints, there may be single timepoints where **multiple sensors** have differing tags. I'll need to drop those intances from this analysis in order to get to the pure single-label, multiclass case.

In [378]:
df_pivoted_targets = df_station_combined.pivot(index="UTC_START", columns="NAME", values="TAGS_00")
df_pivoted_targets = df_pivoted_targets.drop(["p_official", "t_official"], axis=1)

print(df_pivoted_targets.shape[0])

df_pivoted_targets

28491


NAME,sw1005,sw1010,sw1020,sw1050,sw1100,sw2005,sw2010,sw2020,sw2050,sw2100,sw3005,sw3010,sw3020,sw3050,sw3100
UTC_START,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
2017-05-01 00:00:00,,,,,,,,,,,,,,,
2017-05-01 01:00:00,,,,,,,,,,,,,,,
2017-05-01 02:00:00,,,,,,,,,,,,,,,
2017-05-01 03:00:00,,,,,,,,,,,,,,,
2017-05-01 04:00:00,,,,,,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2020-07-31 19:00:00,,,,,noise,,,,,,,,,,
2020-07-31 20:00:00,,,,,noise,,,,,,,,,,
2020-07-31 21:00:00,,,,,noise,,,,,,,,,,
2020-07-31 22:00:00,,,,,noise,,,,,,,,,,


In [377]:
# # Figure out how many rows have a missing value. It seems like there may be a lot. Ooof.
# print(
#     (df_pivoted_targets.isna().values.sum(axis=1) > 0).sum()
# )

# Clean the tags up; ie, convert all NaN to ''.
df_pivoted_targets_clean = clean_tags_dataframe(df_pivoted_targets)

print(df_pivoted_targets.shape[0])
print(df_pivoted_targets_clean.shape[0])

df_pivoted_targets_clean.head()

28491
28491


NAME,sw1005,sw1010,sw1020,sw1050,sw1100,sw2005,sw2010,sw2020,sw2050,sw2100,sw3005,sw3010,sw3020,sw3050,sw3100
UTC_START,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
2017-05-01 00:00:00,,,,,,,,,,,,,,,
2017-05-01 01:00:00,,,,,,,,,,,,,,,
2017-05-01 02:00:00,,,,,,,,,,,,,,,
2017-05-01 03:00:00,,,,,,,,,,,,,,,
2017-05-01 04:00:00,,,,,,,,,,,,,,,


In [333]:
# Concatenate all the values in each row and check for non-unique labels.
# Ie, check for time points where there's normal/spike, etc.
arr_tags_concatenated = np.array(["".join(row) for row in df_pivoted_targets_clean.values])

print(arr_tags_concatenated)
print(np.unique(arr_tags_concatenated))

['' '' '' ... 'noise' 'noise' 'noise']
['' 'noise' 'spikespikespike' 'spikespikespikespikespike']


In [334]:
play = df_pivoted_targets_clean.values
play

array([['', '', '', ..., '', '', ''],
       ['', '', '', ..., '', '', ''],
       ['', '', '', ..., '', '', ''],
       ...,
       ['', '', '', ..., '', '', ''],
       ['', '', '', ..., '', '', ''],
       ['', '', '', ..., '', '', '']], dtype=object)

In [330]:
# Use set operations to the ID unique tags in each row. Then, 
# set(np.char.array([ ["","",""], [""," Acclima-Too high", ""], ["a","b","c"]])[2])
unique_tags_by_row = [list(set(row)) for row in play]
unique_tags_by_row[:10]

[[''], [''], [''], [''], [''], [''], [''], [''], [''], ['']]

In [340]:
# Check where there's more than one unique tag per row.
# These ought to be the timepoints with more than one unique label.
# These should be dropped from the analysis.
print(
    np.array(
        [len(row) > 1 for row in unique_tags_by_row]
    ).sum()
)
print(
    len(unique_tags_by_row)
)

70
28491


#### WELLLLL, ACTUALLY, I'm not concerned with multilabel rows that have both normal and then one tag. 

#### I'm concerned with multilabel rows that have "spike" and "noise".

#### So, I'll concatenate the unique labels together. Anything over length 5 ( len("spike")=5 and len("noise")=5 ) will be a multilabel instance, since len("spikenoise")=10.  

In [379]:
# Find all the multilabel row locations via some string method trickery.
multilabel_row_locations = np.array(
    [
        len(
            "".join(row)    # Join all the unique tags in each row together; ie, 
                            # ["", "noise"] --> "noise", while
                            # ["noise", "spike"] --> "noisespike"
        ) > 5               # Check for anything that has length > 5. This will only occur where
                            # "".join(row) --> "spikenoise" or "noisespike".
        for row in unique_tags_by_row
    ]
)

print(multilabel_row_locations.shape[0])

28491


##### Now I need to drop the multilabel timepoints from the analysis.

In [359]:
# Filter the pivoted targets DF of any multilabel row locations (ie, locations that are co-labeled "spike" and "noise").
df_pivoted_targets_singlelabel =\
    df_pivoted_targets[~multilabel_row_locations]

In [382]:
df_pivoted_features_singlelabel = df_station_features.pivot(
    index="UTC_START", columns="NAME", values="VALUE"            # Create a pivoted DF of the features.
)

print(df_pivoted_features_singlelabel.shape[0])

# Use the filtered targets to dataframe's index to filter the pivoted features dataframe.
# That way, we have only feature locations with single-label multiclass features.
df_pivoted_features_singlelabel = df_station_features.pivot(
    index="UTC_START", columns="NAME", values="VALUE"            # Create a pivoted DF of the features.
).loc[
    df_pivoted_targets_singlelabel.index                         # Filter the pivoted features DF using the datetimes of the remaining targets.
]

print(df_pivoted_features_singlelabel.shape[0])

28491
28490


It appears that filtering by the single-label DF's index reduces the pivoted feature DF's date range. 

So, I need to reverse-filter the single-label DF using the remaining indices from the feature DF.

Ugh. There's so many flipping tricky parts of this problem.

In [385]:
# Filter the pivoted single-label targets DF using the remaining feature DF datetime indices.
df_pivoted_targets_singlelabel =\
    df_pivoted_targets_singlelabel.loc[
        df_pivoted_features_singlelabel.index
    ]

print(df_pivoted_targets_singlelabel.shape[0])

28490


That's done the trick.

Now both the pivoted targets and the pivoted features have the same datetime indices.

Now I need to go back and reformat the remaining targets so that they're a single-column series, rather than a multi-dimensional dataframe.

In [386]:
# Get just the remaining single-label target values.
# I'll use these to get down to one label entry per datetime row.
array_pivoted_targets_singlelabel = clean_tags_dataframe(df_pivoted_targets_singlelabel).values

# Use the list(set()) trick to filter down to the unique entries in each row of the targets array.
unique_tags_by_row = [list(set(row)) for row in array_pivoted_targets_singlelabel]

# Join these unique entries together to form a single entry per target row.
# Since each row only has either {""}, {"", "spike"} or {"", "noise"}, the result will be a single label per row.
multiclass_targets_array = np.char.array(
    [
        "".join(row)                     # Join together the unique single-labels;
                                         # ie, ["","spike"] --> "spike" and ["", "noise"] --> "noise".
        for row in unique_tags_by_row
    ]
)

# Recombine the newly-filtered multiclass targets with their original datetime index.
df_targets_singlelabel = pd.DataFrame(index=df_pivoted_targets_singlelabel.index, values=mutliclass_targets_array)

TypeError: sequence item 0: expected str instance, float found

In [376]:
# multiclass_targets_array = np.char.array(
#     [
#         "".join(row)                     # Join together the unique single-labels;
#                                          # ie, ["","spike"] --> "spike" and ["", "noise"] --> "noise".
#         for row in unique_tags_by_row
#     ]
# )
# print(len(unique_tags_by_row))
# print(multiclass_targets_array.shape[0])
# print(df_station_features.shape[0])
# print(df_pivoted_features_singlelabel.shape[0])


28491
28491
343999
28490


In [370]:
multiclass_targets_series = pd.Series(index)

In [364]:
# Now I need to drop any and all feature rows that have NaN values.
df_pivoted_features_singlelabel = df_pivoted_features_singlelabel.dropna(how="any", axis=0)

df_pivoted_features_singlelabel

NAME,p_official,sw1005,sw1010,sw1020,sw1050,sw1100,sw2005,sw2010,sw2020,sw2050,sw2100,sw3005,sw3010,sw3020,sw3050,sw3100,t_official
UTC_START,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
2018-05-22 21:00:00,0,0.371,0,0.394,0.142,0,0.426,0.458,0.407,0.424,0.121,0.452,0.491,0.468,0,0,19.849
2018-05-22 22:00:00,0,0.37,0,0.397,0.142,0,0.425,0.45,0.406,0.424,0.121,0.45,0.491,0.468,0,0,19.868
2018-05-22 23:00:00,0,0.369,0,0.396,0.142,0,0.424,0.454,0.407,0.425,0.121,0.45,0.488,0.469,0,0,19.833
2018-05-23 00:00:00,0,0.369,0,0.395,0.142,0,0.423,0.448,0.408,0.424,0.121,0.449,0.487,0.47,0,0,19.046
2018-05-23 01:00:00,0,0.367,0,0.399,0.142,0,0.423,0.45,0.407,0.424,0.121,0.446,0.487,0.469,0,0,17.553
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2020-06-15 13:00:00,0,0.132,0.175,0.19,0.224,0.308,0.201,0.225,0.212,0.33,0.117,0.247,0.267,0.254,0.273,0.343,17.344
2020-06-15 14:00:00,0,0.132,0.175,0.19,0.225,0.307,0.202,0.225,0.21,0.33,0.117,0.248,0.267,0.254,0.273,0.342,19.362
2020-06-15 15:00:00,0,0.133,0.175,0.19,0.225,0.306,0.203,0.225,0.209,0.33,0.117,0.249,0.266,0.254,0.273,0.341,20.833
2020-06-15 16:00:00,0,0.133,0.175,0.19,0.225,0.307,0.205,0.225,0.209,0.329,0.117,0.25,0.267,0.254,0.273,0.342,22.185


In [372]:
print(df_pivoted_features_singlelabel.shape[0])
print(multiclass_targets_array.shape[0])

18079
28491
