## Package Load
Load the packages we'll be using to explore the data. All standard imports except for the custom facets tool.
We'll be importing the "dive" class from our custom facets tool.

In [4]:
import numpy as np
import pandas as pd
from PIL import Image
from facets import dive # <-- Our custom version of facets, get at https://github.com/jsiddique/facets
from sklearn import metrics
import matplotlib.pyplot as plt
from matplotlib.ticker import FormatStrFormatter
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
plt.ioff()

## Load Data from CSV

We've created a CSV file of anonymized voltage data that we extracted from our `BigQuery` environment. Some of this voltage data is valid, some is invalid, most of it contains noise and some are corrupted. This is what we would start with, raw, off of our devices.

In [2]:
df = pd.read_csv('CrankingVoltages.csv') # <-- Anonymized cranking voltage data
df = df.sort_values(by=['EventID', 'Milliseconds'])
df = df.reset_index(drop=True)

In [5]:
df.head()

Unnamed: 0,EventID,Milliseconds,Voltage,NumRecords,EventSpan,MinVoltage,AvgVoltage,MaxVoltage,MaxMinDiff,FirstMinDiff,...,AvgDiffAfterMin,MaxDiffBeforeMin,MaxDiffAfterMin,MaxVoltageBeforeMinMinusMin,MaxVoltageAfterMinMinusMin,VoltageSumBeforeMin,VoltageSumAfterMin,SumOfVoltageDiffsBeforeMin,SumOfVoltageDiffsAfterMin,DipLocation
0,006f7c744980,0,12.2863,10,9918,9.6,11.73,14.25,4.65,2.68,...,0.91,2.08,2.15,2.69,4.65,57.03,60.25,4.55,3.65,0.5
1,006f7c744980,3699,12.21474,10,9918,9.6,11.73,14.25,4.65,2.68,...,0.91,2.08,2.15,2.69,4.65,57.03,60.25,4.55,3.65,0.5
2,006f7c744980,3800,11.24868,10,9918,9.6,11.73,14.25,4.65,2.68,...,0.91,2.08,2.15,2.69,4.65,57.03,60.25,4.55,3.65,0.5
3,006f7c744980,4000,11.67804,10,9918,9.6,11.73,14.25,4.65,2.68,...,0.91,2.08,2.15,2.69,4.65,57.03,60.25,4.55,3.65,0.5
4,006f7c744980,4101,9.6028,10,9918,9.6,11.73,14.25,4.65,2.68,...,0.91,2.08,2.15,2.69,4.65,57.03,60.25,4.55,3.65,0.5


## DataFrame of Unique Event Labels
We'll now create a dataframe which holds each of the unique event ID values. We will iterate over this to create separate plots to visualize each event.

In [6]:
events = df[['EventID']].drop_duplicates()
events = events.reset_index(drop=True)

In [7]:
events.head()

Unnamed: 0,EventID
0,006f7c744980
1,00cada351560
2,00ccfd693891
3,0100307c65a8
4,01b64b570930


## Extract Simple Characteristics
Next we'll extract some simple characteristics of the curve to demonstrate the utility of the facets exploration tool. In this example, some simple minimum, maximum, and average values, but you can extract any kind of feature you want.

For our example, this has already been done.

In [8]:
MaxVoltage = df[['EventID', 'Voltage']].groupby(by=['EventID']).max()
MaxVoltage = MaxVoltage.rename(index=str, columns={'Voltage': 'MaxVoltage'})

MinVoltage = df[['EventID', 'Voltage']].groupby(by=['EventID']).min()
MinVoltage = MinVoltage.rename(index=str, columns={'Voltage': 'MinVoltage'})

AvgVoltage = df[['EventID', 'Voltage']].groupby(by=['EventID']).mean()
AvgVoltage = AvgVoltage.rename(index=str, columns={'Voltage': 'AvgVoltage'})

## Create Plots
We'll now create the individual plots that we will visualize with facets.

![Cranking](./reference/002.png)

In [9]:
%%capture
!mkdir image_files

In [10]:
id_array = []
id_features = pd.DataFrame([], columns=['MaxVoltage', 'MinVoltage', 'AvgVoltage', 'EventSpan', 'MaxMinDiff', 'FirstMinDiff', 'LastVoltage'])
img_dim_inches = 1.5
img_dpi = 150
num_examples = 400

In [11]:
reRun = False

for i in range(400):
    id_array.append(events.iloc[i]['EventID'])
    
    example = pd.merge(df, events.iloc[[i]])
    id_features = pd.concat([id_features, example[['MaxVoltage', 'MinVoltage', 'AvgVoltage', 'EventSpan', 'MaxMinDiff', 'FirstMinDiff', 'LastVoltage']].drop_duplicates()], axis=0)
    
    if reRun:
        fig = plt.figure(figsize=(img_dim_inches, img_dim_inches), dpi=img_dpi)
        ax = fig.add_axes([0.17, 0.03, 0.81, 0.93])
        _ = ax.plot(example['Milliseconds'], example['Voltage'], linewidth=2, c='red', zorder=2)
        _ = ax.scatter(example['Milliseconds'], example['Voltage'], s=11, c='black', zorder=3)
        _ = ax.set_xticks([])
        _ = ax.set_ylim([8, example['Voltage'].max() + 1])
        _ = ax.yaxis.set_major_formatter(FormatStrFormatter('%.1f'))
        _ = ax.tick_params(axis='both', which='major', labelsize=6, pad=1)
        _ = ax.tick_params(axis='both', which='minor', labelsize=6, pad=1)
        _ = fig.savefig('./image_files/' + str(i).zfill(3) + '.png', transparent=False, dpi=img_dpi)
        _ = plt.close(fig)

## Create Stitched Image
Each of these images can now be stitched together into one "master" image that will be manipulated by the facets tool.

![Stitch Image](./reference/image_array.png)

In [33]:
num_examples = 400

In [34]:
img_arr = np.zeros([int(20*img_dim_inches*img_dpi), int(20*img_dim_inches*img_dpi), 4])

In [35]:
if reRun:
    for i in range(0, num_examples):
        colnum = i%int(20)
        rownum = i/int(20)
        cols = []
        tmp_img = np.array(Image.open('./image_files/' + str(i).zfill(3) + '.png').getdata()).reshape(int(img_dim_inches*img_dpi), int(img_dim_inches*img_dpi), 4)
        img_arr[int(rownum*img_dim_inches*img_dpi):int(rownum*img_dim_inches*img_dpi+img_dim_inches*img_dpi), 
                int(colnum*img_dim_inches*img_dpi):int(colnum*img_dim_inches*img_dpi+img_dim_inches*img_dpi), 
                :] = tmp_img

In [36]:
if reRun:
    imgout = Image.fromarray(img_arr.astype('uint8'))

In [37]:
if reRun:
    imgout.save('atlas_transparent.png', 'PNG')

## Create Facets ID Metadata
The Facets JSON object holds all of the metadata for each of the events. We construct it in the same order that we created the image so facets can keep track of each image location.

![Stitch Image](./reference/image_array2.png)

In [39]:
reRun = True

In [40]:
if reRun:
    id_features.reset_index(inplace=True, drop=True)
    id_df = pd.DataFrame(id_array, columns=['Id'])
    id_df = pd.concat([id_df, id_features], axis=1)
    id_json = id_df.to_json(orient='records')

## Construct Facets HTML
We're now ready to visualize!

In [41]:
if reRun:
    fc = dive.Facets()
    results = fc.create_classes(labels=['Valid', 'Invalid'])
    fc.define_atlas(id_df, atlas_height=1000, sprite_width=int(img_dim_inches*img_dpi), sprite_height=int(img_dim_inches*img_dpi), atlas_url='atlas_transparent.png')
    fc.render_html('CrankingVoltage.html')

In [None]:
fc.create_labeled_variables('results')

## Train a Random Forest Model

In [13]:
features = df.copy()
features = features.sort_values(by=['EventID'])
features = features.drop(columns=['Voltage', 'Milliseconds'])
features = features.drop_duplicates()
features = features.reset_index(drop=True)
features = features.drop(columns=['EventID'])

labels = pd.read_csv('CrankingLabels.csv')
labels = labels.drop_duplicates()
labels = labels.sort_values(by=['EventID'])
labels = labels.reset_index(drop=True)
labels = labels.drop(columns=['EventID'])
labels = labels.values
labels = labels.reshape(labels.shape[0])

features_m = features.values
xtrain, xtest, ytrain, ytest = train_test_split(features_m, labels)

In [None]:
rf = RandomForestClassifier(n_estimators=1500, warm_start=False)
rffit = rf.fit(xtrain, ytrain)

In [None]:
predict = rffit.predict(xtest)

In [None]:
report = metrics.classification_report(ytest, predict)

In [None]:
print report

             precision    recall  f1-score   support

          0       0.99      1.00      0.99        81
          1       1.00      0.99      1.00       149

avg / total       1.00      1.00      1.00       230



## Evaluate Efficacy
This seems quite high! Is it really this accurate or did we mis-label?

In [None]:
id_features.reset_index(inplace=True, drop=True)
id_df = pd.DataFrame(id_array, columns=['Id'])
id_df = pd.concat([id_df, id_features, pd.DataFrame(labels[0:400], columns=['Valid'])], axis=1)
id_json = id_df.to_json(orient='records')

In [None]:
fc = dive.Facets()
results = fc.create_classes(labels=['Valid', 'Invalid'])
fc.define_atlas(id_df, atlas_height=1000, sprite_width=int(img_dim_inches*img_dpi), sprite_height=int(img_dim_inches*img_dpi), atlas_url='atlas_transparent.png')
fc.render_html('CrankingVoltage.html')