# Exploratory Data Analysis

Before we start training models and running inference, let's take a brief look in the contents of the dataset, containing telemetry from moontracers deployed all around the world.

Data files have been aggregated and anonymized for your security and convenience.

## Loading and exploring data with Pandas

The pandas library offers tools to effectively load, manipulate and analyse data.  

In [1]:
!pip install --upgrade pip



In [2]:
!pip install --upgrade pandas



In [3]:
import pandas as pd

A pandas DataFrame is a table-like data structure used to store and manipulate data in memory.
Let's create one from a CSV file on disk.

In [4]:
filename = "moontracer-dataset.csv.gz"
data = pd.read_csv(filename, 
                   parse_dates = ["timestamp"], 
                   dtype = {"device_id":"string"}, 
                   index_col = "timestamp")
data.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 4496828 entries, 2020-02-22 23:59:59 to 2020-02-25 22:03:11
Data columns (total 3 columns):
 #   Column         Dtype 
---  ------         ----- 
 0   device_id      string
 1   motor_peak_mA  int64 
 2   battery        int64 
dtypes: int64(2), string(1)
memory usage: 137.2 MB


Let's look at the first records to see if data as lodad correctly:

In [5]:
data.head()

Unnamed: 0_level_0,device_id,motor_peak_mA,battery
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2020-02-22 23:59:59,7517a917b42450470661cec1bd4654f8,1335,73
2020-02-22 23:59:59,8e4a851ed2317a249a0903f29d894361,1577,73
2020-02-22 23:59:59,572ddf9d82d5675ed2db832081b70103,1585,73
2020-02-22 23:59:59,b17bbc29ce61265a6212c689a597d4d8,0,73
2020-02-22 23:59:59,19d3c55b134ab7780d2b711211b7cf7c,1286,73


And basic statistics for each column:

In [6]:
data.describe()

Unnamed: 0,motor_peak_mA,battery
count,4496828.0,4496828.0
mean,359.6213,70.78824
std,617.8175,5.374838
min,0.0,0.0
25%,0.0,67.0
50%,10.0,71.0
75%,527.0,74.0
max,7730.0,100.0


Let's take a loook at the data count per device:

In [7]:
data.groupby("device_id").count()

Unnamed: 0_level_0,motor_peak_mA,battery
device_id,Unnamed: 1_level_1,Unnamed: 2_level_1
0001495ce5f079703599a94c32dab2b0,124,124
00134c004e33e830e5dbce3355a485b9,121,121
0019400877c460d9b66298649162179d,124,124
001e70f66ab7a9d4bd6a5a074f288f0f,109,109
00211448f7814aea70f2c8d5aebd2aa9,116,116
...,...,...
ffe45f8328b406920ac3267ade0cdc90,123,123
ffe8ea2425f74fff2f23b5945f706b33,96,96
fff814f36d802edbcff9b2fff16907fd,122,122
fff844e4ff999c11cc6ffa9b46ab26c1,124,124


In [8]:
%store data

Stored 'data' (DataFrame)


# Amazon Web Services

In this workshop we'll use AWS services to store, process and visualize data. Let's ensure we can use the AWS command line interface by creating a S3 bucket and uploading our dataset to it.

In [9]:
import random
import string 

rand_id = ''.join(random.choices(string.ascii_lowercase + string.digits, k=8))
%store rand_id
bucket = "mt-ml-workshop-{}".format(rand_id)
%store bucket
bucket

Stored 'rand_id' (str)
Stored 'bucket' (str)


'mt-ml-workshop-gohvk4kt'

In [10]:
!aws s3 mb "s3://{bucket}"

make_bucket: mt-ml-workshop-gohvk4kt


In [11]:
!aws s3 cp "moontracer-dataset.csv.gz" "s3://{bucket}/"

upload: ./moontracer-dataset.csv.gz to s3://mt-ml-workshop-gohvk4kt/moontracer-dataset.csv.gz


In [12]:
!aws s3 ls "s3://{bucket}"

2021-05-10 17:02:53   58662143 moontracer-dataset.csv.gz


# Battery Forecasting

Now that we understand the dataset, let's use Amazon Sagemaker and the DeepAR Algorithm to [prevent battery outages](mt-battery-deepar.ipynb).