# Exploratory Data Analysis

Before we start training models and running inference, let's take a brief look in the contents of our data.
The dataset contains actual telemetry from moontracers deployed all around the world. 
Data files have been aggregated and anonymized for your security and convenience.

## Loading and exploring data with Pandas

The pandas library offers tools to effectively load, manipulate and analyse data.  

In [1]:
!pip install --upgrade pip

Requirement already up-to-date: pip in /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages (20.1)


In [2]:
!pip install --upgrade pandas

Requirement already up-to-date: pandas in /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages (1.0.3)


In [3]:
import pandas as pd

A pandas DataFrame is a table-like data structure used to store and manipulate data in memory.
Let's create one from a CSV file on disk.

In [4]:
filename = "moontracer-dataset.csv"
data = pd.read_csv(filename, 
                   parse_dates = ["timestamp"], 
                   dtype = {"device_id":"string"}, 
                   index_col = "timestamp")
data.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 4496828 entries, 2020-02-22 23:59:59 to 2020-02-25 22:03:11
Data columns (total 3 columns):
 #   Column         Dtype 
---  ------         ----- 
 0   device_id      string
 1   motor_peak_mA  int64 
 2   battery        int64 
dtypes: int64(2), string(1)
memory usage: 137.2 MB


Let's look at the first records to see if data as lodad correctly:

In [5]:
data.head()

Unnamed: 0_level_0,device_id,motor_peak_mA,battery
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2020-02-22 23:59:59,7517a917b42450470661cec1bd4654f8,1335,73
2020-02-22 23:59:59,8e4a851ed2317a249a0903f29d894361,1577,73
2020-02-22 23:59:59,572ddf9d82d5675ed2db832081b70103,1585,73
2020-02-22 23:59:59,b17bbc29ce61265a6212c689a597d4d8,0,73
2020-02-22 23:59:59,19d3c55b134ab7780d2b711211b7cf7c,1286,73


And basic statistics for each column:

In [6]:
data.groupby("device_id").count()

Unnamed: 0_level_0,motor_peak_mA,battery
device_id,Unnamed: 1_level_1,Unnamed: 2_level_1
0001495ce5f079703599a94c32dab2b0,124,124
00134c004e33e830e5dbce3355a485b9,121,121
0019400877c460d9b66298649162179d,124,124
001e70f66ab7a9d4bd6a5a074f288f0f,109,109
00211448f7814aea70f2c8d5aebd2aa9,116,116
...,...,...
ffe45f8328b406920ac3267ade0cdc90,123,123
ffe8ea2425f74fff2f23b5945f706b33,96,96
fff814f36d802edbcff9b2fff16907fd,122,122
fff844e4ff999c11cc6ffa9b46ab26c1,124,124


In [7]:
%store data

Stored 'data' (DataFrame)


In [8]:
import random
import string 

rand_id = ''.join(random.choices(string.ascii_lowercase + string.digits, k=8))
%store rand_id
bucket = "mt-ml-workshop-{}".format(rand_id)
%store bucket
bucket

Stored 'rand_id' (str)
Stored 'bucket' (str)


'mt-ml-workshop-wzejasmw'

In [9]:
!aws s3 mb "s3://{bucket}"

make_bucket: mt-ml-workshop-wzejasmw


In [10]:
!aws s3 cp "moontracer-dataset.csv" "s3://{bucket}/"

upload: ./moontracer-dataset.csv to s3://mt-ml-workshop-wzejasmw/moontracer-dataset.csv


In [11]:
!aws s3 ls "s3://{bucket}"

2020-05-13 12:54:23  265405696 moontracer-dataset.csv
