In [1]:
#Notebook dependencies (uncomment if you need to install packages)
#install.packages('feather')
#install.packages('ckanr')
#install.packages('plotly')
#install.packages('tictoc')
#install.packages('fasttime')

library(feather) #for fast data writes and reads
library(ckanr) #for accessing data from ckan data portals
library(plotly) #for beautiful visualisations
library(tictoc) #for measuring how long your functions take to execute
library(dplyr) #just because it's standard
library(fasttime) #

Loading required package: ggplot2

Attaching package: ‘plotly’

The following object is masked from ‘package:ggplot2’:

    last_plot

The following object is masked from ‘package:stats’:

    filter

The following object is masked from ‘package:graphics’:

    layout


Attaching package: ‘dplyr’

The following object is masked from ‘package:ckanr’:

    changes

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union



## Get Data

### Read from file

In [6]:
wd <- getwd()
filepath <- paste(wd,'power_sample_data.csv',sep='/')
tic('Reading file from csv is so sloooow....')
data <- read.csv(filepath, stringsAsFactors = FALSE )
toc()

Reading file from csv is so sloooow....: 87.311 sec elapsed


Feather reads and writes much faster. Let's save the file as a feather object so that we can save some time next time we want to load it.

#### Convert to feather format

In [7]:
write_feather(data, 'power_sample_data.feather')

In [8]:
tic('Reading file from feather is much faster!')
data <- read_feather(paste(wd,'small_power_sample_data.feather',sep='/'))
toc()

Reading file from feather is much faster!: 0.016 sec elapsed


Now let's take a look at what we got in our dataset.

In [34]:
str(data)
object.size(data)
data[1:5,]

Classes ‘tbl_df’, ‘tbl’ and 'data.frame':	5899533 obs. of  5 variables:
 $ RecorderID: chr  "GNK001" "GNK001" "GNK001" "GNK001" ...
 $ ProfileID : int  12001461 12001461 12001461 12001461 12001461 12001461 12001461 12001461 12001461 12001461 ...
 $ Datefield : chr  "2010-09-01 00:00:00" "2010-09-01 00:30:00" "2010-09-01 01:00:00" "2010-09-01 01:30:00" ...
 $ Unitsread : num  0.178 0.192 0.128 0.198 0.143 ...
 $ Valid     : num  1 1 1 1 1 1 1 1 1 1 ...


213766320 bytes

RecorderID,ProfileID,Datefield,Unitsread,Valid
GNK001,12001461,2010-09-01 00:00:00,0.1783333,1
GNK001,12001461,2010-09-01 00:30:00,0.1916667,1
GNK001,12001461,2010-09-01 01:00:00,0.1283333,1
GNK001,12001461,2010-09-01 01:30:00,0.1983333,1
GNK001,12001461,2010-09-01 02:00:00,0.1433333,1


#### Convert data format

Looks like we've got some formatting issues to take care of. We'd like the following formatting:

* RecorderID - factor
* ProfileID - factor
* Datefield - datetime
* Unitsread - numeric (no change)
* Valid - numeric (no change)

In [36]:
data[c('RecorderID','ProfileID')] <- lapply(data[c('RecorderID','ProfileID')], factor)
data$Datefield <- fastPOSIXct(data$Datefield, 'GMT+2')

object.size(data)
str(data)

188830168 bytes

Classes ‘tbl_df’, ‘tbl’ and 'data.frame':	5899533 obs. of  5 variables:
 $ RecorderID: Factor w/ 172 levels "GNK001","GNK002",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ ProfileID : Factor w/ 516 levels "12000772","12000776",..: 160 160 160 160 160 160 160 160 160 160 ...
 $ Datefield : POSIXct, format: "2010-08-31 22:00:00" "2010-08-31 22:30:00" ...
 $ Unitsread : num  0.178 0.192 0.128 0.198 0.143 ...
 $ Valid     : num  1 1 1 1 1 1 1 1 1 1 ...


### Read from CKAN data portal
[energydata.uct.ac.za](energydata.uct.ac.za)

ckanr on github: https://github.com/ropensci/ckanr 

ckanr documentation: https://cran.r-project.org/web/packages/ckanr/ckanr.pdf

In [28]:
ckanr_setup(url = 'http://energydata.uct.ac.za')

In [53]:
fileurl <- resource_show('649cb8f9-cbef-467e-b585-710b8e20c9ac', as = 'table')$url
data <- fetch(fileurl,store = 'session')
data$X <- NULL

No encoding supplied: defaulting to UTF-8.


In [54]:
data[1:5,]

RecorderID,ProfileID,Datefield,Unitsread,Valid
GNK001,12001461,2010-09-01 00:00:00,0.1783333,1
GNK001,12001461,2010-09-01 00:30:00,0.1916667,1
GNK001,12001461,2010-09-01 01:00:00,0.1283333,1
GNK001,12001461,2010-09-01 01:30:00,0.1983333,1
GNK001,12001461,2010-09-01 02:00:00,0.1433333,1
