# TSML (Time Series Machine Learning)
- **Speaker:  Paulito Palmes**
- **IBM Dublin Research Lab**
- July 23, 2019

## Motivations
- innovations in industry sectors brought automations 
- automations require installation of sensor networks 
- main challenges:
  - collect large volume of data, detect anomalies, monitor status
  - discover patterns to reduce downtimes and manufacturing errors
  - reduce energy usage
  - predict faults/failures
  - effective maintenance schedules

_TSML leverages AI and ML libraries from ScikitLearn, Caret, and Julia as building blocks for processing huge amount of industrial time series data._

## Typical TSML Workflow

## First, let's create an artificial data with missing values

In [None]:
using DataFrames
using Dates
using Random
ENV["COLUMNS"]=1000 # for dataframe column size

function generateXY()
    Random.seed!(123)
    gdate = DateTime(2014,1,1):Dates.Minute(15):DateTime(2014,1,5)
    gval = Array{Union{Missing,Float64}}(rand(length(gdate)))
    gmissing = floor(0.30*length(gdate)) |> Integer
    gndxmissing = Random.shuffle(1:length(gdate))[1:gmissing]
    X = DataFrame(Date=gdate,Value=gval)
    X.Value[gndxmissing] .= missing
    Y = rand(length(gdate))
    (X,Y)
end;
(df,outY)=generateXY(); first(df,10)

## Let's load the TSML modules and filters to process data

In [None]:
using TSML
using TSML.Utils
using TSML.TSMLTypes
using TSML: CSVDateValReader, CSVDateValWriter, Statifier
using TSML: Monotonicer, Outliernicer, Plotter

## Let's use Pipeline with Plotter filter to plot artificial data

In [None]:
pltr=Plotter(Dict(:interactive => true))

mypipeline = Pipeline(Dict(
  :transformers => [pltr]
 )
)

fit!(mypipeline, df)
transform!(mypipeline, df)  

## Let's get the statistics/data quality including blocks of missing data

In [None]:
statfier = Statifier(Dict(:processmissing=>true))

mypipeline = Pipeline(Dict(
  :transformers => [statfier]
 )
)

fit!(mypipeline, df)
res = transform!(mypipeline, df)

## Let's extend the Pipeline workflow with aggregate, impute, and plot 

In [None]:
valgator = DateValgator(Dict(:dateinterval=>Dates.Hour(1)))

mypipeline = Pipeline(Dict(
  :transformers => [valgator,pltr]
 )
)

fit!(mypipeline, df)
transform!(mypipeline, df)

## Let's now try real data

In [None]:
fname = joinpath(dirname(pathof(TSML)),"../data/testdata.csv")
csvreader = CSVDateValReader(Dict(:filename=>fname,:dateformat=>"dd/mm/yyyy HH:MM"))

outputname = joinpath(dirname(pathof(TSML)),"/tmp/testdata_output.csv")
csvwriter = CSVDateValWriter(Dict(:filename=>outputname))

valgator = DateValgator(Dict(:dateinterval=>Dates.Hour(1)))
valputer = DateValNNer(Dict(:dateinterval=>Dates.Hour(1)))
stfier = Statifier(Dict(:processmissing=>true))
outliernicer = Outliernicer(Dict(:dateinterval=>Dates.Hour(1)));

## Let's plot the real data and check for missing values

In [None]:
mpipeline1 = Pipeline(Dict(
  :transformers => [csvreader,valgator,pltr]
 )
)

fit!(mpipeline1)
transform!(mpipeline1)

## Let's get the statistics to assess data quality

In [None]:
mpipeline1 = Pipeline(Dict(
  :transformers => [csvreader,valgator,stfier]
 )
)

fit!(mpipeline1)
respipe1 = transform!(mpipeline1)

## Let's try imputing and verify the statistical features

In [None]:
mpipeline2 = Pipeline(Dict(
  :transformers => [csvreader,valgator,valputer,statfier]
 )
)

fit!(mpipeline2)
respipe2 = transform!(mpipeline2)

## Let's visualize the imputted data

In [None]:
mpipeline2 = Pipeline(Dict(
  :transformers => [csvreader,valgator,valputer,pltr]
 )
)

fit!(mpipeline2)
transform!(mpipeline2)

## Let's have examples of Monotonic data

In [None]:
regularfile = joinpath(dirname(pathof(TSML)),"../data/typedetection/regular.csv")
monofile = joinpath(dirname(pathof(TSML)),"../data/typedetection/monotonic.csv")
dailymonofile = joinpath(dirname(pathof(TSML)),"../data/typedetection/dailymonotonic.csv")

regularfilecsv = CSVDateValReader(Dict(:filename=>regularfile,:dateformat=>"dd/mm/yyyy HH:MM"))
monofilecsv = CSVDateValReader(Dict(:filename=>monofile,:dateformat=>"dd/mm/yyyy HH:MM"))
dailymonofilecsv = CSVDateValReader(Dict(:filename=>dailymonofile,:dateformat=>"dd/mm/yyyy HH:MM"))

valgator = DateValgator(Dict(:dateinterval=>Dates.Hour(1)))
valputer = DateValNNer(Dict(:dateinterval=>Dates.Hour(1)))
stfier = Statifier(Dict(:processmissing=>true))
mononicer = Monotonicer(Dict())
stfier = Statifier(Dict(:processmissing=>true))
outliernicer = Outliernicer(Dict(:dateinterval=>Dates.Hour(1)));

## Let's plot an example of monotonic data

In [None]:
monopipeline = Pipeline(Dict(
  :transformers => [monofilecsv,valgator,valputer,pltr]
 )
)

fit!(monopipeline)
transform!(monopipeline)

## Let's plot after normalizing the monotonic data

In [None]:
monopipeline = Pipeline(Dict(
  :transformers => [monofilecsv,valgator,valputer,mononicer, pltr]
 )
)

fit!(monopipeline)
transform!(monopipeline)

## Let's remove outliers and plot the result

In [None]:
monopipeline = Pipeline(Dict(
  :transformers => [monofilecsv,valgator,valputer,mononicer,outliernicer,pltr]
 )
)

fit!(monopipeline)
transform!(monopipeline)


## Let's plot and example of a daily monotonic data

In [None]:
dailymonopipeline = Pipeline(Dict(
  :transformers => [dailymonofilecsv,valgator,valputer,pltr]
 )
)

fit!(dailymonopipeline)
transform!(dailymonopipeline)

## Let's normalize and plot

In [None]:
dailymonopipeline = Pipeline(Dict(
  :transformers => [dailymonofilecsv,valgator,valputer,mononicer,pltr]
 )
)
fit!(dailymonopipeline)
transform!(dailymonopipeline)

## Let's add the Outliernicer filter and plot

In [None]:
dailymonopipeline = Pipeline(Dict(
  :transformers => [dailymonofilecsv,valgator,valputer,mononicer,outliernicer,pltr]
 )
)
fit!(dailymonopipeline)
transform!(dailymonopipeline)

## Let's use what we have learned so far to perform automatic data type classification

In [None]:
using TSML: TSClassifier
Random.seed!(12)

trdirname = joinpath(dirname(pathof(TSML)),"../data/realdatatsclassification/training")
tstdirname = joinpath(dirname(pathof(TSML)),"../data/realdatatsclassification/testing")
modeldirname = joinpath(dirname(pathof(TSML)),"../data/realdatatsclassification/model")

tscl = TSClassifier(Dict(:trdirectory=>trdirname,
           :tstdirectory=>tstdirname,
           :modeldirectory=>modeldirname,
           :feature_range => 6:20,
           :num_trees=>50)
)

fit!(tscl)
dfresults = transform!(tscl);
apredict = dfresults.predtype
fnames = dfresults.fname
myregex = r"(?<dtype>[A-Z _ - a-z]+)(?<number>\d*).(?<ext>\w+)"
mtypes=map(fnames) do fname
  mymatch=match(myregex,fname)
  mymatch[:dtype]
end

sum(mtypes .== apredict)/length(mtypes) * 100 |> x-> round(x,digits=2)

## TSML features
- TS data type clustering/classification for automatic data discovery
- TS aggregation based on date/time interval
- TS imputation based on symmetric Nearest Neighbors
- TS statistical metrics for data quality assessment
- TS ML wrapper with more than 100+ libraries from caret, scikitlearn, and julia
- TS date/value matrix conversion of 1-D TS using sliding windows for ML input

## More TSML features
- Common API wrappers for ML libs from JuliaML, PyCall, and RCall
- Pipeline API allows high-level description of the processing workflow
- Specific cleaning/normalization workflow based on data type
- Automatic selection of optimised ML model
- Automatic segmentation of time-series data into matrix form for ML training and prediction
- Easily extensible architecture by using just two main interfaces: fit and transform
- Meta-ensembles for robust prediction
- Support for distributed computation, for scalability, and speed