# JuliaCon 2024
</br>

## End-to-End AI (E2EAI) with Julia, K0s, and Argo Workflow
</br>

##### Presentor: Paulito Palmes, IBM Research
#####      Date: July 11, 2024 

# OUTLINE

- ### The Motivations Behind E2EAI (End-to-End AI)
- ### Components of E2EAI
- ### The Julia AI/ML Solution Use-case
- ### The Future

# The Motivations Behind E2EAI

- current paradigm does not exloit tightly integration of IaC and MLOPs in deploying AI solutions
- issues with no tight integration of IaC and MLOPs in deploying AI solutions
  - difficult to identify optimal infrastructure 
  - difficult to predict resource viability and feasibility
  - difficult to infer the cost 
  - difficult to identify performance bottlenecks and root-cause analysis

# End-to-End AI (E2EAI)
<center><img  src="./yamlcontents.png" width="600"/></center>

- E2EAI is a unified framework tightly integrating MLOps and IaC 
  - single yaml file: Infrastructure + ML Pipeline + LifeCycle Management
  - single yaml file to describe both the IaC and MLOPs
  - yaml workflow templates imply zero to minimal coding
  - collection of yamls become inputs to LLM for intent-driven E2EAI

# Components of E2EAI
<center><img  src="./e2eai-components.png" width="700"/></center>

- SUNRISE-6G
  - SUstainable federatioN of Research Infrastructures \
    for Scaling-up Experimentation in 6G
  - H2020 EU Project (3 years)

# The Julia AI/ML Solution Use-case

- AutoMLPipeline workflow
- Integrating AutoMLPipeline in E2EAI

### Load ML pipeline preprocessing components and models

In [None]:
using AutoMLPipeline;
import PythonCall; const PYC=PythonCall; warnings = PYC.pyimport("warnings"); warnings.filterwarnings("ignore")

#### Decomposition
pca = skoperator("PCA"); fa  = skoperator("FactorAnalysis"); ica = skoperator("FastICA")
#### Scaler 
rb   = skoperator("RobustScaler"); pt   = skoperator("PowerTransformer"); norm = skoperator("Normalizer")
mx   = skoperator("MinMaxScaler"); std  = skoperator("StandardScaler")
#### categorical preprocessing
ohe = OneHotEncoder()
#### Column selector
catf = CatFeatureSelector(); numf = NumFeatureSelector(); disc = CatNumDiscriminator()
#### Learners
rf = skoperator("RandomForestClassifier"); gb = skoperator("GradientBoostingClassifier"); lsvc = skoperator("LinearSVC")
svc = skoperator("SVC"); mlp = skoperator("MLPClassifier")
ada = skoperator("AdaBoostClassifier"); sgd = skoperator("SGDClassifier")
skrf_reg = skoperator("RandomForestRegressor"); skgb_reg = skoperator("GradientBoostingRegressor")
jrf = RandomForest(); tree = PrunedTree()
vote = VoteEnsemble(); stack = StackEnsemble(); best = BestLearner();

### Prepare dataset for classification

In [None]:
# Make sure that the input feature is a dataframe and the target output is a 1-D vector.
using AutoMLPipeline
profbdata = getprofb()
X = profbdata[:,2:end] 
Y = profbdata[:,1] |> Vector;
head(x)=first(x,10)
head(profbdata)

### Pipeline to transform non-numeric features to one-hot encoding

In [None]:
pohe = catf |> ohe
tr = fit_transform!(pohe,X,Y)
head(tr)

### Sample pipeline to transform numerical features to pca and ica and concatenate them

In [None]:
pdec = (numf |> pca) + (numf |> ica)
tr = fit_transform!(pdec,X,Y)
head(tr)

### Another example of more complex pipeline

In [None]:
ppt = (numf |> rb |> ica) + (numf |> pt |> pca)
tr = fit_transform!(ppt,X,Y)
head(tr)

### Evaluating complex preprocessing pipeline together with RandomForest learner

In [None]:
prf = (catf |> ohe) + (numf |> rb |> fa) + (numf |> pt |> pca) |> rf
crossvalidate(prf,X,Y,"accuracy_score")

### Evaluating complex preprocessing pipeline together with Linear SVM learner

In [None]:
plsvc = ((numf |> rb |> pca)+(numf |> rb |> fa)+(numf |> rb |> ica)+(catf |> ohe )) |> lsvc
crossvalidate(plsvc,X,Y,"accuracy_score")

### Parallel search of the best ML pipeline

In [None]:
using Random, DataFrames, Distributed
nprocs() == 1 && addprocs()
@everywhere using DataFrames; @everywhere using AutoMLPipeline
@everywhere begin
    import PythonCall; const PYC=PythonCall; warnings = PYC.pyimport("warnings"); warnings.filterwarnings("ignore")
end
@everywhere begin
  profbdata = getprofb(); X = profbdata[:,2:end]; Y = profbdata[:,1] |> Vector;
end
@everywhere begin
  jrf  = RandomForest(); ohe  = OneHotEncoder(); catf = CatFeatureSelector(); numf = NumFeatureSelector()
  tree = PrunedTree(); ada  = skoperator("AdaBoostClassifier"); disc = CatNumDiscriminator()
  sgd  = skoperator("SGDClassifier"); std  = skoperator("StandardScaler"); lsvc = skoperator("LinearSVC")
end

learners = @sync @distributed (vcat) for learner in [jrf,ada,sgd,lsvc,tree]
   pcmc = disc |> ((catf |> ohe) + (numf |> std)) |> learner
   println(learner.name[1:end-4])
   mean,sd,_ = crossvalidate(pcmc,X,Y,"accuracy_score",3)
   DataFrame(name=learner.name[1:end-4],mean=mean,sd=sd)
end;

### Best Pipeline

In [None]:
@show sort!(learners,:mean,rev=true);

# E2EAI Application

## Infrastructure Creation Automation

<center><img src="./k0s.png" width="500"/></center>

<center><img src="./cluster.png" width="800"/></center>

## AI as a Service: Zero Coding Using Workflow Template
<center><img src="./template.png" width="1000"/></center>

## Explicit ML Pipeline

<center><img src="./mlpipeline.png" width="600"/></center>

## Optimal Pipeline Discovery by AutoML
<center><img src="./lowcode.png" width="500"/></center>

## Low vs High Pipeline Complexity
<center><img src="./low-high-comp.png" width="300"/></center>

### Low Complexity Pipeline
<center><img src="./low-comp.png" width="500"/></center>

### High Complexity Pipeline 
<center><img src="./high-comp.png" width="500"/></center>