Skip to content
This repository has been archived by the owner on Apr 19, 2019. It is now read-only.

Machine Learning Roadmap #11

Closed
lindahua opened this issue Feb 8, 2014 · 138 comments
Closed

Machine Learning Roadmap #11

lindahua opened this issue Feb 8, 2014 · 138 comments

Comments

@lindahua
Copy link

lindahua commented Feb 8, 2014

Currently, the development of machine learning tools are in several different packages without little coordination. Consequently, some efforts are repetitive, while some important aspects remain lacking.

Hopefully, we may coordinate our efforts through this issue. Below, I try to outline a tentative roadmap:

  • Generalized Linear Models

    • Linear Regression
    • Logistic Regression
    • Lasso, Elastic Net, and its variants
    • Stochastic Gradient Descent

    Current efforts: GLMNet, GLM, Regression

  • Support Vector Machines

    Current efforts: SVM, LIBSVM

  • DimensionalityReduction

    • PCA
    • ICA
    • CCA
    • Linear Discriminant Analysis
    • Kernel-based methods

    Current efforts: DimensionalityReduction

  • Non-negative Matrix Factorization

    This may be categorized into dimensionality reduction. However, NNMF in itself has a plethora of methodologies, and thus deserves a separate package.

  • Classification

    There are many techniques for classification. It may be useful to have multiple packages respective techniques (e.g. GLM, SVM, kNN), and have a meta-package Classification.jl to incorporate them all.

  • Clustering

    Current efforts: Clustering.jl

  • Many machine learning applications also require some supporting functionalities, such as performance evaluation, data preprocessing, etc. These can all go into MLBase

  • Probabilistic Modeling (e.g. Bayesian Network, Markov Random Field, etc)

    This is a huge field in itself, and may be discussed separately.

cc: @johnmyleswhite @dmbates @simonster @ViralBShah

Edit


I created an NMF.jl package, which is dedicated to non-negative matrix factorization.

Also, a detailed plan for DimensionalityReduction is outlined here.

@johnmyleswhite
Copy link
Member

I agree with all of this. I've got a lot of prototype SGD code already.

I like the idea of meta-packages. If we're going to have Classification.jl, maybe Regression.jl should be a similar meta-package?

@jiahao
Copy link
Member

jiahao commented Feb 8, 2014

I'm not an expert in this area, but I've been interested for awhile and am willing to help.

@lindahua
Copy link
Author

lindahua commented Feb 8, 2014

@johnmyleswhite: Will you please move Clustering, SVM, and DimensionalityReduction over to JuliaStats? These are very basic for machine learning. I recently get some time to work on those.

For regression, when there are several quite different techniques implemented, it will make sense to make a meta package.

@johnmyleswhite
Copy link
Member

I transferred Clustering and SVM over. I'm going to announce that I'm moving DimensionalityReduction over, then we can go ahead and make the move tomorrow.

@lindahua
Copy link
Author

lindahua commented Feb 8, 2014

Also, I think it is important to separate packages that provide core algorithms and those integrated with DataFrames.

We may consider to provide tools such that they can be worked nicely with machine learning algorithms. However, I think core machine learning packages should not depend on DataFrames -- which are not used as frequently in machine learning.

@johnmyleswhite
Copy link
Member

I agree completely. I would very strongly prefer that we implement integration with DataFrames in the following way throughout all packages:

  • Packages should always define algorithms that operate on Vector{Float64} and Matrix{Float64}.
  • DataFrames.jl exposes a set of tools via formulas that translate between DataFrame and Matrix{Float64}.

This makes it easy to work with pure numerical data without any dependencies on DataFrames, while making it easy for people working with DataFrames to take advantage of the core ML algorithms by efficiently translating DataFrames into matrices.

@johnmyleswhite
Copy link
Member

The only hiccup with what I just described is deciding whether the interfaces that mix DataFrames + ML should live. Arguably there should be one big package that does all of this by wrapping the other ML packages with a DataFrames interface.

@lindahua
Copy link
Author

lindahua commented Feb 8, 2014

@johnmyleswhite are there issues of providing these in DataFrames.jl ?

@johnmyleswhite
Copy link
Member

Providing what?

@lindahua
Copy link
Author

lindahua commented Feb 8, 2014

Sorry, I seemed to misread part of your comments. I agree with your suggestions.

@lindahua
Copy link
Author

lindahua commented Feb 8, 2014

Just that I am not sure whether we really another meta-package to couple DataFrames and ML, if the tools provided in DataFrames are convenient enough.

@johnmyleswhite
Copy link
Member

You're right: we could encourage users to explicitly call the DataFrame -> Matrix conversion routines. That would simplify things considerably.

@johnmyleswhite
Copy link
Member

The two main difficulties with this approach:

  • Getting the community to adopt this kind of strategy consistently.
  • Dealing with packages that legitimately need additional information to do their work. In GLM, for example, the entire model estimation steps need nothing more than access to the design matrix. But presenting the results in a convenient way requires access to the information about the original coefficient labels.

@lindahua
Copy link
Author

lindahua commented Feb 8, 2014

For GLM, my consideration is to have two packages:

  1. A package that provides the core algorithms that only work with numerical arrays.
  2. A higher-level package that builds on top of the core package that provides more friendly interface. (This package may depend on DataFrames)

@lindahua
Copy link
Author

lindahua commented Feb 8, 2014

So this is basically your idea of having a higher-level package that relies on core ML packages + DataFrames to provide useful tools for analyzing data frames.

@IainNZ
Copy link

IainNZ commented Feb 9, 2014

On phone right now, but weren't there some CART/Random Forest packages if not in METADATA then just mentioned in mailing list?
One thing about those is that they can use factors quite well, so I imagine would be directly dependent on DataFrames as that is the package-of-choice for representing that kind of data. So when talking about best practices etc. it might be worth keeping in mind that some packages might really be most efficiently made on top of DatFrames instead of the Matrix{Float64} abstraction

@lindahua
Copy link
Author

lindahua commented Feb 9, 2014

Decision trees, by their nature, can work on heterogeneous data (each observation may be composed of variables of different kinds). For such methods, implementation based on DataFrames makes sense.
I don't mind a decision tree package depending on DataFrames.jl

There do exist a large number of machine learning methods (e.g. PCA, SVM, LASSO, K-means, etc) that are designed to work with real vectors/matrices. Heterogeneous data need to be converted to numerical arrays before such methods can apply. Packages that provide such methodologies are encouraged to be independent of DataFrames.

@johnmyleswhite
Copy link
Member

You're right: there's a DecisionTree package.

To me, working with factors is actually a really strong argument for pushing a representation of categorical data into an earlier layer of our infrastructure like StatsBase. But we're actively debating ways to do this in JuliaStats/DataArrays.jl/issues/73.

If we could avoid some of the issues @simonster raised in his issue, I think it would be a big help to move the representation of categorical data closer to Julia's Base.

Also worth keeping in mind that nominal data is often worked with using dummy variables, which do fit in the Matrix{Float64} abstraction. That's actually how GLM handles those kinds of variables.

If DecisionTree.jl needs DataFrames.jl, I fully agree with Dahua: that's not a problem. But if it only needs a simpler abstraction, pushing things towards that simpler abstraction seems desirable.

@simonster
Copy link
Member

There are some cases where Matrix{Float64} is too specific an abstraction. I have experimented with fitting point process GLMs in Julia, where the design matrix is theoretically expressible as a Matrix{Float64}, but it would require a huge amount of memory (for my models, probably >100 GB). On the other hand, it is easy to express the design matrix as an AbstractMatrix{Float64} that efficiently implements A_mul_B! and At_mul_B!. I wrote code that does this and directly minimizes the negative log likelihood via L-BFGS using NLopt, which fits my model in a reasonable amount of time with reasonable memory requirements, but I'm not sure what to do with this code, since the GLM package is still about 3x faster with a Matrix{Float64} (for the benchmark included with the GLM package with the same convergence criterion, excluding the non-negligible time to construct the ModelFrame).

As far as the model fitting interface for DataFrames, it would be cool if we could get this to work on top of StatisticalModel. Packages could implement:

fit(::Type{MyModelType}, X::AbstractMatrix, y::AbstractVector, args...)

and DataFrames could implement:

function fit{T<:StatisticalModel}(::Type{T}, f::Formula, df::DataFrame, args...)
   mf = ModelFrame(f, df)
   DFStatisticalModel(mf, fit(T, ModelMatrix(mf).m, model_response(mf), args...)
end

or similar. DFStatisticalModel could provide a wrapper that maps between coefficients and their labels when calling coef, predict, etc. Of course, doing this right requires that we have a reasonable StatisticalModel interface (#4) so that we can make the relevant functionality accessible for DataFrames.

@jiahao
Copy link
Member

jiahao commented Feb 10, 2014

There are some cases where Matrix{Float64} is too specific an abstraction.

This sounds a lot like the discussion we had in JuliaLinearAlgebra/IterativeSolvers.jl#2 a little while ago.

@andreasnoack
Copy link
Member

@simonster GLM can use a sparse model model matrix, but I think you'll have to define your own subtype of LinPred.

@ViralBShah
Copy link

It would be great if as part of the roadmap, we can also plan to put together some large datasets in place, so that the community can work on optimizing performance and designing APIs accordingly. Having RDatasets is so useful, and something that makes large public datasets easily available for people to work with will greatly help this effort.

@lindahua
Copy link
Author

@ViralBShah Good point. Datasets are important. I think we already have a MNIST package, we can definitely have more.

Just that we need to be cautious about the licenses that come with the datasets.

@johnmyleswhite
Copy link
Member

There are surprisingly few large data sets that are publicly available. I'd guess that the easiest way to generate "large" data is to do n-grams on something like the 20 Newsgroup data set. Classifying one of the newsgroup against all the others is a simple enough binary classification problem that we can scale out to arbitrarily high size (in terms of features) by working with 2-grams, 3-grams, etc. Other useful examples might be processing the old Audioscrobbler data (http://www-etud.iro.umontreal.ca/~bergstrj/audioscrobbler_data.html) or something similar.

@ViralBShah
Copy link

We also have CommonCrawl.jl. The point about the datasets is not as much to distribute them as julia packages, but to have easy APIs to access them, load them, and work with them. Often, I find that the pain of figuring out all the plumbing is enough to discourage people, and making the plumbing easy could get a lot more people to contribute.

@ViralBShah
Copy link

Perhaps not too big, but there's also the Netflix and MovieLens datasets - which could be made easier to access.

@johnmyleswhite
Copy link
Member

The Netflix data set is illegal to distribute.

simonster added a commit to simonster/GLM.jl that referenced this issue Feb 24, 2014
This adds a method for fitting a GLM by explicitly specifying the
design matrix and response vectors. The resulting GlmMod object has
empty ModelFrame and formula fields, and I've changed the few
functions that reference these fields to first check if they are
defined.

Eventually it is probably a good idea to follow @lindahua's suggestion
from JuliaStats/Roadmap.jl#11 and split out functionality that depends
on DataFrames into a separate package, but most of these changes will
be necessary for that as well.

I have also added a method for fitting a GLM on a new response vector
using the same design matrix.

Closes JuliaStats#54
simonster added a commit to simonster/GLM.jl that referenced this issue Feb 24, 2014
This adds a method for fitting a GLM by explicitly specifying the
design matrix and response vectors. The resulting GlmMod object has
empty ModelFrame and formula fields, and I've changed the few
functions that reference these fields to first check if they are
defined.

Eventually it is probably a good idea to follow @lindahua's suggestion
from JuliaStats/Roadmap.jl#11 and split out functionality that depends
on DataFrames into a separate package, but most of these changes will
be necessary for that as well.

I have also added a method for fitting a GLM on a new response vector
using the same design matrix.

Closes JuliaStats#54
@tkelman
Copy link

tkelman commented Oct 10, 2015

It sounds like there probably would be enough interest for a dedicated JuliaDeepLearning organization. It would have some requirements for interoperating with classical subcomponents that exist in JuliaStats. If there were a Julia equivalent of scikit-learn it would probably go in JuliaStats, but a julia equivalent of theano or cgt could go in a new JuliaDeepLearning org. At a bare minimum, start by moving Mocha there, and figure out the best next steps from there?

@Evizero
Copy link

Evizero commented Oct 10, 2015

I am assuming you - @pluskid - should have a pretty good picture of the state of the Julia deep learning community. So my guess is you probably have the most educated idea of what needs to be done to move it forward. We all know that deep learning is pretty much the most active subfield of ML right now, so I think it would be a good investment to make the Julia part more official.

Question is if there is enough of a community to maintain the packages if there is no explicit owner any more. MLBase is a good example for a package that I don't touch (even though it would make sense to add some code to it), simply because it takes a week to get a version tagging request replied to. Basically I don't think organizations are automatically a good idea; especially not if the author is actively maintaining his/her packages.

As a sidenote, I agree with @tbreloff and think a general JuliaLearn/JuliaML org would make more sense than moving the deep learning packages into JuliaStats; especially given the MLBase situation. I don't think the JuliaStats community has the resources at the moment to maintain ML packages to be frank. I don't want to step on anyone’s toes here. All the JuliaStats members that I had contact with were very nice and very helpful. I just think that they are busy with other things (such as Nullable Arrays) these days and don't have enough time to spend it on Machine Learning.

@lucasb-eyer
Copy link

Hi @pluskid, thanks for starting discussion. I'm currently working on a DL library in Julia which closely follows the design of Torch7, but makes use of Julia's features. It's not on github yet and progress is unfortunately slow because it's a side-project, my research is (still) in Theano. A friend of mine does a similar thing, so there definitely is interest. I also believe there's interest in the DL community at large, because Theano is suboptimal for RNNs and people generally don't like lua.

I agree that the current state of GPU array operations makes this task more painful than it ought to be, and a lot of work on this could probably be shared across DL packages.

PS: CGT looks promising, but it is not a successor of Theano.

@tbreloff
Copy link

I have OnlineAI.jl, which extends OnlineStats.jl into neural nets and reservoir computing. I don't think it's appropriate for inclusion in a new organization, but there are pieces which overlap with other packages, and I think it would be great to have a unifying initiative for an MLBase that can support many different approaches to learning from data.

My experience with many learning frameworks is that they tend to focus heavily on image classification and other similar (static) problems. I would really like to see something like the OnlineStats interface, which allows for both static (image classification, deep learning, etc) and dynamic (video analysis, time series, reinforcement learning, etc) modeling, allowing for analyzing both large distributed datasets and streaming data. Some of this exists already, and I hope we can create a best of breed base package to supply overlapping functionality.

On Oct 10, 2015, at 2:19 PM, Lucas Beyer notifications@github.com wrote:

Hi @pluskid, thanks for starting discussion. I'm currently working on a DL library in Julia which closely follows the design of Torch7, but makes use of Julia's features. It's not on github yet and progress is unfortunately slow because it's a side-project, my research is (still) in Theano. A friend of mine does a similar thing, so there definitely is interest. I also believe there's interest in the DL community at large, because Theano is suboptimal for RNNs and people generally don't like lua.

I agree that the current state of GPU array operations makes this task more painful than it ought to be, and a lot of work on this could probably be shared across DL packages.

PS: CGT looks promising, but it is not a successor of Theano.


Reply to this email directly or view it on GitHub.

@droidicus
Copy link

Just wanted to make sure you were aware of another SVM package in Julia called SALSA.jl: https://github.com/jumutc/SALSA.jl @jumutc

@Evizero
Copy link

Evizero commented Oct 10, 2015

... I think it would be great to have a unifying initiative for an MLBase that can support many different approaches to learning from data.

I agree. Actually I think MLBase is more of an "MLTools" in the sense that it provides design agnostic functionality. We should maybe think of collaborating on a common MLBase or MLAbstractions that does impose some design decisions such as function names. I know that I will sooner or later reach a point where I need to factor out a common base package for my stuff. I don't know much about OnlineStats.jl but I was thinking a little more high-level and really lightweight that evolves as we go along. Not everything falls under online learning, and probably not everything can be boxed in into the same kind of framework. Avoiding name collisions and settling on function names would be a good first step.

@dfdx
Copy link

dfdx commented Oct 10, 2015

@pluskid If you create organization, I'll be glad to join. Recently I added cuRAND.jl to JuliaGPU to support stochastic algorithms and currently am in a process of designing common library for unified CPU/GPU array programming - something similar to Theano/Torch7 (we should probably start a separate discussion about it). So if you are looking for people ready to contribute, include me to the list.

@denizyuret
Copy link

I'd be interested. I am currently working on a Theano alternative model
compiler for Julia, it should take shape in the next couple of months.

best,
deniz

On Sat, Oct 10, 2015 at 3:05 PM Andrei Zhabinski notifications@github.com
wrote:

@pluskid https://github.com/pluskid If you create organization, I'll be
glad to join. Recently I added cuRAND.jl
https://github.com/JuliaGPU/CURAND.jl to JuliaGPU to support stochastic
algorithms and currently am in a process of designing common library for
unified CPU/GPU array programming - something similar to Theano/Torch7 (we
should probably start a separate discussion about it). So if you are
looking for people ready to contribute, include me to the list.


Reply to this email directly or view it on GitHub
#11 (comment)
.

@tbreloff
Copy link

I think there is a lot of value in a consistent API, and I'm ready to put in some effort to make this roadmap a reality. For the last few weeks I've been working on a very similar process with Plots.jl... putting a complex-but-lightweight interface into the plotting world. I think the approach should be very similar for the ML community.

I propose that we create an organization JuliaLearn, and that we create a repo LearnBase.jl which will be home to both the design discussions and an implementation of what I describe (or something similar):

  • Design a bare minimum of verbs: fit, fit!, predict, transform, etc
  • Design a method of mapping different data inputs to a consistent (and more verbose) "backend API". i.e. fit(model, dataframe) would be mapped to a call like fit(model, Any[convertColumn(c) for c in columns(dataframe)]; labels = names(df)). This way the user interface is simple, but we still retain full value in the data structure that was passed in, and final algorithms don't need special handling for DataFrames, etc
  • Design traits or a type hierarchy which defines methods that a type of LearningModel must define... all probably require fit, online models require fit!, regression models require predict, etc. Ideally I think the type hierarchy is not fully defined beforehand, but is implicitly defined given the common methods that a model implements. With a robust API layer, we can use multiple dispatch to our advantage and forego many of the "type tree" problems that exist in other languages.
  • Implement placeholder "linking code" that implements the "backend API", essentially converting calls from the backend API to existing packages. For example, the user may call fit(neuralnet, data...) which gets converted into a call to the backend API fit(neuralnet, processed_data), which then in turn will build a neural net in OnlineAI or Mocha or whatever default is chosen (likely chosen from installed packages with a priority list), and then return a wrapper around that Mocha object. The user never knows about the differences in implementation between OnlineAI and Mocha, assuming they can both accomplish the request.
  • Most backend packages are loaded as-needed, so that the REQUIRE file is small. As such, LearnBase can be be used in many places without worrying about the massive dependency tree that plagues other projects.

This methodology has been (in my opinion) incredibly powerful for Plots.jl. I have a simple, flexible API which can still access functionality from very different underlying packages, and requiring no cooperation from existing package authors. It requires a little extra work up front to support a new backend package, but that is a much smaller effort than if that package would need to be re-written with a new interface, or to start a new package from scratch. A user can make an API call which initially calls a python-wrapped library, but is then later replaced by a better julia implementation, with no change to their code.

There are two really important advantages to the approach that I described:

  • The framework/API can be developed and designed without worry of package breakage or community turmoil. Since you aren't requiring other packages to conform to any specific interface initially, it will be faster to achieve proof-of-concept and to iterate through design decisions. (contributors to LearnBase would not be dependent on other package authors for PR responses, etc)
  • New packages can implement the "backend API" and instantly get the front end preprocessing for free, along with any other niceties that may be available (such as some variation of the @stream macro in OnlineStats.jl, or cross validation, or plotting recipes, etc). This means the barrier to entry for brand new models/techniques is extremely low, and we may see more efficient development efforts in the future.

I am willing to take the lead on this effort, if you'll let me. With a few 👍 I will form the org and get this started.

cc: @StefanKarpinski @joshday

@simonster
Copy link
Member

The first three points are already implemented. StatsBase defines StatisticalModel and RegressionModel types along with methods for them. DataFrames defines a fit method that takes a subtype of StatisticalModel, a formula, and a DataFrame, converts the DataFrame into a design matrix according to the formula, and calls the fit method for the type with the model response and a design matrix.

I'm not sure there's a need for a separate organizations for statistics and machine learning. It may make sense to have a separate organization for deep learning, since it's substantially different from both stats and classical ML. But much of what is currently in JuliaStats could qualify as either statistics or ML. At least at this time I think it's better to do this work in a single organization.

@tbreloff
Copy link

To be clear, I see immense value in using as much of the current stats framework as is reasonable. I think StatsBase would be one of the few required dependencies, and things like StatsBase.fit should be extended, not replaced. There will be abstractions that are appropriate for online models, or deep neural nets, etc that are not appropriate for StatsBase, and those could fall into LearnBase (which is, at it's core, an extension of StatsBase)

As to whether LearnBase.jl should live in JuliaStats or JuliaLearn (or some other name the community agrees on), I can see both sides.

Pros for JuliaStats:

  • Already exists, people know about it
  • Much overlap between stats and machine learning

Pros for JuliaLearn:

  • Clean separation of owners (i.e. can have owners that don't impact JuliaStats)
  • More focused community

I could be convinced either way...

@pluskid
Copy link

pluskid commented Oct 11, 2015

@Evizero Thank you for the comments! I'm starting to agree with you about the concerns of hosting projects under organizations. But I'm very glad to see that there are quite a few people with interest or already started working on theano / torch like system in Julia. We might consider creating an umbrella organization to host wiki pages pointing to those related projects and maybe host general discussions about deep learning libraries in Julia. May I ask, @lucasb-eyer, @dfdx, @denizyuret, when your new projects started to getting in shapes, could you come back and comment here? I think at that stage, we could consider creating such a repo. Having a wiki page summarizing different possible choices of libraries in Julia for deep learning will be at least very helpful for new users.

@Evizero
Copy link

Evizero commented Oct 11, 2015

The first three points are already implemented

@simonster I don't think that is true. I have been playing with the idea of using StatisticalModel or RegressionModel as base class but they are simply not abstract enough. Not every learning model has coefficients and not every ML model that has coefficients has a probabilistic interpretation for them (i.e. things like confint don't always make sense). Here is the small difference between Machine Learning and Statistical Learning in my opinion.

@tbreloff I like the way you think, but I would really like to keep it much simpler and more realistic for now. I wouldn't got the Plots.jl route with the backends. For now we should just dictate the interface and class hierarchy, otherwise it is going to get ugly at one point or another. There should just be enough stuff in there that would be reasonable to expect new ML packages to follow. I think the two main goals should be

  1. a user is able to import multiple ML packages without having name collisions occur
  2. a user can expect similar things to have a similar interface that behaves similarly

I would also like to move my class-encoding code there (that builds on MLBase labelmap). Since it influences both our current efforts I'd suggest we just establish the package and get as many ML people in the loop as we can so that people can provide feedback. Since this package is a group effort it would make sense to me if it lived in an org. We can always move the package to JuliaStats later if it makes more sense, but for now let's just make some progress while we're motivated

EDIT: And to address the potential question of why not put this into MLBase: It doesn't even define the function name accuracy and the PR that would add it is sitting there unaddressed since April

@simonster
Copy link
Member

If you don't define confint for a StatisticalModel nothing bad will happen. But there could be another level in the hierarchy if there is a perceived need.

I'm sympathetic to the concern than MLBase is not being sufficiently actively maintained, but it also looks like that PR failed its own tests.

@Evizero
Copy link

Evizero commented Oct 11, 2015

If you don't define confint for a StatisticalModel nothing bad will happen.

I get that, but that doesn't sound like a good solution

But there could be another level in the hierarchy if there is a perceived need.

Yes, but I think this does need to be a group outcome. Since it is a problem that some people (which includes me) are currently actively concerned with I think it is a good time to brainstorm about this

I'm sympathetic to the concern than MLBase is not being sufficiently actively maintained, but it also looks like that PR failed its own tests.

It's the not-even-replied-to part that bugs me, in the sense that anything non-trivial gets no reaction. I don't blame anyone who losses interest in contributing to Julia (or just a specific package) if no one even takes time to acknowledge the attempted contribution. I am not pointing fingers here. It's no one's fault. In fact, it's pretty cool that MLBase even exists to begin with. I think the StatsBase community is doing a tremendous job. But I do think it is a problem that needs to be addressed

I just think that given that a few people are currently very interested in actively working and improving Julia's ML aspects, that we should talk and address such problems that are crippling (for the lack of a better word) to the progress of the ML ecosystem.

But long story short, @tbreloff and I have started the discussion in LearnBase and we will try to code up a good solution. Anyone who is interested in the discussion or providing feedback is very welcomed

@johnmyleswhite
Copy link
Member

FWIW, I think the best way to move forward is to punt on the abstraction layer problem for now (since we don't all agree on it and reaching group consensus is always extremely difficult) -- and instead focus on just nailing certain specific models. Simon's done amazing work to get regularized linear regression working well in pure Julia. It would be great to have similarly nice tools for things like kNN. I suspect it's easier to get people to collaborate (or at least offer useful feedback to one another) if everyone is coordinating on a single purely technical problem (e.g. how to make nearest neighbor search fast) that doesn't require people to come to consensus about purely aesthetic considerations.

@quinnj
Copy link
Member

quinnj commented Oct 11, 2015

+1 John.
I haven't actively engaged in the thread here, but have followed since the beginning. I do think at this point, it's probably more productive for everyone to pick an area/model they're most interested in and really work on getting fast, feature-rich, usable ML code.
It might be more productive for everyone to plan on attending JuliaCon 2016 where we could plan a workshop(s) where ML Roadmap/Vision is discussed specifically. I think everyone meeting in person, coming with some solid code and ideas, would end up being much more productive in hashing out a coordinated vision for ML in Julia.

@Evizero
Copy link

Evizero commented Oct 11, 2015

Hmm maybe I have gone a little off track. I didn't know about the JuliaCon 2016 plans and I am very happy to hear about them (or at least the consideration)

But the two points I stated before still make sense to me

  • a user is able to import multiple ML packages without having name collisions occur
  • a user can expect similar things to have a similar interface that behaves similarly

I don't think settling on function names and defining them in a single place to avoid collisions is too far out there. I'm not talking about some fictive issues here. These are things that currently concern me in my efforts for SVMs. Some coordination, even if its just for exchanging ideas, is at least educating. I want to at least try and fail rather than not attempt at all.

@Evizero
Copy link

Evizero commented Oct 11, 2015

Let's leave it at this for now: It looks like @tbreloff and I will put our heads together and try to coordinate at least both of our current ML efforts in a meaningful way. Hopefully the outcome will be useful to others as well.

@ValdarT
Copy link

ValdarT commented May 1, 2017

Hi

I am interested in what is the current state of ML ecosystem in Julia. By reading this (and other) issue(s) and having a look at the mentioned packages, it seems to me that:

  • Hard work going on in JuliaML but Learn.jl will not be ready for use any time soon
  • Orchestra.jl and SupervisedLearning.jl not maintained (I assume Learn.jl will fill their place in the future)
  • ScikitLearn.jl maintained and works well but is not very actively improved/developed. (As it currently stands, it is more of an interface to the original code rather than a reimplementation in Julia.)

Are my impressions correct? If so, I assume people are not using Julia for day-to-day ML experiments like they use for example Python+scikit-learn? Or is there perhaps a ML ‘workbench’ package I missed?

@denizyuret
Copy link

denizyuret commented May 1, 2017 via email

@rofinn
Copy link
Member

rofinn commented May 1, 2017

@ValdarT I think most people using julia for "day-to-day ML" either use very specific packages for their use case (e.g. Boltzmann.jl, BayesNets.jl, GaussianProcesses.jl, Mocha.jl) or implement their own methods. I imagine the folks in the JuliaML organization are the most likely to come up with a good/cohesive julia framework all the different ML methods out there, but that's a pretty tough job.

@amueller
Copy link

amueller commented May 1, 2017

Wow, JuliaML looks pretty great but also pretty ambitious. It has a much larger scope than scikit-learn and tensorflow combined... Is there any documentation on the "learn" package or a simple intro somewhere?

@ararslan
Copy link
Member

ararslan commented May 1, 2017

Discussion of the JuliaML organization should take place in their roadmap: https://github.com/JuliaML/Roadmap.jl/issues. The focus of JuliaStats is more classical statistics, as the more ML-oriented packages in this organization are unmaintained (e.g. SVM and RegERMs).

@JuliaStats JuliaStats locked and limited conversation to collaborators May 2, 2017
@ViralBShah
Copy link

Locking and closing this issue so that discussion can continue in the right place: https://github.com/JuliaML/Roadmap.jl

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests