Machine Learning Roadmap #11

lindahua · 2014-02-08T15:11:07Z

Currently, the development of machine learning tools are in several different packages without little coordination. Consequently, some efforts are repetitive, while some important aspects remain lacking.

Hopefully, we may coordinate our efforts through this issue. Below, I try to outline a tentative roadmap:

Generalized Linear Models
- Linear Regression
- Logistic Regression
- Lasso, Elastic Net, and its variants
- Stochastic Gradient Descent
Current efforts: GLMNet, GLM, Regression
Support Vector Machines

Current efforts: SVM, LIBSVM
DimensionalityReduction
- PCA
- ICA
- CCA
- Linear Discriminant Analysis
- Kernel-based methods
Current efforts: DimensionalityReduction
Non-negative Matrix Factorization

This may be categorized into dimensionality reduction. However, NNMF in itself has a plethora of methodologies, and thus deserves a separate package.
Classification

There are many techniques for classification. It may be useful to have multiple packages respective techniques (e.g. GLM, SVM, kNN), and have a meta-package Classification.jl to incorporate them all.
Clustering

Current efforts: Clustering.jl
Many machine learning applications also require some supporting functionalities, such as performance evaluation, data preprocessing, etc. These can all go into MLBase
Probabilistic Modeling (e.g. Bayesian Network, Markov Random Field, etc)

This is a huge field in itself, and may be discussed separately.

cc: @johnmyleswhite @dmbates @simonster @ViralBShah

Edit

I created an NMF.jl package, which is dedicated to non-negative matrix factorization.

Also, a detailed plan for DimensionalityReduction is outlined here.

johnmyleswhite · 2014-02-08T16:19:14Z

I agree with all of this. I've got a lot of prototype SGD code already.

I like the idea of meta-packages. If we're going to have Classification.jl, maybe Regression.jl should be a similar meta-package?

jiahao · 2014-02-08T16:54:30Z

I'm not an expert in this area, but I've been interested for awhile and am willing to help.

lindahua · 2014-02-08T17:05:54Z

@johnmyleswhite: Will you please move Clustering, SVM, and DimensionalityReduction over to JuliaStats? These are very basic for machine learning. I recently get some time to work on those.

For regression, when there are several quite different techniques implemented, it will make sense to make a meta package.

johnmyleswhite · 2014-02-08T17:10:48Z

I transferred Clustering and SVM over. I'm going to announce that I'm moving DimensionalityReduction over, then we can go ahead and make the move tomorrow.

lindahua · 2014-02-08T17:13:03Z

Also, I think it is important to separate packages that provide core algorithms and those integrated with DataFrames.

We may consider to provide tools such that they can be worked nicely with machine learning algorithms. However, I think core machine learning packages should not depend on DataFrames -- which are not used as frequently in machine learning.

johnmyleswhite · 2014-02-08T17:22:55Z

I agree completely. I would very strongly prefer that we implement integration with DataFrames in the following way throughout all packages:

Packages should always define algorithms that operate on Vector{Float64} and Matrix{Float64}.
DataFrames.jl exposes a set of tools via formulas that translate between DataFrame and Matrix{Float64}.

This makes it easy to work with pure numerical data without any dependencies on DataFrames, while making it easy for people working with DataFrames to take advantage of the core ML algorithms by efficiently translating DataFrames into matrices.

johnmyleswhite · 2014-02-08T17:24:46Z

The only hiccup with what I just described is deciding whether the interfaces that mix DataFrames + ML should live. Arguably there should be one big package that does all of this by wrapping the other ML packages with a DataFrames interface.

lindahua · 2014-02-08T17:48:06Z

@johnmyleswhite are there issues of providing these in DataFrames.jl ?

johnmyleswhite · 2014-02-08T18:12:16Z

Providing what?

lindahua · 2014-02-08T18:13:45Z

Sorry, I seemed to misread part of your comments. I agree with your suggestions.

lindahua · 2014-02-08T18:14:38Z

Just that I am not sure whether we really another meta-package to couple DataFrames and ML, if the tools provided in DataFrames are convenient enough.

johnmyleswhite · 2014-02-08T18:15:40Z

You're right: we could encourage users to explicitly call the DataFrame -> Matrix conversion routines. That would simplify things considerably.

johnmyleswhite · 2014-02-08T18:17:38Z

The two main difficulties with this approach:

Getting the community to adopt this kind of strategy consistently.
Dealing with packages that legitimately need additional information to do their work. In GLM, for example, the entire model estimation steps need nothing more than access to the design matrix. But presenting the results in a convenient way requires access to the information about the original coefficient labels.

lindahua · 2014-02-08T18:46:46Z

For GLM, my consideration is to have two packages:

A package that provides the core algorithms that only work with numerical arrays.
A higher-level package that builds on top of the core package that provides more friendly interface. (This package may depend on DataFrames)

lindahua · 2014-02-08T18:49:42Z

So this is basically your idea of having a higher-level package that relies on core ML packages + DataFrames to provide useful tools for analyzing data frames.

IainNZ · 2014-02-09T15:24:44Z

On phone right now, but weren't there some CART/Random Forest packages if not in METADATA then just mentioned in mailing list?
One thing about those is that they can use factors quite well, so I imagine would be directly dependent on DataFrames as that is the package-of-choice for representing that kind of data. So when talking about best practices etc. it might be worth keeping in mind that some packages might really be most efficiently made on top of DatFrames instead of the Matrix{Float64} abstraction

lindahua · 2014-02-09T15:37:05Z

Decision trees, by their nature, can work on heterogeneous data (each observation may be composed of variables of different kinds). For such methods, implementation based on DataFrames makes sense.
I don't mind a decision tree package depending on DataFrames.jl

There do exist a large number of machine learning methods (e.g. PCA, SVM, LASSO, K-means, etc) that are designed to work with real vectors/matrices. Heterogeneous data need to be converted to numerical arrays before such methods can apply. Packages that provide such methodologies are encouraged to be independent of DataFrames.

johnmyleswhite · 2014-02-09T15:40:01Z

You're right: there's a DecisionTree package.

To me, working with factors is actually a really strong argument for pushing a representation of categorical data into an earlier layer of our infrastructure like StatsBase. But we're actively debating ways to do this in JuliaStats/DataArrays.jl/issues/73.

If we could avoid some of the issues @simonster raised in his issue, I think it would be a big help to move the representation of categorical data closer to Julia's Base.

Also worth keeping in mind that nominal data is often worked with using dummy variables, which do fit in the Matrix{Float64} abstraction. That's actually how GLM handles those kinds of variables.

If DecisionTree.jl needs DataFrames.jl, I fully agree with Dahua: that's not a problem. But if it only needs a simpler abstraction, pushing things towards that simpler abstraction seems desirable.

simonster · 2014-02-09T23:59:42Z

There are some cases where Matrix{Float64} is too specific an abstraction. I have experimented with fitting point process GLMs in Julia, where the design matrix is theoretically expressible as a Matrix{Float64}, but it would require a huge amount of memory (for my models, probably >100 GB). On the other hand, it is easy to express the design matrix as an AbstractMatrix{Float64} that efficiently implements A_mul_B! and At_mul_B!. I wrote code that does this and directly minimizes the negative log likelihood via L-BFGS using NLopt, which fits my model in a reasonable amount of time with reasonable memory requirements, but I'm not sure what to do with this code, since the GLM package is still about 3x faster with a Matrix{Float64} (for the benchmark included with the GLM package with the same convergence criterion, excluding the non-negligible time to construct the ModelFrame).

As far as the model fitting interface for DataFrames, it would be cool if we could get this to work on top of StatisticalModel. Packages could implement:

fit(::Type{MyModelType}, X::AbstractMatrix, y::AbstractVector, args...)

and DataFrames could implement:

function fit{T<:StatisticalModel}(::Type{T}, f::Formula, df::DataFrame, args...)
   mf = ModelFrame(f, df)
   DFStatisticalModel(mf, fit(T, ModelMatrix(mf).m, model_response(mf), args...)
end

or similar. DFStatisticalModel could provide a wrapper that maps between coefficients and their labels when calling coef, predict, etc. Of course, doing this right requires that we have a reasonable StatisticalModel interface (#4) so that we can make the relevant functionality accessible for DataFrames.

jiahao · 2014-02-10T03:44:21Z

There are some cases where Matrix{Float64} is too specific an abstraction.

This sounds a lot like the discussion we had in JuliaLinearAlgebra/IterativeSolvers.jl#2 a little while ago.

andreasnoack · 2014-02-10T06:19:50Z

@simonster GLM can use a sparse model model matrix, but I think you'll have to define your own subtype of LinPred.

ViralBShah · 2014-02-10T14:42:01Z

It would be great if as part of the roadmap, we can also plan to put together some large datasets in place, so that the community can work on optimizing performance and designing APIs accordingly. Having RDatasets is so useful, and something that makes large public datasets easily available for people to work with will greatly help this effort.

lindahua · 2014-02-10T15:36:49Z

@ViralBShah Good point. Datasets are important. I think we already have a MNIST package, we can definitely have more.

Just that we need to be cautious about the licenses that come with the datasets.

johnmyleswhite · 2014-02-10T15:37:08Z

There are surprisingly few large data sets that are publicly available. I'd guess that the easiest way to generate "large" data is to do n-grams on something like the 20 Newsgroup data set. Classifying one of the newsgroup against all the others is a simple enough binary classification problem that we can scale out to arbitrarily high size (in terms of features) by working with 2-grams, 3-grams, etc. Other useful examples might be processing the old Audioscrobbler data (http://www-etud.iro.umontreal.ca/~bergstrj/audioscrobbler_data.html) or something similar.

ViralBShah · 2014-02-10T17:12:11Z

We also have CommonCrawl.jl. The point about the datasets is not as much to distribute them as julia packages, but to have easy APIs to access them, load them, and work with them. Often, I find that the pain of figuring out all the plumbing is enough to discourage people, and making the plumbing easy could get a lot more people to contribute.

ViralBShah · 2014-02-10T17:13:07Z

Perhaps not too big, but there's also the Netflix and MovieLens datasets - which could be made easier to access.

johnmyleswhite · 2014-02-10T17:14:34Z

The Netflix data set is illegal to distribute.

@lindahua

This adds a method for fitting a GLM by explicitly specifying the design matrix and response vectors. The resulting GlmMod object has empty ModelFrame and formula fields, and I've changed the few functions that reference these fields to first check if they are defined. Eventually it is probably a good idea to follow @lindahua's suggestion from JuliaStats/Roadmap.jl#11 and split out functionality that depends on DataFrames into a separate package, but most of these changes will be necessary for that as well. I have also added a method for fitting a GLM on a new response vector using the same design matrix. Closes JuliaStats#54

@lindahua

This adds a method for fitting a GLM by explicitly specifying the design matrix and response vectors. The resulting GlmMod object has empty ModelFrame and formula fields, and I've changed the few functions that reference these fields to first check if they are defined. Eventually it is probably a good idea to follow @lindahua's suggestion from JuliaStats/Roadmap.jl#11 and split out functionality that depends on DataFrames into a separate package, but most of these changes will be necessary for that as well. I have also added a method for fitting a GLM on a new response vector using the same design matrix. Closes JuliaStats#54

tkelman · 2015-10-10T16:54:04Z

It sounds like there probably would be enough interest for a dedicated JuliaDeepLearning organization. It would have some requirements for interoperating with classical subcomponents that exist in JuliaStats. If there were a Julia equivalent of scikit-learn it would probably go in JuliaStats, but a julia equivalent of theano or cgt could go in a new JuliaDeepLearning org. At a bare minimum, start by moving Mocha there, and figure out the best next steps from there?

Evizero · 2015-10-10T17:45:26Z

I am assuming you - @pluskid - should have a pretty good picture of the state of the Julia deep learning community. So my guess is you probably have the most educated idea of what needs to be done to move it forward. We all know that deep learning is pretty much the most active subfield of ML right now, so I think it would be a good investment to make the Julia part more official.

Question is if there is enough of a community to maintain the packages if there is no explicit owner any more. MLBase is a good example for a package that I don't touch (even though it would make sense to add some code to it), simply because it takes a week to get a version tagging request replied to. Basically I don't think organizations are automatically a good idea; especially not if the author is actively maintaining his/her packages.

As a sidenote, I agree with @tbreloff and think a general JuliaLearn/JuliaML org would make more sense than moving the deep learning packages into JuliaStats; especially given the MLBase situation. I don't think the JuliaStats community has the resources at the moment to maintain ML packages to be frank. I don't want to step on anyone’s toes here. All the JuliaStats members that I had contact with were very nice and very helpful. I just think that they are busy with other things (such as Nullable Arrays) these days and don't have enough time to spend it on Machine Learning.

lucasb-eyer · 2015-10-10T18:19:03Z

Hi @pluskid, thanks for starting discussion. I'm currently working on a DL library in Julia which closely follows the design of Torch7, but makes use of Julia's features. It's not on github yet and progress is unfortunately slow because it's a side-project, my research is (still) in Theano. A friend of mine does a similar thing, so there definitely is interest. I also believe there's interest in the DL community at large, because Theano is suboptimal for RNNs and people generally don't like lua.

I agree that the current state of GPU array operations makes this task more painful than it ought to be, and a lot of work on this could probably be shared across DL packages.

PS: CGT looks promising, but it is not a successor of Theano.

tbreloff · 2015-10-10T19:04:40Z

I have OnlineAI.jl, which extends OnlineStats.jl into neural nets and reservoir computing. I don't think it's appropriate for inclusion in a new organization, but there are pieces which overlap with other packages, and I think it would be great to have a unifying initiative for an MLBase that can support many different approaches to learning from data.

My experience with many learning frameworks is that they tend to focus heavily on image classification and other similar (static) problems. I would really like to see something like the OnlineStats interface, which allows for both static (image classification, deep learning, etc) and dynamic (video analysis, time series, reinforcement learning, etc) modeling, allowing for analyzing both large distributed datasets and streaming data. Some of this exists already, and I hope we can create a best of breed base package to supply overlapping functionality.

On Oct 10, 2015, at 2:19 PM, Lucas Beyer notifications@github.com wrote:

Hi @pluskid, thanks for starting discussion. I'm currently working on a DL library in Julia which closely follows the design of Torch7, but makes use of Julia's features. It's not on github yet and progress is unfortunately slow because it's a side-project, my research is (still) in Theano. A friend of mine does a similar thing, so there definitely is interest. I also believe there's interest in the DL community at large, because Theano is suboptimal for RNNs and people generally don't like lua.

I agree that the current state of GPU array operations makes this task more painful than it ought to be, and a lot of work on this could probably be shared across DL packages.

PS: CGT looks promising, but it is not a successor of Theano.

—
Reply to this email directly or view it on GitHub.

droidicus · 2015-10-10T19:10:23Z

Just wanted to make sure you were aware of another SVM package in Julia called SALSA.jl: https://github.com/jumutc/SALSA.jl @jumutc

Evizero · 2015-10-10T19:44:02Z

... I think it would be great to have a unifying initiative for an MLBase that can support many different approaches to learning from data.

I agree. Actually I think MLBase is more of an "MLTools" in the sense that it provides design agnostic functionality. We should maybe think of collaborating on a common MLBase or MLAbstractions that does impose some design decisions such as function names. I know that I will sooner or later reach a point where I need to factor out a common base package for my stuff. I don't know much about OnlineStats.jl but I was thinking a little more high-level and really lightweight that evolves as we go along. Not everything falls under online learning, and probably not everything can be boxed in into the same kind of framework. Avoiding name collisions and settling on function names would be a good first step.

dfdx · 2015-10-10T22:05:41Z

@pluskid If you create organization, I'll be glad to join. Recently I added cuRAND.jl to JuliaGPU to support stochastic algorithms and currently am in a process of designing common library for unified CPU/GPU array programming - something similar to Theano/Torch7 (we should probably start a separate discussion about it). So if you are looking for people ready to contribute, include me to the list.

denizyuret · 2015-10-10T23:41:17Z

I'd be interested. I am currently working on a Theano alternative model
compiler for Julia, it should take shape in the next couple of months.

best,
deniz

On Sat, Oct 10, 2015 at 3:05 PM Andrei Zhabinski notifications@github.com
wrote:

@pluskid https://github.com/pluskid If you create organization, I'll be
glad to join. Recently I added cuRAND.jl
https://github.com/JuliaGPU/CURAND.jl to JuliaGPU to support stochastic
algorithms and currently am in a process of designing common library for
unified CPU/GPU array programming - something similar to Theano/Torch7 (we
should probably start a separate discussion about it). So if you are
looking for people ready to contribute, include me to the list.

—
Reply to this email directly or view it on GitHub
#11 (comment)
.

tbreloff · 2015-10-11T02:46:10Z

I think there is a lot of value in a consistent API, and I'm ready to put in some effort to make this roadmap a reality. For the last few weeks I've been working on a very similar process with Plots.jl... putting a complex-but-lightweight interface into the plotting world. I think the approach should be very similar for the ML community.

I propose that we create an organization JuliaLearn, and that we create a repo LearnBase.jl which will be home to both the design discussions and an implementation of what I describe (or something similar):

Design a bare minimum of verbs: fit, fit!, predict, transform, etc
Design a method of mapping different data inputs to a consistent (and more verbose) "backend API". i.e. fit(model, dataframe) would be mapped to a call like fit(model, Any[convertColumn(c) for c in columns(dataframe)]; labels = names(df)). This way the user interface is simple, but we still retain full value in the data structure that was passed in, and final algorithms don't need special handling for DataFrames, etc
Design traits or a type hierarchy which defines methods that a type of LearningModel must define... all probably require fit, online models require fit!, regression models require predict, etc. Ideally I think the type hierarchy is not fully defined beforehand, but is implicitly defined given the common methods that a model implements. With a robust API layer, we can use multiple dispatch to our advantage and forego many of the "type tree" problems that exist in other languages.
Implement placeholder "linking code" that implements the "backend API", essentially converting calls from the backend API to existing packages. For example, the user may call fit(neuralnet, data...) which gets converted into a call to the backend API fit(neuralnet, processed_data), which then in turn will build a neural net in OnlineAI or Mocha or whatever default is chosen (likely chosen from installed packages with a priority list), and then return a wrapper around that Mocha object. The user never knows about the differences in implementation between OnlineAI and Mocha, assuming they can both accomplish the request.
Most backend packages are loaded as-needed, so that the REQUIRE file is small. As such, LearnBase can be be used in many places without worrying about the massive dependency tree that plagues other projects.

This methodology has been (in my opinion) incredibly powerful for Plots.jl. I have a simple, flexible API which can still access functionality from very different underlying packages, and requiring no cooperation from existing package authors. It requires a little extra work up front to support a new backend package, but that is a much smaller effort than if that package would need to be re-written with a new interface, or to start a new package from scratch. A user can make an API call which initially calls a python-wrapped library, but is then later replaced by a better julia implementation, with no change to their code.

There are two really important advantages to the approach that I described:

The framework/API can be developed and designed without worry of package breakage or community turmoil. Since you aren't requiring other packages to conform to any specific interface initially, it will be faster to achieve proof-of-concept and to iterate through design decisions. (contributors to LearnBase would not be dependent on other package authors for PR responses, etc)
New packages can implement the "backend API" and instantly get the front end preprocessing for free, along with any other niceties that may be available (such as some variation of the @stream macro in OnlineStats.jl, or cross validation, or plotting recipes, etc). This means the barrier to entry for brand new models/techniques is extremely low, and we may see more efficient development efforts in the future.

I am willing to take the lead on this effort, if you'll let me. With a few 👍 I will form the org and get this started.

cc: @StefanKarpinski @joshday

simonster · 2015-10-11T03:00:12Z

The first three points are already implemented. StatsBase defines StatisticalModel and RegressionModel types along with methods for them. DataFrames defines a fit method that takes a subtype of StatisticalModel, a formula, and a DataFrame, converts the DataFrame into a design matrix according to the formula, and calls the fit method for the type with the model response and a design matrix.

I'm not sure there's a need for a separate organizations for statistics and machine learning. It may make sense to have a separate organization for deep learning, since it's substantially different from both stats and classical ML. But much of what is currently in JuliaStats could qualify as either statistics or ML. At least at this time I think it's better to do this work in a single organization.

tbreloff · 2015-10-11T04:10:06Z

To be clear, I see immense value in using as much of the current stats framework as is reasonable. I think StatsBase would be one of the few required dependencies, and things like StatsBase.fit should be extended, not replaced. There will be abstractions that are appropriate for online models, or deep neural nets, etc that are not appropriate for StatsBase, and those could fall into LearnBase (which is, at it's core, an extension of StatsBase)

As to whether LearnBase.jl should live in JuliaStats or JuliaLearn (or some other name the community agrees on), I can see both sides.

Pros for JuliaStats:

Already exists, people know about it
Much overlap between stats and machine learning

Pros for JuliaLearn:

Clean separation of owners (i.e. can have owners that don't impact JuliaStats)
More focused community

I could be convinced either way...

pluskid · 2015-10-11T05:24:54Z

@Evizero Thank you for the comments! I'm starting to agree with you about the concerns of hosting projects under organizations. But I'm very glad to see that there are quite a few people with interest or already started working on theano / torch like system in Julia. We might consider creating an umbrella organization to host wiki pages pointing to those related projects and maybe host general discussions about deep learning libraries in Julia. May I ask, @lucasb-eyer, @dfdx, @denizyuret, when your new projects started to getting in shapes, could you come back and comment here? I think at that stage, we could consider creating such a repo. Having a wiki page summarizing different possible choices of libraries in Julia for deep learning will be at least very helpful for new users.

Evizero · 2015-10-11T08:12:27Z

The first three points are already implemented

@simonster I don't think that is true. I have been playing with the idea of using StatisticalModel or RegressionModel as base class but they are simply not abstract enough. Not every learning model has coefficients and not every ML model that has coefficients has a probabilistic interpretation for them (i.e. things like confint don't always make sense). Here is the small difference between Machine Learning and Statistical Learning in my opinion.

@tbreloff I like the way you think, but I would really like to keep it much simpler and more realistic for now. I wouldn't got the Plots.jl route with the backends. For now we should just dictate the interface and class hierarchy, otherwise it is going to get ugly at one point or another. There should just be enough stuff in there that would be reasonable to expect new ML packages to follow. I think the two main goals should be

a user is able to import multiple ML packages without having name collisions occur
a user can expect similar things to have a similar interface that behaves similarly

I would also like to move my class-encoding code there (that builds on MLBase labelmap). Since it influences both our current efforts I'd suggest we just establish the package and get as many ML people in the loop as we can so that people can provide feedback. Since this package is a group effort it would make sense to me if it lived in an org. We can always move the package to JuliaStats later if it makes more sense, but for now let's just make some progress while we're motivated

EDIT: And to address the potential question of why not put this into MLBase: It doesn't even define the function name accuracy and the PR that would add it is sitting there unaddressed since April

simonster · 2015-10-11T15:25:13Z

If you don't define confint for a StatisticalModel nothing bad will happen. But there could be another level in the hierarchy if there is a perceived need.

I'm sympathetic to the concern than MLBase is not being sufficiently actively maintained, but it also looks like that PR failed its own tests.

Evizero · 2015-10-11T17:53:29Z

If you don't define confint for a StatisticalModel nothing bad will happen.

I get that, but that doesn't sound like a good solution

But there could be another level in the hierarchy if there is a perceived need.

Yes, but I think this does need to be a group outcome. Since it is a problem that some people (which includes me) are currently actively concerned with I think it is a good time to brainstorm about this

I'm sympathetic to the concern than MLBase is not being sufficiently actively maintained, but it also looks like that PR failed its own tests.

It's the not-even-replied-to part that bugs me, in the sense that anything non-trivial gets no reaction. I don't blame anyone who losses interest in contributing to Julia (or just a specific package) if no one even takes time to acknowledge the attempted contribution. I am not pointing fingers here. It's no one's fault. In fact, it's pretty cool that MLBase even exists to begin with. I think the StatsBase community is doing a tremendous job. But I do think it is a problem that needs to be addressed

I just think that given that a few people are currently very interested in actively working and improving Julia's ML aspects, that we should talk and address such problems that are crippling (for the lack of a better word) to the progress of the ML ecosystem.

But long story short, @tbreloff and I have started the discussion in LearnBase and we will try to code up a good solution. Anyone who is interested in the discussion or providing feedback is very welcomed

johnmyleswhite · 2015-10-11T18:37:40Z

FWIW, I think the best way to move forward is to punt on the abstraction layer problem for now (since we don't all agree on it and reaching group consensus is always extremely difficult) -- and instead focus on just nailing certain specific models. Simon's done amazing work to get regularized linear regression working well in pure Julia. It would be great to have similarly nice tools for things like kNN. I suspect it's easier to get people to collaborate (or at least offer useful feedback to one another) if everyone is coordinating on a single purely technical problem (e.g. how to make nearest neighbor search fast) that doesn't require people to come to consensus about purely aesthetic considerations.

quinnj · 2015-10-11T18:47:53Z

+1 John.
I haven't actively engaged in the thread here, but have followed since the beginning. I do think at this point, it's probably more productive for everyone to pick an area/model they're most interested in and really work on getting fast, feature-rich, usable ML code.
It might be more productive for everyone to plan on attending JuliaCon 2016 where we could plan a workshop(s) where ML Roadmap/Vision is discussed specifically. I think everyone meeting in person, coming with some solid code and ideas, would end up being much more productive in hashing out a coordinated vision for ML in Julia.

Evizero · 2015-10-11T19:01:24Z

Hmm maybe I have gone a little off track. I didn't know about the JuliaCon 2016 plans and I am very happy to hear about them (or at least the consideration)

But the two points I stated before still make sense to me

a user is able to import multiple ML packages without having name collisions occur

a user can expect similar things to have a similar interface that behaves similarly

I don't think settling on function names and defining them in a single place to avoid collisions is too far out there. I'm not talking about some fictive issues here. These are things that currently concern me in my efforts for SVMs. Some coordination, even if its just for exchanging ideas, is at least educating. I want to at least try and fail rather than not attempt at all.

Evizero · 2015-10-11T20:19:50Z

Let's leave it at this for now: It looks like @tbreloff and I will put our heads together and try to coordinate at least both of our current ML efforts in a meaningful way. Hopefully the outcome will be useful to others as well.

ValdarT · 2017-05-01T17:25:08Z

Hi

I am interested in what is the current state of ML ecosystem in Julia. By reading this (and other) issue(s) and having a look at the mentioned packages, it seems to me that:

Hard work going on in JuliaML but Learn.jl will not be ready for use any time soon
Orchestra.jl and SupervisedLearning.jl not maintained (I assume Learn.jl will fill their place in the future)
ScikitLearn.jl maintained and works well but is not very actively improved/developed. (As it currently stands, it is more of an interface to the original code rather than a reimplementation in Julia.)

Are my impressions correct? If so, I assume people are not using Julia for day-to-day ML experiments like they use for example Python+scikit-learn? Or is there perhaps a ML ‘workbench’ package I missed?

denizyuret · 2017-05-01T17:27:19Z

Please check out https://github.com/denizyuret/Knet.jl

…

On Mon, May 1, 2017 at 8:25 PM ValdarT ***@***.***> wrote: Hi I am interested in what is the current state of ML ecosystem in Julia. By reading this (and other) issue(s) and having a look at the mentioned packages, it seems to me that: - Hard work going on in JuliaML but Learn.jl will not be ready for use any time soon - Orchestra.jl and SupervisedLearning.jl not maintained (I assume Learn.jl will fill their place in the future) - ScikitLearn.jl maintained and works well but is not very actively improved/developed. (As it currently stands, it is more of an interface to the original code rather than a reimplementation in Julia.) Are my impressions correct? If so, I assume people are not using Julia for day-to-day ML experiments like they use for example Python+scikit-learn? Or is there perhaps a ML ‘workbench’ package I missed? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#11 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABvNptgAkv-HT_0sHFQ-XPt4e2htTz6xks5r1hV2gaJpZM4BgVAI> .

rofinn · 2017-05-01T17:34:31Z

@ValdarT I think most people using julia for "day-to-day ML" either use very specific packages for their use case (e.g. Boltzmann.jl, BayesNets.jl, GaussianProcesses.jl, Mocha.jl) or implement their own methods. I imagine the folks in the JuliaML organization are the most likely to come up with a good/cohesive julia framework all the different ML methods out there, but that's a pretty tough job.

amueller · 2017-05-01T17:48:26Z

Wow, JuliaML looks pretty great but also pretty ambitious. It has a much larger scope than scikit-learn and tensorflow combined... Is there any documentation on the "learn" package or a simple intro somewhere?

ararslan · 2017-05-01T18:14:27Z

Discussion of the JuliaML organization should take place in their roadmap: https://github.com/JuliaML/Roadmap.jl/issues. The focus of JuliaStats is more classical statistics, as the more ML-oriented packages in this organization are unmaintained (e.g. SVM and RegERMs).

ViralBShah · 2017-05-02T01:33:48Z

Locking and closing this issue so that discussion can continue in the right place: https://github.com/JuliaML/Roadmap.jl

simonster mentioned this issue Feb 21, 2014

Fitting glms on matricies instead of dataframes? JuliaStats/GLM.jl#54

Closed

tbreloff mentioned this issue Oct 11, 2015

Summarize scope of problem JuliaML/LossFunctions.jl#2

Closed

pluskid mentioned this issue Oct 11, 2015

Why is discussion of Mocha.jl not active than expectation? pluskid/Mocha.jl#157

Closed

tbreloff mentioned this issue Nov 11, 2015

Move planning/organizational discussions here JuliaML/META#1

Closed

JuliaStats locked and limited conversation to collaborators May 2, 2017

ViralBShah closed this as completed May 2, 2017

Machine Learning Roadmap #11

Machine Learning Roadmap #11

Comments

lindahua commented Feb 8, 2014

johnmyleswhite commented Feb 8, 2014

jiahao commented Feb 8, 2014

lindahua commented Feb 8, 2014

johnmyleswhite commented Feb 8, 2014

lindahua commented Feb 8, 2014

johnmyleswhite commented Feb 8, 2014

johnmyleswhite commented Feb 8, 2014

lindahua commented Feb 8, 2014

johnmyleswhite commented Feb 8, 2014

lindahua commented Feb 8, 2014

lindahua commented Feb 8, 2014

johnmyleswhite commented Feb 8, 2014

johnmyleswhite commented Feb 8, 2014

lindahua commented Feb 8, 2014

lindahua commented Feb 8, 2014

IainNZ commented Feb 9, 2014

lindahua commented Feb 9, 2014

johnmyleswhite commented Feb 9, 2014

simonster commented Feb 9, 2014

jiahao commented Feb 10, 2014

andreasnoack commented Feb 10, 2014

ViralBShah commented Feb 10, 2014

lindahua commented Feb 10, 2014

johnmyleswhite commented Feb 10, 2014

ViralBShah commented Feb 10, 2014

ViralBShah commented Feb 10, 2014

johnmyleswhite commented Feb 10, 2014

tkelman commented Oct 10, 2015

Evizero commented Oct 10, 2015

lucasb-eyer commented Oct 10, 2015

tbreloff commented Oct 10, 2015

droidicus commented Oct 10, 2015

Evizero commented Oct 10, 2015

dfdx commented Oct 10, 2015

denizyuret commented Oct 10, 2015

tbreloff commented Oct 11, 2015

simonster commented Oct 11, 2015

tbreloff commented Oct 11, 2015

pluskid commented Oct 11, 2015

Evizero commented Oct 11, 2015

simonster commented Oct 11, 2015

Evizero commented Oct 11, 2015

johnmyleswhite commented Oct 11, 2015

quinnj commented Oct 11, 2015

Evizero commented Oct 11, 2015

Evizero commented Oct 11, 2015

ValdarT commented May 1, 2017

denizyuret commented May 1, 2017 via email

rofinn commented May 1, 2017 • edited Loading

amueller commented May 1, 2017 • edited Loading

ararslan commented May 1, 2017

ViralBShah commented May 2, 2017

rofinn commented May 1, 2017 •

edited

Loading

amueller commented May 1, 2017 •

edited

Loading