Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FEAST Portability #4

Closed
Craigacp opened this issue Aug 3, 2016 · 18 comments
Closed

FEAST Portability #4

Craigacp opened this issue Aug 3, 2016 · 18 comments
Assignees

Comments

@Craigacp
Copy link
Owner

Craigacp commented Aug 3, 2016

@alecuba16 and @mbq have worked on different R interfaces for FEAST (see #3). This issue is to discuss merging in any changes to the core C library or build system to improve usability for R and other possible ports.

@Craigacp
Copy link
Owner Author

Craigacp commented Aug 3, 2016

I'd prefer to avoid forking FEAST so there are different versions of the C library embedded in different projects, as I only have enough time to deal with one set of issues.

@alecuba16
Copy link

alecuba16 commented Aug 4, 2016

I have modified the arrayoperations of MItoolbox, in order to include the extraction of first element of 2DFeatures, because is a repetitive function in almost all FEAST functions, maybe it could be included in arrayoperations or in a new "functions" files in the FEAST toolbox.

Another function that I have implemented is a transpose function inside my feast interface file (FSToolboxR.c). Maybe it must me included on the new "functions" in the FEAST toolbox or in arrayoperations of MItoolbox.

@mbq
Copy link

mbq commented Aug 4, 2016

Certainly:

  1. Macro for OOM handling in checkedCalloc, to avoid links to printf and exit.

Nice to haves:

  1. Static integration of MIToolbox inside FEAST -- it is small and it doesn't seem to be actually shared with any other public software. Integration may allow some smarter compiler optimizations to happen.
  2. Move from doubles to integers for representation of categorical data, or even to (0..numCategories-1, numCategories) structures -- this would be a substantial gain in performance and memory use, also many possible callers would already have data in such form. MATLAB would have to get some compatibility layer, though.

@Craigacp
Copy link
Owner Author

Craigacp commented Aug 4, 2016

@mbq I'm not going to integrate MIToolbox into FEAST as there are a lot of users of the mutual information functions from MATLAB separate from FEAST (judging by the emails I get). That said, refactoring the mutual information functions to expose versions that operate on dense ordered integer arrays would improve FEAST's runtime performance. The macro changes I'm happy to make, could you point me at the CRAN requirements so I can figure out how to handle errors appropriately when used as a library.

@alecuba16 I'll pull out the 2d array operation, it should make things a bit smaller. To be honest there is a lot of things that are shared across FS implementations that could be refactored out, but it's ugly to do while maintaining compatibility with ANSI C for 32-bit Windows MATLAB. I notice you've added support for preselected features, did you only do this for CMIM or for other algorithms as well?

@mbq
Copy link

mbq commented Aug 10, 2016

As a side note, FEAST does not compile with -Wall -Werror due to a few unused variables; this may be a problem for some uses (including, obviously, CRAN publication).

@Craigacp
Copy link
Owner Author

Ok, I've removed the unused variables, I'll push that out later.

@Craigacp
Copy link
Owner Author

Craigacp commented Sep 5, 2016

I'm going to do a bigger refactor of FEAST and MIToolbox to make them look a little more like real projects. It's taken a while before I got enough time to do some coding, but the initial refactor of MIToolbox seems like it will be a little cleaner. Next steps are to fix the Matlab entry point in MIToolbox (as currently it's broken in the dev branch), and then to push the discretisation changes through FEAST.

@Craigacp
Copy link
Owner Author

Craigacp commented Sep 6, 2016

I've updated FEAST to call the same functions in MIToolbox. It's not currently optimised as it doesn't use the int entry points for all mutual information calls (so it's still normalising things a few too many times). The current changes have improved the speed of DISR by about 10%, but there is still more to be done.

Both MIToolbox and FEAST now have src & include directories, and the MATLAB code is in a matlab directory. The makefiles have been updated to include "-Wall -Werror" and both packages compile cleanly. The next time I get a few coding hours I'm going to update FEAST so it only discretises things once which should improve things further. I'll also go through and remove some of the duplicated code from FEAST and generally clean it up, I was focusing on the lower level changes to MIToolbox today.

@mbq what kind of OOM macro is used in the R code?

@alecuba16
Copy link

alecuba16 commented Sep 6, 2016

Hello, I'm back from holidays.

I have the idea of modify or improve feast problems and limitations about the matrix normalization that prevents to be used with real datasets like mine's, otherwise it will be useless in my case.

I will be reviewing over internet how to overcome this limitation.

@Craigacp I have included the preselected only for CMIM as first try, because was the main algorithm that gave the best results with a subset of my real dataset

When you are talking about OOM macro in R , maybe I'm wrong , but R doesn't have this kind of low level memory management, is more like java, there is a garbage collector.

R Garbage collector manual call

Regards

@mbq
Copy link

mbq commented Sep 6, 2016

R has Calloc, Realloc and Free functions for C-like memory management and R_alloc for allocating from the GC pool (which would be inefficient for FEAST/MIToolbox as memory allocated there is usually freed shortly after, way more often than GC sweeps); all of those functions handle OOM on their own, namely jump back to R session generating R error. I though about defining macros like ALLOC_AND_HANDLE_OOM and FREE which could be defined conditionally whether it is MATLAB, plain C, R or whatever.

About this normalisation, it seems almost standard that information-based algorithms only deal with discretised data, at most offer discetisation routines somewhere around (e.g. infotheo, R's MIToolbox). Handling this transparently seems fishy, though, since in principle it should work by somehow fitting proper distributions and calculating integrals for MI and stuff.

@alecuba16
Copy link

alecuba16 commented Sep 6, 2016

@mbq Maybe I'm wrong , but , there is no Calloc, Realloc and free on R interpreter, you have R_alloc for this R-C interface, but is the same as Calloc of C but only works with its data types that are macro's of the C/C++ calloc,etc.

For variables generated outside of this interface, you can deal with GC on the interpreter side or C/C++ functions with the variables generated at C/C++ side.

R_calloc is to deal with variables generated with that interface, then the FEAST code have to be modified to use this interface "typed" variables, example Real vector , then the toolbox will turn into a specifically R code, instead general code with R or Matlab interface as it is now, or could be.

As described in Writing R Extensions you have to deal with the memory reserved on the "binary" part in this case the C code and is totally different than the R memory part.

About normalization, I have used previously infotheo, but the results is very bad due the discretization (freq and same distance bins) I'm looking a alternative way of reduce the memory consumption and be able to normalize with the minimal impact on the results.

@Craigacp
Copy link
Owner Author

Craigacp commented Sep 6, 2016

Calculating a continuous mutual information is really really hard, I've tried to include it in FS algorithms before and it's slow and gives odd results (when using a KDE, and other techniques require juggling or really high dimensions to make them rank equivalent). It's especially hard for things like conditional or joint mutual informations which are necessary for CMIM and JMI. Discretisation is the best way of getting around these issues.

Meyer's R package does include a few discrete entropy estimation corrections which would probably be useful in FEAST/MIToolbox but it'll require a bigger rewrite than I have time for. We found the equal frequency discretisation that is used by that package to be less useful than equal width bins, but there are definitely better discretisation strategies out there. I'm planning on adding an equal width discretisation to MIToolbox once I've figured out how to return the bin size information appropriately to Matlab (as you should discretise once based on your training data, and apply the same bins to your test data).

If R is handling errors by taking control off FEAST when it hits an OOM, then provided it's calling the R allocation code it shouldn't cause any problems right? FEAST & MIToolbox only allocate on the heap through the macro'd calloc function.

@mbq
Copy link

mbq commented Sep 6, 2016

To be precise, my mention of infotheo was about UX; fsx(continuous_x,y) wrongly suggests that fsx can naively work with such data, fsx(discretise(continuous_x),y) is way more explicit.

About alloc, the problem is in checkedCalloc, around here; without MEX macro the if is compiled-in, and pollutes object with links to fprintfand exit, even if CALLOC_FUNC never returns NULL, as when it is R's Calloc.

@Craigacp
Copy link
Owner Author

Craigacp commented Sep 6, 2016

Ah ok. I can fix that. I can also ifdef out the print functions from MIToolbox. I was considering strapping a couple of places with an ifndef R_WRAPPER (or some other symbol if R already defines one) which should make it fairly easy to compile out. That way we might be able to avoid source level changes in FEAST/MIToolbox to make the R wrapper compile cleanly.

I wish I had exceptions to deal with failing to allocate memory, I'd prefer not to have it call exit, but strapping every calloc to check for the return value and passing out a sensible looking error value is going to make the code look pretty different. For Matlab, R (and maybe Python if I get time) I can turn those checks off and assume the runtime system is going to deal with it for me.

@Craigacp
Copy link
Owner Author

Craigacp commented Sep 9, 2016

I've added the first entry point which takes integer values (it's mRMR_D because that's the first file I looked at). I'm going to add the rest of them at some point next week. The code which takes doubles now copies out the inputs, discretises them and then calls the integer code, which makes it about 20% faster than it used to be. Next on the list is the discretisation function, and then I'll start cleaning things up for the R wrapper.

@Craigacp
Copy link
Owner Author

I've pushed all the integer entry points. Things are between 20-30% faster than they used to be, even when using the double entry point. I've not tested the integer entry point for speed on it own, as I haven't written extra MATLAB interfaces.

I've basically finished all the work I want to do apart from adding an equal width discretisation function. Once I've added that in I'm going to merge devel into master and make releases for both FEAST and MIToolbox, giving both of them a major version bump (as all the interfaces have changed name and type).

I modified the checkedCalloc function so it only exits if it's running as a C library. I've also added a COMPILE_R check to MIToolbox to switch over to using Rprintf and R's Calloc and Free which should mean you can use the libraries unmodified. Let me know if it still requires modification to the header files and I'll fix it.

@alecuba16
Copy link

alecuba16 commented Nov 26, 2016

Hello Again @Craigacp, I'm sorry to be a bit disconnected about feast the last two months.

I have an Idea about my memory bounded problem with the MIToolbox, since the little investigation that I have done so far is the limitation on the calculate probabilities.

Optimization strategy 1: Overcome big ranges, more than possible values (array length)

I'm thinking about replace the code at lines 37->66
that contains the full space of states that could be unused by a dual dimension array that stores [pos,count], this may reduce the number of empty states on variables that have very big ranges, this will reduce the reserved memory to the vectorLength*2 of uint type in the worst case (all values are different) instead of the MAXstate(value) that could be huge number.

Or just a selector that analyzes if is better use a fixed vectorLength with all the possible states or vectorLength*2 (pos,count) in the worst case.

Another optimization could be detect if we are able to use one uint in order to store the pos,count checking the maxstate value, and define a mask to store in the same uint the position and count, if it doesn't fit we can use a separate uint element for pos and count or use the same mask mechanism with ulong.

Optimization strategy 2: Remove repeated values , expecting that there are repeated elements.

Previous reserve memory for the possible states, remove the repeated ones , remove the original vector from memory (free) since its not longer used and create another vector with counter values, first use ushort then if they are bigger than the limit, then copy to a uint one.

What do you think? I may try it in the following weeks, since I have to focus again into feature.

I don't know how much memory I can save, I will have to profile with some debugger.

@Craigacp
Copy link
Owner Author

Hi @alecuba16,

The first approach would be the most performant. Replacing that whole array with a real hash map would improve things no end, as it would preserve the O(1) lookup time and keep the complexity of the MI calculation in O(n + k) where n is the number of examples and k is the number of states. The modification you propose would make the lookups into O(k), and thus an MI calculation would be O(nk + k) which might be acceptable. If the aim is to still use the double entry point then this code could live in normaliseArray as that's called before all the feature selection algorithms, and that makes the initial preprocessing O(dnk) rather O(dn), but after that point all the mutual information code is still O(n + k).

Removing repeated elements from the source array isn't going to work, the aim is that the original arrays are never modified by FEAST/MIToolbox so you can use them repeatedly. It would cause more problems with the interface code with other languages as anything that used the int entry points would have memory unexpectedly freed out from under them.

FEAST is pretty memory efficient provided you don't use a wide range of input values. It's really just the overallocation in the JointProbabilityState that wastes memory, and that only occurs if the data is using a wider range of values than is necessary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants