Base 64 experiment #48

JoelGardes · 2017-03-13T13:04:31Z

Here are some explanations for needed tools to evaluate similarity measurement using base 64 information transcoding.

The purpose

Base 64 is a generic information coding based on a alphabet composed of 64 bytes that will allow to free 64 other bytes to improve similarity distance by replacing run length with a more efficient algorithm. An other way ahead will consist of applying "combinatorial pattern matching" algorithm to evaluate direct byte sequences segmentation (delimiters will be coded of the 64 released bytes after alphabet compression).

This purpose means operating a reversible alphabet compression for data coding.

A first step for this experiment consists on evaluating measurement based on current algorithm, of similarities of information coded in base64 which would become a pivot format if good results obtained.

Globally, we have just to add a base 64 transcoding option inside data preparation.

For pictures

We need to preserve linearity of bitmap, so, before base64 coding, a raw format transcoding could be necessary. If possible, maintain the -raw option.

For other content

Generic (raw PDF, for example): add base64 coding facility.
Work on metadata (vector of extracted words for example): transcode in base 64 extracted data and preserve alignment with source files through canonical filename (i.e. filename without termination) for permitting thumbnails computation in graph program with -src option.

Is that OK ?

ChristopheMaldivi added this to In Progress in Simdoc Mar 13, 2017

ChristopheMaldivi moved this from In Progress to Testing in Simdoc Apr 5, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Base 64 experiment #48

Base 64 experiment #48

JoelGardes commented Mar 13, 2017

Base 64 experiment #48

Base 64 experiment #48

Comments

JoelGardes commented Mar 13, 2017

The purpose

For pictures

For other content