Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Two lines of input don't work #69

Closed
ElBartel opened this issue Sep 30, 2018 · 10 comments
Closed

Two lines of input don't work #69

ElBartel opened this issue Sep 30, 2018 · 10 comments

Comments

@ElBartel
Copy link

The commandline

fst map two.csv two.fst

does work, but the following command:

fst range -o two.fst

prints only one single line and the command

fst dot two.fst | dot -Tpng > two.png

produces a graph with three nodes - ok. But only a single final node, a wrong output on label 'a' (1) and no output on label 'b'.

@ElBartel
Copy link
Author

The contents of the file two.csv was:

a,2
ab,1

@BurntSushi
Copy link
Owner

I can reproduce this issue. It would be helpful to have a Rust program using the fst library that reproduce this issue.

@ElBartel
Copy link
Author

That means i've to write my first rust program.
I came along this, since i read your blog about indexiing with "Index 1,600,000,000 Keys with Automata and Rust" and tried to understand what output should be attached to the edges for these two strings.

But ok, its always the first time. I'll try

@ElBartel
Copy link
Author

ElBartel commented Oct 1, 2018

So, finally I got it compiled :-)
Find below my first Rust program.
It shows expected behaviour of the api.
But this behaviour cannot be explained when looking at the generated automaton.

`
extern crate fst;

fn main() {

use fst::{Map, MapBuilder};

let mut map_builder = MapBuilder::memory();
map_builder.insert("a", 2).unwrap();
map_builder.insert("ab", 1).unwrap();

let fst_bytes = map_builder.into_inner().unwrap();
let map = Map::from_bytes(fst_bytes).unwrap();

println!("contains a: {}", map.contains_key("a"));
println!("contains ab: {}", map.contains_key("ab"));

println!("value of a:  {:?}", map.get("a"));
println!("value of ab: {:?}", map.get("ab"));

println!("Up to this point everything works as excpected.");
println!(
  "But this result cannot be explained by the created automaton:");

let fst= map.as_fst();
let root= fst.root();

println!("root: {:?}", root);
println!("n21: {:?}", fst.node(21));
println!("n0: {:?}", fst.node(0));

println!(
  "This automaton reflects exactly the behaviour of the commandline.");

}
`

@BurntSushi
Copy link
Owner

@ElBartel Thanks so much! Awesome work. I'll try to take a look at this soon.

@DiSToAGe
Copy link

I have the same problem. I'm using fst-bin (not tested with fst lib et perso code).
It seems the first line of the csv imported in "fst map" is always dropped, but ONLY if you sort it in fst-bin.
If you externaly sort your csv (sort test.csv > test-sorted.csv ; fst map --sorted test-sorted.csv test.fst ; fst query test.csv '.*') there is no problem. So perhaps a bug in the sort routine in fst-bin ...?

@BurntSushi
Copy link
Owner

Are you sure the first row isn't being skipped because it's being interpreted as a header row and thus is not part of the data?

@DiSToAGe
Copy link

Seems you are right. But it appears that there is not consistency between fst-bin map/set and sorted/unsorted

set + external sorted => no header considered
set + internal sort => no header considered
map + external sorted => no header considered
map + internal sort => WITH header considered

the problem could be in fst-bin/src/cmd/map.rs ...run_unsorted() ?
But I don't understand well if it's in util::ConcatCsv ...??

for comparison :

fst-bin/src/cmd/map.rs, run_sorted() set specificaly : ".has_headers(false)" for csv::ReaderBuilder::new()

but in "impl Iterator for ConcatCsv" seems there is not a ".has_headers(false)" with csv::Reader::from_reader(rdr).

...??

@BurntSushi
Copy link
Owner

Sorry, but I can't make sense of what you're saying. Please provide a reproducible test case, along with inputs, actual output and expected output.

@DiSToAGe
Copy link

DiSToAGe commented Sep 6, 2019

No problem, here it is.

import csv for set

I made a simple "test.csv" file with following content (lines are wrong sorted) :

test1a
test3a
test2a

1. set "external" sort

The sorting is made with /usr/bin/sort unix command, then fst set creation, and query to get all what is imported.

sort test.csv > test-sorted.csv ; fst set --sorted test-sorted.csv test.fst ; fst grep test.fst '.*'

Result (3 lines correctly imported, correctly sorted, the first line in csv is so not a header) :

test1a
test2a
test3a

2. set internal sort

The sorting is made in rust.

( rm test.fst )
fst set test.csv test.fst ; fst grep test.fst '.*'

Result (the same as precedent, first line is not a header) :

test1a
test2a
test3a

import csv for map

I modified the "test.csv" to have numbers for map. Same thing, lines in wrong sort order.

test1a,1
test3a,3
test2a,2

3. map "external" sort

The sorting is made with unix command.

( rm test-sorted.csv ; rm test.fst )
sort test.csv > test-sorted.csv | fst map --sorted test-sorted.csv test.fst ; fst grep test.fst '.*'

Result (same thing, 3 lines, correctly sorted, first line is not a header) :

test1a
test2a
test3a

4. map internal sort

The sorting is made in rust.

( rm test.fst )
fst map test.csv test.fst ; fst grep test.fst '.*'

Result (!!! 2 lines !!! correctly sorted, but seems the first line in csv is used as a header, but why differently than 3 precedent tests ???)

test2a
test3a

I hope it's more understandable ...?

So it's why I supposed the difference can came from "run_sorted()" where there is a ".has_headers(false)"
but not in run_unsorted() ...??

(run_sorted() with ".has_headers(false)")
https://github.com/BurntSushi/fst/blob/master/fst-bin/src/cmd/map.rs
(usage of "csv::Reader::from_reader(rdr);" in ConcatCsv implementation, called from run_unsorted())
https://github.com/BurntSushi/fst/blob/master/fst-bin/src/util.rs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants