Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How can see the result of simpliers/tokenizers on Strings rather than just result #26

Open
ijabz opened this issue Nov 28, 2016 · 5 comments

Comments

@ijabz
Copy link

ijabz commented Nov 28, 2016

So may typically have@

 StringMetric metric = with(new CosineSimilarity<String>())
                .simplify(Simplifiers.toLowerCase())
                .simplify(Simplifiers.removeDiacritics())
                .simplify(new SpecialReplacementsSimplifier())
                .tokenize(Tokenizers.whitespace())

float result = metric.compare(s1,s2)

What I would like to do for debugging is an easy way to see the final step before the cosine similarity, i,e the contents of the sets created by applying the simplifiers and then finally the tokenizer(s), is this possible ?

@mpkorstanje
Copy link
Contributor

mpkorstanje commented Nov 28, 2016

Sure. You can put a break point in CosineSimilarity.java at line 62.

Or if you want to log what goes in, the builder relies on interfaces rather then concrete implementations so you can wrap the metric in your own metric.

But I think you should write unit tests to validate if your SpecialReplacementsSimplifier works as it should rather then visual inspection.

MultisetMetric<String> loggingMetric = new MultisetMetric<String>() {

	final CosineSimilarity<String> cos = new CosineSimilarity<>();

	@Override
	public float compare(Multiset<String> a, Multiset<String> b) {
		System.out.println("CosineSimilarity [");
		System.out.println("a: " + a);
		System.out.println("b: " + a);
		System.out.println("]");
		return cos.compare(a,b);
	}
};

StringMetric metric = with(loggingMetric)
		.simplify(Simplifiers.toLowerCase())
		.simplify(Simplifiers.removeDiacritics())
		.simplify(new SpecialReplacementsSimplifier())
		.tokenize(Tokenizers.whitespace())
		.build();

@ijabz
Copy link
Author

ijabz commented Nov 29, 2016

Thanks that works, but Ideally I would like it to output the two original strings well. Of course I can output these myself before making the compare call, but in a multithreaded system other calls may get interleaved. I wanted this to check my whole simmetrics stack, access to the tokenized sets (as you ve shown me above) is needed to write unit tests anyway

@mpkorstanje
Copy link
Contributor

Then you shouldn't use the builder. Its design relies on being indifferent towards the individual components as long as they adhere to their interface.

@ijabz
Copy link
Author

ijabz commented Nov 29, 2016

If you say so, though it would seem quite useful to have a way of seeing the effects of a builder on some inputs without having to break down the individual steps.

@mpkorstanje
Copy link
Contributor

What would you do with this information?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants