Skip to content


Subversion checkout URL

You can clone with
Download ZIP


Unexpected translation results #1

AvianFlu opened this Issue · 4 comments

2 participants


I'm attempting to use your library in an IRC bot that has a twitter feed, to help reduce the non-English tweets that make it through the filtering. The filtering was mostly successful, except for a couple like this. (The first result from your language detector is printed first, then the tweet itself. I have left out the user names attached to the tweets.)

 [ 'pidgin', 0.22029761904761902 ]
 node-db-oracle - Oracle database bindings for Node.js

 [ 'danish', 0.14746268656716421 ]
 Hiring Systems Engineer scaling node.js ...

I can completely understand where an algorithm would have difficulty with the strange text that most Tweets tend to consist of - but do you have any thoughts here? Even if I had to make a list of three or four languages that can count as 'probably English', I'd be okay with it, but I haven't really used this kind of library before - so I'd like your input.

Thanks again for putting this together - it's a big improvement for me already!


Hi !

I've checked these tweets with the original library (PEAR::LanguageDetect) and here is what came out:

node-db-oracle - Oracle database bindings for Node.js
[pidgin] => 0.220297619048
[english] => 0.16375
[danish] => 0.160119047619
[dutch] => 0.15625

Hiring Systems Engineer scaling node.js ...
[danish] => 0.147462686567
[norwegian] => 0.147213930348
[latin] => 0.142885572139
[portuguese] => 0.134676616915
[dutch] => 0.132039800995

Same as node-language-detect. So these results are inaccurate because of the small input length. If you want to be sure about the tweet's langages the first result must have a score over 0.3-0.4 otherwise chance are there will be wrong detection.

@FGRibreau FGRibreau closed this

Thanks! I figured it would ultimately be Twitter's fault. :-P


Don't get me wrong, it's just that the input data is insufficient sometimes :s


Yeah, and I'm sure that the shortlink URLs some tweets contain don't help, either. I'll just put together a 'best guess' system and see what I can do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.