i-Dentificateur

a simple programme to identify the human language of a document or text fragment

The methodoligies used herein were selected on the basis of:

Garg, Archana, Vishal Gupta, and Manish Jindal. "A Survey of Language Identification Techniques and Applications." Journal of Emerging Technologies in Web Intelligence 6.4 (2014): 388-400.

How Does It Work?

The programme stores a dictionary of each language for which it is operable in the form of a Bloom filter
For an input text we search through the Bloom filters for each language incrementing an associated count whenever a word in the text corresponds to a word in one of our dictionaries.
Finally, the language of the dictionary with the highest valued representive count variable is determined the winner.

This method is surprisingly effective, able to correctly identify such potentially perplexing text fragments as "Schwarzenegger in Kindergarten Cop".

Version

β

Tech

To work properly:

You should be running Java 7 or higher.

Installation

Download the project zip file.

$ wget https://github.com/h1395010/i-Dentificateur/archive/master.zip

Navigate to the directory and unzip it.

$ unzip master.zip

'cd ' into the directory with the source files.

$ cd i-Dentificateur-master/

Compile with the command

$ javac *.java

Run with the command

$ java iDentificateur

Usage

Execution of the above commands will invoke the User Interface:

Text Fragment

Click the radio button adjacent to the word 'Text', in the corresponding box input a fragment or sentence to assess the language in which it is written.

Document

Indicate the radio button beside the word 'File', this action will prompt an local explorer window, navigate to the file of interest and make the selection, this will initiate the evaluation process.

Development

Want to contribute? Great!

Pull-requests are quite welcome.

Alternatively if you want to see a new feature let me know!

TODO

"a software is never done, you just stop working on it"

Handle exceptions at a higher level in the program and exit if an unusable state is detected
Add support for Chinese (Pinyin) and Russian
Programatically enforce specific encoding of stored lists and input text
Optimize size of Bloom filters
Unify input UI and output UI

Currently Supported Languages

Italiano
Français
English
Deutsch
Español
Nederlandse
Português

References

Word lists employed were downloaded from the WinEdt Dictionaries.
The concept of utilizing Bloom filters for this project was inspired, in addition to the above mentioned paper, by Daniel Spiewak.
This Bloom filter implementation draws on the work of Magnus Skjegstad.

License

MIT

Note: Regarding placement of braces this code follows a symetrical style in lieu of the traditional Java convention

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
ascii_word_lists		ascii_word_lists
BloomFilter.java		BloomFilter.java
LICENSE.md		LICENSE.md
Processing.java		Processing.java
README.md		README.md
SearchDirectories.java		SearchDirectories.java
iDentificateur.java		iDentificateur.java

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ascii_word_lists

ascii_word_lists

BloomFilter.java

BloomFilter.java

LICENSE.md

LICENSE.md

Processing.java

Processing.java

README.md

README.md

SearchDirectories.java

SearchDirectories.java

iDentificateur.java

iDentificateur.java

Repository files navigation

i-Dentificateur

How Does It Work?

Version

Tech

Installation

Usage

Development

TODO

Currently Supported Languages

References

License

About

Releases

Packages

Languages

License

smatthewenglish/i-Dentificateur

Folders and files

Latest commit

History

Repository files navigation

i-Dentificateur

How Does It Work?

Version

Tech

Installation

Usage

Development

TODO

Currently Supported Languages

References

License

About

Resources

License

Stars

Watchers

Forks

Languages