Skip to content
Switch branches/tags

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time


Polyglot is a language detector written in Python 2.7.

It's heavily inspired in Linguist by GitHub.

This project was born as a learning experience with the Python language. It's meant to be taken seriously but, still, is far from what Linguist is capable of.

We still have a lot to work on.

    	polyglot - Know your languages. A language detector written in Python 2.7.
    	python [options] [args]
        -p	print programming languages only
        -d	print data files only
        -m  print markup files only
        -a	print every file
        --json	used before any of the above to produce JSON output


Polyglot can be used a stand-alone application, it can be imported as a module or it can be used as a backend script producing JSON. Furthermore, there is a debug unknown languages option and several flags to choose from.


When running Polyglot, as an app or by importing it as a module, there are some variants you can opt for. There are flags that narrow down the search to a specific type of file and also a debug option that includes unknown files in the output. You can read about them in the flags and debug chapters.

The only required argument is the file name that can be a relative or absolute path or just a file name (in which case Polyglot will try to find the file path). The optional flag argument must be one of the following strings ["all", "programming", "data", "markup"]. "all" is the default value. Additionally, you can also use the --json option along with one of the flags to receive the output as valid JSON. The debug argument can either be True or False, being False the default value.


Apart from this, there are still some flags that impact the output of polyglot. If you run with -h option you can see all of them:

$ python -h

Polyglot supports the following searches:
    -p  	programming languages only
    -d  	data files only
    -m      markup files only
    -a  	every file
    --json 	used before any of the above to produce JSON output
    -h  	print help menu

The -a flag is used by default if no option is used. The --json flag is by default set to False.


The debug option is a specific variable located at the top of the file which is False by default. With the debug option:

$ python ../polyglot/

** Python: 60.0%

** YAML: 40.0%
Unknown extensions:
'txt', 'pyc'

Stand-alone app

There are two ways of running it as an app: with or without the debug unknown languages' option. The debug option is a specific variable located at the top of the file which is False by default.

Running the program on its own polyglot/ folder in it's simplest form (without any flags or the debug option):

$ python ../polyglot/

** Python: 60.0%

** YAML: 40.0%


If you want to integrate Polyglot in a different application, there is the possibility of importing it as a module. An simple example of this usage would be the following script:

import polyglot

fileName, flag = parseCommandLine()
if flag == -1:

poly = polyglot.Polyglot (fileName, flag)
results = poly.startPolyglot()

print results

Now an example that enables the debug option:

import polyglot

fileName, flag = polyglot.parseCommandLine()
if flag == -1:

poly = polyglot.Polyglot (fileName, flag, True)
results = poly.startPolyglot()

print "Files recognized:\n ", results[0]
print "Files not recognized: \n", results[1]

As you can see, if the debug option is enabled Polyglot will return a tuple containing the language statistics dictionary in it's first index and an array containing every unkown extension found (stuff like build and compiled files will be the vast majority of what you see here).

JSON output

If you want to get JSON output instead of those nasty strings, being it for use in a backend script with PHP or because you want to integrate the module with another language, this would be the result:

$ python --json ../polyglot/

    "Python": 0.6,
    "YAML": 0.4,
    "nFiles": 5


Language Detection

Language detection works in three stages. If any of the stages succeds in identifying the file language, the next stages are not executed.

First, there is the basic detection stage, in which Polyglot tries to match the file extension with a language stored in a .yml file. If a file with a ambiguous extension (like .pl) is found, this stage fails and Polyglot moves on to the next stage of detection.

After that, there is the heuristics stage, in which Polyglot tries to disambiguate between multiple languages that can be assigned to a file.

Lastly, there is the (as of yet non-existent) classifier stage. This feature isn't implemented yet as we would like to solidify the basis of Polyglot before we move on.

Basic Detection

First we run a simple extension analyzer. Since most filename extensions aren't common to several languages, in most cases, specially with simple projects this works like a charm and is quite fast.

We run the runAnalysis recursively. This function differentiates between directories and files. If it's running with a file as an argument it calls updateStats to update the info or, if it finds a directory, it runs itself with every file it finds as an argument.


For extensions that Polyglot can't assign to a single language like '.pl' (Prolog and Perl) or '.h' (headers of most C variants), heuristics are employed to find the language in function of what's written on the file. Some language specific syntax strings (like ':-' for Prolog) are used as a way to disambiguate between languages. For now it's pretty indecent but it'll work just fine.


For extensions that cannot be disambiguated by the application of heuristics we are building a statistical classifier that determines what is the most likely language for the file in question.


A language detector written in Python 2.7.




No packages published