Juan Pablo Ruiz de Chávez Diez de Urdanivia
JoaquĂn Badillo Granillo
This repository contains a mix
project that highlights a file using the syntax of python code; therefore it is intended to be used with Python files.
You will need to have Elixir installed in your computer. There are lots of Installation methods and you should follow the one that suits your operative system and preferences.
For Ubuntu users we would recommend using asdf
or docker
to have control over multiple versions of elixir.
The mix project has 2 public functions: highlight/2
and highlight/1
. Documentation for this module has been generated with docstrings and ExDoc and it can be found on the docs
directory. Here you will also find in markdown files the tokens that we decided to highlight, some examples and information regarding the time complexity of the lexer.
If you are unfamiliar with mix
but you want to use this project just move to the lexer
directory and run Elixir interactively with the following command
$ iex -S mix
This command also imports the modules from the lib
directory therefore you will have access to the public functions of the Lexer
module.
To a highlight a file with a path path/to/file.py
and get the results in the directory dir
you can run the following command:
iex> Lexer.highlight("path/to/file.py", "dir")
The directory will be created in the same place as the file. The dir
argument is optional and if it is not given the results will be saved in the lexer
directory.
You can also decide to highlight the file without providing a directory, this will try to create a default directory called highlighted
in the same place as the file.
iex> Lexer.highlight("path/to/file.py")
Some examples of code already passed through this function can be found on the /lexer/examples
directory.
You might also need to install some dependencies in order to run the project. If this is the case you can use the following command in the lexer
directory:
mix deps.get
The only dependency is ExDoc, which is used to generate the documentation that is shown in the GitHub Pages.
This document describes the different tokens and characters available in the Python syntax. This keywords or special characters can all be evaluated in the program found in this repository as they are painted in a different color for differentiation pourposes.
as | def | finally | import | return |
assert | del | for | lambda | try |
break | elif | from | None | while |
class | else | global | pass | with |
continue | except | if | raise | yield |
Color code: #c792ea;
Text inside single or double quotes:
- "This is a string"
- 'This is also a string'
Color code: #c3e88d;
- Integers - e.g. 1, 2, 3
- Floating point numbers - e.g. 1.0, 1.5, 0.8
- Scientific notation - e.g. 1e10, -2e5, 8.0e-2
- Complex numbers - e.g. 1+1j, 2.0-2j, -3+4j
Color code: #f78c6c;
- True
- False
Color code: #bdded4;
+ | - | * | / | % | ** | // | << | >> | & |
| | ^ | ~ | < | <= | > | >= | <> | != | == |
+= | -= | *= | /= | //= | %= | >>= | <<= | **= | ^= |
&= | |= | = |
Color code: #ffe68f;
, | : | . | ` | \ | ; | -> |
Color code: #89ddff;
( | ) | [ | ] | { | } |
Color code: #ffa053;
Built-in data types by class name
- | - | - | - | - |
---|---|---|---|---|
str | int | float | complex | list |
tuple | range | dict | set | frozenset |
bool | bytes | bytearray | memoryview | NoneType |
Color code: #fce1f7;
Variables, constants and function names. These have alphanumeric and underscore characters, although these cannot start with numbers
Color code: #82aaff;
Anything after a # sign and before a line break
Color code: #b6b6b6;
spaces = ~r<^(\s+)>
keywords = ~r<^(as|assert|break|class|continue|def|del|elif|else|except|finally|for|from|global|if|import|lambda|None|pass|raise|return|try|while|with|yield)\b>
strings = ~r<^("[^"]*"|'[^']*')>
comments = ~r<^(#.*)>
numbers = ~r<^([+-]?\d+(\.\d+)?(e[+-]?\d+)?([+-]\d+(\.\d+)?(e[+-]?\d+)?j)?)>
bools = ~r<^(True|False)\b>
dtypes = ~r<^(str|int|float|complex|list|tuple|range|dict|set|frozenset|bool|bytes|bytearray|memoryview|NoneType)\b>
operators = ~r[^(\*\*?=?|<<?=?|>>?=?|\^=?|\|=?|&=?|%=?|\/?\/=?|-=?|\+=?|==?|!=?|~|(and|or|is( not)?|(not )?in|not)\b)]
delimiters = ~r<^[.:,;`\\]>
brackets= ~r<^[()\[\]{}]>
ids = ~r<^([a-zA-Z_]\w*)>
Creating a syntax highlighter is always a challenge, but it is a very interesting one. There were different solutions that we could work on in order to solve this problem. The first one was to evaluate the code by using a DFA or state machine, going trhough each character. The other possible solution which was a little bit easier but still very challenging was to analyze the code using different regular expressions. Because of its simplicity and moldability we decided tu use this alternative as our proposal to the solution of the problem.
The first thing we did was to create a regular expression for each token, and then analyze each single line of the document. As the challenge was to create a syntax highlighter, it was imperative to create functions to initialize both, a CSS and a HTML document. Whithin these documents we would store the different styles in order to color the tokens and also the HTML tags to display the code in a web browser.
The next step was to analyze each line of the document. Once this final step was done, the only thing left to do was to test our code and make sure that it was optimized and working properly. This last point lead us to a very interesing questions: What is th complexity of our code?
First let's find the time complexities of some helper functions that we will use in the Lexer
module.
defp find_token(line) do
[{:space, ~r<^(\s+)>},
{:keyword, ~r<^(as|assert|break|class|continue|def|del|elif|else|except|finally|for|from|global|if|import|lambda|None|pass|raise|return|self|try|while|with|yield)\b>},
{:string, ~r<^("[^"]*"|'[^']*')>},
{:comment, ~r<^(#.*)>},
{:number, ~r<^([+-]?\d+(\.\d+)?(e[+-]?\d+)?([+-]\d+(\.\d+)?(e[+-]?\d+)?j)?)>},
{:bool, ~r<^(True|False)\b>},
{:dtype, ~r<^(str|int|float|complex|list|tuple|range|dict|set|frozenset|bool|bytes|bytearray|memoryview|NoneType)\b>},
{:delimiter, ~r/^([.:,;`\\]|->)/ },
{:operator, ~r[^(\*\*?=?|<<?=?|>>?=?|\^=?|\|=?|&=?|%=?|\/?\/=?|-=?|\+=?|==?|!=?|~|(and|or|is( not)?|(not )?in|not)\b)]},
{:bracket, ~r<^[()\[\]{}]>},
{:identifier, ~r<^([a-zA-Z_]\w*)>}]
|> Enum.find(fn {_token, regex} -> Regex.match?(regex, line) end)
|> case do
{token, regex} -> {:ok, Regex.split(regex, line, trim: true, include_captures: true), token}
nil -> {:error, line}
end
end
Let Enum.find/2
procedure will be find_token/1
function.
defp highlight_line(line, fileIn) do
case find_token(line) do
{:ok, [content | [rest | _list]], type} ->
recursive_write(fileIn, content, type, rest)
{:ok, [content | _rest], type} ->
File.write(fileIn, ~s(<span class="#{type}">#{content}</span>), [:append])
{:error, line} ->
File.write(fileIn, ~s(<span class="error">Invalid syntax</span> > <span class="invalid">#{line}</span>), [:append])
end
end
Now looking at the highlight_line/2
function we can see that we call the find_token/1
procedure on the line and try to match its results to different patters. Checking patterns takes constant time, and the find_token/1
function has a complexity of recursive_write/4
function? Let us derive its complexity.
defp recursive_write(fileIn, content, type, rest) do
if type == :space,
do: File.write(fileIn, content, [:append]),
else: File.write(fileIn, ~s(<span class="#{type}">#{content}</span>), [:append])
highlight_line(rest, fileIn)
end
Recursive write is a simple function that will write the content to the file if the type is a space, otherwise it will write the content with the corresponding HTML tag. However, it will also call the highlight_line/2
function with the rest of the line. Writing to the file will be as expensive as the amount of characters to write, therefore it will have a complexity of highlight_line/2
will have a smaller list and thus the longest expression that could match will be smaller let us call this value
Formally, using the Master Theorem (found in Cormen's Introduction to Algorithms), we can see that the complexity of this function will indeed be
Looking at the main and public function highlight/2
, once we derived the time complexity of the highlight_line/2
function we can finally show that the overall time complexity is
def highlight(fileIn, dir) do
py_path = Path.expand(fileIn)
css_file = ~s(#{Path.dirname(fileIn)}/#{dir}/style.css)
html_file = ~s(#{Path.dirname(fileIn)}/#{dir}/index.html)
try do
IO.puts(~s<Trying to create directory #{dir}>)
File.mkdir_p(~s(#{Path.dirname(fileIn)}/#{dir}/))
|> case do
:ok -> :ok
{:error, _reason} -> raise "Failed to create directory"
end
IO.puts(~s<Writing CSS file in #{dir} directory>)
create_css(css_file)
IO.puts(~s<Writing HTML file in #{dir} directory...>)
create_html(html_file)
File.stream!(py_path)
|> Enum.each(&highlight_line(&1, html_file))
File.write(html_file, "</code></pre>\n\t</body>\n</html>", [:append])
rescue
exception ->
IO.puts("Process Failed. Reason: #{inspect(exception)}")
end
end
As we can see, the function highlight firstly initialize the variables and expand the paths. These lines have a complexity of
The next two functions 'create_css' and 'create_html' have a complexity of
Let us define the number of lines in the file as File.stream!
function which will read line by line the text file provided, as send the characters read to the highlight_line/2
function. Reading the characters takes linear time highlight_line/2
has the same complexity, thus for each line we are reading through the stream we are taking time linear in
Parallel programming is of enormous importance in today's computing context as it enables the efficient utilization of multiple processors or cores to tackle complex tasks concurrently, hence significantly enhancing the speed and performance the code. By dividing a program into smaller, independent tasks that can be executed simultaneously, parallel programming makes a program considerably faster and more optimized. Because of this, we decided to implement a parallel version of our lexer, which we will call highlight_parallel
. This function will take the same arguments as the highlight
function, but it will use the Task.async
module to run the do_highlight_parallel
function in parallel as showed in the following code:
@spec highlight_parallel(dir()) :: :ok | :error
def highlight_parallel(dir \\ ".") do
try do
# Find python files and highlight syntax in parallel
File.ls(dir)
|> case do
{:ok, files} -> files
{:error, _reason} -> raise "Failed to read directory"
end
|> Enum.filter(&Regex.match?(~r|.py$|, &1))
|> Enum.map(&Task.async(fn -> do_highlight_parallel(&1, dir) end))
|> Enum.map(&Task.await(&1, :infinity))
# Write CSS once
css_file = ~s(#{dir}/style.css)
create_css(css_file)
|> case do
:ok -> :ok
true -> raise "Failed to create css file"
end
rescue
exception ->
IO.puts("Error: #{inspect(exception)}")
end
end
defp do_highlight_parallel(fileIn, dir) do
file = ~s|#{dir}/#{fileIn}|
html_file = Regex.replace(~r|.py$|, file, ".html")
create_html(html_file)
|> case do
:ok -> :ok
{:error, _reason} -> raise "Failed to create html file"
end
To use this function you will need to run the following command in the iex
terminal, first execute in the root of the mix project (PythonLexer/lexer
):
iex -S mix
Then you can run the following command to highlight the files in the PythonLexer/lexer/examples/parallel
directory:
iex> Lexer.highlight_parallel("./examples/parallel")
This will create a html for each python file in the directory and a css file with the styles. The files will be created in the same directory and will have the same name but with different extensions.
There is also a function that does the same (highlight the contents of all the files in a directory) but it does it sequentially. This function is highlight_sequential/1
and it can be used in the same way as the parallel function. The only difference is that it will use a single thread.
If we consider the case were highlight/2
function, which takes
Interestingly, highlight_parallel
has the same asymptotic time complexity as the sequential function. Since the number of threads that can be executed simultaneously is fixed by the system being used, and constants are dropped in big oh notation.
However, the fact that the asymptotic time complexity is the same doesn't mean that the execution times will be equal. For instance if we try to highlight 8 files in a computer with 8 cores, the difference in the execution times of the sequential and parallel implementations should differ and depending on the size of the files the change in that time might become very significant.
On the other hand if the files are sufficiently small it may be slower for the parallel function to execute. This is because the time it takes to create a new process and all the operations managed by the system to run the process may be greater than the time it takes to highlight the file. In this case the sequential function would be faster. However notice that this only occurs with "small" files, and therefore the time to execute is expected to be small.
A benchmark function was created in order to compare the performance of the sequential and parallel functions. The results can be found in the following table or in the img
folder of this repository.
Run | Sequential time (s) | Parallel time (s) |
---|---|---|
1. | 7.927765 | 5.490092 |
2. | 7.537254 | 5.775834 |
3. | 7.880397 | 5.608249 |
4. | 7.612272 | 5.791548 |
5. | 7.606539 | 5.756286 |
6. | 7.496936 | 5.824183 |
7. | 7.475266 | 5.91838 |
8. | 7.704623 | 6.032369 |
9. | 7.720456 | 5.48416 |
10. | 7.929119 | 5.769395 |
The speedup,
As shown by the table we used 10 iterations, and we also benchmarked the implementations with a computer that has 8 cores. This resulted in a speed of:
It is also important to mention that, as both functions recieve more and larger files, the speedup will increase and the overall performance of the parallel function will be better and noticeable. As a matter of fact one of the files was particularly heavy: examples/parallel/03_large-file.py
and it took 4.75 seconds alone; which is about 80% of the execution time in the parallel function.See the image below, second execution iex(2)>
This shows that the other 7 files added about 3 seconds to the sequential function and therefore there running times are small enough for the parallel function to just wait for this single file to finish.
Coding is always going to be a tool that can be used for the benefit of the society or for the opposite. As any sort of knowledge, it can be used for good or for bad and it is important to always have an ethical view of the situation of what our code can be used for.
In our case, we have created a syntax highlighter for Python. This program by itself may not be a very useful tool in the version that it is right now, we are aware that there are way more powerful lexers out there; but the interesting part of this project is that we have decided to make it open source by making a public repository in GitHub. This means that anyone can use it and modify it for their own benefit, but we have also taken the time to document it correctly so that people can learn from it, the time complexity section on this file could also be helpful for some. This is why we think that humanity caan always evolve for better as long a society is willing to learn, share their knowledge and also help those who need help. From an ethical point of view we believe that making knowledge accesible is not only correct but also necessary for the development of our society and we are glad to contribute to this cause.
As it becomes harder to improve computer hardware, the need for better software becomes more important. This is why concurrency is a very important topic in computer science. It allows us to run different parts of an algorithm separately and when this processes are slow enough the results can be very significant. This is why we decided to implement a parallel version of our lexer.
Concurrency has a lot of difficulties, but using the functional paradigm allows us to have a better control over the state of the program thanks to immutability. Moreover, Elixir makes the use of the Erlang Virtual Machine (BEAM) more accessible which is a very powerful piece of technology as it allows for processes to be very cheap.