Introduction

In general there are many different ways and opinions on how to teach people something new. However, most people agree that a hands-on experience is one of the best ways to make the human brain remember a new skill. Learning must be entertaining and interactive, with fast and frequent feedback. Some areas are more suitable for this practical way of learning than others, and fortunately, programming is one of them.

University education system is one of the areas where this knowledge can be applied. In computer programming, there are several requirements a program should satisfy, such as the code being syntactically correct, efficient and easy to read, maintain and extend.

Checking programs written by students by hand takes time and requires a lot of repetitive work -- reviewing source codes, compiling them and running them through test scenarios. It is therefore desirable to automate as much of this process as possible.

The first idea of an automatic evaluation system comes from Stanford University professors in 1965. They implemented a system which evaluated code in Algol submitted on punch cards. In following years, many similar products were written.

Nowadays properties like correctness and efficiency can be tested to a large extent automatically. This fact should be exploited to help teachers save time for tasks such as examining bad design, bad coding habits, or logical mistakes, which are difficult to perform automatically.

There are two basic ways of automatically evaluating code:

statically -- by checking the source code without running it. This is safe, but not very practical.
dynamically -- by running the code on test inputs and checking the correctness of outputs ones. This provides good real world experience, but requires extensive security measures).

This project focuses on the machine-controlled part of source code evaluation. First we observed the general concepts of grading systems and discussed the problems of the software previously used at the Charles University in Prague. Then the new requirements were specified and we examined projects with similar functionality. With the acquired knowledge from these projects, we set up goals for the new evaluation system, designed the architecture and implemented a fully operational solution based on dynamic evaluation. The system is now ready for production testing at the university.

Assignment

The major goal of this project is to create a grading application which will be used for programming classes at the Faculty of Mathematics and Physics of the Charles University in Prague. However, the application should be designed in a modular fashion to be easily extended or even modified to make other ways of usage possible.

The system should be capable of doing a dynamic analysis of the submitted source codes. This consists of the following basic steps:

compile the code and check for compilation errors
run the compiled program in a sandbox with predefined inputs
check the constraints of the amount of used memory and time
compare the outputs of the program with the defined expected outputs
award the solution with a numeric score

The whole system is intended to help both the teachers (supervisors) and the students. To achieve this, it is crucial for us to keep in mind the typical usage scenarios of the system and to try to make these tasks as simple as possible. To fulfill this task, the project has a great starting point -- there is an old grading system currently used at the university (CodEx), so its flaws and weaknesses can be addressed. Furthermore, many teachers desire to use and test the new system and they are willing to consult our ideas or problems during the development with us.

Current System

The grading solution currently used at the Faculty of Mathematics and Physics of the Charles University in Prague was implemented in 2006 by a group of students. It is called CodEx -- The Code Examiner and it has been used with some improvements since then. The original plan was to use the system only for the basic programming courses, but there was a demand for adapting it for several different courses.

CodEx is based on dynamic analysis. It features a web-based interface, where supervisors can assign exercises to their students and the students have a time window to submit their solutions. Each solution is compiled and run in sandbox (MO-Eval). The metrics which are checked are: correctness of the output, time and memory limits. It supports programs written in C, C++, C#, Java, Pascal, Python and Haskell.

The system has a database of users. Each user is assigned a role, which corresponds to his/her privileges. There are user groups reflecting the structure of lectured courses.

A database of exercises (algorithmic problems) is another part of the project. Each exercise consists of a text describing the problem, a configuration of the evaluation (machine-readable instructions on how to evaluate solutions to the exercise), time and memory limits for all supported runtimes (e.g. programming languages), a configuration for calculating the final score and a set of inputs and reference outputs. Exercises are created by instructed privileged users. Assigning an exercise to a group means choosing one of the available exercises and specifying additional properties: a deadline (optionally a second deadline), a maximum amount of points, a maximum number of submissions and a list of supported runtime environments.

The typical use cases for the user roles are the following:

student
- create a new user account via a registration form
- join groups (e.g., the courses he attends)
- get assignments in the groups
- submit a solution to an assignment -- upload one source file and start the evaluation process
- view the results of the solution -- which parts succeeded and failed, the total number of the acquired points, bonus points
supervisor (similar to CodEx operator)
- create a new exercise -- create description text and evaluation configuration (for each programming environment), upload testing inputs and outputs
- assign an exercise to a group -- choose an exercise and set the deadlines, the number of allowed submissions, the weights of all test cases and the amount of points for the correct solutions
- modify an assignment
- view all of the results of the students in a group
- review the automatic solution evaluation -- view the submitted source files and optionally set bonus points (including negative points)
administrator
- create groups
- alter user privileges -- make supervisor accounts
- check system logs

Exercise Evaluation Chain

The most important part of the system is the evaluation of solutions submitted by the students. The process from the source code to final results (score) is described in more detail below to give readers a solid overview of what is happening during the evaluation process.

The first thing students have to do is to submit their solutions through the web user interface. The system checks assignment invariants (e.g., deadlines, number of submissions) and stores the submitted code. The runtime environment is automatically detected based on the extension of the input file, and a suitable evaluation configuration type is chosen (one exercise can have multiple variants, for example C and Java is allowed). This exercise configuration is then used for the evaluation process.

There is a pool of uniform worker machines dedicated to evaluation jobs. Incoming jobs are kept in a queue until a free worker picks them. Workers are capable of a sequential evaluation of jobs, one at a time.

The worker obtains the solution and its evaluation configuration, parses it and starts executing the instructions contained. Each job should have more test cases which examine invalid inputs, corner cases and data of different sizes to estimate the program complexity. It is crucial to keep the computer running the worker secure and stable, so a sandboxed environment is used for dealing with an unknown source code. When the execution is finished, results are saved, and the student is notified.

The output of the worker contains data about the evaluation, such as time and memory spent on running the program for each test input and whether its output is correct. The system then calculates a numeric score from the data which is presented to the student. If the solution is incorrect (e.g., incorrect output, exceeds memory or time limits), error messages are also displayed to the student.

Possible Improvements

The current system is old, but robust. There were no major security incidents in the course of its usage. However, from the present day perspective there are several major drawbacks:

web interface -- The web interface is simple and fully functional. However, the recent rapid development in web technologies provides us with new possibilities of making web interfaces.
public API -- CodEx offers a very limited public XML API based on outdated technologies that are not sufficient for users who would like to create their custom interfaces such as a command line tool or a mobile application.
sandboxing -- the MO-Eval sandbox is based on the principle of monitoring system calls and blocking the forbidden ones. This can be sufficient with single-threaded programs, but proves to be difficult with multi-threaded ones. Nowadays, parallelism is a very important area of computing, it is required that multi-threaded programs can be securely tested as well.
instances -- Different ways of CodEx use require separate installations (e.g., Programming I and II, Java, C#). This configuration is not user friendly as students have to register in each installation separately and burdens administrators with unnecessary work. The CodEx architecture does not allow sharing workers between installations which results in an inefficient use of hardware for evaluation.
task extensibility -- There is a need to test and evaluate complicated programs for courses such as Parallel programming or Compiler principles, which have a more difficult evaluation chain than simple compilation/execution/evaluation provided by CodEx.

Requirements

There are many different formal requirements for the system. Some of them are necessary for any system for source code evaluation, some of them are specific for university deployment and some of them arose during the ten year long lifetime of the old system. There are not many ways of improving CodEx experience from the perspective of a student, but a lot of feature requests come from administrators and supervisors. The ideas were gathered mostly from our personal experience with the system and from meetings with the faculty staff who use the current system.

In general, CodEx features should be preserved, so only the differences are presented here. For clear arrangement, all the requirements and wishes are presented in groups by the user categories.

Requirements of The Users

group hierarchy -- creating an arbitrarily nested tree structure should be supported to keep related groups together, such as in the example below. CodEx supported only a flat group structure. A group hierarchy also allows to archive data from the past courses.

  Summer term 2016
  |-- Language C# and .NET platform
  |   |-- Labs Monday 10:30
  |   `-- Labs Thursday 9:00
  |-- Programming I
  |   |-- Labs Monday 14:00
   ...

a database of exercises -- teachers should be able to filter the displayed exercises according to several criteria, for example by the supported runtime environments or by the author. It should also be possible to link exercises to a group so that group supervisors do not have to browse hundreds of exercises when their group only uses a few of them
advanced exercises -- the system should support a more advanced evaluation pipeline than basic compilation/execution/evaluation which is in CodEx
customizable grading system -- teachers need to specify the way of calculating the final score which will be allocated to the submissions depending on their correctness and quality
marking a solution as accepted -- a supervisor should be able to choose one of the submitted solutions of a student as accepted. The score of this particular solution will be used as the score which the student receives for the given assignment instead of the one with the highest score.
solution resubmission -- teachers should be able to edit the solutions of the student and privately resubmit them, optionally saving all results (including temporary ones); this feature can be used to quickly fix obvious errors in the solution and see if it is otherwise correct
localization -- all texts (the UI and the assignments of the exercises) should be translatable into several languages
formatted texts of assignments -- Markdown or another lightweight markup language should be supported for the formatting of the texts of the exercises
comments -- adding both private and public comments to exercises, tests and solutions should be supported
plagiarism detection

Administrative Requirements

independent user interface -- the system should allow the use of an alternative user interface, such as a command line client; implementation of such clients should be as straightforward as possible
privilege separation -- there should be at least two roles -- student and supervisor. The cases when a student of a course is also a teacher of another course must be handled correctly.
alternative authentication methods -- logging in through a university authentication system (e.g. LDAP) and potentially other services, such as Github or some other OAuth service, should be supported
querying SIS -- loading user data from the university information system (SIS) should be supported
sandboxing -- there should be a more advanced sandbox which supports execution of parallel programs and an easy integration of different programming environments and tools; the sandboxed environment should have the minimum possible impact on the measurement of results (most importantly on the measured duration of execution)
heterogeneous worker pool -- there must be a support for submission evaluation in multiple programming environments in a single installation to avoid unacceptable workload of the administrator (i.e., maintaining a separate installation for every course) and high hardware requirements
advanced low-level evaluation flow configuration with high-level abstraction layer for ordinary configuration cases; the configuration should be able to express more complicated flows than just compiling a source code and running the program against the test inputs -- for example, some exercises need to build the source code with a tool, run some tests, then run the program through another tool and perform additional tests
use of modern technologies with state-of-the-art compilers

Non-functional Requirements

no installation -- the primary user interface of the system must be accessible on the computers of the users without the need to install any additional software except for a web browser which is installed on almost every personal computer
performance -- the system must be ready for at least hundreds of students and tens of supervisors who are using it at the same time
automated deployment -- all of the components of the system must be easy to deploy in an automated fashion
open source licensing -- the source code should be released under a permissive licence allowing further development; this also applies to the used libraries and frameworks
multi-platform worker -- worker machines running Linux, Windows and potentially other operating systems must be supported

Conclusion

The survey shows that there are a lot of different requirements and wishes for the new system. When the system is ready, it is likely that there will be new ideas on how to use the system and thus the system must be designed to be easily extendable so that these new ideas can be easily implemented, either by us or community members. This also means that widely used programming languages and techniques should be used so the programmers can quickly understand the code and make changes easily.

Related work

To find out the current state in the field of automatic grading systems, we conducted a short market survey of the field of automatic grading systems at universities, programming contests, and other places where similar tools are used.

This is not a complete list of available evaluators, but only of a few projects which are used these days and can be of an inspiration for our project. Each project on the list is provided with a brief description and some of its key features.

Progtest

Progtest is a private project of FIT ČVUT in Prague. As far as we know it is used for C/C++, Bash programming, and knowledge-based quizzes. Each submitted solution can receive several bonus points or penalties and also a few hints can be attached of what is incorrect in the solution. It is very strict with the source code quality, for example the -pedantic option of GCC is used; Valgrind is used for detection of memory leaks; array boundaries are checked via the mudflap library.

Codility

Codility is a web based solution primarily targeted at company recruiters. It is a commercial product available as SaaS and it supports 16 programming languages. The UI of Codility is opensource, the rest of source code is not available. One interesting feature is the 'task timeline' -- the captured progress of writing the code for each user.

CMS

CMS is an opensource distributed system for running and organizing programming contests. It is written in Python and contains several modules. CMS supports C/C++, Pascal, Python, PHP, and Java programming languages. PostgreSQL is a single point of failure, all modules heavily depend on the database connection. Task evaluation can be only a three step pipeline -- compilation, execution, evaluation. Execution is performed in Isolate, a sandbox written by the consultant of our project, Mgr. Martin Mareš, Ph.D.

MOE

MOE is a grading system written in Shell scripts, C and Python. It does not provide a default GUI interface, all actions have to be performed from the command line. The system does not evaluate submissions in real time, results are computed in a batch mode after the exercise deadline, using Isolate for sandboxing. Parts of MOE are used in other systems like CodEx or CMS, but the system is obsolete in general.

Kattis

Kattis is another SaaS solution. It provides a clean and functional web UI, but the rest of the application is too simple. A nice feature is the usage of a standardized format for exercises. Kattis is primarily used by programming contest organizers, company recruiters and also some universities.

Content managed by ReCodEx Team
sponsored by Faculty of Mathematics and Physics, Charles University
Prague, Czech Republic

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduction

Introduction

Assignment

Current System

Exercise Evaluation Chain

Possible Improvements

Requirements

Requirements of The Users

Administrative Requirements

Non-functional Requirements

Conclusion

Related work

Progtest

Codility

CMS

MOE

Kattis

Appendices

Clone this wiki locally