\documentclass[runningheads]{llncs} %%% Local Variables: %%% ispell-local-dictionary: "english" %%% End: \usepackage[utf8]{inputenc} \usepackage{booktabs} % For formal tables \usepackage{graphicx} \begin{document} %\SweaveOpts{concordance=TRUE} <>= library(ggplot2) library(ggthemes) data <- read.csv("data/evostar2019.csv") data$gap = as.factor(data$gap) @ \title{Exploring concurrent and stateless evolutionary algorithms} \author{Juan J. Merelo\inst{1} \and J.L.J. Laredo\inst{2} \and Pedro A. Castillo\inst{1} \and José-Mario García-Valdez\inst{3} \and Sergio Rojas-Galeano\inst{4} } \institute{% Universidad de Granada/CITIC\\ Granada, Spain\\ \email{\{jmerelo,pacv\}@ugr.es}\\ \and RI2C-LITIS, Université du Havre Normandie\\ Le Havre, France\\ \email{juanlu.jimenez@univ-lehavre.fr}\\ \and Instituto Tecnológico de Tijuana\\ Calzada Tecnológico, s/n\\ Tijuana, Mexico\\ \email{mario@tectijuana.edu.mx} \and School of Engineering\\ Universidad Distrital Francisco José de Caldas\\ Bogotá, Colombia\\ \email{srojas@udistrital.edu.co} } %\authorrunning{Merelo et al.} %\titlerunning{Exploring concurrent and stateless EAs} \maketitle %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \begin{abstract} Creating a concurrent and stateless version of an evolutionary algorithm implies changes in its algorithmic model. From the performance point of view, the main challenge is to balance computation with communication, but from the evolutionary point of view another challenge is to keep diversity high so that the algorithm is not stuck in local minima. In a concurrent setting, we will have to find the right balance so that improvements in both facets do not cancel out. In this paper we address such an issue, by exploring the space of parameters of a population based concurrent evolutionary algorithm that yields to find out the best combination for a particular problem. \end{abstract} \keywords{Concurrent algorithms, distributed computing, stateless algorithms, algorithm implementation, performance evaluation.} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Introduction} Concurrent programming adds a layer of abstraction on the machinery of processors and operating systems to offer a high-level interface that enables the user to program code that might be executed in parallel either in threads or in different processes \cite{andrews1991concurrent}. Different languages offer different facilities for concurrency at the primitive level, and mainly differ on how they deal with shared state, that is, variables that are accessed from several processes. Actor-based concurrency \cite{schippers2009towards} eliminates shared state by introducing a series of {\em actors} that store state and can change it; on the other hand, channel based concurrency follows the {\em communicating sequential processes} methodology \cite{Hoare:1978:CSP:359576.359585}, which is effectively stateless, with different processes reacting to channel input without changing state, and writing to these channels. This kind of concurrency is the one implemented by many modern languages such as Go or Perl 6 \cite{lenzperl}. However, the statelessness of the implementation requests for a change in the implementation of any algorithm. In particular, evolutionary algorithms need to migrate its computation model to an architecture that creates and processes streams of data using functions that do not change state. Despite the emphasis on hardware-based techniques such as cloud computing or GPGPU, there are not many papers \cite{Xia2010} dealing with creating concurrent evolutionary algorithms that work in a single computing node or that extend seamlessly from single to many computers. For instance, the EvAg model \cite{evag:gpem} is a locally concurrent and globally parallel evolutionary algorithm that leaves the management of the different agents (i.e. threads) to the underlying platform scheduler and displays an interesting feature: the model is able to scale seamlessly and take full advantage of CPU threads. In a first attempt to measure the scalability of the approach experiments were conducted in \cite{wcci:evoag} for a single and a dual-core processor showing that, for cost functions passing some milliseconds of computing time, the model was able to achieve near linear speed-ups . This study was later on extended in \cite{DBLP:conf/evoW/LaredoBMG12} by scaling up the experimentation to up to 188 parallel machines. The reported speed-up was $\times 960$ which is beyond the linear $\times 188$ that could be expected if local concurrency were not taken into account. The aforementioned algorithm used a protocol that worked asynchronously, leveraging its peer-to-peer capabilities; in general the design of concurrent EAs has to take into account the communication/synchronization between processes, which nowadays will be mainly threads. Although the paper above was original in its approach, other authors targeted explicitly multi-core architectures, such as Tagawa \cite{Tagawa201212} which used shared memory and a clever mechanism to avoid deadlock. Other authors \cite{kerdprasop2012concurrent} actually use a message-based architecture based in the concurrent functional language Erlang, which separates GA populations as different processes, although all communication takes place with a common central thread. In our previous papers \cite{Merelo:2018:MEA:3205651.3208317,Garcia-Valdez:2018:MEA:3205651.3205719}, we presented the proof of concept and initial results with this kind of stateless evolutionary algorithms, implemented in the Perl 6 language. These evolutionary algorithms use a single channel where entire populations are sent. The (stateless) functions read a single population from the channel, run an evolutionary algorithm for a fixed number of generations, which we call the {\em generation gap} or simply {\em gap}, and send the population in the final generation back to the channel. Several populations are created initially, and a concurrent {\em mixer} is run which takes populations in couples, mixes them leaving only a single population with the best individuals selected from the two merged populations. This {\em gap} is then conceptually, if not functionally, similar to the {\em time to migration} in parallel evolutionary algorithms (with which concurrent evolutionary algorithms show a big resemblance). We did some initial exploration of the parameter space in \cite{merelo:WEA}. In these initial explorations we realized that the parameters we used had an influence at the algorithmic level, but also at the implementation level, changing the wallclock performance of the algorithm. In this paper we will explore the parameter space systematically looking particularly at two parameters that have a very important influence on performance: population size and generation gap. Our intention is to create a rule of thumb for setting them in this kind of concurrent model, so that they are able to achieve the best performance. We will present the experimental setup next. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Experimental setup and results} \label{sec:res} Initially, we will work with two threads, one for mixing the populations and another for evolution. This setup, while being concurrent, allows us to focus more on the basic features of the algorithm, which is what we aim exploring in this paper. What we want to find out in these set of experiments is what is the generation gap that gives the best performance in terms of raw time to find a solution, as well as the best number of evaluations per second. In order to do that, we prepared an experiment using the OneMax function with 64 bits, a concurrent evolutionary algorithm such as the one described in \cite{Merelo:2018:MEA:3205651.3208317}, which is based in the free Perl 6 evolutionary algorithm library {\tt Algorithm::Evolutionary::Simple}, and run the experiments in a machine with Ubuntu 18.04, an AMD Ryzen 7 2700X Eight-Core Processor at 2195MHz. The Rakudo version was 6.d, which had recently been released with many improvements to the concurrent core of the language. All processing scripts as well as data obtained in the experiments are available, under a free license, from our GitHub repository. % \begin{figure*}[h!tb] \fbox{ \centering <>= ggplot(data,aes(x=evaluations,y=time,color=gap))+scale_color_brewer(palette="Set1")+geom_point(size=3,aes(shape=gap))+theme_tufte()+labs(x="Evaluations",y="Time",title="Evaluations vs Time per generation gap") @ } \caption{Number of evaluations vs. time needed for different generation gaps (8 to 64) } \label{fig:evals} \end{figure*} % \begin{figure*}[h!tb] \fbox{ \centering <>= ggplot(data,aes(x=gap,y=evaluations,group=gap))+geom_boxplot()+theme_tufte()+labs(title="Evaluations vs generation gap") @ } \caption{Number of evaluations needed for different generation gaps (8 to 64) } \label{fig:evals:bp} \end{figure*} % \begin{figure*}[h!tb] \fbox{ \centering <>= ggplot(data,aes(x=gap,y=avg.eval,group=gap))+geom_boxplot()+theme_tufte()+labs(title="Average number of evaluations vs generation gap") @ } \caption{Average number of evaluations per second for different generation gaps (8 to 64) } \label{fig:evals:avg} \end{figure*} % \begin{figure*}[h!tb] \fbox{ \centering <>= ggplot(data,aes(x=gap,y=time,group=gap))+geom_boxplot()+theme_tufte()+labs(title="Time vs generation gap") @ } \caption{Boxplot of the average wallclock time for different generation gaps (8 to 64) } \label{fig:time} \end{figure*} We used a population size of 256, as well as generation gaps increasing from 8 to 64. Many experiments were run for every configuration, up to 150 in some cases. We logged the upper bound of the number of evaluations needed (by multiplying the number of messages by the number of generations and number of individuals evaluated; this means that this number will be an upper bound, and not the exact number of evaluations until a solution is reached). We will first look at the general picture by plotting the wallclock time in seconds (measured by taking the time of the starting of the algorithm and the last message and subtracting the latter from the former) vs the number of evaluations that have been performed. The result is shown in Figure \ref{fig:evals}. Experiments with different generation gaps are shown with different colors (where available) and shapes, and they spread in an angle which is roughly bracketed by the experiments with a generation gap of 8, which need the most time for the same number of evaluations, and the experiments with a gap of 16, which usually need the least. The experiments with gaps = 32 or 64 are somewhere in between. In that same chart it can also be observed that the number of evaluations needed to find the 64 bit OneMax solution is quite different. We make a boxplot of the number of evaluations vs the generation gap in Figure \ref{fig:evals:bp}. This figure shows an increasing number of evaluations per gap size. Differences are significant between every generation gap and the next. This increasing number of evaluations per generation gap is probably due to the fact that the increasing number of isolated generations makes the population lose diversity, making finding the solution increasingly difficult. This is the same effect observed in parallel algorithms, as reported in \cite{Cantu-Paz:1999:MPT:2933923.2934003}, so it is not unexpected. What is unexpected is the combination of generation gap size and the concurrent algorithm, since it is impossible to know in advance what is the optimal computation to communication balance. % Here it would be interesting to describe why is unexpected? - Juanlu % done - JJ We plot the number of evaluations per second in Figure \ref{fig:evals:avg}. These show a big difference for a generation gap of 16, with a number of evaluations which is almost 50\% higher than for the rest of the generation gaps, where the difference is not so high. The number of evaluation per second does not follow a clear trend. It falls and remains flat for a generation gap higher than 16; it is also slightly higher than for the minimum generation gap that has been evaluated, 8. This generation gap, however, presents also the lowest number of evaluations to solution, which means that, on average, the solution will be found faster with a generation gap of 8 or 16. This is shown in Figure \ref{fig:time}. % I don't feel confortable modifying this paragraph, but I find it a bit twisted - Juanlu % Rewritten - JJ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Conclusions and discussion} \label{sec:conclusions} In this paper we have set out to explore the interaction between the generation gap and the performance measures in a concurrent and stateless evolutionary algorithm. From the point of view of the algorithm, increasing the generation gap favors exploitation over exploration, which might be a plus in some problems, but also decreases diversity, which might lead to premature convergence; in a parallel setting, this will make the algorithm need more evaluations to find a solution. The effect in a concurrent program goes in the opposite direction: by decreasing communication, the amount of code that can be executed concurrently increases, thus improving performance. Since the two effects cancel out, in this paper we have used a experimental methodology to find out what is the combination that is able to minimize wallclock time, which is eventually what we are interested in by maximizing the number of evaluations per second while, at the same time, increasing by a small quantity the number of evaluations needed to find the solution. For the specific problem we have used in this short paper, a 64-bit Onemax, the generation gap that seems to be more appropriate is around 16. The time to communication for that specific generation gap is around 2 seconds, since 16 generations imply 4096 evaluations and evaluation speed is approximately 2K/s. This gives us a ballpark of the amount of computation that is needed for concurrency to be efficient. In this case, we are sending the whole population to the communication channel, and this implies a certain overhead in reading, transmitting and writing. Increasing the population size also increases that overhead. We can thus reason than the amount of computation, for this particular hardware, should be in the order of 2 seconds, so that it effectively overcomes the amount of communication needed. This amount could be played out in different way, for instance by increasing the population; if the evaluation function takes more time, different combinations should be tested so that no message is sent unless that particular amount of time is reached. With these conclusions in mind, we can set out to work with other parameters, such as population size or number of initial populations, so that the loss of diversity for bigger population sizes can be overcome. Also we have to somehow address the problem of the message size during communication to and from the channel; in this respect, we are planning to resort to a compact representation of the population, obtained either by a statistical distribution model or a standard compression algorithm. This is left as an avenue for future work. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Acknowledgements} This paper has been supported in part by projects TIN2014-56494-C4-3-P s (Spanish Ministry of Economy and Competitiveness), DeepBio (TIN2017-85727-C4-2-P) and AMED (co-funded by European Regional Development Fund and the region Normandy). I would like to express my gratefulness to the users in the \#perl6 IRC channel, specially Elizabeth Mattijsen, Timo Paulsen and Zoffix Znet, who helped us with the adventure of programming efficient concurrent evolutionary algorithms. \bibliographystyle{splncs03} \bibliography{geneura,concurrent,perl6} \end{document}