/
github-rankings.Rnw
557 lines (468 loc) · 30.1 KB
/
github-rankings.Rnw
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
\documentclass[review]{elsarticle}
\bibliographystyle{elsarticle-num}
\usepackage[utf8]{inputenc}
\usepackage{hyperref}
\begin{document}
<<setup, cache=FALSE,echo=FALSE>>=
library("ggplot2")
library("RCurl") #To download stuff directly from the GitHub repo
load(".RData") #I'll try to obsolete this to make everything clearer
agg.top.Spain.evol.file <- getURL("https://raw.githubusercontent.com/JJ/top-github-users-data/master/data/processed/aggregated-top-Spain-evol.csv")
agg.top.Spain.evol <- read.csv(text = agg.top.Spain.evol.file,sep=";")
@
\begin{frontmatter}
\title{Measuring local GitHub developer communities in Spain}
\author[ugr,geneura,osl]{J.J. Merelo\corref{cor1}}
\ead{jmerelo@ugr.es}
\author[osl,ugr]{Nuria Rico}
\ead{nrico@ugr.es}
\author[ugr]{Israel Blancas}
\ead{iblancasa@gmail.com}
\author[geneura,ugr]{M. G. Arenas}
\ead{mgarenas@ugr.es}
\author[unizar]{Fernando Tricas}
\ead{ftricas@unizar.es}
\author[cacharreo]{José Antonio Vacas}
\ead{javacasm@gmail.com}
\address[ugr]{University of Granada, Spain}
\address[osl]{Free Software Office, University of Granada}
\address[geneura]{GeNeura Team, \url{http://geneura.wordpress.com}
\address[unizar]{University of Zaragoza, Spain}
\address[cacharreo]{\url{http://elcacharreo.com}}
\cortext[cor1]{Corresponding author. He can be
reached at his email address or at the
\href{https://github.com/geneura-papers/github-ranking/issues}{issues
section} of the repo for this paper}}
\date{}
%\maketitle
\begin{abstract}
Creating rankings might seem like a vain exercise in belly-button
gazing, even more so for people so unlike that kind of things as
programmers. However, in this paper we will try to prove how
creating city (or province) based rankings in Spain has led to
all kind of interesting effects, including increased productivity and
community building. We describe the methodology we have used to
search for programmers residing in a particular province focusing
on those where most population is concentrated and apply different
measures to show how these communities differ in structure, number
and productivity.
\end{abstract}
\begin{keyword}
Free software, libre software, software engineering, software communities
\end{keyword}
\end{frontmatter}
\section{Introduction}
One of the keys to create a community is to actually identify who is
part of it and how they participate. As part of the effort by the Free Software Office of the
University of Granada, we have tried, through the years, to know who
is involved in the creation of open source projects. However, the only
% quito software porque algunos projectos no son solo software
% Totalmente correcto. Gracias - JJ
way of finding out who was is to make them come to any of our events
or contact us through any means.
So the initial intention for creating a ranking of FLOSS (Free/Libre/Open/So\-ur\-ce Software)
was to know who is out there and the kind of things they are doing, be
them part of the academic world or outside it, in business; creating a census would allow us
to discover new FLOSS developers in our own city and even to collaborate with them.
So we used initially the GitHub top-1000 generation script by Paul
Miller \href{https://github.com/paulmillr/top-github-users} to achieve
that, making small modifications to the source and creating our own
version, which was eventually
\href{https://github.com/JJ/github-city-rankings}{moved to a new repository}. But this had
several effects. First, as soon as the ranking was published some
people contacted us and the first GitHub meet-up in Granada took
place. More modifications and changes were added and new data was
obtained. As part of the code tests, more cities were tested and we
ended up with lots of data. And data begs for analysis, which we
eventually started to do. And, along the way, we built a community of users that previously had not known each other. We discovered that
the only fact that a census exists does not imply that there is a
community, but it definitely helps.
We have had some experience with this kind of reactions in the
past. In~\cite{blogtalk} we did some studies on social network
analysis and other measures of the Spanish-speaking blogosphere. Then,
the reactions were two fold: on one side, people showed big interest
in the index in order to be listed there; some blog providers provided
also data. On the other side, people that were expecting to appear in
better positions were a bit angry about it\footnote{You can see some
of the discussions and links -unfortunately most of them do not
work- at: \url{http://www.blogalia.com/historias/7744} (in
Spanish)}. Anyway we feel that self-consciousness is always a good
thing and this work can serve as a driving force for more code sharing
and increase relations among developers.
In this paper, using the tool that we have created for searching for
the users in particular cities or provinces, we will show how the
GitHub activity in these cities or provinces compare with each other
and what kind of characteristics they have, including basic
metrics. We will also delve into the effect of publishing the ranking
itself, which has surprisingly increased the productivity of all
communities measured. Finally, we will try to draw some conclusions on
how measuring activity affects that activity and what are the general
characteristics of open source developers in the provinces measured,
which are the top 20 in population in Spain.
Coming up next we will review different papers that deal with creating
lists and rankings of contributions and trying to measure or explain
the dynamics of communities. In section \ref{sec:exp} we will show the
methodology for obtaining the users in a particular province in Spain;
next we will analyze data obtained and show how different provinces
stack up in terms of contributions and finally we will draw some
conclusions.
\section{State of the art}
Geographically based community metrics have had some attention in the last years~\cite{BarahonaRoblesAndradasGhosh08,TakhteyevHilts10,EngelhardtFreytagSchulz13} but they do not seem to have arisen a
lot of interest in the FLOSS metrics~\cite{herraiz2009flossmetrics} community. Most efforts seem to
be focused in creating tools for actually measuring repositories for
activity; for instance, Laura Arjona describes the {\em Debian
Contributors} tool in \cite{arjona:debian} whose results are dumped
to a website that includes information on the projects that every user
has contributed and some other data such as how users are identified
in the databases. Other authors use existent online development communities, such as Sourceforge \cite{Xu05topologicalSourceforge,Hahn08PriorTies}, to perform statistical analysis on the social structure of the projects.
%TODO hay muchos artículos que usan Sourceforge como fuente de datos, deberíamos indicar las ventajas de usar Github FERGU
However, geography seems to be relevant in human interactions even when we are in Internet, where one could expect this factor to be less important. See, for example, `Visualizing Friendships'\footnote{\url{https://www.facebook.com/notes/facebook-engineering/visualizing-friendships/469716398919}} where the authors studied interactions inside the Facebook social network, or~\cite{RattiSobolevskyCalabreseAndrisReadesMartinoClaxtonStrogatz10} where the subject of analysis are phone calls. We can see that even when there is technology-mediated communication, the geography seems to be an important driver.
It is interesting to note that some models \cite{robles05} use the
concept of stigmergy, that is, interaction using the environment, to
model the dynamics of libre software projects; the mere existence of
these tools can be a catalyst of this interaction and the harbinger
of new software projects. In fact, this seems to be what has happened
in the community (or communities) under observation: the mere creation of
a document that mentions many different users acts as a substrate that
allows the creation and growth of the community through the stigmergy paradigm.
For example, e-mail social networks are used by Sowe et al. \cite{Sowe06knowledge} to identify knowledge brokers using the Debian mailing lists. This technique was also used by Bird et al. \cite{Bird08Latent} to study the social structure in other big Open Source projects. They shown that different subcommunities of participants can be formed and detected, with statistically significant levels of modularity and that these social networks are mainly built around {\em products} of each project (APIs, bug fixes, etc.). Finally, they proved that pairs of members in the mailing lists social networks have more modified files in common than pairs from different communities.
Next we will briefly explain the tool that was designed to search for
geographically based GitHub users.
\section{GitHub city rankings, the tool}
\label{sec:exp}
\begin{table}[htb]
\centering
<<kable,results="asis",echo=FALSE>>=
kable(top20.data)
@
\caption{Raw aggregate measures for the 20 most populated provinces,
including the population (taken from the National Statistics
Institute), the number of users and their contributions, stars and
followers. Please note that province names do not correspond to
official names, having rather been chosen a bit arbitrarily from the
search strings used.}
\label{tab:data}
\end{table}
%
It is quite unlikely that if you are reading this you do not know what
is \href{http://github.com}{GitHub}. GitHub is a web-based git
repository that has a number of ``social'' features, including the
declaration of a profile and {\tt @} mentions in commit messages and
issues. The profile page includes information on the number of
followers, as well as the repositories and the number of contributions
every person has made during the last year. Besides this
easily-scrapeable information, GitHub has a REST API that can be
accessed from any language.
Some other web-based \emph{repos} do have many of the characteristics, and,
besides, are based in free software themselves, like
Gitorious\footnote{\url{http://gitorious.org/}}, SourceForge\footnote{\url{http://sourceforge.net/}}, Google Code\footnote{\url{http://code.google.com/}}, just to name a few of them. However the number (and the activity) of users of these repositories is quite small compared to GitHub, which has become the tool of choice for
FLOSS developers. That is why GitHub was chosen, apart from the availability of tools to mine profile information: it provides an API that allows us to study things in an easier way.
Notice also that all the projects in GitHub can not be considered FLOSS as we can see at~\cite{Williamson13}.
Nevertheless, people seems to be following the web culture where sharing and broadcasting are the usual ways of relation and they are not paying attention to licensing issues.
The \href{https://github.com/paulmillr/top-github-users}{tool used initially, by Paul
Miller} was
written in CoffeeScript and designed for creating a ranking of the top
1000 users with more than an (arbitrary) numbers of followers equal to
256. The tool used the GitHub REST API to make requests, and saved
them in a human-readable form in Markdown and also CSV and JSON. It
was separated in three scripts that were called from a Makefile. Some
utility functions were written in Node.js; the node.js module was
called from several scripts.
Our intention was to look for users in a particular location, that is,
to limit them not by minimum number of users but for the location
declared in their profiles. Every run of the program required 10 API
requests which are limited to 20 per hour, so our first modification \cite{Merelo2015} was to ochange it so that it
used authenticated requests. Finally, we had to rearrange the
whole code so that it counted the number of stars and could be
filtered according to regular expressions, since the country a city or
province is might be ambiguous (you know Toledo, Ohio, but there is
also a Toledo in Spain); Markdown handling was hard-coded into the program
so it was moved to a template-based solution. The resulting solution
\cite{Merelo2015} kept the same license and included also a few
additional tools for data processing.
One of the main problem we found in Spain was the different forms of the
province name. Besides the fact that people write it in any of the
official languages (Spanish and, in some cases, national languages
like Basque or Catalan) and, well, sometimes with typos (with or
without tildes), some people do not mention their province when
writing their location. To make a long story short, we had to provide
a configuration file (in JSON) which lists several possible names that
might be used by people in a particular province; for instance, for
Majorca we had to include this: {\tt "location":
["Balears","Baleares","Palma de Mallorca"]}. This, of course,
excludes those that simply do not care about listing their location,
but more on this later on.
The script is then run with a city name (if there is no particular
configuration option, {\tt Madrid}, for instance) or a configuration
file name {\tt granada}. This can be launched weekly, or simply at a
particular moment or under request.
The results for each user list the number of followers, contributions,
the number of stars his/her repositories have received, the longest
and the current
contribution streak as well as the {\em predominant} language and
avatar. Some of these metrics are shown in the Markdown rankings; for
instance, see the
\href{https://github.com/JJ/top-github-users-data/blob/master/formatted/top-Madrid.md}{one for Madrid}.
Data is saved to a
\href{https://github.com/JJ/top-github-users-data/}{different repository} and is aggregated
and processed using R and Perl scripts. All of them are included in
the same repository. As part of our commitment to free/open science,
all graphics and data were published as soon as they were produced in
Twitter from my {\tt @jjmerelo} account.
\section{Results and analysis}
\label{sec:res}
We reduced the search to the 20 most populated provinces in Spain, for
which appropriate search strings and filters were created\footnote{Population data was obtained from the National Statistics
Institute \url{http://www.ine.es/}}. All data
was downloaded during January 2015, and is available from the already
mentioned \emph{repo}. Aggregate data for these 20 provinces is shown in
table \ref{tab:data}.
The range of users shown in the table hovers around the hundreds, with
the one in the biggest provinces (and cities) approaching
1000. However, population and users/contributions are not directly
related. We can already see some differences in figure \ref{fig:user:contrib}, that shows the number of contributions (left) and users (right) in decreasing order. The first two, Madrid and Barcelona, should only be expected, but then Granada (17th in population) and Zaragoza also occupy a place that does not correspond exactly to the population they have; same as Bilbao (actually Vizcaya), but in the opposite direction.
\begin{figure}[htb]
\centering
<<byfollow, echo=FALSE, out.width='.49\\linewidth',fig.height=6, fig.subcap=c('# Contributions', '#Users'),>>=
ggplot(top20.data.bycontrib,aes(x=province,y=contributions)) + geom_bar(stat='identity')+xlab("Cities")+ coord_flip()
ggplot(top20.data.byuser,aes(x=province,y=users)) + geom_bar(stat='identity')+xlab("Cities")+ coord_flip()
@
\caption{Provinces ranked by number of contributions (left) and users (right)}
\label{fig:user:contrib}
\end{figure}
%
\begin{figure}[htb]
\centering
<<zipf, echo=FALSE, out.width='.49\\linewidth',fig.height=6, fig.subcap=c('Contributions', 'Users'),>>=
ggplot()+geom_point(data=barcelona.data, aes(x=log(seq_along(contributions)), y= log10(contributions),color='BCN')) +geom_point(data=zaragoza.data,aes(y=log10(contributions), x=log(seq_along(contributions)),color='ZGZ'))+geom_point(data=madrid.data, aes(x=log(seq_along(contributions)), y= log10(contributions),color='MAD')) +geom_point(data=granada.data,aes(y=log10(contributions), x=log(seq_along(contributions)),color='GRX'))+geom_point(data=sevilla.data, aes(x=log(seq_along(contributions)), y= log10(contributions),color='SVQ')) +geom_point(data=valencia.data,aes(y=log10(contributions), x=log(seq_along(contributions)),color='VLC'))+labs(color="City")+xlab("Rank")
ggplot()+geom_point(data=barcelona.data, aes(x=log(seq_along(followers)), y=log10(sort(followers,decreasing=TRUE)),color='BCN')) +geom_point(data=zaragoza.data,aes(y=log10(sort(followers,decreasing=TRUE)), x=log(seq_along(followers)),color='ZGZ'))+geom_point(data=madrid.data, aes(x=log(seq_along(followers)), y=log10(sort(followers,decreasing=TRUE)),color='MAD')) +geom_point(data=granada.data,aes(y=log10(sort(followers,decreasing=TRUE)), x=log(seq_along(followers)),color='GRX'))+geom_point(data=sevilla.data, aes(x=log(seq_along(followers)), y=log10(sort(followers,decreasing=TRUE)),color='SVQ')) +geom_point(data=valencia.data,aes(y=log10(sort(followers,decreasing=TRUE)), x=log(seq_along(followers)),color='VLC'))+labs(color="City")+ylab('log10(followers)')+xlab("Rank")
@
\caption{Zipf graph, rank vs. number of contributions (left) or followers (right) for the top 6 provinces in number of users and contributions: Madrid, Barcelona, Valencia, Granada, Sevilla, Zaragoza}
\label{fig:zipf}
\end{figure}
%
So let us look at the distribution of contributions and users looking
for an explanation of the dancing places in the ranking. A users and
contributions vs. rank plot is shown in figure \ref{fig:zipf}; it
shows different slopes which imply different distribution, but there
is a clear indication that a Zipf-like distribution is taking place in
all cases. So let us compute the Zipf exponent and objective, which we
show in table \ref{tab:zipf}.
\begin{table}[htb]
\centering
<<zipfexp,echo=FALSE>>=
#Taken from http://stats.stackexchange.com/questions/6780/how-to-calculate-zipfs-law-coefficient-from-a-set-of-top-frequencies
p<-zaragoza.data$contributions/sum(zaragoza.data$contributions)
lzipf <- function(s,N) -s*log(1:N)-log(sum(1/(1:N)^s))
opt.f <- function(s) sum((log(p)-lzipf(s,length(p)))^2)
opt <- optimize(opt.f,c(0.5,10))
zipf.city <- data.frame(city='Zaragoza',exponent=opt$minimum, obj=opt$objective)
p<-sevilla.data$contributions/sum(sevilla.data$contributions)
opt.f <- function(s) sum((log(p)-lzipf(s,length(p)))^2)
opt<- optimize(opt.f,c(0.5,10))
zipf.city <- rbind(zipf.city,data.frame(city='Sevilla',exponent=opt$minimum, obj=opt$objective))
p<-granada.data$contributions/sum(granada.data$contributions)
opt.f <- function(s) sum((log(p)-lzipf(s,length(p)))^2)
opt <- optimize(opt.f,c(0.5,10))
zipf.city <- rbind(zipf.city,data.frame(city='Granada',exponent=opt$minimum, obj=opt$objective))
p<-valencia.data$contributions/sum(valencia.data$contributions)
opt.f <- function(s) sum((log(p)-lzipf(s,length(p)))^2)
opt <- optimize(opt.f,c(0.5,10))
zipf.city <- rbind(zipf.city,data.frame(city='Valencia',exponent=opt$minimum, obj=opt$objective))
p<-madrid.data$contributions/sum(madrid.data$contributions)
opt.f <- function(s) sum((log(p)-lzipf(s,length(p)))^2)
opt <- optimize(opt.f,c(0.5,10))
zipf.city <- rbind(zipf.city,data.frame(city='Madrid',exponent=opt$minimum, obj=opt$objective))
p<-barcelona.data$contributions/sum(barcelona.data$contributions)
opt.f <- function(s) sum((log(p)-lzipf(s,length(p)))^2)
opt <- optimize(opt.f,c(0.5,10))
zipf.city <- rbind(zipf.city,data.frame(city='Barcelona',exponent=opt$minimum, obj=opt$objective))
kable(zipf.city)
@
\caption{Zipf coefficients for all 6 ``big'' cities, with exponent and objective }
\label{tab:zipf}
\end{table}
This can be interpreted in a different way by plotting the Lorenz
curve, which is the accumulated normalized sum of contributions for
these six cities. This is shown in figure \ref{fig:lorenz}; this
Lorenz curve tends to represent the inequality between those that
contribute more and those that contribute less and is usually
represented by the Gini index, which is shown in table
\ref{tab:gini}.
%
\begin{figure}[htb]
\centering
<<lorenz, echo=FALSE, fig.height=4, fig.width=8>>=
ggplot()+geom_point(data=barcelona.data, aes(x=seq_along(contributions)/length(contributions), y=cumsum(sort(contributions))/sum(contributions),color='BCN'))+geom_point(data=zaragoza.data, aes(x=seq_along(contributions)/length(contributions), y=cumsum(sort(contributions))/sum(contributions),color='ZGZ'))+geom_point(data=madrid.data, aes(x=seq_along(contributions)/length(contributions), y=cumsum(sort(contributions))/sum(contributions),color='MAD'))+geom_point(data=granada.data, aes(x=seq_along(contributions)/length(contributions), y=cumsum(sort(contributions))/sum(contributions),color='GRX'))+geom_point(data=sevilla.data, aes(x=seq_along(contributions)/length(contributions), y=cumsum(sort(contributions))/sum(contributions),color='SVQ'))+geom_point(data=valencia.data, aes(x=seq_along(contributions)/length(contributions), y=cumsum(sort(contributions))/sum(contributions),color='VLC'))+xlab('rank')+ylab('contributions')+labs(color="City")
@
\caption{Lorenz graph, that is, accumulated number of contributions vs rank for the top 6 provinces in number of users and contributions: Madrid, Barcelona, Valencia, Granada, Sevilla, Zaragoza}
\label{fig:lorenz}
\end{figure}
%
\begin{table}[htb]
\centering
<<gini,echo=FALSE,results='asis'>>=
library(ineq)
gini.city <- data.frame(city='Madrid', gini= ineq(madrid.data$contributions,type='Gini'))
gini.city <- rbind(gini.city, data.frame(city='Granada', gini= ineq(granada.data$contributions,type='Gini')))
gini.city <- rbind(gini.city, data.frame(city='Zaragoza', gini= ineq(zaragoza.data$contributions,type='Gini')))
gini.city <- rbind(gini.city, data.frame(city='Sevilla', gini= ineq(sevilla.data$contributions,type='Gini')))
gini.city <- rbind(gini.city, data.frame(city='Barcelona', gini= ineq(barcelona.data$contributions,type='Gini')))
gini.city <- rbind(gini.city, data.frame(city='Valencia', gini= ineq(valencia.data$contributions,type='Gini')))
knitr::kable(gini.city,format='latex')
@
\caption{Gini coefficients for all 6 ``big'' cities }
\label{tab:gini}
\end{table}
The Gini coefficient measures {\em inequality} in the sense of {\em
share} of, in this case, contributions between those with the most
contributions and those with the least. An index equal to 1 would mean
a single person did all the contributions, while the rest did 0, and
index equal to 0 would mean all users make the same number of
contributions. The table \ref{tab:gini} ranks the cities from least
equal (Madrid) to most egalitarian, Valencia. However, there is no
big range of variation, hovering around 0.70, which is way over the
inequality of the most unequal country in the world, the
Seychelles. However, this is meaningless in absolute terms; in
relative terms, it means roughly that the top contributors contribute
roughly 70\% more than the bottom contributors, and that there is no
big variation among the different cities/provinces. It is quite clear,
however, that the contributions by the top contributor, as well as
those made by the {\em average} one, are quite different from place to
place. So we represent in figure \ref{fig:avg} the average number of
contributions, that is, the number of contributions divided by the
number of users.
%
\begin{figure}[htb]
\centering
<<avg, echo=FALSE, fig.height=4, fig.width=8>>=
ggplot(top20.data.byavgcontrib,aes(x=province,y=avgcontrib)) + geom_bar(stat='identity')+coord_flip()+xlab("Contributions/users")
@
\caption{Average number of contributions, sorted from top to bottom,
for the 20 provinces with the most inhabitants.}
\label{fig:avg}
\end{figure}
Figure \ref{fig:avg} shows that average contributions have a bigger
range than the Gini coefficient; Valencia is right in the middle,
around 100; in fact, the top contributor ({\tt pakozm} has 1462
contributions and 25\% have more than 100). In Madrid, however, the top
10 have more than 2000 contributions and there are 200 users (25\%
with more than 150), so that accounts for the bigger inequality. But
productivity is highest in Madrid, Zaragoza and, once again, in
Granada, if we consider productivity exclusively the number of
contributions.
Finally, it is interesting to find out if the publication of these
rankings had any kind of impact. This is shown in figure
\ref{fig:impact} for the three cities for which we have the most
tests, including Granada.
%
\begin{figure}[htb]
\centering
<<impact, echo=FALSE, fig.height=4, fig.width=8>>=
ggplot() + geom_line(data=malaga.evol.data, aes(x=seq_along(users), y=users,color='AGP')) + geom_line(data=sevilla.evol.data, aes(x=seq_along(users),y=users,color='SVQ'))+ geom_line(data=granada.evol.data, aes(x=seq_along(users),y=users,color='GRX')) + xlab("Sequence")+labs(color="City")
@
\caption{Number of users in each ranking; every point indicates simply
a new ranking and are not uniformly distributed in time.}
\label{fig:impact}
\end{figure}
%
The first time the census for Granada was published it included around
140 users. A month (approximately) later, there were more than 180, a
28\% increase, more or less the same than for Málaga and more than for
Seville, where a small increase was noted. Take into account that this
is not absolute number of users, but only {\em active} users; in fact,
it might decrease due to some user becoming inactive (no contribution)
in the last year (this happens in some of the cases in Granada). So,
in general, it should be expected to go every which way, depending on
the city. The fact that all cities whose rankings have been published
have {\em increased} the number of active users after diffusion,
mainly in Twitter, might be an indication more users becoming active,
more mentioning their city/province in their profile, or small
competitions taking place locally to scale up the rankings if there is
a chance to do so. All hypothesis are equally valid lacking other
evidence, but we would say that at least some increase will be due to
the fact that the rankings exist.
% This stuff below is new
If we look at the aggregated data, total number of users returned by
API searches in Spain and how they evolve in time, we will get the
picture shown in Figure \ref{fig:spain:evol}
%
\begin{figure}[htb]
\centering
<<evolSpain, echo=FALSE, fig.height=4, fig.width=8>>=
ggplot(agg.top.Spain.evol,aes(x=users,y=contributions))+geom_line()+geom_point()
@
\caption{Number of contributions plotted against number of users for
different points in time, joined sequentially. Differences between
two points are usually one week. }
\label{fig:spain:evol}
\end{figure}
%
\begin{figure}[htb]
\centering
<<evolSpainrel, echo=FALSE, fig.height=4, fig.width=8>>=
ggplot(data=agg.top.Spain.evol,aes(x=row.names(agg.top.Spain.evol),y=contributions/users))+geom_point()+ xlab("Sequence")
@
\caption{Evolution of the number of contributions per user along time.
Points were taken (roughtly) in consecutive weeks. }
\label{fig:spain:evol:rel}
\end{figure}
%
This steady increment might be due to several factors, once the effect
of making API searches more complete is excluded. One of them,
probably the most critical one, is people actually putting their city
and country in the profile as weekly rankings are published. Another
one is people making more contributions (and getting more followers)
as their work becomes public and notorious. Of course, GitHub and open
source are thriving communities so more developers will eventually
release their work under a free license and new users will be
incorporated to the pool users. The evidence of which one is the most
important is mostly anecdotal, but from comments in social networks
and our own experience it is quite clear that the publication of these
rankings has had a positive impact in the open source contributions,
at least quantitatively, in the whole.
A hint on what happens might
be gleaned from Figure \ref{fig:spain:evol:rel}, which plots the
evolution of the number of contributions along time. In fact, it
remains more or less stable with a certain drop at the beginning due
to the search returning users with a low number of contributions in
big cities and the whole country, but remains stable afterwards, with
a certain trend downwards. Since the number of users is actually
increasing, this would indicate that the new users have a number of
contributions, in fact a number of followers and stars too, that is
close to the average or slightly lower. This does not allow us to say
which contributions is more important: new users (who would have a low
number of contributions) or existing users setting their name in the
profile (which would be expected to have contributions distributed in
the same way as existing users).
\section{Conclusions}
In this paper we have shown what happened when city/province based
rankings were created using GitHub search API and what conclusions can
be extracted from measuring the number of users and contributions made
by these users.
In general, it is interesting to note that Spain hosts a vibrant,
abundant and diverse community of developers. Looking at the raw
numbers, most of them are based in the big cities, Madrid, Barcelona
and Valencia, but some smaller cities like Zaragoza and Granada also
host a numerous and productive group of developers. The publishing of
the rankings has created a lively discussion in Twitter, and also
allowed the discovery of many developers in many areas. In Granada
GitHub monthly meetings have started, and many interesting projects,
including this paper, have been started.
There are many things that remain to be done. The first one is to
check the ability of GitHub to act as a social network; we would like
to analyze how developers in a city connect to each other and how
these actual communities change with time and what is their
background, companies, academia or user groups. Other productivity
measures could also be taken, including number of lines; besides, a
differentiation between code and artifacts could be made, in the same
way it was done by Robles et al. in \cite{robles06}.
\section{Acknowledgements}
This paper is part of the open science effort at the university of
Granada. It has been written using knitr, and its source as well as
the data used to create it can be downloaded from
\href{https://github.com/geneura-papers/github-ranking}{the GitHub repository}. It has been supported in part by
\href{http://geneura.wordpress.com}{GeNeura Team}.
\bibliographystyle{alpha}
\bibliography{geneura,rankings,blogs}
\end{document}