<span style="font-size:250%">Curating Repositories</span>

This notebook was initially designed as active component to the [repoLibrarian](repoLibrarian.py) module (which in turn resulted from the needs of the [historical](RepoAnalysis_Historical.ipynb) part of the [RepoAnalysis](RepoAnalysis.ipynb) notebook). As many of the management queries and optimization measure were done on the fly in those notebooks, this notebook only acts as usage example (and test ground) for the [repoLibrarian](repoLibrarian.py) module and showcases some additional optimization measures.

# How to use the repoLibrarian module

In [2]:
%load_ext autoreload
%aimport repoLibrarian

Where are repos currently saved?

In [3]:
repoLibrarian.getReposFolder()

'/mnt/brick/crm20/repos/'

Let's use some local folder instead:

In [4]:
repoLibrarian.setReposFolder('./repos/')

'./repos/'

Which repos are already saved there?

In [13]:
list(repoLibrarian.knownRepos())

['MarioLizana/RadioControlSED.git',
 'json-iterator/java.git',
 'craigslist206/huffman.git',
 'pleonex/ChatRMI.git',
 'pleonex/CocoKiller.git',
 'pleonex/NiKate-Origins.git',
 'pleonex/locaviewer.git',
 'miken22/304-Project.git',
 'NeebalLearningPvtLtd/InventoryManagementSystem.git',
 'alibaba/arthas.git',
 'zzjove/MOMA.git',
 'tainarareis/Urutau.git',
 'italopaiva/EuVou.git',
 'ShutUpPaulo/TecProg_2016-01.git',
 'andrevctr12/PAA_HUFFMAN.git',
 'Elena-Zhao/Weibao.git',
 'Elena-Zhao/MOMA.git',
 'Elena-Zhao/Database-Auto-troubleshooting.git',
 'Elena-Zhao/Guimi.git',
 'Elena-Zhao/Mini-Chatter.git',
 'alstonlo/Bongo-Cat-Attacc.git',
 'dataspy/surprise-theory.git',
 'ieeeugrsb/ieeextreme8.git',
 'bptlab/Unicorn.git',
 'bptlab/correlation-analysis.git',
 'bptlab/scylla.git',
 'bptlab/cepta.git',
 'allantsai123/COSC310project.git',
 'JTReed/Porygon2.git',
 'JTReed/Porygon.git',
 'shengnwen/WeiBaoSSE.git',
 'lucasBrilhante/campus-party-mobile.git',
 'lucasBrilhante/das-framework-teste.git',
 

What about certain repos?

In [12]:
print(repoLibrarian.hasRepo('bptlab', 'scylla'))
print(repoLibrarian.hasRepo('bptlab', 'fcm2cpn'))

True
False


But I want that certain repo!

In [17]:
user = 'bptlab'
project = 'fcm2cpn'
repoLibrarian.downloadRepo(user, project)
print(repoLibrarian.hasRepo('bptlab', 'fcm2cpn'))

Cloned repo "bptlab/fcm2cpn"
True


If I don't want it anymore ...

In [20]:
repoLibrarian.deleteRepo('bptlab', 'fcm2cpn')

Deleted repo "bptlab/fcm2cpn"


And if don't want want to check if the repo exists and just want a handle?

In [21]:
print(repoLibrarian.getRepo('bptlab', 'fcm2cpn'))
print(repoLibrarian.getRepo('bptlab', 'scylla'))

Cloned repo "bptlab/fcm2cpn"
<git.repo.base.Repo '/mnt/brick/home/lbein/jupyterNotebook/repos/bptlab/fcm2cpn.git'>
<git.repo.base.Repo '/mnt/brick/home/lbein/jupyterNotebook/repos/bptlab/scylla.git'>


# Optimization measures

This section includes some of the optimization measures that have been applied over time to the module.

## Choose the gitPython database

Gitpython provides two git object databases: GitDB and GitCmdObjectDb. According to [GitPython documentation](https://gitpython.readthedocs.io/en/stable/tutorial.html#object-databases) and also personal tests, GitDB is "2 to 5 times slower when extracting large quantities small of objects from densely packed repositories". As this is exactly what we want to do, GitCmdObjectDb is chosen (as opposed to the default GitDB).

## Faster way to check if repo is a java repo

The way of checking if a repo is java by iterating over all files of the head commit is quite expensive for some repositories. <br>
`git ls-tree` can be used to list all files for the head commit and seems promising:

In [22]:
repoLibrarian.setReposFolder('./repos/')

'./repos/'

In [24]:
repo = repoLibrarian.getRepo('alibaba', 'arthas')

The approach is to list all files currently in HEAD and check if any of them ends with `.java`. This way no gitPython wrappers need to be created or traversed.

In [25]:
next(filter(lambda x: x.endswith('.java'), repo.git.ls_tree('--full-tree', '--name-only', '-r', 'HEAD').split('\n')), None) != None

True

In [28]:
%%timeit 
commit = list(repo.iter_commits())[0]
any(repoLibrarian.isJavaFile(obj) for obj in commit.tree.traverse())
repoLibrarian.isJavaRepo('alibaba', 'arthas')

155 ms ± 9.07 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [29]:
%timeit  next(filter(lambda x: x.endswith('.java'), repo.git.ls_tree('--full-tree', '--name-only', '-r', 'HEAD').split('\n')), None) != None

52.3 ms ± 189 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


Timeit implies that the time is approx half as long or even shorter, so the optimization is realized.