<span style="font-size:250%">But actually, we want to analyze repos ...</span>

This notebook was aimed at developing the methodology to analyze repositories fast. Most of the early development history has been put into the [RepoAnalysis History](./RepoAnalysis_Historical.ipynb) notebook. This notebook mainly includes the actual three big analysis runs and the improvements to the analysis methods. From the results of the two notebooks, the [repoAnalysis.py](./repoAnalysis.py) module was created.

Additionally the [RepoLibrarian](./RepoLibrarian.ipynb) notebook that aims at managing downloaded repos was split off this notebook.

In [1]:
import time
from datetime import datetime
from multiprocessing import Pool
from multiprocessing import Semaphore
import multiprocessing
import sqlalchemy
import pandas
from IPython.utils import io
from IPython.display import Audio

In [2]:
%load_ext autoreload
%autoreload 2
%aimport repoAnalysis
%aimport repoLibrarian
%aimport dbUtils

---
## Preparation: Set repos work folder
(Initially I downloaded the repositories to a more local folder, which I later used a small test set for developing analysis routines)

Local folder:

In [31]:
repoLibrarian.setReposFolder('./repos/')

'./repos/'

Shared folder:

In [13]:
repoLibrarian.setReposFolder('/mnt/brick/crm20/repos/')

'/mnt/brick/crm20/repos/'

In [14]:
repoLibrarian.getReposFolder()

'/mnt/brick/crm20/repos/'

## Recap: Current state of analysis

At this point in time, the [repoAnalysis](./repoAnalysis.py) module already included the `calculateMetrics` method to calculate a given set of metrics for all files of a repository. A set of metrics has been developed (`repoAnalysis.metricSuite`). The result data should be written to database, which is why the analysis for each repository returns a pandas dataframe which can easily be written to table.

# Prerequisites to run suites and generate database

Metric suites could already be run on single repos. Now means had to be developed to do this for all repos and write the data into a given table. Note that column names have later been renamed to use snake case, because there were some errors with camel case.

In [17]:
suite = repoAnalysis.metricSuite
data = repoAnalysis.calculateMetrics(('bptlab', 'scylla', 123456789), suite)
data

Time used for ('bptlab', 'scylla', 123456789): 65.44256091117859


Unnamed: 0,sha,parent,timestamp,repo_id,loc,cloc,file_count,num_methods,num_lambdas,num_comment_lines,num_reflection,num_snakes,total_indent
0,37bcca3d0bc7b1f03818ecf24b15951b444b4f76,71e5cb71cd634f5170a4285caf2636308d5eb999,1590332677,123456789,29844,25672,276,1627,226,4235,93,1741,57175.75
1,71e5cb71cd634f5170a4285caf2636308d5eb999,a49e5be01918f8012875befe18821676dc3ce98d,1590329536,123456789,29859,25687,276,1627,226,4233,93,1741,57268.75
2,a49e5be01918f8012875befe18821676dc3ce98d,54ac50b59b456cdc156b4aa18c7be17db1511950,1588277133,123456789,29849,25675,276,1627,226,4233,93,1741,57249.75
3,54ac50b59b456cdc156b4aa18c7be17db1511950,0c42536c5657adf2e910833cc72fea577d8ee1eb,1588096131,123456789,29825,25648,276,1627,226,4230,93,1741,57069.75
4,0c42536c5657adf2e910833cc72fea577d8ee1eb,81171bf7a031077d010518344a28b0d89c6f1e90,1588095924,123456789,29781,25610,276,1627,226,4223,93,1741,56867.00
...,...,...,...,...,...,...,...,...,...,...,...,...,...
475,411d51771ee6c714db13c23e8912086840b84f51,9e8faf6f0b516334451c53960dee001a6e9e42c4,1485386460,123456789,12972,10850,158,691,15,1454,31,117,25773.50
476,e36dd96a1201ad35578a5e5d65a5db02135f4614,0d893a471f932b921473f72563a540da46f15fa5,1484674213,123456789,12966,10836,155,687,15,1463,30,117,25851.00
477,0d893a471f932b921473f72563a540da46f15fa5,9e8faf6f0b516334451c53960dee001a6e9e42c4,1484673978,123456789,12966,10836,155,687,15,1463,30,117,25851.00
478,9e8faf6f0b516334451c53960dee001a6e9e42c4,e1ca54916ff570656eb1d40c13b7e58c10d8bc0a,1484305822,123456789,12922,10795,155,686,15,1466,30,117,25716.50


## Exploring database creation 

In [18]:
from sqlalchemy import MetaData, Table, Column, Integer, String

Sqlalchemy provides direct methods to created tables. As columns, the metric function names can be used. Attention, the schema name has to be set!

In [22]:
columns = [Column('sha', String), Column('parent', String), Column('timestamp', Integer), Column('repo_id', Integer)]
columns = columns + list(map(lambda func: Column(func.__name__, Integer), suite))
meta = MetaData(schema='crm20')
tableName = 'lb_test2'
table = Table(
    tableName, meta,
    *columns
)
meta.create_all(dbUtils.engine)

The table has been created successfully:

In [355]:
dbUtils.runQuery('''
    SELECT column_name, data_type 
    FROM information_schema.columns
    WHERE table_name = '''+"'"+tableName+'''';
''')

Time used: 0.3105020523071289


Unnamed: 0,column_name,data_type
0,totalIndent,integer
1,numReflection,integer
2,numSnakes,integer
3,timestamp,integer
4,repoId,integer
5,loc,integer
6,cloc,integer
7,fileCount,integer
8,numMethods,integer
9,numLambdas,integer


Because of the pandas dataframe, data can easily be written to the database:

In [23]:
data.to_sql(tableName, schema='crm20', con=dbUtils.engine, if_exists='append', index=False)

And has successfully been created:

In [24]:
dbUtils.runQuery('''
    SELECT *
    FROM crm20.'''+tableName+'''
''')

Time used: 0.005092144012451172


Unnamed: 0,sha,parent,timestamp,repo_id,loc,cloc,file_count,num_methods,num_lambdas,num_comment_lines,num_reflection,num_snakes,total_indent
0,37bcca3d0bc7b1f03818ecf24b15951b444b4f76,71e5cb71cd634f5170a4285caf2636308d5eb999,1590332677,123456789,29844,25672,276,1627,226,4235,93,1741,57176
1,71e5cb71cd634f5170a4285caf2636308d5eb999,a49e5be01918f8012875befe18821676dc3ce98d,1590329536,123456789,29859,25687,276,1627,226,4233,93,1741,57269
2,a49e5be01918f8012875befe18821676dc3ce98d,54ac50b59b456cdc156b4aa18c7be17db1511950,1588277133,123456789,29849,25675,276,1627,226,4233,93,1741,57250
3,54ac50b59b456cdc156b4aa18c7be17db1511950,0c42536c5657adf2e910833cc72fea577d8ee1eb,1588096131,123456789,29825,25648,276,1627,226,4230,93,1741,57070
4,0c42536c5657adf2e910833cc72fea577d8ee1eb,81171bf7a031077d010518344a28b0d89c6f1e90,1588095924,123456789,29781,25610,276,1627,226,4223,93,1741,56867
...,...,...,...,...,...,...,...,...,...,...,...,...,...
475,411d51771ee6c714db13c23e8912086840b84f51,9e8faf6f0b516334451c53960dee001a6e9e42c4,1485386460,123456789,12972,10850,158,691,15,1454,31,117,25774
476,e36dd96a1201ad35578a5e5d65a5db02135f4614,0d893a471f932b921473f72563a540da46f15fa5,1484674213,123456789,12966,10836,155,687,15,1463,30,117,25851
477,0d893a471f932b921473f72563a540da46f15fa5,9e8faf6f0b516334451c53960dee001a6e9e42c4,1484673978,123456789,12966,10836,155,687,15,1463,30,117,25851
478,9e8faf6f0b516334451c53960dee001a6e9e42c4,e1ca54916ff570656eb1d40c13b7e58c10d8bc0a,1484305822,123456789,12922,10795,155,686,15,1466,30,117,25717


Mistakes will be made and experiments need to be undertaking, so cleaning up is necessary:<br>
Note: Because dropping a table does not return values, an error is thrown, this can be safely ignored for now 

In [25]:
dbUtils.runQuery('''
    DROP TABLE crm20.'''+tableName+'''
''')

ResourceClosedError: This result object does not return rows. It has been closed automatically.

## Create utility functions

All the results from the previous section can be extracted to some utility functions. These can later be extracted to an appropriate module.

At the core, the `runSuite` function stands: Calculating metrics for a repo and then writing them to table.

In [28]:
dataBaseSemaphore = multiprocessing.Semaphore()
def runSuite(repo):
    with io.capture_output() as output:
        data = repoAnalysis.calculateMetrics(repo, suite)
        with dataBaseSemaphore:
            data.to_sql(tableName, schema='crm20', con=dbUtils.engine, if_exists='append', index=False)
            dbUtils.engine.dispose()
    log(output)
    return len(data) > 0

For this, the table has to be created:

In [30]:
def createTable():
    columns = [Column('sha', String), Column('parent', String), Column('timestamp', Integer), Column('repoId', Integer)]
    columns = columns + list(map(lambda func: Column(func.__name__, Integer), suite))
    meta = MetaData(schema='crm20')
    table = Table(
        tableName, meta,
        *columns
    )
    meta.create_all(dbUtils.engine)
    dbUtils.engine.dispose()

Additionally, table deletion is needed for clean up of failed runs (which there were a lot)

In [29]:
def deleteTable():
    try:
        dbUtils.runQuery('''
            DROP TABLE crm20.'''+tableName+'''
        ''')
    except:
        pass

Lastly, it had shown that for longer running processes, display sessions are not adequate to hold outputs. Therefore, output was written to a log file. This function was later extracted to the [dbUtils](dbUtils.py) module

In [27]:
logSemaphore = multiprocessing.Semaphore()

def log(text):
    with logSemaphore:
        with open('log.txt', 'a') as file:
            file.write('========= '+str(datetime.datetime.now())+' ==========\n'+str(text))

# First iteration

After preparing a first iteration of analysis, the first run started with approx. 200 users (100 in each group).

In [12]:
tableName = 'lb_results1'

In [9]:
repoLibrarian.setReposFolder('/mnt/brick/crm20/repos/')

'/mnt/brick/crm20/repos/'

Now all the results from the other sub-projects are brought in:<br>
The projects to analyze are taken from the [data explorer](DataExplorer.ipynb):

In [14]:
polyglotProjects = dbUtils.runQuery('''SELECT * FROM lb_polyglotProjects''')
controlgroupProjects = dbUtils.runQuery('''SELECT * FROM lb_controlgroupProjects''')
bothProjects = pandas.concat([polyglotProjects, controlgroupProjects], axis=0)

Time used: 0.37739133834838867


Time used: 0.36337733268737793


Repos are downloaded and managed with the help of the [repoLibrarian](repoLibrarian.py):<br>
For this, table rows must however be preprocessed: 

In [6]:
def rowToTuple(x):
    (index, row) = x
    split = row['url'].split('/')
    user = split[-2]
    project = split[-1]
    return (user, project, row['repo_id'])

Downloading repos takes longer time than expected and shows that there many repos that are not accessible anymore. Many repos have shown to just have been renamed, however, implementing a github rename detection strategy is out of scope of this project. Also it is useful to filter the repositories by Java, so no unnecessary downloads occur.

In [16]:
repos = list(filter(lambda tupl: repoLibrarian.isJavaRepo(tupl[0], tupl[1]), map(rowToTuple, bothProjects.iterrows())))

Failed to check ('thiagoruis', 'dotNET-Grupo-2'): Reference at 'refs/heads/master' does not exist
Could not download repo "tananaev/traccar-client": Cmd('git') failed due to: exit code(128)
  cmdline: git clone --bare -v https://github.com/tananaev/traccar-client.git /mnt/brick/crm20/repos/tananaev/traccar-client.git
  stderr: 'Cloning into bare repository '/mnt/brick/crm20/repos/tananaev/traccar-client.git'...
fatal: could not read Username for 'https://github.com': No such device or address
'
Failed to check ('tananaev', 'traccar-client'): Reference at 'refs/heads/master' does not exist
Could not download repo "topicusonderwijs/wicket-openid": Cmd('git') failed due to: exit code(128)
  cmdline: git clone --bare -v https://github.com/topicusonderwijs/wicket-openid.git /mnt/brick/crm20/repos/topicusonderwijs/wicket-openid.git
  stderr: 'Cloning into bare repository '/mnt/brick/crm20/repos/topicusonderwijs/wicket-openid.git'...
fatal: could not read Username for 'https://github.com': No

To compare how many repos of the experiment groups can actually be downloaded and are Java repos: Approx. half

In [700]:
print(len(bothProjects))
print(len(repos))

1377
620


Then the run started. The commented line shows that there were precursive runs with the reduced local dataset (and random repo ids).<br>
The results of this run can be found in table `lb_results1` and their evaluation in the [Results_Iteration#1](Results_Iteration#1.ipynb) notebook.

In [None]:
deleteTable()
createTable()
#repos = list(map(lambda tupl: (*tupl, int.from_bytes(bytearray(str(tupl), 'utf-8'), byteorder='big', signed=False) % 10000000), repoLibrarian.managedRepos()))
start = time.time()
with Pool(int(multiprocessing.cpu_count()/4)) as pool:
    allMetrics = pool.map(runSuite, repos)
end = time.time()
dbUtils.log('Total Time used: '+str(end - start))

# Second iteration

In [37]:
tableName = 'lb_results2'

Create table had to be revised, as firstly additional columns `additions` and `deletions` had been introduced (these were formerly extracted from the table, now directly calculated during analysis) and secondly column names had been switched from camel to snake case, as there were errors when running queries on attributes in camel case.

In [38]:
def createTable():
    columns = [Column('sha', String), Column('parent', String), Column('timestamp', Integer), Column('repo_id', Integer), Column('additions', Integer), Column('deletions', Integer)]
    columns = columns + list(map(lambda func: Column(func.__name__, Integer), suite))
    meta = MetaData(schema='crm20')
    table = Table(
        tableName, meta,
        *columns
    )
    meta.create_all(dbUtils.engine)
    dbUtils.engine.dispose()

Writing to the database gained it's own function for better decomposition

In [39]:
dataBaseSemaphore = multiprocessing.Semaphore()
def writeDataToDb(data):
    with dataBaseSemaphore:
        data.to_sql(tableName, schema='crm20', con=dbUtils.engine, if_exists='append', index=False)
        dbUtils.engine.dispose()

Suite running was changed from the old total-count-calculation to calculating deltas only

In [40]:
def runDeltaSuite(repo):
    with io.capture_output() as output:
        data = repoAnalysis.calculateDeltaMetrics(repo, suite)
        writeDataToDb(data)
    dbUtils.log(output)
    return len(data) > 0

In [12]:
repoLibrarian.setReposFolder('/mnt/brick/crm20/repos/')

'/mnt/brick/crm20/repos/'

In [13]:
polyglotProjects = dbUtils.runQuery('''SELECT * FROM lb_polyglotProjects''', mute=True)
controlgroupProjects = dbUtils.runQuery('''SELECT * FROM lb_controlgroupProjects''', mute=True)
bothProjects = pandas.concat([polyglotProjects, controlgroupProjects], axis=0)

Time used: 0.4349958896636963
Time used: 0.34889984130859375


In [24]:
start = time.time()
repos = list(filter(lambda tupl: repoLibrarian.isJavaRepo(tupl[0], tupl[1]), map(rowToTuple, bothProjects.iterrows())))
end = time.time()
dbUtils.log('Total Time used: '+str(end - start))

Failed to check ('thiagoruis', 'dotNET-Grupo-2'): Reference at 'refs/heads/master' does not exist
Failed to check ('tananaev', 'traccar-client'): Reference at 'refs/heads/master' does not exist
Failed to check ('topicusonderwijs', 'wicket-openid'): Reference at 'refs/heads/master' does not exist
Failed to check ('dav009', 'graficadora-tweets'): Reference at 'refs/heads/master' does not exist
Failed to check ('dav009', 'MWE-DictionaryExtractor'): Reference at 'refs/heads/master' does not exist
Failed to check ('OpenGamma', 'OG-RStats'): Reference at 'refs/heads/master' does not exist


KeyboardInterrupt: 

In [25]:
print(len(bothProjects))
print(len(repos))

1377
620


In [26]:
dbUtils.log('test')

#### Dry test to see if everything is up

In [21]:
repos[0]

('twitter', 'scalding', 1182)

In [27]:
deleteTable()
createTable()
runDeltaSuite(repos[0])
display(Audio('./beep.mp3', autoplay=True))

#### Let's go

In [None]:
deleteTable()
createTable()
start = time.time()
with Pool(int(multiprocessing.cpu_count()*3/4)) as pool:
    allMetrics = pool.map(runDeltaSuite, repos)
end = time.time()
dbUtils.log('Total Time used: '+str(end - start))

# Third (final) iteration 

The final run was started shortly before the submission. It mainly uses the same technology as the run before, but scales up by factor 10. It is meant to revise and refactor the methodology and run a big stress test on it - so it can be released. 

In [31]:
tableName = 'lb_results3'

### Prepare the repos

In [10]:
count = 0
total = len(repoTuples)
def processRepoTuple(tupl):
    isJavaRepo = False
    with io.capture_output() as output:
        global count
        count = count + 1
        print('Looking at '+str(tupl)+' - ('+str(count)+'/'+str(total)+')')
        isJavaRepo = repoLibrarian.isJavaRepo(tupl[0], tupl[1])
    dbUtils.log(output)
    return isJavaRepo

In [33]:
def prepareRepos():
    dbUtils.log('Starting to download repos')
    start = time.time()
    repos = list(filter(processRepoTuple, repoTuples))
    end = time.time()
    dbUtils.log('Total Time used: '+str(end - start))
    return repos

In [8]:
polyglotProjects = dbUtils.runQuery('''SELECT * FROM lb_polyglotProjects''', mute=True)
controlgroupProjects = dbUtils.runQuery('''SELECT * FROM lb_controlgroupProjects''', mute=True)
bothProjects = pandas.concat([polyglotProjects, controlgroupProjects], axis=0)
repoTuples = list(map(rowToTuple, bothProjects.iterrows()))

Time used: 0.5708019733428955
Time used: 0.42449331283569336


In [None]:
repos = prepareRepos()

In [34]:
print(len(repoTuples))
print(len(repos))

16692
7676


## Start the run

In [35]:
def runFullAnalysis():
    deleteTable()
    createTable()
    start = time.time()
    with Pool(int(multiprocessing.cpu_count()*3/4)) as pool:
        allMetrics = pool.map(runDeltaSuite, repos)
    end = time.time()
    dbUtils.log('Total Time used: '+str(end - start))

In [None]:
runFullAnalysis()

---
# Can we get commit changes information from repositories? <a class="anchor" id="analysis-second"></a>
### (And can we do it fast enough?)

In [359]:
testRepoId = ('tfox12', 'REST-Debugger')

In [360]:
repoLibrarian.hasRepo(*testRepoId)

True

In [412]:
from git.db import GitDB
from git.db import GitCmdObjectDB
from git import Repo 

In [586]:
testRepo = Repo.init(repoLibrarian.pathFor(*testRepoId), bare=True, odbt=GitCmdObjectDB)

In [418]:
start = time.time()
for commit in testRepo.iter_commits():
    for obj in commit.tree.traverse():
        if obj.type == 'blob' and obj.name.endswith('.java'):
            contentWithHeader = obj.data_stream.read().decode("CP437")#.decode("utf-8")
end = time.time()
print('Total Time used: '+str(end - start))

start = time.time()
res = []
for commit in testRepo.iter_commits():
    for obj in commit.tree.traverse():
        if obj.type == 'blob' and obj.name.endswith('.java'):
            contentWithHeader = obj.data_stream.read().decode("CP437")#.decode("utf-8")
    res.append(sum(map(lambda file: file[1]['lines'], (filter(lambda file: file[0].endswith('.java'), commit.stats.files.items())))))
end = time.time()
print('Total Time used: '+str(end - start))

Total Time used: 5.757571458816528
Total Time used: 140.97127866744995


In [477]:
EMPTY_TREE_SHA   = "4b825dc642cb6eb9a060e54bf8d69288fbee4904"
start = time.time()
res = []
for commit in testRepo.iter_commits():
    for obj in commit.tree.traverse():
        if obj.type == 'blob' and obj.name.endswith('.java'):
            contentWithHeader = obj.data_stream.read().decode("CP437")#.decode("utf-8")
    if len(commit.parents) == 1:
        print(testRepo.git.diff(commit.parents[0].hexsha, commit.hexsha, '--', numstat=True))
end = time.time()
print('Total Time used: '+str(end - start))

144	0	web/lib/debugger.js
1	1	src/capstone/wrapper/GdbWrapper.java
11	5	web/test.html
22	2	src/capstone/wrapper/GdbWrapper.java
2	1	makefile
11	0	src/capstone/daemon/DaemonHandler.java
1	1	src/capstone/wrapper/PdbWrapper.java
2	0	test/capstone/wrapper/GdbWrapperTest.java
20	108	web/test.html
10	6	src/capstone/daemon/DaemonHandler.java
231	0	src/capstone/wrapper/PdbWrapper.java
129	123	src/capstone/wrapper/Wrapper.java
31	0	test/capstone/wrapper/PdbWrapperTest.java
1	0	.gitignore
5	1	src/capstone/wrapper/Wrapper.java
12	2	src/capstone/wrapper/GdbWrapper.java
1	1	src/capstone/wrapper/Wrapper.java
1	1	src/capstone/util/StackFrame.java
1	1	src/capstone/daemon/DaemonHandler.java
0	1	src/capstone/wrapper/GdbWrapper.java
1	1	src/capstone/daemon/DaemonHandler.java
1	0	src/capstone/wrapper/Wrapper.java
57	3	src/capstone/wrapper/GdbWrapper.java
2	2	src/capstone/wrapper/Wrapper.java
5	0	test/capstone/wrapper/GdbWrapperTest.java
28	15	src/capstone/wrapper/GdbWrapper.java
120	13	src/capstone/wrappe

KeyboardInterrupt: 

- That works, but the needed time is far too bad
- Try to look into internals of gitpython, iterating over commits (and all files!) is fast
- Iterates over revlist and create commit for each ... testRepo.git.rev_list(testRepo.head.commit))
- Iterating over objects of commit needs irritatingly little time -> iter_commits needs 90% of time
- The time is spend in cmd.py>>_call_process which is aliased with `__getattr__` --> gets called on attribute getting
- Comes from repo.git which is GitCommandWrapperType which is cmd.py>>Git
- `_iter_from_process_or_stream` just reads hexshas and reads commits
- `data_stream` calls odb.stream(binsha)
- So file contents are accessed directly via a hash - no way for me to get the diff that way...

Soooo, remember iter_commits taking 90% of time? That's because it calling rev-list. If we replace that with a git log --numstats, it's not much slower but we get detailed diff information - we can filter that for java files and sum up and even pursue a diff-based approach instead of calculating the whole project metrics every time

In [482]:
def toPrun():
    for commit in testRepo.iter_commits():
        for obj in commit.tree.traverse(predicate=lambda obj, depth: obj.type == 'blob' and obj.name.endswith('.java')):
            contentWithHeader = obj.data_stream.read().decode("CP437")#.decode("utf-8")
            print(commit.stats.files.items())
#%prun -s cumulative toPrun()
toPrun()

dict_items([('web/lib/debugger.js', {'insertions': 144, 'deletions': 0, 'lines': 144})])
dict_items([('web/lib/debugger.js', {'insertions': 144, 'deletions': 0, 'lines': 144})])
dict_items([('web/lib/debugger.js', {'insertions': 144, 'deletions': 0, 'lines': 144})])
dict_items([('web/lib/debugger.js', {'insertions': 144, 'deletions': 0, 'lines': 144})])
dict_items([('web/lib/debugger.js', {'insertions': 144, 'deletions': 0, 'lines': 144})])
dict_items([('web/lib/debugger.js', {'insertions': 144, 'deletions': 0, 'lines': 144})])
dict_items([('web/lib/debugger.js', {'insertions': 144, 'deletions': 0, 'lines': 144})])
dict_items([('web/lib/debugger.js', {'insertions': 144, 'deletions': 0, 'lines': 144})])


KeyboardInterrupt: 

In [469]:
from git.objects import Commit
proc = testRepo.git.rev_list(testRepo.head.commit, as_process=True)
print(proc)
commits = Commit._iter_from_process_or_stream(testRepo, proc)
for commit in commits:
    print(commit.hexsha)

<git.cmd.Git.AutoInterrupt object at 0x7f54bb7e1970>
ea393258a29f93539e371156cb77ea8eb2755b31
8a94a36e5448dd83ef4dafc76cc7650e8b20f791
cf84234e66b5a9940a3654b3bf6c17e8ec23e03c
468a8e7939c7328d3af5b95fe8c6832966df1b52
e78ae0397b209312ba80b709ba6996b558bb13bf
d85a3642c97d44d6c4b0bcd145c68f420b3c2d96
9dcfe3d3e498dfb06ae0f8c00dd1eda72bda070d
81a242f338713512b8d6b52d2a228a2be2f70d27
6fcc68689f3772ecaa1a6a011510ff26d7ee48d6
3f93813edef9786f6cb8ced6a7441783da3f8884
34503c976c764ce103baf2c10721ffa3656f20f8
1b3e586b70006b9a21d9a35c5029c1e48cd95837
a198e51031eadf8a013cc3711e37656f9edbe27b
3682dc8523e86a00251a237258047429c1bb6cf9
58c4dcce5ee8f618c0b12fd9053a79df8558b5a6
0f4e0bb6b43d088baca4e874d2641c6bc3e427cf
49f326042cf26b2b9ead9b675b6a830704092fa3
5be9b9fa33c509896fc1ad3c0e9d09680135153f
e7576bd7f72e7e5286d23bbad167cf3f394bed9c
ed4dd39de749f287ad3036832ebcf24126140b72
6c51645d3341163b45552ccfc47c162c8a0d0dc8
377ccc918b4b7d3f390bcba84777e1555e501622
0c5ddda62b6911aaec177be1f0e3d028dd94f94c
9c6d

In [476]:
testRepo.git.for_each_ref()

'ea393258a29f93539e371156cb77ea8eb2755b31 commit\trefs/heads/master'

### Git diff shortstat approach for delta approach

In [711]:
start = time.time()
res = []
for commit in testRepo.iter_commits():
    for obj in commit.tree.traverse():
        if obj.type == 'blob' and obj.name.endswith('.java'):
            contentWithHeader = obj.data_stream.read().decode("CP437")#.decode("utf-8")
    if len(commit.parents) == 1:
        print(testRepo.git.diff('--shortstat', commit.parents[0].hexsha, commit.hexsha))
end = time.time()
print('Total Time used: '+str(end - start))

 3 files changed, 8 insertions(+), 6 deletions(-)
 1 file changed, 3 insertions(+), 1 deletion(-)
 2 files changed, 11 insertions(+), 5 deletions(-)
 6 files changed, 179 insertions(+), 216 deletions(-)
 7 files changed, 178 insertions(+), 161 deletions(-)


KeyboardInterrupt: 

### Manual diff approach for delta approach

In [559]:
import diff as diff_lib
start = time.time()
for commit in testRepo.iter_commits():
    if len(commit.parents) == 1:
        #print(commit.hexsha)
        for obj in commit.tree.traverse(predicate=lambda obj, depth: obj.type == 'blob' and obj.name.endswith('.java')):
            try:
                obj2 = commit.parents[0].tree / obj.path
                content1 = obj.data_stream.read().decode("CP437")#.decode("utf-8")
                content2 = obj2.data_stream.read().decode("CP437")#.decode("utf-8")
                if not (content1 == content2):
                    #print('\tchanged '+obj.path)
                    diff = diff_lib.diff(content1.split('\n'), content2.split('\n'))
                    additions = sum(map(lambda change: change[1], diff))
                    deletions = sum(map(lambda change: change[3], diff))
                    #print(str(additions)+'\t'+str(deletions)+'\t'+obj.path)
            except KeyError:
                pass
                #print('\tadded '+obj.path)
        #print('---\n'+testRepo.git.diff(commit.parents[0].hexsha, commit.hexsha, '--', numstat=True))
        print(testRepo.git.diff(commit.parents[0].hexsha, commit.hexsha))
        #print('\n')
end = time.time()
print('Total Time used: '+str(end - start))

diff --git a/web/lib/debugger.js b/web/lib/debugger.js
new file mode 100644
index 0000000..a6f5752
--- /dev/null
+++ b/web/lib/debugger.js
@@ -0,0 +1,144 @@
+// Debugger Web API
+// Author: ntietz
+// Date:   April 19 2013
+
+var currentline = 1;
+var userId = 1;
+var debuggerId = 1;
+
+function showresponse(data)
+{
+    alert(JSON.stringify(data));
+}
+
+function failhandler(call, textStatus, errorThrown)
+{
+    alert("Error: the request failed!");
+    alert(JSON.stringify(call));
+    alert(textStatus);
+    alert(errorThrown);
+}
+
+function clearHighlightedLines()
+{
+    for (var i = 0; i < codeEditor.lineCount(); i++)
+    {
+        var line = codeEditor.getLineHandle(i);
+        codeEditor.removeLineClass(line, "background", "capstone-errorline-background");
+        codeEditor.removeLineClass(line, "background", "capstone-currentline-background");
+    }
+}
+
+function highlightErrorLine(linenumber)
+{
+    var line = codeEditor.getLineHandle(linenumber-1);
+    codeEditor

KeyboardInterrupt: 

In [522]:
import diff as diff_lib
start = time.time()
for commit in testRepo.iter_commits():
    if len(commit.parents) == 1:
        #print(commit.hexsha)
        for obj in commit.tree.traverse(predicate=lambda obj, depth: obj.type == 'blob' and obj.name.endswith('.java')):
            content1 = obj.data_stream.read().decode("CP437")#.decode("utf-8")
                #print('\tadded '+obj.path)
        #print('---\n'+testRepo.git.diff(commit.parents[0].hexsha, commit.hexsha, '--', numstat=True))
        #print(testRepo.git.diff(commit.parents[0].hexsha, commit.hexsha))
        #print('\n')
end = time.time()
print('Total Time used: '+str(end - start))

Total Time used: 2.238713026046753


### Actual git log approach for delta approach

In [641]:
from git.util import hex_to_bin
from git import Commit

def safeToInt(string):
    return 0 if string == '-' else int(string)

def block_to_stats(block):
    lines = block.split('\n')
    header = lines[0]
    lines = filter(lambda line: line.endswith('.java'), lines)
    changed_files = list(map(lambda line: line.split('\t'), lines))
    #changed_files = list(changed_files)
    additions = sum(map(lambda file: safeToInt(file[0]), changed_files))
    deletions = sum(map(lambda file: safeToInt(file[1]), changed_files))
    return (header, (changed_files, additions, deletions))

def file_contents(tree, path):
    try: 
        obj = tree / path
        return obj.data_stream.read().decode("CP437")
    except KeyError:
        return ''

In [640]:
testRepo = Repo.init('./repos/bptlab/scylla.git', bare=True)
start = time.time()
log = testRepo.git.log('--numstat', '--format=//%H', '--all')
#print(log)
commits = log.split('//')[1:]
changes = map(block_to_stats, commits)

for hexsha, change in changes:
    commit = Commit(testRepo, hex_to_bin(hexsha))
    if len(commit.parents) == 1:
        changed_files, additions, deletions = change
        for added, removed, file in changed_files:
            content1 = file_contents(commit.tree, file)
            content2 = file_contents(commit.parents[0].tree, file)

end = time.time()
print('Total Time used: '+str(end - start))

Total Time used: 9.903306484222412


In [577]:
print(testRepo.git.log('--shortstat', '--format=//%H'))

//ea393258a29f93539e371156cb77ea8eb2755b31

 1 file changed, 144 insertions(+)
//8a94a36e5448dd83ef4dafc76cc7650e8b20f791

 2 files changed, 12 insertions(+), 6 deletions(-)
//cf84234e66b5a9940a3654b3bf6c17e8ec23e03c

 1 file changed, 22 insertions(+), 2 deletions(-)
//468a8e7939c7328d3af5b95fe8c6832966df1b52

 5 files changed, 36 insertions(+), 110 deletions(-)
//e78ae0397b209312ba80b709ba6996b558bb13bf

 4 files changed, 401 insertions(+), 129 deletions(-)
//d85a3642c97d44d6c4b0bcd145c68f420b3c2d96

 1 file changed, 1 insertion(+)
//9dcfe3d3e498dfb06ae0f8c00dd1eda72bda070d

 1 file changed, 5 insertions(+), 1 deletion(-)
//81a242f338713512b8d6b52d2a228a2be2f70d27

 2 files changed, 13 insertions(+), 3 deletions(-)
//6fcc68689f3772ecaa1a6a011510ff26d7ee48d6

 1 file changed, 1 insertion(+), 1 deletion(-)
//3f93813edef9786f6cb8ced6a7441783da3f8884

 2 files changed, 1 insertion(+), 2 deletions(-)
//34503c976c764ce103baf2c10721ffa3656f20f8

 1 file changed, 1 insertion(+), 1 deletion(-)

In [557]:
import subprocess
start = time.time()
print(subprocess.check_output('git -C scylla.git log --numstat --format=//%H', shell=True).decode('utf-8'))
end = time.time()
print('Total Time used: '+str(end - start))

//ee00a205a2225ad73b0264cf7ba64be5c7044d0b

3	2	src/main/java/de/hpi/bpt/scylla/plugin/batch/BatchResourceQueueUpdatedPlugin.java
4	3	src/main/java/de/hpi/bpt/scylla/plugin_type/simulation/resource/ResourceQueueUpdatedPluggable.java
1	1	src/main/java/de/hpi/bpt/scylla/simulation/SimulationModel.java
//aa9335c1b629e660d48cf6cb923bf345f79f0619

3	1	src/main/java/de/hpi/bpt/scylla/plugin_loader/ExternalJarLoader.java
//b215b9f8ea72fcbadb5881c73ed949a21f456601

9	3	src/main/java/de/hpi/bpt/scylla/plugin_type/simulation/resource/ResourceAssignmentPluggable.java
2	2	src/main/java/de/hpi/bpt/scylla/simulation/QueueManager.java
//d27a25f6aca8eb7820b5ee4e2d87f65ecf7c454e

2	2	src/main/java/de/hpi/bpt/scylla/plugin/batch/BatchStashResourceEvent.java
88	208	src/main/java/de/hpi/bpt/scylla/simulation/QueueManager.java
86	3	src/main/java/de/hpi/bpt/scylla/simulation/SimulationModel.java
1	1	src/main/java/de/hpi/bpt/scylla/simulation/event/BPMNEndEvent.java
1	1	src/main/java/de/hpi/bpt/scylla/simula

In [608]:
from git.util import hex_to_bin
from git import Commit, Repo

repo = Repo.init('./repos/bptlab/scylla.git', bare=True)
start = time.time()
log = repo.git.log('--numstat', '--format=//%H', '--all')
blocks = log.split('//')[1:]
commits = map(lambda block: block.split('\n')[0], blocks)
for hexsha in commits:
    commit = Commit(repo, hex_to_bin(hexsha))
    for obj in commit.tree.traverse(predicate=lambda obj, depth: obj.type == 'blob' and obj.name.endswith('.java')):
        content = obj.data_stream.read().decode("CP437")
end = time.time()
print('Total Time used: '+str(end - start))

In [616]:
start = time.time()
for commit in repo.iter_commits('--all'):
    for obj in commit.tree.traverse(predicate=lambda obj, depth: obj.type == 'blob' and obj.name.endswith('.java')):
        content = obj.data_stream.read().decode("CP437")
end = time.time()
print('Total Time used: '+str(end - start))

Total Time used: 11.602530241012573


In [671]:
%aimport repoAnalysis
#testRepoTuple = ('tfox12', 'REST-Debugger', '1337')
testRepoTuple = ('brockn', 'incubator-parquet-mr', 11108627)

In [672]:
repoAnalysis.calculateMetrics(testRepoTuple)

Time used for ('brockn', 'incubator-parquet-mr', 11108627): 435.4707381725311


Unnamed: 0,sha,parent,timestamp,repoId,additions,deletions,loc,cloc,fileCount,numMethods,numLambdas,numCommentLines,numReflection,numSnakes,totalIndent
0,69ba4844730426a212c609facd93b33bf6692b3a,be1222ef4a3260ddcf516d73c6ceecd144a134cb,1412699955,11108627,0,0,71466,56538,489,5214,11,6229,360,1966,54355.75
1,be1222ef4a3260ddcf516d73c6ceecd144a134cb,da9129927bce90feb6d2860745263f4d74d0dfa8,1412198064,11108627,0,0,71461,56534,489,5214,11,6229,360,1964,54351.75
2,da9129927bce90feb6d2860745263f4d74d0dfa8,0b17cbee9541998df66d33c8a99b675ced80d9aa,1412196285,11108627,0,0,71454,56527,489,5214,11,6229,360,1960,54340.75
3,0b17cbee9541998df66d33c8a99b675ced80d9aa,bf20abbf4825fa5892d8e15c066e768671a39289,1412017203,11108627,0,0,71157,56235,489,5197,11,6229,360,1849,53845.25
4,bf20abbf4825fa5892d8e15c066e768671a39289,3a082e8e390898646c094d20f4ec1eeba45b79ac,1411688756,11108627,0,0,71153,56231,489,5197,11,6229,360,1849,53840.75
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1830,dbbf9443ffe219862c9744806f34deb2606d21c7,f3cdad30930c2cd73404804cf4ceeca237509a2c,1346885521,11108627,0,0,2858,2502,40,323,3,71,7,2,2190.50
1831,24028799609ef606359d6bee39acd34d8d67a067,f3cdad30930c2cd73404804cf4ceeca237509a2c,1346454047,11108627,0,0,2858,2502,40,323,3,71,7,2,2190.50
1832,f3cdad30930c2cd73404804cf4ceeca237509a2c,a8c10efccf35977193cab80b0f17d6a2f7d066d9,1346452819,11108627,0,0,2858,2502,40,323,3,71,7,2,2190.50
1833,a8c10efccf35977193cab80b0f17d6a2f7d066d9,576c709724551a5122ae9b9e314b6c400f5f778d,1346452652,11108627,0,0,2858,2502,40,323,3,71,7,2,2190.50


In [673]:
def alternativeCalculateMetrics(repoTuple, metricSuite=repoAnalysis.metricSuite):
    (user, project, repoId) = repoTuple
    repo = repoLibrarian.getRepo(user, project)
    columns = ['sha', 'parent', 'timestamp', 'repoId', 'additions', 'deletions'] + list(map(lambda fun: fun.__name__, metricSuite))
    results = []
    try:
        start = time.time()
        log = repo.git.log('--numstat', '--format=//%H', '--all')
        blocks = log.split('//')[1:]
        commits = map(lambda block: block.split('\n')[0], blocks)
        for hexsha in commits:
            commit = Commit(repo, hex_to_bin(hexsha))
            results.append(repoAnalysis.metricsForCommit(commit, metricSuite, repoId))
        df = pandas.DataFrame(results, columns=columns)
        end = time.time()
        print('Time used for '+str(repoTuple)+': '+str(end - start))
        return df
    except Exception as e:
        print('Failed to analyze '+str(repoTuple)+': '+str(e))
        return []

In [674]:
alternativeCalculateMetrics(testRepoTuple)

Time used for ('brockn', 'incubator-parquet-mr', 11108627): 439.26439571380615


Unnamed: 0,sha,parent,timestamp,repoId,additions,deletions,loc,cloc,fileCount,numMethods,numLambdas,numCommentLines,numReflection,numSnakes,totalIndent
0,69ba4844730426a212c609facd93b33bf6692b3a,be1222ef4a3260ddcf516d73c6ceecd144a134cb,1412699955,11108627,0,0,71466,56538,489,5214,11,6229,360,1966,54355.75
1,be1222ef4a3260ddcf516d73c6ceecd144a134cb,da9129927bce90feb6d2860745263f4d74d0dfa8,1412198064,11108627,0,0,71461,56534,489,5214,11,6229,360,1964,54351.75
2,da9129927bce90feb6d2860745263f4d74d0dfa8,0b17cbee9541998df66d33c8a99b675ced80d9aa,1412196285,11108627,0,0,71454,56527,489,5214,11,6229,360,1960,54340.75
3,0b17cbee9541998df66d33c8a99b675ced80d9aa,bf20abbf4825fa5892d8e15c066e768671a39289,1412017203,11108627,0,0,71157,56235,489,5197,11,6229,360,1849,53845.25
4,bf20abbf4825fa5892d8e15c066e768671a39289,3a082e8e390898646c094d20f4ec1eeba45b79ac,1411688756,11108627,0,0,71153,56231,489,5197,11,6229,360,1849,53840.75
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1830,dbbf9443ffe219862c9744806f34deb2606d21c7,f3cdad30930c2cd73404804cf4ceeca237509a2c,1346885521,11108627,0,0,2858,2502,40,323,3,71,7,2,2190.50
1831,24028799609ef606359d6bee39acd34d8d67a067,f3cdad30930c2cd73404804cf4ceeca237509a2c,1346454047,11108627,0,0,2858,2502,40,323,3,71,7,2,2190.50
1832,f3cdad30930c2cd73404804cf4ceeca237509a2c,a8c10efccf35977193cab80b0f17d6a2f7d066d9,1346452819,11108627,0,0,2858,2502,40,323,3,71,7,2,2190.50
1833,a8c10efccf35977193cab80b0f17d6a2f7d066d9,576c709724551a5122ae9b9e314b6c400f5f778d,1346452652,11108627,0,0,2858,2502,40,323,3,71,7,2,2190.50


In [695]:
def deltaMetricsForCommit(commit, metricSuite, repoId, change):
    
    resultTuple = {
        'sha' : commit.hexsha,
        'parent' : commit.parents[-1].hexsha if len(commit.parents) == 1 else None,
        'timestamp' : commit.committed_date,
        'repoId' : repoId
    }    
    for metricFunction in metricSuite:
        resultTuple[metricFunction.__name__] = 0
    
    changed_files, additions, deletions = change
    resultTuple['additions'] = additions
    resultTuple['deletions'] = deletions
    for added, removed, file in changed_files:
        
        contentWithHeader = file_contents(commit.tree, file)
        content = repoAnalysis.removeHeader(contentWithHeader)
        contentWithoutStrings = repoAnalysis.stringRemoveRegex.sub("\"...\"", content)
        contentWithoutComments = repoAnalysis.commentRegex.sub("/*...*/", contentWithoutStrings)
        for metricFunction in metricSuite:
            metric = metricFunction(content=content, contentWithHeader=contentWithHeader, contentWithoutComments=contentWithoutComments)
            resultTuple[metricFunction.__name__] = resultTuple[metricFunction.__name__] + metric
            
        contentWithHeader = file_contents(commit.parents[0].tree, file)
        content = repoAnalysis.removeHeader(contentWithHeader)
        contentWithoutStrings = repoAnalysis.stringRemoveRegex.sub("\"...\"", content)
        contentWithoutComments = repoAnalysis.commentRegex.sub("/*...*/", contentWithoutStrings)
        for metricFunction in metricSuite:
            metric = metricFunction(content=content, contentWithHeader=contentWithHeader, contentWithoutComments=contentWithoutComments)
            resultTuple[metricFunction.__name__] = resultTuple[metricFunction.__name__] - metric
            
    return resultTuple
    

def calculateDeltaMetrics(repoTuple, metricSuite=repoAnalysis.metricSuite):
    (user, project, repoId) = repoTuple
    repo = repoLibrarian.getRepo(user, project)
    columns = ['sha', 'parent', 'timestamp', 'repo_id', 'additions', 'deletions'] + list(map(lambda fun: fun.__name__, metricSuite))
    results = []
    try:
        start = time.time()
        log = repo.git.log('--numstat', '--format=//%H', '--all')
        commits = log.split('//')[1:]#First 'part'
        changes = map(block_to_stats, commits)

        for hexsha, change in changes:
            commit = Commit(repo, hex_to_bin(hexsha))
            if len(commit.parents) == 1:
                results.append(deltaMetricsForCommit(commit, metricSuite, repoId, change))
                
        df = pandas.DataFrame(results, columns=columns)
        end = time.time()
        print('Time used for '+str(repoTuple)+': '+str(end - start))
        return df
    except Exception as e:
        print('Failed to analyze '+str(repoTuple)+': '+str(e))
        return []

In [693]:
(user, project, repoId) = testRepoTuple
repo = repoLibrarian.getRepo(user, project)
print(len(list(filter(lambda commit: len(commit.parents) == 1, repo.iter_commits('--all')))))
print(len(list(repo.iter_commits('--all'))))
print(len(list(repo.iter_commits())))

1454
1835
1581


In [696]:
%autoreload
calculateDeltaMetrics(testRepoTuple)

Time used for ('brockn', 'incubator-parquet-mr', 11108627): 22.396914958953857


Unnamed: 0,sha,parent,timestamp,repoId,additions,deletions,loc,cloc,fileCount,numMethods,numLambdas,numCommentLines,numReflection,numSnakes,totalIndent
0,69ba4844730426a212c609facd93b33bf6692b3a,be1222ef4a3260ddcf516d73c6ceecd144a134cb,1412699955,11108627,7,2,5,4,0,0,0,0,0,2,4.00
1,be1222ef4a3260ddcf516d73c6ceecd144a134cb,da9129927bce90feb6d2860745263f4d74d0dfa8,1412198064,11108627,19,12,7,7,0,0,0,0,0,4,11.00
2,da9129927bce90feb6d2860745263f4d74d0dfa8,0b17cbee9541998df66d33c8a99b675ced80d9aa,1412196285,11108627,346,49,297,292,0,17,0,0,0,111,495.50
3,0b17cbee9541998df66d33c8a99b675ced80d9aa,bf20abbf4825fa5892d8e15c066e768671a39289,1412017203,11108627,34,30,4,4,0,0,0,0,0,0,4.50
4,bf20abbf4825fa5892d8e15c066e768671a39289,3a082e8e390898646c094d20f4ec1eeba45b79ac,1411688756,11108627,34,1,33,33,0,11,0,0,0,0,15.00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1449,5fb63e9a0743a23735f5b6043abd202a85173202,7ed952872d68eacc02d5a13e6b6ded4cf8babc55,1347128023,11108627,304,146,158,149,0,30,2,32,1,0,83.75
1450,dbbf9443ffe219862c9744806f34deb2606d21c7,f3cdad30930c2cd73404804cf4ceeca237509a2c,1346885521,11108627,0,0,0,0,0,0,0,0,0,0,0.00
1451,24028799609ef606359d6bee39acd34d8d67a067,f3cdad30930c2cd73404804cf4ceeca237509a2c,1346454047,11108627,0,0,0,0,0,0,0,0,0,0,0.00
1452,f3cdad30930c2cd73404804cf4ceeca237509a2c,a8c10efccf35977193cab80b0f17d6a2f7d066d9,1346452819,11108627,0,0,0,0,0,0,0,0,0,0,0.00


In [699]:
dbUtils.runQuery('''
        SELECT
            child.sha, 
            parent.sha AS parent,
            child.timestamp,
            child."repoId",
            child.loc - parent.loc AS d_loc,
            child.cloc - parent.cloc AS d_cloc,
            child."fileCount" - parent."fileCount" AS d_filecount,
            child."numMethods" - parent."numMethods" AS d_methods,
            child."numLambdas" - parent."numLambdas" AS d_lambdas,
            child."numCommentLines" - parent."numCommentLines" AS d_commentlines,
            child."numReflection" - parent."numReflection" AS d_reflection,
            child."numSnakes" - parent."numSnakes" AS d_snakes,
            child."totalIndent" - parent."totalIndent" AS d_totalindent
        FROM 
            crm20.'''+tableName+''' child, crm20.'''+tableName+''' parent
        WHERE child."repoId"=11108627
        AND parent."repoId"=11108627
        AND child.parent = parent.sha
        AND child.sha = 'a8c10efccf35977193cab80b0f17d6a2f7d066d9'
''')

Time used: 0.11280083656311035


Unnamed: 0,sha,parent,timestamp,repoId,d_loc,d_cloc,d_filecount,d_methods,d_lambdas,d_commentlines,d_reflection,d_snakes,d_totalindent
0,a8c10efccf35977193cab80b0f17d6a2f7d066d9,576c709724551a5122ae9b9e314b6c400f5f778d,1346452652,11108627,2858,2502,40,323,3,71,7,2,2191
