# Working with Different Software Data Groups

The ScrumSaga system extracts many different (100+) data fields, with transformation alorithms for many more.  This can seem unmanageable, but becomes intuitive after learning the various categories and sub-categories of metric Data Groups.

This guide describes a few of the basic Data Groups, how they are related.  While all of the data can be represented in any of the categories, for instructional purposes, we can categorize them by their structure and typical use:

* Metric (timeseries)
* Hierarchical (parent-child)
* Entity-Relation (graph/network)
* Descriptive

We also provide the _processing data_ for those with interest.

### Preparations
_Set-up Environment_

In [1]:
# Ensure API Wrapper is available and load it
! ls ./ScrumSaga

Account.py
Portfolio.py
Project.py
README.md
Repo.py
__init__.py
__pycache__


In [2]:
import sys
path = r'C:\Users\Jason\Documents\IPython Notebooks\SS-Reports\ScrumSaga'
sys.path.append(path)
import ScrumSaga as saga

In [3]:
# Acocunt information (must be manipulated on website: scrumsaga.com)
SAGA_ACCT = {"email":"dev.team@mgmt-tech.org","password":"*********"}

Acct = saga.Account(acct_email=SAGA_ACCT['email'], acct_password=SAGA_ACCT['password'])
Acct.login()

passwords match


_Check Available Repo Data_

In [23]:
Acct.view_data()

['IMTorg--testprj_Java_aSimple', 'IMTorgTestCode--testprj_Java_aSimple', 'IMTorgTestProj--demoprj_Java_HumanResourceApp']


### Simple Java Project

In [24]:
# create project
JavaRepo = saga.Repo('IMTorgTestProj','information@mgmt-tech.org','demoprj_Java_HumanResourceApp')
jHrApp = saga.Project(Acct, JavaRepo)

In [27]:
jHrApp.extract(selection='all')

No need to process, again


<Response [200]>

_Load Data from Repo_ 

In [28]:
# load all metric groups
jHrApp.load_all()

ERROR group records:  4274
 -elapsed time: 4.323942
SIZE group records:  69
 -elapsed time: 0.348959
AUTHOR group records:  4
 -elapsed time: 0.226758
PROJECT group records:  69
 -elapsed time: 0.466084
PROCESS_LOG group records:  11
 -elapsed time: 0.229918
ENTITY_CHARACTERISTIC group records:  4438
 -elapsed time: 7.090637
COMPLEXITY group records:  465
 -elapsed time: 0.595327
QUALITY group records:  874
 -elapsed time: 0.721322
ENTITY_STRUCTURE group records:  1114
 -elapsed time: 1.159667
RELATION group records:  12
 -elapsed time: 0.234500
TAG group records:  204
 -elapsed time: 0.452131
Loading completed with no errors


In [78]:
# timeline of events
jHrApp['project'][['stamp', 'hash','subject']].tail()

Unnamed: 0,stamp,hash,subject
64,2016-05-01 05:15:25.000000,e9f59b54d0bc10569ae86abae7607658ca2503d9,Downgrading Java to 1.7
65,2016-05-01 05:20:24.000000,fb0b48d81a4d9be7dfcdd2119fde8e321894cc8a,Updating DB password as DEMO server
66,2016-07-13 22:40:33.000000,8bc5a79bf92a7cee970d5e4c5bfa8c94f88a5a53,1. Add new Employee bug is resolved. 2. Search...
67,2016-07-15 21:33:56.000000,e633c4898600159cf6b62c37256c39e9bc563203,1. Search Assignment functionality is updated.
68,2016-07-17 04:40:04.000000,c73e2824188fe99555f6f435bbe7959631b5d1c6,Updated missing js file usage with existing js...


### Metric Data

This section looks at project, size, and complexity data in order to better understand how they might be used in measures for business decisions.

In [62]:
import pandas
import numpy as np

# per commit (all commits)
print('PROJECT: ', jHrApp['project'].columns ) 
print('SIZE: ', jHrApp['size'].columns )

# per commit (10 commits), per entity
print('COMPLEXITY: ', jHrApp['complexity'].columns) 

PROJECT:  Index(['author_add', 'author_commits_count', 'author_del', 'author_files_size',
       'author_id', 'author_modified_count', 'author_original_count',
       'author_paths_count', 'author_total', 'authors_count', 'hash', 'id',
       'prj_id', 'project', 'release_count', 'reviewer_add',
       'reviewer_commits_count', 'reviewer_del', 'reviewer_files_size',
       'reviewer_modified_count', 'reviewer_name', 'reviewer_original_count',
       'reviewer_paths_count', 'reviewer_total', 'stamp', 'stamp_author',
       'subject'],
      dtype='object')
SIZE:  Index(['count', 'files_count', 'files_size', 'hash', 'id', 'loc_add',
       'loc_del', 'loc_total', 'modified_file_count', 'original_file_count',
       'prj_id', 'project', 'stamp', 'tag_count'],
      dtype='object')
COMPLEXITY:  Index(['bugs', 'calculated_length', 'cyclomatic_complexity', 'difficulty',
       'effort', 'entity_id', 'hash', 'id', 'n1', 'n2', 'nn1', 'nn2', 'time',
       'volume'],
      dtype='object')


In [57]:
jHrApp['complexity']['volume'] = jHrApp['complexity']['volume'].astype('float')
jHrApp['complexity']['difficulty'] = jHrApp['complexity']['difficulty'].astype('float')
jHrApp['complexity']['effort'] = jHrApp['complexity']['effort'].astype('float')

cmplx = jHrApp['complexity'].groupby('hash').agg({'volume': np.sum, 'difficulty':np.sum, 'effort':np.sum})
cmplx['hash'] = cmplx.index.values

In [58]:
m1 = pandas.merge(jHrApp['project'], jHrApp['size'], on='hash', how='left')
metric = pandas.merge(m1, cmplx, on='hash', how='right')

In [79]:
# External stakeholders may look at data, such as this for an overview of project progress
metric[['stamp_x','authors_count','files_size','loc_total','loc_add','loc_del','volume','difficulty','effort']]

Unnamed: 0,stamp_x,authors_count,files_size,loc_total,loc_add,loc_del,volume,difficulty,effort
0,2016-04-16 12:11:18.000000,1,8584450,2935,2935,0,2304.0,106.5,40906.08
1,2016-04-16 17:23:21.000000,1,8724989,3701,3717,16,13733.25,609.3,303455.13
2,2016-04-21 18:21:39.000000,1,8756060,4700,4745,45,28847.13,1014.02,694932.82
3,2016-05-10 19:09:34.000000,2,768312,5000,5228,228,50281.08,1409.6,1402362.56
4,2016-06-08 12:43:32.000000,2,768730,5027,5259,232,50281.08,1409.6,1402362.56
5,2016-06-16 10:03:12.000000,2,768797,5032,5304,272,50281.08,1409.6,1402362.56
6,2016-04-23 11:14:43.000000,3,8766352,5041,5322,281,30920.61,1059.11,787295.47
7,2016-04-28 20:23:38.000000,3,9085553,6444,6801,357,45262.26,1310.09,1166944.23
8,2016-04-30 16:37:40.000000,3,9110644,7131,7576,445,47297.79,1339.17,1258266.39
9,2016-07-17 04:40:04.000000,4,778936,7382,12245,4863,51805.56,1445.4,1482433.08


### Hierarchical Data

This secion will look at the hierarchical structure from two sets of data: entity_structure and tags.  While entity_structure comes naturally from code, tags are manually pasted into code comments by the developer in order to organize additional aspects.

In [64]:
print('Structure: ', jHrApp['entity_structure'].columns )
print('Tags: ', jHrApp['tag'].columns )

Structure:  Index(['child_of', 'child_of_id', 'created_hash', 'entity_name', 'entity_type',
       'ext', 'id', 'last_before_removed_hash', 'prj_id', 'type'],
      dtype='object')
Tags:  Index(['class_name', 'file_path', 'func_id', 'hash', 'id', 'project',
       'tag_key', 'tag_value', 'user', 'var_name'],
      dtype='object')


In [85]:
tmp = jHrApp['entity_structure']
tmp = tmp[tmp['last_before_removed_hash'] =='']
tmp = jHrApp['entity_structure'].groupby('entity_type').agg({'entity_type':np.size})
struct = tmp.reindex(['project','directory','file','class','method','param','variable'])
struct

Unnamed: 0_level_0,entity_type
entity_type,Unnamed: 1_level_1
project,1
directory,41
file,135
class,329
method,245
param,162
variable,140


In [86]:
# 
tag = jHrApp['tag']
tag = tag[tag['hash']=='c73e2824188fe99555f6f435bbe7959631b5d1c6']
tag[['id','tag_key','tag_value']]

Unnamed: 0,id,tag_key,tag_value
201,202,Func,List_Job
202,203,Func,List_Job
203,204,Func,Add_Job


### Entity-Relation Data

In [None]:
entity_structure, relation

In [95]:
jHrApp['relation'].shape

(13, 8)

### Descriptive Data

entity_characteristic, quality, error, author

In [87]:
print('Author: ', jHrApp['author'].columns )
print('Characteristics: ', jHrApp['entity_characteristic'].columns )
print('Quality: ', jHrApp['quality'].columns )
print('Error: ', jHrApp['error'].columns )

Author:  Index(['author_domain', 'author_email', 'author_name', 'date_author_join_prj',
       'id', 'prj_id'],
      dtype='object')
Characteristics:  Index(['blank', 'brief_desc', 'code', 'comment', 'detailed_desc', 'end_line',
       'entity_id', 'hash', 'id', 'inbody_desc', 'last_modification_hash',
       'last_modification_loc_added', 'last_modification_loc_changes',
       'last_modification_loc_removed', 'last_modification_user', 'loc_add',
       'loc_del', 'loc_total', 'location', 'modifications', 'reimplements_id',
       'start_line', 'total_loc_added', 'total_loc_removed',
       'total_references'],
      dtype='object')
Error:  Index(['entity', 'hash', 'id', 'location', 'message', 'project', 'type',
       'user'],
      dtype='object')


In [106]:
tmp = jHrApp['entity_structure']
tmp = tmp[['id','entity_name']]
m1 = pandas.merge(tmp,jHrApp['entity_characteristic'], on='id', how='left')
m2 = pandas.merge(m1, jHrApp['quality'], on='hash', how='left')

Unnamed: 0,child_of,child_of_id,created_hash,entity_name,entity_type,ext,id,last_before_removed_hash,prj_id,type
254,aSimple/bin/aSimple,127,7405846a24596c8fdcadec8be1f392783d1517fc,Cat.class,file,.class,255,,1,
255,aSimple/src/aSimple,130,7405846a24596c8fdcadec8be1f392783d1517fc,Cat.java,file,.java,256,,1,
256,Cat.java,256,7405846a24596c8fdcadec8be1f392783d1517fc,aSimple::Cat,class,.java,257,,1,
257,aSimple::Cat,257,7405846a24596c8fdcadec8be1f392783d1517fc,Cat,method,.java,258,,1,
258,Cat,258,7405846a24596c8fdcadec8be1f392783d1517fc,name,param,.java,259,,1,String


### Processing - Related

process_log

In [31]:
jHrApp['process_log'].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11 entries, 0 to 10
Data columns (total 5 columns):
hash       11 non-null object
id         11 non-null int64
project    11 non-null object
seconds    11 non-null int64
user       11 non-null object
dtypes: int64(2), object(3)
memory usage: 520.0+ bytes


### Conclusion

This overview of the raw data collected provides a basis for workflows and understanding advanced, calculated data.  You can learn more in follow-on [guides](http://guides.scrumsaga.com/).