- This repo is trying to detect the bad smells during the develop process of project 1, csc 505, 15spring, ncsu
- All of the project names and members have been subsitituted by the numbers, due to the privacy consideration
- Contact: Jianfeng Chen (email@example.com)
This section will introduce how to collect the data or records during the developing process. I will collect three basic types of data-- issues, commits and milestones.
All of the collection codes are modified from gitable.py. Github token is required when applying the github API. In this repo, my token has been hiden.
Issues are a great way to keep track of tasks, enhancements, and bugs for the projects. Through it, we can detect or predict some bad smells. To collect the issues for a project, we can put the dump https://api.github.com/repos/org_name/repo_name/issues/events and them parse the return JSON. For example, I want to get the issues for my own project, I should use the dump https://api.github.com/repos/smartSE/constraintAnalysis/issues/events.
If it is successful, I can get all the information about the issues, tagged as "id", "url", "html_url", "state", "title", "body", "user", "label", "milestone", "comments", etc.
This report will discuss how to parse these information in the following section.
Typically the developer set a milestone correspond to a project, feature, or time period. Consequently, milestone is an important information in analysing the bad smell.
Similar to collecting the issues, one can fetch the milestones by using the Github API link: https://api.github.com/repos/org_name/repo_name/milestones . Again, take my own project as an example, the dump for extracting my project milestone should be https://api.github.com/repos/smartSE/constraintAnalysis/milestones?state=all. The state here can be all/closed/ open(default).
The information for a single milestone including "id", "url", "creator", "title", "description", "create_at", "closed_at", etc.
There are several commits in a project. Through tracking the commits, we can detect or predict many bad smells. Similar to the former two collection, we should use the link https://api.github.com/repos/org_name/repo_name/commits . For example, to get the commits for my own project 1, I should use https://api.github.com/repos/smartSE/constraintAnalysis/commits.
Many parameters can be attached to the url so that we can find more precise result, see the github api document.
Common information for a commit includes "url", "author", "committer", "parents", etc.
Labels are the tags for the issues, through which the developer can classify their issues.
Similar to the former two collection, we should use the link https://api.github.com/repos/org_name/repo_name/labels . For example, to get the commits for my own project 1, I should use https://api.github.com/repos/smartSE/constraintAnalysis/labels
Many parameters can be attached to the url so that we can find more precise result, see the github api document
Common information for a commit includes "name", "url", "color", etc.
The main purpose for this repo is to find the bad smells for others, which are widely existed in our own development process. Thus it is the conclusion that matters. All of the names will be hidden to protect the privacy. That is, developers are called "D1", "D2", "D3"; groups are called "G1", "G2", "G3".
I defined a mapping method to substitute the real names. It was simply applying "str.replace(s1,s2)". Obviously, this function is not published in this repo--it contains the mapping relationship.
All of the results are stored in different spreadsheets. These spreadsheets are in CSV format, which is easily written by format and analyzed by EXCEL. All of the figures in this repo were created by MS Excel.
The following table shows how much data I collected for the later analysis.
|Committer||Commit time(epoch secs)|
The actual collected data can be found here:
The commit record mainly focuses on the commit history during the development of software. It contains the committer, commit time.
|Issue Id||State||Creator||Create time||Labels||Milestone due||Last update|
The actual collected data can be found here:
The issue records are used to trace the issue publication in the development of the software. All times with '-1' indicates "not applicable"
|Milestone id||Title||Create time||Due|
The actual collected data can be found here:
Milestone is an important develop tool. All times with '-1' indicates "not applicable/not set"
|Name|Url|Color| |bug|https://api.github.com/repos/g2/labels/bug|fc2929| |team|discussion|https://api.github.com/g2/labels/team%20discussion|eb6420|
The actual collected data can be found here:
6.PART I. Feature Detection and Results
|1||Commit distribution for the whole team|
|2||Commit for a single person|
|3||Not in-time issue|
|4||Issue creator distribution|
|5||Weekly issue distribution|
|6||Labelled issue distribution|
|7||Number of not labelled issue|
|8||Not closed issue|
|10||Number of issues without milestone|
|11||Number of milestones|
1.Commit distribution for the whole team
The commit distribution can be fetched through the dataset 1(commit record). At this time, I ignore the committer. Since all of the times are represented by the epoch seconds in dataset1. Modification for these time is needed. I need to count the total number of weekly commits. The statistic method is as follows. Detail code can be found here
csvfile = file('proj2.csv','rb') reader = csv.reader(csvfile) t =  for line in reader: [a,b] = line t.append(int(float(b))) t.sort() week =  total_week = (t[-1]-t)/(7*24*3600) end =  for i in range(total_week): end.append(t+(i+1)*7*24*3600) c = 0 for x in t: if x < end: c += 1 week.append(c) for alpha in range(1,total_week): c = 0 for x in t: if x >= end[alpha-1] and x < end[alpha]: c+= 1 week.append(c) print(week) csvfile.close()
2.Commit for a single person
Another important feature from the commit history is the commit rate for one single person. In the project team, each member should have equal contribution.
The code for fetching commit rate for each person can be found here.
The following figures shows the commit rate for each person.
3.Not in-time issue
Not in-time issue is the issue which is updated or not closed after the milestone due. Milestone is an important time point to control the develop process; if one can not finish the milestone, that means: 1) they underestimated the project, or 2) there are some problems during their coding and testing process.
The following function is used to count the not in-time issues. Detail code can be found here
#project 1 csvfile = file('proj1.csv','rb') reader = csv.reader(csvfile) t =  count = 0 for line in reader: [a,b,c,d,e,f,g] = line if b == 'open': count += 1 continue if int(float(f)) > 0 and int(float(g)) > int(float(f)): count += 1 print(count) csvfile.close()
Project 1: 28 Project 2: 51 Project 3: 11
4.Issue creator distribution
Here we want to see the distribution of issue creator. During the develop process, any problems should be raised in the issue. Issue is a fantastic way to communicate within a team. The issue creator distribution can be count as follows. Detail code can be found here
#project 1 csvfile = file('proj1.csv','rb') reader = csv.reader(csvfile) t =  creator = set() for line in reader: [a,b,c,d,e,f,g] = line creator.add(c) dis =  for mm in creator: count = 0 csvfile.seek(0) for line in reader: [a,b,c,d,e,f,g] = line if c == mm: count += 1 dis.append(count) print(dis) csvfile.close()
Project 1: [48, 8, 13] Project 2: [1, 54, 7, 19, 1] Project 3: [50, 12, 0, 16]
5.Weekly issue distribution
Weekly issue distribution is anther important issue too. From Feb 1, 2015, we counts the number of issues for the following weeks. The code to fetch this feature is as follows. Detail code can be found here
#project 1 csvfile = file('proj1.csv','rb') reader = csv.reader(csvfile) t =  week =  for line in reader: [a,b,c,d,e,f,g] = line x = long(d) week.append(int(x/604800-2351)) week.sort() frequency =  for i in range(1,max(week)): count = 0 for p in week: if p == i: count += 1 frequency.append(count) print(frequency) csvfile.close()
6.Labelled issue distribution
Now we focus on the labels. Through the data set 2(issue record), we can fetch the label for the issue. But whether each label has the same number of issues. Some not-used labels or miner used labels are not helpful. This feature counts the number of labels with some specific issue. The function for fetching this feature is as follows. Detail code can be found here
def splits(e): result =  strs = '' for c in e: if c != ',' and c != '[' and c != ']' and c != '\'': strs += c else: if len(strs) >0 and strs != ' ': result.append(strs) strs = '' return result print("ISSUE CREATOR DISTRIBUTION===========") #project 1 csvfile = file('proj1.csv','rb') csvfile2 = file('labelDis1.csv','w') reader = csv.reader(csvfile) writer = csv.writer(csvfile2) t =  labels = set() for line in reader: [a,b,c,d,e,f,g] = line for ll in splits(e): labels.add(ll) for la in labels: count = 0 csvfile.seek(0) for line in reader: [a,b,c,d,e,f,g] = line for ll in splits(e): if ll == la: count += 1 writer.writerows([[la,count]]) csvfile.close()
7.Number of not labelled issue
The tag(label) in one issue can help fasten others to find the proper issue, thus promote the communication within a team. The not-labelled issues are not a good thing. In this feature, we want to count the number of not labelled issues. Function for fetching is as follows. Detail code can be found here
#project 1 csvfile = file('proj1.csv','rb') reader = csv.reader(csvfile) t =  labels = set() count = 0 for line in reader: [a,b,c,d,e,f,g] = line if e == '': count += 1 print(count) csvfile.close()
Project 1: 6 Project 2: 26 Project 3: 36
8.Not closed issue
One issue should be closed after it is solved or the developers do not plan to solve it. The open issue indicates the problem to be figure out. Now all of the projects are ended. The issues should be all closed. This feature count the issue which has not yet closed. The function is as follows. Detail code can be found here
csvfile = file('proj1.csv','rb') reader = csv.reader(csvfile) t =  labels = set() count = 0 for line in reader: [a,b,c,d,e,f,g] = line if b != 'closed': count += 1 print(count) csvfile.close()
Project: 1 Project: 0 Project: 0
Here we focus on the name of labels. A meaningful name can promote the developing process. The label name can be collected as the follow function. Detail code can be found here
csvfile = file('proj1.csv','rb') reader = csv.reader(csvfile) names =  for line in reader: [a,b,c] = line names.append(a) names.sort() print(names) print("===============") csvfile.close()
===LABEL NAMES#=========== ['Solved', 'bug', 'design', 'develop', 'duplicate', 'enhancement', 'help wanted', 'invalid', 'question', 'test', 'wontfix'] =============== ['Configure Problem', 'Design Problem', 'Test Problem', 'bug', 'fixed', 'help wanted', 'info!', 'wontfix'] =============== ['Testing', 'Training', 'bug', 'enhancement', 'generate script', 'help wanted', 'question', 'task', 'team discussion'] ===============
10.Number of issues without milestone
The milestone set up a deadline for solving the issue. This is a very helpful tool. This feature counts the number of issues without milestones. The following function is corresponded to this. Detail code can be found here
csvfile = file('proj1.csv','rb') reader = csv.reader(csvfile) t =  labels = set() count = 0 for line in reader: [a,b,c,d,e,f,g] = line if f == '-1': count += 1 print(count) csvfile.close()
Project: 13 Project: 29 Project: 42
11.Number of milestones
This feature is to get how many milestones the developers have created. Typically one milestone represents one developing step. Too less milestones is not a good thing. The number of milestones can be fetched as follows. Detail code can be found here
#project 1 csvfile = file('proj1.csv','rb') reader = csv.reader(csvfile) count = 0 for line in reader: count += 1 print(count) csvfile.close()
PROJECT 1: 5 PROJECT 2: 5 PROJECT 3: 7
For each milestone, this feature indicates the milestone duration. Either too long or too short duration is not good. This feature can be fetched by the following function. Detail code can be found here
csvfile = file('proj1.csv','rb') reader = csv.reader(csvfile) for line in reader: [a,b,c,d] = line days = (long(float(d))-long(float(c)))/(24*3600) print(b+":"+str(int(days))) print("===============") csvfile.close()
===REPO MILESTONE #=========== Beta Launch :9 V1:12 V2:26 System test and Report:24 Final release:10 =============== Test points:2 Basic Service and Test:10 Small Scale Test and Comparison:34 Large Scale Test:39 Final:39 =============== Data Collection and Preliminary Analysis:13 Tasks for Week#3:13 Tasks for 02/27:7 Milestone-03/07:3 105 Model:10 Milestone-03/15:5 Milestone-03/30:9 ===============
7.PARTII. Bad Smells Detector and Results
There may exist many bad smells during the developing process. In this section, I will discuss some of them basing on the features generated above. Some bad smells can derive from one single feature, while others may need to derive from two features or more.
1.Uneven commit distribution
Through the result of feature1(Commit distribution for the whole team), we can easily know that, to some extend, the commit distribution is not even for each team. To confirm this, I use the (standard deviation/total commit*project duration) to see whether the commit distribution is even.
Code for this detector can be found here
project 1: 0.0843137254902
project 2: 0.071004659249
project 2: 0.0751008549106
Through this result, we can confidently make a conclusion that all of the three teams had uneven commit history, especially for the group for project 1.
A super leader may be harmful for a team. The super leader did most of the jobs in a team, which violates the principle of co-operation. For instance, during a team meeting, the team may suffer from cheerleader effect. The following function is the detector for this.
def detectSuperLeader(weeklyCommit, totalCommit): if sum(weeklyCommit) > totalCommit*0.3: return True return False
project 1: False
project 2: False
project 3: False
We're glad to see that there is no super leader among these groups.
Opposite to the super leader, the passenger in a team does more harm in the project. They did not play actively in the developing process and thus reduce the quality of the product/software. We define the passenger as follows:
The commit for him is less than 20% of the total commit; OR
There were many weeks that they did not have any commit
The detector for the passenger can be found here
def isPassenger(weekCommit, totalCommit) if sum(weekCommit) < totalCommit * 0.2: return True c = 0 for i in weekCommit: if i == 0: c += 1 if c >=0.25*len(weekCommit): return True reture False
G1-M1: High commit proportion. Not passenger G1-M2: High commit proportion. Not passenger G1-M3: Zero weekly commit warning. Possible Passenger! G2-M1: High commit proportion. Not passenger G2-M2: High commit proportion. Not passenger G2-M3: Zero weekly commit warning. Possible Passenger! G2-M4: Zero weekly commit warning. Possible Passenger! G3-M1: High commit proportion. Not passenger G3-M2: Zero weekly commit warning. Possible Passenger! G3-M3: Low commit proportion. Possible passenger! G3-M4: Low commit proportion. Possible passenger!
One should notice that this result is consistent to our intuition.
4.Poor time management
Time management including time/effort estimation and plan execution management. Now I can't distinguish them. However, any one of these reasons can lead to overdue issues. The not-in-time tasks do harm to the software developing. Especially when many issues are overdue.
The following function is a detector for poor time management. This detector is mainly based on the feature 3
def poorTimeManageDetector(overdue, totalIssue): if overdue/totalIssue > 0.15: return True else: return False
Project 1: Poor Management Project 2: Poor Management Project 3: Poor Management
4.Not number label
As was said by Dr. Menzies in the lecture, numbering the labels is a good hibit. The following function can check whether the labels has been numbered.
def numberLabelDetector(labels): for label in labels: if label >= '0' and label <= '9': continue else: return False return True
Project 1: Not numbered Project 2: Not numbered Project 3: Not numbered
All teams ignored this!!
5.Poor Issue management
Issue management is essential too. They should be closed in time. Also, one should set up a label for it. Consequently, I wrote a detector for detecting poor issues. The code as follows:
def poorIssueDetector(totalIssue, notLabelIssue, notcloseIssue): if notcloseIssue > 0: return True if notLabelIssue > 0.2*totalIssue: return True return False
Project 1: Poor issue management(slightly.) Project 2: Normal Project 3: Poor issue management(too many non-labelled issues)
6.Poor milestones setting
This detector detects two things: 1) lack of milestones; 2) the duration for the milestone is too long.
This is a 10-weeks project. It's suggested to set a milestone at least every three weeks(21 days).
The following function is the detector.
def poorMileStoneDetector(mileStoneDurations): if len(mileStoneDurations) < 5: return True if max(mileStoneDurations) > 21: return True return False
Project 1: containing long-time milestone Project 2: containing long-time milestone Project 3: normal
8.PARTIII. Early Warning and Results
1.Lazy guy early warning
According to the upper bad smell detector, no one in the team did nothing. But there always exist one or two lazy guys in a team. In this subsection, I will introduce how to detect the lazy guy at the early life-cycle of developing.
In the early earning analyse, we can't use the whole project data. But using the accumulated data or experience is acceptable.
To detect the lazy guy, we first check how many ZERO commits one guy has. This function can check this.
Also, we need to count the weeks for which the number of one guy's commit is far less that that of others. This function can check this.
Finally, we have accumulated issue publication for a guy as time passes by. This is not included in the features. It can be calculated by this code
If one guy performs badly in the view of upper three points, very likely he is a lazy guy.
In the Group1 (data recorded at the end of fourth week)
|Member||Zero commits||Obvious Commit||Less Issues|
Please notice member 3: he performed bad in the first four week. As expected, he was a passenger (basing on the bad smell detector in Part II)!
One more example: let's look at group 3 (data recorded at the end of fourth week)
|Member||Zero commits||Obvious Commit||Less Issues|
Again please notice member 3 and 4: they performed bad in the first four week. As expected, both of them were possible passengers(basing on the bad smell detector in Part II)!