# Step 3.1: Apache Status Analysis

In [1]:
import pandas as pd 
import os

In [2]:
# Open Results
df = pd.read_csv('ApacheStatusCheckerResults.csv')

In the following table we can see, for each project:
- Original: commits where they detected a building system (Maven) and built successfully (Previous Experiment)
- Replicated: commits where they detected a system build (Maven) and we have built successfully (Current experiment)
- Real Replicated: commits where we have detected a build system (Maven) and have been able to build successfully (Current experiment).
- Our Replicated + Fixes: of commits where we have detected any system build (Ant/Maven) and were able to build successfully.
- Complete: # of commits we have been able to build successfully.
- Original Buildable commits: # of commits in which they detected a pom.xml
- Buildable commits: # of commits in which we detected a pom.xml
- TotalCommits: # of repository commits.

In [3]:
df

Unnamed: 0,Project,TotalCommits,Original Buildable commits,Original,Original (%),Replicated,Replicated (%),Real Buildable commits,Real Replicated,Real Replicated (%),Ant Fails,Ant Success,Real Buildable commits + Ant,Real Replicated + Ant,Real Replicated + Ant (%),Complete
0,isis,4817,2062,300,14.548982,90,4.364694,2062,90,4.364694,0,0,2062,90,4.364694,1.868383
1,james-hupa,686,677,96,14.180207,0,0.000000,677,0,0.000000,0,0,677,0,0.000000,0.000000
2,james-jdkim,124,123,9,7.317073,9,7.317073,123,9,7.317073,0,0,123,9,7.317073,7.258065
3,james-jsieve,527,393,140,29.598309,0,0.000000,393,0,0.000000,51,81,525,81,15.428571,15.370019
4,james-jspf,621,384,166,37.219731,166,43.229167,384,166,43.229167,0,62,446,228,51.121076,36.714976
5,james-mime4j,733,722,108,14.958449,70,9.695291,722,70,9.695291,0,0,722,70,9.695291,9.549795
6,james-postage,74,63,0,0.000000,0,0.000000,63,0,0.000000,8,0,71,0,0.000000,0.000000
7,jclouds,5074,1039,0,0.000000,0,0.000000,5072,94,1.853312,0,0,5072,94,1.853312,1.852582
8,jena,2680,2647,376,14.204760,329,12.429165,2647,329,12.429165,0,0,2647,329,12.429165,12.276119
9,kalumet,172,170,63,37.058824,5,2.941176,170,5,2.941176,0,0,170,5,2.941176,2.906977


In [4]:
df.mean()

TotalCommits                    1764.417722
Original Buildable commits      1283.253165
Original                         287.810127
Original (%)                      37.197470
Replicated                       185.620253
Replicated (%)                    25.363013
Real Buildable commits          1366.582278
Real Replicated                  191.215190
Real Replicated (%)               25.423654
Ant Fails                         83.265823
Ant Success                       32.734177
Real Buildable commits + Ant    1482.582278
Real Replicated + Ant            223.949367
Real Replicated + Ant (%)         24.858411
Complete                          22.382664
dtype: float64

In [5]:
# Sum of each field
df.sum()

Project                         isisjames-hupajames-jdkimjames-jsievejames-jsp...
TotalCommits                                                               139389
Original Buildable commits                                                 101377
Original                                                                    22737
Original (%)                                                               2938.6
Replicated                                                                  14664
Replicated (%)                                                            2003.68
Real Buildable commits                                                     107960
Real Replicated                                                             15106
Real Replicated (%)                                                       2008.47
Ant Fails                                                                    6578
Ant Success                                                                  2586
Real Buildable c

In [6]:
df[['TotalCommits']].describe()

Unnamed: 0,TotalCommits
count,79.0
mean,1764.417722
std,2694.935048
min,25.0
25%,234.0
50%,726.0
75%,1898.0
max,14818.0


In [7]:
ant_success = int(df['Ant Success'].sum())
ant_fails   = int(df['Ant Fails'].sum())
print("Ant success: %d"%ant_success)
print("Ant fails: %d"%ant_fails)

Ant success: 2586
Ant fails: 6578


In [8]:
print("Buildable commits (Original) %d"%int(df['Original Buildable commits'].sum()))
print("Buildable commits (Our Study) %d"%int(df['Real Buildable commits'].sum()))
print("Buildable commits (Our Study + Ant) %d"%int(df['Real Buildable commits + Ant'].sum()))

Buildable commits (Original) 101377
Buildable commits (Our Study) 107960
Buildable commits (Our Study + Ant) 117124


Generate resume of Replication Experiment

In [9]:
report_df = df[['Project', 'TotalCommits','Original Buildable commits', 'Replicated']]
report_df.to_csv('replication_experiment_buildability_summary.csv', index=False)  

In the next cell, we can see the projects where we detect commits that have building systems but they don't, getting different values in `Replicated` and `Real Replicated`

In [10]:
# Check projects with mistakes on original experiment
df[df['Replicated'] != df['Real Replicated']]

Unnamed: 0,Project,TotalCommits,Original Buildable commits,Original,Original (%),Replicated,Replicated (%),Real Buildable commits,Real Replicated,Real Replicated (%),Ant Fails,Ant Success,Real Buildable commits + Ant,Real Replicated + Ant,Real Replicated + Ant (%),Complete
7,jclouds,5074,1039,0,0.0,0,0.0,5072,94,1.853312,0,0,5072,94,1.853312,1.852582
26,maven-plugins,11812,11221,21,0.187149,21,0.187149,11810,369,3.124471,0,0,11810,369,3.124471,3.123942


In [11]:
# Projects in which we detect more buildable commits
diff_in_buildable_df = df[df['Original Buildable commits'] != df['Real Buildable commits']]
diff_in_buildable_df

Unnamed: 0,Project,TotalCommits,Original Buildable commits,Original,Original (%),Replicated,Replicated (%),Real Buildable commits,Real Replicated,Real Replicated (%),Ant Fails,Ant Success,Real Buildable commits + Ant,Real Replicated + Ant,Real Replicated + Ant (%),Complete
7,jclouds,5074,1039,0,0.0,0,0.0,5072,94,1.853312,0,0,5072,94,1.853312,1.852582
26,maven-plugins,11812,11221,21,0.187149,21,0.187149,11810,369,3.124471,0,0,11810,369,3.124471,3.123942
61,tomee,8916,6955,0,0.0,0,0.0,8916,0,0.0,0,0,8916,0,0.0,0.0


In [12]:
# Diff between our buildable commits and their buildable commits
their_buildable_commits = int(diff_in_buildable_df[['Original Buildable commits']].sum())
our_buildable_commits   = int(diff_in_buildable_df[['Real Buildable commits']].sum())
our_buildable_commits-their_buildable_commits

6583

In [13]:
their_successfully_built_commits = int(diff_in_buildable_df[['Original']].sum())
our_successfully_built_commits   = int(diff_in_buildable_df[['Real Replicated']].sum())
our_successfully_built_commits-their_successfully_built_commits

442

Below, we can see an overview of the buildability of all projects. The values are percentages.

In [14]:
# Total
print("Successfully built commits (Original): %s"%df['Original'].sum())
print("Successfully built commits (Replicated): %s"%df['Replicated'].sum())
print("Buildable commits (Original): %s"%df['Original Buildable commits'].sum())
print("Total commits: %s"%df['TotalCommits'].sum())
df[['Original (%)','Replicated (%)']].describe()

Successfully built commits (Original): 22737
Successfully built commits (Replicated): 14664
Buildable commits (Original): 101377
Total commits: 139389


Unnamed: 0,Original (%),Replicated (%)
count,79.0,79.0
mean,37.19747,25.363013
std,34.259306,33.277697
min,0.0,0.0
25%,2.908276,0.0
50%,28.638498,8.737347
75%,65.997442,43.614583
max,100.0,100.0


We calculate the values of quartiles Q1 and Q3

In [15]:
q1 = df['TotalCommits'].quantile(0.25)
q1

234.0

In [16]:
q3 = df['TotalCommits'].quantile(0.75)
q3

1898.0

We get the buildability results for short projects (# of commits < Q1)

In [17]:
# Short proyects (< Q1)
short_df = df[ df['TotalCommits']< q1 ]
print("Successfully built commits (Original): %s"%short_df['Original'].sum())
print("Successfully built commits (Replicated): %s"%short_df['Replicated'].sum())
print("Buildable commits (Original): %s"%short_df['Original Buildable commits'].sum())
print("Total commits: %s"%short_df['TotalCommits'].sum())
short_df[['Original (%)','Replicated (%)']].describe()

Successfully built commits (Original): 1129
Successfully built commits (Replicated): 876
Buildable commits (Original): 2303
Total commits: 2349


Unnamed: 0,Original (%),Replicated (%)
count,20.0,20.0
mean,49.846115,41.080037
std,35.084146,40.341705
min,0.0,0.0
25%,23.671875,0.555556
50%,48.635514,31.495912
75%,82.726359,80.622066
max,97.014925,99.118943


In [18]:
len(short_df)

20

We get the buildability results for medium projects (Q1 <= # of commits < Q3)

In [19]:
# Medium proyects (>= Q1 AND < Q3)
medium_df = df.query('TotalCommits >= %d and TotalCommits < %d'%(q1,q3))
print("Successfully built commits (Original): %s"%medium_df['Original'].sum())
print("Successfully built commits (Replicated): %s"%medium_df['Replicated'].sum())
print("Buildable commits (Original): %s"%medium_df['Original Buildable commits'].sum())
print("Total commits: %s"%medium_df['TotalCommits'].sum())
medium_df[['Original (%)','Replicated (%)']].describe()

Successfully built commits (Original): 9602
Successfully built commits (Replicated): 6049
Buildable commits (Original): 26761
Total commits: 33103


Unnamed: 0,Original (%),Replicated (%)
count,39.0,39.0
mean,38.433182,21.146679
std,34.035972,29.631415
min,0.0,0.0
25%,4.994934,0.0
50%,36.242083,1.702128
75%,68.426267,41.251547
max,100.0,95.238095


We get the buildability results for large projects (# of commits >= Q3)

In [20]:
# Large proyects (>= Q1)
large_df = df[ df['TotalCommits'] >= q3 ]
print("Successfully built commits (Original): %s"%large_df['Original'].sum())
print("Successfully built commits (Replicated): %s"%large_df['Replicated'].sum())
print("Buildable commits (Original): %s"%large_df['Original Buildable commits'].sum())
print("Total commits: %s"%large_df['TotalCommits'].sum())
large_df[['Original (%)','Replicated (%)']].describe()

Successfully built commits (Original): 12006
Successfully built commits (Replicated): 7739
Buildable commits (Original): 72313
Total commits: 103937


Unnamed: 0,Original (%),Replicated (%)
count,20.0,20.0
mean,22.139185,17.867842
std,29.331318,28.408516
min,0.0,0.0
25%,0.0,0.0
50%,16.485017,9.440663
75%,25.67389,19.171552
max,100.0,100.0


We used the Wilcoxon test to compare the buildability results obtained in the previous experiment with those obtained in the current experiment.

In [21]:
from scipy.stats import wilcoxon

original = list(df['Original (%)'])
replicated = list(df['Replicated (%)'])

wilcoxon(original,replicated)

WilcoxonResult(statistic=268.0, pvalue=1.9077637128196546e-06)

We get a very low p-value, so we can say that both results are different.