Assuming you have a copy of LLVM installed, and *opt* is on your path

To run and generate a dataset, run: 
``` 
./genProjDataset.py -n <project-name> -p <project-root-dir> -o <output-file>
```

For example, assuming the source code of *slepc* is stored in ~/xSDK/hpc-apps/slepc (and a compile_commands.json compilation db has been generated), running
```
 ./genProjDataset.py -n slepc -p ~/xSDK/hpc-apps/slepc -o slepc_xsdk_dataset.csv
```
generates slepc_xsd_dataset.csv, which can be analyzed with the help of pandas as shown below: 

In [14]:
import pandas as pd 
import numpy as np 

slepc_data = pd.read_csv("slepc_xsdk_dataset.csv")
slepc_data = slepc_data.drop(columns=["Unnamed: 0"])
slepc_data.head()

Unnamed: 0,CC,D,E,I,L,LOC,N,N1,N2,T,...,mu2',AvgShortestPath,Betweenness,Closeness,Eccentricity,FanIn,FanOut,IsIsolated,Katz,Name
0,5,18.9498,9680.0,0.612653,0.0527711,1797,220,81,139,537.778,...,3,0.0,0.0,0.0,0,0.0,0.0,0,0.0,AddNorm2
1,46,42.9815,64343.3,0.270108,0.0232658,329,499,154,345,3574.63,...,3,0.0,0.0,0.0,0,0.0,0.0,0,0.0,ApplyTranspose_FullBasis
2,46,42.9815,64343.3,0.270108,0.0232658,311,499,154,345,3574.63,...,3,0.0,0.0,0.0,0,0.0,0.0,0,0.0,Apply_FullBasis
3,46,42.7231,63572.0,0.271741,0.0234065,254,496,153,343,3531.78,...,3,0.0,0.0,0.0,0,0.0,0.0,0,0.0,Apply_Linear
4,666,377.58,15009900.0,0.0520459,0.00264845,1268,7420,2342,5078,833885.0,...,5,0.0,0.0,0.0,0,0.0,0.0,0,0.0,ArrowTridiag


It is [said](https://www.cppdepend.com/Metrics#MetricsOnApplication) (**original source unclear**) that methods with *cyclomatic complexity* higher than 15 are hard to understand, while those where the metric rises above 30 are extremely complex and should be split in smaller methods. 

However, this makes me wonder where these numbers came from. It could be argued that among the major indicators of a well designed system/code/repository is its lifespan. We have access to multiple codes that have been servicing the scientific community for more than 2 or 3 decades. Wouldn't it be a good idea to *re-define* these thresholds based on the codes' analyses?

A similar issue arises with *mu2': the number of parameters to a function*. Less than or equal to 5 is recommended, and the suggestion is that those methods with more than 5 parameters are painful to call, and even that they have the potential to degrade performance; add more fields and properties to classes and structs instead, they say. But is there any actual evidence for the choice of this number in modern codes, or does it just sound "reasonable"?

*mu2: the number of variables declared in a function body* is similar. The magical number is 8 in this case. 

---

In addition to "re-defining" the above thresholds, other metrics worth taking a closer look at in my opinion are the metrics for 
*   Difficulty
*   Effort
*   Time

All of these are algebraically related, and thus results from an investigation of one would naturally lead to corresponding results about the rest. I propose *time*. For example, one research question could be: **To what extent does the calculated total amount of time it takes to develop all of a project's function accurately reflect the actual amount of time the project has been under active development?** This doesn't seem a hard question to answer since, through the use of version control data, we can subtract the amount of time the project was on a break from its total running period, and the amount calculated using halstead's metric would only correspond to the sum total of the T column in our dataset. 

----

Another research question these datasets could help us answer is: "How does the area of focus/complexity of the project change over time/across versions?" Here area could refer to a section of the callgraph (a community say), or a subdirectory or class (as in the case of PETSC). We could then rank the different areas based on how many of their members appear in the top n-th function for a fiven metric (CC or LOC for example.) I think it would be interesting to see how these change across versions as the project matures. 

--- 
Similarly, we could ask whether the quality (for some definition of ideal quality) of the most important functions (the core of the graph say or those with high FanOut and FanIn) is better than the rest or not; and whether this has always been the case.

--- 
In short, given we can implement a proof of concept (datasets can already be generated and only data-analysis remains), I think a title such as: "Refining Code-quality Metrics Using Exascale Computing Project Codes and a Clang-LLVM based Tool" would be appropriate. Among our contributions would be: 

*   Datasets other SE analysts can re-employ (we can hide names of functions if needed.)
*   A code-quality tool based on Clang/LLVM 
*   A quality analysis of several ECP codes. 
