-
Notifications
You must be signed in to change notification settings - Fork 3
Module Code Model
The Code Model (code link) is our abstract representation of the code graph and the metrics and features relevant for code smell detection. It presents a utility service that supports the main code quality analysis use case, as well as the use cases offered by the Dataset Explorer.
Below is a class diagram of the most significant segment of our CaDET code model (code link). The CaDETModel is built from source code using our CodeParser, which presents the second major package of this module (code link). It ignores external code (e.g., built in classes and methods, such as from the System namespace; imported packages), and creates objects only for the supplied classes.
The CaDETClass and CaDETMember present the most sophisticated part of the CaDETModel. While the CaDETClass maps to a class, the CaDETMember maps to executable members of a class, such as a method, constructor, or property (i.e., getter/setter). These classes help build the code graph, and include:
- Structural connections, where a CaDETClass has a list of Fields, a list of executable Members, links to its Parent and OuterClass, while a CaDETMember knows its ReturnType, Variables, Parameters, and Parent.
- Access and invocation connections, where the CaDETMember knows which Fields and Accessors (including Mutators) it accesses, and which Methods it invokes.
- The source code of the code snippet.
- A collection of metrics described below.
The CaDETProject encapsulates the list of compiled classes, read from a folder or supplied as an array of strings to the CodeParser. It can optionally include any discovered syntax errors in the code and CodeLocationLinks.
The CodeLocationLinks map the discovered code snippets (i.e., classes and members) to specific files and lines of code. They are a utility used by the Dataset Explorer to create more usable datasets. Each link enables easy access to the code snippet, as shown with this CreateProjectWithCodeLinks method example.
Currently, we support modest C# processing and metric extraction, where the full list of supported metrics can be found here.
Notably, code metric calculation is error-prone for two reasons:
- Many code metrics have multiple definitions. For example, some papers consider the LOC metric a simple count of newline characters. In contrast, others define that LOC ignores whitespace and comments (and some call this metric effective LOC - ELOC). Some view ELOC as ignoring all curly brackets, while others only ignore the method body-scoping brackets. When considering metrics calculated from the code graph, such as ATFD (access to foreign data), some papers consider invocations of accessors and mutators (i.e., getters and setters) to be the same as accessing the field, while others do not.
- Many code metric definitions do not account for "syntactic sugar" and advanced language features, as many were designed ten to thirty years ago.
For these reasons, there is a high probability that some of our metric calculations are not correct or at least do not produce the same results as some other tools. Our metric calculations can be viewed here, and we welcome any feedback, issue submissions, or PRs that can help us enhance our calculations.
It is crucial to be aware of these ambiguities when working with metric calculation tools in general. Whichever tool is chosen, researchers should apply it consistently on their datasets to avoid introducing inconsistencies in their work.