Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SVFG: Obtain SVFGNode corresponding to argument of CallSite #13

Open
obraunsdorf opened this issue Mar 23, 2017 · 10 comments
Open

SVFG: Obtain SVFGNode corresponding to argument of CallSite #13

obraunsdorf opened this issue Mar 23, 2017 · 10 comments
Labels

Comments

@obraunsdorf
Copy link

Hi,

I recently tested SVF (commit 5355fc2). Great piece of work from my point of view!
Unfortunatly I have problems using the API correctly and I would be pleased if you could guide me a little.
I initialize SVF with the following Instuctions:

bool runOnModule(Module &m) override {
        FlowSensitive* fspa = FlowSensitive::createFSWPA(m);
        SVFG* svfg = fspa->getSVFG();
        PAG* pag = fspa->getPAG();
        PTACallGraph* ptaCallGraph = fspa->getPTACallGraph();

Later I obtain a llvm::CallSite and want to access the SVFGNodes corresponding to the arguments of that CallSite with

if (svfg->hasActualINSVFGNodes(callSite)) {  // why never true?
    auto set = svfg->getActualINSVFGNodes(callSite);
    int i = 0;
    for(auto it = set.begin(); it!=set.end(); ++it) {
       errs() << "param No. " << i++;
       errs() << "node id: " << *it << "\n";
    }
} else {
    errs() << "no actual INSVFGNodes\n";
}

But the SVFG::callSiteToActualINMap (include/MSSA/SVFG.h:100) is empty everytime. What am I missing here? Do I have wrong initialization steps?
I attached the code of my LLVM Pass as well as source code and LLVM IR of the module under test.
Logger.zip
See LoggerOO.cpp: my ultimate goal is to track back the value of parameter 1 of Logger::log2() [line 75] so that SVF reports its value either originates as return value of Encryptor::encrypt() [line 69] or as output parameter of assign() [line 71]

It would be nice if you could help me with this.
Thank you.

@yuleisui
Copy link
Collaborator

yuleisui commented Mar 24, 2017

Hi Oliver,

Thanks for your question. You case looks a bit complex.

If I understand correctly, do you want to track the value-flows of a particular pointer parameter at a callsite, say "* sp" in your example? Then, You may wish to refer to "getActualParmSVFGNode(const PAGNode* aparm,llvm::CallSite cs)".

If you want to track the value-flows of an address-taken object, say the target pointed to by "*sp" at callsite "log2(*sp,foo)". Then, you may wish to refer to "getActualINSVFGNodes(llvm::CallSite cs)". Any object that may be modified inside callee "log2" is put as a ActualIN SVFGNode before the callsite.

During the past few months, we've made SVF to support C++ programs, e.g., virtual calls. We tried to model as many C++ library calls as possible. However, when lowering a C++ program into a bitcode file, they are still things may not be modelled, especially some of the C++ internal classes, such as class string in your case. String is internal in C++, whose function bodies are not included in the a bc file. Any string operations e.g., =, +, <<, * are no longer simple assignment or dereference, rather, they are translated into a series of invocations to its (string) overloading functions.

You can see some C++ function names like:

line 624 in Logger.bc
"_ZStlsIcSt11char_traitsIcESaIcEERSt13basic_ostreamIT_T0_ES7_RKNSt7__cxx1112basic_st"
corresponds to "<<" at line 39 in Logger.cpp

line 629 in Logger.bc
"_ZNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEaSEPKc"
corresponds to "<=" at line 40 in Logger.cpp

The above function calls are not modeled by SVF, so the value-flows of the address-taken objects pointed to by "*sp" may be missed. We will need to add the side-effects of those functions in file Util/ExtAPI.cpp after demangling their names (Util/CPPUtil.cpp).

For now, another way to achieve you goal is to change "string&" to be "char*&". Then the value-flows will be soundly generated.

@obraunsdorf
Copy link
Author

Thanks for the detailed help!!

Seems like getActualParmSVFGNode() provides the functionality I was looking for. However I can't traverse the value flow graph from there because the returned SVFGNode doesn't have any incoming or outgoing edges. Is this intended?

Fortunately, I managed to get the SVFGNode corresponding to a function argument using

for(const PAGNode* pagNode : pag->getCallSiteArgsList(callSite)) {
	const SVFGNode* svfgNode = svfg->getDefSVFGNode(pagNode);
}

From this SVFGNode I can traverse the Edges through svfgNode->getInEdges() / svfg->getOutEdges().

With the information from your answer about C++ internal classes, I was able to track back the value flow of *sp to the expected locations, simply by inserting
{"_ZNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEaSEOS4_", ExtAPI::EFT_A1R_A0R},
into Util/CPPUtil.cpp. If I understood your code correctly, this causes generation of a load and store edge from the right hand side to the left hand side of the std::string::operator=() (string assingment function). Am I right?
As changing std::string to char* is no option for my project, I guess I have to find all external functions (e.g stdlib functions) which are not linked into the LLVM bit code file and model their behaviour within your ExtAPI class.

Thinking a little bit further the following question comes in my mind:
Do you think the process of building the SVFG could be sped up by also summarizing the behaviour of non-library functions known by the user (and unknown at compile time of SVF)? Maybe the user could specify a list of known functions and their internal dependencies in the ExtAPI-fashion, so that llvm::StringMap<extf_type> info is extended dynamically within ExtAPI::init()?
I guess this would require a more generic fashion, since the ExtAPI is bounded to extf_t-Enum.

@yuleisui
Copy link
Collaborator

A very good try!

  1. On getActualParmSVFGNode
    Yes, you can get ActualParmSVFGNode via "pag->getCallSiteArgsList(callSite)"
    For example, your callsite "logger.log2(*sp, foo);" has three arguments: "logger" (hidden this pointer), "*sp" and "foo".

  2. On SVFGNodes for top-level pointers and address-taken objects
    From my understanding, you want to track the value-flows of the object that "*p" points to. Is this your intention?
    If so, you may wish to use one of the following three ways to get an SVFGNode

  • use"getActualINSVFGNodes(llvm::CallSite cs)" at the callsite.
  • use getFormalINSVFGNodes(llvm::Function) at the entry of callee function "log2(std::string& msg, std::string& msg2)".
  • find a StoreSVFGNode or LoadSVFGNode which is accessing the object (this is a commonly used way).
    After you obtain the SVFGNode, you can iterative its incoming/outgoing edges on SVFG to track the value-flows. Remember that every IndirectSVFGEdge has a "PointsTo" field (representing a group of objects) to specify the value-flows of those objects (https://github.com/unsw-corg/SVF/wiki/Technical-documentation#322-svfgedge).
  1. On C++ internal classes
    We have just committed a new patch which summaries the side-effects of the C++ string APIs. Please see 53dafe6.
    We have also added an option ("-modelConsts") to enable analysis of constant strings for your case. I believe this will solve your problem.
    Please use the following command line to dump memory SSA and SVFG
wpa -fspta -modelConsts=true -stat=false  -dump-mssa -dump-svfg LoggerOO.ll

You will see two ActualINSVFGNodes and one ActualOUTSVFGNode. These are the ones you want to obtain.

CALMU(MR_6V_4)  pts{222 }
CALMU(MR_8V_2)  pts{226 }
  invoke void @_ZN6Logger4log2ERNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES6_(%class.Logger* %logger, %"class.std::__cxx11::basic_string"* dereferenceable(32) %23, %"class.std::__cxx11::basic_string"* dereferenceable(32) %foo)
          to label %invoke.cont27 unwind label %lpad15, !dbg !722 
6V_5 = CALCHI(MR_6V_4)  pts{222 }
  1. On function summarization
    It is a good idea. In pointer analysis, function summarization (either external or internal) is an alternative approach to traditional "whole-program" analysis. Summarization is particularly useful for modular analysis of programs which heavily use frameworks and libraries. You can have a try, however, making summarization of every function may be error prone, because you have to understand the behavior of every app function.

@obraunsdorf
Copy link
Author

Hi Yulei,

sorry for this delayed reply! Unfortunately my time to work with SVF is limited at the moment.

The patch (53dafe6) you provided works great, thanks! I was able to get a big part of my work done after applying it.
Unimportant information: I had to add the following lines to ExtAPI to get the particular overload of std::string::operator=() that my code uses.

/// string operator=: operator= (string&& str)
+    {"_ZNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEaSEOS4_", ExtAPI::CPP_EFT_A0R_A1R}, // c++11

So it seems that there is more work to be done with the standard c++ lib. But I will try to cover this topic in another issue on a higher level.

Regarding points 1 and 2 of your last post:
My intention is to track the value flow of a particular parameter of a particular function X, no matter if the parameter is given by value, by pointer or by reference (which equals a pointer in LLVM IR if I'm correct). When tracking the value flow of this parameter I want to be able to tell that the value comes from
(a) the return value of another function that has been called before function X
(b) the output parameter i of another function that takes 0 <= i <= n arguments and that has been called before function X.
(c) the input parameter of the function that calls function X

So far I accomplished (a) and (b) but there are a few bugs due to my lack of understanding of SVFG and how it is built in all the corner cases.
Currently I replaced SVFGOPT (it confused me that it deleted nodes that I thought I need for my analysis) with unoptimized SVFG. Then for the particular function X, I iterate over FormalInSVFGNodes (as you said) to get a parameter node.

Question 1: The FormalINSVFGNodeSet seems to not guarantee that its order corresponds to parameter ordering. How can I know that the FormalINSVFGNode represents the i-th parameter of function X?
Related Question 1.1: FormalINSVFGNodeSet seems to contain more FormalINSVFGNodes. Only some of them have incoming Edges (which I use for traversal). Where do they come from and how can I know that I it does not correspond to a parameter of function X?

For (a) and (b): Sometimes while traversing SVFG I visit a FormalRetSVFGNode. Then it's obvious that value comes from the return value of another function. But sometimes I only get FormalOutSVFGNodes. It's obvious from which function it comes but it's hard to tell from which parameter (No. 0, 1, 2, ..., n) of the function it comes or if it actually comes from the return value of the function.
I used a little workaround(*) which searches for a StoreSVFGNode within the incoming edges of the FormalOutSVFGNode. Then I can use storeNode->getPAGDstNode()->getValue(), look for a llvm::LoadInstruction which uses this value, call loadInstuction->getPointerOperand() to get the llvm::Value which actually represents the formal parameter and finally cast it to llvm::Argument which provides me with getArgNo(). If the argument number is i=0 and the function has a return value then I know the FormalOutSVFGNode represents this return value.

Question 2: Is this workaround the way to go? In all corner cases? Or is there a simpler and reliable way to get the parameter No. i?
So far my workaround seems to work in all my test cases.

For (c): Very rarely while traversing SVFG I visit a FormalParmSVFGNode. Then it's easy to know from which input parameter of which function the value comes from. But most of the time I get a FormalINSVFGNode. As with FormalOUTSVFGNodes I cannot tell to which parameter number the node corresponds. And I cannot use the workaround(*) as I can't detect any StmtSVFGNode around.

Question 3: How can I know to which parameter number i the FormalINSVFGNode corresponds?
[This question seems to be solved, if question 1 is solved and vice versa]

Many thanks in advance!

@yuleisui
Copy link
Collaborator

yuleisui commented May 4, 2017

Hi Oliver,

We didn’t label an index (e.g., i) for an address-taken object o via i-th formal parameter p passing into the entry of a function. This is because the object o may not only refer to *p, but also **p or ***p, their indexes are the same, which will be confusing for users. In addition, another parameter e.g., q may also point to this object, so object o may not have a fixed index. Besides, if o is a global object passing indirectly as a FormalInSVFGNode via global pointers, which does not have an argument index.

My advice is to start the backtracking from a statement SVFGNode (as your tainted source) rather than a formal parameter on SVFG. It the can always reach a formal parameter if the object/pointer is passed directly or indirect via a parameter.

I have listed several examples for you to understand.

  • Example 1
f(p){
*p = q
} 

A DirectSVFGEdge (value-flow of p) from parameter p (FormParmSVFGNode) to *p=q (StoreSVFGNode)

  • Example 2
f(r){
q = *r
p = *q
} 

A DirectSVFGEdge (value-flow of r) from parameter r (FormParmSVFGNode) to q=*r (LoadSVFGNode), and another DirectSVFGEdge (value-flow of q) from q=*r (LoadSVFGNode) to p=*q (LoadSVFGNode)

  • Example 3
f(x){
*r = x  CHI(o)   // r points to o
p = *q   CHI(o)	  // q points to o
} 

A DirectSVFGEdge (value-flow of “x”) from parameter “x” (FormParmSVFGNode) to “*r=x” (StoreSVFGNode), and another IndirectSVFGEdge (value-flow of “o”) from “*r=x” (StoreSVFGNode) to “p=*q” (LoadSVFGNode)

  • Example 4
f(x){
r = *x    // r points to o
p = *r	  
} 

A DirectSVFGEdge (value-flow of x) from parameter x (FormParmSVFGNode) to *r=x (LoadSVFGNode), and another DirectSVFGEdge (value-flow of r) from *r=x (LoadSVFGNode) to p=*r (LoadSVFGNode).

  • Example 5
    The following case can't reach a formal parameter when backtracking from p=*r, because o is a global object.
caller:
CHI(o)    // AuctualInSVFGNode of o and o is global
f(y)

f(q){
CHI(o)    // FormalInSVFGNode of o
r = *x    // x is a global pointer that points to o
p = *r	  
} 

An IndirectSVFGEdge (value-flow of o) from the AuctualInSVFGNode to FormalInSVFGNode, and then to r=*x (LoadSVFGNode) via IndirectSVFGEdge (value-flow of o), and we have another DirectSVFGEdge (value-flow of r) from r=*x (LoadSVFGNode) to p=*r (LoadSVFGNode).

Please let me know if you have anything more want to discuss.

@yuleisui
Copy link
Collaborator

yuleisui commented May 4, 2017

Hi Oliver,

If possible, could you help us collect the summaries of some C++ library calls (as what you did for string operator=)?

It will benefit other users too.

Thanks

@obraunsdorf
Copy link
Author

Hi Yulei,

yes I will help you collecting the libc++ summaries. After I get my little example to work, I plan to analyze a real world c++ project with lots of c++ library calls. I will provide a git patch or pull request or whatever you prefer to commit the ExtAPI summaries.

But I am really confused at the moment. Your examples seem straight forward to me, except example 5. Where does x come from?(Just a typo and should be q? => I assume x is q for the following).
My Question is: what is the reason to add an Entry-CHI(o) in f(q)? Because there is a Ref of o (which is pointed to by q) in r = *q - a possible side effect, right? So why can't we get a connection between Entry-CHI and the FormalParm q? The are obviously related to each other.
=> Is there any way to get to the llvm::Argument from an Entry-CHI node or Call-CHI node in all corner cases?
If no, should I consider building a custom SVFG?

Another case: if my particular parameter is not a pointer, but just an i8 for example. If I understood you correctly there is no Actual/FormalParmSVFGNode for this parameter and no ActualIN/FormalINSVFGNode, right? Do I have to use PAG then to track its value?

@yuleisui
Copy link
Collaborator

yuleisui commented May 5, 2017

Oliver,

x in example 5 is a global pointer. q is intended to sit there. Let us assume it won't be used in the body of function f. Sorry about the confusion.

Please see the revised one below:

main(){
L1: *x = z;
STORE-CHI(o)   // x is a global pointer that points to o

CALL-CHI(o)    // AuctualInSVFGNode of o and o is global
L2: f(y);
}


f(q){
L3: ENTRY-CHI(o)    // FormalInSVFGNode of o

LOAD-MU(o)
L4: p = *x;    // x is a global pointer that points to o	  
} 

In this example, when you perform backtracking from "p=*r" (tainted source) to "*x=z", the traversed value-flows have nothing to do with parameter q.

L4 --o--> L3 --o--> L2 --o--> L1

We don't model value-flows (def-use chains) for no-pointer values, as they are explicit on LLVM SSA.

Please let me know if the above still doesn't answer your question.

@obraunsdorf
Copy link
Author

Ah okay, that makes sense. So there are definitly CHI and MU nodes, that do not correspond to function parameters. What about this example, where x and y are local allocations in the main function. Is the placement of the CHIs and MUs correct and can we relate them to their actual and formal parameters?

main() {
	x = alloc()
	y = alloc()
	a = *x
	b = *y

	CSCHI(a=>x)
	CSCHI(b=>y)
	functionA(a, b);
	CSMU(a=>y)
	CSMU(b=>x)

	CSCHI(a=>y)
	functionX(a)
	CSMU(a=>?)
}


functionA(*p, *q) {
	ENCHI(p=>x)
	ENCHI(q=>y)
	t = *p
	*p = *q
	*q = t;
	RETMU(p=>y)
	RETMU(q=>x)
}

@yuleisui
Copy link
Collaborator

yuleisui commented May 5, 2017

I believe you can use CollectPtsChain (lines 428-450) in MemRegion.cpp for your case.

You can apply CollectPtsChain for each parameter p to decide whether an object (at an ENTRY-CHI) is passed via p into a callee function.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants