Skip to content
Ryan Richard edited this page Jun 24, 2016 · 7 revisions

#Good Programming Practices This page is meant to give you some tips and tricks regarding what are some good programming practices to adopt when contributing to Psi4. At the moment it largely focuses on C++, but similar ideas apply to Python in most scenarios.

##Contents

## Headers Headers are essential to C++. They serve as the mechanism for separating interface from implementation. There are many advantages to this that I won't go into here, but can be found with a simple Google search. What I want to do here is provide a section on good header practices. Note most of Psi4 does NOT obey these practices so this is a "do what I say, not what I do" sort of section. In fact Psi4's header situation is what inspired me to write this. I also present many of these practices without explanation. You can either trust me, or Google why they are this way (the answer is usually prevents hard to find bugs, reduces code complexity, and decreases compile time).

What goes into a header?

The real point of a header is to show the public interface to a class to the compiler. This allows the compiler to put your program together without knowing every intimate detail all at once. In particular you header file should avoid providing code for details that a user of the class does not need to know (it's fine to discuss them in documentation). One particularly powerful way to do this is through the "opaque pointer" idiom (a.k.a Cheshire cat idiom a.k.a pimpl idiom). Put simply we do something like:

//In the header file
class MyClass {
   struct MyClassData;
   MyClassData* Data_;
  public:
     MyClass();
     double GetDataPieceOne();
     int GetDataPieceTwo();
};

//In the .cpp file
#include "Path/From/Project/Root/To/MyClass.hpp" //Note the quotes
#include <vector> //Note the angle brackets

struct MyClassData {
     double DataPieceOne;
     int DataPieceTwo;
     std::vector<double> OtherData;
};
MyClass::MyClass() : Data_(new MyClassData())
{}

double MyClass::GetDataPieceOne() 
{
   return Data_->DataPieceOne;
}

int MyClass::GetDataPieceTwo()
{
   return Data-->DataPieceTwo;
}

Let's walk through a couple of points in this example. First note the line struct MyClassData. This is a forward declaration. If you use a class in a manner such that the only thing the compiler needs to know about the class is that it names a type (i.e. you only use pointers or references to it) it should only be included in your header file as a forward declaration and not with its header file. Conversely, if you use an actual instance of a class (i.e. not a pointer or reference), for example:

class MyClass2 {
   boost::shared_ptr<double> Data_;//Needs #include <boost/shared_ptr.hpp>
   public:
      std::vector<double> MakeVector();//Needs #include<vector>
}

then you need to include the header file for it (yes, I'm picking on all the forward declarations of boost::shared_ptr, which only work because somewhere in your include tree the actual header was included).

Next point, when we do use an actual include they come in two flavors: quotes and angle brackets. They are NOT the same (compiler handles them differently)!!!! To put it simply, angle brackets should be used for header files that are not part of your current project and quotes should be used for header files that are part of your project.

Point three. Note the header include order. It is commonly agreed upon that headers should be included in an order similar to:

#include "Header file this source is implementing"
#include "Header files that are part of this library"
#include "Header files that are part of this project"
#include <Header files that are part of external library>
#include <C standard headers>
#include <C++ standard headers>

Also note that we include the entire path (from the project root) to the header file. Although sometimes verbose, this greatly simplifies the multiple uses of the header file during the building, staging, installing, and user experience stages of the code life cycle.

While on the topic of header files I should point out that so called convenience header files should not be used within the code base (if they are provided they are for users of your library only). A convenience header is a header file that does nothing, but include other header files. These are bad for many reasons the chief of which is code coupling. Say for example I have a convenience header file for mints that includes the molecule, basis set, and integral interfaces. Now say I use this header file all over the code anytime I need a molecule or basis set (i.e. basically everywhere). Thanks to coupling, every part of the code that needed a molecule or basis set now depends on the integral interface. In other words, if I add a new flag to the integral interface, code that computes bond distances from a molecule will have to recompile. In general, this is one of the worst culprits of increasing compile time. The reason they are fine for user use is the user is not modifying your code so they are safe from compilation time increase. I realize that having to include many header files increases code duplication, but having to do so is usually a sign of poor API design (or APIs not providing the proper includes)

Note, that it is usually considered ideal to be able to compile a simple program like the following for each header file in your project's public interface:

#include "YourHeaderFile.hpp"

int main() {
   MyClass A;
   A.FirstFxn();
   A.SecondFxn();
   return 1;
}

That is by simply including the header file of your class I should be able to call every member function of your class and use every result, without including any other header files (even those found in the standard libraries).

## Namespaces Namespaces are a coding concept designed to prevent clashes between various libraries. For example, it is relatively common to define a wrapper function Print for printing. For every library that your project links against, the odds increase of someone else having also defined a Print function. In C++, to put our Print function in a namespace MyNamespace we simply write:

namespace MyNamespace{
   void Print();
}

In Python this is done automatically for us by using the name of the file as a sort of namespace. In both C++ and Python, namespaces admittedly make the name of an object/class longer to type; so there is a propensity to try to shorten the name. So let's talk about good and bad ways to do this. One common way to accomplish this in C++ is:

using namespace MyNamespace;

This can be done globally in a file or inside a function definition. If done in a function definition, the statement only applies to that function body. Using it globally should only be done in .cpp files or else you have essentially gotten rid of the namespace (as it will apply to all headers included afterward and all source files that include it). More specifically, header files should always contain the full type (e.g. MyNamespace::Print()) to avoid clashes, or using should be used in function definitions.

In C++11, individual elements and be used as well (with the same caveats as above):

using MyNamespace::Print();

//Some time later
Print();//This will call MyNamespace::Print();

That is we can pull in just the function we want; we no longer have to pull in all of the MyNamespace namespace. In general, in Psi4 your files should have the following namespace structure:

namespace psi{
namespace MyLib{
}//End namespace MyLib
}//End namespace psi

All Psi4 contributions need to be in the psi namespace and then should be in a namespace specific to your library.

##Constness

The const keyword is one of the most misunderstood and underused keywords in C++. Because it's easy to make sure it's in code at the time of writing, and almost impossible to add back in later, you should be aware of what const means and how to use it. Many good explanations can be found on the web, such as ​Why How Cpp Const and Const Correctness.

A quick summary. As you probably can gather, const means something is supposed to not change. This is pretty straightforward in most cases, e.g.:

void MyFxn(const &Variable);

In this code our fxn, MyFxn takes a reference to a variable Variable and agrees not to change it. In general, when writing a function you should strive to take only const arguments; this of course is not always practical, but striving for it will go a long way.

There's a few gotcha's for const that are worth pointing out:

const double* Mat;
double const* Mat;

These are actually equivalent and both mean the data pointed to by Mat is const.

double *const Mat;

This means the pointer is const, i.e. you can't do:

double *const Mat;
Mat=new double[9];

Const can also be on member functions:

struct MyClass{
void MyFxn()const;
 };

this means that calling MyClass::MyFxn() will not change the state of MyClass. With C++11 we have a new related const-ness concept: constexpr. This is sort of an inline but additionally signifies that the result of the function are constant for the literal input it is given. This is typically for computing literal values, e.g. to compute a factorial:

int factorial(int n)constexpr{
    return n==1? 1 : n*factorial(n-1);
}

#### Pass by Reference or Pointer

The default way to pass an object in C++ is by copy. For example,

void MyFxn(Type Variable);

Here we are passing an object of type Type called variable by copy. This is fine for small objects like integers and doubles, but when we are passing large arrays the copy becomes very expensive. To get around this C++ allows you to pass an object by reference or pointer. In general, you should prefer references over pointers as the syntax is nicer and the meaning tends to be clearer (it's worth mentioning to somewhat more advanced users that polymorphism works with references as well, i.e. a function that takes a reference to the base class can be called with a derived class). To pass by reference:

void MyFxn(Type& Variable);

and by pointer:

void MyFxn(Type* Variable);

Note that you are actually passing the object so in the two above code snippets you should assume that MyFxn changes your input (if it doesn't, it should be using the const keyword).

References make it clear that MyFxn isn't responsible for the memory associated with Variable since the memory was necessarily allocated before the call (there's no such thing as a NULL reference) and only under rare circumstances-- that you hopefully will never encounter-- should you call delete on a reference (consider it a convention that you never call delete on a reference). Pointers on the other hand are somewhat more ambiguous, in particular who is responsible for the memory associated with Variable. Should you have allocated it before calling MyFxn? Is MyFxn going to delete it? Again C++11 helps us.

###Smart Pointers Note: Previously smart pointers were only available in the Boost library, but they have since become part of the C++11 standard. So instead of seeing, for example, std::shared_ptr you may see boost::shared_ptr throughout the source code, but this section applies equally to both. However, this does not imply that the two types of shared pointers are compatible with one another (they are not). You should consider the boost versions deprecated and start using the std versions when possible.

As mentioned in the previous section, pointers carry some ambiguity as to memory semantics. Partially to avoid this, there exists two types of smart pointer: unique_ptr's and shared_ptr's. In general you will want to use the former, so let's consider that one first. Basic usage is:

{
  //For the time being you have to do this because of a C++11 oversight
  std::unique_ptr<Type> MyPtr(new Type());
  //,but once we can use C++14 you should do
  std::unique_ptr<Type> MyPtr=std::make_unique<Type>();
  if(MyPtr){ //Will be true if MyPtr holds memory and false otherwise
      //Do something with MyPtr like:
      MyPtr->MyFxn();//Calls Type::MyFxn()
  }//End if
}//End first scope

The code is pretty simple at this level, but behind the scenes a lot happens. First we allocate some memory and give it to the unique pointer. That pointer is now responsible for that memory and will delete it when that pointer goes out of scope (presently at the last right curly bracket). You are not responsible for calling delete on it, nor should any subfunctions do so either (although there isn't really a reason for a function to take an unique_ptr, but that's beside the point). As the name unique_ptr should suggest, only one pointer can point to the memory at a given time. Hence:

std::unique_ptr<Type> MyPtr(new Type()), AnotherPtr;
AnotherPtr=MyPtr;
bool Point2SameThing=(AnotherPtr==MyPtr);//This is false
if(MyPtr){}//This is false, MyPtr is not managing anything
if(AnotherPtr){}//This is true

The other type of smart pointer is the shared_ptr. Because unique_ptr relies on move semantics, which weren't available until C++11, it was common throughout Psi4 to use boost::shared_ptr in many situations where one would have wanted unique_ptr. Please, try to break from this habit and only use shared_ptr when you really need it. As the name shared_ptr suggests the memory it points to is shared. Thus repeating the example above:

//C++11 does have this function
std::shared_ptr<Type> MyPtr=std::make_shared<Type>(), AnotherPtr;
AnotherPtr=MyPtr;
if(AnotherPtr==MyPtr){}//This is True
if(MyPtr){}//This is true
if(AnotherPtr){}//This is true
MyPtr.reset();//MyPtr is now NULL
if(MyPtr){}//So this is now false
if(AnotherPtr){}//This is still true

As the last three lines try to demonstrate, shared_ptr keeps track of how many shared_ptr instances still point to the memory so that the memory is only freed when no one is still using it. shared_ptr should only be used when multiple instances need to share the same memory and they all need to be responsible for it. If you are just passing a large chunk of memory, consider passing by reference.

I guess I should note the third type which is weak_ptr. If you are using this I highly suspect you are misusing shared_ptr. Basically, weak_ptr allows you to break cyclic shared_ptr cycles (canonical example if an instance of class A contains a shared_ptr to an instance of class B, which contains a shared_ptr to the same instance of class A). If you made one of them a weak_ptr the cycle breaks.

## With C++11 we have a new type of reference: the "RValue" reference. If you don't know the difference between an rvalue and an lvalue, you should probably just skip this section as compilers typically implement this sort of thing for you already (it's called copy elision). We now just have a syntax for it. Basically it boils down to if you have a function:

Type MyFxn(){
   return Type();
}

as written this function returns a copy of an unnamed temporary of Type, but that unnamed object is created just to be copied and then destroyed; seems silly. What if there is a way to tell the compiler to just give me the original object and don't copy it. That's where RValue references come in. If my object knows how to be constructed from an rvalue we can do this:

//Calls Type::operator(Type &&)
Type MyObject=std::move(MyFxn());

The double ampersand is the rvalue reference.

##Doxygen

Make sure you comment all code! Doxygen provides a very neat way of producing HTML documentation for a project. Full details are provided ​here, but in a nutshell you should do the following. A function with the signature ::

int Factorial(int n, bool print)

should be commented as follows::

/**
 * This function computes the factorial of a number, optionally printing 
 * @param n - the number whose factorial is to be computed.
 * @param print - whether to print the result
 * @return the factorial
 */
int Factorial(int n, bool print)

Class variables can be commented using the inline version::

/// Stores the amount of memory available, in bytes.
size_t memory_;
/// The number of irreducible representations.
int nIrreps_;

## How to name keywords in src/bin/psi4/read_options.cc

A few guidelines for standardizing option names among modules.

  • TRIPLES (not trip), TRIPLETS (not trip), SINGLES (not sing), SINGLETS (not sing)
  • CONVERGENCE (not conv, not converge) and TOLERANCE (not tol)
  • Convergence of a method should be governed by an E_CONVERGENCE for energy and either a D_CONVERGENCE for density or a R_CONVERGENCE for residual/amplitudes. All of these should be doubles- let the input parser handle the flexible input format.
  • Diis should have a boolean DIIS (not do_diis, not use_diis) to turn on/off diis extrapolation, a DIIS_MIN_VECS and DIIS_MAX_VECS for minimum and maximum number of diis vectors to use, and a DIIS_START which is the iteration at which to start saving vectors for diis. Not all modules conform to all these at present, but they're as standardized as they can be without changing code.
  • AMPS (not amplitude, not amp) for amplitudes
  • NUM_ (not n) for number (e.g., NUM_AMPS_PRINT, MAX_NUM_VECS, NUM_THREADS)
  • Some names that could be split into multiple words are staying as one. Use MAXITER, CACHELEVEL, PUREAM, DERTYPE.
  • INTS (not integrals), also OEI (not oe_integrals) for one-electron integrals and TEI (not te_integrals) for two-electron integrals
  • PERTURB (not pert) for perturbation
  • Use PRINT options to indicate printing to output file. Use WRITE options to indicate printing to another file. This probably isn't entirely valid now but should be observed in future. The complement to WRITE is READ. PRINT, READ, and WRITE will usually be the last words in an option name.
  • Use FOLLOW_ROOT for the state to be followed in geometry optimizations
  • WFN (not wavefunction)
  • You're welcome to use WFN and DERTYPE as internal options, but plan to have these set by the python driver and mark them as !expert options. Really avoid using JOBTYPE.
  • You're not welcome to add CHARGE or MULTP options. Plan to get these quantities from the molecule object. Since we frequently use subsets of systems (with their own charge and multiplicity), this is safer.
  • Conform. Just grep 'add' psi4/src/bin/psi4/read_options.cc to get a list of all the option names in PSI4 and try to match any conventions you find.
  • If you have a quantity you'd like to call a cutoff, a threshold, a tolerance, or a convergence, consider the following guidelines in naming it.
    • If its value is typically greater than ~0.001, give it a name with CUTOFF.
    • If its value is typically less than ~0.001 and quantities being tested against the option are more valuable with larger values (e.g., integrals, occupations, eigenvectors), give it a name with TOLERANCE.
    • If its value is typically less than ~0.001 and quantities being tested against the option are more valuable with smaller values (e.g., energy changes, residual errors, gradients), give it a name with CONVERGENCE.
  • In deciding how to arrange words in an option name, place the context first (e.g., MP2_AMPS_PRINT, TRIPLES_DIIS). This means PRINT will generally be at the end of an option name.
  • Use INTS_TOLERANCE (not schwarz_cutoff)
  • H in an option name is reserved for Hamiltonian (or hydrogen). Hessian should be HESS.
  • All option names should be all caps and separated by underscores.
  • If you have an option that instructs your module to do something not too computationally intensive and then quit, append _EXIT to the option name.
  • Scaling terms (like for scs) should follow the pattern MP2_SS_SCALE and SAPT_OS_SCALE.
  • FRAG for fragment.
  • AVG for average.
  • For level-shifting, let's try to have it governed by (double) LEVEL_SHIFT only and not a boolean/double combo since the procedure can be turned on (role of boolean) if the value (role of double) has changed.
  • For Tikhonow regularization, use TIKONOW_OMEGA, not regularizer.
  • SYM for symmetry.
  • OCC for occupied/occupation (e.g., DOCC, LOCK_OCC, OCC_TOLERANCE).
  • COND for condition and CONDITIONER for conditioner.
  • LOCAL (not localize).
  • Use AO and MO for atomic and molecular orbitals. When 'O' for orbitals is too obsure or would make for too short a keyword, as in "bool NO" for "Do use natural orbitals", use ORBS for orbitals. So natural orbitals are NAT_ORBS and Brueckner orbitals are BRUECKNER_ORBS.
  • LEVEL (not LVL, not LEV).
  • EX for excitation.
  • VAL for valence.
  • GEOM (not geo, not geometry).
  • SYM (not symm, not symmetry).
  • FILE (unless truly multiple FILES).
  • WRITE/READ for info transfer across jobs. SAVE/RESTART for same in context of restart.
  • Damping should interface through option (double) DAMPING_PERCENTAGE, where a value of 0.0 indicates no damping.
  • Try to avoid COMPUTE or CALC in an option name. If it's a boolean like "opdm_compute" for "Do compute the one-particle density matrix", just use OPDM.
  • Properties should be governed by a PROPERTIES array for the root of interest or by a PROPERTIES_ALL array for all roots in a multi-root calc. Since no module conforms to this right now, use PROPERTY alone and PROP in multi-part option as PROP_ROOT, PROP_ALL, PROP_SYM to conform.
  • Use DF (not ri) for density-fitting and resolution-of-the-identity option names. Only the basis sets are staying as -RI since that's what EMSL uses.
  • more to come, perhaps

## Other C++11 Tidbits

###Range-Based For Loops The C++ way of iterating over a container is to use iterators. For any object that has iterators that are accessible with begin() and end(), we can use a range-based for loop:

std::vector<double> AnArray;
double sum=0.0;
for(const auto& di: AnArray){
   //auto here is double
   sum+=di;
}

This is also the only recommended acceptable use of auto.

###Constructor Delegation This is a fancy name for chaining constructors together:

class MyClass{
   int i_;
   int j_;
   public:
      MyClass(int i, int j):i_(i),j_(j){}
      MyClass(int i):MyClass(i,0){}
      MyClass():MyClass(0){}
};

For objects that have very advanced setup steps, this can save us from a lot of copy/paste (or a factored general initialization function). In the scenario above, we build the class up by calling chaining the constructors. Many novice C++ users have likely tried this, and pre-C++11 it was a compiler error, but now you can do it.

###Variadic Templates Most people have experience with the printf function which takes an arbitrary number of arguments. Such functions are called variadic functions. There's a problem with the old way, they are not type safe. One of the powerful aspects of a strongly typed language is well types. Now we have a way to do this in a type safe way: variadic templates. Basically the syntax is:

template<typename T,typename... Args>
void MyFxn(T arg1, Args...args){
   //Do stuff with args, likely by recursion as:
   //MyFxn(args...);
}

Typically you unpack the arguments one by one via recursion, where the syntax above plucks the first argument out of the args... that was passed to it and then passes the remaining arguments on. This implies that you need to specialize the MyFxn to the end recursion stage of no additional arguments (one gotcha of which is that the definition needs to come before the variadic template). The syntax typename... Args means unpack all of the (possibly 0) remaining types in a "parameter pack" called Args. Then we unpack all of the (possibly 0) actual instances which have types Args into a variable called args. Finally, args... says drop all of the instances here. Other than recursion you can also just dump the instances into an initializer list for say a vector or some other object.

###Initializer Lists Before C++11 there wasn't a common way to initialize objects. You had to go through the constructor or possibly through a series of member functions. Now we have initializer lists, which provide a more standard interface. We already could do this for array types, but now we can do it for all types. The syntax for adding to it a user defined class is a bit involved so I'll skip that, but it provides a nice way to, say, initialize STL containers:

std::vector<int> MyVector={1,2,3,4,5,6};

###Aggregate Initialization Aggregate data types (commonly, structures without a constructor) can be initialized with similar syntax to initializer lists.

struct MyStruct {
  int i;
  double d;
  std::string str;
};

void SomeFunction(void)
{
    MyStruct m{5, 3.14, "Some String"};
}