Skip to content
lothian edited this page Sep 7, 2014 · 2 revisions

Dummy

The BLAS/LAPACK interface

Computational chemistry is essentially linear algebra on molecular systems, so using stable, portable, scalable, and efficient numerical linear algebra methods in PSI4 is critical. To that end, we use BLAS1 (vector-vector operations, like dot products), BLAS2 (matrix-vector operations, like rank-1 update), BLAS3 (matrix-matrix operations, like matrix multiplication), and LAPACK (advanced matrix decompositions and solutions). The methods provided by BLAS and LAPACK are standard, but the performance of actual implementations differ greatly from one version to another. Moreover, the standard interfaces to the libraries are FORTRAN, so PSI4 provides a common set of wrappers in libqt/qt.h.

BLAS Wrappers

BLAS wrappers are currently fully supported at double precision*

  • BLAS commands involving matrices are wrapped so as to be conventional C-style "row-major" indexing, meaning that the column is the fast index like normal.
  • The calls to BLAS1 routines are wrapped so as to allow for operations on vectors with more than 2^{31} elements (~16 GB, getting to be a problem). So passing a signed or unsigned long works, though the stride arguments must be integers.

All routines are declared in qt.h. Each routine is prefixed with a C_, followed by the standard FORTRAN name of the routine, in capital letters. Input parameters of single primitives (int, double, unsigned long int, char, ...) are passed by value. Arrays, including multidimensional arrays, are required to be in contiguous memory (as provided by block_matrix, for example), and are passed by providing a pointer to the first double or int element of the data (this is array[0] if array is double**). BLAS1 routines occasionally return values (DDOT for instance), BLAS2 and BLAS3 always return void. For char arguments, case is insensitive. A few examples are provided:

// BLAS/LAPACK
#include <libqt/qt.h>
// block_matrix, init_array
#include <libciomr/libciomr.h>

using namespace psi;
...

// Allocate a,b vectors
int n = 100;
double* a = init_array(n);
double* b = init_array(n);

// Allocate A matrix;
double** A = block_matrix(n,n);
double** B = block_matrix(n,n);
double** C = block_matrix(n,n);

// Call the BLAS1 dot product between a and b
// n can be a ULI with the BLAS1 wrappers,
// All strides must be ints though
double dot = C_DDOT(n, a, 1, b, 1);

// Call the BLAS2 GEMV without transposition
// Note this works in row-major order
C_DGEMV('N', n, n, 1.0, A[0], n, a, 1, 0.0, b, 1);

// Call the BLAS3 GEMM without transposition
// Note this works in row-major order
C_DGEMM('N','N', n, n, n, 1.0, A[0], n, B[0], n, 0.0, C[0], n);

// Array's init'd with init_array must be free'd, not delete[]'d
free(a);
free(b);

// Block matrix should be free_blocked
free_block(A);
free_block(B);
free_block(C);

Important BLAS Routines

BLAS1

  • DDOT: dot product
  • DCOPY: efficient memory copy (with variable stride)
  • DAXPY: y = y + alpha*x
  • DROT: Givens Rotation
  • DNRM2: Vector norm square

BLAS2

  • DGEMV: General Matrix-Vector product
  • DTRMV: Triangular Matrix-Vector product (2x faster, not wrapped yet)
  • DTRSM: Triangular Matrix-Vector solution via back substitution (just as fast as DTRMV)
  • DGER: Rank-1 update (not wrapped yet)

BLAS3

  • DGEMM: General Matrix-Matrix product
  • DTRMM: General Triangular Matrix-General Matrix product
  • DTRSM: Triangular Matrix-General Matrix solution via back substitution (just as fast as DTRMM)
  • DSYMM/DSYMV calls are not appreciably faster than DGEMM calls, and should only be used in expert situations (like using the other half of the matrix for some form of other transformation).

DTRMM/DTRMV calls are 2x faster than DGEMM, and should be used where possible.

LAPACK Wrappers

All standard LAPACK 3.2 double precision routines are provided.

  • LAPACK commands remain in FORTRAN's "column-major" indexing, so all the results will be transposed, and leading dimensions may have to be fiddled with (using lda = n in both directions for square matrices is highly recommended). An example of the former problem is a Cholesky Decomposition: you expect to get back a lower triangular matrix L such that L L^T = A, but this is returned in column-major order, so the actual recovery of the matrix A with the row-major BLAS wrappers effectively involves L^T L = A. On of the biggest consequences is in linear equations: The input/output forcing/solution vector must be explicitly formed in column-major indexing (each vector is placed in a C++ row, with its entries along the C++ column). This is visualized in C++ as the transpose of the forcing/solution vector.

All routines are declared in qt.h. Each routine is prefixed with a C_, followed by the standard FORTRAN name of the routine, in capital letters. Input parameters of single primitives (int, double, unsigned long int, char, ...) are passed by value. Arrays, including multidimensional arrays, are required to be in contiguous memory (as provided by block_matrix, for example), and are passed by providing a pointer to the first double or int element of the data (this is array[0] if array is double**). All routines return an int INFO with error and calculation information specific to the routine, In FORTRAN, this is the last argument in all LAPACK calls, but should not be provided as an argument here. For char arguments, case is insensitive. A Cholesky transform example is shown:

// BLAS/LAPACK
#include <libqt/qt.h>
// block_matrix, init_array
#include <libciomr/libciomr.h>

using namespace psi;
...
int n = 100;

// Allocate A matrix;
double** A = block_matrix(n,n);

// Call the LAPACK DPOTRF to get the Cholesky factor
// Note this works in column-major order
// The result fills like:
//   * * * *
//     * * *
//       * *
//         *
// instead of the expected:
//   *
//   * *
//   * * *
//   * * * *
//
int info = C_DPOTRF('L', n, A[0], n);

// A bit painful, see below
fprintf(outfile, "A:\n");
print_mat(A,n,n,outfile);

// Block matrix should be free_blocked
free_block(A);

Important Lapack Routines

  • DSYEV: Eigenvalues and, optionally eigenvectors of a symmetric matrix. Eigenvectors take up to 10x longer than eigenvalues.
  • DGEEV: Eigenvalues and, optionally eigenvectors of a general matrix. Up to 10x slower than DSYEV.
  • DGESV: General solver (uses LU decomposition).
  • DGESVD: General singular value decomposition.
  • DGETRF: LU decomposition.
  • DPOTRF: Cholesky decomposition (much more stable/faster)
  • DGETRS: Solver, given LU decomposition by DGETRF
  • DPOTRS: Solver, given Cholesky decomposition by DPOTRF
  • DGETRI: Inverse, given LU decomposition by DGETRF (Warning: it's faster and more stable just to solve with DGETRS)
  • DPOTRI: Inverse, given Cholesky decomposition by DPOTRF (Warning: it's faster and more stable just to solve with DPOTRS)

Low-Level BLAS/LAPACK with Jet's Matrix Object

Jet's awesome new Matrix object has a lot of simple BLAS/LAPACK built in, but you can just as easily use the double*** array underneath if you are careful (the outer index is the submatrix for each irrep). Here's an example:

// BLAS/LAPACK
#include <libqt/qt.h>
// Matrix
#include <libmints/matrix.h>

using namespace psi;
...
int n = 100;

// Allocate A Matrix (new C1 convenience constructor);
shared_ptr<Matrix> A(new Matrix("Still A, but way cooler", n,n));
// Get the pointer to the 0 irrep (C1 for now, it errors if you ask for too high of an index)
double** A_pointer = A->get_pointer(0);

// Call the LAPACK DPOTRF to get the Cholesky factor
// Note this works in column-major order
// The result fills like:
//   * * * *
//     * * *
//       * *
//         *
// instead of the expected:
//   *
//   * *
//   * * *
//   * * * *
//
int info = C_DPOTRF('L', n, A_pointer[0], n);

// Wow that's a lot easier
A->print();

// Don't free, it's shared_ptr!

Choosing a BLAS/LAPACK Implementation

There are many different vendor- and architecture-specific BLAS/LAPACK distributions out there. Here are some in a pseudo-preferred order:

  • MKL: Intel's MKL BLAS3 and LAPACK routines are extremely well threaded for matrix contraction and solution routines, which provides free threading for most of PSI4's heavy operations (eigenvalues/vectors not quite as much). They work on all common platforms and are free for academic use.
  • ATLAS: Very efficient for single core use (MKL is built on this) and can be compiled with threads, though this is tricky.
  • GotoBLAS: Also threaded, not used often.
  • ACML/IBM/SUN: Vendor-specific, MKL destroys them, even on their design architecture.
  • NETLIB: The default, naive FORTRAN definitions, used only for testing.

The functions DROTG, DROTGM, and DROTM are not provided, as they are deprecated. You shouldn't need them anyways.