## Using Pandas DataFrames as a small database





Pandas DataFrames can serve as small databases. You can use them to construct tables from which you can run queries to compute things. In this example, we will explore a database of molecules and atomic element properties.





### A Table for the chemical elements and their atomic masses





We will use the Atomic Simulation Environment library for this. It is already installed in Deepnote.

We will use it because it has data about the chemical elements, and some molecules we will use.

First, we make a DataFrame containing the chemical elements, and their atomic masses.





In [None]:
import ase
import numpy as np
import pandas as pd

dtypes = np.dtype([('symbol', str), ('atomic mass', float)])
data = np.empty(0, dtype=dtypes)

elements = pd.DataFrame(data)
elements['symbol'] = ase.data.chemical_symbols
elements['atomic mass'] = ase.data.atomic_masses
elements.info()



You can use a `query` function to select rows from the database that meet some criteria.





In [None]:
?elements.query



For example, to get Carbon we can use a query like this.





In [None]:
elements.query('symbol == "C"')



We can do some things with python variables like this.





In [None]:
sym = 'H'
elements.query('symbol == @sym')



\#+END\_SRC

Although, f-strings also work for this.





In [None]:
sym = 'H'
elements.query(f'symbol == "{sym}"')



The atomic mass is trickier to work with because it has a space in it. We have to use back-ticks to "quote" this.





In [None]:
elements.query('`atomic mass` < 4')



Note it does not appear you can use the @ or f-string on column names. Also note the mysterious element X in the table. We can ignore that.





### A Table for some molecules





Next, let's build the molecule database. This will be a table where each row  corresponds to an atom in a molecule. We will be able to get the atoms in a molecule by aggregating these on a `Molecule-ID`. We build this table up row by row, by first getting a molecule, and then iterating over each atom in the molecule. It is conventional to use an integer for an ID, but we will just use the molecular formula in this example. That has some limitations for larger databases (e.g. isomers have different properties, but the same molecular formula), but we will not have that problem here.





In [None]:
df = pd.DataFrame(columns=['Molecule-ID', 'Atom symbol', 'x', 'y', 'z'])

from ase.build import molecule
i = 0

for mlc in ['H2O', 'NH3', 'CH4']:
    for atom in molecule(mlc):
        df.loc[i] = [mlc, atom.symbol, atom.x, atom.y, atom.z]
        i += 1
df



#### What molecules are in the database?





Ok, now we are ready to do some queries. First, let's see what molecules we have in our database. We want the unique values of the 'Molecule-ID' column.





In [None]:
df['Molecule-ID'].unique()



#### Which molecules have three H atoms?





Now, how do we find molecules that have 3 H atoms? We need to do some grouping. First, we select the H rows, and then we group by the Molecule-ID Then, we need a count of each sub group. I prefer the `size` function for this over `count`. `size` returns a Series, while `count` seems to return a DataFrame.





In [None]:
tf = df.query('`Atom symbol` == "H"').groupby(['Molecule-ID']).size()
tf



Finally, we can select the rows that have 3 hydrogen atoms.





In [None]:
tf[tf == 3]



### Getting the molecular weight





Getting the molecular weight requires us to combine information from two DataFrames. To do this, we need to merge them, aligning the rows on a common key. That key is the `Atom symbol` in the molecule DataFrame, and `symbol` in the elements DataFrame. Then we have to do the right grouping, and use the sum aggregation method on each group.





In [None]:
mf = pd.merge(df, elements, how='inner', left_on='Atom symbol', right_on='symbol')
mf



Now, we group by the `Molecule-ID`, select the `atomic mass` column, and aggregate with the sum.





In [None]:
MW = mf.groupby('Molecule-ID')['atomic mass'].sum()
MW



Here is one of many ways to print this in a different format:





In [None]:
for mlc, mw in MW.iteritems():
    print(f'The molecular weight of {mlc} is {mw} gm/mol')



## Chaining commands in Pandas





So far, we have mostly seen sequential commands in Pandas





In [None]:
tf = df.query('`Atom symbol` == "H"').groupby(['Molecule-ID']).size()
tf[tf == 3]



The `query` lets us chain these all into one line.





In [None]:
df.query('`Atom symbol` == "H"').groupby(['Molecule-ID']).count().query("`Atom symbol` == 3")



It is common to see this syntax where parentheses allow us to separate these into multiple lines. This may enhance readability.





In [None]:
(df
 .query('`Atom symbol` == "H"')
 .groupby(['Molecule-ID'])
 .count()
 .query("`Atom symbol` == 3"))



The main benefit of chaining is that you do not have to define temporary variables that exist only so you can reuse them in subsequent lines. The downside is it is more challenging to debug them, and it is common to build them iteratively in a notebook.

Here is another example of chaining to get the molecular weight of water.





In [None]:
(pd.merge(df, elements, how='inner',
          left_on='Atom symbol', right_on='symbol')
 .groupby('Molecule-ID')['atomic mass']
 .sum()
 ['H2O'])



## Subtle points





Pandas offers many ways to do what appear to be the same thing, but they are not. For example, this works:





In [None]:
(df
 .query('`Atom symbol` == "H"')
 .groupby(['Molecule-ID'])
 .count()
 .query("`Atom symbol` == 3"))



And this doesn't.





In [None]:
(df
 .query('`Atom symbol` == "H"')
 .groupby(['Molecule-ID'])
 .size()
 .query("`Atom symbol` == 3"))



The problem is the `size` function here returns Series, and you cannot query a series. We can get back to this with some acrobatics.





In [None]:
(df
 .query('`Atom symbol` == "H"')
 .groupby(['Molecule-ID'])
 .size()  # This is a series
 .rename('counts') # we give the Series a name
 .to_frame() # Convert to dataframe so we can query
 .query('counts == 3'))



In the beginning, it will be challenging to remember all of this, and figure out how to debug it. With practice, it will get easier!

The Pandas manual ([https://pandas.pydata.org/docs/pandas.pdf](https://pandas.pydata.org/docs/pandas.pdf)) is ~3000 pages long! You cannot learn it all, and most likely you don't need to as it covers a lot of use cases that may fall outside your needs.

It is also challenging that there are many ways to do the same thing. For example, here we solve this problem in a different way that has a subtly different syntax. You cannot just cut and paste bits of code between these two examples without knowing what each one does.





In [None]:
(df
 .query('`Atom symbol` == "H"')
 .groupby(['Molecule-ID'])
 .agg({'Atom symbol' :'size'}) # Now this is dataframe
 .query('`Atom symbol` == 3'))



How do you learn/remember these? One way is reading lots of code, and running lots of code. You can read code in the manual. You can also use the notebook to access documentation on these methods.

Here are some of the commands we used today,





In [None]:
?pd.DataFrame.query



In [None]:
?pd.DataFrame.groupby



Getting help on the `agg` command is a little trickier. There are several `agg` functions, so we want to make sure we get the one that is relevant to the result from a `groupby` call. First, we get the type of things that is returned:





In [None]:
type(df
 .query('`Atom symbol` == "H"')
 .groupby(['Molecule-ID']))



Then, we get the help for that thing.





In [None]:
?pd.core.groupby.generic.DataFrameGroupBy.agg



In [None]:
?pd.Series.to_frame



In [None]:
?pd.Series.rename



# Summary

Pandas can do many things. It is a sophisticated interface to arrays; it can do things that you might do with Excel, or with SQL. A critical feature that might make you choose Pandas over one of these is the ability to integrate it into your Python code. This comes at some cost; Pandas is not like Excel or SQL and uses its own domain specific language (DSL) for queries and filters.

