Skip to content

Latest commit

 

History

History
632 lines (453 loc) · 286 KB

sqlalchemy.rst

File metadata and controls

632 lines (453 loc) · 286 KB

ispyb.sqlalchemy

SQLAlchemy is a Python SQL toolkit and Object Relational Mapper that is used to provide access to the ISPyB database with the full power and flexibility of SQL. For a more general introduction to SQLAlchemy, see the official reference documentation and tutorials.

ORM models for all database tables have been automatically generated using sqlacodegen and are available inside the ispyb.sqlalchemy module.

Connecting to the database

First obtain the database connection URL by calling ispyb.sqlalchemy.url(), passing either the path to a config file or a Python dictionary containing the database credentials. If credentials=None then the function will look for a credentials file pointed to by the ISPYB_CREDENTIALS environment variable.

Example credentials file:

[ispyb_sqlalchemy]
username = user
password = password
host = localhost
port = 3306
database = ispyb_build

Example credentials dictionary:

{
   "username": "user",
   "password": "password",
   "host": "localhost",
   "port": 3306,
   "database": "ispyb",
}
import ispyb.sqlalchemy

credentials = "/dls_sw/dasc/mariadb/credentials/ispyb.cfg"
url = ispyb.sqlalchemy.url(credentials)

We interact with the database via an SQLAlchemy Session, which is used to manage database transactions. Usually we should construct a new Session at the beginning of a logical operation involving database access. We can use the sessionmaker class to provide a factory for Session objects:

from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker

# an Engine, which the Session will use for connection
# resources, typically in module scope
engine = create_engine(url, connect_args={"use_pure": True})

# a sessionmaker(), also in the same scope as the engine
Session = sessionmaker(engine)

Making a basic query

First we import the relevant database table models from ispyb.sqlalchemy, in this case, the DataCollection class, which represents the database table of the same name. Next we query the database to ask how many entries there are in this table:

from ispyb.sqlalchemy import DataCollection

# in production code it is typical to use Session as a
# context manager around a logical block of code
with Session() as session:
    query = session.query(DataCollection)
    print(query.count())

# for convenience in this tutorial we won't use it as a
# context manager
session = Session()
query = session.query(DataCollection)

3081314

More interesting would be to query the database for the records for a specific data collection. In this case, calling .filter() on the original query will return the entries that match the conditions provided to the .filter() method:

query = query.filter(DataCollection.dataCollectionId == 6036518)
# This will return the first row of the results, or None if there aren't any results
dc = query.first()
# This will return the first row, and raise an error if no results match the query
dc = query.one()
print(dc)

<ispyb.sqlalchemy._auto_db_schema.DataCollection object at 0x7f2041b05460>

We can use the resulting DataCollection object to print some basic information from the DataCollection table:

print(
    f"""\
dcid: {dc.dataCollectionId}
image template: {dc.imageDirectory}{dc.fileTemplate}
start time: {dc.startTime}\
"""
)

dcid: 6036518 image template: /dls/i04/data/2021/cm28182-1/20210302/TestInsulin/ZincInsulinB4/ZincInsulinB4_1_master.h5 start time: 2021-03-02 13:38:10

We can easily traverse related objects (as defined in the database by foreign key relationships), e.g.:

dcg = dc.DataCollectionGroup
sample = dcg.BLSample
container = sample.Container
crystal = sample.Crystal
protein = crystal.Protein

print(
    f"""\
experiment type: {dcg.experimentType}
sample name: {sample.name}
container code: {container.code}
container type: {container.containerType}
container capacity: {container.capacity}
space group: {crystal.spaceGroup}
protein name: {protein.name}
protein acronym: {protein.acronym}\
"""
)

experiment type: SAD sample name: ZincInsulinB4 container code: I04R-012 container type: Puck container capacity: 16 space group: protein name: TestInsulin protein acronym: TestInsulin

An important point to note is that by default, relationships are loaded lazily, i.e. only data for the DataCollection entry were returned by the initial database query, and when we call (e.g.) the dc.DataCollectionGroup attribute of the resulting DataCollection object, another query is made to the database. This can be inefficient, especially if you want to examine multiple rows across related tables for several data collections.

SQLAlchemy provides a number of different ways to control how relationships are loaded, included, lazy loading (this is the default for most relationships), joined loading, subquery loading and select IN loading. For more detailed information see the SQLAlchemy documentation on loading relationships. In the example below we use joined eager loading by calling the joinedload option:

from sqlalchemy.orm import joinedload
from ispyb.sqlalchemy import DataCollectionGroup, BLSample, Container, Crystal, Protein

query = (
    session.query(DataCollection)
    .options(
        joinedload(DataCollection.DataCollectionGroup)
        .joinedload(DataCollectionGroup.BLSample)
        .options(
            joinedload(BLSample.Container),
            joinedload(BLSample.Crystal).joinedload(Crystal.Protein),
        )
    )
    .filter(DataCollection.dataCollectionId == 6036518)
)

dc = query.first()
dcg = dc.DataCollectionGroup
sample = dcg.BLSample
container = sample.Container
crystal = sample.Crystal
protein = crystal.Protein

print(
    f"""\
dcid: {dc.dataCollectionId}
image template: {dc.imageDirectory}{dc.fileTemplate}
start time: {dc.startTime}
experiment type: {dcg.experimentType}
sample name: {sample.name}
container code: {container.code}
container type: {container.containerType}
container capacity: {container.capacity}
space group: {crystal.spaceGroup}
protein name: {protein.name}
protein acronym: {protein.acronym}\
"""
)

dcid: 6036518 image template: /dls/i04/data/2021/cm28182-1/20210302/TestInsulin/ZincInsulinB4/ZincInsulinB4_1_master.h5 start time: 2021-03-02 13:38:10 experiment type: SAD sample name: ZincInsulinB4 container code: I04R-012 container type: Puck container capacity: 16 space group: protein name: TestInsulin protein acronym: TestInsulin

Alternatively we can use an SQL JOIN query. Here we indicate that we’re only interested in the DataCollection and Protein entries, but we must specify how the two tables are linked via intermediates:

query = (
    session.query(DataCollection, Protein)
    .join(DataCollectionGroup)
    .join(BLSample)
    .join(Crystal)
    .join(Protein)
    .filter(DataCollection.dataCollectionId == 6036518)
)
dc, protein = query.one()
print(dc.dataCollectionId, protein.name)

6036518 TestInsulin

Below we query the BLSession and Proposal tables to determine the full visit name. The DataCollection and BLSession tables aren’t explictly linked via a foreign key, so it is necessary to explicitly link the tables in our query:

from ispyb.sqlalchemy import BLSample, BLSession, Proposal

query = (
    session.query(BLSession, Proposal)
    .join(DataCollection, DataCollection.SESSIONID == BLSession.sessionId)
    .join(Proposal)
    .filter(DataCollection.dataCollectionId == 6036455)
)
bs, p = query.first()
visit = f"visit: {p.proposalCode}{p.proposalNumber}-{bs.visit_number}"
print(visit)

visit: cm28182-1

Extracting autoprocessing statistics

In this next example, we extract some autoprocessing statistics for a given visit. We will take advantage that we can easily use SQLAlchemy to extract data into a pandas.DataFrame to facilitate further analysis. First, we create our query to select the relevant information from the database.

from ispyb.sqlalchemy import (
    AutoProcIntegration,
    AutoProc,
    AutoProcProgram,
    AutoProcScaling,
    AutoProcScalingStatistics,
)

query = (
    session.query(DataCollection, AutoProcProgram, AutoProcScalingStatistics)
    .join(
        AutoProcIntegration,
        AutoProcIntegration.dataCollectionId == DataCollection.dataCollectionId,
    )
    .join(
        AutoProcProgram,
        AutoProcIntegration.autoProcProgramId == AutoProcProgram.autoProcProgramId,
    )
    .join(AutoProc, AutoProcProgram.autoProcProgramId == AutoProc.autoProcProgramId)
    .join(AutoProcScaling, AutoProc.autoProcId == AutoProcScaling.autoProcId)
    .join(
        AutoProcScalingStatistics,
        AutoProcScalingStatistics.autoProcScalingId
        == AutoProcScaling.autoProcScalingId,
    )
    .filter(AutoProcScalingStatistics.scalingStatisticsType == "outerShell")
    .filter(DataCollection.SESSIONID == bs.sessionId)
)
print(query.count())

1221

Thanks to being able to access the raw SQL statement that would be run by a query, we can use the pandas.read_sql() function to directly convert this query into a pandas.DataFrame:

import pandas as pd

df = pd.read_sql(query.statement, session.bind)
print(df.head())

dataCollectionId BLSAMPLEID SESSIONID experimenttype

0 5791190 3224381.0 27446315 None 1 5791190 3224381.0 27446315 None 2 5791190 3224381.0 27446315 None 3 5791190 3224381.0 27446315 None 4 5791220 3224384.0 27446315 None

dataCollectionNumber startTime endTime

0 1 2021-01-14 13:47:30 2021-01-14 13:48:29 1 1 2021-01-14 13:47:30 2021-01-14 13:48:29 2 1 2021-01-14 13:47:30 2021-01-14 13:48:29 3 1 2021-01-14 13:47:30 2021-01-14 13:48:29 4 1 2021-01-14 13:51:01 2021-01-14 13:51:56

runStatus axisStart axisEnd ... meanIOverSigI

0 DataCollection Successful 0.0 0.1 ... 0.330998 1 DataCollection Successful 0.0 0.1 ... 0.588335 2 DataCollection Successful 0.0 0.1 ... 0.600000 3 DataCollection Successful 0.0 0.1 ... 1.400000 4 DataCollection Successful 0.0 0.1 ... 0.043246

completeness multiplicity anomalousCompleteness anomalousMultiplicity

0 99.900400 39.4367 99.7773 20.8294 1 100.000000 38.7516 100.0000 20.7655 2 100.000000 37.6000 100.0000 20.1000 3 69.100000 39.3000 70.3000 21.6000 4 0.053611 1.0000 0.0000 1.0000

recordTimeStamp_1 anomalous ccHalf ccAnomalous resIOverSigI2

0 2021-01-14 14:01:36 None 0.436763 -0.213406 None 1 2021-01-14 14:02:49 None 0.284781 0.020924 None 2 2021-01-14 14:09:14 None 0.366000 -0.026000 None 3 2021-01-14 14:09:15 None 0.728000 -0.008000 None 4 2021-01-14 16:17:06 None 0.000000 0.000000 None

[5 rows x 137 columns]

We can see that this query gave us a large number of columns (136 in fact!), most of which we’re probably not interested in. There are even some duplicate column names across this as some of the tables share column names, e.g. dataCollectionId. This will be problematic if we want to use pandas groupby functionality on dataCollectionId.

print(df["dataCollectionId"].head())

0 5791190 1 5791190 2 5791190 3 5791190 4 5791220 Name: dataCollectionId, dtype: int64

We can modify our query using the query.with_entities() method to select only those columns we’re interested in:

query = query.with_entities(
    DataCollection.dataCollectionId,
    AutoProcProgram.processingPrograms,
    AutoProcProgram.processingMessage,
    AutoProcProgram.processingStartTime,
    AutoProcProgram.processingEndTime,
    AutoProcScalingStatistics.ccHalf,
    AutoProcScalingStatistics.resolutionLimitHigh,
)
df = pd.read_sql(query.statement, session.bind)
print(df.head())

dataCollectionId processingPrograms processingMessage

0 5791190 xia2 dials processing successful 1 5791190 xia2 3dii processing successful 2 5791190 autoPROC processing successful 3 5791190 autoPROC+STARANISO processing successful 4 5791220 xia2 3dii processing successful

processingStartTime processingEndTime ccHalf resolutionLimitHigh

0 2021-01-14 13:48:34 2021-01-14 14:01:36 0.436763 2.05890 1 2021-01-14 13:48:35 2021-01-14 14:02:49 0.284781 2.26003 2 2021-01-14 13:48:41 2021-01-14 14:09:14 0.366000 2.32900 3 2021-01-14 14:09:15 2021-01-14 14:09:15 0.728000 2.29700 4 2021-01-14 13:52:04 2021-01-14 16:17:06 0.000000 1.06354

Next, compute the runtime for each program, filtering only for programs that completed successfully (note we could have done this filtering as part of the original database query), group by dataCollectionId and processingPrograms and then plot a histogram of the runtimes:

# Calculate the processing time
df["processingDuration"] = df["processingEndTime"] - df["processingStartTime"]

# Select only those programs that successfully completed and group on
# dataCollectionid and processingPrograms
grouped = df[df["processingMessage"] == "processing successful"].groupby(
    ["dataCollectionId", "processingPrograms"]
)

# Calculate the minimum processing time for each group
processing_duration = grouped["processingDuration"].min().dt.total_seconds().unstack()

# Drop those columns that aren't strictly comparable
processing_duration.drop(
    ["autoPROC+STARANISO", "xia2 dials (multi)", "fast_dp", "xia2.multiplex"],
    inplace=True,
    axis=1,
)
print(processing_duration.median())

import math
import matplotlib.pyplot as plt

plt.style.use("ggplot")

# Plot processing duration as a histogram and boxplot
processing_duration.plot(kind="hist", bins=50, subplots=True)
plt.xlabel("Processing duration (seconds)")
plt.show()
processing_duration.plot(kind="box", vert=False)
plt.xlabel("Processing duration (seconds)")
plt.show()

processingPrograms autoPROC 1876.0 xia2 3dii 1352.5 xia2 3dii (multi) 1472.0 xia2 dials 1316.0 dtype: float64

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

def comparison_scatter_plot(df, reference, compare):
    ncols = 2
    nrows = int(math.ceil(len(compare) / ncols))
    fig, axes = plt.subplots(nrows=nrows, ncols=ncols, sharex=True, sharey=True)
    for ax, comparison_prg in zip(axes.flatten(), compare):
        df.plot(kind="scatter", x=reference, y=comparison_prg, s=2, ax=ax)
        ax.set_title(comparison_prg)
    for ax in axes.flatten():
        vmin = min(ax.get_xlim()[0], ax.get_ylim()[0])
        vmax = max(ax.get_xlim()[1], ax.get_ylim()[1])
        ax.plot([vmin, vmax], [vmin, vmax], c="black", linestyle="dotted")
        ax.set_aspect("equal")
    return fig, axes


# Scatter plot for xia2 3dii and autoPROC vs xia2 dials
compare = ("xia2 3dii", "autoPROC")
fig, axes = comparison_scatter_plot(
    processing_duration, reference="xia2 dials", compare=compare
)
for ax, comparison_prg in zip(axes.flatten(), compare):
    ax.set_title(comparison_prg)
    ax.set_xlabel("xia2 dials runtime (seconds)")
    ax.set_ylabel(f"runtime (seconds)")
fig.suptitle(f"Runtime vs. xia2-dials for {visit}")
plt.show()

<IPython.core.display.Javascript object>

# Compare resolution limits
compare = ("fast_dp", "xia2 3dii", "autoPROC", "autoPROC+STARANISO")
fig, axes = comparison_scatter_plot(
    grouped["resolutionLimitHigh"].min().unstack(),
    reference="xia2 dials",
    compare=compare,
)
for ax, comparison_prg in zip(axes.flatten(), compare):
    ax.set_title(comparison_prg)
    ax.set_xlabel("xia2 dials d_min (Å^-1)")
    ax.set_ylabel(f"d_min (Å^-1)")
fig.suptitle(f"Resolution vs. xia2-dials for {visit}")
plt.show()

<IPython.core.display.Javascript object>

# Compare outer shell CC1/2
compare = ("fast_dp", "xia2 3dii", "autoPROC", "autoPROC+STARANISO")
fig, axes = comparison_scatter_plot(
    grouped["ccHalf"].min().unstack(), reference="xia2 dials", compare=compare
)
for ax, comparison_prg in zip(axes.flatten(), compare):
    ax.set_title(comparison_prg)
    ax.set_xlabel("xia2 dials d_min (Å^-1)")
    ax.set_ylabel(f"d_min (Å^-1)")
fig.suptitle(f"Resolution vs. xia2-dials for {visit}")
plt.show()

<IPython.core.display.Javascript object>