# Quickstart Guide

In [1]:
import ciffile

## Creating Files

You can create a CIF file from any table-like data structure
(e.g., a `polars.DataFrame`, `pandas.DataFrame`,
dictionary of columns, list of rows, etc.)
that can be converted to a `polars.DataFrame`.
The resulting DataFrame must contain one row
for each unique data item in the CIF file,
with columns specifying:
- **Block code** (i.e., data block name) of the data item.
- **Frame code** (i.e., save frame name within the block) of the data item (optional; for CIF dictionary files).
- **Category** of the data item name (tag).
    For mmCIF files, this corresponds to
    the part before the period in the data name.
    For CIF files, this must be `None` for single data items
    (i.e., not part of a loop/table),
    and a unique value (e.g., "1", "2", ...) for each table,
    shared among all data items in that table.
- **Keyword** of the data item name (tag).
    For mmCIF files, this corresponds to
    the part after the period in the data name.
    For CIF files, this is the data name itself.
- **Values** of the data item as a list.
    For single data items, the list contains a single string.
    For tabular (looped) data items,
    it contains multiple strings,
    corresponding to row values
    for that data item column in the table.

For more information about these terms, refer to the official source: [CIF Version 1.1 Common Semantic Features](https://www.iucr.org/resources/cif/spec/version1.1/semantics#definitions)

In [2]:
file_data = {
    "block": "MyCIFData",
    "category": ["my_table_category"] * 3 + ["my_single_category"] * 3,
    "keyword": ["col1", "col2", "col3", "key1", "key2", "key3"],
    "values": [[1, 10, 100], [2, 20, 200], [3, 30, 300], ["value1"], ["value2 with spaces"], ["value3 \n with \n newlines"]],
}
created_file = ciffile.create(file_data)

## Reading Files

You can read a CIF file from content, path, or a file-like object.
The following example downloads the
[PDB Exchange Dictionary (PDBx/mmCIF)](https://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v50.dic/Index/)
from its URL and reads it directly.

In [3]:
from urllib.request import urlopen

with urlopen("https://mmcif.wwpdb.org/dictionaries/ascii/mmcif_pdbx.dic") as response:
    pdbx = ciffile.read(response)

## Writing Files

Once you have created/read the file,
it can be readily written as a string in CIF syntax.
One simple way is to invoke the `CIFFile` object's `__str__()` method; for example:

In [4]:
print(created_file)

data_MyCIFData
loop_
_my_table_category.col1  _my_table_category.col2  _my_table_category.col3
1                        2                        3                      
10                       20                       30                     
100                      200                      300                    
_my_single_category.key1  value1
_my_single_category.key2  'value2 with spaces'
_my_single_category.key3
;value3 
 with 
 newlines
;



Alternatively,
you can use the `CIFFile`'s `write()` method
for more control over writing options,
or for directly (and incrementally) writting to an output.
The method accepts any callable 
that takes a string and writes it to the desired output.
This could be a file write method or any other string-consuming function.
The following example passes the `print` function for demonstration,
and changes the default styling parameters:

In [5]:
created_file.write(
    writer=lambda s: print(s, end=""),
    list_style="horizontal",
    table_style="tabular-vertical",
    space_items=5,
    min_space_columns=2,
    indent=0,
    indent_inner=3,
    delimiter_preference=("double", "single", "semicolon"),
)

data_MyCIFData
   loop_
      _my_table_category.col1
      _my_table_category.col2
      _my_table_category.col3
      1    2    3  
      10   20   30 
      100  200  300
   _my_single_category.key1  value1
   _my_single_category.key2  "value2 with spaces"
   _my_single_category.key3
;value3 
 with 
 newlines
;


## Exploring Files

The `CIFFile` object returend by the `ciffile.create()` and `ciffile.read()` functions
is a robust data structure with various methods to access and process the data in the file:


In [6]:
pdbx

CIFFile(type='dict', variant='mmcif', blocks=1)

In addition to the data exploration methods described below,
the entire file data is also stored in a `polars.DataFrame`
(with the same format discussed in the Creating Files section above)
and is accessible using the `CIFFile.df` property:

In [7]:
pdbx.df

block,frame,category,keyword,values
str,str,str,str,list[str]
"""mmcif_pdbx.dic""",,"""datablock""","""id""","[""mmcif_pdbx.dic""]"
"""mmcif_pdbx.dic""",,"""datablock""","""description""","[""  This data block holds the Protein Data Bank Exchange Data dictionary.""]"
"""mmcif_pdbx.dic""",,"""dictionary""","""title""","[""mmcif_pdbx.dic""]"
"""mmcif_pdbx.dic""",,"""dictionary""","""datablock_id""","[""mmcif_pdbx.dic""]"
"""mmcif_pdbx.dic""",,"""dictionary""","""version""","[""5.409""]"
…,…,…,…,…
"""mmcif_pdbx.dic""","""diffrn_detector_element.refere…","""item""","""category_id""","[""diffrn_detector_element""]"
"""mmcif_pdbx.dic""","""diffrn_detector_element.refere…","""item""","""mandatory_code""","[""no""]"
"""mmcif_pdbx.dic""","""diffrn_detector_element.refere…","""item_type""","""code""","[""code""]"
"""mmcif_pdbx.dic""","""diffrn_detector_element.refere…","""item_enumeration""","""value""","[""mm"", ""pixels"", ""bins""]"


### Data vs. Dictionary Files

There are two main types of CIF files:
- **Data files** contain information about the subject of a (crystallography related) study or experiment.
- **Dictionary files** contain information about the data items in data files, as identified by their data names.

Although there is no way to distinguish between dictionary and data files at a purely syntactic level,
save frames may only be used in dictionary files.
Therefore, any CIF file containing at lease one save frame
is a dictionary file (note that not all dictionary files contain save frames).
The `CIFFile.type` property tells whether a file is a `data` or `dict` file,
based on whether it contains any save frames:

In [8]:
pdbx.type

'dict'



Dictionary files usually contain two main types of information:
- General information, such as those about the dictionary itself
  (e.g., title, version, change logs, and other identifier).
  These are stored as data items directly under data blocks
  (i.e., not in any save frames).
- Definition and attributes of data items that the dictionary describes.
  These are stored as data items within save frames of each data block.
  Moreover, for mmCIF dictionaries, these definitions can be divided to:
  - Definition of data categories,
    stored in save frames whose frame code is the category code
    (i.e., no period in the frame code).
  - Definition of data keywords within each category,
    stored in save frames whose frame code consists of both category and keyword codes
    (i.e., period in the frame code).

Therefore, it is useful to be able to isolate these parts
and process them separately.
This can be done using the `CIFFile.part()` method;
when called with no arguments,
it returns all different parts of the file as separate `CIFFile` objects:

In [9]:
pdbx_parts = pdbx.part()
pdbx_parts

{'dict_key': CIFFile(type='dict', variant='mmcif', blocks=1),
 'dict': CIFFile(type='dict', variant='mmcif', blocks=1),
 'dict_cat': CIFFile(type='dict', variant='mmcif', blocks=1),
 'data': CIFFile(type='data', variant='mmcif', blocks=1)}

### Data Blocks

A CIF file is a collection of data blocks.
The length of the `CIFFile` tells you how many data blocks
are in the file:

In [10]:
len(pdbx)

1

The block codes (data block names) can be accessed via the `CIFFile.block_codes` property:

In [11]:
pdbx.block_codes

['mmcif_pdbx.dic']

A data block can be accessed by its name or index:

In [12]:
assert pdbx[0] is pdbx["mmcif_pdbx.dic"]

Iterating over the `CIFFile` yields data blocks:

In [13]:

for block in pdbx:
    print(block.code)

mmcif_pdbx.dic


Each returned data block is a `CIFBlock` object:

In [14]:
pdbx_block = pdbx[0]
pdbx_block

CIFBlock(code='mmcif_pdbx.dic', type='dict', variant='mmcif', categories=14)

Similar to `CIFFile`, the entire data of the block can be accessed from the `df` table:

In [15]:
pdbx_block.df

frame,category,keyword,values
str,str,str,list[str]
,"""datablock""","""id""","[""mmcif_pdbx.dic""]"
,"""datablock""","""description""","[""  This data block holds the Protein Data Bank Exchange Data dictionary.""]"
,"""dictionary""","""title""","[""mmcif_pdbx.dic""]"
,"""dictionary""","""datablock_id""","[""mmcif_pdbx.dic""]"
,"""dictionary""","""version""","[""5.409""]"
…,…,…,…
"""diffrn_detector_element.refere…","""item""","""category_id""","[""diffrn_detector_element""]"
"""diffrn_detector_element.refere…","""item""","""mandatory_code""","[""no""]"
"""diffrn_detector_element.refere…","""item_type""","""code""","[""code""]"
"""diffrn_detector_element.refere…","""item_enumeration""","""value""","[""mm"", ""pixels"", ""bins""]"


`CIFBlock` shares many other of its methods and properties with `CIFFile`, including `write()`, `part()`, and `type`:

In [16]:
pdbx_block.type

'dict'

The block code is stored in the `CIFBlock.code` property:

In [17]:
pdbx_block.code

'mmcif_pdbx.dic'

### Data Categories

A CIF block is a collection of data categories
(and for dictionary files, also save frames).
The length of the `CIFBlock` tells you how many data categories
are directly in the block (excluding save frames):

In [18]:
len(pdbx_block)

14

The category codes can be accessed via the `CIFBlock.category_codes` property:

In [19]:
pdbx_block.category_codes

['datablock',
 'dictionary',
 'dictionary_history',
 'sub_category',
 'category_group_list',
 'item_type_list',
 'item_units_list',
 'item_units_conversion',
 'pdbx_comparison_operator_list',
 'pdbx_conditional_context_list',
 'pdbx_dictionary_component',
 'pdbx_dictionary_component_history',
 'pdbx_item_linked_group',
 'pdbx_item_linked_group_list']

A data category can be accessed by its name or index:

In [20]:
assert pdbx_block[0] is pdbx_block["datablock"]

Iterating over the `CIFBlock` yields categories:

In [21]:

for category in pdbx_block:
    print(category.code)

datablock
dictionary
dictionary_history
sub_category
category_group_list
item_type_list
item_units_list
item_units_conversion
pdbx_comparison_operator_list
pdbx_conditional_context_list
pdbx_dictionary_component
pdbx_dictionary_component_history
pdbx_item_linked_group
pdbx_item_linked_group_list


Each returned category is a `CIFDataCategory` object:

In [22]:
pdbx_cat = pdbx_block[0]
pdbx_cat

CIFDataCategory(name='datablock', shape=(1, 2))

Similar to `CIFFile` and `CIFBlock`, the entire data of the category can be accessed from the `df` table.
In contrast, in `CIFDataCategory`, the DataFrame is transposed,
i.e., each column corresponds to a data item in the category
(with column name being the data keyword),
and each row corresponds to one observation of that data item
(for non-tabular categories, the DataFrame only has one row).

In [23]:
pdbx_cat.df

id,description
str,str
"""mmcif_pdbx.dic""","""  This data block holds th…"


`CIFDataCategory` shares methods and properties with `CIFFile` and `CIFBlock`, such as `write()`:

In [24]:
pdbx_block[0].write(writer=lambda s: print(s, end=""))

_datablock.id           mmcif_pdbx.dic
_datablock.description
;
     This data block holds the Protein Data Bank Exchange Data dictionary.
;


The category code is stored in the `CIFDataCategory.code` property:

In [25]:
pdbx_cat.code

'datablock'

### Data Items

A CIF data category is a collection of data items,
i.e., data name keywords each with one or multiple observed values.
The length of the `CIFDataCategory` tells you how many keywords
(not how many observations) are in the category:

In [26]:
len(pdbx_cat)

2

The keyword codes can be accessed via the `CIFDataCategory.keywords_codes` property:

In [27]:
pdbx_cat.keyword_codes

['id', 'description']

A data item can be accessed by its name or index:

In [28]:
assert pdbx_cat[0].equals(pdbx_cat["id"])

Iterating over the `CIFDataCategory` yields data items:

In [29]:

for item in pdbx_cat:
    print(item.shape)

(1,)
(1,)


Each returned data item is a `polars.Series` object:

In [30]:
pdbx_cat[0]

id
str
"""mmcif_pdbx.dic"""


### Save Frames

In dictionary files,
data block can also contain save frames.
These can be accessed via the `CIFBlock.frames` property:

In [31]:
pdbx_block.frames

CIFBlockFrames(variant='mmcif', frames=7247)

The length of the `CIFBlockFrames` tells you
how many save frames are in the block:

In [32]:
len(pdbx_block.frames)

7247

The frame codes can be accessed via the `CIFBlockFrames.codes` property:

In [33]:
pdbx_block.frames.codes

['atom_site',
 'atom_site.aniso_B[1][1]',
 'atom_site.aniso_B[1][1]_esd',
 'atom_site.aniso_B[1][2]',
 'atom_site.aniso_B[1][2]_esd',
 'atom_site.aniso_B[1][3]',
 'atom_site.aniso_B[1][3]_esd',
 'atom_site.aniso_B[2][2]',
 'atom_site.aniso_B[2][2]_esd',
 'atom_site.aniso_B[2][3]',
 'atom_site.aniso_B[2][3]_esd',
 'atom_site.aniso_B[3][3]',
 'atom_site.aniso_B[3][3]_esd',
 'atom_site.aniso_ratio',
 'atom_site.aniso_U[1][1]',
 'atom_site.aniso_U[1][1]_esd',
 'atom_site.aniso_U[1][2]',
 'atom_site.aniso_U[1][2]_esd',
 'atom_site.aniso_U[1][3]',
 'atom_site.aniso_U[1][3]_esd',
 'atom_site.aniso_U[2][2]',
 'atom_site.aniso_U[2][2]_esd',
 'atom_site.aniso_U[2][3]',
 'atom_site.aniso_U[2][3]_esd',
 'atom_site.aniso_U[3][3]',
 'atom_site.aniso_U[3][3]_esd',
 'atom_site.attached_hydrogens',
 'atom_site.auth_asym_id',
 'atom_site.auth_atom_id',
 'atom_site.auth_comp_id',
 'atom_site.auth_seq_id',
 'atom_site.B_equiv_geom_mean',
 'atom_site.B_equiv_geom_mean_esd',
 'atom_site.B_iso_or_equiv',
 'a

A save frame can be accessed by its name or index:

In [34]:
assert pdbx_block.frames[0] is pdbx_block.frames["atom_site"]

Iterating over the `CIFBlockFrames` yields save frames:

In [35]:

for frame in pdbx_block.frames:
    print(frame.code)

atom_site
atom_site.aniso_B[1][1]
atom_site.aniso_B[1][1]_esd
atom_site.aniso_B[1][2]
atom_site.aniso_B[1][2]_esd
atom_site.aniso_B[1][3]
atom_site.aniso_B[1][3]_esd
atom_site.aniso_B[2][2]
atom_site.aniso_B[2][2]_esd
atom_site.aniso_B[2][3]
atom_site.aniso_B[2][3]_esd
atom_site.aniso_B[3][3]
atom_site.aniso_B[3][3]_esd
atom_site.aniso_ratio
atom_site.aniso_U[1][1]
atom_site.aniso_U[1][1]_esd
atom_site.aniso_U[1][2]
atom_site.aniso_U[1][2]_esd
atom_site.aniso_U[1][3]
atom_site.aniso_U[1][3]_esd
atom_site.aniso_U[2][2]
atom_site.aniso_U[2][2]_esd
atom_site.aniso_U[2][3]
atom_site.aniso_U[2][3]_esd
atom_site.aniso_U[3][3]
atom_site.aniso_U[3][3]_esd
atom_site.attached_hydrogens
atom_site.auth_asym_id
atom_site.auth_atom_id
atom_site.auth_comp_id
atom_site.auth_seq_id
atom_site.B_equiv_geom_mean
atom_site.B_equiv_geom_mean_esd
atom_site.B_iso_or_equiv
atom_site.B_iso_or_equiv_esd
atom_site.calc_attached_atom
atom_site.calc_flag
atom_site.Cartn_x
atom_site.Cartn_x_esd
atom_site.Cartn_y
ato

Each returned data block is a `CIFFrame` object:

In [36]:
pdbx_frame = pdbx_block.frames[0]
pdbx_frame

CIFFrame(code='atom_site', variant='mmcif', categories=4)

Similar to `CIFBlock`, the entire data of the frame can be accessed from the `df` table:

In [37]:
pdbx_frame.df

category,keyword,values
str,str,list[str]
"""category""","""description""","["" Data items in the ATOM_SITE category record details about  the atom sites in a macromolecular crystal structure, such as  the positional coordinates, atomic displacement parameters,  magnetic moments and directions.  The data items for describing anisotropic atomic  displacement factors are only used if the corresponding items  are not given in the ATOM_SITE_ANISOTROP category.  wwPDB recommends wwPDB-assigned residue number, residue ID,  and chain ID, _atom_site.auth_seq_id _atom_site.auth_comp_id, and  _atom_site.auth_asym_id, respectively, to be used for publication  materials.""]"
"""category""","""id""","[""atom_site""]"
"""category""","""mandatory_code""","[""no""]"
"""category_key""","""name""","[""_atom_site.id""]"
"""category_group""","""id""","[""inclusive_group"", ""atom_group""]"
"""category_examples""","""detail""","[""  Example 1 - based on PDB entry 5HVP and laboratory records for the  structure corresponding to PDB entry 5HVP.""]"
"""category_examples""","""case""","[""  loop_  _atom_site.group_PDB  _atom_site.type_symbol  _atom_site.label_atom_id  _atom_site.label_comp_id  _atom_site.label_asym_id  _atom_site.label_seq_id  _atom_site.label_alt_id  _atom_site.Cartn_x  _atom_site.Cartn_y  _atom_site.Cartn_z  _atom_site.occupancy  _atom_site.B_iso_or_equiv  _atom_site.footnote_id  _atom_site.auth_seq_id  _atom_site.id  ATOM N N VAL A 11 . 25.369 30.691 11.795 1.00 17.93 . 11 1  ATOM C CA VAL A 11 . 25.970 31.965 12.332 1.00 17.75 . 11 2  ATOM C C VAL A 11 . 25.569 32.010 13.808 1.00 17.83 . 11 3  ATOM O O VAL A 11 . 24.735 31.190 14.167 1.00 17.53 . 11 4  ATOM C CB VAL A 11 . 25.379 33.146 11.540 1.00 17.66 . 11 5  ATOM C CG1 VAL A 11 . 25.584 33.034 10.030 1.00 18.86 . 11 6  ATOM C CG2 VAL A 11 . 23.933 33.309 11.872 1.00 17.12 . 11 7  ATOM N N THR A 12 . 26.095 32.930 14.590 1.00 18.97 4 12 8  ATOM C CA THR A 12 . 25.734 32.995 16.032 1.00 19.80 4 12 9  ATOM C C THR A 12 . 24.695 34.106 16.113 1.00 20.92 4 12 10  ATOM O O THR A 12 . 24.869 35.118 15.421 1.00 21.84 4 12 11  ATOM C CB THR A 12 . 26.911 33.346 17.018 1.00 20.51 4 12 12  ATOM O OG1 THR A 12 3 27.946 33.921 16.183 0.50 20.29 4 12 13  ATOM O OG1 THR A 12 4 27.769 32.142 17.103 0.50 20.59 4 12 14  ATOM C CG2 THR A 12 3 27.418 32.181 17.878 0.50 20.47 4 12 15  ATOM C CG2 THR A 12 4 26.489 33.778 18.426 0.50 20.00 4 12 16  ATOM N N ILE A 13 . 23.664 33.855 16.884 1.00 22.08 . 13 17  ATOM C CA ILE A 13 . 22.623 34.850 17.093 1.00 23.44 . 13 18  ATOM C C ILE A 13 . 22.657 35.113 18.610 1.00 25.77 . 13 19  ATOM O O ILE A 13 . 23.123 34.250 19.406 1.00 26.28 . 13 20  ATOM C CB ILE A 13 . 21.236 34.463 16.492 1.00 22.67 . 13 21  ATOM C CG1 ILE A 13 . 20.478 33.469 17.371 1.00 22.14 . 13 22  ATOM C CG2 ILE A 13 . 21.357 33.986 15.016 1.00 21.75 . 13 23  # - - - - data truncated for brevity - - - -  HETATM C C1 APS C . 1 4.171 29.012 7.116 0.58 17.27 1 300 101  HETATM C C2 APS C . 1 4.949 27.758 6.793 0.58 16.95 1 300 102  HETATM O O3 APS C . 1 4.800 26.678 7.393 0.58 16.85 1 300 103  HETATM N N4 APS C . 1 5.930 27.841 5.869 0.58 16.43 1 300 104  # - - - - data truncated for brevity - - - -""]"


The frame code is stored in the `CIFFrame.code` property:

In [38]:
pdbx_frame.code

'atom_site'

`CIFFrame` is very similar to `CIFBlock`,
and provides the same methods and properties
to access its categories:

In [39]:
len(pdbx_frame)

4

In [40]:
pdbx_frame.category_codes

['category', 'category_key', 'category_group', 'category_examples']

In [41]:
assert pdbx_frame[0] is pdbx_frame["category"]

In [42]:

for category in pdbx_frame:
    print(category.code)

category
category_key
category_group
category_examples


In [43]:
pdbx_frame[0]

CIFDataCategory(name='category', shape=(1, 3))

### Multi-Block/Frame Category Tables

Sometimes it is useful to have a multi-block/frame view of a certain data category,
i.e., to access a category within all data blocks and/or save frames in a file.
This can be done using the `category()` method of the `CIFFile` and `CIFBlock`:

In [44]:
pdbx_multicat = pdbx_block.category("item")
pdbx_multicat

CIFDataCategory(name='item', shape=(7070, 4))

The output is still a `CIFDataCategory`, but with additional identifier columns (`_block` and `_frame` by default) in the table,
specifying which data block and/or save frame each row is coming from:

In [45]:
pdbx_multicat.df

_frame,name,category_id,mandatory_code
str,str,str,str
"""atom_site.aniso_B[1][1]""","""_atom_site.aniso_B[1][1]""","""atom_site""","""no"""
"""atom_site.aniso_B[1][1]_esd""","""_atom_site.aniso_B[1][1]_esd""","""atom_site""","""no"""
"""atom_site.aniso_B[1][2]""","""_atom_site.aniso_B[1][2]""","""atom_site""","""no"""
"""atom_site.aniso_B[1][2]_esd""","""_atom_site.aniso_B[1][2]_esd""","""atom_site""","""no"""
"""atom_site.aniso_B[1][3]""","""_atom_site.aniso_B[1][3]""","""atom_site""","""no"""
…,…,…,…
"""diffrn_detector_element.id""","""_diffrn_detector_element.id""","""diffrn_detector_element""","""yes"""
"""diffrn_detector_element.detect…","""_diffrn_detector_element.detec…","""diffrn_detector_element""","""yes"""
"""diffrn_detector_element.refere…","""_diffrn_detector_element.refer…","""diffrn_detector_element""","""no"""
"""diffrn_detector_element.refere…","""_diffrn_detector_element.refer…","""diffrn_detector_element""","""no"""
