# Quickstart Guide

In [None]:
import ciffile

## Creating Files

You can create a CIF file from any table-like data structure
(e.g., a `polars.DataFrame`, `pandas.DataFrame`,
dictionary of columns, list of rows, etc.)
that can be converted to a `polars.DataFrame`.
The resulting DataFrame must contain one row
for each unique data item in the CIF file,
with columns specifying:
- **Block code** (i.e., data block name) of the data item.
- **Frame code** (i.e., save frame name within the block) of the data item (optional; for CIF dictionary files).
- **Category** of the data item name (tag).
    For mmCIF files, this corresponds to
    the part before the period in the data name.
    For CIF files, this must be `None` for single data items
    (i.e., not part of a loop/table),
    and a unique value (e.g., "1", "2", ...) for each table,
    shared among all data items in that table.
- **Keyword** of the data item name (tag).
    For mmCIF files, this corresponds to
    the part after the period in the data name.
    For CIF files, this is the data name itself.
- **Values** of the data item as a list.
    For single data items, the list contains a single string.
    For tabular (looped) data items,
    it contains multiple strings,
    corresponding to row values
    for that data item column in the table.

For more information about these terms, refer to the official source: [CIF Version 1.1 Common Semantic Features](https://www.iucr.org/resources/cif/spec/version1.1/semantics#definitions)

In [None]:
sample_file_data = {
    "block": "MyCIFData",
    "category": ["my_table_category"] * 3 + ["my_single_category"] * 3,
    "keyword": ["col1", "col2", "col3", "key1", "key2", "key3"],
    "values": [[1, 10, 100], [2, 20, 200], [3, 30, 300], ["value1"], ["value2 with spaces"], ["value3 \n with \n newlines"]],
}
sample = ciffile.create(sample_file_data)

## Reading Files

You can read a CIF file from content, path, or a file-like object.
The following example downloads the
[PDB Exchange Dictionary (PDBx/mmCIF)](https://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v50.dic/Index/)
from its URL and reads it directly.

In [None]:
from urllib.request import urlopen

with urlopen("https://mmcif.wwpdb.org/dictionaries/ascii/mmcif_pdbx.dic") as response:
    pdbx = ciffile.read(response)

## Writing Files

Once you have created/read the file,
it can be readily written as a string in CIF syntax.
One simple way is to invoke the `CIFFile` object's `__str__()` method; for example:

In [None]:
print(sample)

Alternatively,
you can use the `CIFFile`'s `write()` method
for more control over writing options,
or for directly (and incrementally) writting to an output.
The method accepts any callable 
that takes a string and writes it to the desired output.
This could be a file write method or any other string-consuming function.
The following example passes the `print` function for demonstration,
and changes the default styling parameters:

In [None]:
sample.write(
    writer=lambda s: print(s, end=""),
    list_style="horizontal",
    table_style="tabular-vertical",
    space_items=5,
    min_space_columns=2,
    indent=0,
    indent_inner=3,
    delimiter_preference=("double", "single", "semicolon"),
)

## Validating Files



In [None]:
validator = ciffile.dictionary(pdbx)

In [None]:
validator.dict_title

In [None]:
validator.dict_description

In [None]:
validator.dict_version

In [None]:
validator.sub_category

In [None]:
validator.category_group_description

In [None]:
validator.category_group_parent

In [None]:
with urlopen("https://files.rcsb.org/view/3W32.cif") as response:
    pdbentry = ciffile.read(response)

## Exploring Files

CIFFile provides robust data structure with various methods to access and process the data in the file.
For each hierarchical level in the CIF file, there is a corresponding object:

- `CIFFile`: The object returend by the `ciffile.create()` and `ciffile.read()` functions;
  it corresponds to the entire CIF file,
  and is a container of `CIFBlock` objects.
- `CIFBlock`: The top-level grouping in a CIF file, corresponding to a data block,
  which is container of `CIFDataCategory` objects.
  In case of CIF dictionary files,
  it contains a `CIFFrames` object as well.
- `CIFFrames`: A container for all `CIFFrame` objects
  within a data block in CIF dictionary files.
- `CIFFrame`: Corresponds to a save frame within a data block,
  containing `CIFDataCategory` objects.
- `CIFDataCategory`: Corresponds to a data category, containing `CIFDataItem` objects.
- `CIFDataItem`: The last level in a CIF file, corresponding to a data item containing data values.
  
All data structures provide the following methods and properties:
- `code`: Block/frame code or data name category/keyword of the container.
- `codes`: Block/frame code, data name category/keyword, or data value index of the container's element.
- `container_type`: Type (level) identifier of the container.
- `get()`: Get an element by its code/index, and return an empty element if not found.
- `__iter__()`: Iterates over elements within the container.
- `__getitem__()`: Gets elements by their code/index.
- `__contains__()`: Checks whether a code/index exists for a container element.
- `__len__()`: Number of elements in the container.

Other than `CIFDataItem` (which is the terminal data structure),
all other data structure also have:

- `df`: A `polars.DataFrame` representation of the CIF data structure containing all available data.
  For all data structures other than `CIFDataCategory`,
  it has the same format discussed above in the Creating Files section.
  For `CIFDataCategory` the DataFrame is transposed,
  i.e., each column corresponds to a data item in the category
  (with column name being the data keyword),
  and each row corresponds to one observation of that data item
  (for non-tabular categories, the DataFrame only has one row).
- `to_id_dict()`: Creates a dictionary representation of the CIF data structure.
- `write()`: Writes the data structure in CIF format.
- `__str__()`: Convenient method using `write()` to generate a string representation for the data structure in CIF format.

Other than `CIFDataItem` and `CIFDataCategory`,
the remaining data structures also provide:
- `type`: Type of the CIF file (either `"data"` or `"dict"`).
  There are two main types of CIF files:
  - **Data files** contain information about the subject of a (crystallography related) study or experiment.
  - **Dictionary files** contain information about the data items in data files, as identified by their data names.
  
  Although there is no way to distinguish between dictionary and data files at a purely syntactic level,
  save frames may only be used in dictionary files.
  Therefore, any CIF file containing at lease one save frame
  is a dictionary file (note that not all dictionary files contain save frames).
  This property tells whether a container is a `data` or `dict` container,
  based on whether it is or contains any save frames.
- `category()`: Extracts and combines data category tables from all data blocks/save frames within the container.
  This is useful for obtaining a multi-block/frame view of a certain data category,
  i.e., to access a category within all data blocks and/or save frames in a file.
  The output is still a `CIFDataCategory`,
  but with additional identifier columns (`_block` and `_frame` by default) in the table,
  specifying which data block and/or save frame each row is coming from.
  
Other than `CIFDataItem`, `CIFDataCategory`, and `CIFFrame`,
the rest of data structures also provide:
- `part()`: Isolates data/dictionary parts of the container.
  Dictionary files usually contain two main types of information:
  - General information, such as those about the dictionary itself
    (e.g., title, version, change logs, and other identifier).
    These are stored as data items directly under data blocks
    (i.e., not in any save frames).
  - Definition and attributes of data items that the dictionary describes.
    These are stored as data items within save frames of each data block.
    Moreover, for mmCIF dictionaries, these definitions can be divided to:
    - Definition of data categories,
      stored in save frames whose frame code is the category code
      (i.e., no period in the frame code).
    - Definition of data keywords within each category,
      stored in save frames whose frame code consists of both category and keyword codes
      (i.e., period in the frame code).
  
  Therefore, it is useful to be able to isolate these parts
  and process them separately.
  This can be done using the `part()` method;
  when called with no arguments,
  it returns all different parts of the file as separate objects.

#### `CIFFile`

The `ciffile.create()` and `ciffile.read()` functions
return a `CIFFile`:

In [None]:
pdbx, sample

In [None]:
pdbx.container_type, sample.container_type

Files have no code:

In [None]:
pdbx.code is None, sample.code is None

The entire file data is stored as a `polars.DataFrame` in `CIFFile.df`:

In [None]:
pdbx.df

The `type` tells whether the file contains any save frames within its data blocks:

In [None]:
pdbx.type, sample.type

The data and dictionary parts can be isolated:

In [None]:
pdbx.part()

A dictionary view of the file can be generated:

In [None]:
sample.to_id_dict(["block", "category", "keyword"])

A `CIFFile` is a container of data blocks.
The length of the `CIFFile` tells you how many data blocks
are in the file:

In [None]:
len(pdbx), len(sample)

The block codes (data block names) can be accessed via the `codes` property:

In [None]:
pdbx.codes, sample.codes

It can be checked whether a block code exists in the file:

In [None]:
"mmcif_pdbx.dic" in pdbx, "non_existent_code" in sample

A data block can be accessed by its name or index:

In [None]:
pdbx[0] is pdbx["mmcif_pdbx.dic"]

The `get()` method can be used to get an empty `CIFBlock` when the code/index does not exist:

In [None]:
try:
    pdbx["non_existent_code"]
except KeyError as e:
    print(f"Caught expected exception: {e}")

In [None]:
pdbx.get("non_existent_code")

Iterating over the `CIFFile` yields data blocks:

In [None]:

for block in pdbx:
    print(block.container_type)

Categories can be merged across data blocks/save frames as well:

In [None]:
pdbx_multicat = pdbx.category("item")
pdbx_multicat

In [None]:
pdbx_multicat.df

### `CIFBlock`

`CIFFile` elements are `CIFBlock` objects, corresponding to data blocks within the file:

In [None]:
pdbx_block = pdbx[0]
pdbx_block

In [None]:
pdbx_block.container_type

The block code is stored in the `CIFBlock.code` property:

In [None]:
pdbx_block.code

The entire data of the block can be accessed from the `df` table:

In [None]:
pdbx_block.df

The `type` tells whether the data block contains any save frames:

In [None]:
pdbx_block.type

The data and dictionary parts can be isolated:

In [None]:
pdbx_block.part()

A dictionary view of the block can be generated:

In [None]:
pdbx_block.to_id_dict(["frame", "category", "keyword"])

A `CIFBlock` is a container of data categories.
The length of the `CIFBlock` tells you how many data categories
are directly in the block (excluding save frames):

In [None]:
len(pdbx_block)

The category codes can be accessed via the `codes` property:

In [None]:
pdbx_block.codes

It can be checked whether a data category name exists in the block:

In [None]:
"item_type_list" in pdbx_block

A data category can be accessed by its name or index:

In [None]:
pdbx_block[0] is pdbx_block["datablock"]

The `get()` method can be used to get an empty `CIFDataCategory` when the code/index does not exist:

In [None]:
pdbx_block.get("non_existent_category")

Iterating over the `CIFBlock` yields categories:

In [None]:

for category in pdbx_block:
    print(category.container_type)

Categories can be merged across the entire block and its save frames:

In [None]:
pdbx_block_multicat = pdbx_block.category("item")
pdbx_block_multicat

In [None]:
pdbx_block_multicat.df

### `CIFFrames`

In dictionary files,
data block can also contain save frames.
These can be accessed via the `CIFBlock.frames` property:

In [None]:
pdbx_block.frames

In [None]:
pdbx_block.frames.container_type

The entire data of the save frames can be accessed from the `df` table:

In [None]:
pdbx_block.frames.df

The category and keyword definition parts can be isolated:

In [None]:
pdbx_block.frames.part("dict_cat", "dict_key")

A dictionary view of the frames can be generated:

In [None]:
pdbx_block.frames.to_id_dict(["frame", "category", "keyword"])

The length of the `CIFBlockFrames` tells you
how many save frames are in the block:

In [None]:
len(pdbx_block.frames)

The frame codes can be accessed via the `codes` property:

In [None]:
pdbx_block.frames.codes

It can be checked whether a frame code exists:

In [None]:
"atom_site" in pdbx_block.frames

A save frame can be accessed by its name or index:

In [None]:
pdbx_block.frames[0] is pdbx_block.frames["atom_site"]

The `get()` method can be used to get an empty `CIFFrame` when the code/index does not exist:

In [None]:
pdbx_block.frames.get("non_existent_frame_code")

Iterating over the `CIFBlockFrames` yields save frames:

In [None]:

for frame in pdbx_block.frames:
    print(frame.container_type)

Categories can be merged across the entire save frames:

In [None]:
pdbx_frames_multicat = pdbx_block.frames.category("item")
pdbx_frames_multicat

In [None]:
pdbx_frames_multicat.df

### `CIFFrame`

`CIFFrames` contains `CIFFrame` objects corresponding to a single save frame in the data block:

In [None]:
pdbx_frame = pdbx_block.frames[0]
pdbx_frame

In [None]:
pdbx_frame.container_type

The frame code is stored in the `code` property:

In [None]:
pdbx_frame.code

The entire data of the frame can be accessed from the `df` table:

In [None]:
pdbx_frame.df

A dictionary view of the save frame can be generated:

In [None]:
pdbx_frame.to_id_dict(["category", "keyword"])

Similar to `CIFBlock`,
`CIFFrame` is also a container of data categories:

In [None]:
len(pdbx_frame)

In [None]:
pdbx_frame.codes

In [None]:
"category_examples" in pdbx_frame

In [None]:
pdbx_frame[0] is pdbx_frame["category"]

In [None]:
pdbx_frame.get("non_existent_category")

In [None]:
for category in pdbx_frame:
    print(category.container_type)

### Data Categories



`CIFBlock` and `CIFFrame` objects contain `CIFDataCategory` objects:

In [None]:
pdbx_frame[0]

In [None]:
pdbx_cat = pdbx_block[0]
pdbx_cat

The category code is stored in the `code` property:

In [None]:
pdbx_cat.code

The entire data of the category can be accessed from the `df` table.
However, in contrast to earlier data structure
in `CIFDataCategory`, the DataFrame is transposed,
i.e., each column corresponds to a data item in the category
(with column name being the data keyword),
and each row corresponds to one observation of that data item
(for non-tabular categories, the DataFrame only has one row):

In [None]:
pdbx_cat.df

A dictionary view of the category can be generated:

In [None]:
pdbx_cat.to_id_dict(["id"])

A `CIFDataCategory` is a collection of `CIFDataItem` objects,
i.e., data name keywords each with one or multiple observed values.
The length of the `CIFDataCategory` tells you how many keywords
(not how many observations) are in the category:

In [None]:
len(pdbx_cat)

The keyword codes can be accessed via the `codes` property:

In [None]:
pdbx_cat.codes

It can be checked whether a data keyword exists in the category:

In [None]:
"description" in pdbx_cat

A data item can be accessed by its name or index:

In [None]:
pdbx_cat[0] is pdbx_cat["id"]

The `get()` method can be used to get an empty `CIFDataItem` when the code/index does not exist:

In [None]:
pdbx_cat.get("non_existent_keyword")

Iterating over the `CIFDataCategory` yields data items:

In [None]:

for item in pdbx_cat:
    print(item.container_type)

### Data Items

`CIFDataCategory` elements are `CIFDataItem` objects,
corresponding to a data item within the category:

In [None]:
pdbx_item = pdbx_cat[0]
pdbx_item

The data keyword is stored in the `codes` property:

In [None]:
pdbx_item.code

The (full) data name is stored in the `CIFDataItem.name` property:

In [None]:
pdbx_item.name

The length of the `CIFDataItem` tells you how many values the data item contains:

In [None]:
len(pdbx_item)

The values are accessible via the `CIFDataItem.values` property:

In [None]:
pdbx_item.values

While `CIFDataItem.values` always returns `polars.Series` objects,
the `CIFDataItem.value` property returns the singular value
when the data item contains a single value:

In [None]:
pdbx_item.value

Values can also be indexed directly:

In [None]:
pdbx_item[0:]

They can also be iterated:

In [None]:
for value in pdbx_item:
    print(value)