# Quickstart Guide

In [None]:
import json  # for demonstrating dictionary reuse

import pdbapi  # for downloading example CIF files
import polars as pl  # for saving DataFrames

import ciffile

## Creating Files

Using the `ciffile.create()` function,
CIF files can be created from any table-like data structure
(e.g., a `polars.DataFrame`, `pandas.DataFrame`,
dictionary of columns, list of rows, etc.)
that can be converted to a `polars.DataFrame`.
The resulting DataFrame must contain one row
for each unique data item in the CIF file,
with columns specifying:
- **Block code** (i.e., data block name) of the data item.
- **Frame code** (i.e., save frame name within the block) of the data item (optional; for CIF dictionary files).
- **Category** of the data item name (tag).
    For mmCIF files, this corresponds to
    the part before the period in the data name.
    For CIF files, this must be `None` for single data items
    (i.e., not part of a loop/table),
    and a unique value (e.g., "1", "2", ...) for each table,
    shared among all data items in that table.
- **Keyword** of the data item name (tag).
    For mmCIF files, this corresponds to
    the part after the period in the data name.
    For CIF files, this is the data name itself.
- **Values** of the data item as a list.
    For single data items, the list contains a single string.
    For tabular (looped) data items,
    it contains multiple strings,
    corresponding to row values
    for that data item column in the table.

For more information about these terms, refer to the official source: [CIF Version 1.1 Common Semantic Features](https://www.iucr.org/resources/cif/spec/version1.1/semantics#definitions)

In [None]:
sample_data = {
    "block": "MyCIFData",
    "category": ["my_table_category"] * 3 + ["my_single_category"] * 3,
    "keyword": ["col1", "col2", "col3", "key1", "key2", "key3"],
    "values": [[1, 10, 100], [2, 20, 200], [3, 30, 300], ["value1"], ["value2 with spaces"], ["value3 \n with \n newlines"]],
}
sample_file = ciffile.create(sample_data)

## Reading Files

Using the `ciffile.read()` function,
CIF files can be read from content, file paths, or file-like objects.
The following example downloads a PDB file from the RCSB database
and reads it directly.

In [None]:
pdb_file = ciffile.read(pdbapi.file.entry("3w32"))

## Writing Files

Once you have created/read a file,
it can be readily written as a string in CIF syntax.
One simple way is to invoke the `CIFFile` object's `__str__()` method; for example:

In [None]:
print(sample_file)

Alternatively,
the `CIFFile.write()` method allows
for more control over writing options:

In [None]:
sample_file_string = sample_file.write(
    list_style="horizontal",
    table_style="tabular-vertical",
    space_items=5,
    min_space_columns=2,
    indent=0,
    indent_inner=3,
    delimiter_preference=("double", "single", "semicolon"),
)
print(sample_file_string)

It can also be used for directly (and incrementally) writting to an output.
The method accepts any callable 
that takes a string and writes it to the desired output.
This could be a file write method or any other string-consuming function.
The following example passes the `print` function for demonstration:

In [None]:
sample_file.write(lambda s: print(s, end=""))

CIF files can also be directly written from a dictionary representing the data;
This must be a mapping of data block codes to mappings of
save frame codes to lists of data categories,
where each data category is either a `CIFDataCategory` instance (see below),
or a Polars `DataFrame` (or any data convertible to it).
A `None` save frame code indicates data categories directly
in the data block (no save frame).
This is particularly useful when you have a collection of table-like data structures
each representing a CIF data category, and want to directly convert them to CIF format
without having to first reformat them into the exploded format
required for creating a `CIFFile` instance.

For example, the `sample_data` above can also be directly converted to CIF
when represented in the following format:

In [None]:
sample_ciffile_dict = {
    "MyCIFData": {
        None: [
            {
                "my_table_category.col1": [1, 10, 100],
                "my_table_category.col2": [2, 20, 200],
                "my_table_category.col3": [3, 30, 300],
            },
            {
                "my_single_category.key1": ["value1"],
                "my_single_category.key2": ["value2 with spaces"],
                "my_single_category.key3": ["value3 \n with \n newlines"],
            },
        ]
    }
}

print(ciffile.write(sample_ciffile_dict))

## Saving Files

You can save a created/read CIF file to regenerate it later without having to parse the CIF syntax again.
To do so, save the underlying `polars.DataFrame` object in any desired format,
and use it to recreate the CIF file:

In [None]:
saved_filepath = "./ciffile.parquet"

# Save the DataFrame in parquet format
pdb_file.df.write_parquet(saved_filepath)

# Read the DataFrame from the parquet file
pdb_file_df = pl.read_parquet(saved_filepath)

# Recreate the CIF file
pdb_file_recreated = ciffile.create(pdb_file_df)

# Ensure both CIF files are the same
pdb_file == pdb_file_recreated

## Validating Files

CIF files can be validated against a corresponding CIF dictionary file,
to add metadata, find violations, and cast data into appropriate types and formats.
To do so, we first need to create/read a CIF dictionary file.

The following example downloads the
[PDB Exchange Dictionary (PDBx/mmCIF)](https://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v50.dic/Index/)
from its URL and reads it directly.

In [None]:
pdbx_file = ciffile.read(pdbapi.file.dictionary())

Next, the dictionary definitions must be parsed into a format ready for validation.
This is done by calling the `CIFFile.to_validator_dict()` method,
which returns a Python `dict` object with the necessary data.
The method will also issue warnings if any (possibly harmless) issues are found in the dictionary.

In [None]:
pdbx_validation_dict = pdbx_file.to_validator_dict()

The dictionary is serializable, so it can be easily saved in your preferred format
(e.g., JSON, YAML, etc.) and used any time to readily validate CIF files
(instead of having to parse the original CIF dictionary file every time).

In [None]:
pdbx_validation_dict_saved = json.dumps(pdbx_validation_dict)
pdbx_validation_dict_loaded = json.loads(pdbx_validation_dict_saved)

To validate files, first use the `ciffile.validator` function to create a validator from the validation dictionary:

In [None]:
pdbx_validator = ciffile.validator(pdbx_validation_dict_loaded)

Now you can validate any CIF file. This will:
1. Check for the presence of mandatory data categories,
   and issue `"missing_category"` errors for missing mandatory categories.
2. Check that all data categories are defined in the dictionary,
   and issue `"undefined_category"` errors for missing category definitions.
3. Check for the presence of mandatory data items within each available data category,
   and issue `"missing_item"` errors for missing mandatory items.
4. Check that all data items are defined in the dictionary,
   and issue `"undefined_item"` errors for missing item definitions.
5. Set default values for missing (`"?"`) data,
   and issue `"missing_value"` errors for missing data without a defined default value.
6. Check the values against their data type construct (regular expression),
   and issue `"regex_violation"` errors for values that violate the construct.
7. If specified, normalize the case of case-insensitive ("uchar") data types
   to the specified case (upper/lowercase).
8. Cast the data values into the defined type and format.
9. Check that all data values are equal to one of the defined enumerations, if any,
   and issue `"enum_violation"` errors for values that violate the enumeration.
10. Check that all data values are within the defined ranges,
    and issue `"range_violation"` errors for values that do not fall within any of the specified ranges.
11. If requested, add category and item metadata to categories and data items in the file,
    including description, category groups and subgroups, category key items,
    value units, enumeration definitions, mandatory flags, etc.

The data value casting and metadata addition
are performed in-place on the input `CIFFile` object,
while a DataFrame all errors is returned:

In [None]:
errors = pdbx_validator.validate(pdb_file)

In [None]:
errors

You can analyze the errors downstream and decide whether you
accept or reject the validation depending on your pipeline.
For example, check which types of errors were detected:

In [None]:
errors["type"].unique().to_list()

In this case, only `"missing_value"` errors were detected,
which is quite common (and usually inconsequentual)
in PDB entries.

### Converting Back to Strings

When writing data in CIF format, non-string data values (e.g., boolean, integer, float, datetime, list)
need to first be converted back to their string representations.
The writing functionalities described above can already handle simple type castings
from floating-point, integer, and boolean data types to strings.
However, for more complex data types, this must be separately handled by the validator's
`values_to_str()` method, which reverses the type casting performed by `validate()`.

As described above, the validator casts all data values to appropriate data types
according to their dictionary specifications. These include lists/arrays of numbers
generated from comma-separated values,
standard deviation columns extracted from floating-point strings (e.g., `"1.23(4)"` becomes `1.23` and `4`),
or datetime types. For example, the PDBx dictionary defines the data item `diffrn_detector.pdbx_collection_date`
with a datetime data type:

In [None]:
pdbx_validation_dict["item"]["diffrn_detector.pdbx_collection_date"]["type"]

Therefore, the validator casts the value to a datetime format:

In [None]:
pdb_file[0]["diffrn_detector"]["pdbx_collection_date"].value

Now assume you have performed some data manipulation after type casting,
and now want to write back the results to CIF.
For this, you need to first use the validator to cast these complex data types
back to their string representations:

In [None]:
errors_roundtrip = pdbx_validator.values_to_str(pdb_file)
errors_roundtrip

All values are now cast back to their corresponding string format:

In [None]:
pdb_file[0]["diffrn_detector"]["pdbx_collection_date"].value

Now you can write the data to CIF as before:

In [None]:
print(pdb_file[0]["diffrn_detector"])

## Exploring Files

CIFFile provides robust data structure with various methods to access and process the data in the file.
For each hierarchical level in the CIF file, there is a corresponding object:

- `CIFFile`: The object returend by the `ciffile.create()` and `ciffile.read()` functions;
  it corresponds to the entire CIF file,
  and is a container of `CIFBlock` objects.
- `CIFBlock`: The top-level grouping in a CIF file, corresponding to a data block,
  which is container of `CIFDataCategory` objects.
  In case of CIF dictionary files,
  it contains a `CIFFrames` object as well.
- `CIFFrames`: A container for all `CIFFrame` objects
  within a data block in CIF dictionary files.
- `CIFFrame`: Corresponds to a save frame within a data block,
  containing `CIFDataCategory` objects.
- `CIFDataCategory`: Corresponds to a data category, containing `CIFDataItem` objects.
- `CIFDataItem`: The last level in a CIF file, corresponding to a data item containing data values.
  
All data structures provide the following methods and properties:
- `code`: Block/frame code or data name category/keyword of the container.
- `codes`: Block/frame code, data name category/keyword, or data value index of the container's element.
- `container_type`: Type (level) identifier of the container.
- `get()`: Get an element by its code/index, and return an empty element if not found.
- `__iter__()`: Iterates over elements within the container.
- `__getitem__()`: Gets elements by their code/index.
- `__contains__()`: Checks whether a code/index exists for a container element.
- `__len__()`: Number of elements in the container.

Other than `CIFDataItem` (which is the terminal data structure),
all other data structure also have:

- `df`: A `polars.DataFrame` representation of the CIF data structure containing all available data.
  For all data structures other than `CIFDataCategory`,
  it has the same format discussed above in the Creating Files section.
  For `CIFDataCategory` the DataFrame is transposed,
  i.e., each column corresponds to a data item in the category
  (with column name being the data keyword),
  and each row corresponds to one observation of that data item
  (for non-tabular categories, the DataFrame only has one row).
- `to_id_dict()`: Creates a dictionary representation of the CIF data structure.
- `write()`: Writes the data structure in CIF format.
- `__str__()`: Convenient method using `write()` to generate a string representation for the data structure in CIF format.

Other than `CIFDataItem` and `CIFDataCategory`,
the remaining data structures also provide:
- `type`: Type of the CIF file (either `"data"` or `"dict"`).
  There are two main types of CIF files:
  - **Data files** contain information about the subject of a (crystallography related) study or experiment.
  - **Dictionary files** contain information about the data items in data files, as identified by their data names.
  
  Although there is no way to distinguish between dictionary and data files at a purely syntactic level,
  save frames may only be used in dictionary files.
  Therefore, any CIF file containing at lease one save frame
  is a dictionary file (note that not all dictionary files contain save frames).
  This property tells whether a container is a `data` or `dict` container,
  based on whether it is or contains any save frames.
- `category()`: Extracts and combines data category tables from all data blocks/save frames within the container.
  This is useful for obtaining a multi-block/frame view of a certain data category,
  i.e., to access a category within all data blocks and/or save frames in a file.
  The output is still a `CIFDataCategory`,
  but with additional identifier columns (`_block` and `_frame` by default) in the table,
  specifying which data block and/or save frame each row is coming from.
  
Other than `CIFDataItem`, `CIFDataCategory`, and `CIFFrame`,
the rest of data structures also provide:
- `part()`: Isolates data/dictionary parts of the container.
  Dictionary files usually contain two main types of information:
  - General information, such as those about the dictionary itself
    (e.g., title, version, change logs, and other identifier).
    These are stored as data items directly under data blocks
    (i.e., not in any save frames).
  - Definition and attributes of data items that the dictionary describes.
    These are stored as data items within save frames of each data block.
    Moreover, for mmCIF dictionaries, these definitions can be divided to:
    - Definition of data categories,
      stored in save frames whose frame code is the category code
      (i.e., no period in the frame code).
    - Definition of data keywords within each category,
      stored in save frames whose frame code consists of both category and keyword codes
      (i.e., period in the frame code).
  
  Therefore, it is useful to be able to isolate these parts
  and process them separately.
  This can be done using the `part()` method;
  when called with no arguments,
  it returns all different parts of the file as separate objects.

### `CIFFile`

The `ciffile.create()` and `ciffile.read()` functions
return a `CIFFile`:

In [None]:
pdbx_file, pdb_file

In [None]:
pdb_file.container_type

Files have no code:

In [None]:
pdb_file.code is None

The entire file data is stored as a `polars.DataFrame` in `CIFFile.df`:

In [None]:
pdbx_file.df

The `type` tells whether the file contains any save frames within its data blocks:

In [None]:
pdbx_file.type, pdb_file.type

The data and dictionary parts can be isolated:

In [None]:
pdbx_file.part()

A dictionary view of the file can be generated:

In [None]:
sample_file.to_id_dict(["block", "category", "keyword"])

A `CIFFile` is a container of data blocks.
The length of the `CIFFile` tells you how many data blocks
are in the file:

In [None]:
len(pdbx_file)

The block codes (data block names) can be accessed via the `codes` property:

In [None]:
pdbx_file.codes

It can be checked whether a block code exists in the file:

In [None]:
"mmcif_pdbx.dic" in pdbx_file, "non_existent_code" in sample_file

A data block can be accessed by its name or index:

In [None]:
pdbx_file[0] is pdbx_file["mmcif_pdbx.dic"]

The `get()` method can be used to get an empty `CIFBlock` when the code/index does not exist:

In [None]:
try:
    pdbx_file["non_existent_code"]
except KeyError as e:
    print(f"Caught expected exception: {e}")

In [None]:
pdbx_file.get("non_existent_code")

Iterating over the `CIFFile` yields data blocks:

In [None]:
for pdbx_block in pdbx_file:
    print(pdbx_block.container_type)

Categories can be merged across data blocks/save frames as well:

In [None]:
pdbx_multicat = pdbx_file.category("item")
pdbx_multicat

In [None]:
pdbx_multicat.df

### `CIFBlock`

`CIFFile` elements are `CIFBlock` objects, corresponding to data blocks within the file:

In [None]:
pdbx_block = pdbx_file[0]
pdbx_block

In [None]:
pdbx_block.container_type

The block code is stored in the `CIFBlock.code` property:

In [None]:
pdbx_block.code

The entire data of the block can be accessed from the `df` table:

In [None]:
pdbx_block.df

The `type` tells whether the data block contains any save frames:

In [None]:
pdbx_block.type

The data and dictionary parts can be isolated:

In [None]:
pdbx_block.part()

A dictionary view of the block can be generated:

In [None]:
pdbx_block.to_id_dict(["frame", "category", "keyword"])

A `CIFBlock` is a container of data categories.
The length of the `CIFBlock` tells you how many data categories
are directly in the block (excluding save frames):

In [None]:
len(pdbx_block)

The category codes can be accessed via the `codes` property:

In [None]:
pdbx_block.codes

It can be checked whether a data category name exists in the block:

In [None]:
"item_type_list" in pdbx_block

A data category can be accessed by its name or index:

In [None]:
pdbx_block[0] is pdbx_block["datablock"]

The `get()` method can be used to get an empty `CIFDataCategory` when the code/index does not exist:

In [None]:
pdbx_block.get("non_existent_category")

Iterating over the `CIFBlock` yields categories:

In [None]:
for pdbx_category in pdbx_block:
    print(pdbx_category.container_type)

Categories can be merged across the entire block and its save frames:

In [None]:
pdbx_block_multicat = pdbx_block.category("item")
pdbx_block_multicat

In [None]:
pdbx_block_multicat.df

### `CIFFrames`

In dictionary files,
data block can also contain save frames.
These can be accessed via the `CIFBlock.frames` property:

In [None]:
pdbx_block.frames

In [None]:
pdbx_block.frames.container_type

The entire data of the save frames can be accessed from the `df` table:

In [None]:
pdbx_block.frames.df

The category and keyword definition parts can be isolated:

In [None]:
pdbx_block.frames.part("dict_cat", "dict_key")

A dictionary view of the frames can be generated:

In [None]:
pdbx_block.frames.to_id_dict(["frame", "category", "keyword"])

The length of the `CIFBlockFrames` tells you
how many save frames are in the block:

In [None]:
len(pdbx_block.frames)

The frame codes can be accessed via the `codes` property:

In [None]:
pdbx_block.frames.codes

It can be checked whether a frame code exists:

In [None]:
"atom_site" in pdbx_block.frames

A save frame can be accessed by its name or index:

In [None]:
pdbx_block.frames[0] is pdbx_block.frames["atom_site"]

The `get()` method can be used to get an empty `CIFFrame` when the code/index does not exist:

In [None]:
pdbx_block.frames.get("non_existent_frame_code")

Iterating over the `CIFBlockFrames` yields save frames:

In [None]:
for pdbx_frame in pdbx_block.frames:
    print(pdbx_frame.container_type)

Categories can be merged across the entire save frames:

In [None]:
pdbx_frames_multicat = pdbx_block.frames.category("item")
pdbx_frames_multicat

In [None]:
pdbx_frames_multicat.df

### `CIFFrame`

`CIFFrames` contains `CIFFrame` objects corresponding to a single save frame in the data block:

In [None]:
pdbx_frame = pdbx_block.frames[0]
pdbx_frame

In [None]:
pdbx_frame.container_type

The frame code is stored in the `code` property:

In [None]:
pdbx_frame.code

The entire data of the frame can be accessed from the `df` table:

In [None]:
pdbx_frame.df

A dictionary view of the save frame can be generated:

In [None]:
pdbx_frame.to_id_dict(["category", "keyword"])

Similar to `CIFBlock`,
`CIFFrame` is also a container of data categories:

In [None]:
len(pdbx_frame)

In [None]:
pdbx_frame.codes

In [None]:
"category_examples" in pdbx_frame

In [None]:
pdbx_frame[0] is pdbx_frame["category"]

In [None]:
pdbx_frame.get("non_existent_category")

In [None]:
for category in pdbx_frame:
    print(category.container_type)

### Data Categories



`CIFBlock` and `CIFFrame` objects contain `CIFDataCategory` objects:

In [None]:
pdbx_frame[0], pdbx_block[0]

In [None]:
pdb_cat = pdb_file[0]["atom_site"]
pdb_cat

The category code is stored in the `code` property:

In [None]:
pdb_cat.code

The entire data of the category can be accessed from the `df` table.
However, in contrast to earlier data structure
in `CIFDataCategory`, the DataFrame is transposed,
i.e., each column corresponds to a data item in the category
(with column name being the data keyword),
and each row corresponds to one observation of that data item
(for non-tabular categories, the DataFrame only has one row).
Moreover, if the file has been validated,
the data values each have their appropriate type:

In [None]:
pdb_cat.df

A dictionary view of the category can be generated:

In [None]:
pdb_cat.to_id_dict(["id"])

A `CIFDataCategory` is a collection of `CIFDataItem` objects,
i.e., data name keywords each with one or multiple observed values.
The length of the `CIFDataCategory` tells you how many keywords
(not how many observations) are in the category:

In [None]:
len(pdb_cat)

The keyword codes can be accessed via the `codes` property:

In [None]:
pdb_cat.codes

It can be checked whether a data keyword exists in the category:

In [None]:
"pdbx_pdb_model_num" in pdb_cat

A data item can be accessed by its name or index:

In [None]:
pdb_cat[0] is pdb_cat["id"]

The `get()` method can be used to get an empty `CIFDataItem` when the code/index does not exist:

In [None]:
pdb_cat.get("non_existent_keyword")

Iterating over the `CIFDataCategory` yields data items:

In [None]:
for pdb_item in pdb_cat:
    print(pdb_item.container_type, pdb_item.code)

The full names of data items can be accessed by the `item_names` property:

In [None]:
pdb_cat.item_names

If the file has been validated with the option to add metadata,
each category also has available metadata:

In [None]:
pdb_cat.description

In [None]:
pdb_cat.groups

In [None]:
pdb_cat.keys

### Data Items

`CIFDataCategory` elements are `CIFDataItem` objects,
corresponding to a data item within the category:

In [None]:
pdb_item = pdb_cat[0]
pdb_item

The data keyword is stored in the `codes` property:

In [None]:
pdb_item.code

The (full) data name is stored in the `CIFDataItem.name` property:

In [None]:
pdb_item.name

The length of the `CIFDataItem` tells you how many values the data item contains:

In [None]:
len(pdb_item)

The values are accessible via the `CIFDataItem.values` property:

In [None]:
pdb_item.values

While `CIFDataItem.values` always returns `polars.Series` objects,
the `CIFDataItem.value` property returns the singular value
when the data item contains a single value:

In [None]:
pdb_file[0]["entry"]["id"].values

In [None]:
pdb_file[0]["entry"]["id"].value

Values can also be indexed directly:

In [None]:
pdb_item[10:14]

They can also be iterated:

In [None]:
for pdb_value in pdb_item:
    print(pdb_value)

If the file has been validated with the option to add metadata,
each item also has available metadata:

In [None]:
pdb_item.description

In [None]:
pdb_item.mandatory

In [None]:
pdb_cat["occupancy"].default

In [None]:
pdb_cat["group_pdb"].enum

In [None]:
pdb_item.dtype

In [None]:
pdb_cat["pdbx_pdb_model_num"].range

In [None]:
pdb_cat["cartn_x"].unit

## Summary

This quickstart guide has covered the essential features of the CIFFile library:

1. **Creating CIF files** from tabular data structures
2. **Reading CIF files** from various sources (paths, strings, file objects)
3. **Writing CIF files** with customizable formatting
4. **Saving CIF files** to disk
5. **Hierarchical data access** through files, blocks, frames, categories, and items
6. **DataFrame integration** for data manipulation and analysis
7. **Validation** against DDL2 dictionaries
8. **Type casting** for proper data types

### Key Concepts

- **CIFFile**: Top-level container representing the entire CIF file
- **CIFBlock**: Data block (prefixed with `data_` in CIF syntax)
- **CIFFrame**: Save frame within a block (prefixed with `save_` in CIF syntax)
- **CIFDataCategory**: Group of related data items (loop construct in CIF)
- **CIFDataItem**: Individual data item with a name and value(s)

### Best Practices

- Use **mmCIF variant** for macromolecular structures
- Use **CIF 1.1 variant** for small molecule structures
- **Validate files** when reading from external sources
- **Cast data types** for numerical operations
- Leverage **Polars DataFrames** for efficient data manipulation
- Use **customizable writing options** to match your formatting requirements

### Next Steps

- Explore the inline documentation with `help(ciffile.read)`, `help(CIFFile)`, etc.
- Check the comprehensive [README.md](./README.md) for more examples
- Visit the [official CIF specification](https://www.iucr.org/resources/cif) for format details
- See [mmCIF documentation](https://mmcif.wwpdb.org/) for PDB-specific features

### Additional Resources

- [IUCr CIF Resources](https://www.iucr.org/resources/cif)
- [PDBx/mmCIF Dictionary](https://mmcif.wwpdb.org/dictionaries/)
- [Polars Documentation](https://pola.rs/)
- [wwPDB](https://www.wwpdb.org/)