Added examples for parsing DFT

CederGroupHub · Sep 25, 2023 · 2ebc57f · 2ebc57f
1 parent 195d48c
commit 2ebc57f
Showing 1 changed file with 109 additions and 12 deletions.
diff --git a/examples/fine_tuning.ipynb b/examples/fine_tuning.ipynb
@@ -50,34 +50,120 @@
    "id": "16eeae1e",
    "metadata": {},
    "source": [
-    "## 1. Prepare Training Data\n"
+    "## 0. Parse DFT outputs to CHGNet readable formats\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "286c110a",
+   "metadata": {},
+   "source": [
+    "CHGNet is interfaced to [Pymatgen](https://pymatgen.org/), the training samples (normally coming from different DFTs like VASP),\n",
+    "need to be converted to [pymatgen.core.structure](https://pymatgen.org/pymatgen.core.html#module-pymatgen.core.structure).\n",
+    "\n",
+    "To convert VASP calculation to pymatgen structures and CHGNet labels, you can use the following [code](https://github.com/CederGroupHub/chgnet/blob/main/chgnet/utils/vasp_utils.py):"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "208fa4aa",
+   "id": "72ada11a",
+   "metadata": {
+    "pycharm": {
+     "name": "#%%\n"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "from chgnet.utils import parse_vasp_dir\n",
+    "\n",
+    "# ./my_vasp_calc_dir contains vasprun.xml OSZICAR etc.\n",
+    "dataset_dict = parse_vasp_dir(file_root=\"./my_vasp_calc_dir\")\n",
+    "print(dataset_dict.keys())"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b8b3a8cd",
    "metadata": {},
+   "source": [
+    "After the DFT calculations are parsed, we can save the parsed structures and labels to disk,\n",
+    "so that they can be easily reloaded during multiple rounds of training.\n",
+    "The Pymatgen structures can be saved in either json, pickle, cif, or CHGNet graph.\n",
+    "\n",
+    "For super-large training dataset, like MPtrj dataset, we recommend [converting them to CHGNet graphs](https://github.com/CederGroupHub/chgnet/blob/main/examples/make_graphs.py). This will save significant memory and graph computing time.\n",
+    "\n",
+    "Below are the example codes to save the structures."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "a9a74cae",
+   "metadata": {
+    "pycharm": {
+     "name": "#%%\n"
+    }
+   },
    "outputs": [],
    "source": [
-    "try:\n",
-    "    from chgnet import ROOT\n",
+    "# Structure to json\n",
+    "from chgnet.utils import write_json\n",
     "\n",
-    "    lmo = Structure.from_file(f\"{ROOT}/examples/mp-18767-LiMnO2.cif\")\n",
-    "except Exception:\n",
-    "    from urllib.request import urlopen\n",
+    "dict_to_json = [struct.as_dict() for struct in dataset_dict[\"structure\"]]\n",
+    "write_json(dict_to_json, \"CHGNet_structures.json\")\n",
     "\n",
-    "    url = \"https://raw.githubusercontent.com/CederGroupHub/chgnet/main/examples/mp-18767-LiMnO2.cif\"\n",
-    "    cif = urlopen(url).read().decode(\"utf-8\")\n",
-    "    lmo = Structure.from_str(cif, fmt=\"cif\")"
+    "\n",
+    "# Structure to pickle\n",
+    "import pickle\n",
+    "\n",
+    "with open(\"CHGNet_structures.p\", \"wb\") as f:\n",
+    "    pickle.dump(dataset_dict, f)\n",
+    "\n",
+    "\n",
+    "# Structure to cif\n",
+    "for idx, struct in enumerate(dataset_dict[\"structure\"]):\n",
+    "    struct.to(filename=f\"{idx}.cif\")\n",
+    "\n",
+    "\n",
+    "# Structure to CHGNet graph\n",
+    "from chgnet.graph import CrystalGraphConverter\n",
+    "\n",
+    "converter = CrystalGraphConverter()\n",
+    "for idx, struct in enumerate(dataset_dict[\"structure\"]):\n",
+    "    graph = converter(struct)\n",
+    "    graph.save(fname=f\"{idx}.pt\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "61c551cd",
+   "metadata": {},
+   "source": [
+    "For other types of DFT calculations, please refer to their interfaces\n",
+    "in [pymatgen.io](https://pymatgen.org/pymatgen.io.html#module-pymatgen.io).\n",
+    "\n",
+    "see: [Quantum Espresso](https://pymatgen.org/pymatgen.io.html#module-pymatgen.io.pwscf)\n",
+    "\n",
+    "see: [CP2K](https://pymatgen.org/pymatgen.io.cp2k.html#module-pymatgen.io.cp2k)\n",
+    "\n",
+    "see: [Gaussian](https://pymatgen.org/pymatgen.io.html#module-pymatgen.io.gaussian)\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e1611921",
+   "metadata": {},
+   "source": [
+    "## 1. Prepare Training Data"
    ]
   },
   {
    "cell_type": "markdown",
    "id": "9ec2524a",
    "metadata": {},
    "source": [
-    "We create a dummy fine-tuning dataset by using CHGNet prediction with some random noise.\n",
+    "Below we will create a dummy fine-tuning dataset by using CHGNet prediction with some random noise.\n",
     "For your purpose of fine-tuning to a specific chemical system or AIMD data, please modify the block below\n"
    ]
   },
@@ -88,6 +174,17 @@
    "metadata": {},
    "outputs": [],
    "source": [
+    "try:\n",
+    "    from chgnet import ROOT\n",
+    "\n",
+    "    lmo = Structure.from_file(f\"{ROOT}/examples/mp-18767-LiMnO2.cif\")\n",
+    "except Exception:\n",
+    "    from urllib.request import urlopen\n",
+    "\n",
+    "    url = \"https://raw.githubusercontent.com/CederGroupHub/chgnet/main/examples/mp-18767-LiMnO2.cif\"\n",
+    "    cif = urlopen(url).read().decode(\"utf-8\")\n",
+    "    lmo = Structure.from_str(cif, fmt=\"cif\")\n",
+    "\n",
     "structures, energies_per_atom, forces, stresses, magmoms = [], [], [], [], []\n",
     "\n",
     "for _ in range(100):\n",
@@ -172,7 +269,7 @@
     "\n",
     "The `batch_size` is defined to be 8 for small GPU-memory. If > 10 GB memory is available, we highly recommend to increase `batch_size` for better speed.\n",
     "\n",
-    "If you have very large numbers of structures (which is typical for AIMD), putting them all in a python list can quickly run into memory issues. In this case we highly recommend you to pre-convert all the structures into graphs and save them as shown in `examples/make_graphs.py`. Then directly train CHGNet by loading the graphs from disk instead of memory using the `GraphData` class defined in `data/dataset.py`.\n"
+    "If you have very large numbers (>100K) of structures (which is typical for AIMD), putting them all in a python list can quickly run into memory issues. In this case we highly recommend you to pre-convert all the structures into graphs and save them as shown in `examples/make_graphs.py`. Then directly train CHGNet by loading the graphs from disk instead of memory using the `GraphData` class defined in `data/dataset.py`.\n"
    ]
   },
   {