## 利用Catal hub数据集进行机器学习

在catalysis hub 官网（https://www.catalysis-hub.org/publications）<Br>
以"Application of machine learning to discover new intermetallic catalysts for the hydrogen evolution and the oxygen reduction reactions". Martínez-Alonso, Carmen; Vassilev-Galindo, Valentin; Comer, Benjamin M.; Abild-Pedersen, Frank; Winther, Kirsten T.; Llorca, Javier. Catalysis Science & Technology. (2024) #AlonsoStrain2023.
为例。介绍从Catal hub中下载数据集以及后续利用吸附能进行机器学习的方法。

<img src="article.png" width="1200">

Data availability statement<Br>
The database of DFT adsorption energy calculations is made openly available at the Catalysis-Hub platform43via the link https://www.catalysishub.org/publications/AlonsoStrain2023, and the values of the descriptors along with the ML code are published in https://zenodo.org/doi/10.5281/zenodo.11486422.

In [13]:
# pip install openpyxl # read xlsx
import requests
import json
import pandas as pd
import ase

## Catal hub数据集下载

Catalysis hub 官网收录期刊文章数据集对应下载网址 https://www.catalysis-hub.org/publications

选定某篇文章[CHECKOUT REACTIONS]，会有反应的具体信息和可视化结构。选项一TABLE可以在网站上线上查看数据集内容；选项二GRAPHQL QUERY可以得到包含描述符（反应晶面）的json格式数据集；选项三FETCH CSV可以下载csv表格格式数据集（缺少描述符）；选项四ASE ATOMS可以得到数据集原始反应物和产物的extxyz文件。

<img src="detail1.png" width="1200">

### 选项二 GRAPHQL QUERY

文章链接中的GRAPHQL QUERY提供文章信息。下方reactions中的GRAPHQL QUERY提供数据集信息，即：

<img src="detail2.png" width="1200">

https://api.catalysis-hub.org/graphql?query=query%7Breactions%20(pubId%3A%20%22AlonsoStrain2023%22%2C%0A%20%20%20%20%20%20%20%20first%3A%20200%2C%20after%3A%20%22YXJyYXljb25uZWN0aW9uOjE5OQ%3D%3D%22%2C%20%23%E4%B8%80%E5%85%B12628%E6%9D%A1%E6%95%B0%E6%8D%AE%EF%BC%8C%E8%BF%99%E9%87%8C%E5%8F%AA%E8%BE%93%E5%87%BA%E5%89%8D200%E4%B8%AA%0A%20%20%20%20%20%20%20%20order%3A%20%22chemicalComposition%22)%20%7B%0A%20%20%20%20%20%20%20%20totalCount%0A%20%20%20%20%20%20%20%20pageInfo%20%7B%0A%20%20%20%20%20%20%20%20hasNextPage%0A%20%20%20%20%20%20%20%20hasPreviousPage%0A%20%20%20%20%20%20%20%20startCursor%0A%20%20%20%20%20%20%20%20endCursor%0A%20%20%20%20%20%20%20%20%7D%0A%20%20%20%20%20%20%20%20edges%20%7B%0A%20%20%20%20%20%20%20%20%20%20node%20%7B%0A%20%20%20%20%20%20%20%20%20%20%20%20Equation%0A%20%20%20%20%20%20%20%20%20%20%20%20sites%0A%20%20%20%20%20%20%20%20%20%20%20%20id%0A%20%20%20%20%20%20%20%20%20%20%20%20pubId%0A%20%20%20%20%20%20%20%20%20%20%20%20dftCode%0A%20%20%20%20%20%20%20%20%20%20%20%20dftFunctional%0A%20%20%20%20%20%20%20%20%20%20%20%20reactants%0A%20%20%20%20%20%20%20%20%20%20%20%20products%0A%20%20%20%20%20%20%20%20%20%20%20%20facet%0A%20%20%20%20%20%20%20%20%20%20%20%20reactionEnergy%0A%20%20%20%20%20%20%20%20%20%20%20%20activationEnergy%0A%20%20%20%20%20%20%20%20%20%20%20%20surfaceComposition%0A%20%20%20%20%20%20%20%20%20%20%20%20chemicalComposition%0A%20%20%20%20%20%20%20%20%20%20%20%20reactionSystems%20%7B%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20name%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20aseId%0A%20%20%20%20%20%20%20%20%20%20%20%20%7D%0A%20%20%20%20%20%20%20%20%20%20%7D%0A%20%20%20%20%20%20%20%20%7D%0A%20%20%20%20%20%20%20%20%7D%0A%7D&operationName=undefined

通过python代码可下载TABLE信息中包含的数据集字典。

In [21]:
url = "https://api.catalysis-hub.org/graphql"

query = """
query{reactions (pubId: "AlonsoStrain2023",
        first: 200, after: "", #前200个数据
        order: "chemicalComposition") {
        totalCount
        pageInfo {
        hasNextPage
        hasPreviousPage
        startCursor
        endCursor
        }
        edges {
          node {
            Equation
            sites
            id
            pubId
            dftCode
            dftFunctional
            reactants
            products
            facet
            reactionEnergy
            activationEnergy
            surfaceComposition
            chemicalComposition
            reactionSystems {
              name
              aseId
            }
          }
        }
        }
}
"""

response = requests.post(url, json={"query": query})
data = response.json()
print(response.json())

{'data': {'reactions': {'totalCount': 2628, 'pageInfo': {'hasNextPage': True, 'hasPreviousPage': False, 'startCursor': 'YXJyYXljb25uZWN0aW9uOjA=', 'endCursor': 'YXJyYXljb25uZWN0aW9uOjE5OQ=='}, 'edges': [{'node': {'Equation': 'H2O(g) - 0.5H2(g) + * -> OH*', 'sites': '{"OH": "ontopB"}', 'id': 'UmVhY3Rpb246NDU3MDc5', 'pubId': 'AlonsoStrain2023', 'dftCode': 'Quantum Espresso', 'dftFunctional': 'PBE', 'reactants': '{"star": 1, "H2gas": -0.5, "H2Ogas": 1}', 'products': '{"OHstar": 1}', 'facet': '111-5%', 'reactionEnergy': 0.5379286501192837, 'activationEnergy': None, 'surfaceComposition': 'Ag3In-fcc', 'chemicalComposition': 'Ag12In4', 'reactionSystems': [{'name': 'OHstar', 'aseId': '961d40fc904e6605e6f87dda2919a28e'}, {'name': 'star', 'aseId': 'a9840669a4c8fa011bf2280b53bc51dd'}, {'name': 'H2Ogas', 'aseId': 'adc8705ec1f8c02ba106e36811191cb6'}, {'name': 'H2gas', 'aseId': 'e1f32563679f163c00aa78f304a75f22'}]}}, {'node': {'Equation': 'H2O(g) - 0.5H2(g) + * -> OH*', 'sites': '{"OH": "hcpAAA"}', 

In [22]:
#查看json数据集信息
with open("AlonsoStrain2023_full_dataset.json", "w") as f:
    json.dump(data["data"], f, indent=4, sort_keys=True)
from pprint import pprint
pprint(data["data"])

{'reactions': {'edges': [{'node': {'Equation': 'H2O(g) - 0.5H2(g) + * -> OH*',
                                   'activationEnergy': None,
                                   'chemicalComposition': 'Ag12In4',
                                   'dftCode': 'Quantum Espresso',
                                   'dftFunctional': 'PBE',
                                   'facet': '111-5%',
                                   'id': 'UmVhY3Rpb246NDU3MDc5',
                                   'products': '{"OHstar": 1}',
                                   'pubId': 'AlonsoStrain2023',
                                   'reactants': '{"star": 1, "H2gas": -0.5, '
                                                '"H2Ogas": 1}',
                                   'reactionEnergy': 0.5379286501192837,
                                   'reactionSystems': [{'aseId': '961d40fc904e6605e6f87dda2919a28e',
                                                        'name': 'OHstar'},
                            

                                   'reactionSystems': [{'aseId': '2568c958538f24dda7e34ce5bb0e7b01',
                                                        'name': 'star'},
                                                       {'aseId': '2d29eef0497705ade82af1e342e6f6db',
                                                        'name': 'O2gas'},
                                                       {'aseId': 'ab89b1ece7eb26d37d42657ec83da377',
                                                        'name': 'Ostar'}],
                                   'sites': '{"O": "fccAAA"}',
                                   'surfaceComposition': 'Ag3In-fcc'}},
                         {'node': {'Equation': '0.5H2(g) + * -> H*',
                                   'activationEnergy': None,
                                   'chemicalComposition': 'Ag12In4',
                                   'dftCode': 'Quantum Espresso',
                                   'dftFunctional': 'PBE',
                

                                                        'name': 'OHstar'}],
                                   'sites': '{"OH": "ontopB-d"}',
                                   'surfaceComposition': 'Ag3In-fcc'}},
                         {'node': {'Equation': 'H2O(g) - 0.5H2(g) + * -> OH*',
                                   'activationEnergy': None,
                                   'chemicalComposition': 'Ag12In4',
                                   'dftCode': 'Quantum Espresso',
                                   'dftFunctional': 'PBE',
                                   'facet': '111+2%',
                                   'id': 'UmVhY3Rpb246NDU3MDQw',
                                   'products': '{"OHstar": 1}',
                                   'pubId': 'AlonsoStrain2023',
                                   'reactants': '{"star": 1, "H2gas": -0.5, '
                                                '"H2Ogas": 1}',
                                   'reactionEnergy': 0.29132181

                                                        'name': 'star'},
                                                       {'aseId': 'adc8705ec1f8c02ba106e36811191cb6',
                                                        'name': 'H2Ogas'},
                                                       {'aseId': 'e1f32563679f163c00aa78f304a75f22',
                                                        'name': 'H2gas'}],
                                   'sites': '{"OH": "fccAAA"}',
                                   'surfaceComposition': 'Ag3In-fcc'}},
                         {'node': {'Equation': '0.5O2(g) + * -> O*',
                                   'activationEnergy': None,
                                   'chemicalComposition': 'Ag12In4',
                                   'dftCode': 'Quantum Espresso',
                                   'dftFunctional': 'PBE',
                                   'facet': '111',
                                   'id': 'UmVhY3Rpb246NDU3MDIz',

                                   'reactionSystems': [{'aseId': '4ceabf839de72b3b62f0af32fa607fed',
                                                        'name': 'star'},
                                                       {'aseId': '66822d4aced17442743463ce173ad9f2',
                                                        'name': 'Hstar'},
                                                       {'aseId': 'e1f32563679f163c00aa78f304a75f22',
                                                        'name': 'H2gas'}],
                                   'sites': '{"H": "hcpAAB"}',
                                   'surfaceComposition': 'Ag3In-fcc'}},
                         {'node': {'Equation': '0.5H2(g) + * -> H*',
                                   'activationEnergy': None,
                                   'chemicalComposition': 'Ag12In4',
                                   'dftCode': 'Quantum Espresso',
                                   'dftFunctional': 'PBE',
                

                         {'node': {'Equation': '0.5H2(g) + * -> H*',
                                   'activationEnergy': None,
                                   'chemicalComposition': 'Ag12Mg4',
                                   'dftCode': 'Quantum Espresso',
                                   'dftFunctional': 'PBE',
                                   'facet': '111+5%',
                                   'id': 'UmVhY3Rpb246NDU3MDk4',
                                   'products': '{"Hstar": 1}',
                                   'pubId': 'AlonsoStrain2023',
                                   'reactants': '{"star": 1, "H2gas": 0.5}',
                                   'reactionEnergy': 0.07126579800024047,
                                   'reactionSystems': [{'aseId': '6381bc9149f1db6c41bebbd75509a9bf',
                                                        'name': 'Hstar'},
                                                       {'aseId': 'e0425c8d302aeffa8109c9db299d0b58',
   

                                   'reactants': '{"star": 1, "H2gas": -0.5, '
                                                '"H2Ogas": 1}',
                                   'reactionEnergy': 0.6675054504958098,
                                   'reactionSystems': [{'aseId': '271c5ca78072c02d61e01adee7a23553',
                                                        'name': 'star'},
                                                       {'aseId': '76417694a8995a2a86fcb739a99088d9',
                                                        'name': 'OHstar'},
                                                       {'aseId': 'adc8705ec1f8c02ba106e36811191cb6',
                                                        'name': 'H2Ogas'},
                                                       {'aseId': 'e1f32563679f163c00aa78f304a75f22',
                                                        'name': 'H2gas'}],
                                   'sites': '{"OH": "fccAAA"}',
                   

                                   'facet': '111-5%',
                                   'id': 'UmVhY3Rpb246NDU3MDEz',
                                   'products': '{"Ostar": 1}',
                                   'pubId': 'AlonsoStrain2023',
                                   'reactants': '{"star": 1, "O2gas": 0.5}',
                                   'reactionEnergy': -0.9087241641900619,
                                   'reactionSystems': [{'aseId': '2d29eef0497705ade82af1e342e6f6db',
                                                        'name': 'O2gas'},
                                                       {'aseId': '39569af48edc0f20b30bbe7f7187542d',
                                                        'name': 'star'},
                                                       {'aseId': 'd437950cec69b0a5fd006ad424c35832',
                                                        'name': 'Ostar'}],
                                   'sites': '{"O": "fcc"}',
                  

                         {'node': {'Equation': 'H2O(g) - 0.5H2(g) + * -> OH*',
                                   'activationEnergy': None,
                                   'chemicalComposition': 'Ag16',
                                   'dftCode': 'Quantum Espresso',
                                   'dftFunctional': 'PBE',
                                   'facet': '111+2%',
                                   'id': 'UmVhY3Rpb246NDU2OTk2',
                                   'products': '{"OHstar": 1}',
                                   'pubId': 'AlonsoStrain2023',
                                   'reactants': '{"star": 1, "H2gas": -0.5, '
                                                '"H2Ogas": 1}',
                                   'reactionEnergy': 0.5650899650390784,
                                   'reactionSystems': [{'aseId': 'a57e858a40af8e584fa15090e255c28d',
                                                        'name': 'OHstar'},
                               

                                                        'name': 'star'},
                                                       {'aseId': 'e1f32563679f163c00aa78f304a75f22',
                                                        'name': 'H2gas'}],
                                   'sites': '{"OH": "fcc"}',
                                   'surfaceComposition': 'Ag-fcc'}},
                         {'node': {'Equation': '0.5O2(g) + * -> O*',
                                   'activationEnergy': None,
                                   'chemicalComposition': 'Ag16',
                                   'dftCode': 'Quantum Espresso',
                                   'dftFunctional': 'PBE',
                                   'facet': '111',
                                   'id': 'UmVhY3Rpb246NDU2OTkw',
                                   'products': '{"Ostar": 1}',
                                   'pubId': 'AlonsoStrain2023',
                                   'reactants': '{"star":

                                                        'name': 'Ostar'}],
                                   'sites': '{"O": "longbridgeA"}',
                                   'surfaceComposition': 'NdAg-bcc'}},
                         {'node': {'Equation': '0.5H2(g) + * -> H*',
                                   'activationEnergy': None,
                                   'chemicalComposition': 'Ag8Nd8',
                                   'dftCode': 'Quantum Espresso',
                                   'dftFunctional': 'PBE',
                                   'facet': '101+3%',
                                   'id': 'UmVhY3Rpb246NDU4MTM1',
                                   'products': '{"Hstar": 1}',
                                   'pubId': 'AlonsoStrain2023',
                                   'reactants': '{"star": 1, "H2gas": 0.5}',
                                   'reactionEnergy': -0.6434957580531773,
                                   'reactionSystems': [{'aseId': '

                                                '"H2Ogas": 1}',
                                   'reactionEnergy': -2.274081472573016,
                                   'reactionSystems': [{'aseId': '579488dd8d587a4ea0742b4e8ddc9f0c',
                                                        'name': 'star'},
                                                       {'aseId': '6002b487b32db73a5df71da89febafce',
                                                        'name': 'OHstar'},
                                                       {'aseId': 'adc8705ec1f8c02ba106e36811191cb6',
                                                        'name': 'H2Ogas'},
                                                       {'aseId': 'e1f32563679f163c00aa78f304a75f22',
                                                        'name': 'H2gas'}],
                                   'sites': '{"OH": "threefoldAAB"}',
                                   'surfaceComposition': 'ScAg-bcc'}},
                    

                                   'activationEnergy': None,
                                   'chemicalComposition': 'Ag8Sc8',
                                   'dftCode': 'Quantum Espresso',
                                   'dftFunctional': 'PBE',
                                   'facet': '101',
                                   'id': 'UmVhY3Rpb246NDU4NzM2',
                                   'products': '{"OHstar": 1}',
                                   'pubId': 'AlonsoStrain2023',
                                   'reactants': '{"star": 1, "H2gas": -0.5, '
                                                '"H2Ogas": 1}',
                                   'reactionEnergy': -2.2676338712226425,
                                   'reactionSystems': [{'aseId': '818b08d4364962114bdd8ce1d1da9cd9',
                                                        'name': 'OHstar'},
                                                       {'aseId': 'adc8705ec1f8c02ba106e36811191cb6',
         

                                   'facet': '101',
                                   'id': 'UmVhY3Rpb246NDU4NzI2',
                                   'products': '{"Hstar": 1}',
                                   'pubId': 'AlonsoStrain2023',
                                   'reactants': '{"star": 1, "H2gas": 0.5}',
                                   'reactionEnergy': -0.024000168457860127,
                                   'reactionSystems': [{'aseId': 'cffe3f2f901e19632bb063b90918a463',
                                                        'name': 'Hstar'},
                                                       {'aseId': 'e1f32563679f163c00aa78f304a75f22',
                                                        'name': 'H2gas'},
                                                       {'aseId': 'ec19011f204829d2eb5edb12bf822711',
                                                        'name': 'star'}],
                                   'sites': '{"H": "longbridgeB"}',
           

                                   'dftCode': 'Quantum Espresso',
                                   'dftFunctional': 'PBE',
                                   'facet': '0001+5%',
                                   'id': 'UmVhY3Rpb246NDU3MTM2',
                                   'products': '{"Hstar": 1}',
                                   'pubId': 'AlonsoStrain2023',
                                   'reactants': '{"star": 1, "H2gas": 0.5}',
                                   'reactionEnergy': -0.091076365742083,
                                   'reactionSystems': [{'aseId': '64adf7d1a2b5439a5282db811a2b06e0',
                                                        'name': 'Hstar'},
                                                       {'aseId': '974acb932baf4c397b7ebcebd77c792c',
                                                        'name': 'star'},
                                                       {'aseId': 'e1f32563679f163c00aa78f304a75f22',
                            

                                   'pubId': 'AlonsoStrain2023',
                                   'reactants': '{"star": 1, "H2gas": 0.5}',
                                   'reactionEnergy': -0.31409665013870836,
                                   'reactionSystems': [{'aseId': 'a3e185d7ccb48ccfaf989f78e3b07739',
                                                        'name': 'star'},
                                                       {'aseId': 'c97fff8683af0c6a66e191255f29e8f7',
                                                        'name': 'Hstar'},
                                                       {'aseId': 'e1f32563679f163c00aa78f304a75f22',
                                                        'name': 'H2gas'}],
                                   'sites': '{"H": "hcpAAB"}',
                                   'surfaceComposition': 'Al3Sm-hcp'}},
                         {'node': {'Equation': '0.5O2(g) + * -> O*',
                                   'activationEnergy': 

### 选项三 FETCH CSV

通过jq和curl命令将数据集输出成CSV格式。CSV中包含的Query信息相较json更少。

In [1]:
# brew install jq # MacOS
# sudo apt install jq -y # linux
# Windows下载 jq通过github，网址 https://github.com/jqlang/jq/releases
curl "http://api.catalysis-hub.org/graphql?query=%7Breactions(pubId%3A%22AlonsoStrain2023%22)%20%7B%0A%20%20edges%20%7B%0A%20%20%20%20node%20%7B%0A%20%20%20%20%20%20Equation%0A%20%20%20%20%20%20chemicalComposition%0A%20%20%20%20%20%20facet%0A%20%20%20%20%20%20reactionEnergy%0A%20%20%20%20%20%20activationEnergy%0A%20%20%20%20%7D%0A%20%20%7D%0A%7D%7D" | jq -r '.data.reactions.edges[].node | [.chemicalComposition,.facet,.Equation,.reactionEnergy] | @csv'

SyntaxError: invalid syntax (3080725208.py, line 4)

In [4]:
data = pd.read_csv("reactions.csv")
data

Unnamed: 0,Ag16,111,0.5H2(g) + * -> H*,0.16557052232747083
0,Ag16,111,0.5H2(g) + * -> H*,0.182189
1,Ag16,111,0.5H2(g) + * -> H*,0.731601
2,Ag16,111,0.5O2(g) + * -> O*,-1.006922
3,Ag16,111,0.5O2(g) + * -> O*,-0.896975
4,Ag16,111,0.5O2(g) + * -> O*,0.496794
...,...,...,...,...
2622,Zn8Zr8,101+2%,0.5O2(g) + * -> O*,-5.615362
2623,Zn8Zr8,101-3%,0.5H2(g) + * -> H*,-0.853984
2624,Zn8Zr8,101-3%,0.5O2(g) + * -> O*,-5.371142
2625,Zn8Zr8,101-5%,0.5H2(g) + * -> H*,-0.926421


### 方法四 Fetch ASE atoms

通过ASE atoms可以得到extxyz结构文件，有助于直接用于复现DFT计算。<br>

In [4]:
GRAPHQL = 'http://api.catalysis-hub.org/graphql'

def fetch(query):
    return requests.get(
        GRAPHQL, {'query': query}
    ).json()['data']
def reactions_from_dataset(pub_id, page_size=10):
    reactions = []
    has_next_page = True
    start_cursor = ''
    page = 0
    while has_next_page:
        data = fetch("""{{
      reactions(pubId: "{pub_id}", first: {page_size}, after: "{start_cursor}") {{
        totalCount
        pageInfo {{
          hasNextPage
          hasPreviousPage
          startCursor
          endCursor 
        }}  
        edges {{
          node {{
            Equation
            reactants
            products
            reactionEnergy
            reactionSystems {{
              name
              systems {{
                energy
                InputFile(format: "json")
              }}
            }}  
          }}  
        }}  
      }}    
    }}""".format(start_cursor=start_cursor,
                 page_size=page_size,
                 pub_id=pub_id,
                ))
        has_next_page = data['reactions']['pageInfo']['hasNextPage']
        start_cursor = data['reactions']['pageInfo']['endCursor']
        page += 1
        print(has_next_page, start_cursor, page_size * page, data['reactions']['totalCount'])
        reactions.extend(map(lambda x: x['node'], data['reactions']['edges']))

    return reactions

raw_reactions = reactions_from_dataset("AlonsoStrain2023")

def aseify_reactions(reactions):
    for i, reaction in enumerate(reactions):
        for j, _ in enumerate(reactions[i]['reactionSystems']):
            with io.StringIO() as tmp_file:
                system = reactions[i]['reactionSystems'][j].pop('systems')
                tmp_file.write(system.pop('InputFile'))
                tmp_file.seek(0)
                atoms = ase.io.read(tmp_file, format='json')
            calculator = ase.calculators.singlepoint.SinglePointCalculator(
                atoms,
                energy=system.pop('energy')
            )
            atoms.set_calculator(calculator)
            #print(atoms.get_potential_energy())
            reactions[i]['reactionSystems'][j]['atoms'] = atoms
        # flatten list further into {name: atoms, ...} dictionary
        reactions[i]['reactionSystems'] = {x['name']: x['atoms']
                                          for x in reactions[i]['reactionSystems']}
        
reactions = copy.deepcopy(raw_reactions)
aseify_reactions(reactions)

reactions[5]

from ase.io import write

# save as .extxyz
all_atoms = []

for i, r in enumerate(reactions):
    for name, atoms in r["reactionSystems"].items():
        # add atom info
        atoms.info["reaction_index"] = i
        atoms.info["system_name"] = name
        atoms.info["reaction_equation"] = r["Equation"]
        atoms.info["reaction_energy"] = r["reactionEnergy"]
        all_atoms.append(atoms)

write("AlonsoStrain2023_all.extxyz", all_atoms)

True YXJyYXljb25uZWN0aW9uOjk= 10 2628
True YXJyYXljb25uZWN0aW9uOjE5 20 2628
True YXJyYXljb25uZWN0aW9uOjI5 30 2628
True YXJyYXljb25uZWN0aW9uOjM5 40 2628
True YXJyYXljb25uZWN0aW9uOjQ5 50 2628
True YXJyYXljb25uZWN0aW9uOjU5 60 2628
True YXJyYXljb25uZWN0aW9uOjY5 70 2628
True YXJyYXljb25uZWN0aW9uOjc5 80 2628
True YXJyYXljb25uZWN0aW9uOjg5 90 2628
True YXJyYXljb25uZWN0aW9uOjk5 100 2628
True YXJyYXljb25uZWN0aW9uOjEwOQ== 110 2628
True YXJyYXljb25uZWN0aW9uOjExOQ== 120 2628
True YXJyYXljb25uZWN0aW9uOjEyOQ== 130 2628
True YXJyYXljb25uZWN0aW9uOjEzOQ== 140 2628
True YXJyYXljb25uZWN0aW9uOjE0OQ== 150 2628
True YXJyYXljb25uZWN0aW9uOjE1OQ== 160 2628
True YXJyYXljb25uZWN0aW9uOjE2OQ== 170 2628
True YXJyYXljb25uZWN0aW9uOjE3OQ== 180 2628
True YXJyYXljb25uZWN0aW9uOjE4OQ== 190 2628
True YXJyYXljb25uZWN0aW9uOjE5OQ== 200 2628
True YXJyYXljb25uZWN0aW9uOjIwOQ== 210 2628
True YXJyYXljb25uZWN0aW9uOjIxOQ== 220 2628
True YXJyYXljb25uZWN0aW9uOjIyOQ== 230 2628
True YXJyYXljb25uZWN0aW9uOjIzOQ== 240 2628
True YXJyYXljb25u

KeyboardInterrupt: 

### 文章链接下载数据集

通过文章链接，进入文章寻找github项目和zenodo网站，可下载原始数据集或经过输出清洗后的数据集。

In [29]:
with open("Dataset_Eads_H.json", "r") as f:
    data = json.load(f)
from pprint import pprint
pprint(data) 

{'columns': ['Biaxial Strain',
             'PSI',
             'outer electrons A',
             'outer electrons B',
             'Unit cell volume',
             'WEN',
             'WIE',
             'WAR',
             'GCN',
             'Eads',
             'label',
             'Binding site',
             'Material',
             'adsorbate'],
 'data': [[0,
           62.6943005181,
           11,
           11,
           17.2852312008,
           1.93,
           7.576234,
           160.0,
           5.25,
           0.1655704075,
           1,
           'fcc',
           'Ag',
           'H'],
          [0,
           62.6943005181,
           11,
           11,
           17.2852312008,
           1.93,
           7.576234,
           160.0,
           3.25,
           0.1821883857,
           1,
           'hcp',
           'Ag',
           'H'],
          [0,
           62.6943005181,
           11,
           11,
           17.2852312008,
           1.93,
           

           1.5,
           7.54957,
           145.0,
           1.5,
           -0.7317246488,
           18,
           'shortbridge',
           'Ta',
           'H'],
          [0,
           23.3333333333,
           7,
           7,
           28.503381002,
           2.1,
           7.119381,
           135.0,
           5.25,
           -0.7233224962,
           19,
           'fcc',
           'Tc',
           'H'],
          [0,
           23.3333333333,
           7,
           7,
           28.503381002,
           2.1,
           7.119381,
           135.0,
           3.25,
           -0.6987408276,
           19,
           'hcp',
           'Tc',
           'H'],
          [0,
           23.3333333333,
           7,
           7,
           28.503381002,
           2.1,
           7.119381,
           135.0,
           0.75,
           0.0367532245,
           19,
           'ontop',
           'Tc',
           'H'],
          [0,
           15.3374233129,
           5,


           1.71,
           7.137955,
           157.5,
           0.75,
           -0.0622821155,
           37,
           'ontopB',
           'DyPd',
           'H'],
          [0,
           85.8738394261,
           12,
           10,
           42.3119099088,
           1.71,
           7.137955,
           157.5,
           3.25,
           -0.4591694465,
           37,
           'threefoldAAB',
           'DyPd',
           'H'],
          [0,
           79.1015811997,
           12,
           9,
           39.4000071072,
           1.75,
           6.698975,
           155.0,
           2.5,
           0.0327962573,
           38,
           'longbridgeA',
           'DyRh',
           'H'],
          [0,
           53.0112040341,
           12,
           9,
           39.4000071072,
           1.75,
           6.698975,
           155.0,
           2.5,
           -0.7219942113,
           38,
           'longbridgeB',
           'DyRh',
           'H'],
          [0,
   

           7.0901025,
           137.5,
           0.75,
           0.392510585,
           53,
           'ontopB',
           'MnV',
           'H'],
          [0,
           19.5189475381,
           7,
           5,
           23.2899782706,
           1.59,
           7.0901025,
           137.5,
           3.25,
           -0.9506735191,
           53,
           'threefoldABB',
           'MnV',
           'H'],
          [0,
           39.6893926735,
           6,
           11,
           52.0569037613,
           1.535,
           6.550617,
           172.5,
           2.5,
           -0.6229606699,
           54,
           'longbridgeA',
           'NdAg',
           'H'],
          [0,
           49.882849875,
           6,
           11,
           52.0569037613,
           1.535,
           6.550617,
           172.5,
           2.5,
           -0.0792458685,
           54,
           'longbridgeB',
           'NdAg',
           'H'],
          [0,
           56.84350640

           8,
           32.7229697628,
           1.78,
           6.960995,
           145.0,
           0.75,
           0.8826079004,
           65,
           'ontopA',
           'ScRu',
           'H'],
          [0,
           8.1765183214,
           3,
           8,
           32.7229697628,
           1.78,
           6.960995,
           145.0,
           0.75,
           -0.387108758,
           65,
           'ontopB',
           'ScRu',
           'H'],
          [0,
           10.8406377365,
           3,
           8,
           32.7229697628,
           1.78,
           6.960995,
           145.0,
           3.25,
           -0.9199787115,
           65,
           'threefoldAAB',
           'ScRu',
           'H'],
          [0,
           15.6349051173,
           3,
           12,
           37.0598869567,
           1.505,
           7.9778445,
           147.5,
           2.5,
           -0.8230649862,
           66,
           'longbridgeA',
           'ScZn',
 

           35.9067343357,
           4,
           9,
           33.0077041524,
           1.605,
           7.257455,
           145.0,
           0.75,
           0.3928622879,
           79,
           'ontopA',
           'ZrCo',
           'H'],
          [0,
           14.4350935516,
           4,
           9,
           33.0077041524,
           1.605,
           7.257455,
           145.0,
           0.75,
           -0.4786900493,
           79,
           'ontopB',
           'ZrCo',
           'H'],
          [0,
           18.4057965694,
           4,
           9,
           33.0077041524,
           1.605,
           7.257455,
           145.0,
           3.25,
           -0.7936600397,
           79,
           'threefoldAAB-d',
           'ZrCo',
           'H'],
          [0,
           16.1472285425,
           4,
           8,
           35.1742403747,
           1.765,
           7.536065,
           142.5,
           3.25,
           -1.1257820383,
           80,


           7.58841225,
           136.25,
           3.25,
           -0.6649186663,
           100,
           'hcpAAA',
           'Ni3Mn',
           'H'],
          [0,
           44.2519638686,
           10,
           7,
           44.4923380106,
           1.82,
           7.58841225,
           136.25,
           3.25,
           -0.5654811817,
           100,
           'hcpAAB',
           'Ni3Mn',
           'H'],
          [0,
           52.3560209424,
           10,
           4,
           41.5562047552,
           1.9075,
           7.7678285,
           128.75,
           5.25,
           -0.4875676771,
           101,
           'fccAAA',
           'Ni3Si',
           'H'],
          [0,
           52.3560209424,
           10,
           4,
           41.5562047552,
           1.9075,
           7.7678285,
           128.75,
           3.25,
           -0.9240854814,
           101,
           'hcpAAA',
           'Ni3Si',
           'H'],
          [0,
           4

           112,
           'hcpAAA',
           'Zr3Al',
           'H'],
          [0,
           11.1425135493,
           4,
           3,
           83.8403674789,
           1.4,
           6.4718671,
           147.5,
           0.75,
           0.4498648315,
           112,
           'ontopB',
           'Zr3Al',
           'H'],
          [0,
           12.030075188,
           4,
           3,
           89.130794708,
           1.4425,
           6.4220138,
           155.0,
           5.25,
           -1.2049234519,
           113,
           'fccAAA',
           'Zr3In',
           'H'],
          [0,
           12.030075188,
           4,
           3,
           89.130794708,
           1.4425,
           6.4220138,
           155.0,
           3.25,
           -1.0663097784,
           113,
           'hcpAAA',
           'Zr3In',
           'H'],
          [0,
           11.0312253912,
           4,
           3,
           89.130794708,
           1.4425,
           6

           5.25,
           -0.4552508324,
           125,
           'fccAAA',
           'Rh3Mo',
           'H'],
          [0,
           27.6046999626,
           9,
           6,
           112.4639335344,
           2.25,
           7.3672825,
           137.5,
           5.25,
           -0.7093127006,
           125,
           'fccAAB',
           'Rh3Mo',
           'H'],
          [0,
           35.5263157895,
           9,
           6,
           112.4639335344,
           2.25,
           7.3672825,
           137.5,
           3.25,
           -0.6998013037,
           125,
           'hcpAAA',
           'Rh3Mo',
           'H'],
          [0,
           27.6046999626,
           9,
           6,
           112.4639335344,
           2.25,
           7.3672825,
           137.5,
           3.25,
           -0.5539186053,
           125,
           'hcpAAB',
           'Rh3Mo',
           'H'],
          [0,
           32.9366958707,
           9,
           6,
        

           'Os',
           'H'],
          [-5,
           29.0909090909,
           8,
           8,
           28.2548005385,
           2.2,
           8.43823,
           130.0,
           5.25,
           -0.4381434006,
           12,
           'fcc',
           'Os',
           'H'],
          [2,
           29.0909090909,
           8,
           8,
           28.2548005385,
           2.2,
           8.43823,
           130.0,
           5.25,
           -0.6223947047,
           12,
           'fcc',
           'Os',
           'H'],
          [5,
           29.0909090909,
           8,
           8,
           28.2548005385,
           2.2,
           8.43823,
           130.0,
           5.25,
           -0.6557708541,
           12,
           'fcc',
           'Os',
           'H'],
          [8,
           29.0909090909,
           8,
           8,
           28.2548005385,
           2.2,
           8.43823,
           130.0,
           5.25,
           -0.7510207469,


          [2,
           87.2727272727,
           12,
           12,
           28.8451089139,
           1.65,
           9.394199,
           135.0,
           3.25,
           0.5295240571,
           23,
           'hcp',
           'Zn',
           'H'],
          [5,
           87.2727272727,
           12,
           12,
           28.8451089139,
           1.65,
           9.394199,
           135.0,
           3.25,
           0.43107302,
           23,
           'hcp',
           'Zn',
           'H'],
          [8,
           87.2727272727,
           12,
           12,
           28.8451089139,
           1.65,
           9.394199,
           135.0,
           3.25,
           0.3968813677,
           23,
           'hcp',
           'Zn',
           'H'],
          [-1,
           12.030075188,
           4,
           4,
           46.9993188747,
           1.33,
           6.6339,
           155.0,
           5.25,
           -1.0058886349,
           24,
           'f

           -1.1472507112,
           45,
           'threefoldAAB',
           'HfOs',
           'H'],
          [-3,
           17.524049753,
           4,
           9,
           34.2979309163,
           1.79,
           7.141985,
           145.0,
           3.25,
           -0.5426594452,
           46,
           'threefoldAAB',
           'HfRh',
           'H'],
          [-5,
           17.524049753,
           4,
           9,
           34.2979309163,
           1.79,
           7.141985,
           145.0,
           3.25,
           -0.562263576,
           46,
           'threefoldAAB',
           'HfRh',
           'H'],
          [3,
           17.524049753,
           4,
           9,
           34.2979309163,
           1.79,
           7.141985,
           145.0,
           3.25,
           -0.6030328368,
           46,
           'threefoldAAB',
           'HfRh',
           'H'],
          [5,
           17.524049753,
           4,
           9,
           34.2979

           'H'],
          [5,
           14.0760129322,
           3,
           11,
           34.0395711706,
           1.63,
           7.143935,
           147.5,
           3.25,
           -0.7590629326,
           59,
           'threefoldAAB',
           'ScCu',
           'H'],
          [-3,
           11.7261788379,
           3,
           9,
           33.3360718532,
           1.78,
           7.764255,
           147.5,
           3.25,
           -0.9423772962,
           60,
           'threefoldAAB-d',
           'ScIr',
           'H'],
          [-5,
           11.7261788379,
           3,
           9,
           33.3360718532,
           1.78,
           7.764255,
           147.5,
           3.25,
           -0.9015284856,
           60,
           'threefoldAAB-d',
           'ScIr',
           'H'],
          [3,
           11.7261788379,
           3,
           9,
           33.3360718532,
           1.78,
           7.764255,
           147.5,
           3.

          [-3,
           47.9357614613,
           12,
           9,
           26.7369702345,
           1.965,
           8.4265495,
           135.0,
           2.5,
           -0.6862102979,
           78,
           'longbridgeB',
           'ZnRh',
           'H'],
          [-3,
           18.4057965694,
           4,
           9,
           33.0077041524,
           1.605,
           7.257455,
           145.0,
           3.25,
           -0.8241987303,
           79,
           'threefoldAAB-d',
           'ZrCo',
           'H'],
          [2,
           18.4057965694,
           4,
           9,
           33.0077041524,
           1.605,
           7.257455,
           145.0,
           3.25,
           -0.8142295496,
           79,
           'threefoldAAB',
           'ZrCo',
           'H'],
          [-3,
           16.1472285425,
           4,
           8,
           35.1742403747,
           1.765,
           7.536065,
           142.5,
           3.25,
           

           3.25,
           0.0586176473,
           95,
           'hcpAAB',
           'In3Y',
           'H'],
          [5,
           5.7346805558,
           3,
           3,
           98.9730321218,
           1.64,
           5.8940814,
           161.25,
           3.25,
           0.0193908072,
           95,
           'hcpAAB',
           'In3Y',
           'H'],
          [-3,
           24.1494908762,
           9,
           4,
           57.4088929964,
           2.035,
           8.432295,
           136.25,
           5.25,
           -0.6955824529,
           96,
           'fccAAB',
           'Ir3Ti',
           'H'],
          [-5,
           24.1494908762,
           9,
           4,
           57.4088929964,
           2.035,
           8.432295,
           136.25,
           5.25,
           -0.6229110786,
           96,
           'fccAAB',
           'Ir3Ti',
           'H'],
          [3,
           24.1494908762,
           9,
           4,
           57.4

           49.1879859817,
           12,
           5,
           59.7757128663,
           1.6375,
           8.73536175,
           137.5,
           3.25,
           -0.6724158722,
           109,
           'hcpAAB',
           'Zn3Nb',
           'H'],
          [5,
           49.1879859817,
           12,
           5,
           59.7757128663,
           1.6375,
           8.73536175,
           137.5,
           3.25,
           -0.6565047012,
           109,
           'hcpAAB',
           'Zn3Nb',
           'H'],
          [-3,
           42.9324282243,
           12,
           4,
           58.4487109632,
           1.6225,
           8.75267925,
           136.25,
           5.25,
           -0.4838615171,
           110,
           'fccAAB',
           'Zn3Ti',
           'H'],
          [-5,
           42.9324282243,
           12,
           4,
           58.4487109632,
           1.6225,
           8.75267925,
           136.25,
           5.25,
           -0.48389094

           384,
           385,
           386,
           387,
           388,
           389,
           390,
           391,
           392,
           393,
           394,
           395,
           396,
           397,
           398,
           399,
           400,
           401,
           402,
           403,
           404,
           405,
           406,
           407,
           408,
           409,
           410,
           411,
           412,
           413,
           414,
           415,
           416,
           417,
           418,
           419,
           420,
           421,
           422,
           423,
           424,
           425,
           426,
           427,
           428,
           429,
           430,
           431,
           432,
           433,
           434,
           435,
           436,
           437,
           438,
           439,
           440,
           441,
           442,
           443,
           444,
           445,
        

下载得到的json数据集是根据Get_Dataset.py从QE的DFT计算结果中提取的数据集，内容和从catal hub中下载得到的AlonsoStrain2023_full_dataset.json有所不同。<Br>
后续进一步通过机器学习总结描述符以及预测吸附能，通过原始论文的提取描述符的方法进行文章复现；<Br>
也可以通过总结提取其他描述符，用AlonsoStrain2023_full_dataset.json提供的数据集内容进行吸附能的预测。[该部分仅介绍一些常见的描述符]

## 数据集机器学习

以文章Application of machine learning to discover new intermetallic catalysts for the hydrogen evolution and the oxygen reduction reactions为例，介绍机器学习方法根据描述符预测HER和ORR反应吸附能。后续分为文章的复现，机器学习sklearn不同方法代码的应用，以及其他描述符的介绍。

### 数据清洗

文章中使用了代码Get_Dataset.py,从Quantum Espresso的输出文件中提取了计算描述符。定义GCN描述符为几何晶面+晶面指数+吸附位点的描述符；定义PSI描述符是几何结构+外层电子数+电负性的描述符。

In [None]:
python Get_Dataset.py API.txt  #API.txt来自于Material Project，用于寻找最稳定的晶胞体积和点群

得到Full_dataset_H-O-OH.json 数据集，后经过数据清洗得到Dataset_Eads_O-OH.json 数据集。

In [68]:
with open("Full_dataset_H-O-OH.json", "r") as f:
    data = json.load(f)
print("keys:", list(data.keys()))

adsorbates = data.get("adsorbate", [])
unique_adsorbates = sorted(set(adsorbates))
print(f"Adsorbate 一共有 {len(unique_adsorbates)} 种：")
print(unique_adsorbates)

keys: ['is_valid', 'Eads', 'frac_coord', 'cart_coord', 'Z', 'chem_symb', 'cell', 'init_frac_coord', 'magnetic_moment', 'initial_ad_pos', 'final_ad_pos', 'WEN', 'WAR', 'WIE', 'out_eA', 'out_eB', 'PSI', 'GCN', 'magnetic_moment_A', 'magnetic_moment_B', 'magnetic_moment_ads', 'input', 'output', 'dir', 'system', 'adsorbate', 'strain', 'site', 'Type', 'Volume', 'Point_group', 'special', 'ID', 'geom']
Adsorbate 一共有 3 种：
['H', 'O', 'OH']


In [66]:
with open("Full_dataset_H-O-OH.json", "r") as f:
    full_data = json.load(f)

columns = ['Biaxial Strain', 'PSI', 'outer electrons A', 'outer electrons B', 
           'Unit cell volume', 'WEN', 'WIE', 'WAR', 'GCN', 'Eads', 
           'label', 'Binding site', 'Material', 'adsorbate']

key_map = {
    'Biaxial Strain': 'strain',
    'PSI': 'PSI',
    'outer electrons A': 'out_eA',
    'outer electrons B': 'out_eB',
    'Unit cell volume': 'Volume',
    'WEN': 'WEN',
    'WIE': 'WIE',
    'WAR': 'WAR',
    'GCN': 'GCN',
    'Eads': 'Eads',
    'label': 'Type',
    'Binding site': 'site',
    'Material': 'system',
    'adsorbate': 'adsorbate'
}

num_rows = len(full_data['Eads'])

data = []
for i in range(num_rows):
    row = []
    for col in columns:
        key = key_map[col]
        value = full_data.get(key, [None]*num_rows)[i] 
        row.append(value)
    data.append(row)

table_json = {
    "columns": columns,
    "index": list(range(num_rows)),
    "data": data
}

with open("Full_dataset_H-O-OH_table.json", "w") as f:
    json.dump(table_json, f, indent=2)

In [69]:
import json

with open("Full_dataset_H-O-OH_table.json", "r") as f:
    data1 = json.load(f)

print("keys:", list(data1.keys()))

columns = data1['columns']
print("列名:", columns)
rows = data1['data']
num_rows = len(rows)
print(f"数据总共有 {num_rows} 行")

adsorbate_idx = columns.index("adsorbate")
adsorbates = [row[adsorbate_idx] for row in rows]
all_adsorbates = sorted(set(adsorbates))
print("Adsorbate 为", all_adsorbates)



顶层键: ['columns', 'index', 'data']
列名: ['Biaxial Strain', 'PSI', 'outer electrons A', 'outer electrons B', 'Unit cell volume', 'WEN', 'WIE', 'WAR', 'GCN', 'Eads', 'label', 'Binding site', 'Material', 'adsorbate']
数据总共有 2623 行
Adsorbate 为 ['H', 'O', 'OH']


根据Adsorbate分类出两类数据集：H吸附能数据集以及O/OH吸附能数据集。

In [70]:
import json

with open("Full_dataset_H-O-OH_table.json", "r") as f:
    table_data = json.load(f)

columns = table_data["columns"]
rows = table_data["data"]

adsorbate_idx = columns.index("adsorbate")

H_dataset = {"columns": columns, "index": [], "data": []}
O_OH_dataset = {"columns": columns, "index": [], "data": []}

for i, row in enumerate(rows):
    ads = row[adsorbate_idx]
    if ads == "H":
        H_dataset["data"].append(row)
        H_dataset["index"].append(i)
    elif ads in ["O", "OH"]:
        O_OH_dataset["data"].append(row)
        O_OH_dataset["index"].append(i)

with open("Full_dataset_H_table.json", "w") as f:
    json.dump(H_dataset, f, indent=2)
with open("Full_dataset_O_OH_table.json", "w") as f:
    json.dump(O_OH_dataset, f, indent=2)

print(f"H 吸附能数据集行数: {len(H_dataset['data'])}")
print(f"O/OH 吸附能数据集行数: {len(O_OH_dataset['data'])}")


H 吸附能数据集行数: 924
O/OH 吸附能数据集行数: 1699


将原始文章中数据清洗过后的数据集和作者完成数据清洗后的数据集进行对比。

In [72]:
with open("Dataset_Eads_O-OH.json", "r") as f:
    data4 = json.load(f)
columns = data4['columns']
rows = data4['data']
for i, row in enumerate(rows[:2]):
    print(f"第{i+1}行:", dict(zip(columns, row)))

with open("Full_dataset_O_OH_table.json", "r") as f:
    data5 = json.load(f)
columns = data5['columns']
rows = data5['data']
for i, row in enumerate(rows[:2]):
    print(f"第{i+1}行:", dict(zip(columns, row)))

第1行: {'Biaxial Strain': 0, 'PSI': 62.6943005181, 'outer electrons A': 11, 'outer electrons B': 11, 'Unit cell volume': 17.2852312008, 'WEN': 1.93, 'WIE': 7.576234, 'WAR': 160.0, 'GCN': 5.25, 'Eads': 0.7180481436, 'label': 1, 'Binding site': 'fcc', 'Material': 'Ag', 'adsorbate': 'OH'}
第2行: {'Biaxial Strain': 0, 'PSI': 62.6943005181, 'outer electrons A': 11, 'outer electrons B': 11, 'Unit cell volume': 17.2852312008, 'WEN': 1.93, 'WIE': 7.576234, 'WAR': 160.0, 'GCN': 3.25, 'Eads': 0.7461350026, 'label': 1, 'Binding site': 'hcp', 'Material': 'Ag', 'adsorbate': 'OH'}
第1行: {'Biaxial Strain': '0', 'PSI': 62.69430051813472, 'outer electrons A': 11.0, 'outer electrons B': 11.0, 'Unit cell volume': 69.45695019121693, 'WEN': 1.93, 'WIE': 7.576234, 'WAR': 160.0, 'GCN': 5.25, 'Eads': 0.7180481435708284, 'label': 'Pure', 'Binding site': 'fcc', 'Material': 'Ag', 'adsorbate': 'OH'}
第2行: {'Biaxial Strain': '0', 'PSI': 62.69430051813472, 'outer electrons A': 11.0, 'outer electrons B': 11.0, 'Unit cell 

根据对比的处理结果，可以发现在json数据清洗的过程中，作者对Unit cell volume盒子体积大小进行了数值缩放，Unit cell volume来源于Material Project api中材料晶胞的三维体积。在将数据集转换成xlsx的过程中，除了Eads外其他参数进行了归一化缩放。

In [61]:
with open("Dataset_Eads_O-OH.json", "r") as f:
    data2 = json.load(f)

print("keys:", list(data2.keys()))
columns = data2['columns']
print("列名:", columns)
rows = data2['data']
num_rows = len(rows)
print(f"数据总共有 {num_rows} 行")

adsorbate_idx = columns.index("adsorbate")
adsorbates = [row[adsorbate_idx] for row in rows]
all_adsorbates = sorted(set(adsorbates))
print("Adsorbate 为", all_adsorbates)

顶层键: ['columns', 'index', 'data']
列名: ['Biaxial Strain', 'PSI', 'outer electrons A', 'outer electrons B', 'Unit cell volume', 'WEN', 'WIE', 'WAR', 'GCN', 'Eads', 'label', 'Binding site', 'Material', 'adsorbate']
数据总共有 1699 行
Adsorbate 为 ['O', 'OH']


### 数据集归一化

In [78]:
import json
import numpy as np
from sklearn.preprocessing import MinMaxScaler
import pandas as pd

with open("Dataset_Eads_O-OH.json", "r") as f:
    data = json.load(f)

columns = data["columns"]
rows = data["data"]

exclude_cols = ["Eads", "Binding site", "Material", "adsorbate", "label"]
num_cols_idx = [i for i, col in enumerate(columns) if col not in exclude_cols]

X = np.array([[float(row[i]) for i in num_cols_idx] for row in rows])

nms = MinMaxScaler()
X_scaled = nms.fit_transform(X)

for row_idx, row in enumerate(rows):
    for j, col_idx in enumerate(num_cols_idx):
        row[col_idx] = round(float(X_scaled[row_idx, j]), 6)

df = pd.DataFrame(rows, columns=columns)

df.to_excel("Dataset_Eads_O-OH_normalized.xlsx", index=False)

In [76]:
data = pd.read_excel("Dataset_Eads_O-OH_normalized.xlsx")
data

Unnamed: 0,Biaxial Strain,PSI,outer electrons A,outer electrons B,Unit cell volume,WEN,WIE,WAR,GCN,Eads,label,Binding site,Material,adsorbate
0,0.384615,0.573781,0.9,0.909091,0.019658,0.601695,0.485960,0.619048,1.000000,0.718048,1,fcc,Ag,OH
1,0.384615,0.573781,0.9,0.909091,0.019658,0.601695,0.485960,0.619048,0.578947,0.746135,1,hcp,Ag,OH
2,0.384615,0.573781,0.9,0.909091,0.019658,0.601695,0.485960,0.619048,0.052632,1.571387,1,ontop,Ag,OH
3,0.384615,0.573781,0.9,0.909091,0.019658,0.601695,0.485960,0.619048,1.000000,-1.006922,1,fcc,Ag,O
4,0.384615,0.573781,0.9,0.909091,0.019658,0.601695,0.485960,0.619048,0.578947,-0.896974,1,hcp,Ag,O
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1694,0.769231,0.031044,0.1,0.181818,0.507105,0.207627,0.144242,0.595238,0.578947,-6.875316,124,hcpAAA,Sc3In,O
1695,0.153846,0.067550,0.2,0.181818,0.386516,0.322034,0.200785,0.309524,1.000000,-6.286348,125,fccAAA,Ti3In,O
1696,0.000000,0.067550,0.2,0.181818,0.386516,0.322034,0.200785,0.309524,1.000000,-6.076012,125,fccAAA,Ti3In,O
1697,0.615385,0.067550,0.2,0.181818,0.386516,0.322034,0.200785,0.309524,1.000000,-6.653066,125,fccAAA,Ti3In,O


In [15]:
data = pd.read_excel("Dataset_Eads_O-OH_scaled.xlsx")
data

Unnamed: 0,Biaxial Strain,PSI,outer electrons A,outer electrons B,Unit cell volume,WEN,WIE,WAR,GCN,Eads,label,Binding site,Material,adsorbate
0,0.384615,0.573781,0.9,0.909091,0.019658,0.601695,0.485960,0.619048,1.000000,0.718048,1,fcc,Ag,OH
1,0.384615,0.573781,0.9,0.909091,0.019658,0.601695,0.485960,0.619048,0.578947,0.746135,1,hcp,Ag,OH
2,0.384615,0.573781,0.9,0.909091,0.019658,0.601695,0.485960,0.619048,0.052632,1.571387,1,ontop,Ag,OH
3,0.384615,0.573781,0.9,0.909091,0.019658,0.601695,0.485960,0.619048,1.000000,-1.006922,1,fcc,Ag,O
4,0.384615,0.573781,0.9,0.909091,0.019658,0.601695,0.485960,0.619048,0.578947,-0.896974,1,hcp,Ag,O
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1694,0.769231,0.031044,0.1,0.181818,0.507105,0.207627,0.144242,0.595238,0.578947,-6.875316,124,hcpAAA,Sc3In,O
1695,0.153846,0.067550,0.2,0.181818,0.386516,0.322034,0.200785,0.309524,1.000000,-6.286348,125,fccAAA,Ti3In,O
1696,0.000000,0.067550,0.2,0.181818,0.386516,0.322034,0.200785,0.309524,1.000000,-6.076012,125,fccAAA,Ti3In,O
1697,0.615385,0.067550,0.2,0.181818,0.386516,0.322034,0.200785,0.309524,1.000000,-6.653066,125,fccAAA,Ti3In,O


### 机器学习方法

文章中仅用随机森林进行回归预测。补充了其他机器学习回归方法，不同模型性能评估方法和结果如下图所示。数据处理只进行了归一化，未进行标准化，数据集使用[Dataset_Eads_H_scaled.xlsx]。

In [None]:
python MLP.py Dataset_Eads_H_scaled.xlsx # KNN/SVM/RF/ENET/BAYES/XGBoost . py

In [2]:
import pandas as pd

data = pd.read_excel("ML_regression.xlsx")
data

Unnamed: 0.1,Unnamed: 0,RMSE,R2,Person r,Person p,Hyperparameter,MedAE
0,随机森林（文章）,0.1117,0.8465,0.9201,1.2309e-57,"{'max_depth': 300, 'max_features': np.int64(9)...",0.0418
1,神经网络,0.5983,0.3179,0.8687,1.2249000000000002e-43,,
2,KNN+PCA降维,0.4934,0.061,0.6409,1.9326e-17,,
3,KNN,0.5047,0.3555,0.85,7.203e-40,,
4,贝叶斯超参数优化,0.4788,0.1128,0.6038,3.587e-15,Best BayesianRidge params:\n{'alpha_1': 0.0001...,
5,贝叶斯,0.4788,0.11277,0.6038,3.5876e-15,,
6,SVM（线性核）,0.4874,0.0956,0.5918,1.6937e-14,,
7,SVM（RBF核）,0.3387,0.5597,0.8397,3.773e-38,,
8,XGBOOST,0.2706,0.6923,0.9152,6.3414e-56,Best XGBoost Parameters \n{'colsample_bytree':...,
9,弹性网（岭回归）,0.4769,0.1187,0.6003,5.668e-15,"Best Enet params: {'alpha': np.float64(0.001),...",


其中皮尔逊系数，XGBOOST回归器和随机森林（极端树）回归器结果最为接近；贝叶斯超参数优化作用较小；弹性网正则化测试中九个特征均得到保留。

### 其他描述符介绍

举例文章通过自定义Get_Dataset.py 从量子计算输出文件中整理出描述符进行机器学习。如果仅由下载的数据集中添加描述符有以下几种方法。

### 库伦矩阵

从每个extxyz文件中的结构学习得到库伦矩阵，作为机器学习描述符加入进数据集。

In [38]:
import json
import pandas as pd
import ase
from ase.io import read
from matminer.featurizers.structure.matrix import CoulombMatrix, SineCoulombMatrix
from pymatgen.io.ase import AseAtomsAdaptor


# read JSON
json_file = "AlonsoStrain2023_full_dataset.json"
with open(json_file) as f:
    data = json.load(f)

if 'reactions' not in data or 'edges' not in data['reactions']:
    raise ValueError("JSON does not contain 'reactions.edges' key!")

edges = data['reactions']['edges']
print(f"Total reactions: {len(edges)}")

# cm_featurizer
cm_featurizer = CoulombMatrix(flatten=False)
scm_featurizer = SineCoulombMatrix(flatten=False)

# read extxyz
extxyz_file = "AlonsoStrain2023_all.extxyz"
all_atoms = read(extxyz_file, index=':')  # 返回 ASE Atoms 对象列表

atoms_dict = {}
for atoms in all_atoms:
    sys_name = atoms.info.get('system_name', None)
    if sys_name is not None:
        atoms_dict[sys_name] = atoms

print(f"Total structures read from extxyz: {len(atoms_dict)}")

# put cm to json
for i, entry in enumerate(edges):
    node = entry['node']
    node.setdefault('CoulombMatrix', [])
    node.setdefault('SineCoulombMatrix', [])

    if 'reactionSystems' not in node or len(node['reactionSystems']) == 0:
        print(f"Reaction {i} has no reactionSystems, skipping.")
        continue

    # reactionSystems[0] 
    system_name = node['reactionSystems'][0]['name']
    if system_name not in atoms_dict:
        print(f"Reaction {i}, system_name {system_name} not found in extxyz, skipping.")
        continue

    atoms = atoms_dict[system_name]

    # ASE -> pymatgen Structure
    structure = AseAtomsAdaptor.get_structure(atoms)

    # featurize 并压平
    cm_vec = cm_featurizer.featurize(structure)[0].ravel().tolist()
    scm_vec = scm_featurizer.featurize(structure)[0].ravel().tolist()
    node['CoulombMatrix'] = cm_vec
    node['SineCoulombMatrix'] = scm_vec

print("Coulomb matrices computed.")


enhanced_json_file = "AlonsoStrain2023_full_dataset_with_CM.json"
with open(enhanced_json_file, "w") as f:
    json.dump(data, f, indent=2)
print(f"Enhanced JSON saved to {enhanced_json_file}")

df = pd.DataFrame([entry['node'] for entry in edges])

# CoulombMatrix 
if 'CoulombMatrix' in df.columns and df['CoulombMatrix'].apply(len).max() > 0:
    cm_cols = pd.DataFrame(df['CoulombMatrix'].to_list())
    cm_cols.columns = [f'CM_{i}' for i in range(cm_cols.shape[1])]
    df = pd.concat([df.drop(columns=['CoulombMatrix']), cm_cols], axis=1)

# SineCoulombMatrix
if 'SineCoulombMatrix' in df.columns and df['SineCoulombMatrix'].apply(len).max() > 0:
    scm_cols = pd.DataFrame(df['SineCoulombMatrix'].to_list())
    scm_cols.columns = [f'SCM_{i}' for i in range(scm_cols.shape[1])]
    df = pd.concat([df.drop(columns=['SineCoulombMatrix']), scm_cols], axis=1)

csv_file = "AlonsoStrain2023_full_dataset_with_CM.csv"
df.to_csv(csv_file, index=False)
print(f"Enhanced CSV saved to {csv_file}") #PCA CM&SCM features


ModuleNotFoundError: No module named 'ase'

### SOAP描述符

smooth overlap of atomic orbitals 镜像球谐: https://www.sciencedirect.com/science/article/pii/S0010465519303042

### featurizers模块

可以添加元素种类，氧化态，态密度等描述符

In [None]:
from matminer.featurizers.conversions import StrToComposition
from matminer.featurizers.structure import DensityFeatures
os_feat = OxidationStates
df = os_feat.featurize_dataframe(df, “composition_oxid”)
df
df_feat = DensityFeatures()
df = df_feat.featurize_dataframe(df, “structure”)
df #density vpa packing fraction

### Magpie描述符

对于无机材料，需要formula化学式，将字符串解析成composition元素类型，并按元素组成转换成对应描述符

https://hachmannlab.github.io/chemml/chemml.chem.magpie_python.html

In [None]:
from matminer.featurizers.conversions import StrToComposition
df = StrToComposition().featurize_dataframe(df,”formula”)
df3 = StrToComposition().featurize_dataframe(df,”formula”)
from matminer.featurizers.composition import ElementProperty
ep_feat = ElementProperty.from_preset(preset_name = “magpie”)
df = ep_feat.featurize_dataframe(df3, “composition”)


### 其他

datana库：添加原子序数，原子周期数

晶面转成描述符one-hot encoding：df = pd.get_dummies(df["facet"])