# Dataset exploration

In this notebook, we will explore the dataset that we will use to train the model. The dataset is a pandas dataframe that contains the input features and the output properties of the materials. The input features are the properties of the materials that we will use to predict the output properties. The output properties are the properties that we want to predict.

In [1]:
import pandas as pd

# Load the data
filepath = 'data/df_outputs_filtout.pkl'
df: pd.DataFrame = pd.read_pickle(filepath)

# Display the first few rows of the dataframe
df.head()

Unnamed: 0,formula_reduced,crystal_system,dRMS,dijk,elements,epsij,src_bandgap,src_DB_IDs,src_ehull,src_epsij,...,nsites,pg_symbol,spg_number,spg_symbol,structure,dKP,src,origin,dinv2,dinv3
mp-34078,H3ClO,trigonal,0.554479,"[[[0.0, 1.232027588815095, -0.374056043202926]...","[Cl, H, O]","[[2.53058371, -0.0, -0.0], [-0.0, 2.53058371, ...",5.3899,{},0.005834,"[[2.486930581530363, -5.922864326768362e-07, -...",...,5,3m,160,R3m,"{'@module': 'pymatgen.core.structure', '@class...",0.999857,Materials Project,Naccarato,1.707401,2.065073
mp-632326,HCl,orthorhombic,0.653691,"[[[0.0, 0.0, 0.0027236759621490003], [0.0, 0.0...","[Cl, H]","[[2.15315296, -0.0, -0.0], [-0.0, 2.33339089, ...",5.9709,{'icsd': ['icsd-27037']},0.0,"[[2.05131012, 0.0, 0.0], [0.0, 1.80706055, 0.0...",...,4,mm2,36,Cmc2_1,"{'@module': 'pymatgen.core.structure', '@class...",1.516039,Materials Project,Naccarato,2.075681,2.917119
mp-3277,BAsO4,tetragonal,1.631147,"[[[0.032043250933791005, 0.032043250941036, 1....","[As, B, O]","[[2.93707095, -0.0, -0.0], [-0.0, 2.93707095, ...",4.4406,"{'icsd': ['icsd-413438', 'icsd-26891']}",0.0,"[[3.0300598691968603, -1.4644051369749889e-05,...",...,6,-4,82,I-4,"{'@module': 'pymatgen.core.structure', '@class...",2.924868,Materials Project,Naccarato,3.46229,5.880784
mp-570935,LiI,hexagonal,0.103603,"[[[0.0, 0.0, -0.061151779065493], [0.0, 0.0, 2...","[I, Li]","[[3.26178546, -0.0, -0.0], [-0.0, 3.26178546, ...",4.381,{'icsd': ['icsd-414242']},0.0,"[[3.1340718616738847, 3.9628310656241467e-07, ...",...,4,6mm,186,P6_3mc,"{'@module': 'pymatgen.core.structure', '@class...",0.244556,Materials Project,Naccarato,0.650335,0.524261
mp-20459,TiPbO3,tetragonal,10.949745,"[[[0.0, 0.0, -23.192675819051072], [0.0, 0.0, ...","[O, Pb, Ti]","[[7.09469421, -0.0, -0.0], [-0.0, 7.09469421, ...",1.9569,"{'icsd': ['icsd-1612', 'icsd-61168', 'icsd-550...",0.0,"[[6.598517479999999, 0.0, 0.0], [0.0, 6.598517...",...,5,4mm,99,P4mm,"{'@module': 'pymatgen.core.structure', '@class...",23.177082,Materials Project,Naccarato,12.317948,39.450222


In [2]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2699 entries, mp-34078 to agm006074940
Data columns (total 26 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   formula_reduced    2699 non-null   object 
 1   crystal_system     2699 non-null   object 
 2   dRMS               2699 non-null   float64
 3   dijk               2699 non-null   object 
 4   elements           2699 non-null   object 
 5   epsij              2699 non-null   object 
 6   src_bandgap        2699 non-null   float64
 7   src_DB_IDs         1565 non-null   object 
 8   src_ehull          2699 non-null   float64
 9   src_epsij          809 non-null    object 
 10  src_is_gap_direct  1810 non-null   object 
 11  src_is_magnetic    1550 non-null   object 
 12  src_n              928 non-null    float64
 13  src_theoretical    1565 non-null   object 
 14  n                  2699 non-null   float64
 15  nelements          2699 non-null   int64  
 16  nsites        

## Dataset description
Let's describe the dataset to get a better understanding of the data. Here is a small description of each column in the dataset:
- `id`: The id of the material in the Materials Project database
- `formula_reduced`: The chemical formula of the material
- `cristal_system`: The crystal system of the material, e.g., cubic, hexagonal, etc. This is linked to the symmetry of the material. They are different groups of symmetry that can be found in crystals.
- `dRMS`: The root mean square average of dijk
- `dijk`: Tensor of order 3 from SHG which is a measure of the second order nonlinear optical properties of the material, in a way similar to the refractive index
- `elements`: The elements that compose the material
- `epsij`: Dielectric static tensor, representing the dielectric properties of the material (static)
- `src_bandgap`: The bandgap of the material (coming from the source, Materials Project)
- `src_DB_IDs`: The database IDs of the source, experimental data from ICSD
- `src_ehull`: The energy above the hull of the material (coming from the source, Materials Project)
- `src_epsij`: The dielectric static tensor of the material (coming from the source, Materials Project)
- `src_is_gap_direct`: Whether the bandgap is direct or indirect (coming from the source, Materials Project)
- `src_is_magnetic`: Whether the material is magnetic (coming from the source, Materials Project)
- `src_n`: The refractive index of the material (coming from the source, Materials Project)
- `src_theoretical`: Whether the data is theoretical or experimental (coming from the source, Materials Project)
- `n`: The refractive index of the material, computed by the method used in the paper
- `nelements`: The number of elements in the material
- `nsites`: The number of sites in the material
- `pg_symbol`: The point group symbol of the material
- `spg_number`: The space group number of the material
- `spg_symbol`: The space group symbol of the material
- `structure`: The structure of the material, represented as a pymatgen structure object
- `dKP`: The Kohn-Sham bandgap of the material ??
- `origin`: The origin of the data or when the data was created
- `dinv2`: Invariant of dijk
- `dinv3`: Invariant of dijk

More info in https://www.nature.com/articles/s41597-024-03590-9

In [3]:
df.describe()

Unnamed: 0,dRMS,src_bandgap,src_ehull,src_n,n,nelements,nsites,spg_number,dKP,dinv2,dinv3
count,2699.0,2699.0,2699.0,928.0,2699.0,2699.0,2699.0,2699.0,2699.0,2699.0,2699.0
mean,7.189294,3.495047,0.011076,2.317621,2.194162,3.232308,16.127825,78.472397,13.687724,7.107651,28.071741
std,14.390805,1.853512,0.031892,0.944312,0.834897,0.765098,13.258351,74.703339,27.072612,9.790768,55.727365
min,1.6e-05,0.0001,0.0,1.231677,1.003305,1.0,2.0,1.0,3.2e-05,0.001667,6.4e-05
25%,0.207173,2.03615,0.0,1.648786,1.619882,3.0,7.0,8.0,0.40101,0.922247,0.827737
50%,1.156382,3.2646,0.00402,1.994195,1.982314,3.0,12.0,36.0,2.272419,2.903343,4.676774
75%,6.344945,4.9652,0.016746,2.579856,2.491361,4.0,22.0,149.0,12.29067,8.770633,24.571075
max,94.086284,9.7197,1.468057,8.311524,7.49452,6.0,96.0,220.0,169.444013,54.991803,393.254751
