Introduction

When developing drugs, it is important to consider not only their biological activity, but also their physicochemical properties, such as stability and reactivity. One of the key parameters that allows us to evaluate these properties is the HOMO-LUMO gap. The lower this value, the less stable the compound, which can affect its synthesis and practical application.

QM9.csv

QM9 is a dataset widely used in computational chemistry and machine learning to study molecular properties. All molecules in it are made up of carbon (C), hydrogen (H), nitrogen (N), oxygen (O), and fluorine (F) atoms. Each molecule is represented based on quantum mechanical calculations.

Properties

Sign	Definition
μ:	Dipole moment.
α:	Polarizability.
HOMO / LUMO:	Orbital energies.
Gap:	Difference between HOMO and LUMO.
E_0:	Internal energy.
Cv:	Heat capacity.

Purpose:

Clean and prepare the data in Jupyter Notebook so that it is suitable for loading and using in a machine learning model.

Tasks

Extract descriptors from Mordred и RdKit
Data curation
Drop duplicate
Feature selection
Feature Transformation
Stores the results in a PostgreSQL database
Model

Project Structure

project_root/
│
├── Split_Mordred_set/       # Split into multiple CSV files due to large size. Contains all descriptors from the Mordred library.
├── Split_data_Merged_data/  # Split into multiple CSV files due to large size. This is the merged dataset of Mordred, RdKit, and new_qm9 descriptors.
├── .gitignore               # Specifies intentionally untracked files to ignore.
├── Merged_data2.csv         # A merged dataset in CSV format.
├── NewDataset.csv           # A new dataset generated for analysis.
├── Project1.ipynb           # Extracting descriptors from Mordred and RdKit.
├── Project2.ipynb           # Feature selection methods: Pearson correlation and visualization.
├── Project3.ipynb           # Data curation: encoding categorical data, detecting outliers.
├── Project4 (1).ipynb       # Feature transformation methods: PCA and t-SNE.
├── Project5.ipynb           # Demonstration of additional data processing methods using t-SNE.
├── Project6.ipynb           # Model training with XGBoost and LightGBM.
├── README.md                # Project documentation.
├── RdKitSet.csv             # Dataset containing RdKit descriptors.
├── Without_HOMO_LUMO.csv    # Final dataset prepared for PostgreSQL database upload.
├── new_qm9.csv              # The original dataset used initially.
├── split_csv.ipynb          # Script to split large datasets into smaller parts.
└── transformed_df.csv       # Dataset after normalization and dimensionality reduction.

🚀 Getting Started

To start working on the project, you need to create a separate environment. To do this, run the following command in the terminal.

conda env export > environment.yml

Then create the environment based on environment.yml:

conda env create -f environment.yml

Чистка датасета и обучение модели

For molecule optimization:

Use the model to predict the gap of new molecules.
Based on the importance of descriptors, change key parameters (e.g. cyclicity,
mass) to obtain molecules with the desired gap.

For model selection:
LightGBM is better because it:

Predicts more accurately (lower MSE, RMSE, MAE).
Learns faster.
Explains data better (high R²).

Загрузка данных в PostgreSQL

To do this, you need to install the psycopg2 library:

conda install psycopg2

After that, you need to connect to the server using the following details:

pgconn = psycopg2.connect(
    host='localhost',
    user='postgres',
    password='123456QWERTY',
    database='postgres')

Данные можно увидеть войдя в pgAdmin 4 и введя данные, которые находятся выше. После введите следующий SQL запрос

Select * from my_table LIMIT 10

Remember: This is a learning tool, not a production-ready solution. Use it to understand concepts and build your own improved versions!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Introduction

QM9.csv

Properties

Purpose:

Tasks

Project Structure

🚀 Getting Started

Чистка датасета и обучение модели

Загрузка данных в PostgreSQL

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
Dashboard		Dashboard
Split_Mordred_set		Split_Mordred_set
Split_data_Merged_data		Split_data_Merged_data
.gitignore		.gitignore
Merged_data.xlsx		Merged_data.xlsx
Merged_data2.csv		Merged_data2.csv
NewDataset.csv		NewDataset.csv
PostgreSQL.ipynb		PostgreSQL.ipynb
Project1.ipynb		Project1.ipynb
Project2.ipynb		Project2.ipynb
Project3.ipynb		Project3.ipynb
Project4 (1).ipynb		Project4 (1).ipynb
Project5.ipynb		Project5.ipynb
Project6.ipynb		Project6.ipynb
README.md		README.md
RdKitSet.csv		RdKitSet.csv
Without_HOMO_LUMO.csv		Without_HOMO_LUMO.csv
environment.yml		environment.yml
new_qm9.csv		new_qm9.csv
split_csv.ipynb		split_csv.ipynb
transformed_df.csv		transformed_df.csv

Igor-source/Data-Management-Engineering

Folders and files

Latest commit

History

Repository files navigation

Introduction

QM9.csv

Properties

Purpose:

Tasks

Project Structure

🚀 Getting Started

Чистка датасета и обучение модели

Загрузка данных в PostgreSQL

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages