GitHub - Kubabob/chemical-data-explorer

This repository contains implementations of advanced data exploration techniques applied to a dataset of chemical compounds and their molecular descriptors. The project demonstrates the application of multivariate statistical methods to analyze and visualize complex chemical data.

Project Overview

This project applies two key data exploration techniques to analyze relationships between various chemical compounds:

Hierarchical Cluster Analysis (HCA) - Used to identify natural groupings of chemical compounds based on their molecular descriptors
Principal Component Analysis (PCA) - Applied to reduce dimensionality and identify the most important variables that explain variation in the dataset

Dataset

The dataset consists of chemical compounds (pesticides/insecticides) including:

Organophosphate compounds
Neonicotinoids
DDT and its analogs
Pyridine carboxylic acids

For each compound, SMILES (Simplified Molecular Input Line Entry System) codes were used to generate molecular descriptors using the RDKit library. These descriptors provide a numerical representation of chemical properties and structural features.

Analysis Process

Data Preparation

Conversion of SMILES codes to molecular structures using RDKit
Generation of chemical descriptors (~147 features per compound)
Data standardization/autoscaling to handle different measurement scales

Hierarchical Cluster Analysis

Calculation of distance matrices using Euclidean and Manhattan metrics
Application of different linkage methods (single, complete)
Generation of dendrograms to visualize hierarchical relationships
Cluster identification and interpretation

Principal Component Analysis

Correlation matrix analysis
Eigenvalue decomposition
Scree plot analysis for dimension determination
Component interpretation

Key Findings

Successfully identified chemical groups based on structural similarity
Discovered meaningful patterns in complex high-dimensional chemical data
Reduced the dimensionality while preserving essential information
Visualized relationships between compounds in a more interpretable form

Folder Structure

├── HCA/               # Hierarchical Cluster Analysis
│   ├── hca.ipynb     # Main analysis notebook
│   ├── smiles.ipynb  # SMILES processing
│   └── sprawozdanie/ # Report and documentation
│
├── PCA/               # Principal Component Analysis
│   ├── pca.ipynb     # Analysis notebook
│   └── sprawko/      # Report and documentation
│
├── figures/           # Generated visualizations
└── sprawozdanie/      # Overall project documentation

Technologies Used

Python - Primary programming language
Pandas - Data manipulation and analysis
NumPy - Numerical computing
SciPy - Scientific computing, hierarchical clustering
Matplotlib - Data visualization
RDKit - Cheminformatics and molecular processing
Jupyter Notebooks - Interactive development and analysis

Visualizations

The project includes various visualizations:

Dendrograms showing hierarchical relationships between compounds
Distance matrices represented as heatmaps
Principal component plots
Molecular structure visualizations

How to Run

Clone this repository
Install required dependencies: uv sync
Run Jupyter notebooks: jupyter notebook

License

This project is available under the MIT License.

Acknowledgments

This project was completed as part of a data exploration techniques course at university.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
HCA		HCA
PCA		PCA
figures		figures
sprawozdanie		sprawozdanie
.python-version		.python-version
Descriptors.xlsx		Descriptors.xlsx
Odleglosc_euklidesowa.xlsx		Odleglosc_euklidesowa.xlsx
Odleglosc_manhattan.xlsx		Odleglosc_manhattan.xlsx
README.md		README.md
Smiles_codes.xlsx		Smiles_codes.xlsx
dane_autoskalowane.xlsx		dane_autoskalowane.xlsx
hca.ipynb		hca.ipynb
pyproject.toml		pyproject.toml
smiles.ipynb		smiles.ipynb
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Project Overview

Dataset

Analysis Process

Data Preparation

Hierarchical Cluster Analysis

Principal Component Analysis

Key Findings

Folder Structure

Technologies Used

Visualizations

How to Run

License

Acknowledgments

About

Uh oh!

Releases

Packages

Languages

Kubabob/chemical-data-explorer

Folders and files

Latest commit

History

Repository files navigation

Project Overview

Dataset

Analysis Process

Data Preparation

Hierarchical Cluster Analysis

Principal Component Analysis

Key Findings

Folder Structure

Technologies Used

Visualizations

How to Run

License

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages