This repository contains implementations of advanced data exploration techniques applied to a dataset of chemical compounds and their molecular descriptors. The project demonstrates the application of multivariate statistical methods to analyze and visualize complex chemical data.
This project applies two key data exploration techniques to analyze relationships between various chemical compounds:
- Hierarchical Cluster Analysis (HCA) - Used to identify natural groupings of chemical compounds based on their molecular descriptors
- Principal Component Analysis (PCA) - Applied to reduce dimensionality and identify the most important variables that explain variation in the dataset
The dataset consists of chemical compounds (pesticides/insecticides) including:
- Organophosphate compounds
- Neonicotinoids
- DDT and its analogs
- Pyridine carboxylic acids
For each compound, SMILES (Simplified Molecular Input Line Entry System) codes were used to generate molecular descriptors using the RDKit library. These descriptors provide a numerical representation of chemical properties and structural features.
- Conversion of SMILES codes to molecular structures using RDKit
- Generation of chemical descriptors (~147 features per compound)
- Data standardization/autoscaling to handle different measurement scales
- Calculation of distance matrices using Euclidean and Manhattan metrics
- Application of different linkage methods (single, complete)
- Generation of dendrograms to visualize hierarchical relationships
- Cluster identification and interpretation
- Correlation matrix analysis
- Eigenvalue decomposition
- Scree plot analysis for dimension determination
- Component interpretation
- Successfully identified chemical groups based on structural similarity
- Discovered meaningful patterns in complex high-dimensional chemical data
- Reduced the dimensionality while preserving essential information
- Visualized relationships between compounds in a more interpretable form
├── HCA/ # Hierarchical Cluster Analysis
│ ├── hca.ipynb # Main analysis notebook
│ ├── smiles.ipynb # SMILES processing
│ └── sprawozdanie/ # Report and documentation
│
├── PCA/ # Principal Component Analysis
│ ├── pca.ipynb # Analysis notebook
│ └── sprawko/ # Report and documentation
│
├── figures/ # Generated visualizations
└── sprawozdanie/ # Overall project documentation
- Python - Primary programming language
- Pandas - Data manipulation and analysis
- NumPy - Numerical computing
- SciPy - Scientific computing, hierarchical clustering
- Matplotlib - Data visualization
- RDKit - Cheminformatics and molecular processing
- Jupyter Notebooks - Interactive development and analysis
The project includes various visualizations:
- Dendrograms showing hierarchical relationships between compounds
- Distance matrices represented as heatmaps
- Principal component plots
- Molecular structure visualizations
- Clone this repository
- Install required dependencies:
uv sync - Run Jupyter notebooks:
jupyter notebook
This project is available under the MIT License.
This project was completed as part of a data exploration techniques course at university.