Clustar is an advanced astronomical data analysis project that leverages machine learning and clustering techniques to automate the detection and analysis of variable stars and transients from large-scale astronomical surveys.
Modern astronomical surveys like TESS and LSST generate terabytes of data daily, creating an unprecedented challenge for traditional manual analysis methods. Clustar addresses this challenge by:
- Automating detection of anomalous variability, scaling to large datasets
- Prioritizing high-variability clusters, optimizing telescope use
- Providing robust noise handling, capturing diverse transients (e.g., supernovae, pulsators)
- Data Volume: Millions of stars observed in modern surveys generate terabytes of data daily
- Limited Telescope Time: Expensive telescope time (e.g., JWST at $100K/hour) requires optimal utilization
- Manual Analysis Limitations: Traditional methods are slow and miss subtle patterns
- Missed Opportunities: Delayed detection of transients impacts our understanding of stellar evolution and cosmology
- Accelerate discovery of variable stars, supernovae, and other transients
- Enhance understanding of stellar evolution and galactic dynamics
- Transform transient astronomy through automated analysis
- Optimize telescope scheduling, potentially saving $1M-$5M annually
- Prioritize follow-up observations for high-impact targets
- Streamline telescope use through intelligent clustering
Our dataset is sourced from the NASA Exoplanet Science Institute, providing rich photometric and light curve data. Key features include:
Dataset is also provided on kaggle: https://www.kaggle.com/datasets/edgarabasov/star-observations-dataset
star_id: Unique identifier for each starregion: Part of the sky surveyedra: Right Ascension in degreesdec: Declination in degreesstarthjd: Start time of observation (Heliocentric Julian Date)endhjd: End time of observation (Heliocentric Julian Date)vmag: V-band magnitudeverr: V-band magnitude uncertaintyimag: I-band magnitudeierr: I-band magnitude uncertaintynpts: Number of points in the light curve
The project implements a comprehensive analysis pipeline:
-
Data Extraction and Loading
- Direct integration with NASA database
- Efficient data loading and preprocessing
-
Feature Analysis and Processing
- Raw data transformation
- Standardization using scalers
- Dimension reduction techniques
-
Clustering Analysis
- Multiple algorithm implementation (KMeans, PCA, etc.)
- Parameter optimization
- Parallel processing capabilities
-
Visualization and Interpretation
- Interactive result visualization
- Comprehensive reporting
- Feature importance analysis
explore.ipynb: Jupyter notebook containing the data exploration and analysisCluster_2025.04.16_04.44.28.csv: Raw dataset from NASA Exoplanet Archive
The project requires the following Python packages:
- pandas
- numpy
- matplotlib
- seaborn
- scikit-learn
- Clone the repository
- Install the required packages:
pip install pandas numpy matplotlib seaborn scikit-learn
- Open and run the Jupyter notebook
explore.ipynb
The data was obtained from the NASA Exoplanet Archive (http://exoplanetarchive.ipac.caltech.edu) on April 16, 2025.
This project is open source and available under the MIT License.
This research has made use of the NASA Exoplanet Archive, which is operated by the California Institute of Technology, under contract with the National Aeronautics and Space Administration under the Exoplanet Exploration Program.
Contributions are welcome! Please feel free to submit a Pull Request.
