Clustar: Automated Star Clustering Analysis

Clustar is an advanced astronomical data analysis project that leverages machine learning and clustering techniques to automate the detection and analysis of variable stars and transients from large-scale astronomical surveys.

Project Overview

Modern astronomical surveys like TESS and LSST generate terabytes of data daily, creating an unprecedented challenge for traditional manual analysis methods. Clustar addresses this challenge by:

Automating detection of anomalous variability, scaling to large datasets
Prioritizing high-variability clusters, optimizing telescope use
Providing robust noise handling, capturing diverse transients (e.g., supernovae, pulsators)

Key Challenges Addressed

Data Volume: Millions of stars observed in modern surveys generate terabytes of data daily
Limited Telescope Time: Expensive telescope time (e.g., JWST at $100K/hour) requires optimal utilization
Manual Analysis Limitations: Traditional methods are slow and miss subtle patterns
Missed Opportunities: Delayed detection of transients impacts our understanding of stellar evolution and cosmology

Project Impact

Scientific Impact

Accelerate discovery of variable stars, supernovae, and other transients
Enhance understanding of stellar evolution and galactic dynamics
Transform transient astronomy through automated analysis

Operational Impact

Optimize telescope scheduling, potentially saving $1M-$5M annually
Prioritize follow-up observations for high-impact targets
Streamline telescope use through intelligent clustering

Dataset Description

Our dataset is sourced from the NASA Exoplanet Science Institute, providing rich photometric and light curve data. Key features include:

Dataset is also provided on kaggle: https://www.kaggle.com/datasets/edgarabasov/star-observations-dataset

star_id: Unique identifier for each star
region: Part of the sky surveyed
ra: Right Ascension in degrees
dec: Declination in degrees
starthjd: Start time of observation (Heliocentric Julian Date)
endhjd: End time of observation (Heliocentric Julian Date)
vmag: V-band magnitude
verr: V-band magnitude uncertainty
imag: I-band magnitude
ierr: I-band magnitude uncertainty
npts: Number of points in the light curve

Implementation Details

The project implements a comprehensive analysis pipeline:

Data Extraction and Loading
- Direct integration with NASA database
- Efficient data loading and preprocessing
Feature Analysis and Processing
- Raw data transformation
- Standardization using scalers
- Dimension reduction techniques
Clustering Analysis
- Multiple algorithm implementation (KMeans, PCA, etc.)
- Parameter optimization
- Parallel processing capabilities
Visualization and Interpretation
- Interactive result visualization
- Comprehensive reporting
- Feature importance analysis

Project Structure

explore.ipynb: Jupyter notebook containing the data exploration and analysis
Cluster_2025.04.16_04.44.28.csv: Raw dataset from NASA Exoplanet Archive

Requirements

The project requires the following Python packages:

pandas
numpy
matplotlib
seaborn
scikit-learn

Getting Started

Clone the repository

Install the required packages:

pip install pandas numpy matplotlib seaborn scikit-learn

Open and run the Jupyter notebook explore.ipynb

Data Source

The data was obtained from the NASA Exoplanet Archive (http://exoplanetarchive.ipac.caltech.edu) on April 16, 2025.

License

This project is open source and available under the MIT License.

Acknowledgment

This research has made use of the NASA Exoplanet Archive, which is operated by the California Institute of Technology, under contract with the National Aeronautics and Space Administration under the Exoplanet Exploration Program.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
.github/workflows		.github/workflows
Submission		Submission
data		data
docs		docs
othr		othr
README.md		README.md
cluster_analysis.png		cluster_analysis.png
clustering_solution.py		clustering_solution.py
eda_analysis.py		eda_analysis.py
explore.ipynb		explore.ipynb
process.ipynb		process.ipynb
requirements.txt		requirements.txt
test_dummy.py		test_dummy.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Clustar: Automated Star Clustering Analysis

Project Overview

Key Challenges Addressed

Project Impact

Scientific Impact

Operational Impact

Dataset Description

Implementation Details

Project Structure

Requirements

Getting Started

Data Source

License

Acknowledgment

Contributing

About

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Clustar: Automated Star Clustering Analysis

Project Overview

Key Challenges Addressed

Project Impact

Scientific Impact

Operational Impact

Dataset Description

Implementation Details

Project Structure

Requirements

Getting Started

Data Source

License

Acknowledgment

Contributing

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors

Uh oh!

Languages