<a href="https://colab.research.google.com/github/Bishtrahulsingh/stellerclassification/blob/main/stellerClassification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Project Overview

This project implements and evaluates a machine learning pipeline for the automated classification of celestial objects. Using a public dataset of 100,000 observations from a large-scale astronomical survey, this work compares three different classification algorithms—Logistic Regression, Support Vector Machines (SVM), and Random Forest—to accurately categorize objects into **Galaxies**, **Quasars (QSOs)**, or **Stars**.

The final model achieves **97.8% accuracy**, demonstrating the viability of using ensemble methods to reliably handle and classify large volumes of astronomical data.

In [42]:
import pandas as pd
import numpy as np

## The Dataset

The dataset consists of 100,000 entries, each representing a unique celestial object. The classification is based on 8 key features.

* **Target Variable:** `class` - The object's classification (GALAXY, STAR, or QSO).

* **Key Features Used:**

    * `alpha`, `delta`: The celestial coordinates (Right Ascension and Declination) specifying the object's position on the sky.

    * `u`, `g`, `r`, `i`, `z`: The object's brightness (magnitude) as measured through five different photometric filters (ultraviolet, green, red, near-infrared, and infrared), which is a primary indicator of an object's type and properties.

    * `redshift`: A measure of how much an object's light has been "stretched" by the expansion of the universe. It is a critical indicator of distance; Galaxies and Quasars typically have much higher redshifts than Stars .

In [43]:
stars = pd.read_csv('https://raw.githubusercontent.com/Bishtrahulsingh/Datacsv/refs/heads/main/star_classification.csv')

In [44]:
stars.sample(5)

Unnamed: 0,obj_ID,alpha,delta,u,g,r,i,z,run_ID,rerun_ID,cam_col,field_ID,spec_obj_ID,class,redshift,plate,MJD,fiber_ID
30206,1.237662e+18,232.217631,36.600869,23.91907,24.36888,21.87049,20.52274,19.78129,3926,301,4,42,1.209693e+19,GALAXY,0.717772,10744,58199,951
20650,1.237667e+18,132.462776,18.517356,23.29053,21.78326,19.98514,19.19535,18.6814,5061,301,2,123,5.827696e+18,GALAXY,0.481086,5176,56221,137
84530,1.237658e+18,170.334003,53.970492,19.46489,17.94998,17.16981,16.78232,16.49136,2821,301,3,145,1.140584e+18,GALAXY,0.10192,1013,52707,171
90915,1.237668e+18,200.995507,21.660002,18.86089,18.49016,17.98857,17.52521,17.45863,5183,301,3,448,2.985918e+18,QSO,0.13457,2652,54508,114
91925,1.237679e+18,23.30042,12.936932,20.73827,20.51104,20.02464,19.86069,19.73269,7787,301,6,353,1.245256e+19,QSO,1.35804,11060,58523,401


In [45]:
stars.shape

(100000, 18)

In [46]:
stars.columns

Index(['obj_ID', 'alpha', 'delta', 'u', 'g', 'r', 'i', 'z', 'run_ID',
       'rerun_ID', 'cam_col', 'field_ID', 'spec_obj_ID', 'class', 'redshift',
       'plate', 'MJD', 'fiber_ID'],
      dtype='object')

## Data Preprocessing and Cleaning

1.  **Feature Selection:** Removed 9 non-physical metadata columns from the dataset. Columns like `obj_ID`, `run_ID`, `plate`, `MJD`, and `fiber_ID` are identifiers related to how and when the observation was taken, not intrinsic properties of the object itself. Keeping them would lead to overfitting and a model that cannot generalize to new data.

2.  **Target Encoding:** The categorical `class` label was converted into a numerical format using `sklearn.preprocessing.LabelEncoder` for compatibility with the models. The encoding was:
    * `GALAXY`: 0
    * `QSO`: 1
    * `STAR`: 2

3.  **Train-Test Split:** The data was split into a 70% training set and a 30% testing set. A `stratify` parameter was used to ensure that the distribution of GALAXY, QSO, and STAR classes was identical in both the training and test sets. This is critical because the dataset is imbalanced (59% Galaxies).

4.  **Feature Scaling:** All 8 input features were scaled using `sklearn.preprocessing.StandardScaler`. This step is essential as features like `redshift` (values often 0-1) and `alpha` (values 0-360) are on vastly different scales. Scaling centers all features to a mean of 0 and a standard deviation of 1, which improves the performance and convergence of algorithms like Logistic Regression and SVM.

In [47]:
cols_to_drop = ['obj_ID', 'run_ID', 'rerun_ID', 'cam_col', 'field_ID',
                'spec_obj_ID', 'plate', 'MJD', 'fiber_ID']
stars_cleaned = stars.drop(columns=cols_to_drop)

In [48]:
stars_cleaned.head()

Unnamed: 0,alpha,delta,u,g,r,i,z,class,redshift
0,135.689107,32.494632,23.87882,22.2753,20.39501,19.16573,18.79371,GALAXY,0.634794
1,144.826101,31.274185,24.77759,22.83188,22.58444,21.16812,21.61427,GALAXY,0.779136
2,142.18879,35.582444,25.26307,22.66389,20.60976,19.34857,18.94827,GALAXY,0.644195
3,338.741038,-0.402828,22.13682,23.77656,21.61162,20.50454,19.2501,GALAXY,0.932346
4,345.282593,21.183866,19.43718,17.58028,16.49747,15.97711,15.54461,GALAXY,0.116123


In [49]:
#change the class to dummy Vars

In [50]:
stars_cleaned['class'].value_counts()

Unnamed: 0_level_0,count
class,Unnamed: 1_level_1
GALAXY,59445
STAR,21594
QSO,18961


In [51]:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()

In [52]:
stars_cleaned['class'] = le.fit_transform(stars_cleaned['class'])

In [53]:
stars_cleaned['class'].value_counts()

Unnamed: 0_level_0,count
class,Unnamed: 1_level_1
0,59445
2,21594
1,18961


In [54]:
stars_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 9 columns):
 #   Column    Non-Null Count   Dtype  
---  ------    --------------   -----  
 0   alpha     100000 non-null  float64
 1   delta     100000 non-null  float64
 2   u         100000 non-null  float64
 3   g         100000 non-null  float64
 4   r         100000 non-null  float64
 5   i         100000 non-null  float64
 6   z         100000 non-null  float64
 7   class     100000 non-null  int64  
 8   redshift  100000 non-null  float64
dtypes: float64(8), int64(1)
memory usage: 6.9 MB


##Comparative Model Analysis

Three different classification algorithms were trained on the preprocessed data to compare their effectiveness.

* **Logistic Regression:** Chosen as a simple, fast, and highly interpretable linear baseline model.

* **Support Vector Classifier (SVC):** A powerful model chosen for its ability to find complex, non-linear decision boundaries in high-dimensional feature spaces by using the "kernel trick".

* **Random Forest Classifier:** An ensemble, tree-based model. Chosen for its high performance, robustness to outliers, and its ability to capture complex, non-linear interactions between features (e.g., the relationship between the five color bands and `redshift`) without extensive feature engineering.

In [55]:
#now perform classification

In [56]:
x_data = stars_cleaned.drop(columns=['class'])
y_data = stars_cleaned['class']

In [57]:
x_data.head()

Unnamed: 0,alpha,delta,u,g,r,i,z,redshift
0,135.689107,32.494632,23.87882,22.2753,20.39501,19.16573,18.79371,0.634794
1,144.826101,31.274185,24.77759,22.83188,22.58444,21.16812,21.61427,0.779136
2,142.18879,35.582444,25.26307,22.66389,20.60976,19.34857,18.94827,0.644195
3,338.741038,-0.402828,22.13682,23.77656,21.61162,20.50454,19.2501,0.932346
4,345.282593,21.183866,19.43718,17.58028,16.49747,15.97711,15.54461,0.116123


In [58]:
y_data.head()

Unnamed: 0,class
0,0
1,0
2,0
3,0
4,0


In [59]:
from sklearn import model_selection, preprocessing

In [60]:
x_train,x_test,y_train,y_test = model_selection.train_test_split(x_data,y_data,test_size=0.3,stratify=y_data,random_state=42)

In [61]:
scaler = preprocessing.StandardScaler()

In [62]:
x_train_scaled = scaler.fit_transform(x_train)
x_test_scaled = scaler.transform(x_test)

In [63]:
# nos start using models on this data

## Performing Logistic regression

In [64]:
#logistic regression

In [65]:
from sklearn.linear_model import LogisticRegression

In [66]:
log_reg = LogisticRegression(max_iter = 1000,C=10)

In [67]:
log_reg.fit(x_train_scaled,y_train)

In [68]:
log_reg.score(x_test_scaled,y_test)

0.9594333333333334

## Performing random forests

In [69]:
#Random forests

In [70]:
from sklearn import ensemble


In [71]:
random_forest = ensemble.RandomForestClassifier(random_state=42)

In [72]:
random_forest.fit(x_train_scaled,y_train)

In [73]:
random_forest.score(x_test_scaled,y_test)

0.9780333333333333

## Performing support vector machine calssification

In [74]:
#using svm to classify

In [75]:
from sklearn.svm import SVC

In [76]:
svc = SVC()

In [77]:
svc.fit(x_train_scaled,y_train)

In [78]:
svc.score(x_test_scaled,y_test)

0.9596333333333333

## Results and Evaluation

The three trained models were evaluated on the unseen 30% test set. The results, as measured by classification accuracy, are as follows

| Model | Test Accuracy |
| :--- | :--- |
| Logistic Regression | 95.94% |
| Support Vector Machine (SVC)| 95.96% |
| **Random Forest** | **97.80%** |

##Performance Analysis
The Random Forest Classifier demonstrated superior performance, yielding the highest accuracy (97.80%). This result significantly exceeds that of both the linear (Logistic Regression, 95.94%) and the kernel-based (SVC, 95.96%) models.

This performance disparity strongly indicates that the feature space is characterized by non-linear relationships. The classification of these celestial objects is evidently not a linearly separable problem. The Random Forest's ensemble of decision trees was uniquely effective at capturing the complex, high-order interactions between photometric colors and redshift.

##Conclusion
This project successfully validated a machine learning pipeline for the automated classification of celestial objects. The high accuracy (97.8%) achieved by the Random Forest model confirms that modern, tree-based ensemble methods are exceptionally well-suited for this common astronomical research task.