# Stellar Classification

In [7]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# SDSS Astronomical Dataset - Feature Description

This dataset consists of **100,000 astronomical observations** collected by the **Sloan Digital Sky Survey (SDSS)**. Each observation contains **17 features** describing the object and **1 class label** indicating whether the object is a **star**, **galaxy**, or **quasar**.

## 📦 Dataset Columns

| Column Name      | Description |
|------------------|-------------|
| `obj_ID`         | Object Identifier — a unique ID for the object in the image catalog used by the CAS |
| `alpha`          | Right Ascension angle (at J2000 epoch) |
| `delta`          | Declination angle (at J2000 epoch) |
| `u`              | Ultraviolet filter magnitude in the photometric system |
| `g`              | Green filter magnitude in the photometric system |
| `r`              | Red filter magnitude in the photometric system |
| `i`              | Near Infrared filter magnitude in the photometric system |
| `z`              | Infrared filter magnitude in the photometric system |
| `run_ID`         | Run Number — identifies the specific scan |
| `rerun_ID`       | Rerun Number — specifies how the image was processed |
| `cam_col`        | Camera column — identifies the scanline within the run |
| `field_ID`       | Field number — identifies each field |
| `spec_obj_ID`    | Spectroscopic Object ID — unique for optical spectroscopic observations<br>**Note:** Two different observations with the same `spec_obj_ID` must have the same class label |
| `class`          | Object class — one of: `GALAXY`, `STAR`, `QSO` (quasar) |
| `redshift`       | Redshift value — indicates the increase in wavelength |
| `plate`          | Plate ID — identifies each plate in the SDSS |
| `MJD`            | Modified Julian Date — indicates the date the data was taken |
| `fiber_ID`       | Fiber ID — identifies the fiber that pointed the light at the focal plane |



In [9]:
file_path = 'Data\\star_classification.csv'
Stars_df = pd.read_csv(filepath_or_buffer=file_path)
Stars_df.head()

Unnamed: 0,obj_ID,alpha,delta,u,g,r,i,z,run_ID,rerun_ID,cam_col,field_ID,spec_obj_ID,class,redshift,plate,MJD,fiber_ID
0,1.237661e+18,135.689107,32.494632,23.87882,22.2753,20.39501,19.16573,18.79371,3606,301,2,79,6.543777e+18,GALAXY,0.634794,5812,56354,171
1,1.237665e+18,144.826101,31.274185,24.77759,22.83188,22.58444,21.16812,21.61427,4518,301,5,119,1.176014e+19,GALAXY,0.779136,10445,58158,427
2,1.237661e+18,142.18879,35.582444,25.26307,22.66389,20.60976,19.34857,18.94827,3606,301,2,120,5.1522e+18,GALAXY,0.644195,4576,55592,299
3,1.237663e+18,338.741038,-0.402828,22.13682,23.77656,21.61162,20.50454,19.2501,4192,301,3,214,1.030107e+19,GALAXY,0.932346,9149,58039,775
4,1.23768e+18,345.282593,21.183866,19.43718,17.58028,16.49747,15.97711,15.54461,8102,301,3,137,6.891865e+18,GALAXY,0.116123,6121,56187,842


In [10]:
Stars_df.count()

obj_ID         100000
alpha          100000
delta          100000
u              100000
g              100000
r              100000
i              100000
z              100000
run_ID         100000
rerun_ID       100000
cam_col        100000
field_ID       100000
spec_obj_ID    100000
class          100000
redshift       100000
plate          100000
MJD            100000
fiber_ID       100000
dtype: int64

# Data leakage considerations
There might be some varibales that caouse data lekage like:
...
Trought modeling we conisder variables: 
alpha, delta, u, g ,r, i, z, redshift