# **Stellar Dataset**

### **Data Description:**
In astronomy, stellar classification is the classification of stars based on their spectral characteristics. The classification scheme of galaxies, quasars, and stars is one of the most fundamental in astronomy. The early cataloguing of stars and their distribution in the sky has led to the understanding that they make up our own galaxy and, following the distinction that Andromeda was a separate galaxy to our own, numerous galaxies began to be surveyed as more powerful telescopes were built. This datasat aims to classificate stars, galaxies, and quasars based on their spectral characteristics.


### **Data Content:**
The data consists of observations of space taken by the SDSS (Sloan Digital Sky Survey). Every observation is described by 17 feature columns and 1 class column which identifies it to be either a star, galaxy or quasar.

### **Data Dictionary:**
- obj_ID = Object Identifier, the unique value that identifies the object in the image catalog used by the CAS
- alpha = Right Ascension angle (at J2000 epoch)
- delta = Declination angle (at J2000 epoch)
- u = Ultraviolet filter in the photometric system
- g = Green filter in the photometric system
- r = Red filter in the photometric system
- i = Near Infrared filter in the photometric system
- z = Infrared filter in the photometric system
- run_ID = Run Number used to identify the specific scan
- rereun_ID = Rerun Number to specify how the image was processed
- cam_col = Camera column to identify the scanline within the run
- field_ID = Field number to identify each field
- spec_obj_ID = Unique ID used for optical spectroscopic objects (this means that 2 different observations with the same spec_obj_ID must share the output class)
- class = object class (galaxy, star or quasar object) << your target
- redshift = redshift value based on the increase in wavelength
- plate = plate ID, identifies each plate in SDSS
- MJD = Modified Julian Date, used to indicate when a given piece of SDSS - - - data was taken
- fiber_ID = fiber ID that identifies the fiber that pointed the light at the focal plane in each observation

In [48]:
# import libraries
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier # Import Decision Tree Classifier
from sklearn.model_selection import train_test_split # Import train_test_split function
from sklearn import metrics #Import scikit-learn metrics module for accuracy calculation
from sklearn.tree import export_graphviz
from six import StringIO
from IPython.display import Image
import pydotplus


In [5]:
df = pd.read_csv("/workspaces/StellarDatasetML/data/stellar.csv")
df.head()

Unnamed: 0.1,Unnamed: 0,obj_ID,alpha,delta,u,g,r,i,z,run_ID,rerun_ID,cam_col,field_ID,spec_obj_ID,class,redshift,plate,MJD,fiber_ID
0,41029,1.237649e+18,194.748212,-0.911226,25.77469,22.72579,20.84263,19.80384,19.29726,756,301,1,527,4.271919e+18,GALAXY,0.52577,3794,55241,926
1,42888,1.237661e+18,140.525977,35.614836,21.94718,21.31617,20.21319,19.45814,19.09832,3560,301,4,221,5.22871e+18,GALAXY,0.439029,4644,55922,111
2,82610,1.237658e+18,125.922894,38.044046,23.47268,21.3439,19.41544,18.67742,18.14655,2822,301,2,135,4.233595e+18,GALAXY,0.414493,3760,55268,770
3,89586,1.237664e+18,18.634831,0.468756,20.03793,18.13051,17.21534,16.80004,16.48915,4263,301,5,240,1.217236e+18,GALAXY,0.091736,1081,52531,503
4,14627,1.237666e+18,52.832458,1.215699,20.72916,20.34843,20.11169,19.75053,19.74247,4849,301,6,807,8.02867e+17,QSO,1.562706,713,52178,365


In [6]:
df.shape

(80000, 19)

In [None]:
df.columns

Index(['Unnamed: 0', 'obj_ID', 'alpha', 'delta', 'u', 'g', 'r', 'i', 'z',
       'run_ID', 'rerun_ID', 'cam_col', 'field_ID', 'spec_obj_ID', 'class',
       'redshift', 'plate', 'MJD', 'fiber_ID'],
      dtype='object')

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 80000 entries, 0 to 79999
Data columns (total 19 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Unnamed: 0   80000 non-null  int64  
 1   obj_ID       80000 non-null  float64
 2   alpha        80000 non-null  float64
 3   delta        80000 non-null  float64
 4   u            80000 non-null  float64
 5   g            80000 non-null  float64
 6   r            80000 non-null  float64
 7   i            80000 non-null  float64
 8   z            80000 non-null  float64
 9   run_ID       80000 non-null  int64  
 10  rerun_ID     80000 non-null  int64  
 11  cam_col      80000 non-null  int64  
 12  field_ID     80000 non-null  int64  
 13  spec_obj_ID  80000 non-null  float64
 14  class        80000 non-null  object 
 15  redshift     80000 non-null  float64
 16  plate        80000 non-null  int64  
 17  MJD          80000 non-null  int64  
 18  fiber_ID     80000 non-null  int64  
dtypes: f

In [22]:
df['fiber_ID'] = df['fiber_ID'].astype(str)
df['spec_obj_ID'] = df['spec_obj_ID'].astype(str)
df['field_ID'] = df['field_ID'].astype(str)
df['rerun_ID'] = df['rerun_ID'].astype(str)
df['run_ID'] = df['run_ID'].astype(str)
df['obj_ID'] = df['obj_ID'].astype(str)
df['plate'] = df['plate'].astype(str)

In [23]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 80000 entries, 0 to 79999
Data columns (total 18 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   obj_ID       80000 non-null  object 
 1   alpha        80000 non-null  float64
 2   delta        80000 non-null  float64
 3   u            80000 non-null  float64
 4   g            80000 non-null  float64
 5   r            80000 non-null  float64
 6   i            80000 non-null  float64
 7   z            80000 non-null  float64
 8   run_ID       80000 non-null  object 
 9   rerun_ID     80000 non-null  object 
 10  cam_col      80000 non-null  int64  
 11  field_ID     80000 non-null  object 
 12  spec_obj_ID  80000 non-null  object 
 13  class        80000 non-null  object 
 14  redshift     80000 non-null  float64
 15  plate        80000 non-null  object 
 16  MJD          80000 non-null  int64  
 17  fiber_ID     80000 non-null  object 
dtypes: float64(8), int64(2), object(8)
memory usag

In [24]:
df.duplicated().sum()

0

In [25]:
df.isnull().sum().sum()

0

In [27]:
df.drop('Unnamed: 0', axis='columns', inplace=True)

KeyError: "['Unnamed: 0'] not found in axis"

In [28]:
df.columns

Index(['obj_ID', 'alpha', 'delta', 'u', 'g', 'r', 'i', 'z', 'run_ID',
       'rerun_ID', 'cam_col', 'field_ID', 'spec_obj_ID', 'class', 'redshift',
       'plate', 'MJD', 'fiber_ID'],
      dtype='object')

In [73]:
column_list = ['u', 'g', 'i', 'z','redshift', 'plate']


In [74]:
x = df[column_list]
y = df["class"]

print(x)
print(y)

              u         g         i         z  redshift  plate
0      25.77469  22.72579  19.80384  19.29726  0.525770   3794
1      21.94718  21.31617  19.45814  19.09832  0.439029   4644
2      23.47268  21.34390  18.67742  18.14655  0.414493   3760
3      20.03793  18.13051  16.80004  16.48915  0.091736   1081
4      20.72916  20.34843  19.75053  19.74247  1.562706    713
...         ...       ...       ...       ...       ...    ...
79995  22.37756  21.50994  18.78385  18.34169  0.538937   7089
79996  18.02005  17.03582  16.44467  16.35642 -0.001132   1886
79997  25.69788  23.32481  20.35131  19.68414  0.765474  11122
79998  18.29851  17.99857  17.72803  17.81232  1.501470   2970
79999  21.45405  19.93726  17.59074  17.21519  0.323639   2504

[80000 rows x 6 columns]
0        GALAXY
1        GALAXY
2        GALAXY
3        GALAXY
4           QSO
          ...  
79995    GALAXY
79996      STAR
79997    GALAXY
79998       QSO
79999    GALAXY
Name: class, Length: 80000, dtype: object


In [75]:
# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=1) # 70% training and 30% test

In [76]:
# Create Decision Tree classifer object
clf = DecisionTreeClassifier()

# Train Decision Tree Classifer
clf = clf.fit(X_train,y_train)

#Predict the response for test dataset
y_pred = clf.predict(X_test)

In [77]:
# Model Accuracy, how often is the classifier correct?
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

Accuracy: 0.9647083333333333


In [None]:
dot_data = StringIO()

export_graphviz(clf, out_file=dot_data,
               filled=True, rounded=True,feature_names = column_list,class_names=['obj_ID', 'alpha', 'delta', 'u', 'g', 'r', 'i', 'z', 'run_ID', 'rerun_ID', 'cam_col', 'field_ID', 'spec_obj_ID', 'redshift', 'plate', 'MJD', 'fiber_ID'])


graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  

graph.write_png('Stellar.png')

Image(graph.create_png())

In [79]:
from sklearn.model_selection import GridSearchCV


In [80]:
params = {'max_leaf_nodes': list(range(2,50)), 'min_samples_split':[2,3,4]}

grid_search_cv = GridSearchCV(DecisionTreeClassifier(criterion='gini',random_state=42), params, verbose=1,cv=3)

grid_search_cv.fit(X_train,y_train)

Fitting 3 folds for each of 144 candidates, totalling 432 fits


GridSearchCV(cv=3, estimator=DecisionTreeClassifier(random_state=42),
             param_grid={'max_leaf_nodes': [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
                                            13, 14, 15, 16, 17, 18, 19, 20, 21,
                                            22, 23, 24, 25, 26, 27, 28, 29, 30,
                                            31, ...],
                         'min_samples_split': [2, 3, 4]},
             verbose=1)

In [81]:
grid_search_cv.best_estimator_


DecisionTreeClassifier(max_leaf_nodes=43, random_state=42)

In [82]:
params = {'max_depth': list(range(3,30)), 'min_samples_split':[2,3,4]}

grid_search_cv = GridSearchCV(DecisionTreeClassifier(criterion='gini',random_state=42, max_leaf_nodes =43), params, verbose=1,cv=3)

grid_search_cv.fit(X_train,y_train)

Fitting 3 folds for each of 81 candidates, totalling 243 fits


GridSearchCV(cv=3,
             estimator=DecisionTreeClassifier(max_leaf_nodes=43,
                                              random_state=42),
             param_grid={'max_depth': [3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,
                                       15, 16, 17, 18, 19, 20, 21, 22, 23, 24,
                                       25, 26, 27, 28, 29],
                         'min_samples_split': [2, 3, 4]},
             verbose=1)

In [83]:
grid_search_cv.best_estimator_


DecisionTreeClassifier(max_depth=12, max_leaf_nodes=43, random_state=42)

In [84]:
clf1 = DecisionTreeClassifier(criterion='gini',random_state=42, max_leaf_nodes =43, max_depth=12 )
clf1 = clf1.fit(X_train,y_train)

In [85]:
y_pred = clf1.predict(X_test)

In [86]:
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

Accuracy: 0.9746666666666667


In [87]:
# print the scores on training and test set

print('Training set score: {:.4f}'.format(clf1.score(X_train, y_train)))

print('Test set score: {:.4f}'.format(clf1.score(X_test, y_test)))

Training set score: 0.9779
Test set score: 0.9747


In [92]:
importances = pd.DataFrame({'feature':X_train.columns,'importance':np.round(clf.feature_importances_,3)})
importances = importances.sort_values('importance',ascending=False)
importances

Unnamed: 0,feature,importance
4,redshift,0.884
1,g,0.048
0,u,0.023
2,i,0.016
5,plate,0.015
3,z,0.013


In [93]:
dot_data = StringIO()

export_graphviz(clf1, out_file=dot_data,
               filled=True, rounded=True,feature_names = column_list,class_names=['STAR', 'GALAXY', 'QSO'])


graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  

graph.write_png('Stellar.png')

Image(graph.create_png())

InvocationException: GraphViz's executables not found