<a href="https://colab.research.google.com/github/Dansah2/Sloan-Digital-Sky-Survey---DR18/blob/main/EDA_Sloan_Digital_Sky_Survey_DR18.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Sloan Digital Sky Survey - DR18

This dataset consists of 100,000 observations from the Data Release (DR) 18 of the Sloan Digital Sky Survey (SDSS). Each observation is described by 42 features and 1 class column classifying the observation as either:

a STAR
a GALAXY
a QSO (Quasi-Stellar Object) or a Quasar.

Kaggle Dataset Download API Command:

kaggle datasets download -d diraf0/sloan-digital-sky-survey-dr18

#Project Outline:
1) Download the dataset

2) Explore/Analyze the data

3) Preprocess and organize the data for ML training

4) Set appropriate weights

5) Create and Train model

##Download / Read the Dataset
1) Install required libraries

2) Import required libraries

3) Download / Read data from Kaggle

###Install required libraries

In [None]:
!pip install -q -U kaggle
!pip install -q -U numpy

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.8/10.8 MB[0m [31m24.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m18.2/18.2 MB[0m [31m42.6 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
numba 0.56.4 requires numpy<1.24,>=1.18, but you have numpy 1.25.2 which is incompatible.
tensorflow 2.12.0 requires numpy<1.24,>=1.22, but you have numpy 1.25.2 which is incompatible.[0m[31m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m763.4/763.4 kB[0m [31m6.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m404.2/404.2 kB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m226.0/226.0 kB[0m [31m7.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━

###Import required libraries

In [None]:
# handeling data
import numpy as np
import pandas as pd

# graphing data
pd.options.plotting.backend = "plotly"
import plotly.graph_objects as go
import plotly.express as px

# downloading data
from google.colab import drive

# feature exploration
from sklearn.ensemble import ExtraTreesClassifier

In [None]:
# Mount google drive to store Kaggle API for future use
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# make a directory for kaggle temporary instance location in Colab
! mkdir ~/.kaggle

In [None]:
# upload json fine to Google drive and copy the temporary location
!cp /content/drive/MyDrive/Kaggle_API/kaggle.json ~/.kaggle/kaggle.json

In [None]:
# change the file permissions to read/write to the owner only
! chmod 600 ~/.kaggle/kaggle.json

In [None]:
! kaggle datasets download -d diraf0/sloan-digital-sky-survey-dr18

Downloading sloan-digital-sky-survey-dr18.zip to /content
  0% 0.00/14.7M [00:00<?, ?B/s] 34% 5.00M/14.7M [00:00<00:00, 39.1MB/s]
100% 14.7M/14.7M [00:00<00:00, 92.6MB/s]


In [None]:
! unzip sloan-digital-sky-survey-dr18.zip

Archive:  sloan-digital-sky-survey-dr18.zip
  inflating: SDSS_DR18.csv           


In [None]:
def read_function(csv_file):
    return pd.read_csv(csv_file)

raw_data = read_function('SDSS_DR18.csv')


##Explore/Analyze the Data
1) Obtain info about the training / testing set.

2) Visulize the data.

3) Make observations about the data.

Note there are no null values or duplicated data and the data type is int64

In [None]:
def exp_data_cols(data_frame):

  print(f'Columns Names: \n{data_frame.columns}')

  print(f'\nNull Values: \n{data_frame.isna().sum()}\n')

  print(f'\nDuplicated values:{data_frame.loc[data_frame.duplicated()]}\n')

exp_data_cols(raw_data)

Columns Names: 
Index(['objid', 'specobjid', 'ra', 'dec', 'u', 'g', 'r', 'i', 'z', 'run',
       'rerun', 'camcol', 'field', 'plate', 'mjd', 'fiberid', 'petroRad_u',
       'petroRad_g', 'petroRad_i', 'petroRad_r', 'petroRad_z', 'petroFlux_u',
       'petroFlux_g', 'petroFlux_i', 'petroFlux_r', 'petroFlux_z',
       'petroR50_u', 'petroR50_g', 'petroR50_i', 'petroR50_r', 'petroR50_z',
       'psfMag_u', 'psfMag_r', 'psfMag_g', 'psfMag_i', 'psfMag_z', 'expAB_u',
       'expAB_g', 'expAB_r', 'expAB_i', 'expAB_z', 'redshift', 'class'],
      dtype='object')

Null Values: 
objid          0
specobjid      0
ra             0
dec            0
u              0
g              0
r              0
i              0
z              0
run            0
rerun          0
camcol         0
field          0
plate          0
mjd            0
fiberid        0
petroRad_u     0
petroRad_g     0
petroRad_i     0
petroRad_r     0
petroRad_z     0
petroFlux_u    0
petroFlux_g    0
petroFlux_i    0
petroFlux_r    0

There is a large data imbalance among classes. I will use class weighting to address this issue.

Data: [100000 rows, 43 columns]

In [None]:
def exp_graph_data(data_frame, target_col_name=None):

  print(f"Data shape: {data_frame.shape}\n")

  print(f'Column Names: {list(data_frame.columns)}\n')

  if target_col_name:
    class_counts = data_frame[target_col_name].value_counts()

    print(f'Label Count:\n{class_counts}')

    fig = go.Figure(go.Bar(x=class_counts.index,
                           y=class_counts.values))

    fig.update_layout(xaxis_title_text='Classes',
                      yaxis_title_text='Count',
                      title_text='Count of Each Class')
    fig.show()

exp_graph_data(raw_data, 'class')

Data shape: (100000, 43)

Column Names: ['objid', 'specobjid', 'ra', 'dec', 'u', 'g', 'r', 'i', 'z', 'run', 'rerun', 'camcol', 'field', 'plate', 'mjd', 'fiberid', 'petroRad_u', 'petroRad_g', 'petroRad_i', 'petroRad_r', 'petroRad_z', 'petroFlux_u', 'petroFlux_g', 'petroFlux_i', 'petroFlux_r', 'petroFlux_z', 'petroR50_u', 'petroR50_g', 'petroR50_i', 'petroR50_r', 'petroR50_z', 'psfMag_u', 'psfMag_r', 'psfMag_g', 'psfMag_i', 'psfMag_z', 'expAB_u', 'expAB_g', 'expAB_r', 'expAB_i', 'expAB_z', 'redshift', 'class']

Label Count:
GALAXY    52343
STAR      37232
QSO       10425
Name: class, dtype: int64


redshift appears to have the hightest feature importance

In [None]:
# Feature Importance
def tree_classifer(data_frame, target, num_desired_features):
  # create X and y varialbles
  y = data_frame[target]
  X = data_frame.drop(columns=target)

  classifer = ExtraTreesClassifier()
  classifer.fit(X,y)
  print(classifer.feature_importances_)

  #plot graph of feature importances for better visualization
  feat_importances = pd.Series(classifer.feature_importances_, index=X.columns)
  feat_importances = feat_importances.nlargest(num_desired_features)

  fig = px.bar(feat_importances, orientation='h', labels={'index': 'Feature', 'value': 'Importance'},
                 title='Top Feature Importances')

  fig.show()

tree_classifer(raw_data, 'class', 15)

[0.         0.06462703 0.00236475 0.00197371 0.01206864 0.02988212
 0.0323123  0.02032082 0.02292582 0.00314409 0.         0.00120775
 0.00158848 0.06640938 0.08102201 0.00253383 0.02025517 0.02224899
 0.02050184 0.02913162 0.00805003 0.00572019 0.01002227 0.0134776
 0.0095507  0.00953435 0.00893472 0.02061961 0.01153497 0.01690973
 0.00886422 0.08110723 0.0526524  0.07120512 0.02241948 0.01251406
 0.02322377 0.01885766 0.022074   0.01301794 0.0140027  0.11118892]
