<a href="https://colab.research.google.com/github/Dansah2/Sloan-Digital-Sky-Survey---DR18/blob/main/Preprocess_Sloan_Digital_Sky_Survey_DR18.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Sloan Digital Sky Survey - DR18

This dataset consists of 100,000 observations from the Data Release (DR) 18 of the Sloan Digital Sky Survey (SDSS). Each observation is described by 42 features and 1 class column classifying the observation as either:

a STAR
a GALAXY
a QSO (Quasi-Stellar Object) or a Quasar.

Kaggle Dataset Download API Command:

kaggle datasets download -d diraf0/sloan-digital-sky-survey-dr18

#Project Outline:
1) Download the dataset

2) Explore/Analyze the data

3) Preprocess and organize the data for ML training

4) Set appropriate weights

5) Create and Train model

##Download / Read the Dataset
1) Install required libraries

2) Import required libraries

3) Download / Read data from Kaggle

###Install required libraries

In [None]:
!pip install -q -U kaggle
!pip install -q -U scikit-learn
!pip install -q -U numpy

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.8/10.8 MB[0m [31m34.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m18.2/18.2 MB[0m [31m55.9 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
numba 0.56.4 requires numpy<1.24,>=1.18, but you have numpy 1.26.0 which is incompatible.
tensorflow 2.13.0 requires numpy<=1.24.3,>=1.22, but you have numpy 1.26.0 which is incompatible.[0m[31m
[0m

###Import required libraries

In [None]:
# handeling data
import numpy as np
import pandas as pd

# downloading data
from google.colab import drive

# label encoding / scaling
from sklearn.preprocessing import LabelEncoder, StandardScaler

# graphing data
import plotly.graph_objects as go

In [None]:
# Mount google drive to store Kaggle API for future use
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# make a directory for kaggle temporary instance location in Colab
! mkdir ~/.kaggle

In [None]:
# upload json fine to Google drive and copy the temporary location
!cp /content/drive/MyDrive/Kaggle_API/kaggle.json ~/.kaggle/kaggle.json

In [None]:
# change the file permissions to read/write to the owner only
! chmod 600 ~/.kaggle/kaggle.json

In [None]:
! kaggle datasets download -d diraf0/sloan-digital-sky-survey-dr18

Downloading sloan-digital-sky-survey-dr18.zip to /content
 95% 14.0M/14.7M [00:01<00:00, 22.3MB/s]
100% 14.7M/14.7M [00:01<00:00, 15.2MB/s]


In [None]:
! unzip sloan-digital-sky-survey-dr18.zip

Archive:  sloan-digital-sky-survey-dr18.zip
  inflating: SDSS_DR18.csv           


In [None]:
def read_function(csv_file):
    return pd.read_csv(csv_file)

raw_data = read_function('SDSS_DR18.csv')


##Preprocess and organize the data for ML training
1) Label Encoding

2) Drop unecessary columns

3) Save Preprocessed data to Google Drive

####Label Encoding

In [None]:
# label encoding
def encode_label(data_frame, target, new_col_name):
  # creating instance of labelencoder
  labelencoder = LabelEncoder()

  #assign numerical values and store in another column
  data_frame[new_col_name] = labelencoder.fit_transform(data_frame[target])

  return data_frame

raw_data = encode_label(raw_data, 'class', 'e_class')

In [None]:
# compare the original 'class' feature to the label encoded 'e_class' feature
raw_data.iloc[19:22, -2:].head(30)

Unnamed: 0,class,e_class
19,STAR,2
20,QSO,1
21,GALAXY,0


In [None]:
# confirm the names of each column
raw_data.columns

Index(['objid', 'specobjid', 'ra', 'dec', 'u', 'g', 'r', 'i', 'z', 'run',
       'rerun', 'camcol', 'field', 'plate', 'mjd', 'fiberid', 'petroRad_u',
       'petroRad_g', 'petroRad_i', 'petroRad_r', 'petroRad_z', 'petroFlux_u',
       'petroFlux_g', 'petroFlux_i', 'petroFlux_r', 'petroFlux_z',
       'petroR50_u', 'petroR50_g', 'petroR50_i', 'petroR50_r', 'petroR50_z',
       'psfMag_u', 'psfMag_r', 'psfMag_g', 'psfMag_i', 'psfMag_z', 'expAB_u',
       'expAB_g', 'expAB_r', 'expAB_i', 'expAB_z', 'redshift', 'class',
       'e_class'],
      dtype='object')

####Drop Unecessary Columns

In [None]:
#drop columns
def drop_columns(data_frame, col_name):
  try:
    data_frame = data_frame.drop(columns=col_name)
    print(f'The remaining colums are: {data_frame.columns}')
    return data_frame
  except:
    print(f'The column(s) have already been dropped, the remaining are: {data_frame.columns}')

raw_data = drop_columns(raw_data, ['class', 'objid', 'run', 'rerun', 'camcol', 'field', 'fiberid'])

The remaining colums are: Index(['specobjid', 'ra', 'dec', 'u', 'g', 'r', 'i', 'z', 'plate', 'mjd',
       'petroRad_u', 'petroRad_g', 'petroRad_i', 'petroRad_r', 'petroRad_z',
       'petroFlux_u', 'petroFlux_g', 'petroFlux_i', 'petroFlux_r',
       'petroFlux_z', 'petroR50_u', 'petroR50_g', 'petroR50_i', 'petroR50_r',
       'petroR50_z', 'psfMag_u', 'psfMag_r', 'psfMag_g', 'psfMag_i',
       'psfMag_z', 'expAB_u', 'expAB_g', 'expAB_r', 'expAB_i', 'expAB_z',
       'redshift', 'e_class'],
      dtype='object')


####Save Preprocessed data to Google Drive

In [None]:
# export the data
raw_data.to_csv('/content/drive/My Drive/Sloan_Sky_Survey/train_df.csv', index=False)

#verify it was exported
df_verify = pd.read_csv('/content/drive/My Drive/Sloan_Sky_Survey/train_df.csv')
print(df_verify)

          specobjid          ra        dec         u         g         r  \
0      3.240000e+17  184.950869   0.733068  18.87062  17.59612  17.11245   
1      3.250000e+17  185.729201   0.679704  19.59560  19.92153  20.34448   
2      3.240000e+17  185.687690   0.823480  19.26421  17.87891  17.09593   
3      2.880000e+18  185.677904   0.768362  19.49739  17.96166  17.41269   
4      2.880000e+18  185.814763   0.776940  18.31519  16.83033  16.26352   
...             ...         ...        ...       ...       ...       ...   
99995  3.580000e+18  154.077143  55.614066  19.39861  18.35476  18.00348   
99996  3.580000e+18  154.067926  55.635794  19.07703  18.05159  17.78332   
99997  1.070000e+18  153.897018  55.712582  19.07982  17.51349  16.64037   
99998  6.950000e+17  235.656141  56.297044  17.27528  16.41704  16.11662   
99999  6.950000e+17  235.821749  56.400331  17.90598  16.86471  16.51673   

              i         z  plate    mjd  ...  psfMag_g  psfMag_i  psfMag_z  \
0      16