# The CosmoClassifier
In this project, I have used the Data Release 18 version of Sloan Digital Sky Survey (SDSS) dataset to train a classifier algorithm to predict whether the given credentials corresponds to a Galaxy(class 0), Star(class 1) or Quasar(class 2). This notebook is used as a playground to test different hyperparameter settings as well as preprocessing approaches. 

We will 4 different classifier algorithms to test the results and select the one which offers the best result. These are:
1. Random Forest **(This File)**
2. Logistic Regression
3. Suppor Vector Classifier (with RBF kerel)

The models are implemented in separate `.ipynb` files to avoid confusion in one notebook. You can find them all in the `notebooks` subdirectory.
   
We will also use 3 different dimensionality reduction techniques, which include:
1. Sequential Feature Selection (SFS)
2. Linear Discriminant Analysis (LDA)
3. Principle Component Analysis (PCA)

Importing the libraries

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.decomposition import PCA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from sklearn.model_selection import cross_val_score,KFold,train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.pipeline import Pipeline
import numpy as np

## Basic Preprocessing

Importing the dataset

In [33]:
import pandas as pd

df_raw = pd.read_csv("Datasets/SDSS_DR18.csv")
df_raw.columns

Index(['objid', 'specobjid', 'ra', 'dec', 'u', 'g', 'r', 'i', 'z', 'run',
       'rerun', 'camcol', 'field', 'plate', 'mjd', 'fiberid', 'petroRad_u',
       'petroRad_g', 'petroRad_i', 'petroRad_r', 'petroRad_z', 'petroFlux_u',
       'petroFlux_g', 'petroFlux_i', 'petroFlux_r', 'petroFlux_z',
       'petroR50_u', 'petroR50_g', 'petroR50_i', 'petroR50_r', 'petroR50_z',
       'psfMag_u', 'psfMag_r', 'psfMag_g', 'psfMag_i', 'psfMag_z', 'expAB_u',
       'expAB_g', 'expAB_r', 'expAB_i', 'expAB_z', 'redshift', 'class'],
      dtype='object')

Dropping the identifier columns which may lead to data leakage

In [34]:
df_raw = df_raw.drop(columns=["objid", "specobjid", "run", "rerun", "camcol", "field", "plate", "mjd", "fiberid"])

Identifying and mapping the classes

In [35]:
print(df_raw["class"].value_counts())
df_1 = df_raw.copy()

class
GALAXY    52343
STAR      37232
QSO       10425
Name: count, dtype: int64


In [36]:
df_1["class"] = df_1["class"].map({
  "GALAXY":0,
  "STAR":1,
  "QSO":2
})
df_1["class"].head(10)

0    0
1    1
2    0
3    1
4    1
5    1
6    1
7    0
8    0
9    1
Name: class, dtype: int64

Checking for null values

In [37]:
df_1.isna().value_counts()

ra     dec    u      g      r      i      z      petroRad_u  petroRad_g  petroRad_i  petroRad_r  petroRad_z  petroFlux_u  petroFlux_g  petroFlux_i  petroFlux_r  petroFlux_z  petroR50_u  petroR50_g  petroR50_i  petroR50_r  petroR50_z  psfMag_u  psfMag_r  psfMag_g  psfMag_i  psfMag_z  expAB_u  expAB_g  expAB_r  expAB_i  expAB_z  redshift  class
False  False  False  False  False  False  False  False       False       False       False       False       False        False        False        False        False        False       False       False       False       False       False     False     False     False     False     False    False    False    False    False    False     False    100000
Name: count, dtype: int64

No null values were found, so we are going to skip dropping nulls.

Copying the dataset and specifying the target & feature columns

In [38]:
df = df_1.copy()
y = df.iloc[:,-1]      # Target Column
x = df.iloc[:,:-1]     # Feature Column

## ML Preprocessing, Model training, Evaluation 

Performing train-test split

In [39]:
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=2/10,random_state=120,shuffle=True,stratify=y)

Building Pipeline (SFS)

In [71]:
rf_model = RandomForestClassifier(
  n_estimators=150,max_depth=10,random_state=102,class_weight="balanced",n_jobs=-1)
sfs = SequentialFeatureSelector(
  rf_model,n_features_to_select="auto",tol=0.007,direction="forward",cv=None)

preprocessor = Pipeline([
  ("imputation",SimpleImputer(strategy="median")),
  ("scale", StandardScaler()),
  ("sfs",sfs)
])
pipe = Pipeline([
  ("preprocessor",preprocessor),
  ("model",rf_model)
])

kfold = KFold(n_splits=3,shuffle=True,random_state=10)
score = cross_val_score(pipe,x,y,cv=kfold)
print(score)
print(f"Average = {score.mean()}")

[0.98716026 0.98649986 0.98745987]
Average = 0.9870399987974201


Building Pipeline (PCA)

In [44]:
rf_model = RandomForestClassifier(
  n_estimators=150,max_depth=10,random_state=103,class_weight="balanced",n_jobs=-1)
pca = PCA(n_components=20,random_state=19)

preprocessor = Pipeline([
  ("imputation",SimpleImputer(strategy="median")),
  ("scale", StandardScaler()),
  ("pca",pca)
])
pipe = Pipeline([
  ("preprocessor",preprocessor),
  ("model",rf_model)
])

kfold = KFold(n_splits=3,shuffle=True,random_state=10)
score = cross_val_score(pipe,x,y,cv=kfold)
print(score)
print(f"Average = {score.mean()}")

[0.98050039 0.97929979 0.98100981]
Average = 0.980269997696077


Building Pipeline (LDA)

In [45]:
rf_model = RandomForestClassifier(
  n_estimators=150,max_depth=10,random_state=104,class_weight="balanced",n_jobs=-1)
lda = LDA(n_components=2)    
# There are only 2 possible values for n_components since there are only 3 classes.  
# n_estimaors=2 gave the best score. So I kept it for the final version
preprocessor = Pipeline([
  ("imputation",SimpleImputer(strategy="median")),
  ("scale", StandardScaler()),
  ("lda",lda)
])
pipe = Pipeline([
  ("preprocessor",preprocessor),
  ("model",rf_model)
])

kfold = KFold(n_splits=2,shuffle=True,random_state=10)
score = cross_val_score(pipe,x,y,cv=kfold)
print(score)
print(f"Average = {score.mean()}")

[0.9844  0.98478]
Average = 0.9845900000000001


## Summary

Here is a summary of the entire calculation

| Dimensionality Reduction / Models $\downarrow \leftrightarrow$ | Random Forest | Support Vector Classifier | Logistic Regression |
| :--- | :--- | :--- | :--- |
| Principal Component Analysis | 0.9803 | 0.9471 | 0.9725 |
| Linear Discriminant Analysis | 0.9846 | 0.9832 | 0.9832 |
| Sequential Feature Selection | 0.9870 | Null | Null |

From this table, we can see that Random Forest with SFS has achieved the best cross validation score overall. But, we are going to avoid it because of it's extreme complexity and inefficiency. In my Macbook Air M2, it took more than 20 minutes to run even with `n_jobs` set to -1 (all CPU cores are used).

So, we have fixed the Rnadom Forest model with LDA as our final version for the Web App !

**BUT**, There is a catch! Check the `research_2.ipynb` file to know about it.