# The CosmoClassifier
In this project, I have used the Data Release 18 version of Sloan Digital Sky Survey (SDSS) dataset to train a classifier algorithm to predict whether the given credentials corresponds to a Galaxy(class 0), Star(class 1) or Quasar(class 2). This notebook is used as a playground to test different hyperparameter settings as well as preprocessing approaches. 

We will 4 different classifier algorithms to test the results and select the one which offers the best result. These are:
1. Random Forest 
2. Logistic Regression **(This File)**
3. Suppor Vector Classifier (with RBF kerel)

The models are implemented in separate `.ipynb` files to avoid confusion in one notebook. You can find them all in the `notebooks` subdirectory.
   
We will also use 2 different dimensionality reduction techniques, which include:
1. Linear Discriminant Analysis (LDA)
2. Principle Component Analysis (PCA)

Importing the libraries

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.decomposition import PCA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from sklearn.model_selection import cross_val_score,KFold,train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
import numpy as np

## Basic Preprocessing

Importing the dataset

In [11]:
import pandas as pd

df_raw = pd.read_csv("Datasets/SDSS_DR18.csv")
df_raw.columns

Index(['objid', 'specobjid', 'ra', 'dec', 'u', 'g', 'r', 'i', 'z', 'run',
       'rerun', 'camcol', 'field', 'plate', 'mjd', 'fiberid', 'petroRad_u',
       'petroRad_g', 'petroRad_i', 'petroRad_r', 'petroRad_z', 'petroFlux_u',
       'petroFlux_g', 'petroFlux_i', 'petroFlux_r', 'petroFlux_z',
       'petroR50_u', 'petroR50_g', 'petroR50_i', 'petroR50_r', 'petroR50_z',
       'psfMag_u', 'psfMag_r', 'psfMag_g', 'psfMag_i', 'psfMag_z', 'expAB_u',
       'expAB_g', 'expAB_r', 'expAB_i', 'expAB_z', 'redshift', 'class'],
      dtype='object')

Dropping the identifier columns which may lead to data leakage

In [12]:
df_raw = df_raw.drop(columns=["objid", "specobjid", "run", "rerun", "camcol", "field", "plate", "mjd", "fiberid"])

Identifying and mapping the classes

In [13]:
print(df_raw["class"].value_counts())
df_1 = df_raw.copy()

class
GALAXY    52343
STAR      37232
QSO       10425
Name: count, dtype: int64


In [14]:
df_1["class"] = df_1["class"].map({
  "GALAXY":0,
  "STAR":1,
  "QSO":2
})
df_1["class"].head(10)

0    0
1    1
2    0
3    1
4    1
5    1
6    1
7    0
8    0
9    1
Name: class, dtype: int64

Checking for null values

In [15]:
df_1.isna().value_counts()

ra     dec    u      g      r      i      z      petroRad_u  petroRad_g  petroRad_i  petroRad_r  petroRad_z  petroFlux_u  petroFlux_g  petroFlux_i  petroFlux_r  petroFlux_z  petroR50_u  petroR50_g  petroR50_i  petroR50_r  petroR50_z  psfMag_u  psfMag_r  psfMag_g  psfMag_i  psfMag_z  expAB_u  expAB_g  expAB_r  expAB_i  expAB_z  redshift  class
False  False  False  False  False  False  False  False       False       False       False       False       False        False        False        False        False        False       False       False       False       False       False     False     False     False     False     False    False    False    False    False    False     False    100000
Name: count, dtype: int64

No null values were found, so we are going to skip dropping nulls.

Copying the dataset and specifying the target & feature columns

In [16]:
df = df_1.copy()
y = df.iloc[:,-1]      # Target Column
x = df.iloc[:,:-1]     # Feature Column

## ML Preprocessing, Model training, Evaluation

Performing train-test split

In [17]:
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=2/10,random_state=120,shuffle=True,stratify=y)

Building Pipeline (PCA)

In [26]:
lr_model = LogisticRegression(
  C=0.001,penalty="l2",solver="saga",
  class_weight="balanced",random_state=191,max_iter=10_000,n_jobs=-1)
pca = PCA(n_components=20,random_state=69)

preprocessor = Pipeline([
  ("imputation",SimpleImputer(strategy="median")),
  ("scale", StandardScaler()),
  ("pca",pca)
])
pipe = Pipeline([
  ("preprocessor",preprocessor),
  ("model",lr_model)
])

kfold = KFold(n_splits=3,shuffle=True,random_state=10)
score = cross_val_score(pipe,x,y,cv=kfold)
print(score)
print(f"Average = {score.mean()}")

[0.97348053 0.97083971 0.97308973]
Average = 0.972469989894595


Building Pipeline (LDA)

In [30]:
lr_model = LogisticRegression(
  C=0.001,penalty="l2",solver="saga",
  class_weight="balanced",random_state=191,max_iter=10_000,n_jobs=-1)
lda = LDA(n_components=2)    
# There are only 2 possible values(1,2) for n_components since there are only 3 classes. 
# n_components=2 gave much better score, so I kept that 

preprocessor = Pipeline([
  ("imputation",SimpleImputer(strategy="median")),
  ("scale", StandardScaler()),
  ("lda",lda)
])
pipe = Pipeline([
  ("preprocessor",preprocessor),
  ("model",lr_model)
])

kfold = KFold(n_splits=3,shuffle=True,random_state=10)
score = cross_val_score(pipe,x,y,cv=kfold)
print(score)
print(f"Average = {score.mean()}")

[0.98341033 0.98286983 0.98346983]
Average = 0.983249998396666


Here is the summary of the entire calculation

| Dimensionality Reduction / Models $\downarrow \leftrightarrow$ | Random Forest | Support Vector Classifier | Logistic Regression |
| :--- | :--- | :--- | :--- |
| Principal Component Analysis | 0.9803 | 0.9471 | 0.9725 |
| Linear Discriminant Analysis | 0.9846 | 0.9832 | 0.9832 |
| Sequential Feature Selection | 0.9870 | Null | Null |