*This notebook is a continued version of `research_RF.ipynb` notebook. Please read that first.*

# The Problem

At the time of writing this, I have already trained the model, developed the front-end and the web app. Now, I have realized that putting 33 feature columns as input is a bad call. This will be a fraustrating experience for any user. So, in this notebook, I am going to reduce some features and feature-enginner if necessary.

**Importing modules**

In [74]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.decomposition import PCA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from sklearn.model_selection import cross_val_score,KFold,train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.pipeline import Pipeline
import numpy as np
import pandas as pd

## Basic Preprocessing

Importing the dataset

In [75]:
df_raw = pd.read_csv("Datasets/SDSS_DR18.csv")
df_raw.columns

Index(['objid', 'specobjid', 'ra', 'dec', 'u', 'g', 'r', 'i', 'z', 'run',
       'rerun', 'camcol', 'field', 'plate', 'mjd', 'fiberid', 'petroRad_u',
       'petroRad_g', 'petroRad_i', 'petroRad_r', 'petroRad_z', 'petroFlux_u',
       'petroFlux_g', 'petroFlux_i', 'petroFlux_r', 'petroFlux_z',
       'petroR50_u', 'petroR50_g', 'petroR50_i', 'petroR50_r', 'petroR50_z',
       'psfMag_u', 'psfMag_r', 'psfMag_g', 'psfMag_i', 'psfMag_z', 'expAB_u',
       'expAB_g', 'expAB_r', 'expAB_i', 'expAB_z', 'redshift', 'class'],
      dtype='object')

Dropping the identifier columns which may lead to data leakage

In [76]:
df_raw = df_raw.drop(columns=["objid", "specobjid", "run", "rerun", "camcol", "field", "plate", "mjd", "fiberid"])
df_raw.columns

Index(['ra', 'dec', 'u', 'g', 'r', 'i', 'z', 'petroRad_u', 'petroRad_g',
       'petroRad_i', 'petroRad_r', 'petroRad_z', 'petroFlux_u', 'petroFlux_g',
       'petroFlux_i', 'petroFlux_r', 'petroFlux_z', 'petroR50_u', 'petroR50_g',
       'petroR50_i', 'petroR50_r', 'petroR50_z', 'psfMag_u', 'psfMag_r',
       'psfMag_g', 'psfMag_i', 'psfMag_z', 'expAB_u', 'expAB_g', 'expAB_r',
       'expAB_i', 'expAB_z', 'redshift', 'class'],
      dtype='object')

Identifying and mapping the classes

In [77]:
print(df_raw["class"].value_counts())
df_1 = df_raw.copy()

class
GALAXY    52343
STAR      37232
QSO       10425
Name: count, dtype: int64


In [78]:
df_1["class"] = df_1["class"].map({
  "GALAXY":0,
  "STAR":1,
  "QSO":2
})
df_1["class"].head(10)

0    0
1    1
2    0
3    1
4    1
5    1
6    1
7    0
8    0
9    1
Name: class, dtype: int64

## Feature Reduction and Engineering

Dropping all raw fluxes, all concentration indices, all radii, all error terms, and all duplicate magnitudes. I have copied a subset of features from `df_1` to `df_2`. This subset excludes all of those features.

In [79]:
df_2 = df_1[["ra","dec","redshift","u","g","r","i","z","psfMag_r","class"]].copy()

Feature engineering color contrast columns

In [80]:
df_2["u_g_color"] = df_2["u"] - df_2["g"]
df_2["g_r_color"] = df_2["g"] - df_2["r"]
df_2["r_i_color"] = df_2["r"] - df_2["i"]
df_2["i_z_color"] = df_2["i"] - df_2["z"]

Dropping the raw color features

In [81]:
df_2.head(2)

Unnamed: 0,ra,dec,redshift,u,g,r,i,z,psfMag_r,class,u_g_color,g_r_color,r_i_color,i_z_color
0,184.950869,0.733068,0.041691,18.87062,17.59612,17.11245,16.83899,16.70908,19.50324,0,1.2745,0.48367,0.27346,0.12991
1,185.729201,0.679704,-0.000814,19.5956,19.92153,20.34448,20.66213,20.59599,20.34491,1,-0.32593,-0.42295,-0.31765,0.06614


In [82]:
df_2 = df_2.drop(columns=["u","g","r","i","z"])

In [83]:
df_2.head(2)

Unnamed: 0,ra,dec,redshift,psfMag_r,class,u_g_color,g_r_color,r_i_color,i_z_color
0,184.950869,0.733068,0.041691,19.50324,0,1.2745,0.48367,0.27346,0.12991
1,185.729201,0.679704,-0.000814,20.34491,1,-0.32593,-0.42295,-0.31765,0.06614


Moving the class column to the end

In [84]:
popped_class = df_2.pop("class")
df_2.insert(len(df_2.columns), "class", popped_class)

In [85]:
df_2.head(3)

Unnamed: 0,ra,dec,redshift,psfMag_r,u_g_color,g_r_color,r_i_color,i_z_color,class
0,184.950869,0.733068,0.041691,19.50324,1.2745,0.48367,0.27346,0.12991,0
1,185.729201,0.679704,-0.000814,20.34491,-0.32593,-0.42295,-0.31765,0.06614,1
2,185.68769,0.82348,0.113069,18.54832,1.3853,0.78298,0.44434,0.2983,0


Copying `df_2` into the main dataframe and separating feature and target columns

In [86]:
df = df_2.copy()
x = df.iloc[:,:-1].to_numpy()
y = df.iloc[:,-1].to_numpy()

feature_columns = df.columns[:-1]
feature_columns

Index(['ra', 'dec', 'redshift', 'psfMag_r', 'u_g_color', 'g_r_color',
       'r_i_color', 'i_z_color'],
      dtype='object')

## ML Preprocessing, Model training, Evaluation 

Performing train-test split

In [87]:
x_train,x_test,y_train,y_test = train_test_split(
  x,y,test_size=2/10,random_state=120,shuffle=True,stratify=y)

Building Pipeline (PCA)

In [88]:
rf_model = RandomForestClassifier(
  n_estimators=150,max_depth=10,random_state=103,class_weight="balanced",n_jobs=-1)
pca = PCA(n_components=8,random_state=19)

preprocessor = Pipeline([
  ("imputation",SimpleImputer(strategy="median")),
  ("scale", StandardScaler()),
  ("pca",pca)
])
pipe = Pipeline([
  ("preprocessor",preprocessor),
  ("model",rf_model)
])

kfold = KFold(n_splits=5,shuffle=True,random_state=10)
score = cross_val_score(pipe,x,y,cv=kfold)
print(score)
print(f"Average = {score.mean()}")

[0.96295 0.9646  0.96155 0.97585 0.96245]
Average = 0.96548


Building Pipeline (LDA)

In [89]:
rf_model = RandomForestClassifier(
  n_estimators=150,max_depth=10,random_state=104,class_weight="balanced",n_jobs=-1)
lda = LDA(n_components=2)    
# There are only 2 possible values for n_components since there are only 3 classes.  
# n_estimaors=2 gave the best score. So I kept it for the final version
preprocessor = Pipeline([
  ("imputation",SimpleImputer(strategy="median")),
  ("scale", StandardScaler()),
  ("lda",lda)
])
pipe = Pipeline([
  ("preprocessor",preprocessor),
  ("model",rf_model)
])

kfold = KFold(n_splits=3,shuffle=True,random_state=10)
score = cross_val_score(pipe,x,y,cv=kfold)
print(score)
print(f"Average = {score.mean()}")

[0.95758085 0.95550956 0.96399964]
Average = 0.9590300144916611


Building Pipeline (SFS)

In [90]:
rf_model = RandomForestClassifier(
  n_estimators=150,max_depth=10,random_state=102,class_weight="balanced",n_jobs=-1)
sfs = SequentialFeatureSelector(
  rf_model,n_features_to_select="auto",tol=0.007,direction="forward",cv=None)

preprocessor = Pipeline([
  ("imputation",SimpleImputer(strategy="median")),
  ("scale", StandardScaler()),
  ("sfs",sfs)
])
pipe = Pipeline([
  ("preprocessor",preprocessor),
  ("model",rf_model)
])

kfold = KFold(n_splits=3,shuffle=True,random_state=10)
score = cross_val_score(pipe,x,y,cv=kfold)
print(score)
print(f"Average = {score.mean()}")

[0.98716026 0.98649986 0.98745987]
Average = 0.9870399987974201


## Summary
By performing the feature reduction, I have successfully reduced the number of features from 32 to just 9. Obviously, this reduction comes at a cost. I tested with LDA,PCA and SFS and got the following result;

| Dimensionality Reduction $\Rightarrow$ | Principal Component Analysis | Linear Discriminant Analysis | Sequential Feature Selection |
| :--- | :---: | :---: | :---: |
| Before feature reduction | 0.9803 | 0.9846 | 0.9870 |
| After feature reduction | 0.9655 | 0.9590 | 0.9870 |

As you can see, almost all the scores decreased by around 0.2. However, SFS managed to keep the scores absolutely the same as before. This is because SFS is a feature selection method, not feature extraction. It eliminates feature columns but does not change the individual data. 

Because of the reduced dimension, the runtime of SFS implementation also dratically reduced. 

So, we are going to use the RandomForest implementation with SFS dimensionality reduction in the main web app.