*This notebook is a continued version of `research_RF.ipynb` notebook. Please read that first.*

# The Problem

At the time of writing this, I have already trained the model, developed the front-end and the web app. Now, I have realized that putting 33 feature columns as input is a bad call. This will be a fraustrating experience for any user. So, in this notebook, I am going to reduce some features and feature-enginner if necessary.

**Importing modules**

In [4]:
from sklearn.ensemble import RandomForestClassifier,VotingClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from xgboost import XGBClassifier

from sklearn.decomposition import PCA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA

from sklearn.model_selection import train_test_split,learning_curve,RandomizedSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.metrics import classification_report

from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline

import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns

## Basic Preprocessing

Importing the dataset

In [5]:
df_raw = pd.read_csv("Datasets/SDSS_DR18.csv")
df_raw.columns

Index(['objid', 'specobjid', 'ra', 'dec', 'u', 'g', 'r', 'i', 'z', 'run',
       'rerun', 'camcol', 'field', 'plate', 'mjd', 'fiberid', 'petroRad_u',
       'petroRad_g', 'petroRad_i', 'petroRad_r', 'petroRad_z', 'petroFlux_u',
       'petroFlux_g', 'petroFlux_i', 'petroFlux_r', 'petroFlux_z',
       'petroR50_u', 'petroR50_g', 'petroR50_i', 'petroR50_r', 'petroR50_z',
       'psfMag_u', 'psfMag_r', 'psfMag_g', 'psfMag_i', 'psfMag_z', 'expAB_u',
       'expAB_g', 'expAB_r', 'expAB_i', 'expAB_z', 'redshift', 'class'],
      dtype='object')

Dropping the identifier columns which may lead to data leakage

In [6]:
df_raw = df_raw.drop(columns=["objid", "specobjid", "run", "rerun", "camcol", "field", "plate", "mjd", "fiberid"])
df_raw.columns

Index(['ra', 'dec', 'u', 'g', 'r', 'i', 'z', 'petroRad_u', 'petroRad_g',
       'petroRad_i', 'petroRad_r', 'petroRad_z', 'petroFlux_u', 'petroFlux_g',
       'petroFlux_i', 'petroFlux_r', 'petroFlux_z', 'petroR50_u', 'petroR50_g',
       'petroR50_i', 'petroR50_r', 'petroR50_z', 'psfMag_u', 'psfMag_r',
       'psfMag_g', 'psfMag_i', 'psfMag_z', 'expAB_u', 'expAB_g', 'expAB_r',
       'expAB_i', 'expAB_z', 'redshift', 'class'],
      dtype='object')

Identifying and mapping the classes

In [7]:
print(df_raw["class"].value_counts())
df_1 = df_raw.copy()

class
GALAXY    52343
STAR      37232
QSO       10425
Name: count, dtype: int64


In [8]:
df_1["class"] = df_1["class"].map({
  "GALAXY":0,
  "STAR":1,
  "QSO":2
})
df_1["class"].head(10)

0    0
1    1
2    0
3    1
4    1
5    1
6    1
7    0
8    0
9    1
Name: class, dtype: int64

## Feature Reduction and Engineering

Dropping all raw fluxes, all concentration indices, all radii, all error terms, and all duplicate magnitudes. I have copied a subset of features from `df_1` to `df_2`. This subset excludes all of those features.

In [9]:
df_2 = df_1[["ra","dec","redshift","u","g","r","i","z","psfMag_r","class"]].copy()

Feature engineering color contrast columns

In [10]:
df_2["u_g_color"] = df_2["u"] - df_2["g"]
df_2["g_r_color"] = df_2["g"] - df_2["r"]
df_2["r_i_color"] = df_2["r"] - df_2["i"]
df_2["i_z_color"] = df_2["i"] - df_2["z"]

Dropping the raw color features

In [11]:
df_2.head(2)

Unnamed: 0,ra,dec,redshift,u,g,r,i,z,psfMag_r,class,u_g_color,g_r_color,r_i_color,i_z_color
0,184.950869,0.733068,0.041691,18.87062,17.59612,17.11245,16.83899,16.70908,19.50324,0,1.2745,0.48367,0.27346,0.12991
1,185.729201,0.679704,-0.000814,19.5956,19.92153,20.34448,20.66213,20.59599,20.34491,1,-0.32593,-0.42295,-0.31765,0.06614


In [12]:
df_2 = df_2.drop(columns=["u","g","r","i","z"])

In [13]:
df_2.head(2)

Unnamed: 0,ra,dec,redshift,psfMag_r,class,u_g_color,g_r_color,r_i_color,i_z_color
0,184.950869,0.733068,0.041691,19.50324,0,1.2745,0.48367,0.27346,0.12991
1,185.729201,0.679704,-0.000814,20.34491,1,-0.32593,-0.42295,-0.31765,0.06614


Moving the class column to the end

In [14]:
popped_class = df_2.pop("class")
df_2.insert(len(df_2.columns), "class", popped_class)

In [15]:
df_2.head(3)

Unnamed: 0,ra,dec,redshift,psfMag_r,u_g_color,g_r_color,r_i_color,i_z_color,class
0,184.950869,0.733068,0.041691,19.50324,1.2745,0.48367,0.27346,0.12991,0
1,185.729201,0.679704,-0.000814,20.34491,-0.32593,-0.42295,-0.31765,0.06614,1
2,185.68769,0.82348,0.113069,18.54832,1.3853,0.78298,0.44434,0.2983,0


Copying `df_2` into the main dataframe and separating feature and target columns

In [16]:
df = df_2.copy()
x = df.iloc[:,:-1].to_numpy()
y = df.iloc[:,-1].to_numpy()

feature_columns = df.columns[:-1]
feature_columns

Index(['ra', 'dec', 'redshift', 'psfMag_r', 'u_g_color', 'g_r_color',
       'r_i_color', 'i_z_color'],
      dtype='object')

## ML Preprocessing, Model training, Evaluation 

Performing train-test split

In [17]:
x_train,x_test,y_train,y_test = train_test_split(
  x,y,test_size=2/10,random_state=120,shuffle=True,stratify=y)

Defining the pipeline and the model

In [18]:
# RF, SVC, LR, XGB
rf_model = RandomForestClassifier(random_state=40)
svc_model = SVC(random_state=41)
lr_model = LogisticRegression(random_state=42,max_iter=10_000)
xgb_model = XGBClassifier(random_state=43)

pca = PCA(random_state=44)
lda = LDA(n_components=2)

pipe = Pipeline([
  ("impute",SimpleImputer(strategy="median")),
  ("scale",StandardScaler()),
  ("smote",SMOTE(random_state=101)),
  ("dimen",pca),
  ("model",rf_model)
])

Performing Randomized Search CV

In [19]:
param_list = [
  { # Random Forest, PCA On
    "model": [rf_model],"model__n_estimators":np.arange(150,650,100),
    "model__max_depth":np.arange(7,14,2), "dimen" : [pca], "dimen__n_components": np.arange(5,8,1)
  },

  # { # SVC, No dimen. reduction
  #   "model": [svc_model], "model__C":[0.01,0.1,1,10], "model__kernel":["rbf"], "model__gamma":[0.01,0.1,1,10],
  #   "dimen":["passthrough"]
  # },

  { # Logistic Regression, No dimen. reduction, l1 penalty, `saga` solver
    "model": [lr_model], "model__C": [0.01,0.1,1,10], "model__penalty":["l1"], "model__solver":["saga"],
    "dimen": ["passthrough"]
  },
  { # Logistic Regression, No dimen. reduction, l2 penalty, `lbfgs` solver
    "model": [lr_model], "model__C": [0.01,0.1,1,10], "model__penalty":["l2"], "model__solver":["lbfgs"],
    "dimen": ["passthrough"]
  },
  { # XGBoost, PCA On
    "dimen": [pca], "dimen__n_components": np.arange(5,8,1),
    "model": [xgb_model], "model__n_estimators" : np.linspace(500,1100,3,dtype=int),"model__learning_rate": [0.01,0.1], "model__max_depth":np.arange(7,14,3)
  },
  { # XGBoost, LDA On
    "dimen": [lda],
    "model": [xgb_model], "model__n_estimators" : [500,700,900],"model__learning_rate": [0.01,0.1], "model__max_depth":np.arange(7,14,3)
  },
  { # XGBoost, No dimen. reduction
    "dimen": ["passthrough"],
    "model": [xgb_model], "model__n_estimators" : [500,700,900],"model__learning_rate": [0.01,0.1], "model__max_depth":np.arange(7,14,3)
  }
]

rscv = RandomizedSearchCV(
  estimator=pipe,param_distributions=param_list,n_iter=8,cv=5,n_jobs=-1,random_state=50,refit=True
)

rscv.fit(x_train,y_train)
estimator = rscv.best_estimator_
score = rscv.best_score_
config = rscv.best_params_
print(f"Best Configuration:\n{config}")
print(f"Best Score = {score}")



Best Configuration:
{'model__solver': 'saga', 'model__penalty': 'l1', 'model__C': 10, 'model': LogisticRegression(max_iter=10000, random_state=42), 'dimen': 'passthrough'}
Best Score = 0.9758125


Now, let's calculate the **Classification Report**

In [20]:
y_true = y_test
y_pred = rscv.predict(x_test)
print(classification_report(y_true=y_true,y_pred=y_pred))

              precision    recall  f1-score   support

           0       0.99      0.96      0.98     10469
           1       0.96      1.00      0.98      7446
           2       0.96      0.96      0.96      2085

    accuracy                           0.97     20000
   macro avg       0.97      0.97      0.97     20000
weighted avg       0.98      0.97      0.97     20000



As we can see, the overall accuracy decreased about 1% from before. But it is a huge win considering the simplicity we have successfully added by performing the feature reduction.

## Summary

In this notebook, we have explored how we can make the model suitable for user usage by reducing the number of features. We have seen that it has made the model a lot simplier by only sacrificing 1% accuracy. 

We will use the code from this notebook to write `fit.py` from the ground up. 