# STELLAR ENTITY CLASSIFICATION ⭐🪐🌃
- This notebook will walk you through a step-by-step process on how to perform exploratory data analysis and machine learning modelling for the task of classifying stellar entities
- This machine learning model will classify whether a stellar entity is a <b>Galaxy, Quasar, or a Star</b>

<img src="https://wallpapercave.com/dwp1x/wp2088405.jpg" alt="Image Description" style="width: 1080px"/>

## Import dependencies

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder, StandardScaler
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import make_pipeline
from sklearn.metrics import accuracy_score, ConfusionMatrixDisplay, classification_report
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier, ExtraTreesClassifier, BaggingClassifier
from sklearn.neural_network import MLPClassifier
from xgboost import XGBClassifier


import warnings
warnings.filterwarnings("ignore")
plt.style.use('ggplot')

## Data Preparation
Load the data into a pandas dataframe

In [None]:
df= pd.read_csv('/kaggle/input/stellar-classification-dataset-sdss17/star_classification.csv')
df.head()

In [None]:
df.shape

Based on the shape of the dataframe, the dataset contains 100,000 instances with 18 columns

## Statistical summary of the data

In [None]:
df.describe()

## Check if there are null values

In [None]:
df.isnull().sum()

## Check the data types of each column

In [None]:
df.dtypes

Based on the cell output above, all the fields seem to have the correct data type

## Data distribution per class

In [None]:
plt.figure(figsize=(6, 4))
sns.histplot(data=df, x='class', hue='class', alpha=.7)
plt.title('Number of Instances in Each Class')
plt.show()

- Based on the graph above, the distribution of data per class is imbalanced. To deal with this, oversampling will be performed later. 
- <b>Oversampling</b> is a method that involves randomly selecting examples from the minority class, with replacement, and adding them to the training dataset.

## Change the target variable into numeric value
- <b>Galaxy</b>: 0
- <b>Quasar</b>: 1
- <b>Star</b>: 2

In [None]:
LE = LabelEncoder()
df['class'] = LE.fit_transform(df['class'])

## Check if there are significant relationship between each feature and the target variable
In order to select which columns to include as features for the machine learning model to be made, checking the relationships of the independent variables and dependent variables is important

In [None]:
feature_columns = df.columns.drop('class')

### Box Plots

In [None]:
num_plots = len(feature_columns)
num_cols = 3  # Number of columns in the grid
num_rows = (num_plots - 1) // num_cols + 1

fig, axes = plt.subplots(num_rows, num_cols, figsize=(15, 20))

for i, cols in enumerate(feature_columns):
    row = i // num_cols
    col = i % num_cols
    sns.boxplot(x='class', y=cols, data=df, ax=axes[row, col])
    axes[row, col].set_title(cols.upper())
    axes[row, col].set_xlabel('Class')
    axes[row, col].set_ylabel(cols)

for i in range(len(feature_columns), num_rows * num_cols):
    row = i // num_cols
    col = i % num_cols
    fig.delaxes(axes[row, col])

plt.tight_layout()
plt.show()

### Violin Plots

In [None]:
fig, axes = plt.subplots(num_rows, num_cols, figsize=(15, 20))

for i, cols in enumerate(feature_columns):
    row = i // num_cols
    col = i % num_cols
    sns.violinplot(x='class', y=cols, data=df, ax=axes[row, col])
    axes[row, col].set_title(cols.upper())
    axes[row, col].set_xlabel('Class')
    axes[row, col].set_ylabel(cols)

for i in range(len(feature_columns), num_rows * num_cols):
    row = i // num_cols
    col = i % num_cols
    fig.delaxes(axes[row, col])

plt.tight_layout()
plt.show()

## Check if there are correlated fields

In [None]:
plt.figure(figsize=(15, 10))
sns.heatmap(df.drop('rerun_ID', axis=1).corr(), annot=True, cmap="YlGnBu")
plt.show()

## Select the feature variables

Based on the visualization above, the columns that could possibly be a good predictor variable were determined.
Those variables were selected from the data and will be fed into the machine learning model as feature variables.

In [None]:
X = df[['u', 'g', 'r', 'i', 'z', 'spec_obj_ID', 'redshift', 'plate', 'MJD']]

## Define the target variable

In [None]:
y = df['class']

## Feature Variable Distributions

In [None]:
num_plots = len(X.columns)
num_cols = 3  # Number of columns in the grid
num_rows = (num_plots - 1) // num_cols + 1

fig, axes = plt.subplots(num_rows, num_cols, figsize=(15, 20))

for i, cols in enumerate(X.columns):
    row = i // num_cols
    col = i % num_cols
    sns.histplot(X[cols], kde=True, alpha=0.5, color='blue', ax=axes[row, col])
    axes[row, col].set_title(cols.upper())

for i in range(len(feature_columns), num_rows * num_cols):
    row = i // num_cols
    col = i % num_cols
    fig.delaxes(axes[row, col])

plt.tight_layout()
plt.show()

## Oversampling

Since the the number of instances in each class is imbalanced, oversampling was performed.

In [None]:
sm = SMOTE(random_state=42)
X, y = sm.fit_resample(X, y)

In [None]:
y.value_counts()

## Prepare the training and testing datasets

In [None]:
X = np.array(X)
y = np.array(y)

In [None]:
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 777)
x_train.shape,x_test.shape,y_train.shape,y_test.shape

## Model Training

In [None]:
# store the different classification algorithms in a list
classifiers = [
    LogisticRegression(),
    SVC(),
    RandomForestClassifier(),
    KNeighborsClassifier(),
    GaussianNB(),
    DecisionTreeClassifier(),
    AdaBoostClassifier(),
    GradientBoostingClassifier(),
    ExtraTreesClassifier(),
    BaggingClassifier(),
    MLPClassifier(),
    XGBClassifier()
]

In [None]:
best_model = None
best_accuracy = 0

for classifier in classifiers:
    pipeline = make_pipeline(StandardScaler(), classifier)
    pipeline.fit(x_train, y_train)
    
    y_pred = pipeline.predict(x_test)
    
    accuracy = accuracy_score(y_test, y_pred)

    if accuracy > best_accuracy:
        best_accuracy = accuracy
        best_model = classifier.__class__.__name__

    print(f"{classifier.__class__.__name__} - Accuracy: {accuracy}")

print(f"The best performing model is: {best_model} with accuracy: {best_accuracy}")


Based on the accuracy of all the models trained, <b>Random Forest Classifier</b> showed the best performance with an accuracy of 98.27%

## Hyperparameter Tuning

Based on the results of training different classification models, Random Forest Classifier exhibited superior accuracy among all the models trained. Hyperparameter tuning was then performed on the Random Forest Classifier

In [None]:
pipeline = make_pipeline(StandardScaler(), RandomForestClassifier())

param_grid = {
    'randomforestclassifier__n_estimators': [50, 100, 200],
    'randomforestclassifier__max_depth': [None, 10, 20],
}

grid_search = GridSearchCV(pipeline, param_grid, cv=5, n_jobs=-1)
grid_search.fit(x_train, y_train)

best_params = grid_search.best_params_

In [None]:
print("Best Parameters:", best_params)

## Model training with the best model and hyperparameters

In [None]:
model = RandomForestClassifier(n_estimators=200, max_depth=None)
model.fit(x_train, y_train)

## Generate predictions

In [None]:
predictions = model.predict(x_test)

## Model Evaluation

### Determine the importance of each feature

In [None]:
feature_names = ['u', 'g', 'r', 'i', 'z', 'spec_obj_ID', 'redshift', 'plate', 'MJD']
feature_importances = model.feature_importances_

f, ax = plt.subplots(figsize=(10, 7))
ax.barh(range(len(feature_importances)), feature_importances, color='midnightblue', alpha=.7)
ax.set_yticks(range(len(feature_importances)))
ax.set_yticklabels(feature_names)
ax.set_title("Importance of each feature")
ax.set_xlabel("Importance")
plt.show()

Based on the graph above, the <b>redshift value</b> is the superior feature in terms of importance, followed by <b>z</b> (Infrared filter in the photometric system) and <b>g</b> (Green filter in the photometric system).

## Classification Report
Determine the performance of the model using some performance evaluation metrics (accuracy, precision, recall, f1_score, support).

In [None]:
print(classification_report(y_test, predictions, digits=5))

## Confusion Matrix

In [None]:
ConfusionMatrixDisplay.from_predictions(y_test, predictions, cmap = 'cividis', 
                                        display_labels = ['GALAXY', 'QUASAR', 'STAR'])

Using this machine learning model, stellar entities can be classified given a set of features