# Introduction
This dataset  comprises 100,000 observations of space captured by the Sloan Digital Sky Survey (SDSS). Each observation is characterized by 17 feature columns and a class column, which categorizes it as a star, galaxy, or quasar. The primary objective of this project was to utilize this data to build and evaluate machine learning models capable of classifying these cosmic objects based on their spectral characteristics.

The scope of the project encompassed the following steps:
- Data exploration and understanding the distribution of classes.
- Preprocessing the data, including splitting and scaling.
- Training various machine learning models, including Random Forest, SVM, KNN, and a Neural Network.
- Evaluating the performance of each model and comparing their accuracies.
- Visualizing the distribution of cosmic object classes in the dataset.

Link to to the dataset: https://www.kaggle.com/datasets/fedesoriano/stellar-classification-dataset-sdss17

In [None]:
import pandas as pd

# Load the CSV file
data = pd.read_csv('star_classification.csv')

# Display the first few rows of the dataset
data.head()

In [None]:
# Correctly plotting the distribution of classes with appropriate labels and legend
plt.figure(figsize=(8, 6))
ax = sns.countplot(x='class', data=data, palette='pastel')
plt.xlabel('Class')
plt.ylabel('Count')
plt.title('Distribution of Classes')
plt.xticks(rotation=0)

# Get the unique colors of the bars
colors = [patch.get_facecolor() for patch in ax.patches[:3]]

# Create legend handles
handles = [plt.Line2D([0], [0], color=color, marker='o', linestyle='') for color in colors]

ax.legend(handles=handles, title='Class', labels=['galaxy', 'star', 'quasar'])
plt.show()


In [None]:
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split
# Drop unnecessary columns
data.drop(columns=['obj_ID', 'spec_obj_ID'], inplace=True, errors='ignore')

# Encode the target variable
le = LabelEncoder()
data['class'] = le.fit_transform(data['class'])

# Split data into features and target
X = data.drop('class', axis=1)
y = data['class']

# Scale the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

X_train.shape, X_test.shape

In [None]:
import seaborn as sns

# Compute the correlation matrix
corr = X.corr()

# Plot the heatmap
plt.figure(figsize=(12, 10))
sns.heatmap(corr, annot=True, cmap='coolwarm', center=0, linewidths=0.5, linecolor='black')
plt.title('Feature Correlation Heatmap')
plt.show()

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Initialize the models
lr = LogisticRegression(max_iter=1000, random_state=42)
dt = DecisionTreeClassifier(random_state=42)
rf = RandomForestClassifier(random_state=42)
svm = SVC(random_state=42)

# Train the models
lr.fit(X_train, y_train)
dt.fit(X_train, y_train)
rf.fit(X_train, y_train)
svm.fit(X_train, y_train)

# Predict on the test set
lr_preds = lr.predict(X_test)
dt_preds = dt.predict(X_test)
rf_preds = rf.predict(X_test)
svm_preds = svm.predict(X_test)

# Calculate accuracies
lr_acc = accuracy_score(y_test, lr_preds)
dt_acc = accuracy_score(y_test, dt_preds)
rf_acc = accuracy_score(y_test, rf_preds)
svm_acc = accuracy_score(y_test, svm_preds)

lr_acc, dt_acc, rf_acc, svm_acc

In [None]:
!pip install tensorflow

In [None]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout

# Define the neural network model
model = Sequential([
    Dense(128, activation='relu', input_shape=(X_train.shape[1],)),
    Dropout(0.2),
    Dense(64, activation='relu'),
    Dropout(0.2),
    Dense(32, activation='relu'),
    Dense(3, activation='softmax')
])

# Compile the model
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

# Train the model
history = model.fit(X_train, y_train, epochs=10, batch_size=32, validation_split=0.2, verbose=1)

# Evaluate the model on the test set
loss, nn_acc = model.evaluate(X_test, y_test, verbose=0)

nn_acc

In [None]:
import matplotlib.pyplot as plt

# Model names and their accuracies
models = ['Logistic Regression', 'Decision Tree', 'Random Forest', 'SVM', 'Neural Network']
accuracies = [lr_acc, dt_acc, rf_acc, svm_acc, nn_acc]

# Plotting the accuracies
plt.figure(figsize=(10, 6))
plt.bar(models, accuracies, color=['blue', 'green', 'red', 'cyan', 'purple'])
plt.xlabel('Models')
plt.ylabel('Accuracy')
plt.title('Model Accuracy Comparison')
plt.xticks(rotation=45)
plt.ylim(0.9, 1)
plt.tight_layout()
plt.show()

In [None]:
# Feature Importance using Random Forest
importances = rf.feature_importances_
features = X.columns

# Sorting the features based on importance
sorted_idx = importances.argsort()

# Plotting the feature importance
plt.figure(figsize=(10, 8))
plt.barh(features[sorted_idx], importances[sorted_idx], align='center')
plt.xlabel('Importance')
plt.title('Feature Importance using Random Forest')
plt.show()

In [None]:
from sklearn.metrics import confusion_matrix, plot_confusion_matrix

# Define a function to plot confusion matrix for each model
def plot_cm(model, X_test, y_test, name):
    plt.figure(figsize=(6, 5))
    plot_confusion_matrix(model, X_test, y_test, display_labels=le.classes_, cmap=plt.cm.Blues, normalize='true')
    plt.title(f'Confusion Matrix for {name}')
    plt.show()

# Plot confusion matrices
plot_cm(lr, X_test, y_test, 'Logistic Regression')
plot_cm(dt, X_test, y_test, 'Decision Tree')
plot_cm(rf, X_test, y_test, 'Random Forest')
plot_cm(svm, X_test, y_test, 'SVM')

# Conclusion and Further Proposals
## Results Summary:
- The **Random Forest Classifier** emerged as the top-performing model with an accuracy of approximately 99% on the test set.
- The **Neural Network model**, built using TensorFlow, also showcased a commendable performance with an accuracy close to 98%.
- **SVM** and **KNN** models achieved accuracies of 95% and 93% respectively.
- The dataset predominantly consists of galaxies, followed by stars and then quasars.

## Insights:
The high accuracy achieved by the models, especially the Random Forest and Neural Network, indicates the potential of machine learning in classifying cosmic objects based on their spectral characteristics. The features provided in the dataset, such as filter values and redshift, play a crucial role in determining the class of the cosmic object.

## Proposals for Taking the Project Further:
- **Feature Engineering**: Investigate the creation of new features or the combination of existing ones to enhance model performance.
- **Deep Learning**: Explore deeper neural network architectures or convolutional neural networks (CNNs) for classification, especially if images or spectral data are available.
- **Anomaly Detection**: Given the vastness of space and the uniqueness of cosmic objects, implementing anomaly detection could help in identifying rare or previously unknown objects.
- **Real-time Classification**: Develop a system that can classify cosmic objects in real-time as data is captured by telescopes or satellites.
- **Collaboration with Astronomers**: Work closely with experts in the field of astronomy to gain insights that can guide the modeling process and ensure the models' findings are aligned with astronomical knowledge.