# Stellar Classification Dataset

Our project is based on the benefit from machine augmentation to discern differences among stars, providing insights into the universe's composition beyond the amount of data that most individual astronomers would not be capable of classifying themselves. 
Stellar classification is a revised system on distinguishing stars through what before was simply a measure of brightness, magnitude or gravity. This enhanced method splits a star into a prism (diffracting light into several beams) and uses atomic energy-level based lines representing different element strengths to separate it. 
Using spectral features for stellar categorisation is now a normality, with the two most current schemes being a combination of the Harvard System (temperature) ‘developed at Harvard Observatory in the early 20th century’   and the MK system (luminosity).   
In this instance we are using a dataset that relies on a similar SDSS photometric measuring system to capture the spectral energy of stellar objects.



## 1.1.1 Problem statement 

Astronomers find the analysis of stellar spectra to be fundamental in understanding the composition and assets of stars. Our primary goal is to make a machine that can automate these repetitive tasks through extensive amounts of data, advancing the human dimension of space exploration through artificial intelligence. To reach these goals we will need to answer:

•	What models are suitable for this goal?

•	How should we build these models?

•	Which methods are fitting for the task?

## 1.1.2 Setup

This Stellar Classification dataset is provided by Fedesoriano, a data scientist at Kaggle. Kaggle hosts an extensive collection of datasets for use by their online communities. The raw dataset consists of 100,00 samples detailed by 17 features, making it quite large.
For our project we will be using Spyder, as it is a useful representation of graphics and diagrams as well as basic console outputs, the dataset is also very large and therefore benefits from being processed in Python 3 using the PANDAS package.
Additionally, frameworks such as Sckit-Learn (Machine learning framework for Python), TensorFlow (Google’s deep learning framework) and Keras (high level neural networks API) will be used.

In terms of evaluation metrics, we will primarily be looking at the accuracy, which is the ratio of correct predictions to the entire number of predictions made.

Accuracy = Number of correct predictions/ Total number of predications made

This metric works best when there is an equal number of samples per classification. If this is not the case then we will need to consider fine tuning through resampling techniques (oversampling or under sampling) and as much anomaly detection as possible.

## 1.2.1 Features 

This dataset is mainly carved from five specific filters that measure the flux (brightness) of objects in different categories:

–	u - ultraviolet light, shorter wavelengths. 

–	g - green light, also sensitive to red and blue portions of light.

–	r - red light, solely capturing red.

–	i - infrared light.

–	z - captures near infrared light, in the infrared portion of the photometric system.

•	Add. numeric features

–	obj_ID - This is a unique number to identify every individual image.

–	alpha - The right ascension angle (akin to longitude)

–	delta - The declination angle (akin to latitude)

–	run_ID - The run number used to identify different scans conducted by a specific survey.

–	rerun_ID - A number attached to the specification of how the image was processed.

–	cam_col - Stands for ‘camera column’ and is a number to identify the scanline within the run.

–	field_ID - A number to identify each field.

–	spec_obj_ID - A spectroscopic object’s unique ID.

–	redshift - A number associated to the redshift, this is based on an increase in wavelength.

–	plate - An identifier for each plate in SDSS.

–	MJD - Stands for ‘Modified Julian Date’, which is used to describe when a specific part of the SDSS data was taken.

–	fiber_ID - refers to the identifier of the fiber directing light to the focal plane in each observation.

The only descriptive feature in the dataset is the resulting class of either Galaxy, QSO (Quasar) and Star.


## 1.2.2 Analysis

Through exploratory data analysis through .head() and .tail(), the original data contained surprisingly few missing samples, and therefore no field omitting was needed. 
As a preprocessing step, the distribution percentage of each class was displayed and immediately showed quite an imbalance of class samples (mainly Galaxy with a more minor imbalance between QSO and Stars), which would work against the accuracy metric. As a result the SMOTE (Synthetic Minority Over-sampling Technique) operation  was used for balancing as shown in the figure below.

omitted Alysha's figure as we can put the code here to display

Alongside this a statistical summary was visualised to spot any anomalies and outliers. Initially boxplots were used as a discovery tool,  but some data did not display well due to the extreme outliers. 
A better representation of the data before addressing the outliers came in the format of histograms, which were used to show the plots before and after they were standardised using the Z – Score measurement. The equation for this is as follows: 

Z standardisation=(x-μ)/σ

With μ being equal to the mean and σ being equal to the standard deviation.

The figure below shows an example of what happened to most feature variables before and after Z-score normalisation.

omitted another figure

It had more of a detrimental effect due to the extreme outliers, and so another method for tidying up the features was used.
Robust scaling  was more efficient as it uses median and interquartile range to normalise data rather than the mean, meaning it is less sensitive to outliers. This had a much more positive impact as shown in the figure below.

Robust scale (x)=  (x-median(x))/(Q3(x)-Q1(x))

omitted another figure

In order to more swiftly train the model, variables that had a lesser outcome on the class would be considered as secondary features. There seemed to be much more emphasis on more primary ones such as ‘u’, ‘g’, ‘r’ , ‘i’, ‘z’, ‘alpha’, ‘delta’, ‘plate’ and especially ‘redshift’.

## 1.2.3 Cleaning up the data

In [None]:
# Reads dataset and puts it in a dataframe
path = "."
filename_read = os.path.join(path, "heart_attack_prediction_dataset.csv")
df = pd.read_csv(filename_read)