# COGS 118A - Final Project

# Insert title here

# Names

The Elite Fantasy Team.

- Boning Yang (BY)
- Chenhao Zhu (CZ)
- Jason Chen (JC)
- Muchan Li (ML)
- Anni Li (AL)

# Abstract 
We see thousands of stars hanging up high each night but rarely do we come to wonder about their commonalities and differences. For our project, we decide to train a model to classify each star Morgan-Keenan spectral class by using their unique features. We especially pay attention to the brightness, color, and magnitude of the stars since they are the decisive distinctions. After cleaning the data, we will build two separate models with our dataset–SVC and K Nearest Neighbor algorithm, while paying attention to the accuracy, precision, and recall. We expect to achieve a high classification accuracy ( score above 90) so that we can confidently rely on our model to make convenient yet precise predictions when a new star is observed.

# Background

Since astronomy is a brand new field of wonder to every one of our team, we did a great deal of crash course and factual research in order to better position ourselves to answer the question we came up with.

First off, what is the Morgan-Keenan spectral classification system? In short, it is a systematic way introduced by William Wilson Morgan and Philip C. Keenan in 1943 that assigns spectral classification to stars based on their effective temperature. The scale, in descending order, is as follows:
- O, >= 30,000 Kelvin,
- B, between 10,000 and 30,000 Kelvin
- A, between 7,500 and 10,000 Kelvin
- F, between 6,000 and 7,500 Kelvin
- G, between 5,200 and 6,000 Kelvin
- K, between 3,700 and 5,200 Kevlin
- M, between 2,400 and 3,700 Kelvin

We even found an interesting trick to recite this hierarchy, that is "**O**h **B**e **A** **F**ine **G**irl (**G**uy), **K**iss **M**e". More factual information about spectral types and the MK system can be found here<a name="MK"></a>[<sup>[1]</sup>](#MK)

Now, some may wonder that since it seems there is a rather obvious connection between stellar temperature and its MK spectral class, why not just scrape the temperature data and match them up? The one apparent objection we've found in regard to this is that the "temperature" measurement isn't something that can just be taken for granted (notice that our dataset doesn't have the temperature feature either)--it requires tremendous amount of pre-assumptions, observations, and calculations, and in the end the result one astrophysicist would arrive at may not even be accurate<a name="stellartemp"></a>[<sup>[2]</sup>](#stellartemp). In a more scholarly article<a name="infrared"></a>[<sup>[3]</sup>](#infrared), the authors proposed a revised Infrared Flux method to determine stellar effective temperature with reduced numerical working and further insight into the method. They revisited the assumption that the effective temperature $T_e$ is related to integrated stellar flux and monochromatic stellar flux; combining these two relations, they derived a equation that gives the ratio of integrated stellar flux to monochromatic flux, which also includes the $T_e$ term and can be solved for. However, as they later concluded in the "Sensitivity and Accuracy of Method" section, the accuracy of the Infrared Flux method is limited by the accuracy with which the ratio can be measured, which in turn depends on the accuracy of the absolute flux measurements across spectral regions. In a more recent research paper<a name="gaia"></a>[<sup>[4]</sup>](#gaia), in fact, 40 years after the first reference article was published, we are presented with yet another advancement in precise derivation of stellar effective temperature. Nonetheless, the Gaia EDR3 photometry method is still bounded by other measurements and factors like the $K_s$ band, Gaia color, stellar blending, etc. All these prove that "temperature" is not a measurement as straightforward as it may seem when it comes to astronomy.

Hence, we believe if we are able to use machine learning techniques to "brute-forcedly" and reliably classify a star, it would make this classification process more interpretable to astronomy newbies like us and the save the experts a great deal of work all at once when deriving the stellar temperature is not absolutely necessary.

# Problem Statement

In this project, we propose to predict the Morgan-Keenan spectral class of a given/newly-observed star using the minimum number of features from a pool of features such as luminosity, absolute visual magnitude, right ascension/declination, etc. while maintaining above 90% accuracy. This problem can be approached by practical machine learning methods such as decision tree, KNN, etc. The result can be evaluated by the accuracy comparing the predicted category and the real category in our training & testing dataset. With correct method and model, this solution should be replicable

# Data

### Dataset name:

- hygdata_v3.csv

### Github link:

- https://github.com/astronexus/HYG-Database

### Size of the dataset:

- This dataset has 36 variables and 119613 observations. The actual shape of this dataset is 119,614 rows × 37 columns

### What an observation consists of:

- Each observation contains the location (where the star is located and how far it is) and some critical characteristics of each star, including its brightness, energy, color, etc. There are a total of 36 variables in each observation (some of the variables might not be available) We will use some of the critical variables below to classify star types.

### Some critical variables in the dataset:

- Spect: The star's spectral type. There are seven main types of stars: OBAFGKM, where O corresponds to the hottest, most powerful stars (and usually the largest) and M corresponds to the coolest, least powerful stars (and usually the smallest). The spectral type of the star will be the labels in our supervised machine learning model.
- Distance: The star's distance in parsecs. Distance is one of the traits that we need to consider in our model because the farther away the star is from, the dimmer the star will be. We will need to use distance to more accurately measure the luminosity and the brightness of the stars.
- Mag: The star's apparent visual magnitude. We will use the star’s apparent magnitude to determine how bright the star actually is if we observe it from the earth.
- AbsMag: The star's absolute visual magnitude, which will be the start’s visual magnitude if we observe 10 parsecs away from the star.  This identity is crucial because it tells us how bright the star “actually” is. If a star’s absolute visual magnitude is high, it is very likely that the star has a large size and high energy.
- Ci: The start’s B-V color index. This index will indicate the star’s color, which will be an efficient way to indicate the star’s temperature and energy. It is defined as the difference between the blue magnitude and the star’s visual magnitude. The larger the B-V color index, the higher temperature the star will be at (and also, the star will appear bluer)
- Lum: The star's luminosity. This variable is defined by the total electromagnetic power emitted by the star in a given time, and it is shown as a multiple of the solar luminosity (where the sun’s luminosity will be defined as 1). The higher the luminosity the star is, the more energy it will release in a given time period. In other words, the higher the luminosity, the more powerful the star is.

### Any special handling, transformations, cleaning, etc will be needed:

- This dataset contains some variables that we will not be used in classification so we will clean the data to make sure we only have the data we need. Also, the star’s spectral type contains a subtype for each class. For example, in class F, we might have F5 or F0V, etc. We will also clean the data so that we only use the general type to classify star type in order to reduce model complexity.

# Proposed Solution

In this section, clearly describe a solution to the problem. The solution should be applicable to the project domain and appropriate for the dataset(s) or input(s) given. Provide enough detail (e.g., algorithmic description and/or theoretical properties) to convince us that your solution is applicable. Make sure to describe how the solution will be tested.  

If you know details already, describe how (e.g., library used, function calls) you plan to implement the solution in a way that is reproducible.

If it is appropriate to the problem statement, describe a benchmark model<a name="sota"></a>[<sup>[3]</sup>](#sotanote) against which your solution will be compared. 

# Evaluation Metrics

Propose at least one evaluation metric that can be used to quantify the performance of both the benchmark model and the solution model. The evaluation metric(s) you propose should be appropriate given the context of the data, the problem statement, and the intended solution. Describe how the evaluation metric(s) are derived and provide an example of their mathematical representations (if applicable). Complex evaluation metrics should be clearly defined and quantifiable (can be expressed in mathematical or logical terms).

# Results

You may have done tons of work on this. Not all of it belongs here. 

Reports should have a __narrative__. Once you've looked through all your results over the quarter, decide on one main point and 2-4 secondary points you want us to understand. Include the detailed code and analysis results of those points only; you should spend more time/code/plots on your main point than the others.

If you went down any blind alleys that you later decided to not pursue, please don't abuse the TAs time by throwing in 81 lines of code and 4 plots related to something you actually abandoned.  Consider deleting things that are not important to your narrative.  If its slightly relevant to the narrative or you just want us to know you tried something, you could keep it in by summarizing the result in this report in a sentence or two, moving the actual analysis to another file in your repo, and providing us a link to that file.

### Subsection 1

You will likely have different subsections as you go through your report. For instance you might start with an analysis of the dataset/problem and from there you might be able to draw out the kinds of algorithms that are / aren't appropriate to tackle the solution.  Or something else completely if this isn't the way your project works.

### Subsection 2

Another likely section is if you are doing any feature selection through cross-validation or hand-design/validation of features/transformations of the data

### Subsection 3

Probably you need to describe the base model and demonstrate its performance.  Maybe you include a learning curve to show whether you have enough data to do train/validate/test split or have to go to k-folds or LOOCV or ???

### Subsection 4

Perhaps some exploration of the model selection (hyper-parameters) or algorithm selection task. Validation curves, plots showing the variability of perfromance across folds of the cross-validation, etc. If you're doing one, the outcome of the null hypothesis test or parsimony principle check to show how you are selecting the best model.

### Subsection 5 

Maybe you do model selection again, but using a different kind of metric than before?



# Discussion

### Interpreting the result

OK, you've given us quite a bit of tech informaiton above, now its time to tell us what to pay attention to in all that.  Think clearly about your results, decide on one main point and 2-4 secondary points you want us to understand. Highlight HOW your results support those points.  You probably want 2-5 sentences per point.

### Limitations

Are there any problems with the work?  For instance would more data change the nature of the problem? Would it be good to explore more hyperparams than you had time for?   

### Ethics & Privacy

If your project has obvious potential concerns with ethics or data privacy discuss that here.  Almost every ML project put into production can have ethical implications if you use your imagination. Use your imagination.

Even if you can't come up with an obvious ethical concern that should be addressed, you should know that a large number of ML projects that go into producation have unintended consequences and ethical problems once in production. How will your team address these issues?

Consider a tool to help you address the potential issues such as https://deon.drivendata.org

### Conclusion

Reiterate your main point and in just a few sentences tell us how your results support it. Mention how this work would fit in the background/context of other work in this field if you can. Suggest directions for future work if you want to.

# Footnotes
<a name="lorenznote"></a>1.[^](#lorenz): Lorenz, T. (9 Dec 2021) Birds Aren’t Real, or Are They? Inside a Gen Z Conspiracy Theory. *The New York Times*. https://www.nytimes.com/2021/12/09/technology/birds-arent-real-gen-z-misinformation.html<br> 
<a name="admonishnote"></a>2.[^](#admonish): Also refs should be important to the background, not some randomly chosen vaguely related stuff. Include a web link if possible in refs as above.<br>
<a name="sotanote"></a>3.[^](#sota): Perhaps the current state of the art solution such as you see on [Papers with code](https://paperswithcode.com/sota). Or maybe not SOTA, but rather a standard textbook/Kaggle solution to this kind of problem
